# Regular Expressions in Python

Regular expressions (regex) are a powerful tool for matching patterns in text. Python provides the `re` module to work with regular expressions. Here are some common operations you can perform with regex in Python:

## Importing the `re` Module

To use regular expressions in Python, you need to import the `re` module:

```python
import re
```

## Basic Functions

### `re.search()`

Searches for the first occurrence of the pattern in the string.

```python
match = re.search(r'\d+', 'The price is 100 dollars')
if match:
    print(match.group())  # Output: 100
```

### `re.findall()`

Finds all occurrences of the pattern in the string.

```python
matches = re.findall(r'\d+', 'There are 3 cats, 4 dogs, and 5 birds')
print(matches)  # Output: ['3', '4', '5']
```

### `re.sub()`

Replaces occurrences of the pattern with a replacement string.

```python
result = re.sub(r'\d+', 'number', 'There are 3 cats, 4 dogs, and 5 birds')
print(result)  # Output: There are number cats, number dogs, and number birds
```

### `re.split()`

Splits the string by occurrences of the pattern.

```python
result = re.split(r'\s+', 'Split this sentence into words')
print(result)  # Output: ['Split', 'this', 'sentence', 'into', 'words']
```

## Special Characters

- `.`: Matches any character except a newline.
- `^`: Matches the start of the string.
- `$`: Matches the end of the string.
- `*`: Matches 0 or more repetitions of the preceding pattern.
- `+`: Matches 1 or more repetitions of the preceding pattern.
- `?`: Matches 0 or 1 repetition of the preceding pattern.
- `{m,n}`: Matches between m and n repetitions of the preceding pattern.

## Character Classes

Stop executing after Space

- `\d`: Matches any digit.
- `\D`: Matches any non-digit.
- `\w`: Matches any alphanumeric character.
- `\W`: Matches any non-alphanumeric character.
- `\s`: Matches any whitespace character.
- `\S`: Matches any non-whitespace character.
- `\b`: Defines the Boundry.

Regular expressions are a versatile tool for text processing and can be used for tasks such as validation, parsing, and string manipulation.

```python

### Example of `re.match()`

The `re.match()` function attempts to match a pattern at the beginning of a string. If the pattern is found at the start of the string, it returns a match object; otherwise, it returns `None`.

#### Example

```python
import re

pattern = r'\d+'
string = '123abc456'

match = re.match(pattern, string)
if match:
    print('Match found:', match.group())  # Output: Match found: 123
else:
    print('No match')
```

````

``` python
import re 

string = "The quick brown fox jumps over the lazy dog"
pattern = "quick"

# search for the pattern
match = re.search(pattern, string)
match1 = re.search(pattern, string, re.IGNORECASE)
match2 = re.search(pattern, string, re.IGNORECASE | re.MULTILINE)

```

``` python

import re

# Dot (.)
pattern = r"a.b"
text = "acb aab a.b"
matches = re.findall(pattern, text)
print("Dot (.) matches:", matches)  # Output: ['acb', 'aab']

# Caret (^)
pattern = r"^Hello"
text = "Hello world! Hello again!"
matches = re.findall(pattern, text)
print("Caret (^) matches:", matches)  # Output: ['Hello']

# Dollar ($)
pattern = r"world!$"
text = "Hello world! Hello again!"
matches = re.findall(pattern, text)
print("Dollar ($) matches:", matches)  # Output: ['world!']

# Asterisk (*)
pattern = r"ab*"
text = "a ab abb abbb"
matches = re.findall(pattern, text)
print("Asterisk (*) matches:", matches)  # Output: ['a', 'ab', 'abb', 'abbb']

# Plus (+)
pattern = r"ab+"
text = "a ab abb abbb"
matches = re.findall(pattern, text)
print("Plus (+) matches:", matches)  # Output: ['ab', 'abb', 'abbb']

# Question Mark (?)
pattern = r"ab?"
text = "a ab abb abbb"
matches = re.findall(pattern, text)
print("Question Mark (?) matches:", matches)  # Output: ['a', 'ab', 'ab', 'ab']

# Braces ({})
pattern = r"ab{2,3}"
text = "a ab abb abbb abbbb"
matches = re.findall(pattern, text)
print("Braces ({}) matches:", matches)  # Output: ['abb', 'abbb']

# Square Brackets ([])
pattern = r"[aeiou]"
text = "hello world"
matches = re.findall(pattern, text)
print("Square Brackets ([]) matches:", matches)  # Output: ['e', 'o', 'o']

# Backslash (\)
pattern = r"\d"
text = "There are 2 apples and 5 oranges."
matches = re.findall(pattern, text)
print("Backslash (\\) matches:", matches)  # Output: ['2', '5']

# Pipe (|)
pattern = r"cat|dog"
text = "I have a cat and a dog."
matches = re.findall(pattern, text)
print("Pipe (|) matches:", matches)  # Output: ['cat', 'dog']

# Parentheses (())
pattern = r"(ab)+"
text = "abab ab ababab"
matches = re.findall(pattern, text)
print("Parentheses (()) matches:", matches)  # Output: ['ab', 'ab']

````

## Common RegEx Patterns

Pattern	Description	Example Match

\d	Matches any digit (0-9)	"abc123" → 123

\D	Matches any non-digit	"abc123" → abc

\w	Matches any word character (A-Z, a-z, 0-9, _)	"hello_123" → "hello_123"

\W	Matches any non-word character	"hello! world?" → "! ?"

\s	Matches any whitespace character (space, tab, newline)	"Hello World" → " "

\S	Matches any non-whitespace character	"Hello World" → "HelloWorld"

.	Matches any character except a newline	"abc" → "a", "b", "c"

^	Matches the start of the string	"Hello" → Matches "H"

$	Matches the end of the string	"World!" → Matches "!"

*	Matches 0 or more occurrences	"ba*" matches "b", "ba", "baaa"

+	Matches 1 or more occurrences	"ba+" matches "ba", "baa" but not "b"

?	Matches 0 or 1 occurrence	"ba?" matches "b", "ba"

{n}	Matches exactly n occurrences	"\d{3}" matches "123"

{n,}	Matches at least n occurrences	"\d{2,}" matches "12", "123", "1234"

{n,m}	Matches between n and m occurrences	"\d{2,4}" matches "12", "123", "1234"

[...]	Matches any character inside brackets	"[aeiou]" matches "a", "e", "i"

[^...]	Matches any character not inside brackets	"[^aeiou]" matches any non-vowel

`(x	y)`	Matches x or y

In [None]:
## Greedy and Non-Greedy Expressions

### Greedy Expressions

Greedy expressions in regular expressions try to match as much text as possible. They expand the match as far as they can go while still allowing the overall pattern to match.

#### Example


In [None]:
import re

# Greedy expression example
greedy_pattern = r'<.*>'
text = '<div>Some content</div><div>More content</div>'
greedy_matches = re.findall(greedy_pattern, text)
print("Greedy matches:", greedy_matches)  # Output: ['<div>Some content</div><div>More content</div>']

# Non-greedy expression example
non_greedy_pattern = r'<.*?>'
non_greedy_matches = re.findall(non_greedy_pattern, text)
print("Non-greedy matches:", non_greedy_matches)  # Output: ['<div>', '</div>', '<div>', '</div>']

In [4]:
import re 

result = re.split(r'\s+', 'Split this sentence into words')
print(result)  # Output: ['Split', 'this', 'sentence', 'into', 'words']

match = re.search(r'\d', 'The price is 100 dollars 45 878 78')
if match:
    print(match.group())  # Output: 100

matches = re.findall(r'\d+', 'There are 3 cats, 4 dogs, and 5 birds')
print(matches)    

['Split', 'this', 'sentence', 'into', 'words']
1
['3', '4', '5']


In [None]:
import re

# List of class names
class_names = ["Alice", "Bob", "Charlie", "David", "Eve", "Frank", "George",
               "Nancy", "Oscar", "Paul", "Quincy", "Rachel", "Steve", "Tom", "Zara"]

# Regex patterns
pattern_group1 = r"^[A-Ma-m]"  # Names starting with A-M (case insensitive)
pattern_group2 = r"^[N-Zn-z]"  # Names starting with N-Z (case insensitive)

# Divide names using regex
group1 = [name for name in class_names if re.match(pattern_group1, name)]
group2 = [name for name in class_names if re.match(pattern_group2, name)]

# Print the groups
print("Group 1 (A-M):", group1)
print("Group 2 (N-Z):", group2)

In [None]:
# Regular Expressionfor any pattern not starting with 'b' or 'c' 'd & not ending with "rst" 
import re

# Regular expression pattern
pattern = r"^(?![bcd]).*(?!.*rst$)"
pattern2 = r"^[^bcd]...[^r-t]$"
# Test strings
test_strings = [
    "apple",    # Should match
    "banana",   # Should not match (starts with 'b')
    "cherry",   # Should not match (starts with 'c')
    "date",     # Should not match (starts with 'd')
    "forest",   # Should not match (ends with 'rst')
    "grape",    # Should match
    "mango",    # Should match
    "orange",   # Should match
    "pqrst",    # Should not match (ends with 'rst')
    "strawberry" # Should match
]

# Check each string against the pattern
for string in test_strings:
    if re.match(pattern, string):
        print(f"'{string}' matches the pattern")
    else:
        print(f"'{string}' does not match the pattern")

print("\n")

for string in test_strings:
    if re.match(pattern2, string):
        print(f"'{string}' matches the pattern")
    else:
        print(f"'{string}' does not match the pattern")        

In [None]:
import re

# Regular expression pattern for decimal numbers
pattern = r"[-+]?\d*(\.\d+)?"
pattern2 = r"/d+space?[.,/]?\d+"

# Test strings
test_strings = [
    "123",      # Integer
    "-123",     # Negative integer
    "+123",     # Positive integer
    "123.456",  # Floating-point number
    "-123.456", # Negative floating-point number
    "+123.456", # Positive floating-point number
    ".456",     # Floating-point number without leading digits
    "-.456",    # Negative floating-point number without leading digits
    "+.456",    # Positive floating-point number without leading digits
    "abc",      # Not a number
    "123abc",   # Not a number
    "123.",     # Integer with trailing decimal point
    "-123.",    # Negative integer with trailing decimal point
    "+123.",     # Positive integer with trailing decimal point
    "1 3/4",
    "1,255,2,5,7"
]

# Check each string against the pattern
# for string in test_strings:
#     if re.fullmatch(pattern, string):
#         print(f"'{string}' matches the pattern")
#     else:
#         print(f"'{string}' does not match the pattern")

print("\n")        

for string in test_strings:
    if re.fullmatch(pattern, string):
        print(f"'{string}' matches the pattern")
    else:
        print(f"'{string}' does not match the pattern")        

"""
This function uses the `re` module to perform regular expression operations.

The `re.group` method is used to return one or more subgroups of the match. 
If there are multiple groups in the pattern, `re.group` can be called with multiple arguments to return a tuple of matched subgroups. 
If no arguments are passed, it returns the entire match.

Parameters:
- `group1, group2, ...`: The specific groups to return from the match. Groups are numbered starting from 1. Group 0 refers to the entire match.

Returns:
- A string or a tuple of strings representing the matched subgroups.
"""
```python
# Example usage of re.group with the matches variable
if matches:
    print("Entire match:", matches.group(0))  # Output: I am Going Home
    print("First word:", matches.group(1))    # Output: I
    print("Second word:", matches.group(2))   # Output: am
    print("Third word:", matches.group(3))    # Output: Going
    print("Fourth word:", matches.group(4))   # Output: Home
```

In [None]:
import re 

s="aditya gupta"
p1=r"^a\w+?a$"
p2=r"^a.*?a"
#p3=r"^a^\s?a$""
matches = re.match(p2,s)
print(matches) #finds aditya

#s1="I am Going Home"    


In [None]:
import re

def is_valid_email(email):
    
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return re.match(pattern, email) is not None


In [None]:
import re 

s="I am Going Home"
p1=r"^\w+\s\w+\s\w+\s\w+$"
matches1=re.split('\W+',s)
print(matches1)

matches = re.match(p1,s)
print(matches) 


In [13]:
import re

email = "abc_xyz@aid.svnit.ac.in"
pattern = r'(\w+)@([\w.]+)\.(\w+)$'
match = re.match(pattern, email)
if match:
    print(match.groups())

    pattern_named = r'(?P<username>\w+)@(?P<domain>[\w.]+)\.(?P<tld>\w+)$'
    match_named = re.match(pattern_named, email)
    if match_named:
        print("Username:", match_named.group('username'))
        print("Domain:", match_named.group('domain'))
        print("Top Level Domain:", match_named.group('tld'))
        d = match_named.groupdict()
print(d)

('abc_xyz', 'aid.svnit.ac', 'in')
Username: abc_xyz
Domain: aid.svnit.ac
Top Level Domain: in
{'username': 'abc_xyz', 'domain': 'aid.svnit.ac', 'tld': 'in'}


In [6]:
import re 
st="(abc def abc def ghi pqr)"
matches=re.findall(r'(abc)\s(def)',st)
print(matches)


[('abc', 'def'), ('abc', 'def')]


In [None]:
import re

# Positive lookahead example: Match 'foo' only if it is followed by 'bar'
pattern_lookahead = r'foo(?=bar)'
text = "foobar foo bar foo"
matches_lookahead = re.findall(pattern_lookahead, text)
print("Positive lookahead matches:", matches_lookahead)  # Output: ['foo']

# Negative lookahead example: Match 'foo' only if it is not followed by 'bar'
pattern_neg_lookahead = r'foo(?!bar)'
matches_neg_lookahead = re.findall(pattern_neg_lookahead, text)
print("Negative lookahead matches:", matches_neg_lookahead)  # Output: ['foo', 'foo']

# Positive lookbehind example: Match 'bar' only if it is preceded by 'foo'
pattern_lookbehind = r'(?<=foo)bar'
matches_lookbehind = re.findall(pattern_lookbehind, text)
print("Positive lookbehind matches:", matches_lookbehind)  # Output: ['bar']

# Negative lookbehind example: Match 'bar' only if it is not preceded by 'foo'
pattern_neg_lookbehind = r'(?<!foo)bar'
matches_neg_lookbehind = re.findall(pattern_neg_lookbehind, text)
print("Negative lookbehind matches:", matches_neg_lookbehind)  # Output: ['bar']

In [9]:


import re 
str="I am going home"
matches=re.split(r'(\W)',str)
print(matches)

['I', ' ', 'am', ' ', 'going', ' ', 'home']


## `re.sub` and `re.subn`

The `re.sub` and `re.subn` functions in Python's `re` module are used for substituting occurrences of a pattern in a string with a replacement string.

### `re.sub`

The `re.sub` function replaces all occurrences of the pattern in the string with the replacement string.

**Syntax:**
```python
re.sub(pattern, replacement, string, count=0, flags=0)
```

- `pattern`: The regular expression pattern to search for.
- `replacement`: The string to replace the pattern with.
- `string`: The input string.
- `count`: The maximum number of pattern occurrences to replace. Default is 0, which means replace all occurrences.
- `flags`: Optional flags to modify the matching behavior.

**Example:**
```python
import re

text = "The rain in Spain stays mainly in the plain."
pattern = r"ain"
replacement = "XYZ"
result = re.sub(pattern, replacement, text)
print(result)  # Output: The rXYZ in SpXYZ stays mXYZly in the plXYZ.
```

### `re.subn`

The `re.subn` function is similar to `re.sub`, but it returns a tuple containing the new string and the number of substitutions made.

**Syntax:**
```python
re.subn(pattern, replacement, string, count=0, flags=0)
```

- `pattern`: The regular expression pattern to search for.
- `replacement`: The string to replace the pattern with.
- `string`: The input string.
- `count`: The maximum number of pattern occurrences to replace. Default is 0, which means replace all occurrences.
- `flags`: Optional flags to modify the matching behavior.

**Example:**
```python
import re

text = "The rain in Spain stays mainly in the plain."
pattern = r"ain"
replacement = "XYZ"
result, count = re.subn(pattern, replacement, text)
print(result)  # Output: The rXYZ in SpXYZ stays mXYZly in the plXYZ.
print("Number of substitutions:", count)  # Output: 4
```
### Flags

Flags are optional parameters that modify the behavior of the pattern matching. They can be combined using the bitwise OR operator (`|`). Some common flags include:

- `re.IGNORECASE` or `re.I`: Ignore case when matching.
- `re.MULTILINE` or `re.M`: Treat the input string as consisting of multiple lines.
- `re.DOTALL` or `re.S`: Make the `.` special character match any character, including a newline.
- `re.VERBOSE` or `re.X`: Allow the use of whitespace and comments within the pattern for better readability.
- `re.ASCII` or `re.A`: Perform ASCII-only matching instead of Unicode matching.

These flags can be passed as the `flags` argument in the `re.sub` and `re.subn` functions to alter their matching behavior.

In [1]:
import re

# Example using re.sub
text = "The rain in Spain stays mainly in the plain."
pattern = r"ain"
replacement = "XYZ"
result = re.sub(pattern, replacement, text)
print("Result using re.sub:", result)  # Output: The rXYZ in SpXYZ stays mXYZly in the plXYZ.

# Example using re.subn
result, count = re.subn(pattern, replacement, text)
print("Result using re.subn:", result)  # Output: The rXYZ in SpXYZ stays mXYZly in the plXYZ.
print("Number of substitutions:", count)  # Output: 4

Result using re.sub: The rXYZ in SpXYZ stays mXYZly in the plXYZ.
Result using re.subn: The rXYZ in SpXYZ stays mXYZly in the plXYZ.
Number of substitutions: 4


In [3]:
import re
s="abcdef"
p=r"g"
replace="z"
result=re.sub(p,replace,s)
print(result)
s1="abc----def"
p1=r"\-+"
replace1=" "
result1=re.sub(p1,replace1,s1)

def custom_sub(pattern, replacement, string):
    compiled_pattern = re.compile(pattern)
    match = compiled_pattern.search(string)
    while match:
        string = string[:match.start()] + replacement + string[match.end():]
        match = compiled_pattern.search(string)
    return string

s2 = "abc---desf--f"
p2 = r"\-+"
replace2 = " "
result2 = custom_sub(p2, replace2, s2)
print(result2)  # Output: abc desf f

abcdef
abc desf f


In [4]:
import re

s1 = "abc----def"
p1 = r"(\-+)"
replace1 = r" "

# Find all matches
matches = re.finditer(p1, s1)

# Replace matches manually
result1 = s1
for match in matches:
    start, end = match.span()
    result1 = result1[:start] + replace1 + result1[end:]

print(result1)  # Output: abc\ndef

abc def


In [None]:
s="abc---def--gh-----i"
result=re.sub(r"(\-{1,})",r" ",s)
result1=re.sub(r"(\-+)",r" ",s)
print(result)

abc def gh i


In [None]:

def replace_func(match_obj):
    if match_obj.group(0) == "-":
        return " "
    elif len(match_obj.group(0)) > 1:
        return " "
    else:
        return None
    
s = "abc---def--gh-----i"
result = re.sub(r"(-+)", replace_func, s)
print(result)  # Output: abc def gh i



In [3]:
import re 
text = "<p> <ul> abc </ul> <ul> efg </ul> </p>"
pattern = "<ul>.*?</ul>"
matches = re.findall(pattern, text)
print(matches)  # Output: ['<ul> abc </ul>', '<ul> efg </ul>']

['<ul> abc </ul>', '<ul> efg </ul>']


In [5]:
# find the occurence of each character in the string and store its value and character in dictionary

st="adfagchshudhkafaxafdayrdk"
dict1={}
for char in st:
    dict1[char] = dict1.get(char, 0) + 1

print(dict1)


{'a': 6, 'd': 4, 'f': 3, 'g': 1, 'c': 1, 'h': 3, 's': 1, 'u': 1, 'k': 2, 'x': 1, 'y': 1, 'r': 1}


In [6]:
st = "hello"
dict1 = {}

for char in st:
    if char in dict1:
        dict1[char] += 1
    else:
        dict1[char] = 1

print(dict1)

{'h': 1, 'e': 1, 'l': 2, 'o': 1}
