#                                           Intro

The re module is a built-in module in Python that provides support for regular expressions (regex). Regular expressions are a powerful tool for pattern matching and string manipulation.

## Components of Patterns

In regular expressions, we use meta characters and special sequences to create patterns that match specific strings or character sequences. By combining these building blocks, we can create very complex patterns to match a wide variety of text data.



# MetaCharacters

<pre>In regular expressions, metacharacters are characters that have a special meaning and are used to create complex patterns. 
Here are some common metacharacters in the re module:

.       (dot)       : matches any single character except a newline character.
^       (caret)     : matches the start of a string.
$     (dollar sign) : matches the end of a string.
*    (asterisk)    : matches zero or more occurrences of the previous character.
\+   (plus sign)    : matches one or more occurrences of the previous character.
? (question mark)   : matches zero or one occurrence of the previous character.
{m}                 : matches exactly m occurrences of the previous character.
{m,n}               : matches between m and n occurrences of the previous character.
[] (square brackets): matches any character inside the brackets.
() (parentheses)    : groups a series of characters together.
| (pipe symbol)     : Used to create a logical OR between multiple patterns. For example, 
                      the pattern cat|dog matches either "cat" or "dog".
</pre>


# Special Sequences

<pre>
Special sequences in regular expressions are patterns that match specific types of characters or character groups.
Here are some common special sequences in the re module:

\d: matches any digit character. Equivalent to [0-9].
\D: matches any non-digit character. Equivalent to [^0-9].
\s: matches any whitespace character, including space, tab, newline, and carriage return. Equivalent to [\t\n\r\f\v].
\S: matches any non-whitespace character. Equivalent to [^\t\n\r\f\v].
\w: matches any alphanumeric character, including underscore. Equivalent to [a-zA-Z0-9_].
\W: matches any non-alphanumeric character. Equivalent to [^a-zA-Z0-9_].
\b: matches a word boundary, which is the position between a word character (as defined by \w) and a non-word character.
\B: matches a non-word boundary.
</pre>

# Some Basic MetaCharacters

###  \. DOT

In [23]:
import re

# Match any string that starts with 'a', ends with 'b', and has any single character in between
pattern = r'a.b'
text1 = 'acb'
text2 = 'a2b'
text3 = 'a@b'
text4 = 'a\nb'  # Dot does not match newline
text5 = 'abc'   # Needs a single character in between

# Use the search function to find a match
match1 = re.search(pattern, text1)
match2 = re.search(pattern, text2)
match3 = re.search(pattern, text3)
match4 = re.search(pattern, text4)
match5 = re.search(pattern, text5)

# Print the matched text
print(match1.group())  # 'acb'
print(match2.group())  # 'a2b'
print(match3.group())  # 'a@b'
print(match4)          # None (no match)
print(match5)          # None (no match)


acb
a2b
a@b
None
None


In [24]:
# The pattern matches a.b meaning exactly dot (.) in between a and b

pattern = r'a\.b' # Escape the dot to match a.b instead of any single character in between a and b 
text1 = 'a.b'
text2 = 'a\nb'  # Dot does not match newline

match1 = re.search(pattern, text1)
match2 = re.search(pattern, text2)

print(match1.group())  # 'a.b'
print(match2)          # None (no match)


a.b
None


In [25]:
# The pattern matches a.b meaning exactly dot (.) in between a and b and also one character before dot

pattern = r'a.\.b'

text1 = 'ac.b'
text2 = 'a.sb'

match1 = re.search(pattern,text1)
match2 = re.search(pattern,text2)

print(match1.group())
print(match2)

ac.b
None


In [26]:
pattern =  r'a.*b' # Match any string that starts with 'a', ends with 'b', and has any number of characters in between (including zero)

text1 = 'acb'
text2 = 'a2b'
text3 = 'a@b'
text4 = 'a\nb'  # Dot does not match newline
text5 = 'abcdefghijklmnopqrstuvwxyzb'
text6 = 'acdefghijklmnopqrstuvwxyzb'

match1 = re.search(pattern, text1)
match2 = re.search(pattern, text2)
match3 = re.search(pattern, text3)
match4 = re.search(pattern, text4)
match5 = re.search(pattern, text5)
match6 = re.search(pattern, text6)

print(match1.group())  # 'acb'
print(match2.group())  # 'a2b'
print(match3.group())  # 'a@b'
print(match4)  # None (no match)
print(match5.group())  # 'abcdefghijklmnopqrstuvwxyzb' (matches the longest string)
print(match6.group())  # 'acdefghijklmnopqrstuvwxyzb'


acb
a2b
a@b
None
abcdefghijklmnopqrstuvwxyzb
acdefghijklmnopqrstuvwxyzb


In [27]:
pattern = r'a\.b.*'

text1 = 'a.b'
text2 = 'a.b\n'
text3 = 'a.b\n\n'
text4 = 'a.bcdefghijklmnopqrstuvwxyzb'

match1 = re.search(pattern, text1)
match2 = re.search(pattern, text2)
match3 = re.search(pattern, text3)
match4 = re.search(pattern, text4)

print(match1.group())  # 'a.b'
print(match2.group())  # 'a.b'
print(match3.group())  # 'a.b'
print(match4.group())  # 'a.bcdefghijklmnopqrstuvwxyzb'

a.b
a.b
a.b
a.bcdefghijklmnopqrstuvwxyzb


###  \^ CARET

In [32]:
pattern = r'^Artificial' # Match any string that starts with 'Artificial' (and has any number of characters after that)

text1 = 'Artificial Intelligence'
text2 = 'ArtificialIntelligence'
text3 = 'Artificial Intelligence is the future'

match1 = re.search(pattern, text1)
match2 = re.search(pattern, text2)
match3 = re.search(pattern, text3)

print(match1.group())  # 'Artificial'
print(match2.group())  # 'Artificial'
print(match3.group())  # 'Artificial'


Artificial
Artificial
Artificial


Note : If the caret ^ is used within the regular expression pattern (not at the beginning), it has a different meaning. It acts as a negation operator and matches any character that is not within the character set following the caret. For example, the pattern [^abc] matches any character that is not 'a', 'b', or 'c'.

In [33]:
pattern = r'[^a-zA-Z0-9]' # Match any string that contains any character that is not a letter or a number

text1 = '@#$!@#'
text2 = 'a2b'

match1 = re.search(pattern, text1)
match2 = re.search(pattern, text2)

print(match1.group())  # '@'
print(match2)          # None (no match)


@
None


###  \$ DOLLAR

In [37]:
pattern = r'aib$' # Match any string that ends with 'aib'

text1 = 'Sai aib'
text2 = 'Shahzaib'
text3 = 'Jahanzib'

match1 = re.search(pattern, text1)
match2 = re.search(pattern, text2)
match3 = re.search(pattern, text3)

print(match1.group())  # 'aib'
print(match2.group())          # 'aib'
print(match3)          # None (no match)

aib
aib
None


In [39]:
pattern = r'Python\$'

text1 = 'Python$'
text2 = 'Python$ is a programming language'

match1 = re.search(pattern, text1)
match2 = re.search(pattern, text2)

print(match1.group())  # 'Python$'
print(match2.group())  # 'Python$'


Python$
Python$
