# Regular Expressions in Python

This notebook demonstrates the usage of the `re` module in Python, which provides regular expression matching operations.

In [None]:
import re

## Basic Search Operations

Here we search for a pattern in a string and get its position.

In [None]:
# Here we don't have a pattern, we are just looking if where the string is inside s
s = "AbbbbAbbbbAbbb:A computer science portal for aaaaa"
# Search for the word "portal" in the given string and then print the start and end indices of the matched word within the string.
# r character (r'portal') stands for raw, not regex. The raw string is slightly different from a regular string, it won't interpret the \ character as an escape character.

match = re.search(r'portal', s)
print('Start Index: ', match.start())
print('End Index: ', match.end())

## Common Regular Expression Functions

Here are the commonly used regex functions:

- `re.findall()`: Finds and returns all non-overlapping (meaning that they don't have an intersection) matching occurrences in a list
- `re.compile()`: Regular expressions are compiled into pattern objects
- `re.split()`: Split the string by the occurrences of a character or a pattern
- `re.sub()`: Replaces all occurrences of a pattern with a string
- `re.escape()`: Escapes special characters
- `re.search()`: Finds the first occurrence of a character or a pattern

In [None]:
string = """Hello my Number is 123456789 and
            my friend's number is 987654321"""
regex = '\d+'

# \ escape
# \d digit
# \d+ sequence of digits
# d\* possible 0 digits
match = re.findall(regex, string)
print(match)

## Character Classes

Regular expressions provide a way to match specific sets of characters.

In [None]:
p = re.compile('[a-e]')
# It matches all the characters between a and e
print(p.findall("Aye, said Mr. Gibeeeenson Startk"))
# It starts from left to right, and it returns the list accordingly

In [None]:
p = re.compile('\d')
# Find all one singular digit
print(p.findall("I went to him at 11 A.M. on 4th july 1886"))
p = re.compile('\d+')
# Find all sequence of digits
print(p.findall("I went to him at 11 A.M. on 4th July 1886"))

## Word Characters

`\w` matches any alphanumeric character (letters, digits, and underscore).
`\W` matches any non-alphanumeric character.

In [None]:
# By default it takes raw
p = re.compile('\w')
# Single character (things that can be used in a variable)

print(p.findall("He said * in some_lang.")) 
p = re.compile('\w+')
# Successive characters
print(p.findall('I went to him at 11 A.M., he \
                said *** in some_language'))
p = re.compile('\W')
# Everything that cannot be used in a variable name
print(p.findall("he said *** in some_language."))

In [None]:
p = re.compile('ab*')
print(p.findall("ababbaabbb"))
# 'ab*' matches 'a' followed by zero or more 'b's

## Split Function

The `re.split()` function splits a string by the occurrences of a pattern.

Syntax:
```python
re.split(pattern, string, maxsplit=0, flags=0)
```

- `maxsplit` is the number of times we want the code to split the string
- `flags` are used to ignore some stuff (e.g., `re.IGNORECASE` ignores case) (optional)
- If `maxsplit` is not specified then it would work in a way such that it always splits the string according to all occurrences

In [None]:
print(re.split('\W+','Words, words , Words'))
# It will split according to W+ which means successive special characters, so it will split according to (, ) and then ( , )
print(re.split('\W+',"Words's words Words"))
# Same idea but now (') is also a special character
print(re.split('\W+','On 12th Jan 2016, at 11:02 AM'))
# \W+ | \d+ (this is how we do or) (here it will split based on the digit too) 
# when it was W+ it took it because a digit is not considered as a special character
print(re.split('\d+','On 12th Jan 2016, at 11:02 AM'))
# It will cut relative to the sequence of digits

In [None]:
# If we limit the number of splits:
print(re.split('\d+','On 12th Jan 2016, at 11:02 AM',1))
# It will split only one time
print(re.split('[a-f]+','Aey, Bou oh boy, come here',flags=re.IGNORECASE))
# Equivalent to (a+b+c+d+e+f)+ it will also ignore the case so it is taking into consideration also the characters that are uppercase
print(re.split('[a-f]+','Aey, Boy oh boy, come here'))

## Sub Function

The `re.sub()` function replaces occurrences of a pattern with a provided replacement.

Syntax:
```python
re.sub(pattern, repl, string, count=0, flags=0)
```

We search for a pattern in a string and it is replaced by `repl`.

In [None]:
# It will replace ub with ~* inside the string (with relation to the condition)
print(re.sub('ub','~*', 'Subject has Uber booked already',flags=re.IGNORECASE))
# IGNORECASE will take into consideration 2^n combinations ub Ub uB UB (4 possibilities because n = 2)
print(re.sub('ub','~*', 'Subject has Uber booked already'))
# Only ub
print(re.sub('ub','~*', 'Subject has Uber booked already',count=1,flags=re.IGNORECASE))
# It will only do it once
print(re.sub('ub','~*', 'Subject has Uber booked already uBik',count=3,flags=re.IGNORECASE))
# It will do it 3 times
print(re.sub(r'\sAND\s','&',"Baked Beans And Spam",flags=re.IGNORECASE))
# r means that it is raw, meaning that it will take symbols like \s (which means space), so we are searching for " AND "
print(re.sub(r'\'AND\'','&',"Baked Beans 'And' Spam",flags=re.IGNORECASE))

## Subn Function

The `re.subn()` function is just like `sub()` but it returns a tuple with the new string and the count of replacements.

Syntax:
```python
re.subn(pattern, repl, string, count=0, flags=0)
```

It's just like re.sub() but it returns the modified string in the first part of the tuple, and the count in the second part of the tuple

In [None]:
print(re.subn('ub','~*', 'Subject has Uber booked already',flags=re.IGNORECASE))
t = re.sub('ub','~*', 'Subject has Uber booked already',flags=re.IGNORECASE)
print(t)
print(len(t))
print(t[0])

In [None]:
# Non alphanumerical (not digit and not alphabet)
# re.escape(string) 
print(re.sub('\W+',r'\ ', 'This is Awesome even 1 AM'))