# REGULAR EXPRESSIONS - REGEX

Regular expression is a tool used in text processing to describe and match patterns in strings. It is a sequence of characters that defines a search pattern.

### Why is it useful?

- **Pattern matching**: to search for specific patterns within a larger body of text. E.g finding specific words, dates, email addresses and other strutured information.


- **Text extraction**: useful to extract pieces of information from a document. This information can be anything from a house address to most repeeated phrases in a document.


- **Data validation & pattern validation**: useful for validating input data. E.g, checking if the phone number matches the 11 didgit pattern for nigerian contact numbers. It us used to validate whether an input adheres to a format, such as checking if a password being created has at least one capital letter, a digit and special symbols.


- **Search and replace**: useful for cases where you need to replace specific patterns with something else. This is especially useful in data cleaning and transformation.


- **Web scraping and data extraction**: to locate and extract specific elements and information/data from HTML pages


- **Parsing and tokenization**: to break down text into smaller units or tokens. Useful in NLP problems.


- **Log file analysis**: used in searching and parsing log files, allowin you to extract important information or identify patterns of interest.


- **Search engines and information retrieval**: used for matching user queries to relevant content on websites or databases


- **URL routing and validation**: useful for validating URLs to ensure they follow the correct format and extracting speicuf parameters from them


- **Network security and firewall rules**: it allows the ability to define custom rules for allowing or blocking specific types of traffic nbased on patterns in network traffic logs


- **Extracting meta data from documents**: by parsing documents like PDFs and word documents regex can extract metadata like the titles, authors, etc


- **URL rewriting in web servers**: modifying URLs on the fly to improve SEO or to direct traffic to specific pages

## User-defined character

- `[abc]` - match either a,b/c

- `[^abc]` - match any character asides a,b, or c

- `[a-z]` - match lower case alphabet

- `[A-Z]` - match upper case alphabet
- `[a-zA-Z]` - matches any english alphabet - whether lower case or upper case
- `[0-9]` - matches any digit character
- `[a-zA-Z0-9_]` - matches any alphanumeric character
- `[^a-zA-Z0-9_]` - matches any character except alphanumeric characters

**NOTE:**

1. The pattern [abc] will only match a single a,b, or c letter and nothing else. Therefore, remember to put in the actual string you want to serach for inside the square brackets.

2. Dash `-` is used to define the character range, so we can still have  [0-6] as well

## Pre-defined characters
They can serve as alternatives to the above.

- `\d = [0-9]`: matches a digit character
- `\D = [^0-9]`: matches any character except digit characters
- `\w` = matches an alphanumeric character
- `\W` = matches any character that is not alphanumeric
- `\s` = matches a space character
- `\S` = match any character except space
- `\t` = matches a tab character
- `\n` = matches a newline character
- `\r` = matches a carriage return


## Quantifiers
To determine how much or how many of those characters you want to search for

- `*` = matches 0 or more number of characters, useful for cases where the character might be there or not
- `+` = matches 1 or more number of characters, i.e, it checks for cases where the character appears at anywhere from once to as many times as possible
- `?` = matches atmost 1 character, i.e 0 or 1, it indicates that the string to be matched is optional
- `a{n}` = matches excatly n number of characters
- `a{m,n}` = matches at least m number and at most n number of characters


**NOTE:** You can have \d+, \d*, \d{5}

## Meta Characters

Refers to characters/symbols that are like special key characters. This is to say that if you were to encounter any of them in a document you will need to use `\` before it to tell the regex matcher that it is an actual symbol in the document you want to find.

`.`, `^`, `*`, `+`, `?`, `{}`, `()`, `|`, `\`, `$`,

- `^....$` = **starting and ending**. ^ indicates the start of the string to search for, $ indicates that we have come to a stop


- `(string)` = () are used for **matching groups**. It is such that we define groups of characters and capture them using () metacharacter. Any subpattern inside a pair of () will be captured as a group


- `(abc(cv))` = **capturing sub groups AKA nested groups**


- `(abc|cvb)` = **capturing conditionals** to denote possible sets of characters


- `\b` = **matches boundary between a word and a non-word character, e.g space or a newline**. It is useful in capturing entire words (for example by using the pattern \w+\b)


- `.*` = **captures all**. This captures all the string in a document

**For example:**

1. `ab?c` = could be abc or ac, b is an optional charcter to search for


2. Searching for "success" in a log file that also contains instances of "unsuccessful". Pattern to search for will be = `^success$`

3. "file_record_transcript.pdf" and "file_07241999.pdf" We need to write a regex that matches only filenames (not including the pdf files). Pattern will be = `^(file_[a-z0-9]*([_a-z]?)+)`

4. Searching for instances of "Buy more milk", "Buy more bread", "Buy more juice" 
The pattern would be `"Buy more (milk|bread|juice)"`

## Importing the required module

In [34]:
import re

In [3]:
help(re)

Help on module re:

NAME
    re - Support for regular expressions (RE).

MODULE REFERENCE
    https://docs.python.org/3.9/library/re
    
    The following documentation is automatically generated from the Python
    source files.  It may be incomplete, incorrect or include features that
    are considered implementation detail and may vary between Python
    implementations.  When in doubt, consult the module reference at the
    location listed above.

DESCRIPTION
    This module provides regular expression matching operations similar to
    those found in Perl.  It supports both 8-bit and Unicode strings; both
    the pattern and the strings being processed can contain null bytes and
    characters outside the US ASCII range.
    
    Regular expressions can contain both special and ordinary characters.
    Most ordinary characters, like "A", "a", or "0", are the simplest
    regular expressions; they simply match themselves.  You can
    concatenate ordinary characters, so last mat

### re.fullmatch()

Returns a match object **if and only if the entire string matches the pattern**. Otherwise, it will return `None`. This is useful for vaidating data.

In [None]:
#Syntax:

import re
re.fullmatch(pattern, string)

**1. Regex to match a Nigerian contact number**

Rules:

- It should start with 0

- It should contain exactly 11 digits

In [37]:
#asking for input 
mobile_num = input('Enter your Nigerian mobile number: ')

#defining the expression/pattern fpr Nigerian phone numbers
regex = r'^(0)[7-9]\d{9}$'

#match the regex with the user entered mobile number
match = re.fullmatch(regex, mobile_num)

#Check to know if the number entered is valid or not
if match:
    print('Your number is Valid!')
else:
    print('Not valid!')

Enter your Nigerian mobile number: 091234567890987
Not valid!


**Note:**

- [7-9] is used to ensure that the first number after the leading 0 belongs to a mobile operator

**Preceding `r` in a string**

The "r" at the start of the pattern string is used to designate a python "raw" string. It means that the string is to be treated as a raw string which means all escape codes will be ignored. 

For example: `\n` will be treated as a newline character while `r'\n` refers to the characters \ followed by n

**2. Regex to match to validate a python identifier (AKA user-defined names)**

Rules:

   - Can be a combination of letters(uppercase and lowercase),digits(0-9)
    
   - Must start with a letter or an underscore. Digits cannot be the first character.
    
   - No special characters are allowed except _
   
   - Reserved words or keywords are not allowed

In [38]:
#create a list of python keywords
keywords = ["True", "False", "None", "and", "or", "not", "in", "is", 
            "if", "elif", "else", "for", "while", "break", "continue", 
            "pass", "def", "class", "with", "as", "lambda", "return", 
            "yield", "import", "from", "try", "except", "raise", "finally", "assert", "async", "await"]

#asking for input 
my_identifier = input('Enter python identifier: ')

#defining the expression/pattern fpr Nigerian phone numbers
regex = r"^[a-zA-Z_][\w\_]*"

#match the regex with the user entered mobile number
match = re.fullmatch(regex, identifier)

#Check to know if the number entered is valid or not
if my_identifier in keywords:
    print('You have entered a python keyword. Please revise.')
elif match:
    print('Valid!')
else:
    print('Not valid!')

Enter python identifier: break
You have entered a python keyword. Please revise.


### re.findall()

It iterates over a string(document) to **find a subset of characters that match a specified pattern**. It will return a list of every pattern match that occurs in a given string. The string is scanned L-to-R, and matches are returned in the order found. Useful for tasks like data extraction and data cleaning.

In [None]:
#Syntax:

import re
re.findall(pattern, string)

In [45]:
import re

string = """
        Hello, my email is john.doe@example.com. 
        For more information, contact me at john.doe+info@example.com. 
        You can also reach out to my colleague at jane.doe@example.co.uk.
        For general inquiries, please contact us at info4567@example.com. 
        For technical support, reach out to support@example.com. 
        For sales inquiries, email sales@example.com. 
        For partnership opportunities, contact us at partners@example.com.
        """
#extract all email addresses from the string
regex = r"[\w\.\+]+\@[\w]+\.[a-z\.]+"

re.findall(regex, string)

['john.doe@example.com.',
 'john.doe+info@example.com.',
 'jane.doe@example.co.uk.',
 'info4567@example.com.',
 'support@example.com.',
 'sales@example.com.',
 'partners@example.com.']

### Search and replace

### re.sub

In [None]:
#Syntax:

re.sub(pattern, replacement, targetString)

In [46]:
import re
new = re.sub('\d', '#', 'a1b2c3d4e4f6&*%')
new

'a#b#c#d#e#f#&*%'

In [49]:
#re.sub(regex, replacement, targetString)

import re
phone = '123-1234-546-980. This is my phone number'

#removing the text part of the string
num1 = re.sub('[a-zA-Z\.]+', '', phone)
print('Num1: ', num1)

#replacing - with |
num2 = re.sub('[\-]+', '|', phone)
print('Num2: ', num2)

Num1:  123-1234-546-980     
Num2:  123|1234|546|980. This is my phone number


### re.split()

In [None]:
#Syntax:

re.split(pattern, string)

In [51]:
import re
line = re.split('\W', 'abcdef&ghij+1234@gmail.com')
print(line)

['abcdef', 'ghij', '1234', 'gmail', 'com']


## Step by Step for creating RegEx

### Step 1: Create the pattern object

In [52]:
#compile() ->> converts the pattern into regex object

import re
pattern = re.compile('Python')

print(type(pattern))

<class 're.Pattern'>


### Step 2: Create matcher object

In [53]:
matcher = pattern.finditer('Does Python use Object Oriented Programming? How many Python libraries do you know of? Python and python')

print(type(matcher))

<class 'callable_iterator'>


### Step 3: Iterate over the matcher

In [54]:
#start() -> starting index of matched string
#end() -> end+1 index
#group() -> returns matched string

for m in matcher:
    print(type(m))
    print('Match is at: {}, End: {}, Pattern found: {}'.format(m.start(), m.end(), m.group()))
print('DONE!')

<class 're.Match'>
Match is at: 5, End: 11, Pattern found: Python
<class 're.Match'>
Match is at: 54, End: 60, Pattern found: Python
<class 're.Match'>
Match is at: 87, End: 93, Pattern found: Python
DONE!


## Using various combinations

The one we just saw above is the first one.

In [55]:
#2nd way

import re

matcher = re.compile('Python').finditer('Does Python use Object Oriented Programming? How many Python libraries do you know of? Python and python')

for m in matcher:
    print('Match is at: {}, End: {}, Pattern found: {}'.format(m.start(), m.end(), m.group()))

Match is at: 5, End: 11, Pattern found: Python
Match is at: 54, End: 60, Pattern found: Python
Match is at: 87, End: 93, Pattern found: Python


In [56]:
#3rd way =>> re.finditer(pattern, target)

import re

matcher = re.finditer('Python', 'Does Python use Object Oriented Programming? How many Python libraries do you know of? Python and python')
for m in matcher:
    print('Match is at: {}, End: {}, Pattern found: {}'.format(m.start(), m.end(), m.group()))


Match is at: 5, End: 11, Pattern found: Python
Match is at: 54, End: 60, Pattern found: Python
Match is at: 87, End: 93, Pattern found: Python


In [58]:
import re
matcher = re.finditer('Python','Does Python use Object Oriented Programming? How many Python libraries do you know of? Python and python')

count = 0

for m in matcher:
    count += 1
    print('Match is at: {}, End: {}, Pattern found: {}'.format(m.start(), m.end(), m.group()))
print('Total patterns found: ', count)

Match is at: 5, End: 11, Pattern found: Python
Match is at: 54, End: 60, Pattern found: Python
Match is at: 87, End: 93, Pattern found: Python
Total patterns found:  3


In [59]:
matcher = re.finditer('[^a-z]', 'a4b&u_lkegy/>+ Deciimal points#@')

count = 0

for m in matcher:
    count += 1
    print('Match is at: {}, End: {}, Pattern found: {}'.format(m.start(), m.end(), m.group()))
print('Total patterns found: ', count)

Match is at: 1, End: 2, Pattern found: 4
Match is at: 3, End: 4, Pattern found: &
Match is at: 5, End: 6, Pattern found: _
Match is at: 11, End: 12, Pattern found: /
Match is at: 12, End: 13, Pattern found: >
Match is at: 13, End: 14, Pattern found: +
Match is at: 14, End: 15, Pattern found:  
Match is at: 15, End: 16, Pattern found: D
Match is at: 23, End: 24, Pattern found:  
Match is at: 30, End: 31, Pattern found: #
Match is at: 31, End: 32, Pattern found: @
Total patterns found:  11


In [63]:
matcher = re.finditer('a*', 'abaaaahytajkoaaa7ytaaaweraasdfgaa')

count = 0

for m in matcher:
    count += 1
    print('Match is at: {}, End: {}, Pattern found: {}'.format(m.start(), m.end(), m.group()))
print('Total patterns found: ', count)

Match is at: 0, End: 1, Pattern found: a
Match is at: 1, End: 1, Pattern found: 
Match is at: 2, End: 6, Pattern found: aaaa
Match is at: 6, End: 6, Pattern found: 
Match is at: 7, End: 7, Pattern found: 
Match is at: 8, End: 8, Pattern found: 
Match is at: 9, End: 10, Pattern found: a
Match is at: 10, End: 10, Pattern found: 
Match is at: 11, End: 11, Pattern found: 
Match is at: 12, End: 12, Pattern found: 
Match is at: 13, End: 16, Pattern found: aaa
Match is at: 16, End: 16, Pattern found: 
Match is at: 17, End: 17, Pattern found: 
Match is at: 18, End: 18, Pattern found: 
Match is at: 19, End: 22, Pattern found: aaa
Match is at: 22, End: 22, Pattern found: 
Match is at: 23, End: 23, Pattern found: 
Match is at: 24, End: 24, Pattern found: 
Match is at: 25, End: 27, Pattern found: aa
Match is at: 27, End: 27, Pattern found: 
Match is at: 28, End: 28, Pattern found: 
Match is at: 29, End: 29, Pattern found: 
Match is at: 30, End: 30, Pattern found: 
Match is at: 31, End: 33, Pattern

### re.match()

`re.match(pattern, target)`

It is used to match the given patern at the beginning of the target string. If it finds the pattern it returns the match object. If nothing is found, it returns None.

In [65]:
import re

regex = input('Enter pattern to search for: ')

m = re.match(regex, 'abcdefghijk')

if m == None:
    print('Match is not available at the beginning of the string')
else:
    print('Match found at the beginning of the string.')
    print('Match is at: {}, End: {}, Pattern found: {}'.format(m.start(), m.end(), m.group()))

Enter pattern to search for: efg
Match is not available at the beginning of the string


### re.search()

`re.search(pattern, target)`

Searches the target string irrespective of location. If match is found it returns the first occurrence, otherwise it returns None

In [69]:
import re

regex = input('Enter pattern to search for: ')

m = re.search(regex, 'abcdefghijk')

if m != None:
    print('Match available!')
    print('First occurence at: {}, End: {}, Pattern found: {}'.format(m.start(), m.end(), m.group()))
else:
    print('Match was not found in the entire sring.')
    

Enter pattern to search for: efghi
Match available!
First occurence at: 4, End: 9, Pattern found: efghi


### Ignoring the case of the pattern

In [74]:
import re
I = re.IGNORECASE  #You can refer to the help(re) results above to understand how to use flags in regex

string = 'I am learning Python'

match = re.search('python$', string, I)

if match:
    print('Nice!')
else:
    print('Oops🙂')

Nice!
