<a href="https://colab.research.google.com/github/AbdulRauf96/NLP/blob/main/regular_expression_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <font color = 'pickle' > **Regular Expressions in Python**

- A regular expression, or regex for short, is a pattern that describes a set of strings. 
- It is a powerful tool for text processing, and is widely used in many programming languages for tasks such as validating input, searching for patterns in strings, and manipulating text. 
- Regular expressions use a special syntax to define the pattern, which typically includes a combination of characters, special characters and metacharacters, and quantifiers. 




# <font color = 'pickle' >**re module**
The re module in Python is a module for working with regular expressions. 

The re module includes functions for searching for and manipulating text that matches a specified regular expression. 

Some of the most commonly used functions in the re module include:

- re.match(): Attempts to match a regular expression pattern to the beginning of a string and returns a match object if the pattern is found.
- re.search(): Searches for a specified regular expression pattern in a string and returns a match object if the pattern is found.
- re.finditer(): Returns an iterator yielding match objects for every non-overlapping match in the string.
- re.findall(): Returns all non-overlapping matches of a regular expression pattern in a string, as a list of strings.
- re.sub(): Replaces all occurrences of a regular expression pattern in a string with a specified replacement string.
- re.split(): Splits a string by a specified regular expression pattern.

Overall, the re module provides a wide range of capabilities for working with regular expressions in Python and makes it easy to search for and manipulate text that matches a specified pattern.

We will discuss these functions with examples in this notebook.

# <font color = 'pickle' >import re module

In [None]:
import re

# <font color = 'pickle' > Sample Text

In [None]:
# Sample text for finding regex patterns
full_text = """This is a random file a-z and a_z
Random text mixed with numbers
Writing programs (or programming) is a very creative and rewarding activity.  computers are fun You can write and eat programs for 3036 reasons. Let us add some random numbers 2234 678 9 67 456. Do you like computers. art mat fat  Computer with capital case. Molly and mollies. Cat and dog
organisation or organization

List of Holidays - Republic Day, National Science Day, Independence day
List of numbers and hyphens: 98166-94916 98794-41515
List of telephone numbers: 562-489-1132 795-456-2615 951-964-4930
List of URLs mixed with random text 
https://www.youtube.com some ransom text https://www.facebook.com  just having some fun http://www.baidu.com bla blah blah https://www.yahoo.com  more urls to have more fun https://www.amazon.com https://www.wikipedia.org
https://docs.python.org/3/library/re.html
List of tweet mentions : @jnardino @pilot @cairdin @Leora13
List of Twitter Hashtags : #influencer #fridayfeelings #mondayblues #tbt

List of Dates : 16-12-2006  xysb  26-09-2012  abcd 1-2-1995  ttt 1-12-2019 12-2-2024
List of amounts: $10000 you never pay Rs 159,974 rat bat cat  is good Rs 300000 $300,000,000 more amount examples $200 only $200,00
List of Names : Mr.Wade and  Mr.Seth are brothers. Mr Wade is married to Mrs.Deborah  and has daughter Miss.Sarah. Mr. Seth ismarried to  Mrs.Ashley

Lits of countries and states 
USA-Texas US-Washington D.C. USA-California USA-Utah USA-Wyoming US-Washington

List of emails : abc@xyz.com xwe_123_faf@fal.com asd.fgh@jklm.com dfgh.ajfb.2008@gmail.com
List of executives mixed with random text: CEO Sundar random text CEO Jeff  12345 CEO Vikram  blah blah blah CFO Ram CTO Sam
Small math Problem 45 + 55 =100"""

# <font color = 'pickle'>**Functions of re module**

## <font color = 'pickle' > **re.match()**
re.match() is a function in the re module in Python that is used to search for a specific pattern in a string. It attempts to match the pattern to the beginning of the string, and returns a match object if it finds a match. The match object contains information about the match, such as the position of the match in the string and the group of characters that matched the pattern.

The basic syntax for the re.match() function is:

```re.match(pattern, string)```

- pattern is the regular expression pattern that you want to search for
- string is the string in which you want to search for the pattern

If the match is successful, the function returns a match object. **If the match is not successful, the function returns None.**

**It's worth noting that the match() function only matches the pattern at the beginning of the string** and if the match is not found it returns None. 

In [None]:
# Define the pattern we want to search for
search_pattern = r"This "

# Use the match() function to search for the pattern in the text
# match() function attempts to match the pattern to the beginning of the string
# and returns a match object if it finds a match
# If no match is found, it will return None
result = re.match(search_pattern, full_text)

# Print the match object
print(result)

<re.Match object; span=(0, 5), match='This '>


### <font color = 'pickle' > **group() method**

The group() method, which is a function of the match object returned by the match() function in the re module, returns the matched string.

**It's worth noting that if the match object was None, calling group() on it will raise an exception.** To check if the match was successful, you should check if the match object is None before calling the group() method.

In [None]:
print(result.group())

This 


### <font color = 'pickle' > **start() and end() methods**

In [None]:
# The code will return the start, end position of the matched string
print(result.start())
print(result.end())

0
5


### <font color = 'pickle' > **group(n) method**

The group() method takes an optional parameter, n, that specifies the group to return. If n is not specified, group() returns the entire match. If the match has multiple groups, group(n) returns the nth group, where group 0 is the entire match, group 1 is the first group in the match, group 2 is the second group, and so on

In [None]:
# pattern to search
search_pattern = r"([A-Z])(\w+)"

The regular expression is composed of two parts:

- ([A-Z]) is the first part of the regular expression. This part uses square brackets [] to define a character set, which matches any single character that is within the set. In this case, the set is defined as [A-Z], which matches any single capital letter. This part is also enclosed in parentheses () which creates a group.
- (\w+) is the second part of the regular expression. This part uses the special character \w to match any word character (letters, digits, or underscores). The plus sign + following the \w means that one or more word characters should be matched. This part is also enclosed in parentheses () which creates a group.

When this regular expression is used to match a string, it looks for one capital letter, followed by one or more word characters.

In [None]:
# The group(0) function returns the entire match, including all groups
result = re.match(search_pattern, full_text).group(0)

# Use the group(1) function to return the match of the first group
result_2 = re.match(search_pattern,full_text).group(1)

# Use the group(2) function to return the match of the second group
result_3 = re.match(search_pattern,full_text).group(2)

# print the results
print('All matches: ',result,'\nGroup 1: ',result_2,'\nGroup 2: ',result_3)

All matches:  This 
Group 1:  T 
Group 2:  his


### <font color = 'pickle' > **No match case**

If the re.match() function is unable to find a match for the specified pattern in the given text, it returns None. In this case, if you try to call the group() method on the returned None value, it will raise an AttributeError, because None doesn't have a group() method.

In [None]:
# if there is no match then the output will be a None Type Object

# pattern - this pattern will not match as it is not the same as the beginning of the sentence in text
search_pattern = r" This is" # Contains whitespace in the beginning

# Checking for match
result = re.match(search_pattern, full_text)
print(result)
print(result.group())

None


AttributeError: ignored

### <font color = 'pickle' > **use try-except block to handle attribute error**

In [None]:
# Define the search pattern to look for at the beginning of the text
search_pattern = r" This is" # this pattern will not match as it is not in the beginning of a sentence

try:
    # Use the match() function to check if the search pattern is found at the beginning of the full text
    result = re.match(search_pattern, full_text)
    
    # Extracting the matched substring using group()
    matched_substring = result.group()
    print(matched_substring)

# If the match() function returns None, an AttributeError is raised when trying to call the group() method
except AttributeError:
    # Print an error message if the substring is not found at the beginning of the sentence
    print("substring not found at the beginning of sentence")

substring not found at the beginning of sentence


Explanation of the above code:

- The try block of the code calls the re.match() function and assigns the result to the variable result. If the re.match() function returns None, an AttributeError is raised when trying to call the group() method on the result variable.

- The except block of the code catches the AttributeError and prints an error message "substring not found at the beginning of sentence" if the re.match() function returns None.

- The matched_substring variable is defined as the group() method of the result variable, which returns the matched substring. If the re.match() function returns None, the matched_substring variable is not defined and an AttributeError will be raised.

- Since the search_pattern doesn't match in the full_text, the re.match() function will return None and the code will print "substring not found at the beginning of sentence".

## <font color = 'pickle' >**re.search()**
- re.match() function only searches for a pattern at the beginning of a string. If the pattern is found, it returns a match object, otherwise it returns None.

- re.search(), on the other hand, searches for a pattern anywhere in the string. It returns the first occurrence of the pattern in the string as a match object, or None if the pattern is not found.

In [None]:
# Define the search pattern to look for in the text
search_pattern = r"random"

# Use the search() function to find the first occurrence of the substring
result = re.search(search_pattern, full_text)

# Print the match object
print(result)

<re.Match object; span=(10, 16), match='random'>


### <font color = 'pickle' > **group method**

In [None]:
# We will need to use .group() to extract text
# Works exactly the same as the .group() method for .match() method.
# This also returns None if no match is found

# Define the search pattern to look for in the text
search_pattern = r"random"

try:
    # Use the search() function to find the first occurrence of the substring
    result = re.search(search_pattern, full_text)
    
    # Extracting the matched substring using group()
    matched_substring = result.group()
    print(matched_substring)

# If the search() function returns None, an AttributeError is raised when trying to call the group() method
except AttributeError:
    # Print an error message if the substring is not found anywhere in the text
    print("substring not found anywhere in the text")

random


## <font color = 'pickle' >**re.findall()**</font>
re.findall() is a function in Python's re module that is used to find all non-overlapping occurrences of a pattern in a string. It returns a list of strings that contain all matches of the pattern.

The re.findall() function takes two arguments:

1. pattern: This is the regular expression pattern to search for. It can be a string or a compiled regular expression object.

2. string: This is the string to search in. It should be a string type.

### <font color = 'pickle' > Example 1 - find all occurences  of a word

In [None]:
# Define the search pattern to look for in the text
search_pattern = r"random"

# Use the findall() function to find all non-overlapping occurrences of the substring
result = re.findall(search_pattern, full_text)

# Print the list of all matches
print(result)

['random', 'random', 'random', 'random', 'random']


### <font color = 'pickle' > Example 2 - find all dates

In [None]:
# Define the search pattern to look for in the text
search_pattern = r'\d{1,2}-\d{1,2}-\d{4}'

# Use the findall() function to find all non-overlapping occurrences of the substring
result = re.findall(search_pattern, full_text)

# Print the list of all matches
print(result)

['16-12-2006', '26-09-2012', '1-2-1995', '1-12-2019', '12-2-2024']


- This code uses the re module in Python to search for a specific pattern in a given text, full_text. The re.findall() function is used to search for all non-overlapping occurrences of the pattern in the full_text string.

- The search_pattern variable is defined as the regular expression ```r'\d{1,2}-\d{1,2}-\d{4}'``` which will match all occurrences of the pattern of 1 or 2 digits, followed by a hyphen, followed by 1 or 2 digits, followed by a hyphen, followed by 4 digits.

- This pattern is used to search for dates in the format of dd-mm-yyyy or d-m-yyyy.

- The result variable is set to the value of re.findall(search_pattern, full_text) which finds all non-overlapping occurrences of the substring in the text and returns a list of strings containing the matched substrings. If the substring is not found, it returns an empty list.

In summary, this code uses the re.findall() function to search for all non-overlapping occurrences of a pattern of dates in the format of dd-mm-yyyy or d-m-yyyy in a string and returns a list of all matched substrings.

## <font color = 'pickle' >**re.finditer()** 
re.finditer() is a function in the python re (regular expression) module. It returns an iterator yielding match objects for all non-overlapping matches of the given pattern in the string.

The function takes two arguments:

- The first argument is the regular expression pattern that you want to search for.
- The second argument is the string in which you want to search for the pattern.

The returned iterator object can be used in a for loop to iterate over all the matches, with each iteration returning a match object that contains information about the match, such as the start and end position of the match within the string.

One of the key benefits of using finditer() over the findall() function is that it returns match objects instead of just strings. These match objects contain additional information about the match, such as the start and end positions of the match within the string, which can be useful in certain scenarios.

In [None]:
text = 'Computer Science is the study of computers and computational systems. Computer engineers design and build computer systems.'

search_pattern = r"computer"

# Using finditer to get the matches
matches = re.finditer(search_pattern, text, re.IGNORECASE)

# Iterating over the matches
for match in matches:
    print(f"Matched word: {match.group()}")
    print(f"Start index: {match.start()}")
    print(f"End index: {match.end()}")
    print()

Matched word: Computer
Start index: 0
End index: 8

Matched word: computer
Start index: 33
End index: 41

Matched word: Computer
Start index: 70
End index: 78

Matched word: computer
Start index: 106
End index: 114



Code Explanation:
- In this example, finditer() is used to search for the word "computer" (ignoring the case of the word) in the text variable. 
- The group() method is used to extract the matched word, start() returns the start index and end() returns the end index of the matched word.

As you can see, finditer() allows you to access additional information about the matches such as the start and end position of the matched string in the text which can be useful in different situations where you need to extract information from a text and process them.

## <font color = 'pickle' >**re.sub()** 
re.sub() is a function in the python re (regular expression) module that replaces all occurrences of a regular expression pattern in a string with a specified replacement string.

The function takes three arguments:

- The first argument is the regular expression pattern that you want to search for.
- The second argument is the replacement string.
- The third argument is the string in which you want to search for the pattern.


In [None]:
text = "hello world, hello everyone"
search_pattern = "hello"
new_text = re.sub(search_pattern, "hi", text)
print(new_text)

hi world, hi everyone


In this example, the function is searching for the pattern "hello" in the string "hello world, hello everyone" and replaces all occurrences of the pattern with the string "hi" and returns the modified string "hi world, hi everyone".

re.sub() is useful when you want to replace all occurrences of a pattern in a string with a specified replacement string, and <font color = 'red'>**you can also use it to remove certain parts of the string**.</font> For example, you might use re.sub() to remove all the numbers from a string by replacing them with an empty string.

In [None]:
text = "hello 1234 world 5678"
search_pattern = "[0-9]+"
new_text = re.sub(search_pattern, "", text)
print(new_text)

hello  world 


In this example, the regular expression pattern [0-9]+ searches for one or more consecutive digits and replaces them with an empty string, effectively removing all the numbers from the text.



## <font color = 'pickle' >**re.split()**

re.split() is a function in the python re (regular expression) module that splits a string into a list of substrings by using a specified regular expression pattern as the delimiter.

The function takes two arguments:

- The first argument is the regular expression pattern that you want to use as the delimiter.
- The second argument is the string that you want to split.

re.split() can be useful when you want to quickly split a string into a list of substrings based on a specified delimiter, and it's an alternative to string method str.split() which only splits on a specified string, not regular expression.

### <font color = 'pickle' >Example 1 : Split using comma as delimiter

In [None]:
text = "hello,world,how,are,you"
search_pattern = ","
result = re.split(search_pattern, text)
print(result)

['hello', 'world', 'how', 'are', 'you']


In this example, the function is using the regular expression pattern "," as the delimiter to split the string "hello,world,how,are,you" into a list of substrings ['hello', 'world', 'how', 'are', 'you'].

### <font color = 'pickle' >Example 1 : Create tokens which have only alpha numeric characters

In [None]:
text = 'This is a random file a-z and a_z. Can you help?'
re.split(r'[^\w]+',text)



['This',
 'is',
 'a',
 'random',
 'file',
 'a',
 'z',
 'and',
 'a_z',
 'Can',
 'you',
 'help',
 '']

In the code, ```re.split(r'[^\w]+', text)``` is used to create a list of tokens (substrings) from the text variable by splitting it at every occurrence of one or more non-word characters (anything that is not a letter, digit, or underscore).

Here's what happens :
- The regular expression pattern r'[^\w]+' is used as the delimiter for the split function
- regular expression matches one or more non-word characters (anything that is not a letter, digit, or underscore). Thus, in this example any non-alpha numeric character is delimiter.
- re.split() function is used to split the text variable at each occurrence of the delimiter
- The resulting list of substrings is then assigned to the variable result.
- Note that delimiters are not part of the list of substrings and are thus removed.

Summary: The resulting list of substrings contains only tokens that are composed of alphanumeric characters because the regular expression pattern r'[^\w]+' is used to split the text at every occurrence of non-word characters, thus removing any non-alphanumeric characters from the resulting list of tokens.

<font color = 'red'>Note:In the above example, the hyphen (-) in 'a-z' is considered a non-word character and is used as a delimiter. As a result, 'a-z' is split into two separate tokens: 'a' and 'z'. On the other hand, the underscore (_) in 'a_z' is considered a word character and is not used as a delimiter. Therefore, 'a_z' appears as a single token.

