<a href="https://colab.research.google.com/github/Anik-Adnan/Natural-Language-Processing/blob/main/Regular_Expressions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Regular Expressions
Regular expression is a set of characters, called as the pattern, which helps in finding substrings in a given string. The pattern is used to detect the substrings

For example, suppose you have a dataset of customer reviews about your restaurant. Say, you want to extract the emojis from the reviews because they are a good predictor os the sentiment of the review.

Take another example, the artificial assistants such as Siri, Google Now use information retrieval to give you better results. When you ask them for any query or ask them to search for something interesting on the screen, they look for common patterns such as emails, phone numbers, place names, date and time and so on. This is because then the assitant can automatically make a booking or ask you to call the resturant to make a booking.

Regular expressions are very powerful tool in text processing. It will help you to clean and handle your text in a much better way.

### Let's import the regular expression library in python.

In [None]:
import re

Let's do a quick search using a pattern.

In [None]:
re.search('Ravi', 'Ravi is an exceptional student!')

<re.Match object; span=(0, 4), match='Ravi'>

In [None]:
# print output of re.search()
match = re.search('Ravi', 'Ravi is an exceptional student!')
print(match.group())

Ravi


Let's define a function to match regular expression patterns

In [None]:
def find_pattern(text, patterns):
    if re.search(patterns, text):
        return re.search(patterns, text)
    else:
        return 'Not Found!'

### Quantifiers

In [None]:
# '*': Zero or more
print(find_pattern("ac", "ab*"))
print(find_pattern("abc", "ab*"))
print(find_pattern("abbc", "ab*"))

<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 2), match='ab'>
<re.Match object; span=(0, 3), match='abb'>


In [None]:
# '?': Zero or one (tells whether a pattern is absent or present)
print(find_pattern("ac", "ab?"))
print(find_pattern("abc", "ab?"))
print(find_pattern("abbc", "ab?"))

<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 2), match='ab'>
<re.Match object; span=(0, 2), match='ab'>


In [None]:
# '+': One or more
print(find_pattern("ac", "ab+"))
print(find_pattern("abc", "ab+"))
print(find_pattern("abbc", "ab+"))

Not Found!
<re.Match object; span=(0, 2), match='ab'>
<re.Match object; span=(0, 3), match='abb'>


In [None]:
# {n}: Matches if a character is present exactly n number of times
print(find_pattern("abbc", "ab{2}"))


<re.Match object; span=(0, 3), match='abb'>


In [None]:
# {m,n}: Matches if a character is present from m to n number of times
print(find_pattern("aabbbbbbc", "ab{3,5}"))   # return true if 'b' is present 3-5 times
print(find_pattern("aabbbbbbc", "ab{7,10}"))  # return true if 'b' is present 7-10 times
print(find_pattern("aabbbbbbc", "ab{,10}"))   # return true if 'b' is present atmost 10 times
print(find_pattern("aabbbbbbc", "ab{10,}"))   # return true if 'b' is present from at least 10 time

<re.Match object; span=(1, 7), match='abbbbb'>
Not Found!
<re.Match object; span=(0, 1), match='a'>
Not Found!


### Anchors

In [None]:
# '^': Indicates start of a string
# '$': Indicates end of string

print(find_pattern("James", "^J"))   # return true if string starts with 'J'
print(find_pattern("Pramod", "^J"))  # return true if string starts with 'J'
print(find_pattern("India", "a$"))   # return true if string ends with 'c'
print(find_pattern("Japan", "a$"))   # return true if string ends with 'c'


<re.Match object; span=(0, 1), match='J'>
Not Found!
<re.Match object; span=(4, 5), match='a'>
Not Found!


### Wildcard

In [None]:
# '.': Matches any character
print(find_pattern("a", "."))
print(find_pattern("#", "."))


<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 1), match='#'>


### Character sets

In [None]:
# Now we will look at '[' and ']'.
# They're used for specifying a character class, which is a set of characters that you wish to match.
# Characters can be listed individually as follows
print(find_pattern("a", "[abc]"))

# Or a range of characters can be indicated by giving two characters and separating them by a '-'.
print(find_pattern("c", "[a-c]"))  # same as above

<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 1), match='c'>


In [None]:
# '^' is used inside character set to indicate complementary set
print(find_pattern("a", "[^abc]"))  # return true if neither of these is present - a,b or c

Not Found!


### Character sets
| Pattern  | Matches                                                                                    |
|----------|--------------------------------------------------------------------------------------------|
| [abc]    | Matches either an a, b or c character                                                      |
| [abcABC] | Matches either an a, A, b, B, c or C character                                             |
| [a-z]    | Matches any characters between a and z, including a and z                                  |
| [A-Z]    | Matches any characters between A and Z, including A and Z                                  |
| [a-zA-Z] | Matches any characters between a and z, including a and z ignoring cases of the characters |
| [0-9]    | Matches any character which is a number between 0 and 9                                    |

### Meta sequences

| Pattern  | Equivalent to    |
|----------|------------------|
| \s       | [ \t\n\r\f\v]    |
| \S       | [^ \t\n\r\f\v]   |
| \d       | [0-9]            |
| \D       | [^0-9]           |
| \w       | [a-zA-Z0-9_]     |
| \W       | [^a-zA-Z0-9_]    |

### Greedy vs non-greedy regex

In [None]:
print(find_pattern("aabbbbbb", "ab{3,5}")) # return if a is followed by b 3-5 times GREEDY

<re.Match object; span=(1, 7), match='abbbbb'>


In [None]:
print(find_pattern("aabbbbbb", "ab{3,5}?")) # return if a is followed by b 3-5 times GREEDY

<re.Match object; span=(1, 5), match='abbb'>


In [None]:
# Example of HTML code
print(re.search("<.*>","<HTML><TITLE>My Page</TITLE></HTML>"))

<re.Match object; span=(0, 35), match='<HTML><TITLE>My Page</TITLE></HTML>'>


In [None]:
# Example of HTML code
print(re.search("<.*?>","<HTML><TITLE>My Page</TITLE></HTML>"))

<re.Match object; span=(0, 6), match='<HTML>'>


### The five most important re functions that you would be required to use most of the times are

match() Determine if the RE matches at the beginning of the string

search() Scan through a string, looking for any location where this RE matches

finall() Find all the substrings where the RE matches, and return them as a list

finditer() Find all substrings where RE matches and return them as asn iterator

sub() Find all substrings where the RE matches and substitute them with the given string

In [None]:
# - this function uses the re.match() and let's see how it differs from re.search()
def match_pattern(text, patterns):
    if re.match(patterns, text):
        return re.match(patterns, text)
    else:
        return ('Not found!')

In [None]:
print(find_pattern("abbc", "b+"))

<re.Match object; span=(1, 3), match='bb'>


In [None]:
print(match_pattern("abbc", "b+"))

Not found!


In [None]:
## Example usage of the sub() function. Replace Road with rd.

street = '21 Ramakrishna Road'
print(re.sub('Road', 'Rd', street))

21 Ramakrishna Rd


In [None]:
print(re.sub('R\w+', 'Rd', street))

21 Rd Rd


In [None]:
## Example usage of finditer(). Find all occurrences of word Festival in given sentence

text = 'Diwali is a festival of lights, Holi is a festival of colors!'
pattern = 'festival'
for match in re.finditer(pattern, text):
    print('START -', match.start(), end="")
    print('END -', match.end())

START - 12END - 20
START - 42END - 50


In [None]:
# Example usage of findall(). In the given URL find all dates
url = "http://www.telegraph.co.uk/formula-1/2017/10/28/mexican-grand-prix-2017-time-does-start-tv-channel-odds-lewisl/2017/05/12"
date_regex = '/(\d{4})/(\d{1,2})/(\d{1,2})/'
print(re.findall(date_regex, url))

[('2017', '10', '28')]


In [None]:
## Exploring Groups
m1 = re.search(date_regex, url)
print(m1.group())  ## print the matched group

/2017/10/28/


In [None]:
print(m1.group(1)) # - Print first group

2017


In [None]:
print(m1.group(2)) # - Print second group

10


In [None]:
print(m1.group(3)) # - Print third group

28


In [None]:
print(m1.group(0)) # - Print zero or the default group

/2017/10/28/


# Question 1: Check Presence of 'education'

**Description:**  
Check whether the word `'education'` is present in the sentence:  
*"The roots of education are bitter, but the fruit is sweet."*  
Use `re.search()` to return `True` if present, else `False`.


In [None]:
# Question 1: Python Code
import re

string = 'The roots of education are bitter, but the fruit is sweet.'

# regex pattern to check if 'education' is present
pattern = 'education'

# check whether pattern is present in string or not
result = re.search(pattern, string)

if result != None:
    print(True)
else:
    print(False)


True


# Question 2: Extract Start Position

**Description:**  
Extract the **starting position** of the word `'education'` in the sentence using `result.start()`.


In [None]:
# Question 2: Python Code
import re

string = 'The roots of education are bitter, but the fruit is sweet.'

pattern = 'education'

result = re.search(pattern, string)

# get starting position of the match
start_position = result.start()

print(start_position)


13


# Question 3: Extract End Position

**Description:**  
Extract the **ending position** of the word `'education'` in the sentence using `result.end()`.


In [None]:
# Question 3: Python Code
import re

string = 'The roots of education are bitter, but the fruit is sweet.'

pattern = 'education'

result = re.search(pattern, string)

# get ending position of the match
end_position = result.end()

print(end_position)


22


# Regular Expressions — Quantifiers

## Recap
- `re.search()` returns a **RegexObject** if the pattern is found; else `None`.
- `match.start()` → starting index of the match
- `match.end()` → ending index of the match
- Other functions in the `re` library exist for various tasks.

## Introduction to Quantifiers
Quantifiers let you control **how many times a character or group appears** in a pattern.

### Example
Suppose you have a list of words:


# Regex Exercise — Match 'tree' or 'trees'

**Description:**  
Write a regular expression that matches the word `'tree'` or `'trees'` in a given text.

**Sample positive cases:**
- 'The tree stands tall.'
- 'There are a lot of trees in the forest.'

**Sample negative cases:**
- 'The boy is heading for the school.'
- 'It's really hot outside!'


In [None]:
# Python Code
import sys
import re

# read input string
string = sys.stdin.read()

# regex pattern to match 'tree' or 'trees'
pattern = r'trees?'  # 's?' means optional 's'

# check whether pattern is present in string
result = re.search(pattern, string)

# evaluate result
if result != None:
    print(True)
else:
    print(False)


False


# Regex Exercise — Match 'x', 'xy', 'xz', 'xyz'

**Description:**  
Write a regular expression that matches the following words:  
- `x`  
- `xy`  
- `xz`  
- `xyz`  

Do **not** match:  
- `Xyyz`, `Xyzz`, `Xyy`, `Xzz`, `Yz`


In [None]:
# Python Code
import re

# input string (replace with test string)
string = 'xyz'  # Example: change to 'x', 'xy', 'xz', etc. for testing

# regex pattern to match 'x', 'xy', 'xz', 'xyz' only
pattern = r'x(y)?(z)?$'

# check whether pattern is present in string
result = re.search(pattern, string)

# evaluate result
if result != None:
    print(True)
else:
    print(False)


True


# Question 1: Binary Number Starting with 101

**Description:**  
Match a binary number that **starts with `101`** and ends with **zero or more zeros**.

**Sample positive cases:**  
- 1010  
- 10100  
- 101000  
- 101  

**Sample negative cases:**  
- 10  
- 100  
- 1


In [None]:
# Question 1: Python Code
import re

# input string (replace with test string)
string = '10100'  # test any example

# regex pattern: starts with 101, followed by zero or more 0's
pattern = r'^1010*$' # r'1010*$'

# check pattern
result = re.search(pattern, string)

if result != None:
    print(True)
else:
    print(False)


True


# Question 2: Binary Number Starting with 1 and Ending with 0

**Description:**  
Write a pattern that **starts with 1**, ends with **0**, and has **zero or more 1s in between**.

**Sample positive cases:**  
- 110  
- 11111110  
- 10  

**Sample negative cases:**  
- 11  
- 00  
- 1  
- 0


In [None]:
# Question 2: Python Code
import re

# input string (replace with test string)
string = '11111110'  # test any example

# regex pattern: starts with 1, zero or more 1's, ends with 0
pattern = r'^1[1]*0$' # r'1+0$'

# check pattern
result = re.search(pattern, string)

if result != None:
    print(True)
else:
    print(False)


True


# Regular Expressions — Quantifiers II

## Recap of Previous Quantifiers
- `?` → Matches the preceding character **0 or 1 time** (optional)  
- `*` → Matches the preceding character **0 or more times**  
- `+` → Matches the preceding character **1 or more times** (at least once)

**Difference between `*` and `+`:**  
- `*` → character may be absent  
- `+` → character must appear at least once

## Fixed-Occurrence Quantifiers
Sometimes you need to match a character **exactly** a number of times or within a range:

1. `{m, n}` → Matches the preceding character **m to n times**  
2. `{m,}` → Matches the preceding character **m to infinite times**  
3. `{, n}` → Matches the preceding character **0 to n times**  
4. `{n}` → Matches the preceding character **exactly n times**

**Notes:**  
- Avoid spaces inside `{m,n}` → write `{m,n}` not `{m, n}`  
- These can **replace other quantifiers**:
  - `?` ≡ `{0,1}`  
  - `*` ≡ `{0,}`  
  - `+` ≡ `{1,}`


## Question 1: Match 'ab+' against 'ac'

**Description:**  
Check whether the pattern `'ab+'` matches the string `'ac'`.


In [None]:
import re

string = "ac"
pattern = "ab+"

result = re.search(pattern, string)

if result != None:
    print(True)
else:
    print(False)


False


## Question 2: Match numbers that are powers of 10

**Positive:** 10, 100, 1000  
**Negative:** 0, 1, 15


In [None]:
import re

# Test string
string = "100"  # replace with input string

# Pattern: 1 followed by one or more 0s
pattern = r'10+'

result = re.search(pattern, string)

if result != None:
    print(True)
else:
    print(False)


True


## Question 3: Which string will not match 'abc+d'?

**Pattern:** `abc+d`  
**Strings:**
- abcd ✅
- abccd ✅
- abd ❌
- abccccccccd ✅  

**Answer:** `abd` will not match.


## Question 4: '?' quantifier equivalent

**Answer:** `{0,1}` ✅


## Question 5: Match 'hurray' with 2-5 'r's

**Pattern:** `hurr{2,5}ay`


In [None]:
import re

string = "hurrray"  # test string

pattern = r'hurr{2,5}ay'

result = re.search(pattern, string)

if result != None:
    print(True)
else:
    print(False)


## Question 6: Match 'awesome' with more than 2 'e's at the end

**Pattern:** `awesome{3,}`


In [None]:
import re

string = "awesomeeee"  # test string

pattern = r'awesome{3,}'

result = re.search(pattern, string)

if result != None:
    print(True)
else:
    print(False)


True


## Question 7: Match 'a' followed by 'b' a maximum of 3 times

**Pattern:** `ab{0,3}`


In [None]:
import re

string = "abbb"  # test string

pattern = r'ab{0,3}'

result = re.search(pattern, string)

if result != None:
    print(True)
else:
    print(False)


True


## Question 8: Match three or more '0's followed by one or more '1's

**Pattern:** `0{3,}1+`


In [None]:
import re

string = "0001111"  # test string

pattern = r'0{3,}1+'

result = re.search(pattern, string)

if result != None:
    print(True)
else:
    print(False)


True


# Regular Expressions: Anchors and Wildcard

## Anchors:
- `^` → Start of the string
- `$` → End of the string
- Both can be combined to match patterns at the start **and** end
- Example: `^01*0$` matches strings starting and ending with 0, with any number of 1s in between

## Wildcard:
- `.` matches any single character
- Combined with anchors, quantifiers, and character sets, you can define flexible patterns


In [None]:
# 1. Match all dictionary words that start with 'A'

import re, sys
string = sys.stdin.read()

# regex pattern - starts with 'A' (ignore case)
pattern = r'^A\w*'

result = re.search(pattern, string, re.I)
print(result != None)


In [None]:
# 2. Match words ending with 'ing'

import re, sys
string = sys.stdin.read()

# regex pattern - ends with 'ing'
pattern = r'\w+ing$'

result = re.search(pattern, string)
print(result != None)


False


In [None]:
# 3. Complex pattern: starts with one or more 1s, followed by >=3 zeros, then any number of ones,
#    followed by 1-7 zeros, ends with 2 or 3 ones

import re, sys
string = sys.stdin.read()

pattern = r'^1+0{3,}1*0{1,7}1{2,3}$'

result = re.search(pattern, string)
print(result != None)


False


In [None]:
# 4. Match first names with length between 3 and 15 characters (no spaces)

import re, sys
string = sys.stdin.read()

# regex pattern
pattern = r'^[A-Za-z]{3,15}$'

result = re.search(pattern, string)
print(result != None)


False


### Regular Expressions: Username Validation

**Description:**  
Write a regular expression using meta-sequences to match usernames of a database.  

**Rules for a valid username:**  
1. Starts with alphabets (1 to 10 characters long).  
2. Followed by a number of exactly 4 digits.  

**Sample Positive Matches:**  
- `sam2340`  
- `irfann2590`  

**Sample Negative Matches:**  
- `8730`  
- `bobby9073834`  
- `sameer728`  
- `radhagopalaswamy7890`  

**Instructions:**  
- Use Python’s `re` module to write your regular expression.  
- Use `re.search()` to check if the pattern is present in the input string.  
- Use the `re.I` flag to ignore case while matching.


In [None]:
import re
import ast, sys
string = sys.stdin.read()

# regex pattern: starts with 1-10 letters, followed by exactly 4 digits
pattern = r'^[a-zA-Z]{1,10}\d{4}$'

# check whether pattern is present in string or not
result = re.search(pattern, string, re.I)

# evaluate result
if result != None:
    print(True)
else:
    print(False)


False


### Greedy vs Non-Greedy (Lazy) Search in Regular Expressions

**Summary:**  
- **Greedy Search:** By default, regex tries to match the **longest possible string**.  
  - Example: Pattern `'30+'` on string `'3000'` matches `'3000'`.  
- **Non-Greedy (Lazy) Search:** Stops matching as soon as the condition is satisfied; matches the **shortest string**.  
  - Example: Pattern `'30+?'` on string `'3000'` matches `'30'`.  

**Lazy quantifiers:** Append `?` to any greedy quantifier:  
- `*?`  
- `+?`  
- `??`  
- `{m,n}?`  
- `{m,}?`  
- `{,n}?`  
- `{n}?`  

**Key point:**  
Greedy vs lazy does not refer to matching multiple occurrences; it only affects how much of the string a single match consumes.

---

### Questions

**1. Greedy HTML Match**  
You are given the following HTML code:  
```html
<html>
<head>
<title> My amazing webpage </title>
</head>
<body> Welcome to my webpage! </body>
</html>


In [None]:
import re
import ast, sys
string = sys.stdin.read()

# regex pattern
pattern = r'<html>[\s\S]*</html>'# write your regex here

# check whether pattern is present in string or not
result = re.search(pattern, string, re.M)  # re.M enables tha tpettern to be searched in multiple lines

# evaluate result - don't change the following piece of code, it is used to evaluate your regex
if (result != None) and (len(result.group()) > 6):
    print(True)
else:
    print(False)

False


## Question 2: Non-Greedy Regular Expression

**Description:**  
You’re given the following HTML code:

```html
<html>
<head>
<title> My amazing webpage </title>
</head>
<body> Welcome to my webpage! </body>
</html>


In [None]:
import re
import ast, sys
string = sys.stdin.read()

# regex pattern
pattern = r"<.*?>" # write your regex here

# check whether pattern is present in string or not
result = re.search(pattern, string, re.M)  # re.M enables tha tpettern to be searched in multiple lines

# evaluate result - don't change the following piece of code, it is used to evaluate your regex
if (result != None) and (len(result.group()) <= 6):
    print(True)
else:
    print(False)

False


# Commonly Used Regular Expression Functions in Python

## Summary

Python’s `re` module provides multiple functions to work with regular expressions:

### 1. `re.match()`
- Matches the pattern **only at the beginning** of the string.
- Returns a match object if successful, otherwise `None`.

### 2. `re.search()`
- Searches the pattern **anywhere** in the string.
- Stops at the first match.

### 3. `re.sub()`
- Replaces all occurrences of a pattern with a replacement string.
- Useful for text cleaning and transformations.

### 4. `re.finditer()`
- Returns an iterator of match objects.
- Useful when you need match positions and detailed control.

### 5. `re.findall()`
- Returns a list of all matches.
- Best when only matched strings are required.

---



## Question 1: re.match() Function

### Description
Write a string such that when `re.match()` is applied using the pattern `a{2,}`, it returns a non-empty match.



In [None]:
import re
import ast, sys
pattern = sys.stdin.read()

string = "aaa"  # starts with at least two 'a's

result = re.match(pattern, string, re.I)

if result != None:
    print(True)
else:
    print(False)


True


Question 3: re.sub() – Replace Phone Numbers
Description

Replace all 11-digit phone numbers in the string with "####".

In [None]:
import re
import ast, sys
string = sys.stdin.read()

pattern = r'\b\d{11}\b'
replacement = "####"

result = re.sub(pattern, replacement, string)

if re.search(replacement, result) != None:
    print(True)
else:
    print(False)


False


Do not compare apples with oranges. Compare apples with apples


In [1]:
import re
import ast, sys

string = sys.stdin.read()

# regex pattern to extract all words
pattern = r'\b\w+\b'

# store results in the list 'result'
result = []

# iterate over the matches
for match in re.finditer(pattern, string):
    if len(match.group()) >= 5:
        result.append(match)
    else:
        continue

# evaluate result
print(len(result))


0


Extract all words that end with the suffix **'ing'** using Regular Expressions.
Use the `re.findall()` function and count the total matches.
Playing outdoor games when its raining outside is always fun!


In [2]:
import re
import ast, sys

string = sys.stdin.read()

# regex pattern to extract words ending with 'ing'
pattern = r'\b\w+ing\b'

# store results in the list 'result'
result = re.findall(pattern, string)

# evaluate result
print(len(result))


0


• Grouping in Regular Expressions is used to extract specific sub-parts of a matched pattern.
• Parentheses () are used to create groups.
• group(0) returns the entire match.
• group(1), group(2), etc. return specific grouped sub-patterns.
• Grouping is especially useful for extracting parts like day, month, year, or domains.


• The regex pattern matches dates in DD-MM-YYYY format.
• \d{2} matches exactly two digits.
• Hyphens (-) are matched literally.
• re.search() is used to find the first occurrence of the date.
• group(0) returns the full matched date.


In [6]:
import re
import ast, sys

string = sys.stdin.read()

# regex pattern to extract date in DD-MM-YYYY format
pattern = r'\d{2}-\d{2}-\d{4}'

# store result
result = re.search(pattern, string)

# evaluate result
if result != None:
    print(result.group(0))
else:
    print(False)


False


• Parentheses are used to create groups in the regex.
• First group captures the day.
• Second group captures the month.
• Third group captures the year.
• group(2) extracts the month from the date.


In [7]:
import re
import ast, sys

string = "Today's date is 18-05-2018"

# regex pattern with grouping to extract month
pattern = r'\d{2}-(\d{2})-\d{4}'

# store result
result = re.search(pattern, string)

# extract month using group command
if result != None:
    month = result.group(1)
else:
    month = "NA"

# evaluate result
print(month)


05


• The local part (before @) contains letters, numbers, and underscores.
• The domain part contains only alphabets followed by '.com'.
• Grouping is used to capture only the domain name.
• group(1) extracts the domain from the email.


In [4]:
import re
import ast, sys

string = sys.stdin.read()

# regex pattern to extract domain using grouping
pattern = r'[a-zA-Z0-9_]+@([a-zA-Z]+\.com)'

# store result
result = re.search(pattern, string)

# extract domain using group command
if result != None:
    domain = result.group(1)
else:
    domain = "NA"

# evaluate result
print(domain)


NA
