<a href="https://colab.research.google.com/github/Omar-Zantot/NLP_SECTIONS/blob/main/Lab_06.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
# Regular Expressions in Python - re library
import re

# 🧩 Introduction to Basic Functions in `re`

* `findall` ▶ Returns a list containing all matches.
* `search` ▶ Returns a Match object if there is a match **`anywhere`** in the string.
* `split` ▶ Returns a list where the string has been split at each match.
* `match` ▶ Returns a Match object if there is a match at the **`start`** of the string.

# 🧩 Basic Pattern Matching

In [None]:
# Basic pattern matching
txt = 'Omar'
pattern = 'Omar'
if re.match(pattern, txt):
    print('Yes! Match found')
else:
    print('No match found')


Yes! Match found


# 🧩 Special Characters

## The Period (`.`)
+ Matches any single character except a newline.

In [None]:
txt = "Mohamed Maher"
pattern = "M.h..ed"
print(re.findall(pattern, txt))  # Should return ['Mohamed']

['Mohamed']


## Start (`^`) and End (`$`) Anchors

### caret
-`^` Matches the **`start`** of the string

In [None]:
txt = "Omar loves programming"
pattern = "^Omar"
print(re.findall(pattern, txt))  # Should match 'Omar' if it's at the start

['Saeed']


In [None]:
txt = "Omar loves programming"
pattern = "^loves"
print(re.findall(pattern, txt))  # No match !

[]


- `$` Matches the end of the string

In [None]:
txt = "Omar loves programming"
pattern = "programming$"
print(re.findall(pattern, txt))  # Matches only if 'programming' is at the end
print(re.findall('ing$', txt))   #
print(re.findall('ed$', txt))   #

['programming']
['ing']
[]


## [25Ab]
- Matches 2 or 5 or A or b

## [a-zA-z0-9]
* It matches any single character that is:
    1. a lowercase letter (a to z)
    2. an uppercase letter (A to Z)
    3. a digit (0 to 9)

In [None]:
txt = "My favorite colors are Red, Green, and blue."
pattern = "[Rr]ed|[Gg]reen|[Bb]lue"
print(re.findall(pattern, txt))

['Red', 'Green', 'blue']


In [None]:
txt = "Omar is 22 years old."
pattern = "[a-zA-Z0-9]"
print(re.findall(pattern, txt))

['O', 'm', 'a', 'r', 'i', 's', '2', '2', 'y', 'e', 'a', 'r', 's', 'o', 'l', 'd']


In [None]:
txt = "Omar is 22 years old."
pattern = "Omar [a-zA-Z0-9]"
print(re.findall(pattern, txt))

['Omar i']


using caret `^` in [ ] will look for complementary set

In [None]:
txt = "4 years"
pattern = "[^9] years"
m = re.findall(pattern, txt)
print(m)

['4 years']


In [None]:
txt = "4 years"
pattern = "[^4] years"
m = re.findall(pattern, txt)
print(m)

[]


## 🧩 Escaping with `\`

Backslash - 2 options
- Option1: special characters, such as (\t, \s, ...)
- Option2: not special character, treated like any other character or removes special meaning (\s, \t, ...)

In [None]:
# Option 1 - treating \ with special charachter (\s mean space)
txt = "Omar shady"
pattern = "Omar\shady"
m = re.findall(pattern, txt)
print(m)

[]


In [None]:
txt = "Omar shady"
pattern = "Omar\sshady"
m = re.findall(pattern, txt)
print(m)

['Omar shady']


In [None]:
# Option 2 - treating \ like any other character
txt = "Subscribe [o]mar channel"
pattern_0 = "Subscribe [o]mar"
pattern_1 = "Subscribe \[o\]mar"

m0 = re.findall(pattern_0, txt)
m1 = re.findall(pattern_1, txt)
print(m0)
print(m1)

[]
['Subscribe [o]mar']


### more special characters
+ `\t` - Matches tab
+ `\n` - Matches newline
+ `\A` - Matches only at the start of the string
+ `\Z` - Matches only at the end of the string

In [None]:
# reminder ^ sign
txt = "Omar loves programming"
pattern = "\AOmar"
m = re.findall(pattern, txt)
print(m)

['Omar']


In [None]:
# reminder $ sign
txt = "Omar loves programming"
pattern = "ming\Z"
m = re.findall(pattern, txt)
print(m)

['ming']


## 🧩 Common Patterns (`\w`, `\d`, `\s`)
- `\w`: Matches any single letter, digit, or underscore.
- `\W`: Matches any character not part of `\w`.
- `\d`: Matches any decimal digit (0-9).
- `\D`: Matches any character that is not a digit.
- `\s`: Matches any whitespace character.
- `\S`: Matches any non-whitespace character.

In [None]:
txt = "Omar loves programming"
pattern = "Omar loves \wrogramming"
# similar to [0-9a-zA-Z_]
m = re.findall(pattern, txt)
print(m)

['Omar loves programming']


In [None]:
txt = "Omar loves %rogramming"
pattern = "Omar loves \Wrogramming"
# similar to [^0-9a-zA-Z_]
m = re.findall(pattern, txt)
print(m)

['Omar loves %rogramming']


In [None]:
txt = "4 years"
pattern = "\d years"
# similar to [0-9]
m = re.findall(pattern, txt)
print(m)

['4 years']


In [None]:
txt = "% years"
pattern = "\D years"
# similar to [^0-9]
m = re.findall(pattern, txt)
print(m)

['% years']


In [None]:
txt = "Omar loves programming"
pattern = "Omar\sloves"
m = re.findall(pattern, txt)
print(m)

['Omar loves']


In [None]:
txt = "Omar-loves programming"
pattern = "Omar\Sloves"
m = re.findall(pattern, txt)
print(m)

['Omar-loves']


## 🧩 Quantifiers

### `+` - one or more accurrences

In [None]:
txt = "programmmmmmmming"
pattern = "program+ing"
print(re.findall(pattern, txt))  # Matches 'programmmmmmmming'

['programmmmmmmming']


In [None]:
# will this work ?
txt = "programmmmmmmming"
pattern = "progra\w+ing"
m = re.findall(pattern, txt)
print(m)

['programmmmmmmming']


In [None]:
# This will not work!
txt = "programmmmmmmming"
pattern = "progras+ing"
m = re.findall(pattern, txt)
print(m)

[]


`*` - Zero or more occurrences

In [None]:
txt = "programmmmmmmming"
pattern = "program*ing"
print(re.findall(pattern, txt))  # Matches with any number of 'm's

['programmmmmmmming']


In [None]:
txt = "programmmmmmmming"
pattern = "program*f*ing"
print(re.findall(pattern, txt))

['programmmmmmmming']


`?` - Zero or one occurrence

In [None]:
txt = "programming"
pattern = "programm?ing"
print(re.findall(pattern, txt))  # Matches with zero or one 'm'

['programming']


In [None]:
txt = "programming"
pattern = "programmf?ing"
print(re.findall(pattern, txt))  # Matches with zero or one 'm'

['programming']


`{x,y}` - Range of occurrences

In [None]:
txt = "programmmmming"
pattern = "programm{2,}ing"
print(re.findall(pattern, txt))  # Matches 'programmmmming' with at least 2 'm's

['programmmmming']


# 🧩 Use-cases in NLP

# Tokenization 📝

## 🪄 Note :  
The `re.split()` function in Python’s re library is used to split a string by a specified regular expression pattern, giving you ***more control over how the splitting occurs*** compared to Python’s **built-in** `str.split()` method.

### Simple Split on Non-Word Characters
#### Splitting by any character that is not a letter, digit, or underscore (`\W+`).

In [5]:
text = "Hello, world! Welcome to NLP."
result = re.split('\W+', text)
print(result)

['Hello', 'world', 'Welcome', 'to', 'NLP', '']


#### Splitting on **Digits**

In [6]:
text = "There are 3 cats, 4 dogs, and 1 bird."
result = re.split('\d+', text)
print(result)

['There are ', ' cats, ', ' dogs, and ', ' bird.']


#### Splitting on Multiple Delimiters

In [7]:
text = "apple, orange; banana grape"
result = re.split('[,\s;]+', text)
print(result)

['apple', 'orange', 'banana', 'grape']


# Text Cleaning and Preprocessing 🧼

### 🪄 Note:
The `re.sub()` function in Python's `re` library is used to replace parts of a string that match a regular expression pattern with a specified replacement. It’s particularly useful in NLP for cleaning and normalizing text data, like removing unwanted characters, handling contractions, or replacing multiple spaces.

#### Removes punctuation & wight spaces

In [8]:
text = "Hello! This text needs some cleaning...      Lots of extra spaces too!"
cleaned_text = re.sub('[^\w\s]', '', text)  # Removes punctuation
print(cleaned_text)
cleaned_text = re.sub('\s+', ' ', cleaned_text).strip()  # Replaces multiple spaces with a single space
print(cleaned_text)

Hello This text needs some cleaning      Lots of extra spaces too
Hello This text needs some cleaning Lots of extra spaces too


#### Handling Contractions

In [9]:
text = "I can't believe it's already 2024!"
expanded_text = re.sub("can't", "cannot", text)
expanded_text = re.sub("it's", "it is", expanded_text)
print(expanded_text)

I cannot believe it is already 2024!


# Pattern Matching and Entity Recognition 🔍

In [10]:
text = "Our next NLP class is on 2024-11-28. Save the date!"
dates = re.findall(r'\b\d{4}-\d{2}-\d{2}\b', text)  # Finds dates in YYYY-MM-DD format
print(dates)

# \b marks a word boundary

['2024-11-28']


In [None]:
# Ex: create a pattern to validate an email

import re

def validate_email(email):
  pattern = '\w+@\w+.com'
  if re.match(pattern, email):
    return True
  else:
    return False

email = "omar82199@gmail.com"
if validate_email(email):
  print("Valid email address")
else:
  print("Invalid email address")

In [None]:
# Extracts all numeric sequences
txt = '101010 202020 Saeed 404040 505050 Isa'
pattern_0 = '[\d]+'
pattern_1 = '\d+'
print(re.findall(pattern_0, txt))  # Extracts all numeric sequences


['101010', '202020', '404040', '505050']


In [None]:
txt0 = 'The end is The end'
txt1 = 'The end is not as i guiss'
txt2 = 'The end'
txt3 = 'The apple end'
pattern = '^The end$'
print(re.findall(pattern, txt0))
print(re.findall(pattern, txt1))
print(re.findall(pattern, txt2))
print(re.findall(pattern, txt3))

[]
[]
['The end']
[]


In [None]:
result = re.match('noorhan', 'name noorhan')
print(result != None)

False


In [None]:
result = re.search('noorhan', 'name noorhan')
print(result != None)

True
