# Regular Expressions (RegEx)

**RegEx** are a powerful tool for manipulating and filtering strings. Useful for :
- Clean columns (emails, phone numbers, etc.)
- Validate formats
- Extract informations
- Templates available at <a>www.regex101.com</a>

In [1]:
# import regex module
import re

## RegEx basics
- `.` : any character
- `^` : string starting point
- `$` : end of string
- `*` : 0 or n
- `+` : 1 or n
- `?` : 0 or 1
- `[]` : predifined possible characters
- `\` : escape character
- `|` : logical OR
- `&` : logical AND

In [2]:
# Example : verify if the string starts by a letter and ends with a number
pattern = r"^\w.*\d$" # r"^[a-zA-Z].*[0-9]$"

## Groups & quantifiers
- `(abc)` : group
- `{n}` : exactly n times
- `{n,}` : at least n times
- `{n,m}` : between n & m times

In [3]:
# example : extract a year from a string
pattern = r"(\d{4})"

## `re` methods
- `re.match()` : match since the start (first word)
- `re.search()` : first match
- `re.findall()` : all matches
- `re.sub()` : replace

1. Simple research with `search`
`re.search(pattern, string)` check if a pattern is present in a string
Example : 

In [24]:
text = "Le rapport est daté de 2025"
pattern = r"(\b\d{4}\b)"
result = re.search(pattern, text)
print(result)
print(result.group(0))

<re.Match object; span=(23, 27), match='2025'>
2025


### Exercise 01 :
Test if the string : `test0xB0testtest` contains the string `0xB0`

In [33]:
# Answer :
string = "test0xB0testtest"
pattern = r"0xB0"
result = bool(re.search(pattern, string))
if result :
    print("Motif trouvé !")
else:
    print("Motif non trouvé !")

Motif trouvé !


## Filtering a list with `re.search`
We can apply a regex filter criteria on each element of a list.
Example :

In [36]:
fruits = ["banane", "pomme", "ananas", "abricot", "fraise"]

# pattern : starts with "a" followed by at least two letters
pattern = r"^[aA][a-z]{2,}"

result = [fruit for fruit in fruits if re.search(pattern, fruit)]

print(result)

['ananas', 'abricot']


### Exercise 02
Filter a list with words that do not contain the letter `e`.

In [39]:
# Answer
words = ["ordinateur", "pain", "stylo", "chaussures"]
pattern = r"[eE]"
result = [word for word in words if not re.search(pattern, word)]
print(result)

['pain', 'stylo']


## Split lines and ignore case
- `re.split()` : allow string splitting
- `re.IGNORECASE` : ignores case

In [43]:
# split() :
text = "bane;pomme;ananas;abricot;fraise"
pattern = r";"
result = re.split(pattern, text)
print(result)

# ignorecase :
text = "Pyhton, PYTHON, pYTHON, python"
pattern = r"python"
match = re.findall(pattern, text, re.IGNORECASE)
print(match)

['bane', 'pomme', 'ananas', 'abricot', 'fraise']
['PYTHON', 'pYTHON', 'python']


### Exercise 03
- Delete lines containing `start` (lower or upper case alike)
```python
text = """test1
test2start
test3
test4Start"""
```

In [47]:
text = """test1
test2start
test3
test4Start"""

split_pattern = r"\n"
splitted_list = re.split(split_pattern, text)
print("splitted_list : ", splitted_list)

delete_pattern = r"start"
filtered_list = [word for word in splitted_list if not re.search(delete_pattern, word, re.IGNORECASE)]
print("filtered_list : ", filtered_list)

splitted_list :  ['test1', 'test2start', 'test3', 'test4Start']
['test1', 'test3']


## Replacing patterns with `re.sub()`
- `re.sub(pattern, replacement, text)` : replace all occurences of a given pattern in a string
- `re.sub(pattern, replacement, text, count=1)` : limit the replacement to 1 occurence

In [50]:
text = "Le prix est de 50 euros."
result = re.sub(r" euros", "€", text)
result = re.sub(r"\d+", "**", result)
print(result)

Le prix est de **€.


### Exercise 04 :
```python
text = "123 456 789 564"
```
1. Replace each `5` by `cinq`
```python
text2 = "note : ce paragraphe contient des notes"
```
2. Replace each `notes` by `X` (case incensitive)

In [55]:
text = "123 456 789 564"
text2 = "note : ce paragraphe contient des notes"

result = re.sub(r"5", "cinq", text)
print(result)

result = re.sub(r"notes", "X", text2, flags=re.IGNORECASE)
print(result)

123 4cinq6 789 cinq64
note : ce paragraphe contient des X


## Verify if the string start matches the pattern with `re.match()`
Example :

In [58]:
text = "abc123abc456"
pattern = r"[a-z]+"
result = re.match(pattern, text)
print(result)
print(result.group())

<re.Match object; span=(0, 3), match='abc'>
<class 'str'>


## Return all occurences with `re.findall()`
Example :

In [60]:
result = re.findall(pattern, text)
print(result)

['abc', 'abc']


### Exercise 05
```python
texte = "Nom: Dupont, Age: 34, Email: dupont@example.com; Nom: Martin, Age: 28, Email: martin@example.com"
```
1. Use `re.match()` to see if the string starts with `Nom`
2. Use `re.findall()` to extract ages
3. Use `re.findall()` to extract emails

In [67]:
# Answer
texte = "Nom: Dupont, Age: 34, Email: dupont@example.com; Nom: Martin, Age: 28, Email: martin@example.com"
string_starter = re.match(r"Nom", texte)
print(string_starter)
ages = re.findall(r"Age:\s*(\d+)", texte)
print(ages)
emails = re.findall(r"[a-ZA-Z0-9.-_$€!#+*]+@[a-ZA-Z.-]+\.\w{2,}", texte)
print(emails)


<re.Match object; span=(0, 3), match='Nom'>
['34', '28']
['dupont@example.com', 'martin@example.com']


## TP : Data cleaning
We'll clean a CSV file :
- Incorrect emails
- incorrect prices
- Ambigus dates of birth
- Different phone number formats

In [143]:
# Dependancy import
import csv
from dateutil import parser

In [144]:
# Convert CSV data into multi dimentional array
original_file_path = "./data/donnees_sales.csv"
users_list = []

with open(original_file_path, "r", encoding="UTF_8") as file:
    users = csv.reader(file, delimiter=",")
    next(users) # pass 1st line
    for user in users:
        users_list.append(user)

print(users_list)

[['1', 'jean.dupont[at]example.com', '€35', '02/08/1980', '06-01-02-03-04'], ['2', 'marie.dupont(at)example(dot)fr', '34,99$', '1980.08.02', '0601020304'], ['3', 'pierre.martin[at]exemple.org', '35 euros', '1980/08/02', '+33 6 01 02 03 04'], ['4', 'julie.francois(at)example_com', '€35.00', '02-08-1980', '6.01.02.03.04'], ['5', 'emilie.durand@example.com', '35EUR', '1980-08-02', '0033 6 01 02 03 04']]


In [None]:
# Define regex patterns
email_pattern = r"^((?!\.)[\w\-_.]*[^.])(@\w+)(\.\w+(\.\w+)?[^.\W])$"
price_pattern =r"[+-]?([0-9]*[.])?[0-9]+"
phone_pattern =r"^\+\d{2}\s\d{1}\s\d{2}\s\d{2}\s\d{2}\s\d{2}$"




In [146]:
# Cleaning emails
def email_cleaner(email : str):
    email = re.sub(r"\[at\]|\(at\)", "@", email)

    email = re.sub(r"\(dot\)|_", ".", email)
    return email

for user in users_list:
    if re.search(email_pattern, user[1]) != user[1]:
        user[1] = email_cleaner(user[1])

print(users_list)


[['1', 'jean.dupont@example.com', '€35', '02/08/1980', '06-01-02-03-04'], ['2', 'marie.dupont@example.fr', '34,99$', '1980.08.02', '0601020304'], ['3', 'pierre.martin@exemple.org', '35 euros', '1980/08/02', '+33 6 01 02 03 04'], ['4', 'julie.francois@example.com', '€35.00', '02-08-1980', '6.01.02.03.04'], ['5', 'emilie.durand@example.com', '35EUR', '1980-08-02', '0033 6 01 02 03 04']]


In [147]:
# cleaning prices
def prices_cleaner(price):
    price = re.search(price_pattern, price)
    return float(price.group())

for user in users_list:
    if re.search(price_pattern, user[2]) != user[2]:
        user[2] = prices_cleaner(user[2])

print(users_list)

[['1', 'jean.dupont@example.com', 35.0, '02/08/1980', '06-01-02-03-04'], ['2', 'marie.dupont@example.fr', 34.0, '1980.08.02', '0601020304'], ['3', 'pierre.martin@exemple.org', 35.0, '1980/08/02', '+33 6 01 02 03 04'], ['4', 'julie.francois@example.com', 35.0, '02-08-1980', '6.01.02.03.04'], ['5', 'emilie.durand@example.com', 35.0, '1980-08-02', '0033 6 01 02 03 04']]


In [None]:
# cleaning dates

for user in users_list:
    user[3] = parser.parse(user[3], parserinfo, dayfirst=True, )

print(users_list,)

[['1', 'jean.dupont@example.com', 35.0, datetime.datetime(1980, 8, 2, 0, 0), '06-01-02-03-04'], ['2', 'marie.dupont@example.fr', 34.0, datetime.datetime(1980, 2, 8, 0, 0), '0601020304'], ['3', 'pierre.martin@exemple.org', 35.0, datetime.datetime(1980, 2, 8, 0, 0), '+33 6 01 02 03 04'], ['4', 'julie.francois@example.com', 35.0, datetime.datetime(1980, 8, 2, 0, 0), '6.01.02.03.04'], ['5', 'emilie.durand@example.com', 35.0, datetime.datetime(1980, 2, 8, 0, 0), '0033 6 01 02 03 04']]


In [None]:
# cleaning phone
def phone_cleaner(phone):
    phone = phone.sub(r"")

for user in users_list:
    if re.search(phone_pattern, user[4]) != user[4]:
        user[4] = phone_cleaner(user[4])