# Regular expresions (RegEx)

- Online regex tester: https://regex101.com
- w3schools: https://www.w3schools.com/python/python_regex.asp
- re module documentation: https://docs.python.org/3/library/re.html

### Sintaxis

- **Literals** `a` 
- **Alternation** `a|b`
- **Character sets** `[ab]`, `[^ab]`
- **Wildcards** `.`
- **Escape special characters** `\` (?,*,+,^,&)
- **Ranges** `[a-d]`, `[1-9]`
- **Character classes** `\w`, `\d`, `\s`, `\n`, `\W`, `\D`, `\S`
- **Quantifiers** `{2}`, `{2,}`, `{2,4}`, `?`, `*`, `+`
- **Grouping** `()`
- **Anchors** `^`, `$`

### Methods

- **re.findall()**
- **re.sub()**
- **re.search()**
- **re.match()**
- **re.split()**

In [1]:
import re

In [2]:
text = "Pepe, Pepa and Luis are 22, 34, and 56 years old, respectively?"

In [3]:
# literals
# re.sub
text1 = re.sub("Luis", "Lola", text)
print(text1)

Pepe, Pepa and Lola are 22, 34, and 56 years old, respectively?


In [4]:
# alternation

text2 = re.sub("22|34", "40" , text)
print(text2)

Pepe, Pepa and Luis are 40, 40, and 56 years old, respectively?


In [5]:
# character sets
text3 = re.sub("[234]", "5", text)
print(text3)

Pepe, Pepa and Luis are 55, 55, and 56 years old, respectively?


In [6]:
# wildcards
text4 = re.sub("Pep.", "Felipe", text)
print(text4)

Felipe, Felipe and Luis are 22, 34, and 56 years old, respectively?


In [7]:
# escape special characters
text5 = re.sub("\?", "!", text)
print(text5)

Pepe, Pepa and Luis are 22, 34, and 56 years old, respectively!


In [8]:
# ranges
# re.findall
print(re.findall("[a-df-z]", text))
print(re.findall("[A-Z]", text))
print(re.findall("[0-9]", text))

['p', 'p', 'a', 'a', 'n', 'd', 'u', 'i', 's', 'a', 'r', 'a', 'n', 'd', 'y', 'a', 'r', 's', 'o', 'l', 'd', 'r', 's', 'p', 'c', 't', 'i', 'v', 'l', 'y']
['P', 'P', 'L']
['2', '2', '3', '4', '5', '6']


In [9]:
# character classes
print(re.findall("\w+", text))

['Pepe', 'Pepa', 'and', 'Luis', 'are', '22', '34', 'and', '56', 'years', 'old', 'respectively']


In [10]:
# quantifiers
text = "baa ba b a aa aaa aaaa aaaaa"
print(re.findall("a", text))
print(re.findall("a{2}", text))
print(re.findall("a{2,}", text))
print(re.findall("a*", text))
print(re.findall("a+", text))
print(re.findall("ba+", text))
print(re.findall("a?", text))

['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a']
['aa', 'aa', 'aa', 'aa', 'aa', 'aa', 'aa']
['aa', 'aa', 'aaa', 'aaaa', 'aaaaa']
['', 'aa', '', '', 'a', '', '', '', 'a', '', 'aa', '', 'aaa', '', 'aaaa', '', 'aaaaa', '']
['aa', 'a', 'a', 'aa', 'aaa', 'aaaa', 'aaaaa']
['baa', 'ba']
['', 'a', 'a', '', '', 'a', '', '', '', 'a', '', 'a', 'a', '', 'a', 'a', 'a', '', 'a', 'a', 'a', 'a', '', 'a', 'a', 'a', 'a', 'a', '']


In [11]:
# grouping
# re.search
text = "abctrc abc"

print(re.search("([a-z]{2}c){2}\sabc", text))

<re.Match object; span=(0, 10), match='abctrc abc'>


In [12]:
# anchors
text = "Ironhack is the best school"
inverse = "The best school is Ironhack"

print(re.search("^Ironhack", text))
print(re.search("^Ironhack", inverse))
print(re.search("Ironhack$", text))
print(re.search("Ironhack$", inverse))

<re.Match object; span=(0, 8), match='Ironhack'>
None
None
<re.Match object; span=(19, 27), match='Ironhack'>


In [13]:
# re.search
# re.match
if re.search("a", "hola"):
    print("encontrado!")

print(re.search("a", "hola"))
print(re.match("a", "a"))

encontrado!
<re.Match object; span=(3, 4), match='a'>
<re.Match object; span=(0, 1), match='a'>


In [14]:
# re.split
text = "Pepe, Pepa and Luis are 22, 34, and 56 years old, respectively?"
print(re.split("\d\d", text))

['Pepe, Pepa and Luis are ', ', ', ', and ', ' years old, respectively?']


### Let's practice

You work for a very big company and you are assigned the task of verifying information from the 200 most important clients in Europe for a meeting with the board of directors in an hour. Execute the code below to see your dataframe.

In [15]:
# loading dataframe
import pandas as pd
a = pd.read_csv('db.csv')
display(a)

Unnamed: 0,name phone email date contract_value creditcard country postalcode address Cameron Sanders +049 106-232065 In@consequatdolorvitae.co.uk Nov 18,2018 €1968,40 5109-0814-0029-5373 Spain 17716 6396 Orci St. Paki Andrews +92 757-624220 sed.tortor.Integer@amet.org Nov 5,2019 €1436,12 5193 1959 4162 5216 Spain 85485 669-6354 Orci Ave Alice Henderson +058 588-402901 Phasellus@vitaesodalesnisi.co.uk Mar 9,2020 €1454,20 5582.3641.9791.9751 Austria 4964 1696 Ac St. Jamalia Johnston +080 954-388350 turpis.nec@elitafeugiat.net Sep 30,2019 €1979,49 5398 2093 5064 8545 Netherlands 7683 XL Ap #958-9660 Dis Rd. Erica Meloni +044 143-705707 pede@risusQuisquelibero.net Sep 24,2019 €1532,...,20 5573-3754-4110-1229 Germany 60983 Ap #411-6393 Neque Ave Ina Pope +66 710-695422 ipsum.leo@etnetus.org Jun 25,2019 €1411,14 5494.4294.5676.1991 Turkey 56481 100-3143 Sem Av. Pierre Huet +59 780-534428 Duis@massanon.edu Feb 12,2020 €1204,66 5464.3195.6239.7366 France 56606 9836 Proin St. Fuller Wong +066 630-613976 arcu.Sed@pharetrafelis.edu Feb 16,2019 €1964,41 5173.8314.8364.7499 Poland 70-698 4283 Proin Avenue Inga Perez +077 335-699540 lectus.ante@dolordolortempus.edu Apr 13,2020 €1416,76 5259 5489 5154 6089 Spain 91092 P.O. Box 788,5453 Enim. St.


Oh no! It seems that one of the interns has messed up the `.csv` file, there is no backup, and you are to blame. In order to keep your job, you must find a way to restore the original data. But wait! There is no time to go through all the data manually. Good thing you know how to use `regular expressions`.

First of all, lets import the `regex` library and load our text file to a variable.
This is the data we must retrieve back:
- name
- phone
- email
- date
- contract_value
- creditcard
- country
- address

In [16]:
import re

In [17]:
with open('db.csv') as file:
    db = file.read()
print(db)

name phone email date contract_value creditcard country postalcode address Cameron Sanders +049 106-232065 In@consequatdolorvitae.co.uk Nov 18, 2018 €1968,40 5109-0814-0029-5373 Spain 17716 6396 Orci St. Paki Andrews +92 757-624220 sed.tortor.Integer@amet.org Nov 5, 2019 €1436,12 5193 1959 4162 5216 Spain 85485 669-6354 Orci Ave Alice Henderson +058 588-402901 Phasellus@vitaesodalesnisi.co.uk Mar 9, 2020 €1454,20 5582.3641.9791.9751 Austria 4964 1696 Ac St. Jamalia Johnston +080 954-388350 turpis.nec@elitafeugiat.net Sep 30, 2019 €1979,49 5398 2093 5064 8545 Netherlands 7683 XL Ap #958-9660 Dis Rd. Erica Meloni +044 143-705707 pede@risusQuisquelibero.net Sep 24, 2019 €1532,00 5186 5977 9103 6189 Italy 31046 P.O. Box 271, 1072 Cursus Rd. Hunter Cardenas +042 741-154469 dolor@quis.edu Jul 31, 2020 €1950,98 5406 4489 0949 9550 United Kingdom C57 2AU 473-2340 Nec Road Anders Bodin +21 119-216844 elit.Aliquam@quistristique.edu Jul 28, 2019 €1763,38 5511-2605-5014-4845 Sweden 51381 Ap #633-7

In [18]:
#name
name = re.findall(r"[A-Z][a-z]+\s[A-Z][a-z]+\s\+",db)
name = [re.sub("\s\+","", nombre) for nombre in name] 
print(len(name))

200


In [19]:
#phone
phone = re.findall(r"\+[0-9]+\s[0-9]+\-[0-9]+", db)
print(len(phone))

200


In [20]:
#email
email = re.findall(r"\S+@\S+",db)
# email = re.findall(r'(?i)[a-z0-9.]*@[a-z0-9.]*', db)
print(len(email))

200


In [21]:
#date
date = re.findall(r"\S+\s\d{1,2},\s\S+",db)
# date = re.findall('[JFMASOND]{1}[a-z]{2}\s[0-9]+,\s[0-9]{4}', db)
print(len(date))

200


In [22]:
#contract_value
# contract = re.findall(r"€[0-9,]+", db)
# contract = re.findall("€[0-9]+,[0-9]{2}")
contract = re.findall(r"€\S+", db)
print(len(contract))

200


In [23]:
#creditcard
creditcard = re.findall("\d{4}.\d{4}.\d{4}.\d{4}",db)
print(len(creditcard))

200


In [24]:
#country
country = re.findall("\d{4}.\d{4}.\d{4}.\d{4}\s\w+",db)
country = [re.sub("\d{4}.\d{4}.\d{4}.\d{4}\s","",e) for e in country]
country = [re.sub("United","United Kingdom",e) for e in country]
print(len(country))

200


In [25]:
#address
# Creating regex expressions with | to that takes any value for name or country.
name_reg = '|'.join(set(name)).replace(' ','\s')
country_reg = '|'.join(set(country)).replace(' ','\s')
# Splitting db by country and of the resulting list, splitting each element by name
address = [re.split(name_reg,e) for e in re.split(country_reg,db)][1:]
# Taking the element corresponding to the addres and removing unwanted spaces 
# on begining and end
address = [e[0].strip() for e in address]
print(len(address))

200


In [26]:
# Creating DataFrame

df = pd.DataFrame(list(zip(name,phone,email,date,contract,creditcard,country,address)), 
                  columns=['name','phone','email','date','contract','creditcard',
                           'country','address'])
display(df.head())
df.to_csv('df_cleaned.csv')

Unnamed: 0,name,phone,email,date,contract,creditcard,country,address
0,Cameron Sanders,+049 106-232065,In@consequatdolorvitae.co.uk,"Nov 18, 2018","€1968,40",5109-0814-0029-5373,Spain,17716 6396 Orci St.
1,Paki Andrews,+92 757-624220,sed.tortor.Integer@amet.org,"Nov 5, 2019","€1436,12",5193 1959 4162 5216,Spain,85485 669-6354 Orci Ave
2,Alice Henderson,+058 588-402901,Phasellus@vitaesodalesnisi.co.uk,"Mar 9, 2020","€1454,20",5582.3641.9791.9751,Austria,4964 1696 Ac St.
3,Jamalia Johnston,+080 954-388350,turpis.nec@elitafeugiat.net,"Sep 30, 2019","€1979,49",5398 2093 5064 8545,Netherlands,7683 XL Ap #958-9660 Dis Rd.
4,Erica Meloni,+044 143-705707,pede@risusQuisquelibero.net,"Sep 24, 2019","€1532,00",5186 5977 9103 6189,Italy,"31046 P.O. Box 271, 1072 Cursus Rd."
