# Regular Expressions (RegEx)

A regular expression, regex or regexp is a sequence of characters that define a search pattern.

![regex](https://miro.medium.com/max/1200/1*ZVlIZ1ZYC6rASz-dYPzhZQ.jpeg)

**references**

- https://docs.python.org/3/howto/regex.html
- https://www.datacamp.com/community/tutorials/python-regular-expression-tutorial
- https://www.dataquest.io/blog/regular-expressions-data-scientists/
- https://www.kaggle.com/rtatman/fraudulent-email-corpus

**may save your life**

- https://regex101.com/

### First things first

For the standard case **import re** should be enough. For the later case **pip3/pip install regex** should install it.

In [7]:
import re
import numpy as np
import pandas as pd

## Sintax
### Special Characters:
- `.` Matches any character except a newline.
- `^` Matches the start of the string.
- `$` Matches the end of the string or just before the newline at the end of the string.
- `*` Matches 0 or more repetitions of the preceding RE.
- `+` Matches 1 or more repetitions of the preceding RE.
- `?` Matches 0 or 1 repetitions of the preceding RE.
- `?<=` Matches Lookbehind --> https://www.regular-expressions.info/lookaround.html
- NOTA: re.M -> modo multilinea

### Special Sequences:

- **Literals** `a` 
- **Alternation** `a|b`
- **Character sets** `[ab]`, `[^ab]` <- we use the hat between brackets to indicate that we want the opposite
- **Wildcards** `.`
- **Escape special characters** `\` (?,*,+,^,$)
- **Ranges** `[a-d]`, `[1-9]`, `[A-D]`

- **Quantifiers** `{2}`, `{2,}`, `{2,4}`, `?`, `*`, `+`
- **Grouping** `()`
- **Anchors** `^`, `$`
- **Character classes** `\w`, `\d`, `\s`, `\n`, `\W`, `\D`, `\S`

**\w** - Matches any alphanumeric character (digits and alphabets). Equivalent to `[a-zA-Z0-9_]`. By the way, underscore _ is also considered an alphanumeric character. 

**\d** - Matches any digit. Equivalent to `[0-9]` 

**\s** - Matches where a string contains any whitespace character. Equivalent to `[ \t\n\r\f\v]`

**\W** - Matches any non-alphanumeric character. Equivalent to `[^a-zA-Z0-9_]`

**\D** - Matches any non digit. Equivalent to `[^0-9]` 

**\S** - Matches where a string contains any non-whitespace character. Equivalent to `[^ \t\n\r\f\v]`



### Methods

### re.sub(pattern, repl, string, count=0)
Replaces one or many matches with a string

In [8]:
txt = "gabriel, DDio & CClara are TA's??"

In [9]:
a = '2'

In [10]:
txt = "gabriel, DDDio & CClara are TA's??"
re.sub('[A-Z]', a, txt)

"gabriel, 222io & 22lara are 22's??"

In [11]:
re.sub('[A-Z]{2}', a, txt)

"gabriel, 2Dio & 2lara are 2's??"

In [12]:
#re.sub
#Literals
#{2} 2 occurencies of [A-Z] substituted by a
txt = "gabriel, DDDio & CClara are TA's??"
txt = re.sub('[A-Z]{2}', a, txt)
txt

"gabriel, 2Dio & 2lara are 2's??"

In [13]:
#Ranges
re.sub('[A-Z]','',txt)

"gabriel, 2io & 2lara are 2's??"

In [14]:
#Escape special character, quantifiers
re.sub('\?{2}','.',txt)

"gabriel, 2Dio & 2lara are 2's."

### re.search(pattern, string, flags=0)
Scan through a string, looking for any location where this RE matches. If the search is succesful, `re.search()` returns a match object. Otherwise, it returns `None`.

In [21]:
#re.search
txt = "The rain in Spain and the sun in England, Tower"
x = re.search("^The.*Spain", txt) 
x.span()

(0, 17)

In [29]:
txt = "The rain in Spain"
#\b everything starts with "S" and have any character next.
# w+ for getting all the string
x = re.search(r"\bS\w+", txt)
print(x)
print(x.span())
#returns a tuple containing the start-, and end positions of the match
print(x.start())
#contains the start position of the match
print(x.end())
#contains the end position of the match
print(x.string)
#print the string passed into the function (variable 'txt')
print(x.group())
#Print the part of the string where there was a match

<re.Match object; span=(12, 17), match='Spain'>
(12, 17)
12
17
The rain in Spain
Spain


In [28]:
for i in range(x.span()[0], x.span()[1]):
    print(x.string[i])

S
p


In [32]:
print(re.search(r'r\w+\s\w+', txt).group())
print(re.search(r'R\w+', txt))
print(re.search(r'T\w*', txt).group())
print(re.search(r'^t\w*', txt))

rain in
None
The
None


### re.match(pattern, string)
Determine if the RE matches at the beginning of the string.

In [126]:
#re.match
pattern = r"Cookie"
sequence = "I want a Cookie"
sequence2= "Cookie, I want you!"
if re.match(pattern, sequence2):
    print("Match!")
else: 
    print("Not a match!")

Match!


In [35]:
txt = "The rain in Spain"
#matches at the beginning of the string
print(re.match(r'r\w*', txt))
print(re.match(r'^r\w*', txt))

print(re.search(r'^T\w*', txt).group()) # search with ^ is the same as match

print(re.match(r'T\w*', txt).group())

None
None
The
The


In [42]:
email_address = 'Please contact us at: support@thebridge.com'
matchi = re.search(r'(\w+)@([\w\.]+)', email_address)
if matchi:
    print(matchi.group(0)) # The whole matched text
    print(matchi.group(1)) # The username (group 1)
    print(matchi.group(2)) # The host (group 2)

support@thebridge.com
support
thebridge.com


### re.fullmatch(pattern, string)

In [276]:
class_names = ["Andrea", "Ariadna", "Estela", "Anais", "Xeles", "Maria", "Mar"]
for name in class_names:
    if re.fullmatch("Maria", name):
        print(f"{name} is desired name")
    else:
        print(f"{name} is not desired name")

Andrea is not desired name
Ariadna is not desired name
Estela is not desired name
Anais is not desired name
Xeles is not desired name
Maria is desired name
Mar is not desired name


### re.findall (pattern, string)
Find all substrings where the RE matches, and returns them as a list.

In [138]:
#re.findall
email_address = "Please contact us at: support.data@data-science.com, xyz@thebridge.com"

#'addresses' is a list that stores all the possible match
addresses = re.findall(r'\w+\.?\w+@[\w\.-]+', email_address)
addresses

['support.data@data-science.com', 'xyz@thebridge.com']

In [140]:
print(re.findall('[^aeiou\s]',email_address))
print(re.findall('\sc\w*',email_address))
print(re.findall('^P\w*',email_address))

['P', 'l', 's', 'c', 'n', 't', 'c', 't', 's', 't', ':', 's', 'p', 'p', 'r', 't', '.', 'd', 't', '@', 'd', 't', '-', 's', 'c', 'n', 'c', '.', 'c', 'm', ',', 'x', 'y', 'z', '@', 't', 'h', 'b', 'r', 'd', 'g', '.', 'c', 'm']
[' contact']
['Please']


In [141]:
with open('info.txt', 'r') as file:
    client_info = file.read()

In [159]:
emails_clients=re.findall(r"[\w.]+@[\w.]+", client_info)
print(emails_clients[:5])

['sit.amet.metus@egestasnunc.ca', 'arcu.Aliquam.ultrices@vestibulumneceuismod.co.uk', 'Nulla.eu.neque@Fuscealiquetmagna.com', 'Nullam.velit@non.ca', 'In@gravidamolestiearcu.co.uk']


In [174]:
client_numbers_normal = re.findall(r"[0-9]{2}-\d{3}-\d{3}-\d{3}", client_info)

In [178]:
print(client_numbers_normal[:5])

['34-739-941-941', '34-278-870-242', '34-999-876-292', '34-345-887-949', '34-905-089-682']


In [172]:
client_numbers = re.findall(r"(([0-9]{2}-)\d{3}-\d{3}-\d{3})", client_info)

In [177]:
print(client_numbers[:5])

[('34-739-941-941', '34-'), ('34-278-870-242', '34-'), ('34-999-876-292', '34-'), ('34-345-887-949', '34-'), ('34-905-089-682', '34-')]


### re.split(pattern, string, maxsplit=0)
Returns a list where the string has been split at each match

In [19]:
#re.split
sente = "Hello,\n Please, contact me the sooner.\n Thank you,\n Me"

In [179]:
reg = re.split("\n", sente)
reg

['Hello,', ' Please, contact me the sooner.', ' Thank you,', ' Me']

In [181]:
"".join(reg)

'Hello, Please, contact me the sooner. Thank you, Me'

https://stackoverflow.com/questions/10804732/what-is-the-difference-between-and-in-regex

In [190]:
client_list = re.split(r"(?<=[0-9]{2}-\d{3}-\d{3}-\d{3})",client_info)
print(client_list[:3])

['Wyoming Dudley sit.amet.metus@egestasnunc.ca 2518 Nulla Road 34-739-941-941', ' Aristotle Grant arcu.Aliquam.ultrices@vestibulumneceuismod.co.uk 8453 Nostra, St. 34-278-870-242', ' Zephania Copeland Nulla.eu.neque@Fuscealiquetmagna.com P.O. Box 733, 3179 Ligula. Av. 34-999-876-292']


### re.compile(pattern
Compiles a RE into a regular expression object.

In [193]:
name_check = re.compile(r"[^A-Za-z ]")

In [194]:
name = input("Please insert your name:")
while name_check.search(name):
    # it loops while if finds a match
    print("Please enter your name correctly!")
    name = input("Please insert your name:")
print("Finally mate, I thought you'd never do it")

Please insert your name:Clara Piniella
Please enter your name correctly!
Please insert your name:clara+piniella
Please enter your name correctly!
Please insert your name:Clara.piniella
Please enter your name correctly!
Please insert your name:7835djfa
Please enter your name correctly!
Please insert your name:898593*+
Finally mate, I thought you'd never do it


-----------------------------------------------------------------------------------------------------------


## Some practice 
Now is your turn.

#### Simple validation of a username
https://www.codewars.com/kata/56a3f08aa9a6cc9b75000023
    
Write a simple regex to validate a username. Allowed characters are:

- lowercase letters,
- numbers,
- underscore

Length should be between 4 and 16 characters (both included).


In [282]:
# Maria team
def validate_usr(username):
    namecheck = re.compile(r"[^a-z0-9_]{4,16}")
    while namecheck.search(username):
        print("Please enter your name correctly!")
        username = input("Please insert your name:")
    print("Finally mate, I thought you'd never do it")
    
validate_usr(username=input("Enter a username: "))

Enter a username: cla_piniella
Finally mate, I thought you'd never do it


In [243]:
# Alex team
def validate_usr(username):
    str_list = re.findall("[a-z]|[0-9]|_",username)
    compare = ""
    compare = compare.join(str_list)
    if compare == username and len(compare)<=16 and len(compare)>4:
        return True
    else:
        return False

username = "clara"
test = validate_usr(username)
test

True

In [25]:
# your solution

In [199]:
def validate_usr(username):
    if re.match(r'^[a-z\d_]{4,16}$', username):
        return True
    else:
        return False

In [205]:
validate_user = lambda string: bool(re.match('^[a-z\d_]{4,16}$', string))

In [250]:
check_usernames = ["unusuario", "UNUSUARIO", "UN_usuario", "un_usuario76", "8934_38875aa"]

In [251]:
check_usernamesbool = [validate_user(e) for e in check_usernames]

In [252]:
check_usernamesbool

[True, False, False, True, True]

#### Regex validate PIN code 

https://www.codewars.com/kata/55f8a9c06c018a0d6e000132

ATM machines allow 4 or 6 digit PIN codes and PIN codes cannot contain anything but exactly 4 digits or exactly 6 digits.

If the function is passed a valid PIN string, return true, else return false.

Examples:
```python
"1234"   -->  True
"12345"  -->  False
"a234"   -->  False
```

In [None]:
# Javi Gil team

In [264]:
pin = re.compile(r"^[0-9]{4,6}$")
enter_pin= input("Please insert your pin:")
while not (pin.search(enter_pin)):
    # it loops while if finds a match
    print("venga zoquete a ver si saca el regex regex")
    enter_pin = input("Please insert your pin:")
print("Finally mate, I thought you'd never do it")

# esta no valdría porque coge los de 5

Please insert your pin:893758954
venga zoquete a ver si saca el regex regex
Please insert your pin:17718
Finally mate, I thought you'd never do it


In [None]:
# Alfon team

In [265]:
pin_check = re.compile(r"^[0-9]{4}$|^[0-9]{6}$")

def validate_usr(pin):
    while not pin_check.search(pin):
        print("'Pin incorrecto'")
        pin = input('Dame un pin')
    print("Pin correcto")


validate_usr(pin=input("Insert a number"))

#aprobado!

Insert a number98725973289075
'Pin incorrecto'
Dame un pinclara
'Pin incorrecto'
Dame un pin83789
'Pin incorrecto'
Dame un pin3546
Pin correcto


In [27]:
# your solution

In [267]:
def validate_pin(pin):
    if re.fullmatch("\d{4}|\d{6}", str(pin)):
        return True
    else:
        return False

In [268]:
pin = 7847

In [269]:
pins = ["8475", 8735, 9858473485, "fklajkfd", pin]

In [270]:
checkpins = [validate_pin(pi) for pi in pins]

In [271]:
checkpins

[True, True, False, False, True]

In [273]:
list(filter(validate_pin, pins))

['8475', 8735, 7847]

-----------------------------------------------------------------------------------------------------------

# The FBI challenge

- https://www.fbi.gov/scams-and-safety/common-fraud-schemes/nigerian-letter-or-419-fraud
- https://www.kaggle.com/rtatman/fraudulent-email-corpus

It's your first day at the FBI office and your boss has just send you an `txt` file: `emails.txt`, she asked you to make some analysis but first of all, you need to get a dataframe like the following. You'll need some python knowledge and some regex for that goal. 

In [47]:
df.head()

Unnamed: 0,sender_email,sender_name,date_sent,time_sent,subject
0,james_ngola2002@maktoob.com,"""MR. JAMES NGOLA.""",31 Oct 2002,02:38:20,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
1,bensul2004nng@spinfinder.com,"""Mr. Ben Suleman""",31 Oct 2002,05:10:00,URGENT ASSISTANCE /RELATIONSHIP (P)
2,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002,22:17:55,GOOD DAY TO YOU
3,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002,22:44:20,GOOD DAY TO YOU
4,m_abacha03@www.com,"""Maryam Abacha""",1 Nov 2002,01:45:04,I Need Your Assistance.


---------------------------------------------------------------------------------------------------------------------------

#### Since we are good people, here you have a proposed solution

In [69]:
emails_info={}

In [70]:
fh = open("emails.txt", "r").read()

In [71]:
fh.count("From r")

3977

In [72]:
contents = re.split(r"From r", fh)

In [73]:
contents[0]

''

In [74]:
contents.pop(0)

''

- sender_email
- sender_name
- date_sent
- time_sent
- subject


### Info Sender

In [75]:

info_sender=[]
for i,e in enumerate(contents):
    try:
        info_sender.append(re.search("From:.*", e).group())
    except: 
        info_sender.append("not found")

In [78]:
len(info_sender)

3977

In [38]:
#sender_email
emails_info['sender_email']=[]
for line in info_sender:
    res=re.findall(r'[\w\.]+@[\w\.-]+', line)
    if res:
        emails_info['sender_email'].append(res[0])
    else:
        emails_info['sender_email'].append(np.nan)
        
len(emails_info['sender_email'])

3977

In [39]:
#sender name
emails_info['sender_name']=[]
for line in info_sender:
    res=re.findall(r':.*<', line)
    if res:
        emails_info['sender_name'].append(res[0][1:-1])
    else:
        emails_info['sender_name'].append(np.nan)
len(emails_info['sender_name'])

3977

### Info Dates

In [40]:
#DATES
dates=[]
for i,e in enumerate(contents):
    try:
        dates.append(re.search("Date:.*", e).group())
    except: 
        dates.append("not found")
len(dates)

3977

In [41]:
#email date
emails_info['date_sent']=[]
for dat in dates:
    res=re.findall(r"\d+\s\w{3}\s\d+", dat)
    if res:
        emails_info['date_sent'].append(res[0])
    else:
        emails_info['date_sent'].append(np.nan)

len(emails_info['date_sent'])

3977

In [42]:
emails_info['time_sent']=[]
for dat in dates:
    res=re.findall(r"\d{2}:\d{2}:\d{2}", dat)
    if res:
        emails_info['time_sent'].append(res[0])
    else:
        emails_info['time_sent'].append(np.nan)

len(emails_info['time_sent'])

3977

### Subject

In [43]:
subject=[]
for i,e in enumerate(contents):
    try:
        subject.append(re.search("Subject:.*", e).group())
    except: 
        subject.append("not found")
len(subject)

3977

In [44]:
emails_info['subject']=[]
for sub in subject:
    res=re.findall(r":.*", sub)
    if res:
        emails_info['subject'].append(res[0][2:])
    else:
        emails_info['subject'].append(np.nan)

len(emails_info['subject'])

3977

### Creating DataFrame

In [45]:
df=pd.DataFrame(emails_info)
df.isnull().sum()

sender_email    476
sender_name     837
date_sent       614
time_sent       618
subject          27
dtype: int64

In [46]:
df.head()

Unnamed: 0,sender_email,sender_name,date_sent,time_sent,subject
0,james_ngola2002@maktoob.com,"""MR. JAMES NGOLA.""",31 Oct 2002,02:38:20,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
1,bensul2004nng@spinfinder.com,"""Mr. Ben Suleman""",31 Oct 2002,05:10:00,URGENT ASSISTANCE /RELATIONSHIP (P)
2,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002,22:17:55,GOOD DAY TO YOU
3,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002,22:44:20,GOOD DAY TO YOU
4,m_abacha03@www.com,"""Maryam Abacha""",1 Nov 2002,01:45:04,I Need Your Assistance.


### ¡Now you can start your analysis!