# Regular Expressions (RegEx)

A regular expression, regex or regexp is a sequence of characters that define a search pattern.

![regex](https://miro.medium.com/max/1200/1*ZVlIZ1ZYC6rASz-dYPzhZQ.jpeg)

**references**

- https://docs.python.org/3/howto/regex.html
- https://www.datacamp.com/community/tutorials/python-regular-expression-tutorial
- https://www.dataquest.io/blog/regular-expressions-data-scientists/
- https://www.kaggle.com/rtatman/fraudulent-email-corpus

**may save your life**

- https://regex101.com/

### First things first

For the standard case **import re** should be enough. For the later case **pip3/pip install regex** should install it.

In [1]:
import re
import numpy as np
import pandas as pd

In [3]:
#help(re)

## Sintax
### Special Characters:
- `.` Matches any character except a newline.
- `^` Matches the start of the string.
- `$` Matches the end of the string or just before the newline at the end of the string.
- `*` Matches 0 or more repetitions of the preceding RE.
- `+` Matches 1 or more repetitions of the preceding RE.
- `?` Matches 0 or 1 repetitions of the preceding RE.
- `?<=` Matches Lookbehind --> https://www.regular-expressions.info/lookaround.html
- NOTA: re.M -> modo multilinea

### Special Sequences:

- **Literals** `a` 
- **Alternation** `a|b`
- **Character sets** `[ab]`, `[^ab]` <- we use the hat between brackets to indicate that we want the opposite
- **Wildcards** `.`
- **Escape special characters** `\` (?,*,+,^,$)
- **Ranges** `[a-d]`, `[1-9]`, `[A-D]`

- **Quantifiers** `{2}`, `{2,}`, `{2,4}`, `?`, `*`, `+`
- **Grouping** `()`
- **Anchors** `^`, `$`
- **Character classes** `\w`, `\d`, `\s`, `\n`, `\W`, `\D`, `\S`

**\w** - Matches any alphanumeric character (digits and alphabets). Equivalent to `[a-zA-Z0-9_]`. By the way, underscore _ is also considered an alphanumeric character. 

**\d** - Matches any digit. Equivalent to `[0-9]` 

**\s** - Matches where a string contains any whitespace character. Equivalent to `[ \t\n\r\f\v]`

**\W** - Matches any non-alphanumeric character. Equivalent to `[^a-zA-Z0-9_]`

**\D** - Matches any non digit. Equivalent to `[^0-9]` 

**\S** - Matches where a string contains any non-whitespace character. Equivalent to `[^ \t\n\r\f\v]`



### Methods

### re.sub(pattern, repl, string, count=0)
Replaces one or many matches with a string

In [4]:
txt = "gabriel, DDio & CClara are TA's??"

In [5]:
a = '2'

In [6]:
help(re.sub)

Help on function sub in module re:

sub(pattern, repl, string, count=0, flags=0)
    Return the string obtained by replacing the leftmost
    non-overlapping occurrences of the pattern in string by the
    replacement repl.  repl can be either a string or a callable;
    if a string, backslash escapes in it are processed.  If it is
    a callable, it's passed the Match object and must return
    a replacement string to be used.



In [14]:
re.sub('[A-Z]', a, txt)

"gabriel, 22io & 22lara are 22's??"

In [17]:
re.sub('[A-Z]{2}', a, txt)

"gabriel, 2io & 2lara are 2's??"

In [12]:
#re.sub
#Literals
#{2} 2 occurencies of [A-Z] substituted by a
txt = "gabriel, DDDio & CClara are TA's??"
txt = re.sub('[A-Z]{2}', a, txt)
txt

"gabriel, 2Dio & 2lara are 2's??"

In [18]:
#Ranges
re.sub('[A-Z]', '', txt)

"gabriel, io & lara are 's??"

In [19]:
#Escape special character, quantifiers
re.sub('\?{2}','.',txt)

"gabriel, DDio & CClara are TA's."

### re.search(pattern, string, flags=0)
Scan through a string, looking for any location where this RE matches. If the search is succesful, `re.search()` returns a match object. Otherwise, it returns `None`.

In [26]:
#re.search
txt = "The rain in Spain and the sun in England, Tower"
x = re.search("^The.*Sp[a|A]in", txt) 
x.span()
#txt[0:17]

(0, 17)

In [43]:
txt = "The rain on Spain i slightly different"
#\b everything starts with "S" and have any character next.
# w+ for getting all the string
x = re.search(r"\b[S|s]\w+", txt)
print(x)
print(x.span())
#returns a tuple containing the start-, and end positions of the match
print(x.start())
#contains the start position of the match
print(x.end())
#contains the end position of the match
print(x.string)
#print the string passed into the function (variable 'txt')
print(x.group())
#Print the part of the string where there was a match

<re.Match object; span=(12, 17), match='Spain'>
(12, 17)
12
17
The rain on Spain i slightly different
Spain


In [35]:
for i in range(x.span()[0], x.span()[1]):
    print(x.string[i])

S
p
a
i
n


In [53]:
txt1 = "The raun on Spaun i sli di"
print(re.search(r'[S|s]\w+\s\w+', txt1).group())
#print(re.search(r'i\w+', txt1).group())
print(re.search(r'i\w*', txt1).group())
print(re.search(r'^T\w*', txt1).group())

Spaun i
i
The


### re.match(pattern, string)
Determine if the RE matches at the beginning of the string.

In [76]:
#re.match
pattern = r"Co*kie"
sequence = "I want a Cookie"
sequence2= "Cookie, I want you!"
if re.match(pattern, sequence2):
    print("Match!")
else: 
    print("Not a match!")

Match!


In [78]:
txt = "The rain in Spain"
#matches at the beginning of the string
print(re.search(r'r\w*', txt))
print(re.match(r'^r\w*', txt))

print(re.search(r'^T\w*', txt).group()) # search with ^ is the same as match

print(re.match(r'T\w*', txt).group())

<re.Match object; span=(4, 8), match='rain'>
None
The
The


In [104]:
email_address = 'Please contact us at: support1@thebridge.gov'
matchi = re.search(r'(\w+)@([\w\.]+)', email_address)
if matchi:
    print(matchi.group(0)) # The whole matched text
    print(matchi.group(1)) # The username (group 1)
    print(matchi.group(2)) # The host (group 2)

support1@thebridge.gov
support1
thebridge.gov


In [105]:
# Patron de mail para .com y .es
matchi = re.search(r'(\w+)@(\w+)\.(com|es)', email_address)
if matchi:
    print(matchi.span())
    print(matchi.group(0)) # The whole matched text
    print(matchi.group(1)) # The username (group 1)
    print(matchi.group(2)) # The host (group 2)  
    print(matchi.group(3)) # The host (group 2)      

### re.fullmatch(pattern, string)

In [107]:
class_names = ["Andrea", "Ariadna", "Estela", "Anais", "Xeles", "Maria", "Mar"]
for name in class_names:
    if re.fullmatch("Maria", name):
        print(f"{name} is desired name")
    else:
        print(f"{name} is not desired name")

Andrea is not desired name
Ariadna is not desired name
Estela is not desired name
Anais is not desired name
Xeles is not desired name
Maria is desired name
Mar is not desired name


### re.findall (pattern, string)
Find all substrings where the RE matches, and returns them as a list.

In [121]:
#re.findall
email_address = """
    Please contact us at canada: 
    support.data@data-science.com, xyz@thebridge.com"""

#'addresses' is a list that stores all the possible match
addresses = re.findall(r'\w+\.?\w+@[\w\.-]+', email_address)
addresses

['support.data@data-science.com', 'xyz@thebridge.com']

In [123]:
print(re.findall('[^aeiou\s]',email_address))
print(re.findall('\sc\w*',email_address))
print(re.findall('^\sP\w*',email_address))

['P', 'l', 's', 'c', 'n', 't', 'c', 't', 's', 't', 'c', 'n', 'd', ':', 's', 'p', 'p', 'r', 't', '.', 'd', 't', '@', 'd', 't', '-', 's', 'c', 'n', 'c', '.', 'c', 'm', ',', 'x', 'y', 'z', '@', 't', 'h', 'b', 'r', 'd', 'g', '.', 'c', 'm']
[' contact', ' canada']
['\nPlease']


In [124]:
with open('info.txt', 'r') as file:
    client_info = file.read()

In [127]:
client_info[:130]

'Wyoming Dudley sit.amet.metus@egestasnunc.ca 2518 Nulla Road 34-739-941-941 Aristotle Grant arcu.Aliquam.ultrices@vestibulumneceui'

In [132]:
emails_clients = re.findall(r"[\w.]+@[\w.]+", client_info)
print(emails_clients[:10])

['sit.amet.metus@egestasnunc.ca', 'arcu.Aliquam.ultrices@vestibulumneceuismod.co.uk', 'Nulla.eu.neque@Fuscealiquetmagna.com', 'Nullam.velit@non.ca', 'In@gravidamolestiearcu.co.uk', 'non.bibendum@ipsumdolorsit.edu', 'rutrum.Fusce.dolor@purusNullamscelerisque.ca', 'quam.quis@ac.net', 'Nulla.eu.neque@idmollis.com', 'lacus.Cras@quisaccumsan.net']


In [133]:
client_numbers_normal = re.findall(r"[0-9]{2}-\d{3}-\d{3}-\d{3}", client_info)

In [134]:
print(client_numbers_normal[:5])

['34-739-941-941', '34-278-870-242', '34-999-876-292', '34-345-887-949', '34-905-089-682']


In [145]:
p = "0034 666666666"
p = "+34 999999999"

x = re.search(r"[(00)|\+]34 [6-9]\d{8}", p)
x.group()

'+34 999999999'

In [172]:
client_numbers = re.findall(r"(([0-9]{2}-)\d{3}-\d{3}-\d{3})", client_info)

In [177]:
print(client_numbers[:5])

[('34-739-941-941', '34-'), ('34-278-870-242', '34-'), ('34-999-876-292', '34-'), ('34-345-887-949', '34-'), ('34-905-089-682', '34-')]


### re.split(pattern, string, maxsplit=0)
Returns a list where the string has been split at each match

In [147]:
#re.split
sente = "Hello,\n Please, contact me the sooner.\n Thank you,\n Me"
sente.split("\n")

['Hello,', ' Please, contact me the sooner.', ' Thank you,', ' Me']

In [148]:
reg = re.split(r",\s+", sente)
reg

['Hello', 'Please', 'contact me the sooner.\n Thank you', 'Me']

In [149]:
"".join(reg)

'HelloPleasecontact me the sooner.\n Thank youMe'

https://stackoverflow.com/questions/10804732/what-is-the-difference-between-and-in-regex

In [190]:
client_list = re.split(r"(?<=[0-9]{2}-\d{3}-\d{3}-\d{3})",client_info)
print(client_list[:3])

['Wyoming Dudley sit.amet.metus@egestasnunc.ca 2518 Nulla Road 34-739-941-941', ' Aristotle Grant arcu.Aliquam.ultrices@vestibulumneceuismod.co.uk 8453 Nostra, St. 34-278-870-242', ' Zephania Copeland Nulla.eu.neque@Fuscealiquetmagna.com P.O. Box 733, 3179 Ligula. Av. 34-999-876-292']


### re.compile(pattern
Compiles a RE into a regular expression object.

In [193]:
name_check = re.compile(r"[^A-Za-z ]")

In [194]:
name = input("Please insert your name:")
while name_check.search(name):
    # it loops while if finds a match
    print("Please enter your name correctly!")
    name = input("Please insert your name:")
print("Finally mate, I thought you'd never do it")

Please insert your name:Clara Piniella
Please enter your name correctly!
Please insert your name:clara+piniella
Please enter your name correctly!
Please insert your name:Clara.piniella
Please enter your name correctly!
Please insert your name:7835djfa
Please enter your name correctly!
Please insert your name:898593*+
Finally mate, I thought you'd never do it


-----------------------------------------------------------------------------------------------------------


## Some practice 
Now is your turn.

#### Simple validation of a username
https://www.codewars.com/kata/56a3f08aa9a6cc9b75000023
    
Write a simple regex to validate a username. Allowed characters are:

- lowercase letters,
- numbers,
- underscore

Length should be between 4 and 16 characters (both included).


Enter a username: cla_piniella
Finally mate, I thought you'd never do it


True

In [25]:
# your solution

[True, False, False, True, True]

#### Regex validate PIN code 

https://www.codewars.com/kata/55f8a9c06c018a0d6e000132

ATM machines allow 4 or 6 digit PIN codes and PIN codes cannot contain anything but exactly 4 digits or exactly 6 digits.

If the function is passed a valid PIN string, return true, else return false.

Examples:
```python
"1234"   -->  True
"12345"  -->  False
"a234"   -->  False
```

Please insert your pin:893758954
venga zoquete a ver si saca el regex regex
Please insert your pin:17718
Finally mate, I thought you'd never do it


Insert a number98725973289075
'Pin incorrecto'
Dame un pinclara
'Pin incorrecto'
Dame un pin83789
'Pin incorrecto'
Dame un pin3546
Pin correcto


[True, True, False, False, True]

['8475', 8735, 7847]

-----------------------------------------------------------------------------------------------------------

# The FBI challenge

- https://www.fbi.gov/scams-and-safety/common-fraud-schemes/nigerian-letter-or-419-fraud
- https://www.kaggle.com/rtatman/fraudulent-email-corpus

It's your first day at the FBI office and your boss has just send you an `txt` file: `emails.txt`, she asked you to make some analysis but first of all, you need to get a dataframe like the following. You'll need some python knowledge and some regex for that goal. 

Unnamed: 0,sender_email,sender_name,date_sent,time_sent,subject
0,james_ngola2002@maktoob.com,"""MR. JAMES NGOLA.""",31 Oct 2002,02:38:20,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
1,bensul2004nng@spinfinder.com,"""Mr. Ben Suleman""",31 Oct 2002,05:10:00,URGENT ASSISTANCE /RELATIONSHIP (P)
2,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002,22:17:55,GOOD DAY TO YOU
3,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002,22:44:20,GOOD DAY TO YOU
4,m_abacha03@www.com,"""Maryam Abacha""",1 Nov 2002,01:45:04,I Need Your Assistance.


---------------------------------------------------------------------------------------------------------------------------

#### Since we are good people, here you have a proposed solution

3977

''

''

- sender_email
- sender_name
- date_sent
- time_sent
- subject


### Info Sender

3977

3977

3977

### Info Dates

3977

3977

3977

### Subject

3977

3977

### Creating DataFrame

sender_email    476
sender_name     837
date_sent       614
time_sent       618
subject          27
dtype: int64

Unnamed: 0,sender_email,sender_name,date_sent,time_sent,subject
0,james_ngola2002@maktoob.com,"""MR. JAMES NGOLA.""",31 Oct 2002,02:38:20,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
1,bensul2004nng@spinfinder.com,"""Mr. Ben Suleman""",31 Oct 2002,05:10:00,URGENT ASSISTANCE /RELATIONSHIP (P)
2,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002,22:17:55,GOOD DAY TO YOU
3,obong_715@epatra.com,"""PRINCE OBONG ELEME""",31 Oct 2002,22:44:20,GOOD DAY TO YOU
4,m_abacha03@www.com,"""Maryam Abacha""",1 Nov 2002,01:45:04,I Need Your Assistance.


### ¡Now you can start your analysis!