# Regular expressions (regex) : love or hate?

![commit strip](http://www.commitstrip.com/wp-content/uploads/2014/02/Strips-Le-dernier-des-vrais-codeurs-650-finalenglsih.jpg)

Regular expressions are used in almost all languages. It is a very powerful tool to check if the content of a variable has the shape you expect. 

For example, if you retrieve a phone number, you expect the variable to be composed of numbers and spaces (or dashes) but nothing more. 

Regular expressions not only warn you of an unwanted character but also delete/modify all those that are not desirable.


**There are two ways to use regular expressions:**
* The first consists in calling the function with the pattern as the first parameter, and the string to be analyzed as the second parameter.
* The second way is to compile the regex, and then use the methods of the created object to analyze a string passed as an argument. This method speeds up processing when a regex is used several times.  

In [2]:
import re

In [3]:
pattern = "[ ]"
string = "I am fine ! There are still 6 months left :()"

# Searches the pattern in the previous string and return a `MatchObject` if matches are found,
# otherwise returns `None`.
print(re.search(pattern, string))

<re.Match object; span=(1, 2), match=' '>


In [4]:
pattern = "[ ]"
string = "I am fine ! There are still 6 months left :()"

# Cuts the string according to the occurrence of the pattern.
print(re.split(pattern, string))

['I', 'am', 'fine', '!', 'There', 'are', 'still', '6', 'months', 'left', ':()']


### A little syntax

    [xy]  A possible segment list. Example[abc] equals: a, b or c

    (x|y) Indicates a multiple choice type (ps|ump) equals "ps" OR "UMP" 

    \d    the segment is composed only of numbers, which is equivalent to[0-9].

    \D    the segment is not composed of numbers, which is equivalent to[^0-9].

    \s    A space, which is equivalent to [ \t\n\r\r\f\v].

    \S    No space, which is equivalent to[^ ^ \t\n\r\f\v].

    \w    Alphanumeric presence, which is equivalent to[a-zA-Z0-9_].

    \W    No alphanumeric presence[^a-zA-Z0-9_].

    \     Is an escape character

Let's try it.

If the answer is not `None`, it means the match matches. GREY is indeed a name beginning with GR followed by a character and ending with Y.

In [5]:
print(re.match("GR(.)?Y", "GREY"))
# (.)? means that we expect 0 or 1 character.
# 0 or 1 because of the `?` following the character `.`, which means any character

<re.Match object; span=(0, 4), match='GREY'>


In [6]:
pattern = "GR(.)?Y"
string = "GREY"

result = re.match(pattern, string)
print(result)

# It is equal to
compiled = re.compile(pattern)
result = compiled.match(string)
print(result)

<re.Match object; span=(0, 4), match='GREY'>
<re.Match object; span=(0, 4), match='GREY'>


In [7]:
#  So in a loop the second syntax is nicer
pattern = "GR(.)?Y"
compiled = re.compile(pattern)
l = ["GREY 'S", "GRAY", "GREYISH", "A GREY"]

for elem in l:
    result = compiled.match(elem)
    print(elem, result)

GREY 'S <re.Match object; span=(0, 4), match='GREY'>
GRAY <re.Match object; span=(0, 4), match='GRAY'>
GREYISH <re.Match object; span=(0, 4), match='GREY'>
A GREY None


In the following, we search for specific expressions in a string.

In [8]:
print(re.findall("GR(.)?Y", "GREY"))
# so here we are looking for a unique element (.)? between GR and Y

['E']


In [9]:
# Ditto for two characters to be found
re.findall("G(.)?(.)?Y", "GREY")

[('R', 'E')]

To keep only numbers. 

In [10]:
# Only numbers
print(re.findall("([0-9]+)", "Hello I live on the 7th floor of 220 street of sims"))
# "+" Means 1 or more characters

['7', '220']


And conversely, if you only want to keep the words. 

In [11]:
# Only words
print(re.findall("([A-z]+)", "Hello I live on the 7th floor of 220 street of sims"))

['Hello', 'I', 'live', 'on', 'the', 'th', 'floor', 'of', 'street', 'of', 'sims']


### Stop, we recap !

Character | Meaning   
:-------------------------:|:-------------------------:
**.** | **Refers to any character.**
**^** | **Indicates that the beginning of the string must match <br/> (i.e. a string can only match if it starts in the same way, <br /> if it is preceded by spaces or a line break)**
**$** | **Indicates that the end of the chain must match <br /> (the same remark as above applies, but at the end level).**
**{n}**|**Indicates that the previous character must be repeated n times.**
**{n, m}**|**Indicates that the previous character must be repeated between n and m times.**
 *| **The previous character can be repeated none or several times. <br />For example, ab* may correspond to: a, ab, or a followed by any number of b.**
**+**|**The previous character can be repeated once or several times. <br/>For example, to ab+ corresponds an a followed by any number of b.**
**?**|**The previous character can be repeated zero or once.<br /> For example, to ab? correspond ab and a.**
**\w** | **it corresponds to any alphabetical character, it is equivalent to [a-zA-Z].**
**\W** | **it corresponds to everything that is not an alphabetical character.**
**\d** | **it corresponds to any numeric character, i.e. it is equivalent to[0-9].**
**\D** | **it corresponds to everything that is not a numeric character.**

![alt text](http://www.codercaste.com/wp-content/uploads/2013/01/regex.gif)

### Some useful resources
http://www.rexegg.com/regex-quickstart.html  
http://www.dreambank.net/regex.html#examples  
https://pythex.org/ *(Pythex is a real-time regular expression editor for Python, a quick way to test your regular expressions.)*   
https://regex101.com/   
*(Regex101 is online regex editor and debugger. Regex101 allows you to create, debug, test and have your expressions explained for PHP, PCRE, Python, Golang and JavaScript. The website also features a community where you can share useful expressions.)*

#### How to check that the entered string is that of a number ?

In [14]:
number = input("Your number : ")
if re.match("^[0-9]+$", number):
    print("The string entered is a number.")
else:
    print("The string entered is NOT a number.")

The string entered is NOT a number.


Another way

In [15]:
compiled = re.compile("^[0-9]+$")
if compiled.search(number) is not None:
    print("The string entered is a number.")
else:
    print("The string entered is NOT a number")

The string entered is NOT a number


## Drill 


**1. Create a regex that finds integers without size limit.**

In [16]:
s = "sssgdds8sfsfs"
first_pattern = re.compile(r"[0-9]+")
result = first_pattern.findall(s)
print(result)


['8']


**2. Create a regex that finds negative integers without size limit.**

In [17]:
s = "sssgdds-8sfsfs"
second_pattern = re.compile(r"-[0-9]+")
result = second_pattern.findall(s)
print(result)


['-8']


**3. Create a regex that finds (positive or negative) integers without size limit.**

In [18]:
s = "sssgdds-8s8fsfs"
third_pattern = re.compile(r"-?[0-9]+")
result = third_pattern.findall(s)
print(result)


['-8', '8']


**4. Capture all the numbers of the following sentence :**

In [57]:
text = "21 scouts and 3 tanks fought against 4,003 protestors, so the manager was not 100.00% happy."

fourth_pattern = re.compile(r"\W{1,}[0-9,?\.?]+|[0-9]+")
result = fourth_pattern.findall(text)
print(result)

['21', ' 3', ' 4,003', ' 100.00']


**5. Find all words that end with 'ly'.**

In [61]:
text = "He had prudently disguised himself but was quickly captured by the police."
result = re.findall(r"\b\w+ly\b", text)
print(result)

['prudently', 'quickly']


**6. License plate number**  
A license plate consists of 2 capital letters, a dash ('-'), 3 digits, a dash ('-') and finally 2 capital letters. Write a script to check that an input string is a license plate.  
If it's correct, print `"good"`. If it's not correct, print `"Not good"`.

In [4]:
plate = input("Enter your license plate number: ")
new_pattern = re.compile(r"[A-Z]{2}-[0-9]{3}-[A-Z]{2}")
Is_goodLicense = re.match(new_pattern,plate)
if Is_goodLicense:
    print("good")
else:
    print("It's not good")
    

It's not good


**7 . Address IPV4**  
An IPv4 address is composed of 4 numbers between 0 and 255 separated by '.'   
Write a script to verify that a string entered is that of an IPv4 address.

In [10]:
ip = input("Enter your IP address :")
ip_parts = re.split(r"\.",ip)
for num in ip_parts : 
    if int(num) > 255:
        print("Unvalid IP")
    else:
        seventh_pattern = re.compile(r'^25[0-5]?\.|2[0-4][0-9]\.|[0-1]?[0-9][0-9]\.|[0-9][0-9]?$')
        Valid_IPV4 = re.match(seventh_pattern, ip)

print(ip)
if Valid_IPV4:
    print("it's a good IPV4")
else:
    print("Try again")

50.190.2.0
it's a good IPV4


**8. Valid Mail**  
An email is composed of alphanumeric characters followed by `@` and a domain name.  
Write a script that checks that the string entered by a user is indeed that of an email, otherwise ask him to re-enter it again (until he gets a valid email).

In [22]:

while True:
    mail = input("Enter your email :")
    email_validity = re.compile(r'[A-z0-9\.]+@[A-z0-9\.]+(com|org|net|be)')
    if re.match(email_validity,mail):
        print("it's a good e-mail")
        print(mail)
        break
    else:
        print("Invalid Email, Try again")
    



it's a good e-mail
rana@rana.be


**9. Valid Password**  
Write an additional script that verifies the password (obviously if the email is valid) where the only specificity of the password is that it has to contain at least 6 characters.

In [25]:
password = input("Enter your password :")
password_validity = re.compile(r"\w{6}")
if re.match(password_validity,password):
    print("It's Done!")
else:
    print("Try again")
print(password)

It's Done!
uRd523


**10. Valid Password bis**  
The password must now contain at least 6 characters AND  

- at least one lowercase letter AND 
- at least one uppercase letter AND 
- at least one number AND 
- at least one special character (among `$#@`).

In [34]:
password = input("Enter your password :")
password_validity = re.compile(r"^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*#?&])[A-Za-z0-9@$!#%*?&]{6,}$")
if re.fullmatch(password_validity,password):
    print("It's Done!")
else:
    print("It's not valid, Try again")

It's Done!


**11. Search by groups**  
It is possible to search by groups, and it is very powerful!  
`?P<x>\w+` means the capture of a "group" named `x`, this group is composed of at least (`+`) one alphanumeric  character `(\w)`.

In [35]:
m = re.search(
    "Welcome to (?P<where>\w+) ! You are (?P<age>\d+) years old ?",
    "Welcome to Olivier ! You are 32 years old ?",
)
print(m.group("where"))
print(m.group("age"))

Olivier
32


In [36]:
# Another Example
m = re.search(
    "^(?P<who>\w*)[.]?(?P<who2>\w*)@(?P<operator>\w+)[.](?P<zone>\w+$)",
    "audrey.boulevart@benextcomapgny.com",
)
if m is not None:
    print(m.group("who"))
    print(m.group("who2"))
    print(m.group("operator"))
    print(m.group("zone"))

audrey
boulevart
benextcomapgny
com


Load the file `./data/mail.txt` and clean it with the regex. The goal is to retrieve the last name, first name, operator and zone, as in the previous example. Store each of those into their own separate list.

In [46]:
import re

# loading and opening the mails' file:

file = open("./data/mail.txt","r") 
mailList = file.readlines()

#print(len(mailList))
#print(mailList[4000])

# creating the lists for each type od data:

firstNameList = []
lastNameList = []
operatoreList = []
zonesList = []

# implementing the pattern that can clean all emails:

mail_pattern = re.compile(r"^(?P<firstName>\w*)[.-_](?P<LastName>\w*)@(?P<operator>\w+)[.](?P<zone>\w+$)")

# check each email
for each_email in mailList:
    check_step = re.search(mail_pattern,each_email)
    if check_step is not None:
        firstNameList.append(check_step.group("firstName"))                      
        lastNameList.append(check_step.group("LastName"))
        operatoreList.append(check_step.group("operator"))
        zonesList.append(check_step.group("zone"))

#print("The First Names list is : \n", firstNameList)
#print("The last Names list is : \n", lastNameList)
#print("The operators Names list is : \n", operatoreList)
#print("The zone Names list is : \n", zonesList)
print(len(lastNameList))
print(len(firstNameList))
#print(len(operatoreList))
#print(len(zonesList))


file.close()
#print(list_mail)

4456
4456


**12. Another way of doing things.**

In [6]:
mail = "audrey.boulevart@benextcomapgny.com"
splitMail = mail.replace(".", " ").split("@").copy()
print(splitMail)

firstName = []
name = []
ope = []
zone = []

splitMail[0].split()

print(splitMail[0].split()[0])
#firstName.append(splitMail[0])
#name.append(splitMail[0].split()[-1])
#ope.append(splitMail[1].split()[0])
#zone.append(splitMail[1].split()[-1])

#firstName, name, ope, zone

['audrey boulevart', 'benextcomapgny com']
audrey


In [44]:
# loading and opening the mails' file:

file = open("./data/mail.txt","r") 
NewmailList = file.readlines()

# creating the lists for each type od data:

firstNameList2 = []
lastNameList2 = []
operatoreList2 = []
zonesList2 = []

# to separate each email in two parts:
for each_email in NewmailList:
    new_string = each_email.replace(".", " ").replace("-", " ").replace("_", " ").split("@").copy()
    print(new_string[0])
#
    firstNameList2.append(new_string[0].split()[0])
    lastNameList2.append(new_string[0].split()[-1])
    operatoreList2.append(new_string[1].split()[0])
    zonesList2.append(new_string[1].split()[-1])

print(len(lastNameList2))

vogal roger
aikin joe
moore
halknutson
alexnorquist
matthewlulloff
jenson thomas
mark4451
monte hylan
dan80
schmitt steve
knutsondan
lepage
chapman ben
upson1544
552966959
soloman ziegler
ortiz mark
ashwoon hank
pettigrew
doranedward
mills joe
valente alex
yang
ike
hankshaffer
larrytrebil
davis ike
davidsonhal
john8733
johnfletcher
cataldi larry
rogerfietzer
edward4185
hancock matthew
solberg cast
vader jack
georgepak
caswell hal
ike
tapia quinn
795419860
chambers johnsen
jack
joe
reyesaaron
ben yang
pettigrewgeorge
miller peter
linde carl
boyd carl
yocum mark
davidferro
george59
johnsen2201
thompson adam
schutzlarry
quizoz joe
matthew8227
alex
georgemoody
lawicki589
doran7947
baueraaron
mark
waltermartin
kaskel
roberts matthew
george
ferry victor
fred weiss
jurgensonpeter
stevefietzer
sonderling caswell
johnsen walter
mccormack
roberts hank
valente
ory walter
tyirwin
paulferry
george
frank lulloff
ripka6596
wagner schuster
mccormack carl
851886792
zeaser aaron
adam
tisler roger
ferro 

Repeat the previous exercise with this new formula and compare the length of your lists with those of the previous exercise.  
What do you notice ?