# Regular expressions (regex) : love or hate?

![commit strip](http://www.commitstrip.com/wp-content/uploads/2014/02/Strips-Le-dernier-des-vrais-codeurs-650-finalenglsih.jpg)

Regular expressions are used in almost all languages. It is a very powerful tool to check if the content of a variable has the shape you expect. 

For example, if you retrieve a phone number, you expect the variable to be composed of numbers and spaces (or dashes) but nothing more. 

Regular expressions not only warn you of an unwanted character but also delete/modify all those that are not desirable.


**There are two ways to use regular expressions:**
* The first consists in calling the function with the pattern as the first parameter, and the string to be analyzed as the second parameter.
* The second way is to compile the regex, and then use the methods of the created object to analyze a string passed as an argument. This method speeds up processing when a regex is used several times.  

In [1]:
import re

In [2]:
pattern = "[ ]"
string = "I am fine ! There are still 6 months left :()"

# Searches the pattern in the previous string and return a `MatchObject` if matches are found,
# otherwise returns `None`.
print(re.search(pattern, string))

<re.Match object; span=(1, 2), match=' '>


In [None]:
pattern = "[ ]"
string = "I am fine ! There are still 6 months left :()"

# Cuts the string according to the occurrence of the pattern.
print(re.split(pattern, string))

### A little syntax

    [xy]  A possible segment list. Example[abc] equals: a, b or c

    (x|y) Indicates a multiple choice type (ps|ump) equals "ps" OR "UMP" 

    \d    the segment is composed only of numbers, which is equivalent to [0-9].

    \D    the segment is not composed of numbers, which is equivalent to [^0-9].

    \s    A space, which is equivalent to [ \t\n\r\r\f\v].

    \S    No space, which is equivalent to [^ \t\n\r\f\v].

    \w    Alphanumeric presence, which is equivalent to [a-zA-Z0-9_].

    \W    No alphanumeric presence [^a-zA-Z0-9_].

    \     Is an escape character. It _unprotects_ reserved characters by restoring their original meaning.

### Let's try it.

If the answer is not `None`, it means the match matches. GREY is indeed a name beginning with GR followed by a character and ending with Y.

In [None]:
print(re.match("GR(.)?Y", "GREY"))
# (.)? means that we expect 0 or 1 character.
# 0 or 1 because of the `?` following the character `.`, which means any character

In [3]:
pattern = "GR(.)?Y"
string = "GREY"

result = re.match(pattern, string)
print(result)

# It is equal to
compiled = re.compile(pattern)
result = compiled.match(string)
print(result)

<re.Match object; span=(0, 4), match='GREY'>
<re.Match object; span=(0, 4), match='GREY'>


In [4]:
#  So in a loop the second syntax is nicer
pattern = "GR(.)?Y"
compiled = re.compile(pattern)
l = ["GREY 'S", "GRAY", "GREYISH", "A GREY"]

for elem in l:
    result = compiled.match(elem)
    print(elem, result)

GREY 'S <re.Match object; span=(0, 4), match='GREY'>
GRAY <re.Match object; span=(0, 4), match='GRAY'>
GREYISH <re.Match object; span=(0, 4), match='GREY'>
A GREY None


In the following, we search for specific expressions in a string.

In [7]:
print(re.findall("GR(.)?Y", "GREY"))
# so here we are looking for a unique element (.)? between GR and Y

['E']


In [8]:
# Ditto for two characters to be found
re.findall("G(.)?(.)?Y", "GREY")

[('R', 'E')]

To keep only numbers. 

In [5]:
# Only numbers
print(re.findall("([0-9]+)", "Hello I live on the 7th floor of 220 street of sims"))
# "+" Means 1 or more characters

['7', '220']


And conversely, if you only want to keep the words. 

In [6]:
# Only words
print(re.findall("([A-z]+)", "Hello I live on the 7th floor of 220 street of sims"))

['Hello', 'I', 'live', 'on', 'the', 'th', 'floor', 'of', 'street', 'of', 'sims']


### Stop, we recap !

Character | Meaning   
:-------------------------:|:-------------------------:
**.** | Refers to any character.
**^** | Indicates the beginning of the string.<br />For example, _^a_ matches _ab_ but not _ba_. 
**$** | Indicates the end of the string.<br />For example, _a$_ matches _ba_ but not _ab_. 
**?**| The previous character can be repeated zero or once.<br /> For example, _ab?_ corresponds to _ab_ and _a_.
 *| The previous character can be repeated none or several times. <br />For example, _ab\*_ may correspond to: _a_, _ab_, or _a_ followed by any number of _b_.
**+**| The previous character can be repeated once or several times. <br/>For example, to _ab+_ corresponds an _a_ followed by any number of _b_.
**{n}**| Indicates that the previous character must be repeated _n_ times.
**{n, m}**|Indicates that the previous character must be repeated between _n_ and _m_ times.
**\w** | It corresponds to any alphabetical character, it is equivalent to _[a-zA-Z]_.
**\W** | It corresponds to everything that is not an alphabetical character.
**\d** | It corresponds to any numeric character, i.e. it is equivalent to _[0-9]_.
**\D** | It corresponds to everything that is not a numeric character.

<img src="https://i.redd.it/nac35ntlfg831.jpg" width="400">


### Some useful resources
[Regex quickstart](http://www.rexegg.com/regex-quickstart.html): the Regex cheat sheet

[Dreambank Regex](http://www.dreambank.net/regex.html#examples): some examples of regex behaviour

[Pythex](https://pythex.org/): a real-time regular expression editor for Python, a quick way to test your regular expressions.  

[Regex101](https://regex101.com/): online regex editor and debugger. Regex101 allows you to create, debug, test and have your expressions explained for PHP, PCRE, Python, Golang and JavaScript. The website also features a community where you can share useful expressions.

##### And just for fun...
[Regex Crosswords](https://regexcrossword.com/): some crossword puzzles to test your Regex knowledge


#### How to check that the entered string is that of a number ?

In [None]:
import re
number = input("Your number : ")
if re.match("^[0-9]+$", number):
    print("The string entered is a number.")
else:
    print("The string entered is NOT a number.")

Another way

In [11]:
compiled = re.compile("^[0-9]+$")
if compiled.search(number) is not None:
    print("The string entered is a number.")
else:
    print("The string entered is NOT a number")

The string entered is NOT a number


### Drill


**1. Create a regex that finds integers without size limit.**

In [28]:
import re
complied = re.compile("^\d+$")

s = "sssgdds8sfsfs"
if complied.search(s) is not None:
    print(f"The string {s} is an integer")
else:
    print(f"The string {s} is not an integer")



The string sssgdds8sfsfs is not an integer


**2. Create a regex that finds negative integers without size limit.**

In [27]:
import re
complied = re.compile("^-\d+$")

s = "sssgdds8sfsfs"
if complied.search(s) is not None:
    print(f"The string {s} is a negative integer")
else:
    print(f"The string {s} is not a negative integer")

The string sssgdds8sfsfs is not a negative integer


**3. Create a regex that finds (positive or negative) integers without size limit.**

In [29]:
import re
complied = re.compile("^-?\d+$")

s = "sssgdds8sfsfs"
if complied.search(s) is not None:
    print(f"The string {s} is an integer")
else:
    print(f"The string {s} is not an integer")

The string sssgdds8sfsfs is not an integer


**4. Capture all the numbers of the following sentence :**

In [39]:
import re

compiled = re.compile(r"[0-9]+")

text = "21 scouts and 3 tanks fought against 4,003 protestors, so the manager was not 100.00% happy."
numbers_in_string = compiled.findall(text)

if numbers_in_string:
    print(f"The string \"{text}\" contains numbers: {numbers_in_string}")
else:
    print(f"The string \"{text}\" does not contain any numbers.")

print(f"{numbers_in_string} is {type(numbers_in_string)}")

The string "21 scouts and 3 tanks fought against 4,003 protestors, so the manager was not 100.00% happy." contains numbers: ['21', '3', '4', '003', '100', '00']
['21', '3', '4', '003', '100', '00'] is <class 'list'>


**5. Find all words that end with 'ly'.**

In [45]:

import re

compiled = re.compile(r'\w+ly\b')  # Words ending with "ly" 
#\w+ matches one or more word characters (letters and digits)
# \b is a word boundary assertion to ensure that "ly" is at the end of the word.

text = "He had prudently disguised himself but was quickly captured by the police."
words_ending_with_ly = compiled.findall(text)

if words_ending_with_ly:
    print(f"The string \"{text}\" contains words ending with 'ly': {words_ending_with_ly}")
else:
    print(f"The string \"{text}\" does not contain any such word.")

print(f"{words_ending_with_ly} is {type(words_ending_with_ly)}")


The string "He had prudently disguised himself but was quickly captured by the police." contains words ending with 'ly': ['prudently', 'quickly']
['prudently', 'quickly'] is <class 'list'>


**6. License plate number**  
A Belgian license plate consists of 1 digit (0, 1 or 2), a dash ('-'), 3 capital letters, a dash ('-') and finally 3 digits. Write a script to check that an input string is a valid license plate.  
If it's correct, print `"good"`. If it's not correct, print `"Not good"`.

In [50]:
import re

pattern = r'^[012]-[A-Z]{3}-\d{3}$'

compiled = re.compile(pattern) 
plate = input("Enter your license plate number: ")

check_plate = compiled.match(plate)
if check_plate:
    print(f"{plate} is valid lisence")
else:
    print(f"{plate} is invalid lisence")

print(f"{check_plate} is {type(check_plate)}")

2-BUD-234 is valid lisence
<re.Match object; span=(0, 9), match='2-BUD-234'> is <class 're.Match'>


**7. Address IPV4**  
An IPv4 address is composed of 4 numbers between 0 and 255 separated by '.'   
Write a script to verify that a string entered is a valid IPv4 address.

In [51]:
import re

pattern = r'^(\d{1,3}\.){3}\d{1,3}$'

compiled = re.compile(pattern)

ip = input("Enter your IP address: ")

check_ip = compiled.match(ip)

if check_ip:
    # Further validation to check if each octet is within the valid range (0-255)
    octets = ip.split('.')
    valid_ip = all(0 <= int(octet) <= 255 for octet in octets)
    if valid_ip:
        print(f"{ip} is a valid IP address.")
    else:
        print(f"{ip} is not a valid IP address (out of range).")
else:
    print(f"{ip} is not a valid IP address (invalid format).")
    

print(f"{valid_ip} is {type(valid_ip)}")

234.123.8.234 is a valid IP address.
True is <class 'bool'>


**8. Valid Mail**  
An email is composed of alphanumeric characters followed by `@` and a domain name.  
Write a script that checks that the string entered by a user is indeed that of an email, otherwise ask him to re-enter it again (until he gets a valid email).

In [53]:
import re

# Function: is_valid_email
# Validates an email address using a regular expression pattern.
# Parameters:
#   - email (str): The email address to be validated.
# Returns:
#   - bool: True if valid, False otherwise.
def is_valid_email(email):
    pattern = r'^[a-zA-Z0-9]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return re.match(pattern, email)

email = input("Enter an email address: ")

if is_valid_email(email):
    print(f"{email} is a valid email address.")
else:
    print(f"{email} is not a valid email address.")



nguyen@hoa.com is a valid email address.


**9. Valid Password**  
Write an additional script that verifies the password (obviously if the email is valid) where the only specificity of the password is that it has to contain at least 6 characters.

In [56]:
import re

# Function: is_valid_email
# Description: Validates an email address using a regular expression pattern.
# Parameters:
#   - email (str): The email address to be validated.
# Returns:
#   - bool: True if valid, False otherwise.
def is_valid_email(email):
    # Define the regular expression pattern for a valid email address
    pattern = r'^[a-zA-Z0-9]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'

    # Use re.match() to check if the input email matches the pattern
    return re.match(pattern, email)

# Function: is_valid_password
# Description: Validates a password: any character, len >=6.
# Parameters:
#   - password (str): The password to be validated.
# Returns:
#   - bool: True if valid, False otherwise.
def is_valid_password(password):
    # Password validation requiring a minimum length of 6 characters:
    return len(password) >= 6

# Get an email address and password from the user
email = input("Enter an email address: ")
password = input("Enter a password: ")

# Check if the input email is valid
if is_valid_email(email):
    print(f"{email} is a valid email address.")
else:
    print(f"{email} is not a valid email address.")

# Check if the input password is valid
if is_valid_password(password):
    print("Password is valid.")
else:
    print("Password is not valid. It should be at least 6 characters long.")


hoa@nguyen.com is a valid email address.
Password is valid.


**10. Valid Password bis**  
The password must now contain at least 6 characters AND  

- at least one lowercase letter AND 
- at least one uppercase letter AND 
- at least one number AND 
- at least one special character (among `$#@`).

In [64]:
import re

# Function to validate an email address letters, then @, then domain name
def is_valid_email(email):
    pattern = r'^[a-zA-Z0-9]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return re.match(pattern, email)

# Function to validate a password
import re

def is_valid_password(password):
    # Password validation requiring a minimum length of 6 characters
    # and AT LEAST: one uppercase, one lowercase, one number, one special character [@, $, !, %, *, ?, &]
    pattern = r'^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]+$'
    
    return bool(re.match(pattern, password)) and len(password) >= 6


# Get an email address from the user
#email = input("Enter an email address: ")
email = "abc123@hotmail.org"

# Check if the input email is valid
if is_valid_email(email):
    print(f"Email address {email} is valid.")
    
    # If the email is valid, get the password from the user
    #password = input("Enter a password: ") 
    password = "Passw01@"

    # Check if the input password is valid
    if is_valid_password(password):
        print(f"Password {password} is valid.")
    else:
        print("Password is not valid. It should be at least 6 characters long, and at least: one lowercase, one uppercase, one number and one special character  [@, $, !, %, *, ?, &].")
    
    # Test the function of validating password: 
    password2 = "weak"
    password3 = "TooShort1"
    print(is_valid_password(password2))  #  return False
    print(is_valid_password(password3))  # return False

Email address abc123@hotmail.org is valid.
Password Passw01@ is valid.
False
False


**11. Search by groups**  
It is possible to search by groups, and it is very powerful!  
`?P<x>\w+` means the capture of a "group" named `x`, this group is composed of at least (`+`) one alphanumeric  character `(\w)`.

In [65]:
m = re.search(
    "Welcome to (?P<where>\w+) ! You are (?P<age>\d+) years old ?",
    "Welcome to Olivier ! You are 32 years old ?",
)
print(m.group("where"))
print(m.group("age"))

Olivier
32


In [66]:
# Another Example
m = re.search(
    "^(?P<who>\w*)[.]?(?P<who2>\w*)@(?P<operator>\w+)[.](?P<zone>\w+$)",
    "audrey.boulevart@benextcomapgny.com",
)
if m is not None:
    print(m.group("who"))
    print(m.group("who2"))
    print(m.group("operator"))
    print(m.group("zone"))

audrey
boulevart
benextcomapgny
com


Load the file `./data/mail.txt` and clean it with the regex. The goal is to retrieve the last name, first name, operator and zone, as in the previous example. Store each of those into their own separate list.

In [67]:
#  Reads the contents of the mail.txt file into the list_mail list, and then 
#  it processes each email address from that list using the regex pattern and 
#  stores the extracted information in separate lists as before.

import re

# Initialize lists to store extracted information
first_names = []
last_names = []
operators = []
zones = []

# Define the regex pattern
pattern = r"^(?P<who>\w*)[.]?(?P<who2>\w*)@(?P<operator>\w+)[.](?P<zone>\w+$)"

# Read the contents of the 'mail.txt' file into a list
with open('./data/mail.txt', 'r') as file:
    list_mail = file.readlines()

# Iterate through the list of email addresses
for line in list_mail:
    # Use regex to extract information from each email address
    m = re.search(pattern, line)
    
    if m is not None:
        # Append extracted information to respective lists
        # extract specific parts of the email address using m.group() when a match is found.
        first_names.append(m.group("who"))
        last_names.append(m.group("who2"))
        operators.append(m.group("operator"))
        zones.append(m.group("zone"))

# Print the extracted information
for i in range(len(first_names)):
    print("First Name:", first_names[i])
    print("Last Name:", last_names[i])
    print("Operator:", operators[i])
    print("Zone:", zones[i])
    print()  # Add an empty line for separation if needed



First Name: aikin
Last Name: joe
Operator: odul
Zone: xyz

First Name: moore
Last Name: 
Operator: imail
Zone: gov

First Name: halknutson
Last Name: 
Operator: email
Zone: xyz

First Name: alexnorquist
Last Name: 
Operator: proton
Zone: com

First Name: monte
Last Name: hylan
Operator: belgaximus
Zone: net

First Name: knutsondan
Last Name: 
Operator: belgaximus
Zone: org

First Name: lepage
Last Name: 
Operator: gaagle
Zone: xyz

First Name: chapman_ben
Last Name: 
Operator: napster
Zone: hu

First Name: upson1544
Last Name: 
Operator: youhoo
Zone: hu

First Name: 552966959
Last Name: 
Operator: belgaximus
Zone: hu

First Name: soloman_ziegler
Last Name: 
Operator: email
Zone: tech

First Name: ortiz
Last Name: mark
Operator: napster
Zone: me

First Name: ashwoon_hank
Last Name: 
Operator: imail
Zone: io

First Name: pettigrew
Last Name: 
Operator: proton
Zone: hu

First Name: doranedward
Last Name: 
Operator: napster
Zone: net

First Name: mills_joe
Last Name: 
Operator: imail
Zone:

**12. Another way of doing things.**

In [71]:
# Extract the first name, last name, operator, and zone from an email 

mail = "audrey.boulevart@benextcomapgny.com"

# splitMail: a list that results from processing the email address mail. 
# replaces periods (".") in the email address with spaces and then 
# splits the email address into two parts using the "@" symbol as the delimiter. 
# copy(): contain the first part (before "@") and the second part (after "@").

splitMail = mail.replace(".", " ").split("@").copy()

firstName = []
name = []
ope = []
zone = []

firstName.append(splitMail[0].split()[0])
name.append(splitMail[0].split()[-1])
ope.append(splitMail[1].split()[0])
zone.append(splitMail[1].split()[-1])

print(splitMail)
print(splitMail[0].split()[0])
print(splitMail[0].split()[-1])
print(splitMail[1].split()[0])
print((splitMail[1].split()[-1]))
firstName, name, ope, zone

['audrey boulevart', 'benextcomapgny com']
audrey
boulevart
benextcomapgny
com


(['audrey'], ['boulevart'], ['benextcomapgny'], ['com'])

Repeat the previous exercise with this new formula and compare the length of your lists with those of the previous exercise.  
What do you notice ?