# <center>MSC-INF101E – Practical 2: Regular Expressions under Python
## <center>Professors: Lina Fahed, Virgil Hamici-Aubert
 ## <center> 
    
    
Very useful examples can be found here:
* https://www.w3schools.com/python/python_regex.asp   
* https://docs.python.org/3/library/re.html

Learning regular expressions (regex) in Python is important for several reasons:
1.    String Manipulation and Validation: Regular expressions provide a powerful way to manipulate and validate strings. They allow you to search, match, replace, and extract specific patterns from text, which is essential for tasks like data cleaning and validation.
2.    Text Processing and Analysis: When working with textual data, regex is invaluable for extracting meaningful information. Whether it’s parsing log files, extracting data from HTML or XML, or analyzing large text corpora, regex enables efficient and precise text processing.
3.    Pattern Matching: Regular expressions offer a concise and expressive syntax for defining patterns in strings. This is particularly useful when searching for specific patterns or validating input formats, such as email addresses, phone numbers, or URLs.
4.    Scripting and Automation: In scripting tasks, regex can streamline data processing. For example, you can use regex to identify and modify specific patterns within files or automate text-based tasks efficiently.
5.    Data Extraction from Web Scraping: When extracting data from web pages, regular expressions are often used to locate and extract specific content. While caution is needed for complex HTML parsing, regex can be effective for simpler cases.
6.    Data Cleaning and Transformation: Regular expressions are powerful tools for cleaning and transforming data. They allow you to find and replace specific patterns, making data more consistent and suitable for analysis.
7.    Code Parsing and Analysis: In programming, regex is handy for code parsing and analysis. For instance, you can use regex to search for specific function calls, variable names, or patterns in code.
8.    Cross-Language Compatibility: Regular expressions are widely used across different programming languages. Once you understand regex in Python, you can apply similar patterns in other languages, enhancing your versatility as a programmer.


In Python, the re module provides robust support for regular expressions, making it an essential skill for anyone working with textual data, automation, or data analysis in Python.

** ** 
** ** 
## 1. RegEx

A **RegEx**, or Regular Expression, is a sequence of characters that forms a search pattern.

RegEx can be used to check if a string contains the specified search pattern. It is a powerful tool for searches and replacements. 

In Python, when you have imported the **re** module (package), you can start using regular expressions. 

Let us take an example: Search the following string to see if it starts with "The" and ends with "Spain":


In [77]:
import re

txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt)
#  ^   means : starts with "The"
#   .  means : any characters
#   *   means  : Zero or more occurrences (of any character here)
#    $  means : ends with 
if x:
  print("YES! We have a match!")
else:
  print("No match")

YES! We have a match!


** ** 
** ** 
## 3. Regular functions

The **re** module offers a set of functions that allows us to search a string for a match:
***
* **findall**	: Returns a list containing all matches
* **search**	: Returns a Match object if there is a match anywhere in the string
* **split**	: Returns a list where the string has been split at each match
* **sub**	: Replaces one or many matches with a string

***

** ** 
** ** 
## 3. Metacharacters

Metacharacters are characters with a special meaning:
***
* **[ ]**	: A set of characters
* **\**	: Signals a special sequence (can also be used to escape special characters)	
* **.**	: Any character (except newline character)	
* **^**	: Starts with	
*  \\$ 	: Ends with		
*  \* 	: Zero or more occurrences	
* **+**	: One or more occurrences	
* **{ }**	: Exactly the specified number of occurrences	
* **|**	: Either or		
* **( )**	: Capture and group

***

Let us try these metacharacters:

In [78]:
import re

txt = "The rain in Spain"


In [79]:
#use [ ] : Find all lower case characters alphabetically between "a" and "m":

x = re.findall("[a-m]", txt)
print(x)

['h', 'e', 'a', 'i', 'i', 'a', 'i']


In [80]:
#use \ : Signals a special sequence
## \d  : Returns a match where the string contains digits (numbers from 0-9)
import re
txt = "That will be 59 dollars"
#Find all digit characters:
x = re.findall("\d", txt)
print(x)

['5', '9']


  x = re.findall("\d", txt)


In [81]:
#use . : Any character (except newline character)
import re
txt = "hello world"
#Search for a sequence that starts with "he", followed by two (any) characters, and an "o":
x = re.findall("he..o", txt)
print(x)

['hello']


In [82]:
#use ^: Starts with
import re
txt = "hello world"
#Check if the string starts with 'hello':
x = re.findall("^hello", txt)
if x:
  print("Yes, the string starts with 'hello'")
else:
  print("No match")

Yes, the string starts with 'hello'


In [83]:
#use $ : Ends with
import re
txt = "hello world"
#Check if the string ends with 'world':
x = re.findall("world$", txt)
if x:
  print("Yes, the string ends with 'world'")
else:
  print("No match")

Yes, the string ends with 'world'


In [84]:
#use * : Zero or more occurrences
import re
txt = "The rain in Spain falls mainly in the plain!"
#Check if the string contains "ai" followed by 0 or more "x" characters:
x = re.findall("aix*", txt)
print(x)
if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['ai', 'ai', 'ai', 'ai']
Yes, there is at least one match!


In [85]:
#use + : One or more occurrences
import re
txt = "The rain in Spain falls mainly in the plain!"
#Check if the string contains "ai" followed by 1 or more "x" characters:
x = re.findall("aix+", txt)
print(x)
if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

[]
No match


In [86]:
#use { } : Exactly the specified number of occurrences
import re
txt = "The rain in Spain falls mainly in the plain!"
#Check if the string contains "a" followed by exactly two "l" characters:
x = re.findall("al{2}", txt)
print(x)
if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['all']
Yes, there is at least one match!


In [87]:
#use | : Either or
import re
txt = "The rain in Spain falls mainly in the plain!"
#Check if the string contains either "falls" or "stays":
x = re.findall("falls|stays", txt)
print(x)
if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['falls']
Yes, there is at least one match!


** ** 
** ** 
## 4. Special Sequences

A special sequence is a **\** followed by one of the characters in the list below, and has a special meaning:

***
* **\A**	: Returns a match if the specified characters are at the beginning of the string
* **\b**	: Returns a match where the specified characters are at the beginning or at the end of a word (the "r" in the beginning is making sure that the string is being treated as a "raw string")
* **\B**	: Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word (the "r" in the beginning is making sure that the string is being treated as a "raw string")
* **\d** :	Returns a match where the string contains digits (numbers from 0-9)
* **\D** :	Returns a match where the string DOES NOT contain digits
* **\s** :	Returns a match where the string contains a white space character
* **\S** :	Returns a match where the string DOES NOT contain a white space character
* **\w**	: Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)
* **\W**	: Returns a match where the string DOES NOT contain any word characters
* **\Z**	: Returns a match if the specified characters are at the end of the string
***

Raw string notation *(r"text")* keeps regular expressions sane. Without it, every backslash ('\\') in a regular expression would have to be prefixed with another one to escape it.


Let us try these sequences:

In [88]:
#use \A : Returns a match if the specified characters are at the beginning of the string
import re
txt = "The rain in Spain"
#Check if the string starts with "The":
x = re.findall("\AThe", txt)
print(x)
if x:
  print("Yes, there is a match!")
else:
  print("No match")

['The']
Yes, there is a match!


  x = re.findall("\AThe", txt)


In [89]:
#use \b : Returns a match where the specified characters are at the beginning or at the end of a word 
#(the "r" in the beginning is making sure that the string is being treated as a "raw string")
import re
txt = "The rain in Spain"
#Check if "ain" is present at the beginning of a WORD:
x = re.findall("r\bain", txt)
print(x)
if x:
  print("Yes, there is at least one match at the beginning!")
else:
  print("No match")

#Check if "ain" is present at the end of a WORD:
x = re.findall(r"ain\b", txt)
print(x)
if x:
  print("Yes, there is at least one match at the end!")
else:
  print("No match")

[]
No match
['ain', 'ain']
Yes, there is at least one match at the end!


In [90]:
#use \B : Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word
#(the "r" in the beginning is making sure that the string is being treated as a "raw string")
import re
txt = "The rain in Spain"
#Check if "ain" is present, but NOT at the beginning of a word:
x = re.findall(r"\Bain", txt)
print(x)
if x:
  print("Yes, there is at least one match, NOT at the beginning of a word!")
else:
  print("No match")

#Check if "ain" is present, but NOT at the end of a word:
x = re.findall(r"ain\B", txt)
print(x)
if x:
  print("Yes, there is at least one match, NOT at the end of a word!")
else:
  print("No match")

['ain', 'ain']
Yes, there is at least one match, NOT at the beginning of a word!
[]
No match


In [91]:
#use \d : Returns a match where the string contains digits (numbers from 0-9)
import re
txt = "The rain in Spain"
#Check if the string contains any digits (numbers from 0-9):
x = re.findall("\d", txt)
print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


[]
No match


  x = re.findall("\d", txt)


In [92]:
#use \D : Returns a match where the string DOES NOT contain digits
import re
txt = "The rain in Spain"
#Return a match at every no-digit character:
x = re.findall("\D", txt)
print(x)
if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['T', 'h', 'e', ' ', 'r', 'a', 'i', 'n', ' ', 'i', 'n', ' ', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!


  x = re.findall("\D", txt)


In [93]:
#use \s : Returns a match where the string contains a white space character
import re
txt = "The rain in Spain"
#Return a match at every white-space character:
x = re.findall("\s", txt)
print(x)
if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

[' ', ' ', ' ']
Yes, there is at least one match!


  x = re.findall("\s", txt)


In [94]:
#use \S : Returns a match where the string DOES NOT contain a white space character
import re
txt = "The rain in Spain"
#Return a match at every NON white-space character:
x = re.findall("\S", txt)
print(x)
if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!


  x = re.findall("\S", txt)


In [95]:
#use \w : Returns a match where the string contains any word characters 
#(characters from a to Z, digits from 0-9, and the underscore _ character)
import re
txt = "The rain in Spain"
#Return a match at every word character (characters from a to Z, digits from 0-9, and the underscore _ character):
x = re.findall("\w", txt)
print(x)
if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!


  x = re.findall("\w", txt)


In [96]:
#use \W : Returns a match where the string DOES NOT contain any word characters
import re
txt = "The rain in Spain"
#Return a match at every NON word character (characters NOT between a and Z. Like "!", "?" white-space etc.):
x = re.findall("\W", txt)
print(x)
if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

[' ', ' ', ' ']
Yes, there is at least one match!


  x = re.findall("\W", txt)


In [97]:
#use \Z : Returns a match if the specified characters are at the end of the string
import re
txt = "The rain in Spain"
#Check if the string ends with "Spain":
x = re.findall("Spain\Z", txt)
print(x)
if x:
  print("Yes, there is a match!")
else:
  print("No match")

['Spain']
Yes, there is a match!


  x = re.findall("Spain\Z", txt)


** ** 
** ** 
## 5. Sets

A set is a set of characters inside a pair of square brackets [ ] with a special meaning:

***
* **[arn]** :	Returns a match where one of the specified characters (for example: a, r, or n) are present
* **[a-n]**	: Returns a match for any lower case character, alphabetically between (for example) a and n
* **[^arn]** :	Returns a match for any character EXCEPT (for example) a, r, and n
* **[0123]** : Returns a match where any of the specified digits (for example 0, 1, 2, or 3) are present
* **[0-9]** :	Returns a match for any digit between 0 and 9
* **[0-5][0-9]** :	Returns a match for any two-digit numbers from 00 and 59
* **[a-zA-Z]**	Returns a match for any character alphabetically between a and z, lower case OR upper case
* **[+]**	: In sets, +, *, ., |, ( ), $,{ } has no special meaning, so [+] means: return a match for any + character in the string
***

Let us try these sets:

In [98]:
#use [arn] : Returns a match where one of the specified characters (a, r, or n) are present
import re
txt = "The rain in Spain"
#Check if the string has any a, r, or n characters:
x = re.findall("[arn]", txt)
print(x)
if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['r', 'a', 'n', 'n', 'a', 'n']
Yes, there is at least one match!


In [99]:
#use [a-n] : Returns a match for any lower case character, alphabetically between a and n
import re
txt = "The rain in Spain"
#Check if the string has any characters between a and n:
x = re.findall("[a-n]", txt)
print(x)
if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['h', 'e', 'a', 'i', 'n', 'i', 'n', 'a', 'i', 'n']
Yes, there is at least one match!


In [100]:
#use [^arn] : Returns a match for any character EXCEPT a, r, and n
import re
txt = "The rain in Spain"
#Check if the string has other characters than a, r, or n:
x = re.findall("[^arn]", txt)
print(x)
if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['T', 'h', 'e', ' ', 'i', ' ', 'i', ' ', 'S', 'p', 'i']
Yes, there is at least one match!


In [101]:
#use [0123] : Returns a match where any of the specified digits (0, 1, 2, or 3) are present
import re
txt = "The rain in Spain"
#Check if the string has any 0, 1, 2, or 3 digits:
x = re.findall("[0123]", txt)
print(x)
if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

[]
No match


In [102]:
#use [0-9] : Returns a match for any digit between 0 and 9
import re
txt = "8 times before 11:45 AM"
#Check if the string has any digits:
x = re.findall("[0-9]", txt)
print(x)
if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['8', '1', '1', '4', '5']
Yes, there is at least one match!


In [103]:
#use [0-5][0-9] : Returns a match for any two-digit numbers from 00 and 59
import re
txt = "8 times before 11:45 AM"
#Check if the string has any two-digit numbers, from 00 to 59:
x = re.findall("[0-5][0-9]", txt)
print(x)
if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['11', '45']
Yes, there is at least one match!


In [104]:
#use [a-zA-Z] Returns a match for any character alphabetically between a and z, lower case OR upper case
import re
txt = "8 times before 11:45 AM"
#Check if the string has any characters from a to z lower case, and A to Z upper case:
x = re.findall("[a-zA-Z]", txt)
print(x)
if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['t', 'i', 'm', 'e', 's', 'b', 'e', 'f', 'o', 'r', 'e', 'A', 'M']
Yes, there is at least one match!


In [None]:
#use [+] : In sets, +, *, ., |, ( ), $,{ } has no special meaning, so [+] means: return a match for any + character in the string
import re
txt = "8 times before 11:45+ AM"
#Check if the string has any + characters:
x = re.findall("[+]", txt)
print(x)
if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['5+']
Yes, there is at least one match!


** ** 
** ** 
## 6. findall() Function

The findall() function returns a list containing all matches. The list contains the matches in the order they are found. If no matches are found, an empty list is returned:

In [None]:
import re

txt = "The rain in Spain"
x = re.findall("ai", txt)
print(x)

y = re.findall("Portugal", txt)
print(y)

['ai', 'ai']
[]


** ** 
** ** 
## 7. search() Function

The search() function searches the string for a match, and returns a Match object if there is a match.

If there is more than one match, only the first occurrence of the match will be returned. If no matches are found, the value None is returned.

In [115]:
import re

txt = "The rain in Spain"
x = re.search('\s', txt) ## recall : \s : Returns a match where the string contains a white space character

print("The first white-space character is located in position:", x.start()) 

y = re.search("Portugal", txt)
print(y)

The first white-space character is located in position: 3
None


  x = re.search('\s', txt) ## recall : \s : Returns a match where the string contains a white space character


** ** 
** ** 
## 8. split() Function
The split() function returns a list where the string has been split at each match. You can control the number of occurrences by specifying the maxsplit parameter.

In [108]:
import re

#Split the string at every white-space character:
txt = "The rain in Spain"
x = re.split("\s", txt)
print(x)

#Split the string at the first white-space character:
y = re.split("\s", txt, 1)
print(y)


['The', 'rain', 'in', 'Spain']
['The', 'rain in Spain']


  x = re.split("\s", txt)
  y = re.split("\s", txt, 1)


** ** 
** ** 
## 9. sub() Function
The sub() function replaces the matches with the text of your choice. You can control the number of replacements by specifying the count parameter:

In [109]:
import re

#Replace all white-space characters with the digit "9":
txt = "The rain in Spain"
x = re.sub("\s", "9", txt)
print(x)

#Replace the first two occurrences of a white-space character with the digit 9:
txt = "The rain in Spain"
y = re.sub("\s", "9", txt, 2)
print(y)


The9rain9in9Spain
The9rain9in Spain


  x = re.sub("\s", "9", txt)
  y = re.sub("\s", "9", txt, 2)


** ** 
** ** 
## 10. Match Object
A Match Object is an object containing information about the search and the result.
If there is no match, the value None will be returned, instead of the Match Object.
The Match object has properties and methods used to retrieve information about the search, and the result:
* span() returns a tuple containing the start-, and end positions of the match.
* string returns the string passed into the function
* group() returns the part of the string where there was a match

You can also use the function **match()**...

In [110]:
import re

txt = "The rain in Spain"
x = re.search("ai", txt)
print("object is : " , x) #this will print an object

#Search for an upper case "S" character in the beginning of a word, and print its position:
x = re.search(r"\bS\w+", txt)
print(x.span())
print(x.start())# will print the first position only



#The string property returns the search string:
y = re.search(r"\bS\w+", txt)
print(y.string)

#Search for an upper case "S" character in the beginning of a word, and print the word:
txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.group())

# in the string 'blabliblu', get words starting by 'bl' and followed by 'a' or 'i' or 'u
match = re.match('bl[aiu]','blabliblu')
print (match.group())


object is :  <re.Match object; span=(5, 7), match='ai'>
(12, 17)
12
The rain in Spain
Spain
bla


In [111]:
import re

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print("object is : " , x) #this will print an object

#Search for an upper case "S" character in the beginning of a word, and print its position:
x = re.search(r"\bS\w+", txt)
print(x.span())
print(x.start())# will pritn the first position only

object is :  <re.Match object; span=(12, 17), match='Spain'>
(12, 17)
12


** ** 
** ** 
## 11. Exercise: find *named entities*
We want to find all words that start all the time with a capital letter, i.e. they do not exist  with lowercase letters. Notice that these words have high chances of being named entities.

In [112]:
import re


### Let us start with a simple case : finding words starting with a capital letter
## notice that the word "the" should not appear in our result
txt = "The American scientist Michael Irwin Jordan isis one of the leading figures in machine learning"
capital_letter_words = re.findall('([A-Z][a-zA-Z]+)', txt) # find words starting by letter from A to Z, then with 0 or more letters whatever the case
print(capital_letter_words)
### We will now go through the resulting list in order to search whether words appear in a small letter case
## we can use list comprehension 
## we use "set" to keep only one occurence of each item in the list
named_entities_set = set([x for x in capital_letter_words if x in txt and x.lower() not in txt])
print(named_entities_set)




['The', 'American', 'Michael', 'Irwin', 'Jordan']
{'Michael', 'American', 'Irwin', 'Jordan'}


In [113]:
## Let us now do this when reading from a file
import re

my_file = open("files/Michael-Irwin-Jordan.txt", "r")
txt = my_file.read() ## read whole file to one String
print(" ------Text------\n",txt,"\n\n")

capital_letter_words = re.findall('([A-Z][a-zA-Z]+)', txt)
named_entities_list = [x for x in capital_letter_words if x in txt and x.lower() not in txt]
   
print("------Named entities list------ \n ",named_entities_list, "\n")  
named_entities_set = set(named_entities_list) ## to keep one occurrence of "University" for example
print("------Named entities set without repetition------ \n ",named_entities_set, "\n")    

my_file.close()

 ------Text------
 Michael Irwin Jordan (born February 25, 1956) is an American scientist, professor at the University of California, Berkeley and researcher in machine learning, statistics, and artificial intelligence.
The American scientist is one of the leading figures in machine learning, and in 2016 Science reported him as the world's most influential computer scientist.
Jordan received his BS magna cum laude in Psychology in 1978 from the Louisiana State University, his MS in Mathematics in 1980 from Arizona State University and his PhD in Cognitive Science in 1985 from the University of California, San Diego.
At the University of California, San Diego, Jordan was a student of David Rumelhart and a member of the PDP Group in the 1980s. 


------Named entities list------ 
  ['Michael', 'Irwin', 'Jordan', 'February', 'American', 'University', 'California', 'Berkeley', 'American', 'Science', 'Jordan', 'BS', 'Psychology', 'Louisiana', 'State', 'University', 'MS', 'Mathematics', 'Ariz