<img src="images/lasalle_logo.png" style="width:375px;height:110px;">

# Week 6 - Regular Expressions

### WIM250 - Introduction to Scripting Languages 
### Instructor: Ivaldo Tributino

Source:

- Python for Everybody Exploring Data Using Python 3 by Dr. Charles R. Severance
- https://medium.com/@chongye225/python-regular-expression-2ac91e084662

## Introduction

This task of searching and extracting is so common that Python has a very powerful library called `regular expressions` that handles many of these tasks quite elegantly. 

<img src="images/regular_ex.png" style="width:750px;height:400px;">

`Regular Expression`, is a sequence of characters that forms a search pattern. the characters above can be used to check if a string contains a specified search pattern.


Regular expressions are almost their own little programming language for searching and parsing strings. In this course, we will only cover the basics of regular expressions. For more detail on regular expressions, see:
https://docs.python.org/3/library/re.html

The regular expression library `re` must be imported into your program before you can use it. The simplest use of the regular expression library is the `search()` function. The following program demonstrates a trivial use of the search function.

In [6]:
import re  # Importing re module

name = 'Ivaldo' # try Ryan, john, Risa ...
fhand = open('attendance.txt')
for line in fhand:
    line = line.rstrip()
    if re.search(name,line): 
        print(line)

Ivaldo Tributino de Sousa Joined 7/27/2021, 1:28:33 PM
Tributino Ivaldo de Sousa Joined 7/27/2021, 1:28:33 PM


In [7]:
# Using the string method find() 
fhand = open('attendance.txt')
for line in fhand:
    line = line.rstrip()
    if line.find(name)!= -1:  
        print(line)

Ivaldo Tributino de Sousa Joined 7/27/2021, 1:28:33 PM
Tributino Ivaldo de Sousa Joined 7/27/2021, 1:28:33 PM


The power of the `regular expressions` comes when we add special characters to the search string that allow us to more precisely control which lines match the string. Adding these special characters to our regular expression allow us to do `sophisticated matching` and extraction while writing very little code.


For example, the caret character is used in regular expressions to match “the beginning” of a line. We could change our program to only match lines where
“Ivaldo” was at the beginning of the line as follows:

In [2]:
fhand = open('attendance.txt')

name = '^Ivaldo'
for line in fhand:
    line = line.rstrip()
    if re.search(name, line): 
        print(line)

NameError: name 're' is not defined

Now we will only match lines that start with the string “Ivaldo”. This is still a very simple example that we could have done equivalently with the` startswith()` method from the string library. But it serves to introduce the notion that regular expressions contain special action characters that give us more control as to what will match the regular expression.

In [8]:
# Using the string method startswith() 
fhand = open('attendance.txt')
for line in fhand:
    line = line.rstrip()
    if line.startswith('Ivaldo'): 
        print(line)

Ivaldo Tributino de Sousa Joined 7/27/2021, 1:28:33 PM


## Character matching in regular expressions

There are a number of other special characters that let us build even more powerful regular expressions. The most commonly used special character is the `period` or `full stop`, which matches any character.

Example, the regular expression `F..m:` would match any of the strings `“From:”`, `“Fxxm:”`, `“F12m:”`, or `“F!@m:”` since the period characters in the regular expression match any character.

In [None]:
fhand = open('attendance.txt')
for line in fhand:
    line = line.rstrip()
    if re.search('J..n', line): # Try Joined, Join and J..n
        print(line)

Let's say you don't know how to spell my name 'Ivaldo'. However, you know that my name starts with `"I"` and ends with "`o`". How can I find my name if I don't know how many characters there are between `"I"` and `"o"`.

## <center>I____o</center>

In [None]:
fhand = open('attendance.txt')
for line in fhand:
    line = line.rstrip()
    if re.search('I\S+o', line): #  try I..o and I.+o or I.*o
        print(line)

It is good to think of the `plus` and `asterisk` characters as “pushy”. For example, the following string would match the last `o` in the string as the .+ pushes outwards, as shown below:

In [None]:
line1 = 'I am not an impresario' #  produces entertainment, especially the director of an opera company.
line2 = 'Ivaldo'
lines = [line1, line2]
for line in lines:
    if re.search('I.*o', line): 
        print(line)

It is possible to tell an asterisk or plus sign not to be so `“greedy”` by adding another character.(matches as many characters as possible.). However in this exaple is better we use `non-whitespace character`.

In [None]:
for line in lines:
    if re.search('I\S*o', line): # try 'I.*o', 'I\S+o' and 'I\S+'
        print(line)

## Extracting data using regular expressions

If we want to extract data from a string in Python we can use the `findall()` method to extract all of the `substrings` which match a regular expression. Let’s use the example of wanting to extract anything that looks like an time.

In [None]:
s = 'I joined the meeting at 01:32:24 and left the meeting at 02:56:54'
lst = re.findall('[0-9][0-9]', s) # try [0-9], [0-9][0-9], [0-9]+
print(lst)

In [None]:
lst = list(map(int, lst))
print(lst)
h = (lst[3]-lst[0])*60
m = (lst[4]-lst[1])
s = (lst[5]-lst[3])/60
t = h+m+s     
print('I spent %0.2f minutes in the meeting' %t)        

Another example from the book Python for Everybody

In [None]:
s = 'A message from csev@umich.edu to cwen@iupui.edu about meeting @2PM'
lst = re.findall('', s) # try '\s@\S+', '@+\S'
print(lst)

Translating the regular expression, we are looking for substrings that have at least one non-whitespace character, followed by an `@`, followed by at least one more non-whitespace character. The \S+ matches as many non-whitespace characters as possible.

The regular expression would match twice (`csev@umich.edu` and `cwen@iupui.edu`), but it would not match the string “@2PM” because there are no non-blank characters before the at-sign. We can use this regular expression in a program to read all the lines in a file and print out anything that looks like an email address as follows:

## Combining searching and extracting

Let's translate the code below:

In [None]:
fhand = open('mbox-short.txt') 
for line in fhand:
    line = line.rstrip()
    if re.search('^X\S*: [0-9.]+', line):
        print(line)    

`Parentheses` are another special character in regular expressions. When you add `parentheses` to a regular expression, they are ignored when matching the string. But when you are using `findall()`, parentheses indicate that while you want the whole expression to match, you only are interested in `extracting` a portion of the substring that matches the regular expression.

In [None]:
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    x = re.findall('^X\S*: ([0-9.]+)', line) # try ^X\S*: ([0-9.]+)
    if len(x) > 0:
        print(x)

Now, we were interested in the time of day of each mail message. We looked for lines of the form:

```
From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
```

Now we can use regular expressions to do this following the regular expression:

```
^From .* [0-9][0-9]:
```

The translation of this regular expression is that we are looking for lines that start `with From (note the space)`, followed by any number of characters `(.*)`, followed by a space, followed by two digits `[0-9][0-9]`, followed by a colon character. This is the definition of the kinds of lines we are looking for. In order to pull out only the hour using `findall()`, we add parentheses around the two digits as follows:
```
^From .* ([0-9][0-9]):
```

In [None]:
fhand = open('mbox-short.txt')
for line in fhand:
    line = line.rstrip()
    x = re.findall('^From.* ([0-9][0-9]):', line) 
    if len(x) > 0: 
        print(x)

Let's redo one of the problems from your midterm exam using the `library re`.

**Get the highest number, lowest number, and calculate the sum of the numbers in the list from myList.txt file.**

In [None]:
with open("myList.txt", "r") as file:
    data = file.read()
data

In [None]:
print("min: %g, max: %g and sum: %g" %(min(data), max(data), sum(data)))

## Escape character


Since we use special characters in regular expressions to match the beginning or end of a line or specify wild cards, we need a way to indicate that these characters are “normal” and we want to match the actual character such as a dollar sign, caret or paratheses.

We can indicate that we want to simply match a character by prefixing that character with a `backslash`. For example, we can find money amounts with the following regular expression.

In [None]:
comment = '''The view from my window overlooked the wall of another building and the location 
        was not convenient. But other than that everything was perfect, the whole 
        staff was very attentive :).'''
print(comment)

In [None]:
x = re.findall(':\)',comment) 
x

<img src="images/sent_analysis.png" style="width:350px;height:175px;">

<center>Image from monkeylearn.com
    
If you want to know a little more about `Sentiment Analysis`, see: https://monkeylearn.com/sentiment-analysis/    

In [None]:
:)