### Regular Expressions

In computing, a regular expression, also referred to as "regex" or "regexp", provides a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. A regular expression is written in a formal language that can be interpreted by a regular expression processor.

<b> Understanding Regular Expressions

- Very powerful and quite cryptic<br>
- Fun once you understand them<br>
- Regular expressions are a language unto themselves<br>
- A language of " marker characters" - programming with characters<br>
- It is kind of an "old school" language - compact<br>

<b> The Regular expression Module

|SYMBOL               |DESCRIPTION                          
|---------------------|-------------------------------------|
| ^                   | Matches the beginning of a line     |
|$                    | Matches the end of the line               | 
|.                    |Matches any character                      
|\s                   | Matches Whitespace                       | 
| \s                  |Matches any non-whitespace character   | 
| *                   |                Repeats a character zero or more times  | 
| *?                  |  Repeats a character zero or more times (non-greedy)| 
|   +                 |           Repeats a character one or more times  | 
| +?                  |  Repeats a character one or more times(non-greedy) | 
|  [aeiou]            |Matches a single chracter in the listed set  | 
| [^XYZ]              | Matches a single character not in the isted set | 
|  [a-z0-9]           |The set of characters can include a range  | 
|  (                  | Indicates where string extraction is to start | 
| )                   |Indicates where string extraction is to end  |


- Before you can use regular expressions in your program, you must import the library using "import re"<br>
- You can use re.search() to see if a string matches a regular expression, similar to using the find() method for strings<br>
- You can use re.findall() to extract portions of a string that match your regular expression, similar to a combination of find() and slicing: var[5:10]

<b> Using re.search() Like find() and startswith()

Using Regular Expression we can get the same output as we get using find() and startswith().hand = open("mbox-short.txt")


In [7]:
hand = open("mbox-short.txt")
for line in hand:
    line = line.rstrip()
    if line.find('From:')>=0:
        print(line)

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu


In [8]:
hand = open("mbox-short.txt")
for line in hand:
    line = line.rstrip()
    if line.startswith('From:'):
        print(line)

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu


<b> The above two codes give the same output as we get by using re.search()  
    <b> Note:
    </b> Use import re

In [9]:
import re
hand = open("mbox-short.txt")
for line in hand:
    line = line.rstrip()
    if re.search('^From:' , line):
        print(line)

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu


#### Here, ^ is a special character in the code that is used to show that you want to get first character as specified 

## Wild-Card Characters

- The dot charater matches any character<br>
- If you add the asterisk character, the character is "any number of times"

<b>Example: ^X.*:

Here, <b> ^ </b> Matches the start of the line,<br>
      <b> . </b> Matches any character <br>
      <b> * </b> Many times and<br>
      <b> : </b> Lat character in the match is a :

## Matching and Extracting Data

- re.search() returns a True/False depending on whether the string matches the regular expression<br>
- If we actually want the matching strings to be extracted, we use re.findall()

<b> Example: [-9]+

In [10]:
import re
x= "My 2 favourite numbers are 19 and 42"
y = re.findall("[0-9]+",x)
print(y)

['2', '19', '42']


<b> re.findall </b> - It returns a list of zero or more sub-strings that match the regular expression

In [11]:
import re
x= "My 2 favourite numbers are 19 and 42"
y = re.findall("[AEIOU]+",x)
print(y)

[]


## Warning: Greedy Matching

The repeat characters (* and +) push outward in both direstions (greedy) to match the largest possible string

<b> Example: ^F.+:

In [12]:
import re
x= "From: Using the : character"
y = re.findall("^F.+:",x)
print(y)

['From: Using the :']


## Non-Greedy Matching

Not all regular expressions repeat codes are greedy

In [13]:
import re
x = "From: Using the : character"
y = re.findall("^F.+?:",x)
print(y)

['From:']


Here, <b> ^F </b> is the first character in the match,<br>
<b> .+? </b> is one or more characters but not greedy and<br>
    <b> : </b> is last character in the match

## Fine - Tuning String Extraction

- You can refine the match for re.findall() and separately determine which portion of the match is to be extracted by using parentheses
- Parentheses are not part of the match - but they tell when to start and stop what string to extract.

In [14]:
import re
x = "From stephen.marqard@utc.ac.za Sat Jan"
y = re.findall("\S+@\S+",x)
print(y)

['stephen.marqard@utc.ac.za']


## The Double Split Pattern 

Sometimes we split a line one way, and then grab one of the pieces of the line and split that piece again

In [15]:
line = "From stephen.marquard@utc.ac.za"
words = line.split()
print(words)

['From', 'stephen.marquard@utc.ac.za']


In [16]:
line = "From stephen.marquard@utc.ac.za"
words = line.split()
email = words[1]
print(email)

stephen.marquard@utc.ac.za


In [17]:
line = "From stephen.marquard@utc.ac.za"
words = line.split()
email = words[1]
pieces = email.split('@')
print(pieces[1])

utc.ac.za


## The Regex Version of Double Split Pattern

In [18]:
import re
line = "From stephen.marquard@utc.ac.za Sat Jan"
y = re.findall("@([^ ]*)",line)
print(y)

['utc.ac.za']


In [19]:
import re
line = "From stephen.marquard@utc.ac.za Sat Jan"
y = re.findall("^From .*@([^ ]*)",line)
print(y)

['utc.ac.za']


## Spam Confidence

In [20]:
import re
hand = open("mbox-short.txt")
numlist = list()
for line in hand:
    line = line.rstrip()
    stuff = re.findall("^X-DSPAM-Confidence: ([0-9. ]+)", line)
    if len(stuff) != 1: 
        continue
    num = float(stuff[0])
    numlist.append(num)
print('Maximum:', max(numlist))

Maximum: 0.9907


## Escape Character

If you want a special regular expression character to just behave normally (most of the time) you prefix with "\"

In [21]:
import re
x = "We just received $10.00 for cookies."
y = re.findall("\$[0-9.]+",x)
print(y)

['$10.00']


<b> Assignment: Finding Numbers in a Haystack</b> <br>

In this assignment you will read through and parse a file with text and numbers. You will extract all the numbers in the file and compute the sum of the numbers.

<b> Data Files</b> <br>

We provide two files for this assignment. One is a sample file where we give you the sum for your testing and the other is the actual data you need to process for the assignment.<br>

    Sample data: regex_sum_42.txt (There are 90 values with a sum=445833)
    Actual data: regex_sum_704346.txt (There are 81 values and the sum ends with 368)

These data are already in the folder. Make sure to check the file into the same folder as you will be writing your Python program.  

In [22]:
import re

if __name__ == "__main__":
    file = open("regex_sum_42.txt")
    sm = 0
    wd = 0
    for line in file:
        temp = line.rstrip()
        temp = re.findall("[0-9]+", temp)
        if len(temp) > 0:
            for w in temp:
                wd += 1
                sm += int(w)
print("The sum for the sample text above is %d\n" %sm)

The sum for the sample text above is 445833



In [23]:
import re

if __name__ == "__main__":
    file = open("regex_sum_704346.txt")
    sm = 0
    wd = 0
    for line in file:
        temp = line.rstrip()
        temp = re.findall("[0-9]+", temp)
        if len(temp) > 0:
            for w in temp:
                wd += 1
                sm += int(w)
print("The sum for the sample text above is %d\n" %sm)

The sum for the sample text above is 329368

