# Regular Expressions

- Based on Coursera: Using Python to Access Web Data - https://www.coursera.org/learn/python-network-data/home/welcome
- Code for Dr. Charles R. Severance's presentation (attached in the repo)

### Quick Guide



- `^` - Matches the beginning of a line
- `$` - Matches the end of the line
- `.` - Matches any character
- `\s` - Matches whitespace
- `\S` - Matches any non-whitespace character
- `*` - Repeats a character zero or more times
- `*?` - Repeats a character zero or more times (non-greedy)
- `+` - Repeats a character one or more times
- `+?` - Repeats a character one or more times (non-greedy)
- `[aeiou]` - Matches a single character in the listed set
- `[^XYZ]` - Matches a single character not in the listed set
- `[a-z0-9]` - The set of characters can include a range
- `(` - Indicates where string extraction is to start
- `)` - Indicates where string extraction is to end


### Using re.search() like find()
- Display all the lines from `./txt/mbox-short.txt` file that contain `From:` phrase

#### Without Regular Expressions

In [None]:
hand = open('./txt/mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if line.find('From:') >= 0:
        print (line)

#### With Regular Expressions

In [None]:
import re
hand = open('./txt/mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search('From:', line):
        print (line)

### using re.search() like startswith()
- Display all the lines from `./txt/mbox-short.txt` file that start with `From:`

#### Without regex

In [None]:
hand = open('./txt/mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if line.startswith('From:'):
        print (line)

#### With Regex

In [None]:
hand = open('./txt/mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search('^From:', line):
        print (line)

### Machine and Extracting data - numbers
- re.search() returns a True/False depending on whether the string matches  the regular expression
- If we actually want the matching strings to be extracted, we use re.findall()

`[0-9]+` - one or more digits

In [None]:
x = 'My 2 favorite numbers are  19 and 42'
y = re.findall('[0-9]+', x)
print (y)

### Greedy and Non-Greedy Matching
- The repeat characters (* and +) push outward in both directions (greedy) to match the largest possible string

In [None]:
x = 'From: Using the : character'
y = re.findall('^F.+:', x)
y

- If you want to use Non-Greedy matching, just add a ? character at the end of * or +

In [None]:
y = re.findall('^F.+?:', x)
y

### Fine-Tuning  String Extraction

- To do: find an e-mail address of the sender
- First step: Let's try to find an e-mail

In [None]:
mail_header = 'From: przemek.sekula@ue.katowice.pl Sat, Jul 19 2019'
re.findall('\S+@\S+', mail_header)

- This code was OK, but what if we have more than one e-mail address in the line?

In [None]:
mail_header_2 = 'From: przemek.sekula@ue.katowice.pl Sat, Jul 19 2019 To: przemeksekula@gmail.com'
re.findall('\S+@\S+', mail_header_2)

- Parentheses `( )` are not part of the match - but they tell where to start and stop what string to extract


In [None]:
re.findall('^From: (\S+@\S+)', mail_header_2)

### E-mail's domain extraction
- Extrtact the domain that the email has been sent from
- *Note: `[^ ]` serves as `\S` here*

In [None]:
re.findall('@([^ ]+)', mail_header_2)

In [None]:
re.findall('^From: .+?@([^ ]*)', mail_header_2)

### Spam confidence
- Given: Information about spam confidence presented as: `X-DSPAM-Confidence: 0.8475`
- To do: 
    - Extract spam confidences of all e-mails
    - Compute maximum and median spam confidence


In [None]:
import numpy as np
hand = open('./txt/mbox-short.txt')
numlist = list()
for line in hand:
    line = line.rstrip()
    stuff = re.findall('^X-DSPAM-Confidence: ([0-9.]+)', line)
    if len(stuff) != 1 :  continue
    num = float(stuff[0])
    numlist.append(num)
print('Maximum: {}, Median: {}'.format(max(numlist), np.median(numlist)))

### Escape Character
- If you want a special regular expression character to just behave normally (most of the time) you prefix it with `\`
- To do: Extract the amount (in USD) from the given sentence 

In [None]:
text = 'We just received $10.00 for cookies.'
re.findall('\$[0-9.]+', text)

- Question: What output will we get now?

In [None]:
text = 'We just received $10.00 for cookies. Each of us received $5 as we share the income evenly.'
re.findall('\$[0-9.]+', text)

- Usually, you can achieve your goal in many ways. There is usually a trade-off: more complex regular expressions are less susceptible to contaminated data.

In [None]:
text = 'We just received $10.00 for cookies. Each of us received $5 as we share the income evenly.'
re.findall('\$[0-9]+\.*[0-9]*', text)