# Lesson 7: Advanced Web Scraping and Data Gathering
## Topic 4: Regular Expressions (aka _regex_)

Regular expressions are used to identify whether a pattern exists in a given sequence of characters (string) or not. They help in manipulating textual data, which is often a pre-requisite for data science projects that involve text mining.

In [1]:
import re

### Exercise 27: Use `match` method to check if a pattern matches a string/sequence. It is case-sensitive.

In [2]:
string1 = 'Python'
pattern = r"Python"

In [3]:
if re.match(pattern,string1):
    print("Matches!")
else:
    print("Doesn't match.")

Matches!


In [4]:
string2 = 'python'

In [5]:
if re.match(pattern,string2):
    print("Matches!")
else:
    print("Doesn't match.")

Doesn't match.


### Exercise 28: Instead of repeating the code, we can use `compile` to create a regex program and use methods

In [6]:
prog = re.compile(pattern)
prog.match(string1)

<_sre.SRE_Match object; span=(0, 6), match='Python'>

### Exercise 29: So compiled progarms return special object e.g. `match` objects. But if they don't match it will return `None`, so we can still run our conditional loop!

In [7]:
prog = re.compile(pattern)
if prog.match(string1)!=None:
    print("Matches!")
else:
    print("Doesn't match.")

Matches!


In [8]:
if prog.match(string2)!=None:
    print("Matches!")
else:
    print("Doesn't match.")

Doesn't match.


### Exercise 30: Use additional parameters in `match` to check for positional matching
The following example matches **`y`** for the 2nd position (index/pos 1)

In [9]:
prog = re.compile(r'y')

In [10]:
prog.match('Python',pos=1)

<_sre.SRE_Match object; span=(1, 2), match='y'>

In [11]:
prog = re.compile(r'thon')

In [12]:
prog.match('Python',pos=2)

<_sre.SRE_Match object; span=(2, 6), match='thon'>

In [13]:
prog.match('Marathon',pos=4)

<_sre.SRE_Match object; span=(4, 8), match='thon'>

### Exercise 31: Let's see a use case. Find out how many words in a list has last three letters with 'ing'

In [14]:
prog = re.compile(r'ing')
words = ['Spring','Cycling','Ringtone']
for w in words:
    if prog.match(w,pos=len(w)-3)!=None:
        print("{} has last three letters 'ing'".format(w))
    else:
        print("{} does not have last three letter as 'ing'".format(w))

Spring has last three letters 'ing'
Cycling has last three letters 'ing'
Ringtone does not have last three letter as 'ing'


### Exercise 32: We could have used simple string method. What's powerful about regex? The answer is that it can match very complex pattern. But to see such examples, let's first explore `search` method.

In [15]:
prog = re.compile('ing')

In [16]:
prog.search('Spring')

<_sre.SRE_Match object; span=(3, 6), match='ing'>

In [17]:
prog.search('Ringtone')

<_sre.SRE_Match object; span=(1, 4), match='ing'>

### Exercise 33: Use the `span()` method of the `match` object, returned by `search`, to locate the position of the matched pattern

In [18]:
prog = re.compile(r'ing')
words = ['Spring','Cycling','Ringtone']
for w in words:
    mt = prog.search(w)
    # Span returns a tuple of start and end positions of the match
    start_pos = mt.span()[0] # Starting position of the match
    end_pos = mt.span()[1] # Ending position of the match
    print("The word '{}' contains 'ing' in the position {}-{}".format(w,start_pos,end_pos))

The word 'Spring' contains 'ing' in the position 3-6
The word 'Cycling' contains 'ing' in the position 4-7
The word 'Ringtone' contains 'ing' in the position 1-4


### Exercise 34: Examples of various single character pattern matching with `search`. Here we will also use `group()` method, which essentially returns the string matched.

#### Dot `.` matches any single character except newline character

In [19]:
prog = re.compile(r'py.')
print(prog.search('pygmy').group())
print(prog.search('Jupyter').group())

pyg
pyt


#### `\w` (lowercase w) matches any single letter, digit or underscore

In [20]:
prog = re.compile(r'c\wm')
print(prog.search('comedy').group())
print(prog.search('camera').group())
print(prog.search('pac_man').group())
print(prog.search('pac2man').group())

com
cam
c_m
c2m


#### `\W` (uppercase W) matches anything not covered with `\w`

In [21]:
prog = re.compile(r'9\W11')
print(prog.search('9/11 was a terrible day!').group())
print(prog.search('9-11 was a terrible day!').group())
print(prog.search('9.11 was a terrible day!').group())
print(prog.search('Remember the terrible day 09/11?').group())

9/11
9-11
9.11
9/11


#### `\s` (lowercase s) matches a single whitespace character like: space, newline, tab, return.

In [22]:
prog = re.compile(r'Data\swrangling')

print(prog.search("Data wrangling is cool").group())
print("-"*80)
print("Data\twrangling is the full string")
print(prog.search("Data\twrangling is the full string").group())
print("-"*80)

print("Data\nwrangling is the full string")
print(prog.search("Data\nwrangling").group())

Data wrangling
--------------------------------------------------------------------------------
Data	wrangling is the full string
Data	wrangling
--------------------------------------------------------------------------------
Data
wrangling is the full string
Data
wrangling


#### `\d` matches numerical digits 0 - 9

In [23]:
prog = re.compile(r"score was \d\d")

print(prog.search("My score was 67").group())
print(prog.search("Your score was 73").group())

score was 67
score was 73


### Exercise 35: Examples of pattern matching either at the start or end of the string

In [22]:
def print_match(s):
    if prog.search(s)==None:
        print("No match")
    else:
        print(prog.search(s).group())

#### `^` (Caret) matches a pattern at the start of the string

In [23]:
prog = re.compile(r'^India')

print_match("Russia implemented this law")
print_match("India implemented that law")
print_match("This law was implemented by India")

No match
India
No match


#### `$` (dollar sign) matches a pattern at the end of the string

In [24]:
prog = re.compile(r'Apple$')

print_match("Patent no 123456 belongs to Apple")
print_match("Patent no 345672 belongs to Samsung")
print_match("Patent no 987654 belongs to Apple")

Apple
No match
Apple


### Exercise 36: Examples of pattern matching with multiple characters

#### `*` matches 0 or more repetitions of the preceding RE

In [25]:
prog = re.compile(r'ab*')

print_match("a")
print_match("ab")
print_match("abbb")
print_match("b")
print_match("bbab")
print_match("something_abb_something")

a
ab
abbb
No match
ab
abb


#### `+` causes the resulting RE to match 1 or more repetitions of the preceding RE

In [26]:
prog = re.compile(r'ab+')

print_match("a")
print_match("ab")
print_match("abbb")
print_match("b")
print_match("bbab")
print_match("something_abb_something")

No match
ab
abbb
No match
ab
abb


#### `?` causes the resulting RE to match precisely 0 or 1 repetitions of the preceding RE

In [27]:
prog = re.compile(r'ab?')

print_match("a")
print_match("ab")
print_match("abbb")
print_match("b")
print_match("bbab")
print_match("something_abb_something")

a
ab
ab
No match
ab
ab


### Exercise 37: Greedy vs. non-greedy matching

In [28]:
prog = re.compile(r'<.*>')
print_match('<a> b <c>')

<a> b <c>


In [29]:
prog = re.compile(r'<.*?>')
print_match('<a> b <c>')

<a>


### Exercise 38: Controlling how many repetitions to match

#### `{m}` specifies exactly `m` copies of RE to match. Fewer matches cause a non-match and returns `None`

In [30]:
prog = re.compile(r'A{3}')

print_match("ccAAAdd")
print_match("ccAAAAdd")
print_match("ccAAdd")

AAA
AAA
No match


#### `{m,n}` specifies exactly `m` to `n` copies of RE to match.  Omitting `m` specifies a lower bound of zero, and omitting `n` specifies an infinite upper bound.

In [31]:
prog = re.compile(r'A{2,4}B')

print_match("ccAAABdd")
print_match("ccABdd")
print_match("ccAABBBdd")
print_match("ccAAAAAAABdd")

AAAB
No match
AAB
AAAAB


In [32]:
prog = re.compile(r'A{,3}B')

print_match("ccAAABdd")
print_match("ccABdd")
print_match("ccAABBBdd")
print_match("ccAAAAAAABdd")

AAAB
AB
AAB
AAAB


In [33]:
prog = re.compile(r'A{3,}B')

print_match("ccAAABdd")
print_match("ccABdd")
print_match("ccAABBBdd")
print_match("ccAAAAAAABdd")

AAAB
No match
No match
AAAAAAAB


#### `{m,n}?` specifies `m` to `n` copies of RE to match in a non-greedy fashion.

In [34]:
prog = re.compile(r'A{2,4}')
print_match("AAAAAAA")

prog = re.compile(r'A{2,4}?')
print_match("AAAAAAA")

AAAA
AA


### Exercise 39: Sets of matching characters

#### `[x,y,z]` matches x, y, or z

In [35]:
prog = re.compile(r'[A,B]')
print_match("ccAd")
print_match("ccABd")
print_match("ccXdB")
print_match("ccXdZ")

A
A
B
No match


#### A range of characters can be matched inside the set. This is one of the most widely used regex techniques!

In [36]:
prog = re.compile(r'[a-zA-Z]+@+[a-zA-Z]+\.com')

print_match("My email is coolguy@xyz.com")
print_match("My email is coolguy12@xyz.com")

coolguy@xyz.com
No match


In [37]:
prog = re.compile(r'[a-zA-Z0-9]+@+[a-zA-Z]+\.com')

print_match("My email is coolguy12@xyz.com")
print_match("My email is coolguy12@xyz.org")

coolguy12@xyz.com
No match


In [38]:
prog = re.compile(r'[a-zA-Z0-9]+@+[a-zA-Z]+\.+[a-zA-Z]{2,3}')
print_match("My email is coolguy12@xyz.org")
print_match("My email is coolguy12[AT]xyz[DOT]org")

coolguy12@xyz.org
No match


### Exercise 40: OR-ing of regex using `|`

In [39]:
prog = re.compile(r'[0-9]{10}')

print_match("3124567897")
print_match("312-456-7897")

3124567897
No match


In [108]:
prog = re.compile(r'[0-9]{10}|[0-9]{3}-[0-9]{3}-[0-9]{4}')

print_match("3124567897")
print_match("312-456-7897")

3124567897
312-456-7897


In [113]:
p1= r'[0-9]{10}'
p2=r'[0-9]{3}-[0-9]{3}-[0-9]{4}'
p3 = r'\([0-9]{3}\)[0-9]{3}-[0-9]{4}'
p4 = r'[0-9]{3}\.[0-9]{3}\.[0-9]{4}'
pattern= p1+'|'+p2+'|'+p3+'|'+p4
prog = re.compile(pattern)

print_match("3124567897")
print_match("312-456-7897")
print_match("(312)456-7897")
print_match("312.456.7897")

3124567897
312-456-7897
(312)456-7897
312.456.7897


### Exercise 41: `findall` method finds all the occurance of the pattern and return them as a list of strings

In [40]:
ph_numbers = """Here are some phone numbers.
Pick out the numbers with 312 area code: 
312-423-3456, 456-334-6721, 312-5478-9999, 
312-Not-a-Number,777.345.2317, 312.331.6789"""

print(ph_numbers)
re.findall('312+[-\.][0-9-\.]+',ph_numbers)

Here are some phone numbers.
Pick out the numbers with 312 area code: 
312-423-3456, 456-334-6721, 312-5478-9999, 
312-Not-a-Number,777.345.2317, 312.331.6789


['312-423-3456', '312-5478-9999', '312.331.6789']