# Intermediate Regex Exercises (Solution)

In [1]:
# for Python 2: use print only as a function
from __future__ import print_function

## Exercise 1: IMDb top 100 movies

Data about the 100 highest rated movies has been been scraped from the IMDb website and stored in the file **`imdb_100.csv`** (in the **`data`** directory of the course repository).

In [2]:
# read the file into a DataFrame
import pandas as pd
path = '../data/imdb_100.csv'
imdb = pd.read_csv(path)

In [3]:
imdb.columns

Index([u'star_rating', u'title', u'content_rating', u'genre', u'duration',
       u'actors_list'],
      dtype='object')

In [4]:
# save the 'title' Series as a Python list
titles = imdb.title.tolist()

In [5]:
print(titles)

['The Shawshank Redemption', 'The Godfather', 'The Godfather: Part II', 'The Dark Knight', 'Pulp Fiction', '12 Angry Men', 'The Good, the Bad and the Ugly', 'The Lord of the Rings: The Return of the King', "Schindler's List", 'Fight Club', 'The Lord of the Rings: The Fellowship of the Ring', 'Inception', 'Star Wars: Episode V - The Empire Strikes Back', 'Forrest Gump', 'The Lord of the Rings: The Two Towers', 'Interstellar', "One Flew Over the Cuckoo's Nest", 'Seven Samurai', 'Goodfellas', 'Star Wars', 'The Matrix', 'City of God', "It's a Wonderful Life", 'The Usual Suspects', 'Se7en', 'Life Is Beautiful', 'Once Upon a Time in the West', 'The Silence of the Lambs', 'Leon: The Professional', 'City Lights', 'Spirited Away', 'The Intouchables', 'Casablanca', 'Whiplash', 'American History X', 'Modern Times', 'Saving Private Ryan', 'Raiders of the Lost Ark', 'Rear Window', 'Psycho', 'The Green Mile', 'Sunset Blvd.', 'The Pianist', 'The Dark Knight Rises', 'Gladiator', 'Terminator 2: Judgmen

Here are a few of the titles from this list:

> `titles = [..., "It's a Wonderful Life", 'The Usual Suspects', 'Se7en', ...]`

We want a revised list with the **initial article (A/An/The) removed**, without affecting the rest of the title. Here is the **expected output:**

> `clean_titles = [..., "It's a Wonderful Life", 'Usual Suspects', 'Se7en', ...]`

In [6]:
import re

In [7]:
# remove the initial article
clean_titles = [re.sub(r'^(A|An|The) ', r'', title) for title in titles]
print(clean_titles)

['Shawshank Redemption', 'Godfather', 'Godfather: Part II', 'Dark Knight', 'Pulp Fiction', '12 Angry Men', 'Good, the Bad and the Ugly', 'Lord of the Rings: The Return of the King', "Schindler's List", 'Fight Club', 'Lord of the Rings: The Fellowship of the Ring', 'Inception', 'Star Wars: Episode V - The Empire Strikes Back', 'Forrest Gump', 'Lord of the Rings: The Two Towers', 'Interstellar', "One Flew Over the Cuckoo's Nest", 'Seven Samurai', 'Goodfellas', 'Star Wars', 'Matrix', 'City of God', "It's a Wonderful Life", 'Usual Suspects', 'Se7en', 'Life Is Beautiful', 'Once Upon a Time in the West', 'Silence of the Lambs', 'Leon: The Professional', 'City Lights', 'Spirited Away', 'Intouchables', 'Casablanca', 'Whiplash', 'American History X', 'Modern Times', 'Saving Private Ryan', 'Raiders of the Lost Ark', 'Rear Window', 'Psycho', 'Green Mile', 'Sunset Blvd.', 'Pianist', 'Dark Knight Rises', 'Gladiator', 'Terminator 2: Judgment Day', 'Memento', 'Taare Zameen Par', 'Dr. Strangelove or: 

As a **bonus task**, add the removed article to the end of the title. Here is the **expected output:**

> `better_titles = [..., "It's a Wonderful Life", 'Usual Suspects, The', 'Se7en', ...]`

In [8]:
# move the initial article to the end
better_titles = [re.sub(r'^(A|An|The) (.+)', r'\2, \1', title) for title in titles]
print(better_titles)

['Shawshank Redemption, The', 'Godfather, The', 'Godfather: Part II, The', 'Dark Knight, The', 'Pulp Fiction', '12 Angry Men', 'Good, the Bad and the Ugly, The', 'Lord of the Rings: The Return of the King, The', "Schindler's List", 'Fight Club', 'Lord of the Rings: The Fellowship of the Ring, The', 'Inception', 'Star Wars: Episode V - The Empire Strikes Back', 'Forrest Gump', 'Lord of the Rings: The Two Towers, The', 'Interstellar', "One Flew Over the Cuckoo's Nest", 'Seven Samurai', 'Goodfellas', 'Star Wars', 'Matrix, The', 'City of God', "It's a Wonderful Life", 'Usual Suspects, The', 'Se7en', 'Life Is Beautiful', 'Once Upon a Time in the West', 'Silence of the Lambs, The', 'Leon: The Professional', 'City Lights', 'Spirited Away', 'Intouchables, The', 'Casablanca', 'Whiplash', 'American History X', 'Modern Times', 'Saving Private Ryan', 'Raiders of the Lost Ark', 'Rear Window', 'Psycho', 'Green Mile, The', 'Sunset Blvd.', 'Pianist, The', 'Dark Knight Rises, The', 'Gladiator', 'Termin

## Exercise 2: FAA tower closures, revisited

A list of FAA tower closures has been copied from a [PDF](http://www.faa.gov/news/media/fct_closed.pdf) into the file **`faa.txt`**, which is stored in the **`data`** directory of the course repository.

In [9]:
# read the file into a single string
with open('../data/faa.txt') as f:
    data = f.read()

In [10]:
# examine the first 300 characters
print(data[0:300])

FAA Contract Tower Closure List
(149 FCTs)
3-22-2013
LOC
ID Facility Name City State
DHN DOTHAN RGNL DOTHAN AL
TCL TUSCALOOSA RGNL TUSCALOOSA AL
FYV DRAKE FIELD FAYETTEVILLE AR
TXK TEXARKANA RGNL-WEBB FIELD TEXARKANA AR
GEU GLENDALE MUNI GLENDALE AZ
GYR PHOENIX GOODYEAR GOODYEAR AZ
IFP LAUGHLIN/BULL


In [11]:
# create a list of tuples containing the tower IDs and their states
print(re.findall(r'([A-Z]{3}) .+ ([A-Z]{2})', data))

[('DHN', 'AL'), ('TCL', 'AL'), ('FYV', 'AR'), ('TXK', 'AR'), ('GEU', 'AZ'), ('GYR', 'AZ'), ('IFP', 'AZ'), ('RYN', 'AZ'), ('FUL', 'CA'), ('MER', 'CA'), ('OXR', 'CA'), ('RAL', 'CA'), ('RNM', 'CA'), ('SAC', 'CA'), ('SDM', 'CA'), ('SNS', 'CA'), ('VCV', 'CA'), ('WHP', 'CA'), ('WJF', 'CA'), ('BDR', 'CT'), ('DXR', 'CT'), ('GON', 'CT'), ('HFD', 'CT'), ('HVN', 'CT'), ('OXC', 'CT'), ('APF', 'FL'), ('BCT', 'FL'), ('EVB', 'FL'), ('FMY', 'FL'), ('HWO', 'FL'), ('LAL', 'FL'), ('LEE', 'FL'), ('OCF', 'FL'), ('OMN', 'FL'), ('PGD', 'FL'), ('SGJ', 'FL'), ('SPG', 'FL'), ('SUA', 'FL'), ('TIX', 'FL'), ('ABY', 'GA'), ('AHN', 'GA'), ('LZU', 'GA'), ('MCN', 'GA'), ('RYY', 'GA'), ('DBQ', 'IA'), ('IDA', 'ID'), ('LWS', 'ID'), ('PIH', 'ID'), ('SUN', 'ID'), ('ALN', 'IL'), ('BMI', 'IL'), ('DEC', 'IL'), ('MDH', 'IL'), ('UGN', 'IL'), ('BAK', 'IN'), ('GYY', 'IN'), ('HUT', 'KS'), ('IXD', 'KS'), ('MHK', 'KS'), ('OJC', 'KS'), ('TOP', 'KS'), ('OWB', 'KY'), ('PAH', 'KY'), ('DTN', 'LA'), ('BVY', 'MA'), ('EWB', 'MA'), ('LWM', '

Without changing the output, make this regular expression pattern more readable by using the **`re.VERBOSE`** option flag and adding comments.

In [12]:
print(re.findall(r'''
([A-Z]{3})\    # match group 1 is ID, then space
.+\            # any characters, then space
([A-Z]{2})     # match group 2 is state
''', data, flags=re.VERBOSE))

[('DHN', 'AL'), ('TCL', 'AL'), ('FYV', 'AR'), ('TXK', 'AR'), ('GEU', 'AZ'), ('GYR', 'AZ'), ('IFP', 'AZ'), ('RYN', 'AZ'), ('FUL', 'CA'), ('MER', 'CA'), ('OXR', 'CA'), ('RAL', 'CA'), ('RNM', 'CA'), ('SAC', 'CA'), ('SDM', 'CA'), ('SNS', 'CA'), ('VCV', 'CA'), ('WHP', 'CA'), ('WJF', 'CA'), ('BDR', 'CT'), ('DXR', 'CT'), ('GON', 'CT'), ('HFD', 'CT'), ('HVN', 'CT'), ('OXC', 'CT'), ('APF', 'FL'), ('BCT', 'FL'), ('EVB', 'FL'), ('FMY', 'FL'), ('HWO', 'FL'), ('LAL', 'FL'), ('LEE', 'FL'), ('OCF', 'FL'), ('OMN', 'FL'), ('PGD', 'FL'), ('SGJ', 'FL'), ('SPG', 'FL'), ('SUA', 'FL'), ('TIX', 'FL'), ('ABY', 'GA'), ('AHN', 'GA'), ('LZU', 'GA'), ('MCN', 'GA'), ('RYY', 'GA'), ('DBQ', 'IA'), ('IDA', 'ID'), ('LWS', 'ID'), ('PIH', 'ID'), ('SUN', 'ID'), ('ALN', 'IL'), ('BMI', 'IL'), ('DEC', 'IL'), ('MDH', 'IL'), ('UGN', 'IL'), ('BAK', 'IN'), ('GYY', 'IN'), ('HUT', 'KS'), ('IXD', 'KS'), ('MHK', 'KS'), ('OJC', 'KS'), ('TOP', 'KS'), ('OWB', 'KY'), ('PAH', 'KY'), ('DTN', 'LA'), ('BVY', 'MA'), ('EWB', 'MA'), ('LWM', '