# Lab 4.2- Regular Expression Practice

Now that we have grouped the data blocks, we need to identify and correct problems in the data.  The go-to tools for this task is the regular expression.  In this lab, we will point out a number of problems in the grouped data and create regular expression to perform various tasks (matching/splitting/substitution).

## Problem 1 -- Reading in current progress

Recall that we saved the results of grouping the data in a file named `911_Deaths_Grouped.csv`.  Read in the content of this file and split the content into a list of lines.

In [2]:
# Your code here

#### Key

In [3]:
with open('911_Deaths_Grouped.csv') as f:
    content = f.read()
content[:500]

"Gordon M. Aamoth, Jr., 32, Sandler O'Neill + Partners, World Trade Center.\nEdelmiro Abad, 54, Brooklyn, N.Y., Fiduciary Trust Company International, World Trade Center.\nMarie Rose Abad, 49, Keefe, Bruyette&Woods, Inc., World Trade Center.\nAndrew Anthony Abate, 37, Melville, N.Y., Cantor Fitzgerald, World Trade Center.\nVincent Paul Abate, 40, Brooklyn, N.Y., Cantor Fitzgerald, World Trade Center.\nLaurence Christopher Abel, 37, New York City, Cantor Fitzgerald, World Trade Center.\nAlona Abraham, 3"

In [4]:
grouped_lines = content.split('\n')
grouped_lines

antor Fitzgerald, World Trade Center.',
 'Anthony J. Fallone, Jr., 39, New York City, Cantor Fitzgerald, World Trade Center.',
 'Dolores Brigitte Fanelli, 38, Farmingville, N.Y., Marsh&McLennan Companies, Inc., World Trade Center.',
 'Robert John Fangman, 33, Chelsea, Mass., Flight Crew, United 175, World Trade Center.',
 'John Joseph Fanning, 54, West Hempstead, N.Y., New York City Fire Department, World Trade Center.',
 'Kathleen Anne Faragher, 33, Risk Waters Group conference attendee from Janus Capital Group, World Trade Center.',
 'Thomas James Farino, 37, Bohemia, N.Y., New York City Fire Department, World Trade Center.',
 'Nancy C. Doloszycki Farley, 45, Jersey City, N.J., Reinsurance Solutions, World Trade Center.',
 'Paige Marie Farley-Hackel, 46, Newton, Mass., Passenger, United 11, World Trade Center.',
 'Elizabeth Ann Farmer, 62, Cantor Fitzgerald contractor, World Trade Center.',
 'Douglas Jon Farnum, 33, Brooklyn, N.Y., Marsh&McLennan Companies, Inc., World Trade Center.'

## Problem 2 -- Inspecting problem lines

I have provided some examples of problems that can be found in this data set below.  Inspect the lines and determine one or more things that are problematic for each line.

In [5]:
example_idx = (0, 33, 75, 76, 150, 232, 1304, 1305, 1343)
examples = [l for i, l in enumerate(grouped_lines) if i in example_idx]
examples

["Gordon M. Aamoth, Jr., 32, Sandler O'Neill + Partners, World Trade Center.",
 'Godwin O. Ajala, 33, Summit Security Services, Inc., World Trade Center, died 9/15/01.',
 'Mary Lynn Edwards Angell, 52, Cape Cod, Mass. and Pasadena, Calif., Passenger, United 11, World Trade Center.',
 'Laura Angilletta, 23, Staten Island, N.Y., Cantor Fitzgerald, World Trade Center.',
 'Lorraine G. Bay, 58, East Windsor, N.J., Flight Crew, United 93, Shanksville, Pa.',
 'Canfield D. Boone, ??, United States Army, Pentagon.',
 'Albert Gunnis Joseph, 79, New York City, Morgan Stanley, World Trade Center, died 1/2/02.',
 'Ingeborg Joseph, 53, Marriott guest, World Trade Center, died 10/9/01.',
 'Brenda Kegler, ??, Capitol Heights, Md., United States Army Civilian, Pentagon.']

> Your answers here

#### Key

Some problems include

1. Comma's in names, companies, location
2. Missing ages (represented as `??`)
3. Rows missing an entry for hometowns, passenger type, flight, and date of death.
4. Non-uniform hometown/state entries.


## General procedure

1. Create the expression and test against positive case.
2. Match/search against all example
3. After you know it works, add the `groups` method for all examples.
4. Look for any non-matches in the full data set.
5. Test on the fill data set if all rows match.

## What's the big deal?

So why are we being so careful to make sure everything matches? Turns out that if any row fails to match, adding `groups` will crash the code :/

In [6]:
import re
test = re.compile(', \d\d,')
[test.search(l) for l in examples]

[<re.Match object; span=(21, 26), match=', 32,'>,
 <re.Match object; span=(15, 20), match=', 33,'>,
 <re.Match object; span=(24, 29), match=', 52,'>,
 <re.Match object; span=(16, 21), match=', 23,'>,
 <re.Match object; span=(15, 20), match=', 58,'>,
 None,
 <re.Match object; span=(20, 25), match=', 79,'>,
 <re.Match object; span=(15, 20), match=', 53,'>,
 None]

In [7]:
[test.search(l).groups() for l in examples]

AttributeError: 'NoneType' object has no attribute 'groups'

## Problem 3 -- Capturing the age field

Notice that all victims have a passenger field that contains either their age or `??` if the age is unknown.

In this problem, we will build a regular expression to match this field, which will ALSO allow us to capture the name field (even when there are problems with extra commas).

#### Task 1 - Capture the age field.

Write a regular expression that matches and captures the age field.  

**Hints:** Remember that 

* Use `(pat)` to capture a pattern.
* Use `\d` to match digits
* `(p1|p2)` allows you to match `p1` or `p2`.   

In [50]:
# Your code here

#### Key

In [51]:
import re
age = re.compile(', (\?\?|\d{1,3}),')
type(age.search(examples[0]))

re.Match

We are searching for character strings of ,[space] followed by either two question marks or a digit value of 1 to 3 characters before finding one last comma. The age (??) or (1-3 digits) is captured 

In [52]:
examples[0], age.search(examples[0]).groups()

("Gordon M. Aamoth, Jr., 32, Sandler O'Neill + Partners, World Trade Center.",
 ('32',))

In [53]:
[age.search(l) for l in examples]

[<re.Match object; span=(21, 26), match=', 32,'>,
 <re.Match object; span=(15, 20), match=', 33,'>,
 <re.Match object; span=(24, 29), match=', 52,'>,
 <re.Match object; span=(16, 21), match=', 23,'>,
 <re.Match object; span=(15, 20), match=', 58,'>,
 <re.Match object; span=(17, 22), match=', ??,'>,
 <re.Match object; span=(20, 25), match=', 79,'>,
 <re.Match object; span=(15, 20), match=', 53,'>,
 <re.Match object; span=(13, 18), match=', ??,'>]

In [54]:
[age.search(l).groups() for l in examples]

[('32',),
 ('33',),
 ('52',),
 ('23',),
 ('58',),
 ('??',),
 ('79',),
 ('53',),
 ('??',)]

In [55]:
[(i, l) for i, l in enumerate(grouped_lines) if not age.search(l)]

[]

In [56]:
[age.search(l).groups() for l in grouped_lines]

[('32',),
 ('54',),
 ('49',),
 ('37',),
 ('40',),
 ('37',),
 ('30',),
 ('55',),
 ('42',),
 ('38',),
 ('29',),
 ('37',),
 ('28',),
 ('61',),
 ('25',),
 ('51',),
 ('62',),
 ('28',),
 ('22',),
 ('36',),
 ('48',),
 ('32',),
 ('37',),
 ('36',),
 ('37',),
 ('35',),
 ('46',),
 ('30',),
 ('43',),
 ('74',),
 ('27',),
 ('47',),
 ('30',),
 ('33',),
 ('37',),
 ('37',),
 ('41',),
 ('39',),
 ('46',),
 ('25',),
 ('46',),
 ('57',),
 ('43',),
 ('51',),
 ('44',),
 ('39',),
 ('31',),
 ('30',),
 ('36',),
 ('48',),
 ('41',),
 ('31',),
 ('23',),
 ('38',),
 ('25',),
 ('60',),
 ('40',),
 ('60',),
 ('43',),
 ('41',),
 ('32',),
 ('29',),
 ('28',),
 ('42',),
 ('35',),
 ('26',),
 ('57',),
 ('53',),
 ('52',),
 ('34',),
 ('43',),
 ('37',),
 ('63',),
 ('38',),
 ('54',),
 ('52',),
 ('23',),
 ('44',),
 ('32',),
 ('48',),
 ('26',),
 ('55',),
 ('26',),
 ('26',),
 ('36',),
 ('45',),
 ('32',),
 ('38',),
 ('37',),
 ('34',),
 ('52',),
 ('29',),
 ('48',),
 ('50',),
 ('49',),
 ('37',),
 ('47',),
 ('53',),
 ('25',),
 ('21',),


#### Task 2 - Capture the age field, as well as everything before and after.

Adapt your work from the last problem to not only capture the age field, but also everything before and after.

**Hint:** Remember that 

* Use greedy wild-cards `.*` and/or `.+` to grab as much as possible.
* Use comma's to anchor the three parts, e.g. `(pat1), (pat2), (pat3)`

In [57]:
# Your code here

#### Key

In [58]:
import re
age_plus = re.compile('(.+), (\?\?|\d{1,3}), (.+)')

In [59]:
[age_plus.search(l) for l in examples]

[<re.Match object; span=(0, 74), match="Gordon M. Aamoth, Jr., 32, Sandler O'Neill + Part>,
 <re.Match object; span=(0, 86), match='Godwin O. Ajala, 33, Summit Security Services, In>,
 <re.Match object; span=(0, 109), match='Mary Lynn Edwards Angell, 52, Cape Cod, Mass. and>,
 <re.Match object; span=(0, 81), match='Laura Angilletta, 23, Staten Island, N.Y., Cantor>,
 <re.Match object; span=(0, 81), match='Lorraine G. Bay, 58, East Windsor, N.J., Flight C>,
 <re.Match object; span=(0, 52), match='Canfield D. Boone, ??, United States Army, Pentag>,
 <re.Match object; span=(0, 89), match='Albert Gunnis Joseph, 79, New York City, Morgan S>,
 <re.Match object; span=(0, 70), match='Ingeborg Joseph, 53, Marriott guest, World Trade >,
 <re.Match object; span=(0, 79), match='Brenda Kegler, ??, Capitol Heights, Md., United S>]

In [60]:
[age_plus.search(l).groups() for l in examples]

[('Gordon M. Aamoth, Jr.',
  '32',
  "Sandler O'Neill + Partners, World Trade Center."),
 ('Godwin O. Ajala',
  '33',
  'Summit Security Services, Inc., World Trade Center, died 9/15/01.'),
 ('Mary Lynn Edwards Angell',
  '52',
  'Cape Cod, Mass. and Pasadena, Calif., Passenger, United 11, World Trade Center.'),
 ('Laura Angilletta',
  '23',
  'Staten Island, N.Y., Cantor Fitzgerald, World Trade Center.'),
 ('Lorraine G. Bay',
  '58',
  'East Windsor, N.J., Flight Crew, United 93, Shanksville, Pa.'),
 ('Canfield D. Boone', '??', 'United States Army, Pentagon.'),
 ('Albert Gunnis Joseph',
  '79',
  'New York City, Morgan Stanley, World Trade Center, died 1/2/02.'),
 ('Ingeborg Joseph',
  '53',
  'Marriott guest, World Trade Center, died 10/9/01.'),
 ('Brenda Kegler',
  '??',
  'Capitol Heights, Md., United States Army Civilian, Pentagon.')]

In [61]:
[(i, l) for i, l in enumerate(grouped_lines) if not age_plus.search(l)]

[]

In [62]:
[age_plus.search(l).groups() for l in grouped_lines]

, N.J., Aon Corporation, World Trade Center.'),
 ('William M. Feehan',
  '71',
  'Flushing, N.Y., New York City Fire Department, World Trade Center.'),
 ('Francis Jude Feely',
  '41',
  'Marsh&McLennan Companies, Inc., World Trade Center.'),
 ('Garth Erin Feeney',
  '25',
  'New York City, Risk Waters Group conference attendee from DataSynapse, World Trade Center.'),
 ('Sean Bernard Fegan',
  '34',
  'New York City, Fred Alger Management, Inc., World Trade Center.'),
 ('Lee S. Fehling',
  '28',
  'Wantagh, N.Y., New York City Fire Department, World Trade Center.'),
 ('Peter Adam Feidelberg',
  '34',
  'Hoboken, N.J., Aon Corporation, World Trade Center.'),
 ('Alan D. Feinberg',
  '48',
  'Marlboro, N.J., New York City Fire Department, World Trade Center.'),
 ('Rosa Maria Feliciano',
  '30',
  'Marsh&McLennan Companies, Inc., World Trade Center.'),
 ('Edward P. Felt',
  '41',
  'Matawan, N.J., Passenger, United 93, Shanksville, Pa.'),
 ('Edward Thomas Fergus, Jr.',
  '40',
  'Wilton, Co

## Problem 4 -- Capturing the date of death

While most victims of the attack died on 9/11, a few died at a later date.  Notice that those that those that died later have an additional field at the end of the line.

In this problem, we will build a regular expression to match this field.

In [63]:
examples[-2]

'Ingeborg Joseph, 53, Marriott guest, World Trade Center, died 10/9/01.'

#### Task 1 - Capture the date of death field.
  
Write a regular expression that matches and captures the date of death (e.g. `10/9/01`).  This expression should return `None` when this field is missing.

**Hints:** Remember that 

* Use `$` to match the end of the line.
* Escape to match periods exactly, i.e. `\.`
* Use `\d{n,m}` to match digits to match between `n` and `m` digits
* `?` allows you to match optional patterns

In [64]:
# Your code here

#### Key

In [65]:
dod = re.compile('(, died \d{1,2}/\d{1,2}/\d{1,2})?(\.)?$')
dod.search(examples[-2]).groups()

(', died 10/9/01', '.')

We are searching for a sequence of (captured) ", died " followed by thre groups of 1 to 2 digits sepereted by "/" that may or may not occur (none) and captured a period that may or may not occur imediatley afterwards. This sequence will always be at the end of the string

In [66]:
[dod.search(l) for l in examples]

[<re.Match object; span=(73, 74), match='.'>,
 <re.Match object; span=(71, 86), match=', died 9/15/01.'>,
 <re.Match object; span=(108, 109), match='.'>,
 <re.Match object; span=(80, 81), match='.'>,
 <re.Match object; span=(80, 81), match='.'>,
 <re.Match object; span=(51, 52), match='.'>,
 <re.Match object; span=(75, 89), match=', died 1/2/02.'>,
 <re.Match object; span=(55, 70), match=', died 10/9/01.'>,
 <re.Match object; span=(78, 79), match='.'>]

In [67]:
[dod.search(l).groups() for l in examples]

[(None, '.'),
 (', died 9/15/01', '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (', died 1/2/02', '.'),
 (', died 10/9/01', '.'),
 (None, '.')]

In [68]:
[(i, l) for i, l in enumerate(grouped_lines) if not dod.search(l)]

[]

In [69]:
[dod.search(l).groups() for l in grouped_lines]

[(None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (', died 9/15/01', '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, '.'),
 (None, 

#### Task 2 - Capture the age field, as well as everything before and after.

Adapt your work from the last problem to not only capture the data of death field, but also everything before.

**Hint:** Remember that 

* Use greedy wild-cards `.*` and/or `.+` to grab as much as possible.
* Use comma's to anchor the three parts, e.g. `(pat1), (pat2), (pat3)`

In [70]:
# Your code here

age_every = re.compile('(.*), (\?\?|\d{1,3}),(.*)')
type(age_every.search(examples[0]))

re.Match

This is fundmanetally the same as our eariler age searcher, just with the addition of two more captures of any character other then a line break (.*) on either ends. The commas between the deaths and everything else are not captured.

In [71]:
[age_every.search(l) for l in examples]

[<re.Match object; span=(0, 74), match="Gordon M. Aamoth, Jr., 32, Sandler O'Neill + Part>,
 <re.Match object; span=(0, 86), match='Godwin O. Ajala, 33, Summit Security Services, In>,
 <re.Match object; span=(0, 109), match='Mary Lynn Edwards Angell, 52, Cape Cod, Mass. and>,
 <re.Match object; span=(0, 81), match='Laura Angilletta, 23, Staten Island, N.Y., Cantor>,
 <re.Match object; span=(0, 81), match='Lorraine G. Bay, 58, East Windsor, N.J., Flight C>,
 <re.Match object; span=(0, 52), match='Canfield D. Boone, ??, United States Army, Pentag>,
 <re.Match object; span=(0, 89), match='Albert Gunnis Joseph, 79, New York City, Morgan S>,
 <re.Match object; span=(0, 70), match='Ingeborg Joseph, 53, Marriott guest, World Trade >,
 <re.Match object; span=(0, 79), match='Brenda Kegler, ??, Capitol Heights, Md., United S>]

In [72]:
[age_every.search(l).groups() for l in examples]

[('Gordon M. Aamoth, Jr.',
  '32',
  " Sandler O'Neill + Partners, World Trade Center."),
 ('Godwin O. Ajala',
  '33',
  ' Summit Security Services, Inc., World Trade Center, died 9/15/01.'),
 ('Mary Lynn Edwards Angell',
  '52',
  ' Cape Cod, Mass. and Pasadena, Calif., Passenger, United 11, World Trade Center.'),
 ('Laura Angilletta',
  '23',
  ' Staten Island, N.Y., Cantor Fitzgerald, World Trade Center.'),
 ('Lorraine G. Bay',
  '58',
  ' East Windsor, N.J., Flight Crew, United 93, Shanksville, Pa.'),
 ('Canfield D. Boone', '??', ' United States Army, Pentagon.'),
 ('Albert Gunnis Joseph',
  '79',
  ' New York City, Morgan Stanley, World Trade Center, died 1/2/02.'),
 ('Ingeborg Joseph',
  '53',
  ' Marriott guest, World Trade Center, died 10/9/01.'),
 ('Brenda Kegler',
  '??',
  ' Capitol Heights, Md., United States Army Civilian, Pentagon.')]

In [73]:
[(i, l) for i, l in enumerate(grouped_lines) if not age_every.search(l)] #what we want to see

[]

In [74]:
[age_every.search(l).groups() for l in grouped_lines] #still have that annoying VS studio visual bug, but it works

World Trade Center.'),
 ('Garth Erin Feeney',
  '25',
  ' New York City, Risk Waters Group conference attendee from DataSynapse, World Trade Center.'),
 ('Sean Bernard Fegan',
  '34',
  ' New York City, Fred Alger Management, Inc., World Trade Center.'),
 ('Lee S. Fehling',
  '28',
  ' Wantagh, N.Y., New York City Fire Department, World Trade Center.'),
 ('Peter Adam Feidelberg',
  '34',
  ' Hoboken, N.J., Aon Corporation, World Trade Center.'),
 ('Alan D. Feinberg',
  '48',
  ' Marlboro, N.J., New York City Fire Department, World Trade Center.'),
 ('Rosa Maria Feliciano',
  '30',
  ' Marsh&McLennan Companies, Inc., World Trade Center.'),
 ('Edward P. Felt',
  '41',
  ' Matawan, N.J., Passenger, United 93, Shanksville, Pa.'),
 ('Edward Thomas Fergus, Jr.',
  '40',
  ' Wilton, Conn., Cantor Fitzgerald, World Trade Center.'),
 ('George J. Ferguson III',
  '54',
  ' Teaneck, N.J., Westfalia Investments, Inc., World Trade Center.'),
 ('J. Joseph Ferguson',
  '39',
  ' Washington, D.C., Pas

## Problem 5 -- Working with passenger data

Notice that 

1. Passengers on the flights have two extra fields: passenger status and flight
2. Other victim are missing these fields.

In this problem, we will build a regular expression to match these fields and use this expression to split the data.  In the process, we will be able to add the missing fields to the other rows.

#### Task 1 - Make an expression that matches the passenger status.

Make a regular expression that matches and extracts the passenger status field.  This expression should match all lines, returning `None` for the other rows.

**Hint:** Remember that 

* `(p1|p2)` allows you to match `p1` or `p2`.   
* `?` allows you to match optional patterns

In [75]:
examples[:20]

["Gordon M. Aamoth, Jr., 32, Sandler O'Neill + Partners, World Trade Center.",
 'Godwin O. Ajala, 33, Summit Security Services, Inc., World Trade Center, died 9/15/01.',
 'Mary Lynn Edwards Angell, 52, Cape Cod, Mass. and Pasadena, Calif., Passenger, United 11, World Trade Center.',
 'Laura Angilletta, 23, Staten Island, N.Y., Cantor Fitzgerald, World Trade Center.',
 'Lorraine G. Bay, 58, East Windsor, N.J., Flight Crew, United 93, Shanksville, Pa.',
 'Canfield D. Boone, ??, United States Army, Pentagon.',
 'Albert Gunnis Joseph, 79, New York City, Morgan Stanley, World Trade Center, died 1/2/02.',
 'Ingeborg Joseph, 53, Marriott guest, World Trade Center, died 10/9/01.',
 'Brenda Kegler, ??, Capitol Heights, Md., United States Army Civilian, Pentagon.']

In [76]:
#passenger_status = re.compile('^..*?,(?:.*?(Passenger)|.*?(Flight Crew))?')
#passenger_status = re.compile('^.*?,.*(P?a?s?s?e?n?g?e?r?)|^.*?,.*(F?l?i?g?h?t? ?C?r?e?w?)')

passenger_status = re.compile('^.*?,(?:.*?(Passenger|Flight\s+Crew))?')
type(age.search(examples[4]))




re.Match

This regex works by looking at all character starting from the front until it gets to the first comma, then, it optionally decides to look forward and match on the rest of the characters, stopping to grab Passenger and Flight Crew (accounts for more than one space) if they are available.

For the passenger status, one is always referred to as either Flight Crew or Passenger. Therefore, a regular epxression that looks for 1 or the other (|) and returns none if not found (?) will work just fine

In [77]:
[passenger_status.search(l) for l in examples] #looks good

[<re.Match object; span=(0, 17), match='Gordon M. Aamoth,'>,
 <re.Match object; span=(0, 16), match='Godwin O. Ajala,'>,
 <re.Match object; span=(0, 77), match='Mary Lynn Edwards Angell, 52, Cape Cod, Mass. and>,
 <re.Match object; span=(0, 17), match='Laura Angilletta,'>,
 <re.Match object; span=(0, 52), match='Lorraine G. Bay, 58, East Windsor, N.J., Flight C>,
 <re.Match object; span=(0, 18), match='Canfield D. Boone,'>,
 <re.Match object; span=(0, 21), match='Albert Gunnis Joseph,'>,
 <re.Match object; span=(0, 16), match='Ingeborg Joseph,'>,
 <re.Match object; span=(0, 14), match='Brenda Kegler,'>]

In [78]:
#it works
[passenger_status.search(l).groups() for l in examples]



[(None,),
 (None,),
 ('Passenger',),
 (None,),
 ('Flight Crew',),
 (None,),
 (None,),
 (None,),
 (None,)]

In [79]:
[(i, l) for i, l in enumerate(grouped_lines) if not passenger_status.search(l)] #passes

[]

In [80]:
[passenger_status.search(l).groups() for l in grouped_lines] 

[(None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 ('Passenger',),
 (None,),
 (None,),
 (None,),
 (None,),
 ('Passenger',),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 ('Passenger',),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 ('Passenger',),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 ('Passenger',),
 ('Passenger',),
 (None,),
 (None,),
 (None,),
 ('Passenger',),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 ('Flight Crew',),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 ('Passenger

#### Task 2 - Capture the flight field

Make a regular expression that matches and extracts the flight field.  This expression should match all lines, returning `None` for the other rows.

**Hint:** Remember that 

* Look through the data file to identify the airlines.
* Use `\d` to match digits
* `(p1|p2)` allows you to match `p1` or `p2`.   
* `?` allows you to match optional patterns

In [81]:
#the four flights on 9/11 are United 93, United 175, American 77, American 11

#oddly, this dataset has a typo, refferring to American 11 as "United 11"


Flight_Field = re.compile('^.*?,(?:.*?(United\s+93|United\s+175|United\s+11|American\s+77))?')
type(Flight_Field.search(examples[4]))


re.Match

These regex works on fundamentally the same principle as t he previous one, the only difference is that now it only stops to pick up on of the four flights by name (accounting for any number of spaces)

In [82]:
[Flight_Field.search(l) for l in examples] #looks good

[<re.Match object; span=(0, 17), match='Gordon M. Aamoth,'>,
 <re.Match object; span=(0, 16), match='Godwin O. Ajala,'>,
 <re.Match object; span=(0, 88), match='Mary Lynn Edwards Angell, 52, Cape Cod, Mass. and>,
 <re.Match object; span=(0, 17), match='Laura Angilletta,'>,
 <re.Match object; span=(0, 63), match='Lorraine G. Bay, 58, East Windsor, N.J., Flight C>,
 <re.Match object; span=(0, 18), match='Canfield D. Boone,'>,
 <re.Match object; span=(0, 21), match='Albert Gunnis Joseph,'>,
 <re.Match object; span=(0, 16), match='Ingeborg Joseph,'>,
 <re.Match object; span=(0, 14), match='Brenda Kegler,'>]

In [83]:
[(i, l) for i, l in enumerate(grouped_lines) if not Flight_Field.search(l)] #passes

[]

In [84]:
[Flight_Field.search(l).groups() for l in examples]

[(None,),
 (None,),
 ('United 11',),
 (None,),
 ('United 93',),
 (None,),
 (None,),
 (None,),
 (None,)]

In [85]:
[Flight_Field.search(l).groups() for l in grouped_lines] 

[(None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 ('United 175',),
 (None,),
 (None,),
 (None,),
 (None,),
 ('United 93',),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 ('United 11',),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 ('American 77',),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 ('United 11',),
 ('United 11',),
 (None,),
 (None,),
 (None,),
 ('United 11',),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 ('United 11',),
 (None,),
 (None,),
 (None,),
 (None,),
 (None,),
 ('United 1

#### Task 3 - Combine the last two expression

Now combine the last two expressions to capture the two flight fields, but also all content before and after these fields.

In [86]:
examples[:20]

["Gordon M. Aamoth, Jr., 32, Sandler O'Neill + Partners, World Trade Center.",
 'Godwin O. Ajala, 33, Summit Security Services, Inc., World Trade Center, died 9/15/01.',
 'Mary Lynn Edwards Angell, 52, Cape Cod, Mass. and Pasadena, Calif., Passenger, United 11, World Trade Center.',
 'Laura Angilletta, 23, Staten Island, N.Y., Cantor Fitzgerald, World Trade Center.',
 'Lorraine G. Bay, 58, East Windsor, N.J., Flight Crew, United 93, Shanksville, Pa.',
 'Canfield D. Boone, ??, United States Army, Pentagon.',
 'Albert Gunnis Joseph, 79, New York City, Morgan Stanley, World Trade Center, died 1/2/02.',
 'Ingeborg Joseph, 53, Marriott guest, World Trade Center, died 10/9/01.',
 'Brenda Kegler, ??, Capitol Heights, Md., United States Army Civilian, Pentagon.']

In [87]:
#This one took awhile....

#Combined_Flight_Field = re.compile('^(.*?),(?:.*?(Passenger|Flight\s+Crew), (United 93|United\s+175|United\s+11|American\s+77))?,?(.*)')
#Combined_Flight_Field = re.compile('^(.*?,.*?)(?:(Passenger|Flight\s+Crew), (United 93|United\s+175|United\s+11|American\s+77))?,?(.*)')
#Combined_Flight_Field = re.compile('^(?:(.*)(Passenger|Flight\s+Crew), (United 93|United\s+175|United\s+11|American\s+77))?,?(.*)') #last best one
#Combined_Flight_Field = re.compile('(?:(.*?)(Passenger)?,?\s?(United 11)?,([^,]*$))')
#Combined_Flight_Field = re.compile('(?:(.*?)(Passenger|Flight\s+Crew)?,?\s?(United\s+93|United\s+175|United\s+11|American\s+77)?,([^,]*$))')
#Combined_Flight_Field = re.compile('^(?:(.*,.*)(Passenger|Flight\s+Crew)?,? ?(United 93|United\s+175|United\s+11|American\s+77)?),?(.*)')
#Combined_Flight_Field = re.compile('^(.*)(Passenger|Flight\s+Crew)?,?\s?(United 93|United\s+175|United\s+11|American\s+77)?,?((?:World\s?Trade\s?Center|Shanksville,\s?Pa|Pentagon)(?:, died \d{1,2}.\d{1,2}.\d{1,2})?\.?)')
#Combined_Flight_Field = re.compile('^(.*)(?:(Passenger),\s(United 93|United\s+175|United\s+11|American\s+77), ((?:World\s?Trade\s?Center|Shanksville,\s?Pa|Pentagon)(?:, died \d{1,2}.\d{1,2}.\d{1,2})?\.?)|(Flight\s+Crew),\s(United 93|United\s+175|United\s+11|American\s+77), ((?:World\s?Trade\s?Center|Shanksville,\s?Pa|Pentagon)(?:, died \d{1,2}.\d{1,2}.\d{1,2})?\.?)|((?:World\s?Trade\s?Center|Shanksville,\s?Pa|Pentagon)(?:, died \d{1,2}.\d{1,2}.\d{1,2})?\.?))')

Combined_Flight_Field = re.compile('^(.*?)(?:(Passenger|Flight\s+Crew))?,\s?(?:(United 93|United\s+175|United\s+11|American\s+77))?,?\s?((?:World\s?Trade\s?Center|Shanksville,\s?Pa|Pentagon)(?:, died \d{1,2}.\d{1,2}.\d{1,2})?\.?)') #thanks again for the help prof, I was stumped for hours



type(Combined_Flight_Field.search(examples[4]))


re.Match

This logic combines the previous two steps to parse the data laziy, picking up Passenger and Flight status along the way if it finds them until it reaches one of three cases:

died at WTC, died at Pentagon,died in Pennslyvannia

Then, it optionally grabs the died attribute as part of the last capture if it exists until it has finished parsing the line. Spaces are reffered to as either /s+ or /s? depending on whether common errors existed with to many or a lack of spaces. Further small adjustments to particulars were made until all cases successfully matched.


In [88]:
[Combined_Flight_Field.search(l) for l in examples] #looks good

[<re.Match object; span=(0, 74), match="Gordon M. Aamoth, Jr., 32, Sandler O'Neill + Part>,
 <re.Match object; span=(0, 86), match='Godwin O. Ajala, 33, Summit Security Services, In>,
 <re.Match object; span=(0, 109), match='Mary Lynn Edwards Angell, 52, Cape Cod, Mass. and>,
 <re.Match object; span=(0, 81), match='Laura Angilletta, 23, Staten Island, N.Y., Cantor>,
 <re.Match object; span=(0, 81), match='Lorraine G. Bay, 58, East Windsor, N.J., Flight C>,
 <re.Match object; span=(0, 52), match='Canfield D. Boone, ??, United States Army, Pentag>,
 <re.Match object; span=(0, 89), match='Albert Gunnis Joseph, 79, New York City, Morgan S>,
 <re.Match object; span=(0, 70), match='Ingeborg Joseph, 53, Marriott guest, World Trade >,
 <re.Match object; span=(0, 79), match='Brenda Kegler, ??, Capitol Heights, Md., United S>]

In [89]:
[(i, l) for i, l in enumerate(grouped_lines) if not Combined_Flight_Field.search(l)] #passes

[]

In [90]:
[Combined_Flight_Field.search(l).groups() for l in examples]

[("Gordon M. Aamoth, Jr., 32, Sandler O'Neill + Partners",
  None,
  None,
  'World Trade Center.'),
 ('Godwin O. Ajala, 33, Summit Security Services, Inc.',
  None,
  None,
  'World Trade Center, died 9/15/01.'),
 ('Mary Lynn Edwards Angell, 52, Cape Cod, Mass. and Pasadena, Calif., ',
  'Passenger',
  'United 11',
  'World Trade Center.'),
 ('Laura Angilletta, 23, Staten Island, N.Y., Cantor Fitzgerald',
  None,
  None,
  'World Trade Center.'),
 ('Lorraine G. Bay, 58, East Windsor, N.J., ',
  'Flight Crew',
  'United 93',
  'Shanksville, Pa.'),
 ('Canfield D. Boone, ??, United States Army', None, None, 'Pentagon.'),
 ('Albert Gunnis Joseph, 79, New York City, Morgan Stanley',
  None,
  None,
  'World Trade Center, died 1/2/02.'),
 ('Ingeborg Joseph, 53, Marriott guest',
  None,
  None,
  'World Trade Center, died 10/9/01.'),
 ('Brenda Kegler, ??, Capitol Heights, Md., United States Army Civilian',
  None,
  None,
  'Pentagon.')]

In [91]:
[Combined_Flight_Field.search(l).groups() for l in grouped_lines] #still get that annoying output bug, but it does work

ouise Fialko, 29, Teaneck, N.J., Aon Corporation',
  None,
  None,
  'World Trade Center.'),
 ('Kristen Nicole Fiedel, 27, Marsh&McLennan Companies, Inc.',
  None,
  None,
  'World Trade Center.'),
 ('Amelia V. Fields, 46, Dumfries, Va., United States Army Civilian',
  None,
  None,
  'Pentagon.'),
 ('Samuel Fields, 36, Summit Security Services, Inc.',
  None,
  None,
  'World Trade Center.'),
 ('Alexander Milan Filipov, 70, Concord, Mass., ',
  'Passenger',
  'United 11',
  'World Trade Center.'),
 ('Michael Bradley Finnegan, 37, Basking Ridge, N.J., Cantor Fitzgerald',
  None,
  None,
  'World Trade Center.'),
 ('Timothy J. Finnerty, 33, Glen Rock, N.J., Cantor Fitzgerald',
  None,
  None,
  'World Trade Center.'),
 ('Michael C. Fiore, 46, Staten Island, N.Y., New York City Fire Department',
  None,
  None,
  'World Trade Center.'),
 ('Stephen J. Fiorelli, 43, Aberdeen, N.J., Port Authority of New York and New Jersey',
  None,
  None,
  'World Trade Center.'),
 ('Paul M. Fiori, 31, C