# Basic Regex Application

In [1]:
# for Python 2: use print only as a function
from __future__ import print_function

## Exercise 1: Baltimore homicides

A Baltimore newspaper's website contains a map of all homicides in the city, including details about the victims. That map has been scraped, and the data is stored in the file **`homicides.txt`** (in the **`data`** directory of the course repository).

In [2]:
# read the data into a list (each row is one list element)
with open('../data/homicides.txt') as f:
    data = [row for row in f]

In [3]:
# check the number of rows
len(data)

1250

In [4]:
# examine the first 5 rows
data[0:5]

['39.311024, -76.674227, iconHomicideShooting, \'p2\', \'<dl><dt>Leon Nelson</dt><dd class="address">3400 Clifton Ave.<br />Baltimore, MD 21216</dd><dd>black male, 17 years old</dd><dd>Found on January 1, 2007</dd><dd>Victim died at Shock Trauma</dd><dd>Cause: shooting</dd></dl>\'\n',
 '39.312641, -76.698948, iconHomicideShooting, \'p3\', \'<dl><dt>Eddie Golf</dt><dd class="address">4900 Challedon Road<br />Baltimore, MD 21207</dd><dd>black male, 26 years old</dd><dd>Found on January 2, 2007</dd><dd>Victim died at scene</dd><dd>Cause: shooting</dd></dl>\'\n',
 '39.309781, -76.649882, iconHomicideBluntForce, \'p4\', \'<dl><dt>Nelsene Burnette</dt><dd class="address">2000 West North Ave<br />Baltimore, MD 21217</dd><dd>black female, 44 years old</dd><dd>Found on January 2, 2007</dd><dd>Victim died at scene</dd><dd>Cause: blunt force</dd></dl>\'\n',
 '39.363925, -76.598772, iconHomicideAsphyxiation, \'p5\', \'<dl><dt>Thomas MacKenney</dt><dd class="address">5900 Northwood Drive<br />Balti

We want a list of the **ages** of the Baltimore homicide victims. (If the age is missing, insert a zero instead.) Here is the **expected output:**

> `ages = ['17 years old', '26 years old', ..., '0 years old', ...]`

In [5]:
import re

In [6]:
ages = []
for row in data:
    match=re.search(r'\d+ years? old' , row)
    if match:
        ages.append(match.group())
    else:
        ages.append('0 years old')

In [7]:
ages[:5]

['17 years old',
 '26 years old',
 '44 years old',
 '21 years old',
 '61 years old']

As a **bonus task**, use the `ages` list to create a second list of integers. (This does not require regular expressions.) Here is the **expected output:**

> `age_nums = [17, 26, ..., 0, ...]`

As **another bonus task**, use the `age_nums` list to calculate the **mean age** of a homicide victim (excluding zeros).

## Exercise 2: Baltimore homicides, revisited

Using **match groups**, create the `age_nums` list directly from the regular expression. Here is the **expected output:**

> `age_nums = [17, 26, ..., 0, ...]`

In [8]:
age_nums = []
for row in data:
    match = re.search(r'(\d+) years? old', row)
    if match:
        age_nums.append(int(match.group(1)))
    else:
        age_nums.append(0)

In [11]:
print(age_nums[:20])

[17, 26, 44, 21, 61, 46, 27, 21, 16, 21, 34, 25, 23, 30, 26, 36, 21, 27, 30, 19]
