# **Basic Regex Exercise**

**Exercise 1: Baltimore homicides**

A Baltimore newspaper's website contains a map of all homicides in the city, including details about the victims. That map has been scraped, and the data is stored in the file **`homicides.txt`**

In [1]:
#read the data into a list (each row is one list element)
with open('/content/homicides.txt') as f:
  data = [row for row in f]

In [2]:
# check the number of rows
len(data)

1250

In [4]:
#examine the 1st 5 rows, i.e. 1st 5 elements of the list.
data[0:5]

['39.311024, -76.674227, iconHomicideShooting, \'p2\', \'<dl><dt>Leon Nelson</dt><dd class="address">3400 Clifton Ave.<br />Baltimore, MD 21216</dd><dd>black male, 17 years old</dd><dd>Found on January 1, 2007</dd><dd>Victim died at Shock Trauma</dd><dd>Cause: shooting</dd></dl>\'\n',
 '39.312641, -76.698948, iconHomicideShooting, \'p3\', \'<dl><dt>Eddie Golf</dt><dd class="address">4900 Challedon Road<br />Baltimore, MD 21207</dd><dd>black male, 26 years old</dd><dd>Found on January 2, 2007</dd><dd>Victim died at scene</dd><dd>Cause: shooting</dd></dl>\'\n',
 '39.309781, -76.649882, iconHomicideBluntForce, \'p4\', \'<dl><dt>Nelsene Burnette</dt><dd class="address">2000 West North Ave<br />Baltimore, MD 21217</dd><dd>black female, 44 years old</dd><dd>Found on January 2, 2007</dd><dd>Victim died at scene</dd><dd>Cause: blunt force</dd></dl>\'\n',
 '39.363925, -76.598772, iconHomicideAsphyxiation, \'p5\', \'<dl><dt>Thomas MacKenney</dt><dd class="address">5900 Northwood Drive<br />Balti

We want a list of the **ages** of the Baltimore homicide victims. (If the age is missing, insert a zero instead.) Here is the **expected output:**
> `ages = ['17 years old', '26 years old', ..., '0 years old', ...]`

In [5]:
import re

In [6]:
#this is your framework code
ages = []
for row in data:
    match = re.search(r'\d+ years? old', row)
    if match:
        ages.append(match.group())
    else:
        ages.append('0 years old')

In [7]:
print(ages[0:100])

['17 years old', '26 years old', '44 years old', '21 years old', '61 years old', '46 years old', '27 years old', '21 years old', '16 years old', '21 years old', '34 years old', '25 years old', '23 years old', '30 years old', '26 years old', '36 years old', '21 years old', '27 years old', '30 years old', '19 years old', '31 years old', '34 years old', '24 years old', '31 years old', '33 years old', '24 years old', '25 years old', '22 years old', '23 years old', '52 years old', '34 years old', '32 years old', '26 years old', '39 years old', '28 years old', '29 years old', '19 years old', '37 years old', '22 years old', '27 years old', '32 years old', '18 years old', '21 years old', '25 years old', '17 years old', '19 years old', '20 years old', '28 years old', '17 years old', '37 years old', '36 years old', '40 years old', '48 years old', '19 years old', '17 years old', '18 years old', '27 years old', '49 years old', '65 years old', '21 years old', '30 years old', '19 years old', '54 yea

Use the `ages` list to create a second list of integers. (This does not require regular expressions.) Here is the **expected output:**

> `age_nums = [17, 26, ..., 0, ...]`

In [8]:
#split the string on spaces, only keep the first element, and convert to int
age_nums = [int(element.split()[0]) for element in ages]
print(age_nums[0:100])

[17, 26, 44, 21, 61, 46, 27, 21, 16, 21, 34, 25, 23, 30, 26, 36, 21, 27, 30, 19, 31, 34, 24, 31, 33, 24, 25, 22, 23, 52, 34, 32, 26, 39, 28, 29, 19, 37, 22, 27, 32, 18, 21, 25, 17, 19, 20, 28, 17, 37, 36, 40, 48, 19, 17, 18, 27, 49, 65, 21, 30, 19, 54, 17, 39, 18, 17, 16, 23, 23, 21, 21, 39, 25, 20, 16, 45, 25, 23, 45, 29, 23, 18, 25, 35, 30, 36, 22, 16, 24, 31, 18, 31, 0, 23, 23, 24, 25, 23, 26]


In [9]:
#Lets go back to regex101.com. type 'my 1st string!!' for demonstrating the 
#match group.
s = 'my 1st string!!'
s

'my 1st string!!'

In [10]:
re.search(r'(\d)(..)', s).group()

'1st'

In [11]:
#By Creating match groups I can refer to them by number.
#If I say group 1, I just get the output as 1 because
#it was the first match group.
re.search(r'(\d)(..)', s).group(1)

'1'

In [12]:
#If I say group 2, I get the st. This is a way to define logical groups 
#within a regex match such that using group method.
re.search(r'(\d)(..)', s).group(2)

'st'

As **a bonus task**, use the `age_nums` list to calculate the **mean age** of a homicide victim (excluding zeros).

In [14]:
# remove zero ages
clean_age_nums = [num for num in age_nums if num !=0]
len(clean_age_nums)

1239

In [15]:
# calculate the mean age (excluding zeros)
sum(clean_age_nums)/len(clean_age_nums)

29.919289749798224

**Exercise 2: Baltimore homicides, revisited**

Using **match groups**, create the `age_nums` list directly from the regular expression. Here is the **expected output:**

> `age_nums = [17, 26, ..., 0, ...]`

In [18]:
age_nums = []
for row in data:
    match = re.search(r'(\d+) years? old', row)
    if match:
        age_nums.append(int(match.group(1)))
    else:
        age_nums.append(0)

In [19]:
print(age_nums[0:100])

[17, 26, 44, 21, 61, 46, 27, 21, 16, 21, 34, 25, 23, 30, 26, 36, 21, 27, 30, 19, 31, 34, 24, 31, 33, 24, 25, 22, 23, 52, 34, 32, 26, 39, 28, 29, 19, 37, 22, 27, 32, 18, 21, 25, 17, 19, 20, 28, 17, 37, 36, 40, 48, 19, 17, 18, 27, 49, 65, 21, 30, 19, 54, 17, 39, 18, 17, 16, 23, 23, 21, 21, 39, 25, 20, 16, 45, 25, 23, 45, 29, 23, 18, 25, 35, 30, 36, 22, 16, 24, 31, 18, 31, 0, 23, 23, 24, 25, 23, 26]


In [None]:
#Lets go back to regex101 for demonstrating character classes.

In [None]:
#I want to show you how we can use character classes to solve the Cause problem
#on the homicide txt file. But first let us try it on the regex101 platform.
#lets copy the first string from the homicide.txt file to the regex101 platform.