## Case Study: Data Types in Data Science

The data file, **crime_sampler.csv** contains the date (1st column), block where it occurred (2nd column), primary type of the crime (3rd), description of the crime (4th), description of the location (5th), if an arrest was made (6th), was it a domestic case (7th), and city district (8th).

Here, however, you'll focus only 4 columns: The date, type of crime, location, and whether or not the crime resulted in an arrest.

#### Reading your data with CSV Reader and Establishing your Data Containers

In [2]:
import csv

csvfile = open('crime_sampler.csv','r')
crime_data = []

for row in csv.reader(csvfile):
    crime_data.append((row[0], row[2], row[4], row[5]))
    
crime_data.pop(0)

print(crime_data[:10])

[('05/23/2016 05:35:00 PM', 'ASSAULT', 'STREET', 'false'), ('03/26/2016 08:20:00 PM', 'BURGLARY', 'SMALL RETAIL STORE', 'false'), ('04/25/2016 03:05:00 PM', 'THEFT', 'DEPARTMENT STORE', 'true'), ('04/26/2016 05:30:00 PM', 'BATTERY', 'SIDEWALK', 'false'), ('06/19/2016 01:15:00 AM', 'BATTERY', 'SIDEWALK', 'false'), ('05/28/2016 08:00:00 PM', 'BATTERY', 'GAS STATION', 'false'), ('07/03/2016 03:43:00 PM', 'THEFT', 'OTHER', 'false'), ('06/11/2016 06:55:00 PM', 'PUBLIC PEACE VIOLATION', 'STREET', 'true'), ('10/04/2016 10:20:00 AM', 'BATTERY', 'STREET', 'true'), ('02/14/2017 09:00:00 PM', 'CRIMINAL DAMAGE', 'PARK PROPERTY', 'false')]


#### Find the Months with the Highest Number of Crimes

Using the **crime_data** list from the prior cell, we'll'll answer a common question that arises when dealing with crime data: 

*How many crimes are committed each month?*

In [3]:
crime_data

[('05/23/2016 05:35:00 PM', 'ASSAULT', 'STREET', 'false'),
 ('03/26/2016 08:20:00 PM', 'BURGLARY', 'SMALL RETAIL STORE', 'false'),
 ('04/25/2016 03:05:00 PM', 'THEFT', 'DEPARTMENT STORE', 'true'),
 ('04/26/2016 05:30:00 PM', 'BATTERY', 'SIDEWALK', 'false'),
 ('06/19/2016 01:15:00 AM', 'BATTERY', 'SIDEWALK', 'false'),
 ('05/28/2016 08:00:00 PM', 'BATTERY', 'GAS STATION', 'false'),
 ('07/03/2016 03:43:00 PM', 'THEFT', 'OTHER', 'false'),
 ('06/11/2016 06:55:00 PM', 'PUBLIC PEACE VIOLATION', 'STREET', 'true'),
 ('10/04/2016 10:20:00 AM', 'BATTERY', 'STREET', 'true'),
 ('02/14/2017 09:00:00 PM', 'CRIMINAL DAMAGE', 'PARK PROPERTY', 'false'),
 ('03/01/2017 12:29:00 AM', 'ROBBERY', 'CURRENCY EXCHANGE', 'false'),
 ('12/29/2016 10:00:00 AM', 'ASSAULT', 'APARTMENT', 'false'),
 ('01/25/2016 12:22:00 AM', 'BATTERY', 'BAR OR TAVERN', 'false'),
 ('05/20/2016 03:00:00 PM', 'DECEPTIVE PRACTICE', '', 'false'),
 ('06/28/2016 12:30:00 AM', 'DECEPTIVE PRACTICE', 'BAR OR TAVERN', 'false'),
 ('09/01/2016 03:

In [15]:
from collections import Counter
from datetime import datetime

crimes_by_month = Counter()

for crime in crime_data: 
    # Convert the first element of each item into a Python Datetime Object: date
    date = datetime.strptime(crime[0], '%m/%d/%Y %I:%M:%S %p')
    crimes_by_month[date.month] += 1
    
#The 3 most common months for crime
print(crimes_by_month.most_common(3))

[(1, 1948), (2, 1862), (7, 1257)]


#### Transforming your Data Containers to Month and Location

Now let's flip **crime_data** list into a dictionary keyed by month with a list of location values for each month, and filter down to the records for the year 2016.

In [21]:
print(crime_data[1])
print(crime_data[1][2])

('03/26/2016 08:20:00 PM', 'BURGLARY', 'SMALL RETAIL STORE', 'false')
SMALL RETAIL STORE


In [25]:
from collections import defaultdict

locations_by_month = defaultdict(list)

for row in crime_data:
    date = datetime.strptime(row[0], '%m/%d/%Y %I:%M:%S %p')
    if date.year == 2016:
        locations_by_month[date.month].append(row[2])
    
print(locations_by_month)

defaultdict(<class 'list'>, {5: ['STREET', 'GAS STATION', '', 'PARKING LOT/GARAGE(NON.RESID.)', 'RESIDENCE', 'STREET', 'RESTAURANT', 'SMALL RETAIL STORE', 'STREET', 'APARTMENT', 'SIDEWALK', 'PARKING LOT/GARAGE(NON.RESID.)', 'DEPARTMENT STORE', 'PARKING LOT/GARAGE(NON.RESID.)', 'SMALL RETAIL STORE', 'RESIDENCE', 'STREET', 'RESIDENCE', 'APARTMENT', 'RESIDENCE-GARAGE', 'APARTMENT', 'ALLEY', 'HIGHWAY/EXPRESSWAY', 'SIDEWALK', 'POLICE FACILITY/VEH PARKING LOT', 'RESIDENCE', 'STREET', 'APARTMENT', 'RESIDENCE PORCH/HALLWAY', 'STREET', 'RESIDENCE', 'SMALL RETAIL STORE', 'SCHOOL, PUBLIC, BUILDING', 'SIDEWALK', 'SCHOOL, PUBLIC, BUILDING', 'STREET', 'APARTMENT', 'STREET', 'SIDEWALK', 'SMALL RETAIL STORE', 'ALLEY', 'OTHER', 'APARTMENT', 'STREET', 'RESIDENCE', 'GROCERY FOOD STORE', 'SIDEWALK', 'SCHOOL, PUBLIC, BUILDING', 'APARTMENT', 'APARTMENT', 'PARKING LOT/GARAGE(NON.RESID.)', 'RESIDENCE', 'STREET', 'APARTMENT', 'APARTMENT', 'CURRENCY EXCHANGE', 'RESIDENTIAL YARD (FRONT/BACK)', 'ALLEY', 'CTA TRAI

#### Find the Most Common Crimes by Location Type by Month in 2016

Using the **locations_by_month** dictionary from the prior cell, we'll now determine common crimes by month and location type. Because the dataset is so large, it's a good idea to use Counter to look at an aspect of it in an easier to manageable size and learn more about it.

In [34]:
# Loop over the items from locations_by_month
for month, locations in locations_by_month.items():
    location_count = Counter(locations)
    print(month)
    print(location_count.most_common(5))

5
[('STREET', 241), ('RESIDENCE', 175), ('APARTMENT', 128), ('SIDEWALK', 111), ('OTHER', 41)]
3
[('STREET', 240), ('RESIDENCE', 190), ('APARTMENT', 139), ('SIDEWALK', 99), ('OTHER', 52)]
4
[('STREET', 213), ('RESIDENCE', 171), ('APARTMENT', 152), ('SIDEWALK', 96), ('OTHER', 40)]
6
[('STREET', 245), ('RESIDENCE', 164), ('APARTMENT', 159), ('SIDEWALK', 123), ('PARKING LOT/GARAGE(NON.RESID.)', 44)]
7
[('STREET', 309), ('RESIDENCE', 177), ('APARTMENT', 166), ('SIDEWALK', 125), ('OTHER', 47)]
10
[('STREET', 248), ('RESIDENCE', 206), ('APARTMENT', 122), ('SIDEWALK', 92), ('OTHER', 62)]
12
[('STREET', 207), ('RESIDENCE', 158), ('APARTMENT', 136), ('OTHER', 47), ('SIDEWALK', 46)]
1
[('STREET', 196), ('RESIDENCE', 160), ('APARTMENT', 153), ('SIDEWALK', 72), ('PARKING LOT/GARAGE(NON.RESID.)', 43)]
9
[('STREET', 279), ('RESIDENCE', 183), ('APARTMENT', 144), ('SIDEWALK', 121), ('OTHER', 39)]
11
[('STREET', 236), ('RESIDENCE', 182), ('APARTMENT', 154), ('SIDEWALK', 75), ('OTHER', 41)]
8
[('STREET',

#### Reading your Data with DictReader and Establishing your Data Containers

The data file, **crime_sampler.csv** contains in positional order: the date, block where it occurred, primary type of the crime, description of the crime, description of the location, if an arrest was made, was it a domestic case, and city district.

We'll now use a DictReader to load up a dictionary to hold your data with the district as the key and the rest of the data in a list.

In [62]:
csvfile = open('crime_sampler.csv', 'r')

crimes_by_district = defaultdict(list)

for row in csv.DictReader(csvfile):
    # Pop the district from each row: district
    district = row.pop('District')
    # Append the rest of the data to the list for proper district
    crimes_by_district[district].append(row)

In [66]:
len(crimes_by_district)

23

In [64]:
type(crimes_by_district)

collections.defaultdict

In [65]:
crimes_by_district['14']

[OrderedDict([('Date', '05/23/2016 05:35:00 PM'),
              ('Block', '024XX W DIVISION ST'),
              ('Primary Type', 'ASSAULT'),
              ('Description', 'SIMPLE'),
              ('Location Description', 'STREET'),
              ('Arrest', 'false'),
              ('Domestic', 'true')]),
 OrderedDict([('Date', '09/22/2016 03:00:00 PM'),
              ('Block', '027XX N SPAULDING AVE'),
              ('Primary Type', 'THEFT'),
              ('Description', 'FROM BUILDING'),
              ('Location Description', 'APARTMENT'),
              ('Arrest', 'false'),
              ('Domestic', 'false')]),
 OrderedDict([('Date', '08/24/2016 05:13:00 AM'),
              ('Block', '033XX W BARRY AVE'),
              ('Primary Type', 'CRIMINAL DAMAGE'),
              ('Description', 'TO VEHICLE'),
              ('Location Description', 'STREET'),
              ('Arrest', 'false'),
              ('Domestic', 'false')]),
 OrderedDict([('Date', '05/20/2016 05:00:00 PM'),
             

#### Determine the Arrests by District by Year

Using **crimes_by_district** dictionary from the previous cell, we'll now determine the number arrests in each City district for each year.

In [69]:
# Loop over the crimes_by_district
for district, crimes in crimes_by_district.items():
    print(district)
    year_count = Counter()
    for crime in crimes:
        if crime['Arrest'] == 'true':
            year = datetime.strptime(crime['Date'], '%m/%d/%Y %I:%M:%S %p').year
            year_count[year] += 1
            
    print(year_count)

14
Counter({2016: 59, 2017: 8})
24
Counter({2016: 51, 2017: 10})
6
Counter({2016: 157, 2017: 32})
15
Counter({2016: 154, 2017: 16})
12
Counter({2016: 72, 2017: 9})
7
Counter({2016: 181, 2017: 27})
1
Counter({2016: 124, 2017: 15})
11
Counter({2016: 275, 2017: 53})
18
Counter({2016: 92, 2017: 17})
22
Counter({2016: 78, 2017: 12})
5
Counter({2016: 149, 2017: 30})
16
Counter({2016: 66, 2017: 9})
9
Counter({2016: 116, 2017: 17})
8
Counter({2016: 124, 2017: 26})
3
Counter({2016: 98, 2017: 18})
2
Counter({2016: 84, 2017: 15})
19
Counter({2016: 88, 2017: 11})
10
Counter({2016: 144, 2017: 20})
4
Counter({2016: 134, 2017: 15})
17
Counter({2016: 38, 2017: 5})
20
Counter({2016: 27, 2017: 8})
25
Counter({2016: 150, 2017: 26})
31
Counter({2016: 1})


#### Unique Crimes by City Block

The task in this cell is to get a unique list of crimes that have occurred on a couple of the blocks that have been selected for you to learn more about. 

In [78]:
csvfile = open('crime_sampler.csv', 'r')

crimes_by_block = defaultdict(list)

for row in csv.DictReader(csvfile):
    # Pop the district from each row: district
    block = row.pop('Block')
    # Append the rest of the data to the list for proper district
    crimes_by_block[block].append(row['Primary Type'])

In [79]:
crimes_by_block['070XX S SOUTH SHORE DR']

['THEFT', 'THEFT', 'ROBBERY']

In [80]:
len(crimes_by_block)

9195

In [82]:
# First block
n_state_st_crimes = set(crimes_by_block['001XX N STATE ST'])
print(n_state_st_crimes)

# Second block
w_terminal_st_crimes = set(crimes_by_block['0000X W TERMINAL ST'])
print(w_terminal_st_crimes)

#Differences between the two blocks
crime_differences = n_state_st_crimes - w_terminal_st_crimes
print(crime_differences)

{'ASSAULT', 'CRIMINAL DAMAGE', 'DECEPTIVE PRACTICE', 'CRIMINAL TRESPASS', 'OTHER OFFENSE', 'ROBBERY', 'THEFT', 'BATTERY'}
{'NARCOTICS', 'ASSAULT', 'CRIMINAL DAMAGE', 'DECEPTIVE PRACTICE', 'CRIMINAL TRESPASS', 'PUBLIC PEACE VIOLATION', 'OTHER OFFENSE', 'THEFT'}
{'BATTERY', 'ROBBERY'}
