<h1>CS 210 Project - File 2</h1>
<h4>Hypothesis</h4> The number of crimes in the areas populated with more Starbucks stores per person in the townships than other areas have a higher 911 call rate <br />
[Hypothesis statement & testing will be further discussed in file 3]
<h5>Assumptions</h5> Every people goes to the Starbucks stores in their own town and the usual assumptions of statistical analysis (randomly, identically and independently distributed data etc.)
<hr />
Here is the plan:<br />
<ol>
    ~~<li>Drop the Starbucks stores outside Montgomery County</li>~~
    <li>Obtain the population data of each township in the Montgomery County</li>
    <li>Build a new dataframe consisting of the following:</li><ul>
            <li>Township Name</li>
            <li>Township Population</li>
            <li>Number of people per store living in the township</li>
            <li>Number of 911 calls received in the township</li>
        </ul>
   <li>Create an ML model of this data and test the hypothesis</li>
</ol>


Import the libraries and pickles

In [1]:
import pandas as pd
import numpy as np
import reverse_geocoder as rg

In [2]:
starbucks = pd.read_pickle('./pickles/montgomeryStarbucksStores.pickle')
emergency = pd.read_pickle('./pickles/emergencyCalls.pickle')

<b>Step 2</b> - Obtain the population data of each township in the Montgomery County

In [3]:
# 2010 census was used to gather the following data
populations = {
    "Bala-Cynwyd": 9619,
    "King of Prussia": 19936,
    "Montgomeryville": 12624,
    "Plymouth Meeting": 6177,
    "Willow Grove": 15726,
    "Rockledge": 24998,
    "Flourtown": 4538,
    "Ardmore": 24486,
    "Horsham": 14842,
    "Wyncote": 3044,
    "Trappe": 3516,
    "Bridgeport": 4558,
    "Pottstown": 22392,
    "Trooper": 5744,
    "Narberth": 4284,
    "Norristown": 34347,
    "Glenside": 8384,
    "Jenkintown": 4426,
    "Conshohocken": 7842,
    "Wyndmoor": 5498,
    "Spring House": 3804,
    "Audubon": 8814,
    "Dresher": 5395,
    "Bryn Mawr": 3779,
    "Lansdale": 16282,
    "Sanatoga": 8378,
    "Collegeville": 5094
    }

<b>Step 3</b> - Build a new dataframe consisting of the following:<ul>
            <li>Township Name</li>
            <li>Township Population</li>
            <li>Number of people per store living in the township</li>
            <li>Number of 911 calls received in the township</li>
        </ul>

In [4]:
search_results = rg.search(list(zip(list(starbucks.Latitude), list(starbucks.Longitude))));
sbTowns = pd.DataFrame([x['name'] for x in search_results], columns=['Town'])
sbTowns.head(2)

Loading formatted geocoded file...


Unnamed: 0,Town
0,Rockledge
1,Willow Grove


In [5]:
em = emergency.drop('zip', axis="columns")
em = em.drop('timeStamp', axis="columns")
em.head()

Unnamed: 0,lat,lng,title
0,40.297876,-75.581294,EMS: BACK PAINS/INJURY
1,40.258061,-75.26468,EMS: DIABETIC EMERGENCY
2,40.121182,-75.351975,Fire: GAS-ODOR/LEAK
3,40.116153,-75.343513,EMS: CARDIAC EMERGENCY
5,40.253473,-75.283245,EMS: HEAD INJURY


The reverse geocoder works in a feasible time if we provide the lat/long pairs in advance. Thus we do the following

In [6]:
# 1 - Obtain the reverse geocoding results
coordinates = zip(list(em.lat), list(em.lng))
reverseGeocodingOutput = rg.search(list(coordinates))

In [7]:
# 2 - Obtain the indexes of rows for which the call originates from within the Montgomery County
montgomery_indexes = []
town_list = list(sbTowns.Town)
for i in range(len(reverseGeocodingOutput)):
    if reverseGeocodingOutput[i]['name'] in town_list:
        montgomery_indexes.append(i)

In [8]:
# 3 - Select the rows we have just distinguished from the 911 calls data frame
filtered_rows = []
for i in montgomery_indexes:
    filtered_rows.append(list(em.iloc[i]))
filtered_em = pd.DataFrame(filtered_rows)
filtered_em.reset_index()
filtered_em.columns = ['Lat', 'Lng', 'Title']
filtered_em.head(3)

Unnamed: 0,Lat,Lng,Title
0,40.258061,-75.26468,EMS: DIABETIC EMERGENCY
1,40.121182,-75.351975,Fire: GAS-ODOR/LEAK
2,40.116153,-75.343513,EMS: CARDIAC EMERGENCY


We know that the rows in the filtered_em data frame has Starbucks stores. We need to get a data frame consisting of the town name instead of lat-long pairs.

In [9]:
coordinates = list(zip(list(filtered_em.Lat), list(filtered_em.Lng)))
em_search = rg.search(coordinates)
em_with_town = filtered_em.drop(filtered_em.columns[[0, 1]], axis="columns") # drop Lat & Lng columns

In [10]:
em_with_town['Town'] = [x['name'] for x in em_search]

In [11]:
em_with_town.head(3)

Unnamed: 0,Title,Town
0,EMS: DIABETIC EMERGENCY,Montgomeryville
1,Fire: GAS-ODOR/LEAK,Norristown
2,EMS: CARDIAC EMERGENCY,Norristown


In [12]:
sbTowns = sbTowns.Town.value_counts().reset_index()
sbTowns.columns = ['Town', 'Count']

In [13]:
sbTowns.head()

Unnamed: 0,Town,Count
0,Bala-Cynwyd,7
1,King of Prussia,6
2,Montgomeryville,5
3,Plymouth Meeting,3
4,Willow Grove,3


We've built the necessary data frames, now we need to perform merge operation

In [14]:
df = pd.merge(sbTowns, em_with_town, how="inner", on=['Town'])
df.columns=['Town', 'SBcount', 'Incident']
df.head()

Unnamed: 0,Town,SBcount,Incident
0,Bala-Cynwyd,7,Traffic: VEHICLE ACCIDENT -
1,Bala-Cynwyd,7,Traffic: DISABLED VEHICLE -
2,Bala-Cynwyd,7,Traffic: VEHICLE ACCIDENT -
3,Bala-Cynwyd,7,Traffic: VEHICLE ACCIDENT -
4,Bala-Cynwyd,7,EMS: FALL VICTIM


In [15]:
populationDF = pd.DataFrame(list(populations.keys()), list(populations.values())).reset_index()
populationDF.columns = ['Population', 'Town']
populationDF.head(3)

Unnamed: 0,Population,Town
0,9619,Bala-Cynwyd
1,19936,King of Prussia
2,12624,Montgomeryville


In [16]:
df = pd.merge(df, populationDF, how="inner", on=["Town"])

In [17]:
df

Unnamed: 0,Town,SBcount,Incident,Population
0,Bala-Cynwyd,7,Traffic: VEHICLE ACCIDENT -,9619
1,Bala-Cynwyd,7,Traffic: DISABLED VEHICLE -,9619
2,Bala-Cynwyd,7,Traffic: VEHICLE ACCIDENT -,9619
3,Bala-Cynwyd,7,Traffic: VEHICLE ACCIDENT -,9619
4,Bala-Cynwyd,7,EMS: FALL VICTIM,9619
5,Bala-Cynwyd,7,Traffic: VEHICLE ACCIDENT -,9619
6,Bala-Cynwyd,7,Fire: CARBON MONOXIDE DETECTOR,9619
7,Bala-Cynwyd,7,Traffic: DISABLED VEHICLE -,9619
8,Bala-Cynwyd,7,Traffic: VEHICLE ACCIDENT -,9619
9,Bala-Cynwyd,7,EMS: LACERATIONS,9619


In [18]:
df.Incident.value_counts()

Traffic: VEHICLE ACCIDENT -             22399
Fire: FIRE ALARM                         6245
Traffic: DISABLED VEHICLE -              6158
EMS: RESPIRATORY EMERGENCY               5586
EMS: CARDIAC EMERGENCY                   5194
EMS: FALL VICTIM                         5029
EMS: SUBJECT IN PAIN                     2979
EMS: HEAD INJURY                         2705
EMS: VEHICLE ACCIDENT                    2522
Traffic: ROAD OBSTRUCTION -              2503
EMS: UNKNOWN MEDICAL EMERGENCY           1970
EMS: SYNCOPAL EPISODE                    1818
EMS: SEIZURES                            1721
EMS: ABDOMINAL PAINS                     1554
EMS: ALTERED MENTAL STATUS               1498
EMS: MEDICAL ALERT ALARM                 1476
Fire: FIRE INVESTIGATION                 1467
EMS: GENERAL WEAKNESS                    1465
EMS: HEMORRHAGING                        1344
EMS: OVERDOSE                            1318
EMS: UNCONSCIOUS SUBJECT                 1279
EMS: CVA/STROKE                   

There are numerous types of incidents, we shall define a well defined categorical data group to ease our processing

In [19]:
def simplifyIncident(s):
    simplified = s[:s.find(':')]
    return simplified

In [20]:
df['Incident'] = df.apply(lambda x: simplifyIncident(x['Incident']), axis="columns");

In [21]:
df.head()

Unnamed: 0,Town,SBcount,Incident,Population
0,Bala-Cynwyd,7,Traffic,9619
1,Bala-Cynwyd,7,Traffic,9619
2,Bala-Cynwyd,7,Traffic,9619
3,Bala-Cynwyd,7,Traffic,9619
4,Bala-Cynwyd,7,EMS,9619


In [22]:
df.Incident.value_counts()

EMS        50209
Traffic    32522
Fire       14167
Name: Incident, dtype: int64

The ideal data set we want has the following columns:<br />
Town, SBcount, EMScount, TrafficCount, FireCount, Population<br />
We shall group our queries with respect to town and have one row for each town

In [23]:
len(df[(df.Incident == "Traffic") & (df.Town == "Bala-Cynwyd")])

1430

In [24]:
ems_counts = []
fire_counts = []
traffic_counts = []

for town in populations.keys():
    ems_counts.append( len(df[(df.Incident == "EMS") & (df.Town == town)]) )
    fire_counts.append( len(df[(df.Incident == "Fire") & (df.Town == town)] ) )
    traffic_counts.append( len(df[(df.Incident == "Traffic") & (df.Town == town)] ) )

In [25]:
len(ems_counts), len(fire_counts), len(traffic_counts)

(27, 27, 27)

Now that we have these counts, we can drop the redundant rows in the dataframe to hold a single row per town

In [26]:
df = df.drop_duplicates(subset="Town").reset_index()

In [27]:
df = df.drop('index', axis="columns")

In [28]:
df = df.drop('Incident', axis="columns")

In [29]:
df['EMScount'] = ems_counts
df['FireCount'] = fire_counts
df['TrafficCount'] = traffic_counts
total_911_count = []
for i in range(27):
    total_911_count.append(ems_counts[i] + fire_counts[i] + traffic_counts[i])
df['TotalCalls'] = total_911_count

In [30]:
df

Unnamed: 0,Town,SBcount,Population,EMScount,FireCount,TrafficCount,TotalCalls
0,Bala-Cynwyd,7,9619,1109,473,1430,3012
1,King of Prussia,6,19936,2155,527,1193,3875
2,Montgomeryville,5,12624,1845,445,1741,4031
3,Plymouth Meeting,3,6177,2468,736,1955,5159
4,Willow Grove,3,15726,3209,794,2229,6232
5,Rockledge,3,24998,1668,550,1293,3511
6,Pottstown,2,22392,1054,369,803,2226
7,Trappe,2,3516,1264,481,1093,2838
8,Bridgeport,2,4558,1550,368,1525,3443
9,Wyncote,2,3044,1336,378,1372,3086


We now have the data frame we want!

In [31]:
df.to_pickle('./pickles/finalDF_incident.pickle')