# 911 Emergencies

A US-county would like to know what are the main cases they need to focus on to protect their citizens. They hired you to get that kind of recommandations. In addition they give you a map with all the 911 calls they received over the past years.

1. Import common libraries (including plotly)

In [1]:
import pandas as pd

from sklearn.preprocessing import  OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.cluster import DBSCAN

import plotly.express as px


2. Import the dataset here 👉👉 <a href="https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Machine+Learning+non+Supervis%C3%A9/DBSCAN/Datasets/911.csv" target="_blank">911.csv</a>

In [3]:
# df = pd.read_csv("../12_assets/06_unsupervised_ML/911_lite.csv", delimiter=';')
df = pd.read_csv("../../12_assets/06_unsupervised_ML/911.csv")
df.head()

Unnamed: 0,lat,lng,desc,zip,title,timeStamp,twp,addr,e
0,40.297876,-75.581294,REINDEER CT & DEAD END; NEW HANOVER; Station ...,19525.0,EMS: BACK PAINS/INJURY,2015-12-10 17:10:52,NEW HANOVER,REINDEER CT & DEAD END,1
1,40.258061,-75.26468,BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP...,19446.0,EMS: DIABETIC EMERGENCY,2015-12-10 17:29:21,HATFIELD TOWNSHIP,BRIAR PATH & WHITEMARSH LN,1
2,40.121182,-75.351975,HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-St...,19401.0,Fire: GAS-ODOR/LEAK,2015-12-10 14:39:21,NORRISTOWN,HAWS AVE,1
3,40.116153,-75.343513,AIRY ST & SWEDE ST; NORRISTOWN; Station 308A;...,19401.0,EMS: CARDIAC EMERGENCY,2015-12-10 16:47:36,NORRISTOWN,AIRY ST & SWEDE ST,1
4,40.251492,-75.60335,CHERRYWOOD CT & DEAD END; LOWER POTTSGROVE; S...,,EMS: DIZZINESS,2015-12-10 16:56:52,LOWER POTTSGROVE,CHERRYWOOD CT & DEAD END,1


3. The dataset is quite big, take a sample of 10 000 observations

In [4]:
# Attention avec 911_lite y a que 5000 lignes
df = df.sample(10_000, random_state=0)
df.head()

Unnamed: 0,lat,lng,desc,zip,title,timeStamp,twp,addr,e
600055,40.360688,-75.482317,WALTERS RD & GRAVEL PIKE; UPPER HANOVER; Stat...,18073.0,EMS: HEAD INJURY,2020-01-22 21:02:54,UPPER HANOVER,WALTERS RD & GRAVEL PIKE,1
21520,40.120442,-75.120903,HIGHLAND AVE & WOODLAND RD; ABINGTON; 2016-02-...,19001.0,Traffic: VEHICLE ACCIDENT -,2016-02-02 08:14:02,ABINGTON,HIGHLAND AVE & WOODLAND RD,1
598158,40.164656,-75.286313,SKIPPACK PIKE & CRESTLINE DR; WHITPAIN; Stati...,19422.0,EMS: SYNCOPAL EPISODE,2020-01-18 13:20:29,WHITPAIN,SKIPPACK PIKE & CRESTLINE DR,1
196304,40.072731,-75.155969,CHELTENHAM AVE & 79TH AVE; CHELTENHAM; Statio...,19150.0,EMS: HEMORRHAGING,2017-05-04 14:48:19,CHELTENHAM,CHELTENHAM AVE & 79TH AVE,1
504256,40.097222,-75.376195,PENNSYLVANIA TPKE & ALLENDALE RD OVERPASS; UPP...,,Traffic: DISABLED VEHICLE -,2019-06-04 07:50:36,UPPER MERION,PENNSYLVANIA TPKE & ALLENDALE RD OVERPASS,1


5. Using plotly scatter mapbox, visualize your data points on a map. You should also differentiate colors depending on `title`

In [5]:
# https://plotly.com/python/mapbox-layers/
fig = px.scatter_mapbox(
        df,
        lat="lat",
        lon="lng",
        color="title",
        mapbox_style='open-street-map'
)
fig.show()

6. The dataset is quite big, let's try to use DBSCAN to help us out. First, create a variable `X` that only includes `lat`, `lng` and `title` columns.

In [121]:
features_of_interrest = ["lat", "lng", "title"]

data_sample = df.loc[:, features_of_interrest]
data_sample.head()

Unnamed: 0,lat,lng,title
600055,40.360688,-75.482317,EMS: HEAD INJURY
21520,40.120442,-75.120903,Traffic: VEHICLE ACCIDENT -
598158,40.164656,-75.286313,EMS: SYNCOPAL EPISODE
196304,40.072731,-75.155969,EMS: HEMORRHAGING
504256,40.097222,-75.376195,Traffic: DISABLED VEHICLE -


7. Create dummy variables column `title`.

In [122]:
numeric_features = data_sample.select_dtypes(include="number").columns
print(numeric_features)

numeric_transformer = Pipeline(
  steps=[
    #("imputer_num", SimpleImputer(strategy="median")),
    #("imputer_num", KNNImputer()),
    ("scaler_num" , StandardScaler()),
  ]
)


categorical_features = data_sample.select_dtypes(exclude="number").columns
print(categorical_features)

categorical_transformer = Pipeline(
  steps=[
    # ("imputer_cat", SimpleImputer(strategy="most_frequent")),  
    ("encoder_cat", OneHotEncoder(drop="first")),                 
  ]
)

preprocessor = ColumnTransformer(
  transformers=[
    ("num", numeric_transformer,     numeric_features),
    ("cat", categorical_transformer, categorical_features),
  ]
)

X = preprocessor.fit_transform(data_sample)
print(X[0:5, ])


Index(['lat', 'lng'], dtype='object')
Index(['title'], dtype='object')
  (0, 0)	0.9030367896916485
  (0, 1)	-0.11004154567506207
  (0, 29)	1.0
  (1, 0)	-0.18022407382652786
  (1, 1)	0.10947176369941068
  (1, 82)	1.0
  (2, 0)	0.01913326285356729
  (2, 1)	0.009006555384859206
  (2, 48)	1.0
  (3, 0)	-0.3953505949560368
  (3, 1)	0.08817391187273635
  (3, 31)	1.0
  (4, 0)	-0.28492076331826843
  (4, 1)	-0.04558574464212034
  (4, 79)	1.0


8. Let's start using DBSCAN, import the module and fit DBSCAN to your data. You should use `eps=0.2`, `min_samples=100` and `metric="manhattan"` as parameters

In [139]:
db = DBSCAN(eps=0.2, min_samples=200, metric="manhattan")
db.fit(X)

9. Find out how many clusters DBSCAN created.

In [140]:
set(db.labels_)

{-1, 0, 1, 2, 3}

10. Add a new column `"cluster"` to `data_sample` where each observations are going to be the label of the corresponding cluster.

In [141]:
data_sample["cluster"] = db.labels_
data_sample.head()

Unnamed: 0,lat,lng,title,cluster
600055,40.360688,-75.482317,EMS: HEAD INJURY,-1
21520,40.120442,-75.120903,Traffic: VEHICLE ACCIDENT -,0
598158,40.164656,-75.286313,EMS: SYNCOPAL EPISODE,-1
196304,40.072731,-75.155969,EMS: HEMORRHAGING,-1
504256,40.097222,-75.376195,Traffic: DISABLED VEHICLE -,1


11. Visualize all the clusters on a map except all the ones that DBSCAN considered as outliers.

In [142]:
fig = px.scatter_mapbox(
  data_sample,
  lat="lat",
  lon="lng",
  color="cluster",
  mapbox_style='open-street-map'
)

fig.show()

In [143]:
fig = px.scatter_mapbox(
  data_sample[data_sample.cluster != -1],
  lat="lat",
  lon="lng",
  color="cluster",
  mapbox_style="open-street-map"
)

fig.show()

12. Visualize all data points on a map except outliers using plotly. You should have different colors per `title`.

13. What would then be your recommandations for this US county politicians?

In [144]:
fig = px.scatter_mapbox(
  data_sample[data_sample.cluster != -1],
  lat="lat",
  lon="lng",
  color="title",
  mapbox_style="open-street-map"
)
fig.show()

**The map shows the main topics to focus on and the main areas where this events occur. Therefore these are the areas that politics should focus on.**