# **911 Exploratory Analysis Project**
For this project we'll analyze the 911 call dataset from Kaggle. The data contains the following fields:

* lat : String variable, Latitude
* lng: String variable, Longitude
* desc: String variable, Description of the Emergency Call
* zip: String variable, Zipcode
* title: String variable, Title
* timeStamp: String variable, YYYY-MM-DD HH:MM:SS
* twp: String variable, Township
* addr: String variable, Address
* e: String variable, Dummy variable (always 1)

Let's start with some data analysis and visualisation imports.

In [None]:
import numpy as np
import pandas as pd

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set_style('whitegrid')

plt.rcParams['figure.figsize'] = (6, 4)

In [None]:
#Reading the data
df = pd.read_csv('data/911.csv')

In [None]:
df.info()

In [None]:
#Checking the head of the dataframe
df.head()


# **Basic Analysis**
Let's check out the top 5 zipcodes for calls.

In [None]:
df['zip'].value_counts().head(5)

In [None]:
df['twp'].value_counts().head(5)

In [None]:
df['title'].nunique()

# **Data Wrangling for Feature Creation**
We can extract some generalised features from the columns in our dataset for further analysis.

In the title column, there's a kind of 'subcategory' or 'reason for call' alloted to each entry (denoted by the text before the colon).

The timestamp column can be further segregated into Year, Month and Day of Week too.

Let's start with creating a 'Reason' feature for each call.

In [None]:
df['Reason'] = df['title'].apply(lambda x: x.split(':')[0])

In [None]:
df.tail()


Now, let's find out the most common reason for 911 calls, according to our dataset.

In [None]:
df['Reason'].value_counts()

In [None]:
sns.countplot(df['Reason'])

Let's deal with the time information we have. Checking the datatype of the timestamp column.

In [None]:
type(df['timeStamp'][0])

As the timestamps are still string types, it'll make our life easier if we convert it to a python DateTime object, so we can extract the year, month, and day information more intuitively.

In [None]:
df['timeStamp'] = pd.to_datetime(df['timeStamp'])

In [None]:
time = df['timeStamp'].iloc[0]

print('Hour:',time.hour)
print('Month:',time.month)
print('Day of Week:',time.dayofweek)

Now let's create new features for the above pieces of information.

In [None]:
df['Hour'] = df['timeStamp'].apply(lambda x: x.hour)
df['Month'] = df['timeStamp'].apply(lambda x: x.month)
df['Day of Week'] = df['timeStamp'].apply(lambda x: x.dayofweek)

In [None]:
df.head(3)

In [None]:
dmap = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}

In [None]:
df['Day of Week'] = df['Day of Week'].map(dmap)

df.tail(3)

Let's combine the newly created features, to check out the most common call reasons based on the day of the week.

In [None]:
sns.countplot(df['Day of Week'],hue=df['Reason'])

plt.legend(bbox_to_anchor=(1.25,1))


It makes sense for the number of traffic related 911 calls to be the lowest during the weekends, what's also iteresting is that Emergency Service related calls are also low during the weekend.

In [None]:
sns.countplot(df['Month'],hue=df['Reason'])

plt.legend(bbox_to_anchor=(1.25,1))

Now, let's check out the relationship between the number of calls and the month.

In [None]:
byMonth = pd.groupby(df,by='Month').count()

In [None]:
byMonth['e'].plot.line(y='e')
plt.title('Calls per Month')
plt.ylabel('Number of Calls')


Using seaborn, let's fit the number of calls to a month and see if there's any concrete correlation between the two.

In [None]:
byMonth.reset_index(inplace=True)

In [None]:
sns.lmplot(x='Month',y='e',data=byMonth)
plt.ylabel('Number of Calls')

So, it does seem that there are fewer emergency calls during the holiday seasons.

Let's extract the date from the timestamp, and see behavior in a little more detail.

In [None]:
df['Date']=df['timeStamp'].apply(lambda x: x.date())

In [None]:
df.head(2)


Grouping and plotting the data:

In [None]:
pd.groupby(df,'Date').count()['e'].plot.line(y='e')

plt.legend().remove()
plt.tight_layout()

We can also check out the same plot for each reason separately.

In [None]:
pd.groupby(df[df['Reason']=='Traffic'],'Date').count().plot.line(y='e')
plt.title('Traffic')
plt.legend().remove()
plt.tight_layout()

In [None]:
pd.groupby(df[df['Reason']=='Fire'],'Date').count().plot.line(y='e')
plt.title('Fire')
plt.legend().remove()
plt.tight_layout()

In [None]:
pd.groupby(df[df['Reason']=='EMS'],'Date').count().plot.line(y='e')
plt.title('EMS')
plt.legend().remove()
plt.tight_layout()

Let's create a heatmap for the counts of calls on each hour, during a given day of the week.

In [None]:
day_hour = df.pivot_table(values='lat',index='Day of Week',columns='Hour',aggfunc='count')

day_hour


Now create a HeatMap using this new DataFrame.

In [None]:
sns.heatmap(day_hour)

plt.tight_layout()

We see that most calls take place around the end of office hours on weekdays. We can create a clustermap to pair up similar Hours and Days.

In [None]:
sns.clustermap(day_hour)


And this concludes the exploratory analysis project.