# Crime Incident Analysis on Data from SFPD


This report summarizes my findings on the crime incident data provided freely by the SFPD through the [portal](https://data.sfgov.org/), as part of my sumbission for the Crime Analytics assignment for the course __Communicating Data Science Results__ by University of Washington in coursera.


#### important notes
 * The assumptions made for this study as well as the conclusions drawn are based on my personal intuition and abillity of interpretation, thus they should not be treated as valid scientific facts.
 * For the needs of this assignment I worked on a subset of the crime incident data that corresponds to the Summer of 2014. This scripts that suplement this report are tested upon that subset and are not promised to be working on the whole dataset.

## Finding 1 

---

## Crime incidents happen more frequently near the city center

The first assumption I made was whether the incidents happen more frequently near specific areas / districts. Since each report includes the specific geoocordinates, of the place where the incident happened, it was quite convenient to start by investigating that correlation.

So, I started by plotting all the available data points using the Longitute values on the x-axis and the Latitude values on the y-axis. I also colored the data points based on the district the belong in. The script bellow produces that scatter plot.

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb


marker_style = dict(color='red', linestyle=':', marker='D',
                    markersize=5)


sanfIncidents = pd.read_csv('sanfrancisco_incidents_summer_2014.csv')
sb.lmplot('X', 'Y', data=sanfIncidents, hue='PdDistrict', fit_reg=False)
plt.ylabel('Latitude')
plt.xlabel('Longitude')
plt.title('San Francisco incidents summer 2014 colored by district')

# place marker at the city center
plt.plot(-122.419416, 37.774929, **marker_style)

plt.show()


The red marker incidates the coordinates of the city center.

As you may be able to see there is is an increased density of data points near the city center. I believe that it is quite apparent that the areas near the center contribute more data points in comparison to the rest but it is not so clear whether that difference is significant or not. It is also quite difficult from that plot to understand exactly which color corresponds to which district. That makes it even harder, especially for someone who is not familiar with those areas, to make an estimation of the criminal activity level of each district.

The next plot presents the distribution of data points (the two plots on top and on the right), on the two axes that were used previously (Longitute, Latitude), while also uses hue to on the scatter plot to indicate areas with greater density.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb


sanfIncidents = pd.read_csv('sanfrancisco_incidents_summer_2014.csv')
sb.jointplot('X', 'Y', data=sanfIncidents, color='r', kind='hex')

plt.show()

By inspecting the two distribution plots, it is much easier now to be sure that there is significant difference in the density of incidents that happen near the center. Specifically the two global maximums point us at (37.783, -122.407) and we can also see that the main activity happens around that area.

But again information is not that easily infered from that plot. And that's mostly due to the use of geoocordinates, since it's very hard to have an aproximation of which district correspond to which area of the plot.

For having a clearer overview of the levels of contribution, I estimated the amount of incidents corresponding to each district and created the following plot

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb


sanfIncidents = pd.read_csv('sanfrancisco_incidents_summer_2014.csv')

districtCount = sanfIncidents.groupby('PdDistrict').count()
y_pos = np.arange(len(districtCount.index.values))
plt.barh(y_pos, districtCount['IncidntNum'])
plt.yticks(y_pos, districtCount.index.values)
plt.title('Amount of incidents in San Francisco, per district, for Summer 2014')
plt.show()

Now we can see that Southern is by the far the one drawing more criminal activity. The  total count of incidents for Southern is above 5700 while the second most criminaly active district reaches 3700 incidents. 

Since we have identified the most active areas it would be valueable to find out what type of incidents happen in those areas. In a first rough attempt I used the script for the the first plot but instead of coloring based on the different districts, I colored each data point based on the type of incident. 

In [7]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb


marker_style = dict(linestyle=':', marker='D', markersize=5, color='red')

sanfIncidents = pd.read_csv('sanfrancisco_incidents_summer_2014.csv')
sb.lmplot('X', 'Y', data=sanfIncidents, hue='Category', fit_reg=False)

# place marker at the city center
plt.plot(-122.419416, 37.774929, **marker_style)

plt.ylabel('Latitude')
plt.xlabel('Longitude')
plt.title('San Francisco incidents summer 2014 colored by incident category')

plt.show()

In that plot it is easy to see that the city center is dominated by green and blue points. But given there are 30 different categories and some share similar colors it is hard to be sure if for example there are a lot of robberies or a lot of secondary codes going on in the city center.

For that I estimated the amount of incidents per category and I made a bar plot that includes all categories the corresponding to the top high activity districts. The variable __top__ in the script controls the number of districts that will be included in the bar chart. For example if `__top__ ==  1` the only the most active district will be included, if `__top__ == 2` the two most active etc

In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

__top__ = 1

# load data
sanFIncidents = pd.read_csv('sanfrancisco_incidents_summer_2014.csv')
# compute count of incidents per district
districtCount = sanFIncidents.groupby('PdDistrict').count()
# keep only the top==__top__ districts
districtCount = districtCount.sort_values(
        'IncidntNum', ascending=False).head(__top__)
# filter out the incidents that happened outside of the top freq districts
sanFIncidents = sanFIncidents[sanFIncidents['PdDistrict'].isin(
        districtCount.index.values)]
# compute count of incidents per incident Category
categoryCount = sanFIncidents.groupby('Category').count()

districts = ''
for district in districtCount.index.values:
    districts = districts + str(district) + ', '
districts = districts[:-2]
title = ('Amount of incidents for districts ' + districts +
         ' per incident category')

y_pos = np.arange(len(categoryCount.index.values))
plt.barh(y_pos, categoryCount['IncidntNum'])
plt.yticks(y_pos, categoryCount.index.values)
plt.xlabel('amount of incidents')
plt.title(title)
plt.show()

As we can see it's mostly gta i fasi man mou

In [12]:
from __future__ import division
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from datetime import datetime


labels = {6: 'June', 7: 'July', 8: 'August'}

inc = pd.read_csv('sanfrancisco_incidents_summer_2014.csv',
                  na_values=['', ' '])
inc.dropna()

inc['Date'] = inc['Date'].apply(
            lambda date: int(date.split('/')[0]))

incCount = inc.groupby('Date').count()
incCount['Month'] = incCount.index

incCount['Month'] = incCount['Month'].apply(
            lambda month: labels[month])

sns.set_style("whitegrid")
sns.barplot(x='Month', y='IncidntNum', data=incCount, color='salmon',
            saturation=.5)
plt.ylabel('Count of incidents')
plt.title('Count of incidents per month of summer 2014')
plt.show()