Note: I used data from [boston police districts](https://www.kaggle.com/christotk/boston-police-districts) in order to process with <i>geopandas</i>

<h3>Import libraries and load data</h3>

In [1]:
import pandas as pd # data processing
import geopandas as gpd # geospatial data processing
import numpy as np # linear algebra
import folium # mapping
from folium.plugins import HeatMap
import seaborn as sns # visualization
import matplotlib.pyplot as plt # visualization
%matplotlib inline

# read crimes file
crimes = pd.read_csv('../input/crimes-in-boston/crime.csv', encoding = 'latin')

# read Police Districts shapefile with geopandas
gdf = gpd.read_file('../input/boston-police-districts/police_districts/Police_Districts.shp')

ModuleNotFoundError: No module named 'geopandas'

<h3>Explore the datasets</h3>

In [None]:
gdf

In [None]:
crimes.head()

In [None]:
crimes.shape

In [None]:
crimes.describe()

check if we have missing data

In [None]:
crimes.isnull().sum()

<h3>Plot spatial data</h3>

In [None]:
gdf.plot()
plt.tight_layout()

geopandas plots data from geometry column.

we need to label each polygon/district. To do so we have to define a point within each polygon

In [None]:
gdf['point'] = gdf.representative_point() # this is a point guaranteed to be within each polygon

# label_points - a GeoDataFrame used for labeling
label_points = gdf.copy()
label_points.set_geometry('point', inplace = True)

# plot districts
ax = gdf.plot(color = 'whitesmoke', figsize = (12,8), edgecolor = 'black', linewidth = 0.3)

def add_label():
    # add label for each polygon
    for x, y, label in zip(label_points.geometry.x, label_points.geometry.y, label_points['DISTRICT']):
        plt.text(x, y, label, fontsize = 10, fontweight = 'bold')

add_label()
plt.title('Boston police districts', fontsize = 16)
plt.tight_layout()

<h3>Analysis</h3>

* What types of crimes are the most common?

In [None]:
most_common_crimes = pd.DataFrame({'Count': crimes.OFFENSE_CODE_GROUP.value_counts().sort_values(ascending = False).head(10)}) # top 10 most common crimes
most_common_crimes

In [None]:
plt.figure(figsize = (20,12))
sns.barplot(x = most_common_crimes.index, y = 'Count', data = most_common_crimes)
plt.yticks((np.arange(5000, most_common_crimes['Count'].max(), 5000)))
plt.ylabel(None)
plt.tick_params(labelsize = 12)
plt.xlabel('\n Most common crime types', fontsize = 12)
plt.title('Top 10 crimes in Boston', fontsize = 18)
plt.tight_layout()

* How is crime distributed in boston area? (most_common_crimes)

folium HeatMap seems useful in this case

In [None]:
location_of_most_common_crimes = crimes[crimes.OFFENSE_CODE_GROUP.isin(most_common_crimes.index)].loc[:, ['Lat', 'Long']].dropna()

my_map=folium.Map(location = [42.320,-71.05], #Initiate map on Boston city
                  zoom_start = 11,
                  min_zoom = 11
)

HeatMap(data=location_of_most_common_crimes.sample(10000), radius=16).add_to(my_map)

my_map

* How are crimes distributed amongst the districts?

In [None]:
districts = pd.DataFrame({'Count': crimes.DISTRICT.value_counts().sort_values(ascending = False)})
districts

In [None]:
plt.figure(figsize = (12,8))
sns.barplot(x = districts.index, y = 'Count', data = districts, palette = 'Reds_r')
sns.lineplot(x = districts.index, y = districts['Count'].mean(), data = districts, label = 'mean', color = 'black') # plot mean value
plt.title('Crimes per district in Boston', fontsize = 16)
plt.ylabel(None)
plt.xlabel('\nDISTRICT')
plt.yticks(np.arange(10000, 55000, 10000))
plt.tick_params(labelsize = 12)
plt.tight_layout()

B2, C11, D4, A1, B3 have the highest crime rates.
(Note that there are 1765 NaN values in DISTRICT column...)

Let's visualize this with geopandas. 

In [None]:
gdf['crimes'] = gdf.DISTRICT.map(districts['Count']) # use map function to match each district with its corresponding value
ax = gdf.plot(column = gdf.crimes, cmap = 'Reds', legend = True, edgecolor = 'black', linewidth = 0.3, figsize = (12,8))
add_label()
plt.title('Crimes per district in Boston', fontsize = 16)
plt.tight_layout()

* What time during the day most crimes are being reported?

In [None]:
crimes_per_hour = pd.DataFrame({'Count': crimes['HOUR'].value_counts().sort_index()})
crimes_per_hour

In [None]:
plt.figure(figsize = (12,8))
sns.barplot(x = crimes_per_hour.index, y = crimes_per_hour['Count'], data = crimes_per_hour, color = '#7AD7F0')
plt.ylabel(None)
plt.xlabel(None)
plt.yticks(np.arange(2500, 22000, 2500))
plt.tick_params(labelsize = 12)
plt.title('Boston crimes per hour', fontsize = 16)
plt.tight_layout()

The majority of crimes are being reported between 4PM and 7PM and the minority between 2AM and 6AM

* How are crimes distributed during weeks?

In [None]:
labels = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
crimes_per_day = pd.DataFrame({'Count': crimes['DAY_OF_WEEK'].value_counts().loc[labels]})
crimes_per_day

In [None]:
plt.figure(figsize = (12,8))
sns.barplot(x = crimes_per_day.index, y = 'Count', data = crimes_per_day)
plt.ylabel(None)
plt.xlabel(None)
plt.yticks(np.arange(10000, 55000, 10000))
plt.tick_params(labelsize = 12)
plt.title('Boston crimes per day', fontsize = 16)
plt.tight_layout()

peak on Friday and through on Sunday

* How many crimes per month/year?

As shown below, this dataset contains crimes reported between 06/2015 and 09/2018. Therefore, for this question I will use data collected in 2016 and 2017.

In [None]:
print(crimes.OCCURRED_ON_DATE.min())
print(crimes.OCCURRED_ON_DATE.max())

In [None]:
crimes_2016_2017 = crimes[crimes['YEAR'].isin(['2016', '2017'])]
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
crimes_per_month = pd.DataFrame({'Count': crimes_2016_2017['MONTH'].value_counts().sort_index().values}, index = months)
crimes_per_month

In [None]:
plt.figure(figsize = (14, 8))
sns.barplot(x = crimes_per_month.index, y = 'Count', data = crimes_per_month, palette = 'tab10')
plt.ylabel(None)
plt.xlabel(None)
plt.yticks(np.arange(2500, 20000, 2500))
plt.tick_params(labelsize = 12)
plt.title('Boston crimes per month 2016 - 2017', fontsize = 16)
plt.tight_layout()

more crimes in summer compared to winter

In [None]:
crimes_per_year = pd.DataFrame({'Count': crimes_2016_2017['YEAR'].value_counts().sort_index()})
crimes_per_year

In [None]:
plt.figure(figsize = (12, 8))
sns.barplot(x = crimes_per_year.index, y = 'Count', data = crimes_per_year)
plt.ylabel(None)
plt.tick_params(labelsize = 12)
plt.yticks(np.arange(20000, 120000, 20000))
plt.title('Boston crimes per year 2016 - 2017', fontsize = 16)
plt.tight_layout()

In [None]:
crimes_per_year['population'] = [678430, 685094] # Boston population for 2016 and 2017

In [None]:
(crimes_per_year.loc[2017].Count-crimes_per_year.loc[2016].Count)/crimes_per_year.loc[2016].Count

In [None]:
(crimes_per_year.loc[2017].population-crimes_per_year.loc[2016].population)/crimes_per_year.loc[2016].population

population and crime rate increased by 1% and 1,8% respectively from 2016 to 2017.

**This was my first kernel. Thank you for reviewing. Please feel free to comment or advise.**