This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system. In order to protect the privacy of crime victims, addresses are shown at the block level only and specific locations are not identified. Should you have questions about this dataset, you may contact the Research & Development Division of the Chicago Police Department at 312.745.6071 or RDAnalysis@chicagopolice.org. Disclaimer: These crimes may be based upon preliminary information supplied to the Police Department by the reporting parties that have not been verified. The preliminary crime classifications may be changed at a later date based upon additional investigation and there is always the possibility of mechanical or human error. Therefore, the Chicago Police Department does not guarantee (either expressed or implied) the accuracy, completeness, timeliness, or correct sequencing of the information and the information should not be used for comparison purposes over time. The Chicago Police Department will not be responsible for any error or omission, or for the use of, or the results obtained from the use of this information. All data visualizations on maps should be considered approximate and attempts to derive specific addresses are strictly prohibited. The Chicago Police Department is not responsible for the content of any off-site pages that are referenced by or that reference this web page other than an official City of Chicago or Chicago Police Department web page. The user specifically acknowledges that the Chicago Police Department is not responsible for any defamatory, offensive, misleading, or illegal conduct of other users, links, or third parties and that the risk of injury from the foregoing rests entirely with the user. The unauthorized use of the words "Chicago Police Department," "Chicago Police," or any colorable imitation of these words or the unauthorized use of the Chicago Police Department logo is unlawful. This web page does not, in any way, authorize such use. Data are updated daily. The dataset contains more than 6,000,000 records/rows of data and cannot be viewed in full in Microsoft Excel. To access a list of Chicago Police Department - Illinois Uniform Crime Reporting (IUCR) codes, go to http://data.cityofchicago.org/Public-Safety/Chicago-Police-Department-Illinois-Uniform-Crime-R/c7ck-438e

Content
ID - Unique identifier for the record.

Case Number - The Chicago Police Department RD Number (Records Division Number), which is unique to the incident.

Date - Date when the incident occurred. this is sometimes a best estimate.

Block - The partially redacted address where the incident occurred, placing it on the same block as the actual address.

IUCR - The Illinois Unifrom Crime Reporting code. This is directly linked to the Primary Type and Description. See the list of IUCR codes at https://data.cityofchicago.org/d/c7ck-438e.

Primary Type - The primary description of the IUCR code.

Description - The secondary description of the IUCR code, a subcategory of the primary description.

Location Description - Description of the location where the incident occurred.

Arrest - Indicates whether an arrest was made.

Domestic - Indicates whether the incident was domestic-related as defined by the Illinois Domestic Violence Act.

Beat - Indicates the beat where the incident occurred. A beat is the smallest police geographic area – each beat has a dedicated police beat car. Three to five beats make up a police sector, and three sectors make up a police district. The Chicago Police Department has 22 police districts. See the beats at https://data.cityofchicago.org/d/aerh-rz74.

District - Indicates the police district where the incident occurred. See the districts at https://data.cityofchicago.org/d/fthy-xz3r.

Ward - The ward (City Council district) where the incident occurred. See the wards at https://data.cityofchicago.org/d/sp34-6z76.

Community Area - Indicates the community area where the incident occurred. Chicago has 77 community areas. See the community areas at https://data.cityofchicago.org/d/cauq-8yn6.

FBI Code - Indicates the crime classification as outlined in the FBI's National Incident-Based Reporting System (NIBRS). See the Chicago Police Department listing of these classifications at http://gis.chicagopolice.org/clearmap_crime_sums/crime_types.html.

X Coordinate - The x coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block.

Y Coordinate - The y coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block.

Year - Year the incident occurred.

Updated On - Date and time the record was last updated.

Latitude - The latitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.

Longitude - The longitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.

Location - The location where the incident occurred in a format that allows for creation of maps and other geographic operations on this data portal. This location is shifted from the actual location for partial redaction but falls on the same block.

What we are doing in project 
Data Cleanup
Data Visualization
Data Modeling

In [None]:
import pandas as pd
import numpy as np

import sklearn.preprocessing 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Importing required library module
from sklearn.preprocessing import OneHotEncoder
# scale the features
from sklearn.preprocessing import MinMaxScaler


import matplotlib as mpl
import matplotlib.cm as cm
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
sns.set_style("whitegrid")
sns.set_context("poster")

sns.set()

In [None]:
data = pd.read_csv('dataset/Crimes_-_2001_to_present.csv', error_bad_lines=False)

In [None]:
df=data.copy()

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.shape

In [None]:
df.dtypes

In [None]:
df.isnull().sum()

In [None]:
df.replace(' ',np.NaN,inplace=True)

In [None]:
df.dropna()

In [None]:
df.fillna(0,inplace=True)

In [None]:
df.isnull().sum()

In [None]:
#df = df[(df['Year'] >= 2011) & (df['Year'] <= 2017)]

In [None]:
#df[:10]

In [None]:
df['Primary Type'].value_counts()

In [None]:
df['Location Description'].value_counts()

In [None]:
## Plot these for better visualization
crime_type_df = df['Primary Type'].value_counts(ascending=True)

## Some formatting to make it look nicer
fig=plt.figure(figsize=(18, 16))
plt.title("Frequency of Crimes Per Crime Type")
plt.xlabel("Frequency of Crimes")
plt.ylabel("Type of Crime")
ax = crime_type_df.plot(kind='barh')
ax.get_xaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))

In [None]:
df['Year'].isnull().values.any()

In [None]:
## Count number of reported crimes for each year
df['Year'].value_counts()

In [None]:
## Plot these for better visualization
crime_year_df = df['Year'].value_counts(ascending=True)

## Some formatting to make it look nicer
fig=plt.figure(figsize=(10, 8))
plt.title("Frequency of Crimes Per Year in Chicago")
plt.xlabel("Frequency of Crimes")
plt.ylabel("Year")
ax = crime_year_df.plot(kind='barh')
ax.get_xaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))

In [None]:
## Check if any rows are missing data and are null
df['Arrest'].isnull().values.any()

In [None]:
## Count number of successful arrests for each year
df['Arrest'].value_counts()

In [None]:
sns.countplot(x='Arrest', data=df);

In [None]:
#write code here
sns.boxplot(x='Arrest',y='Year',data=df)

In [None]:
## Convert values into percentages
arrest_df = df['Arrest'].value_counts()
arrest_percent = (arrest_df / df['Arrest'].sum()) * 100 

## Rename Series.name
arrest_percent.rename("% of Arrests",inplace=True)

## Rename True and False to % Arrested and % Not Arrested
arrest_percent.rename({True: '% Arrested', False: '% Not Arrested'},inplace=True)

In [None]:
## Format pie chart to nicely show percentage and count
def make_autopct(values):
    def my_autopct(pct):
        total = sum(values)
        val = int(round(pct*total/100.0))
        return '{p:.2f}%  ({v:d})'.format(p=pct,v=val)
    return my_autopct

## Plot results in a pie chart
arrest_percent.plot.pie(fontsize=11,
                       autopct=make_autopct(df['Arrest'].value_counts()),
                       figsize=(8, 8))


In [None]:
#What are the successful arrest percentages per year?
## Group dataset by year and arrests
arrest_per_year = df.groupby('Year')['Arrest'].value_counts().rename('Counts').to_frame()
arrest_per_year['Percentage'] = (100 * arrest_per_year / arrest_per_year.groupby(level=0).sum())
arrest_per_year.reset_index(level=[1],inplace=True)
arrest_per_year

In [None]:
## Create a line plot for percentages of successful arrests over time (2001 to present)
line_plot = arrest_per_year[arrest_per_year['Arrest'] == True]['Percentage']

## Configure line plot to make visualizing data cleaner
labels = line_plot.index.values
fig=plt.figure(figsize=(12, 10))
plt.title('Percentages of successful arrests from 2011 to 2017')
plt.xlabel("Year")
plt.ylabel("Successful Arrest Percentage")
plt.xticks(line_plot.index, line_plot.index.values)

line_plot.plot(grid=True, marker='o', color='mediumvioletred')


In [None]:
plt.figure(figsize=(10,5))
chart = sns.countplot(
    x='Year',
    hue='Arrest',
    data=df
)
chart.set_xticklabels(chart.get_xticklabels(), rotation=90)

In [None]:
df['Arrest'].value_counts()

In [None]:
plt.figure(figsize=(25,5))
chart = sns.countplot(
    x='Ward',
    hue='Arrest',
    data=df
)
chart.set_xticklabels(chart.get_xticklabels(), rotation=90)

In [None]:
plt.figure(figsize=(10,5))
chart = sns.countplot(
    x='Primary Type',
    hue='Arrest',
    data=df
)
chart.set_xticklabels(chart.get_xticklabels(), rotation=90)

In [None]:
df.groupby('Primary Type')['Arrest'].value_counts()

In [None]:
plt.figure(figsize=(8,30))
df.groupby([df['Location Description']]).size().sort_values(ascending=True).plot(kind='barh')
plt.title('Number of crimes by Location')
plt.ylabel('Crime Location')
plt.xlabel('Number of crimes')
plt.show()

In [None]:
plt.figure(figsize=(8,30))
df.groupby([df['Ward']]).size().sort_values(ascending=True).plot(kind='barh')
plt.title('Number of crimes by Ward')
plt.ylabel('Crime Ward')
plt.xlabel('Number of crimes')
plt.show()

In [None]:
df_graphs=df.copy()

In [None]:
# convert dates to pandas datetime format and setting the index to be the date will help us a lot later on
df_graphs.Date = pd.to_datetime(df_graphs.Date, format='%m/%d/%Y %I:%M:%S %p')

df_graphs.index = pd.DatetimeIndex(df_graphs.Date)

df_graphs.tail()

In [None]:
#Exploration and visualization
#Qstn answered:How maany crimes per month between the year 2010-2017
plt.figure(figsize=(12,6))
df_graphs.resample('M').size().plot()
plt.title('Number of crimes per month (2001 - 2019)')
plt.xlabel('Months')
plt.ylabel('Number of crimes')
plt.show()

The above chart shows a clear "periodic" pattern in the crimes over many years.

I guess this very periodic pattern is an essential part of why crime a very predictable activity!

The above chart does show a decresing pattern in the amount of crimes happening from the year 2006-2007 but it's not very clear if all the crimes are decresing.

Thus, to find out if all the crimes are decreasing, I have written code that will allow us to see if there is aa decrease in the sum of all crimes.

In [None]:
#now let's see if the sum of all the crime is decresing over the period of time
plt.figure(figsize=(12,6))
df_graphs.resample('D').size().rolling(365).sum().plot()
plt.title('Sum of all crimes from 2001 - 2019')
plt.xlabel('Days')
plt.ylabel('Number of crimes')
plt.show()

#below diag shows a decrease in the overall crime rate

Here, we will take a finer scale to get the visualization right. I decided to look at the rolling sum of crimes . The idea is, for each day, we calculate the sum of crimes. If this rolling sum is decreasing, then we know for sure that crime rates have been decreasing during that year. On the other hand, if the rolling sum stays the same during a given year, then we can conclude that crime rates stayed the same.

Thus, from the above chart we can say that the sum of crime has indeed decreased.

But now the question that comes to my mind is, Are all the crimes decreasing? Lets find out.

In [None]:
#now let's seperate crime by it's type 
crimes_count_date = df_graphs.pivot_table('ID', aggfunc=np.size, columns='Primary Type',
                                       index=df_graphs.index.date, fill_value=0)
crimes_count_date.index = pd.DatetimeIndex(crimes_count_date.index)
plo = crimes_count_date.rolling(365).sum().plot(figsize=(12, 30), 
                                                subplots=True, layout=(-1, 3), 
                                                sharex=False, sharey=False)

#if we were to only believe the previous graph we would have been wrong since some of the crimes have actually 
#incresed over the period of time
#Crimes like Concealed carry license violation,Deceptive practice,Human trafficing etc have show an increasing trend

In [None]:
data.head()

In [None]:
session_one_data=data.copy()

In [None]:
session_one_data.dropna()

In [None]:
session_one_data = session_one_data[(session_one_data['Year'] >= 2011) & (session_one_data['Year'] <= 2017)]

In [None]:
session_one_data=session_one_data.drop(["Case Number", "Domestic", "Beat", "Updated On", "Arrest"], 1)

In [None]:
datagb_crime=session_one_data.groupby("Primary Type")["Primary Type"].count()
datagb_crime.sort_values(ascending=False, inplace=True)
datagb_crime.head(50)

In [None]:
crime_list=datagb_crime.index.values[0:25].tolist()
session_one_data=session_one_data[session_one_data["Primary Type"].isin(crime_list)]

In [None]:
session_one_data=session_one_data[session_one_data["Primary Type"]!="OTHER OFFENSE"]

In [None]:
severe_crime_list=["ARSON", "ASSAULT", "BATTERY", "CRIM SEXUAL ASSAULT", "CRIMINAL DAMAGE", "CRIMINAL TRESPASS", "HOMICIDE", "ROBBERY"]
session_one_data["severe"]=np.where(session_one_data['Primary Type'].isin(severe_crime_list), 1, 0)
session_one_data.head(5)

In [None]:
print(len(session_one_data.groupby("Location Description")["Location Description"].count().index.values))

In [None]:
datagb_location=session_one_data.groupby("Location Description")["Location Description"].count()
datagb_location.sort_values(ascending=False, inplace=True)
datagb_location.head(50)

In [None]:
location_list=datagb_location.index.values[0:25].tolist()
session_one_data=session_one_data[session_one_data["Location Description"].isin(location_list)]
session_one_data.shape

In [None]:
print(session_one_data.groupby("District")["District"].count())
print(session_one_data.groupby("Community Area")["Community Area"].count())

In [None]:
datagb_destrict=session_one_data.groupby("District")["District"].count()
district_list=datagb_destrict.index.values[0:22].tolist()
session_one_data=session_one_data[session_one_data["District"].isin(district_list)]
session_one_data.shape

In [None]:
session_one_data["District"]='D'+session_one_data['District'].astype(str)
session_one_data.head(2)

In [None]:
#make a dummy variable for district and primary type of crime
dummydf=pd.get_dummies(session_one_data,columns=["Primary Type","District"])
#we will just make a copy here in case we need to use it in the future
dummydf=dummydf.join(session_one_data[["District","Primary Type"]])
print(dummydf.shape)
dummydf.head(2)

In [None]:
timedataf=dummydf.copy()
from datetime import datetime
format = '%m/%d/%Y %I:%M:%S %p'
dummydf["time_24hour"]=dummydf.Date.apply(lambda row: datetime.strptime(row, format).strftime("%H:%M"))
dummydf["Timeblock"]=dummydf.Date.apply(lambda row: str(3*int(int(datetime.strptime(row, format).strftime("%H"))/3)))
dummydf['Date_no_time']=dummydf.Date.apply(lambda row: datetime.strptime(row, format).strftime("%Y%m%d"))
dummydf["Weekday"]=dummydf.Date.apply(lambda row: datetime.strptime(row, format).strftime("%A"))
dummydf=pd.get_dummies(dummydf,columns=["Timeblock","Weekday"])
dummydf.shape

In [None]:
dummydf = dummydf.dropna()
dummydf.shape

# Working on Police Station Data

In [None]:
chicago=dummydf
police_df=pd.read_csv("dataset/police-stations.csv")

In [None]:
police_df.head()

In [None]:
chicago_lati = np.array(chicago["Latitude"])

In [None]:
chicago_longi = np.array(chicago["Longitude"])

In [None]:
chicago_lat_long = zip(chicago_lati, chicago_longi)

In [None]:
chicago_lat_long_list = list(chicago_lat_long)

In [None]:
chicago_lat_long_list

In [None]:
stations_array = [(41.8583725929, -87.627356171), (41.8018110912, -87.6305601801), 
                  (41.7664308925, -87.6057478606), (41.7079332906, -87.5683491228),
                  (41.6927233639, -87.6045058667),
                 (41.7521368378, -87.6442289066), (41.7796315359, -87.6608870173),
                  (41.778987189, -87.7088638153),(41.8373944311, -87.6464077068),
                  (41.8566845327, -87.708381958),(41.8735822883, -87.705488126),
                  (41.8629766244, -87.6569725149),(41.9211033246, -87.6974518223),
                  (41.8800834614, -87.768199889),(41.9740944511, -87.7661488432),
                  (41.9660534171, -87.728114561),(41.9032416531, -87.6433521393),
                  (41.9474004564, -87.651512018),(41.9795495131, -87.6928445094),
                  (41.6914347795, -87.6685203937), (41.9997634842, -87.6713242922),
                  (41.9186088912, -87.765574479)]

In [None]:
station_array = np.array (stations_array)

In [None]:
chicago["Location"] = chicago_lat_long_list

# Calculating distance between particular crime location and its closest police station

We will use the Harvesine Formula to calculate the closest police station to each of the crime scenes in our dataset:

In [None]:
%time
from math import radians, sin, cos, asin, sqrt, pi, atan2
import itertools
from timeit import Timer

distance = []
earth_radius_miles = 3956.0

x = chicago["Location"]
y = station_array 

def get_shortest_in(needle, haystack):
    # needle is a single (lat,long) tuple. haystack is a numpy array to find the point in that has the shortest distance to needle  
    
    dlat = np.radians(haystack[:,0]) - radians(needle[0])
    dlon = np.radians(haystack[:,1]) - radians(needle[1])
    a = np.square(np.sin(dlat/2.0)) + cos(radians(needle[0])) * np.cos(np.radians(haystack[:,0])) * np.square(np.sin(dlon/2.0))
    great_circle_distance = 2 * np.arcsin(np.minimum(np.sqrt(a), np.repeat(1, len(a))))
    d = earth_radius_miles * great_circle_distance
    return np.min(d)



def donumpy():
    get_shortest_in(x, y)
    
for i in x:
    distance.append(get_shortest_in(i, y))


In [None]:
chicago["closest_station"] = distance
chicago.head()

In [None]:
chicago.shape

In [None]:
smalldf = chicago
inc_edu_age = pd.read_csv("dataset/Census_Data_-_Selected_socioeconomic_indicators_in_Chicago__2008___2012.csv",index_col=0)

In [None]:
smalldf.head()

In [None]:
inc_edu_age.head(5)

In [None]:
smalldf["Income"]=smalldf["Community Area"].apply(lambda row: inc_edu_age.iloc[int(row)-1].iloc[5])

In [None]:
smalldf['HARDSHIP INDEX']=smalldf["Community Area"].apply(lambda row: inc_edu_age.iloc[int(row)-1].iloc[0])

In [None]:
smalldf['Under18_over64']=smalldf["Community Area"].apply(lambda row: inc_edu_age.iloc[int(row)-1].iloc[5])

In [None]:
smalldf['Unemployed']=smalldf["Community Area"].apply(lambda row: inc_edu_age.iloc[int(row)-1].iloc[4])

In [None]:
smalldf['House_below_poverty']=smalldf["Community Area"].apply(lambda row: inc_edu_age.iloc[int(row)-1].iloc[2])

In [None]:
smalldf.head(5)

In [None]:
smalldf.head()
print(smalldf.shape)
smalldf=smalldf[smalldf["Location Description"]!="OTHER"]
smalldf=pd.get_dummies(smalldf,columns=["Location Description"])
print(smalldf.shape)

# Modeling on Data

In [None]:
data=smalldf.sample(80000)
data.shape

In [None]:
data.head(5)

In [None]:
data.reset_index(inplace=True)

In [None]:
data.head(5)

In [None]:
data=data.drop(["HARDSHIP INDEX"], axis=1)

In [None]:
data=data.drop(["ID","Date","Block","IUCR","Description","Ward","Community Area","FBI Code"], axis=1)

In [None]:
data=data.drop(["X Coordinate","Y Coordinate","Year","Location","District","Primary Type","time_24hour","Date_no_time"], axis=1)

In [None]:
a=list(data.columns.values)
droplist=[]
for i in a:
    if i.startswith("Primary Type"):
        droplist.append(i)
data=data.drop(droplist, axis=1)

In [None]:
#pip install -U scikit-learn

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
itrain, itest = train_test_split(range(data.shape[0]), train_size=0.7)

In [None]:
mask=np.ones(data.shape[0], dtype='int')
mask[itrain]=1
mask[itest]=0
mask=(mask==1)

In [None]:
data.shape

In [None]:
data.head(5)

In [None]:
data=data.dropna()

In [None]:
np.nan_to_num(data)

In [None]:
#We have a list of continuous features, or in other words standardizable variables
STANDARDIZABLE=["Latitude", "Longitude","closest_station","House_below_poverty","Unemployed","Under18_over64","Income"]

#Also create a list for indicator variable. We can do this by excluding the above continuous features from total features. 
INDICATOR=list(data.columns)
#We need to remove the response variable from our total list of features
INDICATOR.remove(u'severe')


In [None]:
print (len(STANDARDIZABLE), len(INDICATOR))

In [None]:
STANDARDIZABLE

In [None]:
from sklearn.preprocessing import StandardScaler
#Standardize training set
data.loc[mask,STANDARDIZABLE]=StandardScaler().fit_transform(data.loc[mask,STANDARDIZABLE])
#Standardize test set
data.loc[~mask,STANDARDIZABLE]=StandardScaler().fit_transform(data.loc[~mask,STANDARDIZABLE])

In [None]:
data.head(5)

In [None]:
fig=plt.figure(figsize=(24,36))
pos=data[data["severe"]==1]
neg=data[data["severe"]==0]
for k in range (7):
    ax=fig.add_subplot(5,3,k+1)    
    sns.kdeplot(pos[STANDARDIZABLE[k]],color="red",label="Severe")
    sns.kdeplot(neg[STANDARDIZABLE[k]],color="blue",label="Not severe")
    ax.set_title(STANDARDIZABLE[k])
    ax.set_xlabel("Normalized Z-score")
    ax.set_ylabel("Frequency")

In [None]:
#The following command just plot those continuous features with significant effects (for presentation purpose)
plotlist=["Latitude","House_below_poverty","Unemployed","Under18_over64","Income"]
plotlist_title=["Latitude","Proportion of House Below Poverty","Proportion Unemployed", "Proportion with Age Under 18 Over 64", "Income per Capita"]
fig=plt.figure(figsize=(24,18))
for k in range (5):
    ax=fig.add_subplot(2,3,k+1)    
    sns.kdeplot(pos[plotlist[k]],color="red",label="Severe")
    sns.kdeplot(neg[plotlist[k]],color="blue",label="Not severe")
    ax.set_title(plotlist_title[k])
    ax.set_xlabel("Normalized Z-score")
    ax.set_ylabel("Frequency")

In [None]:
fig=plt.figure(figsize=(30,144))
pos=data[data['severe']==1]
neg=data[data['severe']==0]
for k in range (68):
    ax=fig.add_subplot(17,4,k+1)
    ax.hist((pos[INDICATOR[k]],neg[INDICATOR[k]]),stacked=True,color=("red","blue"),range=[0,1])
    ax.set_title(INDICATOR[k])
    ax.legend(("Severe","Not severe"),loc="upper center")
    ax.set_xlabel("Value")
    ax.set_ylabel("Crime Counts")

In [None]:
#This cell is to plot those indicator variables that will be good predictor for whether a crime will be severe or not.
fig=plt.figure(figsize=(26,18))
plotfeature=["District_D11.0","Timeblock_3","Timeblock_9","Timeblock_12","Location Description_APARTMENT","Location Description_STREET"]
plotfeature_title=["Police District 11","Time: 3am-6am","Time: 9am-12pm", "Time: 12pm to 3pm","Location: apartment","Location: street"]
for k in range (6):
    ax=fig.add_subplot(2,3,k+1)
    ax.hist((pos[plotfeature[k]],neg[plotfeature[k]]),stacked=True,color=("red","blue"),range=[0,1])
    ax.set_title(plotfeature_title[k])
    ax.legend(("Severe","Not severe"),loc="upper center")
    ax.set_xlabel("Value")
    ax.set_ylabel("Crime Counts")

# Model 0 - Baseline model

In [None]:
pos=data[data['severe']==1]
neg=data[data['severe']==0]
percent_severe=float(len(pos))/len(data)
percent_non_severe=float(len(neg))/len(data)
print (percent_severe, percent_non_severe)

In [None]:
#Let's make a dictionary storing confusion matrix for all the algorithms, so that we can have some comparison
confusion_dict={}
confusion_dict["Baseline_model"]=np.asarray([[len(neg),0],[len(pos),0]])
#Also create a dictionary to store all the models
model_dict={}
#The following dict will store the accuracy for training set
accuracy_dict={}
#The following dict will store the accuracy for test set
accuracy_dict1={}
train_not_severe_percent=1-float(sum(data["severe"].values[mask]))/len(data["severe"].values[mask])
test_not_severe_percent=1-float(sum(data["severe"].values[~mask]))/len(data["severe"].values[~mask])
print (train_not_severe_percent, test_not_severe_percent) 
accuracy_dict["Baseline_model"]=train_not_severe_percent
accuracy_dict1["Baseline_model"]=test_not_severe_percent

# Model 1 - Logistic regression with Lasso-based feature selection

In [None]:
#Got the X and y for traning set and test set
total_features=STANDARDIZABLE+INDICATOR

In [None]:
#reference: hw3 do_classify function
#Slightly modify the hw3 function, but overall very similar
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix
def do_classify(clf, parameters, indf, featurenames, targetname, target1val, mask, score_func=None, n_folds=5):
    subdf=indf[featurenames]
    X=subdf.values
    y=(indf[targetname].values==target1val)*1
    Xtrain, Xtest, ytrain, ytest = X[mask], X[~mask], y[mask], y[~mask]
    if parameters:
        clf = cv_optimize(clf, parameters, Xtrain, ytrain, n_folds=n_folds, score_func=score_func)
    clf=clf.fit(Xtrain, ytrain)
    training_accuracy = clf.score(Xtrain, ytrain)
    test_accuracy = clf.score(Xtest, ytest)
    print ("Training accuracy: %0.2f" % (training_accuracy))
    print ("Test accuracy:     %0.2f" % (test_accuracy))
    confmatrix=confusion_matrix(ytest, clf.predict(Xtest))
    print (confmatrix)
    print (clf)
    return clf, Xtrain, ytrain, Xtest, ytest, confmatrix, training_accuracy, test_accuracy

In [None]:
#reference: hw3 cv_optimize function
#we will use five fold validation by default
#This function is largely the same as the one in our hw
def cv_optimize(clf, parameters, X, y, n_folds=5, score_func=None):
    if score_func:
        gs=GridSearchCV(clf, param_grid=parameters, cv=n_folds, scoring=score_func)
    else:
        gs=GridSearchCV(clf, param_grid=parameters, cv=n_folds)
    gs.fit(X, y)
    best = gs.best_estimator_
    return best

In [None]:
from sklearn.linear_model import LogisticRegression
clflog = LogisticRegression(penalty="none")
clflog, Xtrain, ytrain, Xtest, ytest, confclflog, training_accuracy, test_accuracy=do_classify(clflog, {"C": [0.001, 0.01, 0.1, 1.0, 10.0, 20.0, 40.0, 70.0, 100.0]}, data, total_features, u'severe', 1, mask=mask)
confusion_dict["Logistic"]=confclflog
model_dict["Logistic"]=clflog
accuracy_dict["Logistic"]=training_accuracy
accuracy_dict1["Logistic"]=test_accuracy

In [None]:
#in addition to l2 (lasso) regularization, we also tried l2 regularization (the default mode). It works equally well and seems to run faster.
from sklearn.linear_model import LogisticRegression
clflog2 = LogisticRegression(penalty="l2")
clflog2, Xtrain, ytrain, Xtest, ytest, confclflog2, training_accuracy, test_accuracy=do_classify(clflog2, {"C": [0.001, 0.01, 0.1, 1.0, 10.0, 20.0, 40.0, 70.0, 100.0]}, data, total_features, u'severe', 1, mask=mask)

# Model 2 - Linear SVM

In [None]:
from sklearn.svm import LinearSVC
clfsvm=LinearSVC(loss="hinge")
clfsvm, Xtrain, ytrain, Xtest, ytest, confclfsvm, training_accuracy, test_accuracy= do_classify(clfsvm, {"C": [0.001, 0.01, 0.1, 1.0, 10.0, 50, 100.0]}, data, total_features, u'severe',1, mask=mask)
confusion_dict["svm"]=confclfsvm
model_dict["svm"]=clfsvm
accuracy_dict["svm"]=training_accuracy
accuracy_dict1["svm"]=test_accuracy

# Model 3 - Decision tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
clfdt=DecisionTreeClassifier()
clfdt, Xtrain, ytrain, Xtest, ytest, confclfdt, training_accuracy, test_accuracy = do_classify(clfdt, {"max_depth":np.arange(1,20,2)}, data, total_features, u'severe',1, mask=mask)
confusion_dict["decision tree"]=confclfdt
model_dict["decision tree"]=clfdt
accuracy_dict["decision tree"]=training_accuracy
accuracy_dict1["decision tree"]=test_accuracy

# Model 4 - Naive Bayes model

In [None]:
from sklearn.naive_bayes import GaussianNB
clfgnb = GaussianNB()
clfgnb, Xtrain, ytrain, Xtest, ytest, confgnb, training_accuracy, test_accuracy=do_classify(clfgnb, None, data, total_features, u'severe',1, mask=mask)
confusion_dict["Naive Bayes"]=confgnb
model_dict["Naive Bayes"]=clfgnb
accuracy_dict["Naive Bayes"]=training_accuracy
accuracy_dict1["Naive Bayes"]=test_accuracy

# Model 5 - Random forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
randf=RandomForestClassifier()
clfrdf, Xtrain, ytrain, Xtest, ytest, confrdf, training_accuracy, test_accuracy=do_classify(randf, {"n_estimators":[10, 20, 30, 40, 100]}, data, total_features, u'severe',1, mask=mask)
confusion_dict["Random forest"]=confrdf
model_dict["Random forest"]=clfrdf
accuracy_dict["Random forest"]=training_accuracy
accuracy_dict1["Random forest"]=test_accuracy

# Model 6 - KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier()
neigh, Xtrain1, ytrain1, Xtest1, ytest1, confknn, training_accuracy, test_accuracy=do_classify(neigh, {"n_neighbors":[5, 10, 20, 40]}, data, total_features, u'severe',1, mask=mask)
confusion_dict["KNN"]=confknn
model_dict["KNN"]=neigh
accuracy_dict["KNN"]=training_accuracy
accuracy_dict1["KNN"]=test_accuracy

# Compare testing accuracy

In [None]:
#Compare training and testing accuracy
pd.Series(accuracy_dict).plot(kind="bar",title="Training accuracy",width=0.8,color="grey", fontsize=20)

In [None]:
#Compare training and testing accuracy
pd.Series(accuracy_dict1).plot(kind="bar",title="Test accuracy",width=0.8, color="grey",fontsize=20)

# Identify important factors determining whether a crime is severe or not based on coefficients

In [None]:
def nonzero_lasso(clf):
    featuremask=(clf.coef_ !=0.0)[0]
    return pd.DataFrame(dict(feature=total_features, coef=clf.coef_[0], abscoef=np.abs(clf.coef_[0])))[featuremask].sort_values('abscoef', ascending=False)
lasso_importances=nonzero_lasso(clflog)
lasso_importances.head(50)

# Multiclass classification into crime types

In [None]:
data2=smalldf.sample(80000)
data2=data2.drop(["ID","Date","Block","IUCR","Description","Ward","Community Area","FBI Code","severe"], axis=1)
data2=data2.drop(["X Coordinate","Y Coordinate","Year","Location","District","time_24hour","Date_no_time"], axis=1)
data2.head()

In [None]:
#np.nan_to_num(data2)
#data2 = data2.fillna(data2.mean())
#data2 = data2.dropna()

#data2.isnull().values.any()
#data2.isnull().sum().sum()

In [None]:
print(data2.shape)
data2["Crime_interested"]=data2["Primary Type_THEFT"]+data2["Primary Type_BATTERY"]+data2["Primary Type_NARCOTICS"]+data2["Primary Type_CRIMINAL DAMAGE"]
#only maintain the four specific crime types we are interested in classifying
data2=data2[data2["Crime_interested"]==1]
print(data2.shape)
data2.head()

In [None]:
def get_categorical_integer(row):
    if row=="THEFT":
        return int(0)
    elif row=="BATTERY":
        return int(1)
    elif row=="CRIMINAL DAMAGE":
        return int(2)
    elif row=="NARCOTICS":
        return int(3)

#Assign each of the crime type an integer identifier. We will input these integer identifier directly for the algorithms below.
data2["category"]=data2["Primary Type"].apply(get_categorical_integer)

In [None]:
#Previously we dropped rows that are not in the four types, which mess up the integer index of the dataframe
#here we reset the integer index. This is required for the mask to work.
#data2 = data2.reset_index(drop=True)
data2.reset_index(inplace=True)

In [None]:
data2.head()

In [None]:
def do_classify2(clf, parameters, indf, featurenames,targetname, mask, score_func=None, n_folds=5):
    subdf=indf[featurenames]
    X=subdf.values
#y will be an array with integer values for different categories (0, 1, 2, 3...)
    y=indf[targetname]
    Xtrain, Xtest, ytrain, ytest = X[mask], X[~mask], y[mask], y[~mask]
    if parameters:
        clf = cv_optimize(clf, parameters, Xtrain, ytrain, n_folds=n_folds, score_func=score_func)
    clf=clf.fit(Xtrain, ytrain)
    training_accuracy = clf.score(Xtrain, ytrain)
    test_accuracy = clf.score(Xtest, ytest)
    print ("Training accuracy: %0.2f" % (training_accuracy))
    print( "Test accuracy:     %0.2f" % (test_accuracy))
    confmatrix=confusion_matrix(ytest, clf.predict(Xtest))
    print (confmatrix)
    print (clf)
    return clf, Xtrain, ytrain, Xtest, ytest, training_accuracy, test_accuracy

In [None]:
mask=np.ones(data2.shape[0], dtype='int')
itrain, itest = train_test_split(range(data2.shape[0]), train_size=0.7)
mask[itrain]=1
mask[itest]=0
mask=(mask==1)

In [None]:
#from sklearn.preprocessing import StandardScaler
#Standardize training set
data2.loc[mask,STANDARDIZABLE]=StandardScaler().fit_transform(data2.loc[mask,STANDARDIZABLE])
#Standardize test set
data2.loc[~mask,STANDARDIZABLE]=StandardScaler().fit_transform(data2.loc[~mask,STANDARDIZABLE])
data2.head()

# Model 0: Base-line model

In [None]:
print(float(len(data2[data2["category"]==0]))/len(data2))
train_theft_percent=float(sum(data2["Primary Type_THEFT"].values[mask]))/len(data2["Primary Type_THEFT"].values[mask])
test_theft_percent=float(sum(data2["Primary Type_THEFT"].values[~mask]))/len(data2["Primary Type_THEFT"].values[~mask])

print(train_theft_percent)
print(test_theft_percent)

accuracy_multi_train={}
accuracy_multi_test={}
accuracy_multi_train["Baseline model"]=train_theft_percent
accuracy_multi_test["Baseline model"]=test_theft_percent

# Model 1: Decision tree for multiclass

In [None]:
from sklearn.tree import DecisionTreeClassifier
clfdt_multi=DecisionTreeClassifier()
clfdt_multi, Xtrain, ytrain, Xtest, ytest, training_accuracy, test_accuracy=do_classify2(clfdt_multi, {"max_depth":np.arange(1,20,2)}, data2, total_features, "category", mask=mask)
accuracy_multi_train["Decision tree multiclass"]=training_accuracy
accuracy_multi_test["Decision tree multiclass"]=test_accuracy


# from sklearn.tree import DecisionTreeClassifier
# clfdt=DecisionTreeClassifier()
# clfdt, Xtrain, ytrain, Xtest, ytest, confclfdt, training_accuracy, test_accuracy = do_classify(clfdt, {"max_depth":np.arange(1,20,2)}, data, total_features, u'severe',1, mask=mask)
# confusion_dict["decision tree"]=confclfdt
# model_dict["decision tree"]=clfdt
# accuracy_dict["decision tree"]=training_accuracy
# accuracy_dict1["decision tree"]=test_accuracy


# Model 2: Random forest for multiclass

In [None]:
from sklearn.ensemble import RandomForestClassifier
randfmulti=RandomForestClassifier()
randfmulti, Xtrain, ytrain, Xtest, ytest, training_accuracy, test_accuracy=do_classify2(randfmulti, {"n_estimators":[10, 20, 30, 40, 100]}, data2, total_features, "category", mask=mask)
accuracy_multi_train["Random forest multiclass"]=training_accuracy
accuracy_multi_test["Random forest multiclass"]=test_accuracy

# Model 3: Logistic regression for multiclass

In [None]:
from sklearn.linear_model import LogisticRegression
clflogmulti=LogisticRegression(penalty="l2",multi_class='multinomial',solver="newton-cg",max_iter=100)
clflogmulti, Xtrain, ytrain, Xtest, ytest, training_accuracy, test_accuracy=do_classify2(clflogmulti, {"C": [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]}, data2, total_features, 'category', mask=mask)
accuracy_multi_train["Logistic - newton cg"]=training_accuracy
accuracy_multi_test["Logistic - newton cg"]=test_accuracy

In [None]:
from sklearn.linear_model import LogisticRegression
clflogmulti2 = LogisticRegression(penalty="l2",multi_class='multinomial',solver="lbfgs",max_iter=400)
clflogmulti2, Xtrain, ytrain, Xtest, ytest, training_accuracy, test_accuracy=do_classify2(clflogmulti2, {"C": [0.01, 0.1, 1.0, 10.0]}, data2, total_features, 'category', mask=mask)
accuracy_multi_train["Logistic - lbfgs"]=training_accuracy
accuracy_multi_test["Logistic - lbfgs"]=test_accuracy

In [None]:
#Compare training and testing accuracy
pd.Series(accuracy_multi_train).plot(kind="bar",title="Training accuracy",width=0.8,color="grey", fontsize=20)

In [None]:
#Compare training and testing accuracy
pd.Series(accuracy_multi_test).plot(kind="bar",title="Test accuracy",width=0.8,color="grey", fontsize=20)

In [None]:
#What is the accuracy for two-type prediction for a baseline model?
total_predict=len(ytest)
correct_predict=0
baseline_accuracy=0
for i in range(0,total_predict):
    if ((ytest.iloc[i]==0) | (ytest.iloc[i]==1)):
        correct_predict += 1
    if ytest.iloc[i]==0:
        baseline_accuracy += 1
print(float(correct_predict)/total_predict)

In [None]:
#Reference: http://stackoverflow.com/questions/6910641/how-to-get-indices-of-n-maximum-values-in-a-numpy-array
def accuracy_two_type(est, Xtest, ytest):
    probs=est.predict_proba(Xtest)
    correct_predict=0
    total_predict=len(ytest)
#For each line (a prediction with four probabilities for the four categories), we obtain the two most likely classes
#We checked whether the actual observation is in one of the two predictions
#If so, we call the result for this record of crime accurate
    for i in range(0,total_predict):
#This will return the column index of the two largest probabilities, which directly corresponds to type of crime
        ind=np.argpartition(probs[i,],-2)[-2:]
        if ytest.iloc[i] in ind:
            correct_predict+=1
        i+=1
    return float(correct_predict)/total_predict

In [None]:
print(accuracy_two_type(clflogmulti, Xtest, ytest))
print(accuracy_two_type(clflogmulti2, Xtest, ytest))

In [None]:
lasso_importances_multi=nonzero_lasso(clflogmulti)
lasso_importances_multi.head(50)

# Visualization

In [None]:
crimedata =chicago
print(crimedata.shape)
crimedata.head(5)

# Plotting crime rates

Here, we will plot the occurence rates of the following

1) Crime Type
2) Scene of Crime
3) Time of Crime
4) Day of Crime
5) Month of Crime
6) Average Temperature of Crime

In [None]:
# Occurrence rates of the various types of crime
crimetypegb=crimedata.groupby(["Primary Type"])["Primary Type"].count()/len(crimedata)*100
crimetypegb.sort_values(ascending=False, inplace=True)
print(crimetypegb)


In [None]:
crimetypegb.plot(kind='bar',title="Type of Crime")
plt.ylabel('Occurrence rate (%)')

In [None]:
locationgb=crimedata.groupby(['Location Description'])["Location Description"].count()/len(crimedata)*100
locationgb.sort_values(ascending=False, inplace=True)
print(locationgb)

In [None]:
locationgb.plot(kind='bar',title="Scene of Crime")
plt.ylabel('Occurrence rate (%)')

In [None]:
from datetime import datetime
format = '%m/%d/%Y %I:%M:%S %p'
crimedata["time_hour"]=crimedata.Date.apply(lambda row: datetime.strptime(row, format).strftime("%H"))
crimedata["month"]=crimedata.Date.apply(lambda row: datetime.strptime(row, format).strftime("%m"))

In [None]:
timegb=crimedata.groupby(['time_hour'])["time_hour"].count()/len(crimedata)*100
print(timegb)


In [None]:
timegb.plot(kind='bar',title="Hour Crime occurred")
plt.ylabel('Occurrence rate (%)')

In [None]:
weekday=crimedata[["Weekday_Monday","Weekday_Tuesday","Weekday_Wednesday","Weekday_Thursday","Weekday_Friday","Weekday_Saturday","Weekday_Sunday"]].sum()/len(crimedata)*100
print(weekday)

In [None]:
weekday.plot(kind='bar',title="Day Crime occurred") #color='bgyrc'
plt.ylabel('Occurrence rate (%)')

In [None]:
month_group=crimedata.groupby(['month'])['month'].count()/len(crimedata)*100
print(month_group)

In [None]:
month_group.plot(kind='bar',title="Month Crime occurred")
plt.ylabel('Occurrence rate (%)')

# Now we focus on only the 4 major types of crimes

In [None]:
def topfour(row):
    keep = ["THEFT", "BATTERY", "NARCOTICS", "CRIMINAL DAMAGE"]
    if row not in keep:
        return "OTHERS"
    else:
        return row

In [None]:
crimedata["New_Type"] = crimedata["Primary Type"].apply(topfour)

In [None]:
# Here we write a function to take in a column name, title, and return a plot that displays the percentage, per column
# with the normalized types of crimes for each feature in the selected column
def plotsplit(cnam, title):
    datasplit = crimedata.groupby([cnam, "New_Type"])[cnam].count().unstack()
    # Convert everything to percentage for normalization, so we can compare!
    datasplit= datasplit.apply(lambda c: c / c.sum() * 100, axis=1)
    # Reorder columns
    datasplit = datasplit[['OTHERS', 'CRIMINAL DAMAGE', 'NARCOTICS', 'BATTERY', 'THEFT']]
    datasplit.plot(kind = "bar", stacked = True, title = title)
    plt.ylabel('Fraction of Crime Type (%)')
    # Anchoring legend from http://stackoverflow.com/questions/4700614/how-to-put-the-legend-out-of-the-plot
    plt.legend(loc='upper center', bbox_to_anchor=(0.5, 1), ncol=3, fancybox=True, shadow=True)
    plt.ylim([0,120])

In [None]:
import matplotlib.pyplot as pltd

pltd.rcParams["figure.figsize"] = (12,9)

In [None]:

plotsplit("Location Description", "Normalized Crime Types by Location")

Crime type per hour of day

In [None]:

plotsplit("time_hour", "Normalized Crime Types by Time")

Crimes by day: First we need to un-get_dummies the data

In [None]:
# Writing a function to turn Mon, Tue, Wed, Thur, Fri, Sat, Sun into 1 - 7
crimedata["Weekday_Tuesday"] = (crimedata["Weekday_Tuesday"].apply(lambda x: x+1 if x > 0 else 0))
crimedata["Weekday_Wednesday"] = (crimedata["Weekday_Wednesday"].apply(lambda x: x+2 if x > 0 else 0))
crimedata["Weekday_Thursday"] = (crimedata["Weekday_Thursday"].apply(lambda x: x+3 if x > 0 else 0))
crimedata["Weekday_Friday"] = (crimedata["Weekday_Friday"].apply(lambda x: x+4 if x > 0 else 0))
crimedata["Weekday_Saturday"] = (crimedata["Weekday_Saturday"].apply(lambda x: x+5 if x > 0 else 0))
crimedata["Weekday_Sunday"] = (crimedata["Weekday_Sunday"].apply(lambda x: x+6 if x > 0 else 0))

In [None]:
crimedata["Num_Day"] = crimedata["Weekday_Monday"] + crimedata["Weekday_Tuesday"] + crimedata["Weekday_Wednesday"] + crimedata["Weekday_Thursday"] + crimedata["Weekday_Friday"] + crimedata["Weekday_Saturday"] + crimedata["Weekday_Sunday"]
crimedata["Num_Day"] = crimedata["Num_Day"].astype(int)

In [None]:

plotsplit("Num_Day", "Normalized Crime Types by Day")

In [None]:
plotsplit("month", "Normalized Crime Types by Month")

In [None]:
pip install folium

# Now lets make a map to visualize the crime locations and type

In [None]:
import folium
from IPython.display import HTML

In [None]:
def display(m, height=300):
    """Takes a folium instance and embed HTML."""
    m._build_map()
    srcdoc = m.HTML.replace('"', '&quot;')
    embed = HTML('<iframe srcdoc="{0}" '
                 'style="width: 100%; height: {1}px; '
                 'border: none"></iframe>'.format(srcdoc, height))
    return embed

We can start an instance of the Chicago map like this

In [None]:
map = folium.Map(location=[41.8369, -87.6847], zoom_start=10)
folium.Marker([41.8369, -87.6847]).add_to(map) 
#display(map)

In [None]:
#Code is fine but little bit error in object creation
from folium.plugins import MarkerCluster
marker_cluster = MarkerCluster().add_to(map)
##mapping the main crime types to the map 
types = ['OTHERS', 'CRIMINAL DAMAGE', 'NARCOTICS', 'BATTERY', 'THEFT']
for i in types:
    typedata=crimedata[crimedata["New_Type"]==i]
    map = folium.Map(location=[41.8369, -87.6847], zoom_start=10)
    #add a marker for every record in the filtered data, use a clustered view
    for each in typedata[0:len(typedata)].iterrows():
        folium.Marker(
            location = [each[1]['Latitude'],each[1]['Longitude']]).add_to(marker_cluster)
    #display(map)
    map.create_map(path=i + 'map.html')