In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.nonparametric.smoothers_lowess import lowess
from matplotlib.dates import DayLocator, MonthLocator, DateFormatter, drange
import seaborn as sns

## Data Preparation Part 1

Here, we read in the data and parse the CMPLNT_FR_DT variable as datetime.

In [2]:
df = pd.read_csv('/home/drew/School/Semester4/ML1/NewYorkCityCrimes2015/Data/Lab2_Daily_Crime_Volume_Data/Training_and_Test_Set.csv',
                parse_dates = ['CMPLNT_FR_DT'])

As you can see below, there are many variables that are non-numeric. KNN accepts only numeric variables, so we must do some pre-processing in order to use these variables in our model.

In [3]:
df.head(10)

Unnamed: 0,CMPLNT_FR_DT,Daytime,Day_Name,Month,Day,Year,Season,GeoCell,BORO_NM,PRCP,...,TMIN,TMAX,Population,PC_INCOME,Hm_Sls_Price_Range,Holiday,Event,is_Holiday,is_Event,count_cmplnt
0,2014-11-26,Morning,Wednesday,November,26.0,2014.0,Fall,66.0,QUEENS,1.24,...,34,51,2250002,40997,Medium,,,0,0,1
1,2014-12-01,Late Night,Monday,December,1.0,2014.0,Winter,60.0,QUEENS,0.09,...,42,65,2250002,40997,Medium,,,0,0,1
2,2015-11-10,Morning,Tuesday,November,10.0,2015.0,Fall,15.0,BROOKLYN,0.26,...,51,57,2552911,43915,High,,,0,0,2
3,2014-02-04,Morning,Tuesday,February,4.0,2014.0,Winter,48.0,QUEENS,0.0,...,22,35,2250002,40997,Medium,,,0,0,3
4,2015-08-25,Late Night,Tuesday,August,25.0,2015.0,Summer,35.0,BROOKLYN,0.0,...,73,90,2552911,43915,High,,,0,0,1
5,2014-09-03,Morning,Wednesday,September,3.0,2014.0,Fall,23.0,STATEN ISLAND,0.0,...,72,86,468730,48123,Low,,US Open Tennis,0,1,1
6,2015-09-17,Morning,Thursday,September,17.0,2015.0,Fall,49.0,QUEENS,0.0,...,68,89,2250002,44031,Medium,,New York Boat Show,0,1,7
7,2015-01-29,Morning,Thursday,January,29.0,2015.0,Winter,35.0,BROOKLYN,0.02,...,19,36,2552911,43915,High,,,0,0,5
8,2015-03-12,Evening,Thursday,March,12.0,2015.0,Spring,67.0,QUEENS,0.0,...,36,47,2250002,44031,Medium,,,0,0,1
9,2014-06-23,Late Night,Monday,June,23.0,2014.0,Summer,48.0,QUEENS,0.0,...,65,81,2250002,40997,Medium,,,0,0,3


Our Day_Name variable was built off of the CMPLNT_FR_DT variable. In order to turn that variable into a numeric variable, we simply use the dt.dayofweek function from pandas in order to change "Monday" to 0, "Tuesday" to 1 etc. 

In [4]:
df['DayOfWeek'] = df['CMPLNT_FR_DT'].dt.dayofweek

Similar to the dayofweek function, dt.month turns January into 1, February into 2, etc. We use this function to change our Month variable into a numeric variable. 

In [5]:
df['Month_No'] = df['CMPLNT_FR_DT'].dt.month
df_Crime = df

As you can see in the plot below, there does seem to be a temporal dependence of number of crimes. Crimes seem to peak in the summer and drop in the winter. We encode this temporal dependence as numeric variables identifying the month, the year and the day. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.nonparametric.smoothers_lowess import lowess
from matplotlib.dates import DayLocator, MonthLocator, DateFormatter, drange
import seaborn as sns

## Data Preparation Part 1

Here, we read in the data and parse the CMPLNT_FR_DT variable as datetime.

In [2]:
df = pd.read_csv('/home/drew/School/Semester4/ML1/NewYorkCityCrimes2015/Data/Lab2_Daily_Crime_Volume_Data/Training_and_Test_Set.csv',
                parse_dates = ['CMPLNT_FR_DT'])

As you can see below, there are many variables that are non-numeric. KNN accepts only numeric variables, so we must do some pre-processing in order to use these variables in our model.

In [3]:
df.head(10)

Unnamed: 0,CMPLNT_FR_DT,Daytime,Day_Name,Month,Day,Year,Season,GeoCell,BORO_NM,PRCP,...,TMIN,TMAX,Population,PC_INCOME,Hm_Sls_Price_Range,Holiday,Event,is_Holiday,is_Event,count_cmplnt
0,2014-11-26,Morning,Wednesday,November,26.0,2014.0,Fall,66.0,QUEENS,1.24,...,34,51,2250002,40997,Medium,,,0,0,1
1,2014-12-01,Late Night,Monday,December,1.0,2014.0,Winter,60.0,QUEENS,0.09,...,42,65,2250002,40997,Medium,,,0,0,1
2,2015-11-10,Morning,Tuesday,November,10.0,2015.0,Fall,15.0,BROOKLYN,0.26,...,51,57,2552911,43915,High,,,0,0,2
3,2014-02-04,Morning,Tuesday,February,4.0,2014.0,Winter,48.0,QUEENS,0.0,...,22,35,2250002,40997,Medium,,,0,0,3
4,2015-08-25,Late Night,Tuesday,August,25.0,2015.0,Summer,35.0,BROOKLYN,0.0,...,73,90,2552911,43915,High,,,0,0,1
5,2014-09-03,Morning,Wednesday,September,3.0,2014.0,Fall,23.0,STATEN ISLAND,0.0,...,72,86,468730,48123,Low,,US Open Tennis,0,1,1
6,2015-09-17,Morning,Thursday,September,17.0,2015.0,Fall,49.0,QUEENS,0.0,...,68,89,2250002,44031,Medium,,New York Boat Show,0,1,7
7,2015-01-29,Morning,Thursday,January,29.0,2015.0,Winter,35.0,BROOKLYN,0.02,...,19,36,2552911,43915,High,,,0,0,5
8,2015-03-12,Evening,Thursday,March,12.0,2015.0,Spring,67.0,QUEENS,0.0,...,36,47,2250002,44031,Medium,,,0,0,1
9,2014-06-23,Late Night,Monday,June,23.0,2014.0,Summer,48.0,QUEENS,0.0,...,65,81,2250002,40997,Medium,,,0,0,3


Our Day_Name variable was built off of the CMPLNT_FR_DT variable. In order to turn that variable into a numeric variable, we simply use the dt.dayofweek function from pandas in order to change "Monday" to 0, "Tuesday" to 1 etc. 

In [4]:
df['DayOfWeek'] = df['CMPLNT_FR_DT'].dt.dayofweek

Similar to the dayofweek function, dt.month turns January into 1, February into 2, etc. We use this function to change our Month variable into a numeric variable. 

In [5]:
df['Month_No'] = df['CMPLNT_FR_DT'].dt.month
df_Crime = df

As you can see in the plot below, there does seem to be a temporal dependence of number of crimes. Crimes seem to peak in the summer and drop in the winter. We encode this temporal dependence as numeric variables identifying the month, the year and the day. 