# San Francisco Crime Classification with 7 Features, KNN

Data: https://www.kaggle.com/c/sf-crime/data

Data fields
- Dates - timestamp of the crime incident
- Category - category of the crime incident (only in train.csv). **This is the target variable you are going to predict.**
- Descript - detailed description of the crime incident (only in train.csv)
- DayOfWeek - the day of the week
- PdDistrict - name of the Police Department District
- Resolution - how the crime incident was resolved (only in train.csv)
- Address - the approximate street address of the crime incident 
- X - Longitude
- Y - Latitude

San Francisco Crime Classification EDA and Basic Modeling 및 이전 notebook을 먼저 보고 본 notebook을 참고해주시기 바랍니다.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from statsmodels.stats.outliers_influence import variance_inflation_factor
from patsy import dmatrices
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
%matplotlib inline

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [2]:
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

## Dates Split
년, 월, 일, 시간으로 나누어 DataFrame에 저장합니다.

In [3]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 878049 entries, 0 to 878048
Data columns (total 9 columns):
Dates         878049 non-null object
Category      878049 non-null object
Descript      878049 non-null object
DayOfWeek     878049 non-null object
PdDistrict    878049 non-null object
Resolution    878049 non-null object
Address       878049 non-null object
X             878049 non-null float64
Y             878049 non-null float64
dtypes: float64(2), object(7)
memory usage: 60.3+ MB


우선, Dates의 type을 datetime으로 바꿔줍니다.

In [4]:
train.Dates = pd.to_datetime(train.Dates)
test.Dates = pd.to_datetime(test.Dates)

In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 878049 entries, 0 to 878048
Data columns (total 9 columns):
Dates         878049 non-null datetime64[ns]
Category      878049 non-null object
Descript      878049 non-null object
DayOfWeek     878049 non-null object
PdDistrict    878049 non-null object
Resolution    878049 non-null object
Address       878049 non-null object
X             878049 non-null float64
Y             878049 non-null float64
dtypes: datetime64[ns](1), float64(2), object(6)
memory usage: 60.3+ MB


년, 월, 일, 시간 순으로 dataframe에 저장해줍니다.

In [6]:
train['Year'] = train['Dates'].dt.year
train['Month'] = train['Dates'].dt.month
train['Day'] = train['Dates'].dt.day
train['Hour'] = train['Dates'].dt.hour

In [7]:
test['Year'] = test['Dates'].dt.year
test['Month'] = test['Dates'].dt.month
test['Day'] = test['Dates'].dt.day
test['Hour'] = test['Dates'].dt.hour

## Convert Numeric Variables to Categorical Variables

In [8]:
X_bins = [-123, -122.5, -122.48, -122.46, -122.44, -122.43, -122.42, -122.41, -122.40, -122.38, -122.36, -120]
Y_bins = [37, 37.72, 37.74, 37.76, 37.77, 37.78, 37.80, 37.82, 91]
X_labels = [1, 2, 3 ,4 ,5, 6, 7, 8, 9, 10, 11]
Y_labels = [1, 2, 3 ,4 ,5, 6, 7, 8]

train_X_cats = pd.cut(train.X, X_bins, labels=X_labels)
train_Y_cats = pd.cut(train.Y, Y_bins, labels=Y_labels)

test_X_cats = pd.cut(test.X, X_bins, labels=X_labels)
test_Y_cats = pd.cut(test.Y, Y_bins, labels=Y_labels)

In [9]:
train['X_cats'] = train_X_cats
train['Y_cats'] = train_Y_cats
test['X_cats'] = test_X_cats
test['Y_cats'] = test_Y_cats

In [10]:
test.head()

Unnamed: 0,Id,Dates,DayOfWeek,PdDistrict,Address,X,Y,Year,Month,Day,Hour,X_cats,Y_cats
0,0,2015-05-10 23:59:00,Sunday,BAYVIEW,2000 Block of THOMAS AV,-122.399588,37.735051,2015,5,10,23,9,2
1,1,2015-05-10 23:51:00,Sunday,BAYVIEW,3RD ST / REVERE AV,-122.391523,37.732432,2015,5,10,23,9,2
2,2,2015-05-10 23:50:00,Sunday,NORTHERN,2000 Block of GOUGH ST,-122.426002,37.792212,2015,5,10,23,6,6
3,3,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412,2015,5,10,23,5,2
4,4,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412,2015,5,10,23,5,2


In [11]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 878049 entries, 0 to 878048
Data columns (total 15 columns):
Dates         878049 non-null datetime64[ns]
Category      878049 non-null object
Descript      878049 non-null object
DayOfWeek     878049 non-null object
PdDistrict    878049 non-null object
Resolution    878049 non-null object
Address       878049 non-null object
X             878049 non-null float64
Y             878049 non-null float64
Year          878049 non-null int64
Month         878049 non-null int64
Day           878049 non-null int64
Hour          878049 non-null int64
X_cats        878049 non-null category
Y_cats        878049 non-null category
dtypes: category(2), datetime64[ns](1), float64(2), int64(4), object(6)
memory usage: 88.8+ MB


X_cats와 Y_cats가 Cateogry type이므로 int타입으로 바꿔줍니다.

In [12]:
train['X_cats'] = train['X_cats'].astype('int8')
train['Y_cats'] = train['Y_cats'].astype('int8')
test['X_cats'] = test['X_cats'].astype('int8')
test['Y_cats'] = test['Y_cats'].astype('int8')

In [13]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 878049 entries, 0 to 878048
Data columns (total 15 columns):
Dates         878049 non-null datetime64[ns]
Category      878049 non-null object
Descript      878049 non-null object
DayOfWeek     878049 non-null object
PdDistrict    878049 non-null object
Resolution    878049 non-null object
Address       878049 non-null object
X             878049 non-null float64
Y             878049 non-null float64
Year          878049 non-null int64
Month         878049 non-null int64
Day           878049 non-null int64
Hour          878049 non-null int64
X_cats        878049 non-null int8
Y_cats        878049 non-null int8
dtypes: datetime64[ns](1), float64(2), int64(4), int8(2), object(6)
memory usage: 88.8+ MB


## Label Encoding

In [14]:
# Catergory Encoding
cat_encoder = LabelEncoder()
train['Category_encoded'] = cat_encoder.fit_transform(train['Category'])

In [15]:
# DayOfWeek Encoding
day_encoder = LabelEncoder()
train['DayOfWeek_encoded'] = day_encoder.fit_transform(train['DayOfWeek'])
test['DayOfWeek_encoded'] = day_encoder.fit_transform(test['DayOfWeek'])

In [16]:
# Descript Encoding
descript_encoder = LabelEncoder()
train['Descript_encoded'] = descript_encoder.fit_transform(train['Descript'])

In [17]:
# PdDistrict Encoding
pdDistrict_encoder = LabelEncoder()
train['PdDistrict_encoded'] = pdDistrict_encoder.fit_transform(train['PdDistrict'])
test['PdDistrict_encoded'] = pdDistrict_encoder.fit_transform(test['PdDistrict'])

In [18]:
# Address Encoding
address_encoder = LabelEncoder()
train['Address_encoded'] = address_encoder.fit_transform(train['Address'])
test['Address_encoded'] = address_encoder.fit_transform(test['Address'])

In [19]:
train.head(1)

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y,Year,Month,Day,Hour,X_cats,Y_cats,Category_encoded,DayOfWeek_encoded,Descript_encoded,PdDistrict_encoded,Address_encoded
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599,2015,5,13,23,6,5,37,6,866,4,19790


In [20]:
test.head(1)

Unnamed: 0,Id,Dates,DayOfWeek,PdDistrict,Address,X,Y,Year,Month,Day,Hour,X_cats,Y_cats,DayOfWeek_encoded,PdDistrict_encoded,Address_encoded
0,0,2015-05-10 23:59:00,Sunday,BAYVIEW,2000 Block of THOMAS AV,-122.399588,37.735051,2015,5,10,23,9,2,3,0,6407


## Modeling

Address는 이미 X, Y 좌표를 통해 알 수 있으므로 Feature에서 제외했습니다.

In [21]:
X = train[['Year', 'Month', 'Hour', 'X_cats', 'Y_cats', 'DayOfWeek_encoded', 'PdDistrict_encoded']]
y = train['Category_encoded']

X_test = test[['Year', 'Month', 'Hour', 'X_cats', 'Y_cats', 'DayOfWeek_encoded', 'PdDistrict_encoded']]

Feature Scaling을 해줍니다. 데이터의 크기 분포가 너무 들쑥날쑥할 때 Scale을 상대적으로 잡아주는 것입니다.

In [22]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)

  return self.partial_fit(X, y)


StandardScaler(copy=True, with_mean=True, with_std=True)

In [23]:
X = scaler.transform(X)
X_test = scaler.transform(X_test)

  """Entry point for launching an IPython kernel.
  


In [24]:
knn_model = KNeighborsClassifier(n_neighbors=100)

knn_model.fit(X, y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=100, p=2,
           weights='uniform')

In [25]:
y_test = knn_model.predict(X_test)

## Submission

In [26]:
submission = pd.read_csv('./data/sampleSubmission.csv')
submission.head(2)

Unnamed: 0,Id,ARSON,ASSAULT,BAD CHECKS,BRIBERY,BURGLARY,DISORDERLY CONDUCT,DRIVING UNDER THE INFLUENCE,DRUG/NARCOTIC,DRUNKENNESS,...,SEX OFFENSES NON FORCIBLE,STOLEN PROPERTY,SUICIDE,SUSPICIOUS OCC,TREA,TRESPASS,VANDALISM,VEHICLE THEFT,WARRANTS,WEAPON LAWS
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


Sample Submission의 WARRANTS 열이 모두 1로 되어 있어 default인 0으로 변경해줍니다.

In [27]:
submission.WARRANTS = 0
submission.head(2)

Unnamed: 0,Id,ARSON,ASSAULT,BAD CHECKS,BRIBERY,BURGLARY,DISORDERLY CONDUCT,DRIVING UNDER THE INFLUENCE,DRUG/NARCOTIC,DRUNKENNESS,...,SEX OFFENSES NON FORCIBLE,STOLEN PROPERTY,SUICIDE,SUSPICIOUS OCC,TREA,TRESPASS,VANDALISM,VEHICLE THEFT,WARRANTS,WEAPON LAWS
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


submission의 모든 열이 0으로 되어 있으므로, loop를 돌며 Category에 해당하면 해당 열에 1을 입력해줍니다.

In [28]:
for index, row in submission.iterrows():
    # 맨 앞에 ID가 있으므로 +1을 해야함
    category_index = y_test[index]+1
    row.iloc[category_index] = 1

In [29]:
submission.to_csv('submission_knn.csv', index=False)