# San Francisco Crime Classification

### **Description:** The main purpose of this project is to classify the category of crime based on Location and Time at which crime was commited .


The steps followed are as follows:
* Step-1: Download dataset from [https://www.kaggle.com/c/sf-crime/data](https://www.kaggle.com/c/sf-crime/data), unzip it and place it in same directory as the .ipynb file. Read the data and store it in a dataframe. 
* Step-2: Create a new dataframe containing Latitude, Longtitude, Category, Hour, Week, Month and Year of crime.
* Step-3: Create a new dataframe X by dropping Category. Create a variable Y containing the category of crime.
* Step-4: Split the data(X_train, y_train and X_test, y_test data(80:20)).
* Step-5: Check Different Machine Learning Algorithm to get best accuracy.
* Step-6: Check Different feature Engineering techniques on a particular algorithm(I have tried on KNN) to get better accuracy.
     * Convert to 3 Dimensional point(x, y, z) from Latitude and Longtitude and use as input.
     * Use only Latitude and Longtitude as input data.
     * Use Hour, Week, Month, Year, Latitude, Longtitude as input data.
     * Use the columns of dataframe that has positive correlation with category of crime.
     * Try on a smaller part of dataset maybe the model is overfitting.
* Step-7: Use One Hot Encoder Technique to submit predicted value into required submission format


In [1]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap

In [2]:
import pandas as pd
sfc = pd.read_csv('/train.csv')

lat=sfc['X'].values
long=sfc['Y'].values
category=sfc['Category'].values

In [3]:
#Convert Weekaday from Text to Numeric
week=sfc['DayOfWeek'].values
week_map={'Monday':'1', 'Tuesday':'2', 'Wednesday':'3', 'Thursday':'4', 'Friday':'5', 'Saturday':'6', 'Sunday':'7'}
week1=[]
for i in range(len(week)):
    week1.append(int(week_map[str(week[i])]))
print(week1)

[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 

In [4]:
#Create Month, Year and Hour list from DateStamp
month=[]
year=[]
hour=[]
for i in range(len(sfc['Dates'])):
    t = pd.tslib.Timestamp(sfc['Dates'][i])
    month.append(t.month)
    year.append(t.year)
    hour.append(t.hour)

You can access Timestamp as pandas.Timestamp
  


In [5]:
df=pd.DataFrame({'Latitude':lat, 'Longtitude':long, 'Category':category, 'Hour':hour, 'Week':week1, 'Month':month, 'Year':year})
print(df)

                      Category  Hour    Latitude  Longtitude  Month  Week  \
0                     WARRANTS    23 -122.425892   37.774599      5     3   
1               OTHER OFFENSES    23 -122.425892   37.774599      5     3   
2               OTHER OFFENSES    23 -122.424363   37.800414      5     3   
3                LARCENY/THEFT    23 -122.426995   37.800873      5     3   
4                LARCENY/THEFT    23 -122.438738   37.771541      5     3   
5                LARCENY/THEFT    23 -122.403252   37.713431      5     3   
6                VEHICLE THEFT    23 -122.423327   37.725138      5     3   
7                VEHICLE THEFT    23 -122.371274   37.727564      5     3   
8                LARCENY/THEFT    23 -122.508194   37.776601      5     3   
9                LARCENY/THEFT    23 -122.419088   37.807802      5     3   
10               LARCENY/THEFT    22 -122.419088   37.807802      5     3   
11              OTHER OFFENSES    22 -122.487983   37.737667      5     3   

In [6]:
import copy
X=copy.deepcopy(df)
X.drop('Category', axis=1, inplace=True)
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df['Category'])
Y=le.transform(df['Category']) 
print(Y)

[37 21 21 ..., 16 35 12]


In [7]:
df['Y']=Y
corr = df.corr()
print(corr)

                Hour  Latitude  Longtitude     Month      Week      Year  \
Hour        1.000000  0.001057   -0.002018 -0.001896 -0.021087 -0.006257   
Latitude    0.001057  1.000000    0.559338  0.001548  0.005999 -0.004078   
Longtitude -0.002018  0.559338    1.000000  0.003307 -0.000908 -0.009163   
Month      -0.001896  0.001548    0.003307  1.000000  0.010685 -0.048188   
Week       -0.021087  0.005999   -0.000908  0.010685  1.000000  0.014228   
Year       -0.006257 -0.004078   -0.009163 -0.048188  0.014228  1.000000   
Y           0.023524 -0.024401   -0.000414  0.000008  0.001078 -0.021803   

                   Y  
Hour        0.023524  
Latitude   -0.024401  
Longtitude -0.000414  
Month       0.000008  
Week        0.001078  
Year       -0.021803  
Y           1.000000  


In [8]:
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from sklearn.metrics import accuracy_score

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

In [9]:
#Convert to 3 Dimensional point(x, y, z) from Latitude and Longtitude and use as input.
import math
X4=[]
Y4=[]
Z4=[]
for i in range(len(df['Latitude'])):
    X4.append(math.cos(df['Latitude'][i])*math.cos(df['Longtitude'][i]))
    Y4.append(math.cos(df['Latitude'][i])*math.sin(df['Longtitude'][i]))
    Z4.append(math.sin(df['Latitude'][i]))

In [40]:
from sklearn.neighbors import KNeighborsClassifier
X3=pd.DataFrame({'X':X4, 'Y':Y4, 'Z':Z4})
X_train3, X_test3, y_train3, y_test3 = train_test_split(X3, Y, test_size=0.2)
neigh3 = KNeighborsClassifier(n_neighbors = 50, weights='uniform', algorithm='auto')
neigh3.fit(X_train3,y_train3) 
y_pred3 = neigh3.predict(X_test3)
print("Accuracy is ", accuracy_score(y_test3,y_pred3)*100,"% for K-Value:")

Accuracy is  27.5086840157 % for K-Value:


In [None]:
#Use only Latitude and Longtitude as input data.
X2=pd.DataFrame({'Latitude':df['Latitude'], 'Longitude':df['Longtitude']})
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, Y, test_size=0.2)
#Implement KNN(So we take K value to be )
neigh2 = KNeighborsClassifier(n_neighbors = 50, weights='uniform', algorithm='auto')
neigh2.fit(X_train2,y_train2) 
y_pred2 = neigh2.predict(X_test2)
print("Accuracy is ", accuracy_score(y_test2,y_pred2)*100,"% for K-Value:")

In [44]:
#Use Hour, Week, Month, Year, Latitude, Longtitude as input data.
neigh = KNeighborsClassifier(n_neighbors = 50, weights='uniform', algorithm='auto')
neigh.fit(X_train,y_train) 
y_pred = neigh.predict(X_test)
print("Accuracy is ", accuracy_score(y_test,y_pred)*100,"% for K-Value:")

Accuracy is  21.1639428279 % for K-Value:


In [16]:
#Use the columns of dataframe that has positive correlation with category of crime.
X6=pd.DataFrame({'Hour':df['Hour'], 'Month':df['Month'], 'Week':df['Week']})
X_train6, X_test6, y_train6, y_test6 = train_test_split(X6, Y, test_size=0.2)
neigh6 = KNeighborsClassifier(n_neighbors = 50, weights='uniform', algorithm='auto')
neigh6.fit(X_train6,y_train6) 
y_pred6 = neigh6.predict(X_test6)
print("Accuracy is ", accuracy_score(y_test6,y_pred6)*100,"% for K-Value:")

Accuracy is  18.8497238198 % for K-Value:


In [45]:
#Try on a smaller part of dataset maybe the model is overfitting.
df1 = df.sample(frac=0.2).reset_index(drop=True)
X1=copy.deepcopy(df1)
X1.drop('Category', axis=1, inplace=True)
Y1=df1['Category']
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, Y1, test_size=0.2)
#Implement KNN(So we take K value to be )
neigh1 = KNeighborsClassifier(n_neighbors = 50, weights='uniform', algorithm='auto')
neigh1.fit(X_train1,y_train1) 
y_pred1 = neigh1.predict(X_test1)
print("Accuracy is ", accuracy_score(y_test1,y_pred1)*100,"% for K-Value:")

Accuracy is  83.9787028074 % for K-Value:


In [38]:
#Check k for best results of KNN
from sklearn.neighbors import KNeighborsClassifier
for K in range(100):
    K_value = K+1
neigh = KNeighborsClassifier(n_neighbors = 50, weights='uniform', algorithm='auto')
neigh.fit(X_train, y_train) 
y_pred = neigh.predict(X_test)
print("Accuracy is ", accuracy_score(y_test,y_pred)*100,"% for K-Value:",50)


Accuracy is  21.0739707306 % for K-Value: 50


In [None]:
#Implement KNN for the best features after Feature Engineering and best value of K
neigh = KNeighborsClassifier(n_neighbors = 50, weights='uniform', algorithm='auto')
neigh.fit(X_train,y_train) 
y_pred = neigh.predict(X_test)
print("Accuracy is ", accuracy_score(y_test,y_pred)*100,"% for K-Value:")

In [None]:
#Implement Grid Serch for best Gamma, C and Selection between rbf and linear kernel
#from sklearn import svm, datasets
#from sklearn.cross_validation import StratifiedKFold
#from sklearn.grid_search import GridSearchCV
#from sklearn.svm import SVC
#parameter_candidates = [
#  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
#  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
#]
#clf = GridSearchCV(estimator=svm.SVC(), param_grid=parameter_candidates, n_jobs=-1)
#clf.fit(X_train1, y_train1)   
#print('Best score for data1:', clf.best_score_) 
#print('Best C:',clf.best_estimator_.C) 
#print('Best Kernel:',clf.best_estimator_.kernel)
#print('Best Gamma:',clf.best_estimator_.gamma)



In [None]:
#OVA SVM(Grid Search Results: Kernel - linear, C -1 , Gamma - auto)
#from sklearn import svm
#lin_clf = svm.LinearSVC(C=1)
#lin_clf.fit(X_train2, y_train2)
#y_pred2=lin_clf.predict(X_test2)
#print(accuracy_score(y_test2,y_pred2)*100)

In [None]:
#OVA SVM(Grid Search Results: Kernel - rbf, C -1 , Gamma - auto)
#from sklearn import svm
#lin_clf=svm.SVC(kernel='rbf')
#lin_clf.fit(X_train2, y_train2)
#y_pred2=lin_clf.predict(X_test2)
#print(accuracy_score(y_test2,y_pred2)*100)

In [None]:
#SVM by Crammer(Grid Search Results: Gamma - Auto, C - 1)
#lin_clf = svm.LinearSVC(C=1, multi_class='crammer_singer')
#lin_clf.fit(X_train2, y_train2)
#y_pred2=lin_clf.predict(X_test2)
#print(accuracy_score(y_test2,y_pred2)*100)

In [12]:
#Implementing OVA Naive Bayes
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train2, y_train2)
y_pred2 = clf.predict(X_test2)
print(accuracy_score(y_test2,y_pred2)*100)

ValueError: Input X must be non-negative

In [40]:
#Implementing OVA Logistic Regerssion
from sklearn.linear_model import LogisticRegression
X2=pd.DataFrame({'Latitude':df['Latitude'], 'Longitude':df['Longtitude']})
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, Y, test_size=0.2)
logisticRegr = LogisticRegression()
logisticRegr.fit(X_train2, y_train2)
y_pred = logisticRegr.predict(X_test2)
print(accuracy_score(y_test,y_pred)*100)

19.9646944935


In [18]:
data_dict={}
target = df["Category"].unique()
count = 1
for data in target:
    data_dict[data] = count
    count+=1
print(data_dict)

{'WARRANTS': 1, 'OTHER OFFENSES': 2, 'LARCENY/THEFT': 3, 'VEHICLE THEFT': 4, 'VANDALISM': 5, 'NON-CRIMINAL': 6, 'ROBBERY': 7, 'ASSAULT': 8, 'WEAPON LAWS': 9, 'BURGLARY': 10, 'SUSPICIOUS OCC': 11, 'DRUNKENNESS': 12, 'FORGERY/COUNTERFEITING': 13, 'DRUG/NARCOTIC': 14, 'STOLEN PROPERTY': 15, 'SECONDARY CODES': 16, 'TRESPASS': 17, 'MISSING PERSON': 18, 'FRAUD': 19, 'KIDNAPPING': 20, 'RUNAWAY': 21, 'DRIVING UNDER THE INFLUENCE': 22, 'SEX OFFENSES FORCIBLE': 23, 'PROSTITUTION': 24, 'DISORDERLY CONDUCT': 25, 'ARSON': 26, 'FAMILY OFFENSES': 27, 'LIQUOR LAWS': 28, 'BRIBERY': 29, 'EMBEZZLEMENT': 30, 'SUICIDE': 31, 'LOITERING': 32, 'SEX OFFENSES NON FORCIBLE': 33, 'EXTORTION': 34, 'GAMBLING': 35, 'BAD CHECKS': 36, 'TREA': 37, 'RECOVERED VEHICLE': 38, 'PORNOGRAPHY/OBSCENE MAT': 39}


In [19]:
from collections import OrderedDict
data_dict_new = OrderedDict(sorted(data_dict.items()))
print(data_dict_new)

OrderedDict([('ARSON', 26), ('ASSAULT', 8), ('BAD CHECKS', 36), ('BRIBERY', 29), ('BURGLARY', 10), ('DISORDERLY CONDUCT', 25), ('DRIVING UNDER THE INFLUENCE', 22), ('DRUG/NARCOTIC', 14), ('DRUNKENNESS', 12), ('EMBEZZLEMENT', 30), ('EXTORTION', 34), ('FAMILY OFFENSES', 27), ('FORGERY/COUNTERFEITING', 13), ('FRAUD', 19), ('GAMBLING', 35), ('KIDNAPPING', 20), ('LARCENY/THEFT', 3), ('LIQUOR LAWS', 28), ('LOITERING', 32), ('MISSING PERSON', 18), ('NON-CRIMINAL', 6), ('OTHER OFFENSES', 2), ('PORNOGRAPHY/OBSCENE MAT', 39), ('PROSTITUTION', 24), ('RECOVERED VEHICLE', 38), ('ROBBERY', 7), ('RUNAWAY', 21), ('SECONDARY CODES', 16), ('SEX OFFENSES FORCIBLE', 23), ('SEX OFFENSES NON FORCIBLE', 33), ('STOLEN PROPERTY', 15), ('SUICIDE', 31), ('SUSPICIOUS OCC', 11), ('TREA', 37), ('TRESPASS', 17), ('VANDALISM', 5), ('VEHICLE THEFT', 4), ('WARRANTS', 1), ('WEAPON LAWS', 9)])


In [33]:
import pandas as pd
test_dataset = pd.read_csv('/test.csv')
df_test=pd.DataFrame({'Latitude':test_dataset['X'],'Longtitude':test_dataset['Y']})

predictions = neigh2.predict(df_test)


In [35]:
#One Hot Encoding of knn as per submission format
result_dataframe = pd.DataFrame({
    "Id": test_dataset["Id"]
})
for key,value in data_dict_new.items():
    result_dataframe[key] = 0
count = 0
for item in predictions:
    for key,value in data_dict.items():
        if(value == item):
            result_dataframe[key][count] = 1
    count+=1
result_dataframe.to_csv("submission_knn.csv", index=False) 

In [41]:
predictions = logisticRegr.predict(df_test)

In [42]:
#One Hot Encoding of logistic regression as per submission format
result_dataframe = pd.DataFrame({
    "Id": test_dataset["Id"]
})
for key,value in data_dict_new.items():
    result_dataframe[key] = 0
count = 0
for item in predictions:
    for key,value in data_dict.items():
        if(value == item):
            result_dataframe[key][count] = 1
    count+=1
result_dataframe.to_csv("submission_logistic.csv", index=False) 

# Conclusion
* Out of the feature engineering methods suggested the one which works best on KNN is simply by taking latitude and longtitude as input data.
* Since input data contains negative values we cannot use OVA Naive Bayes.
* OVA SVM with linear and rbf kernel and CrammerSVM 
* The  Algorithm gives the best accuracy of  for the given dataset.

|       Algorithm       |           Parameters           | Accuracy |
|:----------------------|:------------------------------:|:--------:|
|          KNN          |              K= 50             |  27.65%  |
|    OVA Naive Bayes    |               -                |    -     |      
|OVA Logistic Regression|               -                |  19.96%  |

The reason for such less accuracy is because the dataset contains around 800k rows and ML algorithm work well only till 100k. A strong proof for this can be seen with the shoot of accuracy to 83.97% by reducing the dataset to 20% after randomly shuffing it. So we need to use Neural Network approach.
