# San Francisco Crime Classification
## Predict the category of crimes that occurred in the city by the bay

From 1934 to 1963, San Francisco was infamous for housing some of the world's most notorious criminals on the inescapable island of Alcatraz.

Today, the city is known more for its tech scene than its criminal past. But, with rising wealth inequality, housing shortages, and a proliferation of expensive digital toys riding BART to work, there is no scarcity of crime in the city by the bay.

From Sunset to SOMA, and Marina to Excelsior, this competition's dataset provides nearly 12 years of crime reports from across all of San Francisco's neighborhoods. Given time and location, you must predict the category of crime that occurred.

We're also encouraging you to explore the dataset visually. What can we learn about the city through visualizations like this Top Crimes Map? The top most up-voted scripts from this competition will receive official Kaggle swag as prizes. 

![Police Line](https://kaggle2.blob.core.windows.net/competitions/kaggle/4458/media/sfcrime_banner.png)

## Load Data

In [1]:
import numpy as np
import pandas as pd

data = pd.read_csv("data/train.csv", )
data = data.reindex(np.random.permutation(data.index))

data

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
377876,2010-01-31 08:23:00,OTHER OFFENSES,CONTRIBUTING TO THE DELINQUENCY OF MINOR,Sunday,TENDERLOIN,"ARREST, BOOKED",400 Block of ELLIS ST,-122.413631,37.784805
739105,2004-11-12 12:00:00,FORGERY/COUNTERFEITING,"CHECKS, FORGERY (FELONY)",Friday,MISSION,NONE,4200 Block of 23RD ST,-122.436992,37.752672
312949,2011-01-16 09:00:00,SUSPICIOUS OCC,SUSPICIOUS OCCURRENCE,Sunday,SOUTHERN,NONE,700 Block of FOLSOM ST,-122.399625,37.783159
104868,2013-12-14 16:30:00,VANDALISM,"MALICIOUS MISCHIEF, VANDALISM",Saturday,RICHMOND,NONE,0 Block of COMMONWEALTH AV,-122.455856,37.785018
602622,2006-10-22 22:33:00,SUSPICIOUS OCC,SUSPICIOUS OCCURRENCE,Sunday,BAYVIEW,NONE,700 Block of KIRKWOOD AV,-122.374019,37.729203
147530,2013-05-31 08:20:00,NON-CRIMINAL,AIDED CASE -PROPERTY FOR DESTRUCTION,Friday,TARAVAL,NONE,3300 Block of NORIEGA ST,-122.499515,37.753169
205885,2012-08-21 15:00:00,LARCENY/THEFT,PETTY THEFT FROM A BUILDING,Tuesday,TARAVAL,NONE,1500 Block of 48TH AV,-122.507938,37.757511
637277,2006-05-01 15:29:00,FAMILY OFFENSES,MINOR WITHOUT PROPER PARENTAL CARE,Monday,MISSION,NONE,1000 Block of POTRERO AV,-122.406539,37.756486
673139,2005-10-23 16:40:00,SUSPICIOUS OCC,SUSPICIOUS PERSON,Sunday,INGLESIDE,NONE,MOSCOW ST / ITALY AV,-122.432322,37.715039
103748,2013-12-26 11:30:00,OTHER OFFENSES,VIOLATION OF EMERGENCY PROTECTIVE ORDER,Thursday,TARAVAL,"ARREST, BOOKED",2100 Block of 20TH AV,-122.477018,37.747650


## Feature Engineering

In [2]:
preprocess_list = []

### Encode Label

In [3]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
data['category_encoded'] = label_encoder.fit_transform(data["Category"])

# 너무 작은 라벨은 가장 많은 값으로 바꿔버린다. (log_loss가 계산을 못 함)
data.loc[data['category_encoded'] == 22, 'category_encoded'] = 21
data.loc[data['category_encoded'] == 33, 'category_encoded'] = 21

data.head()

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y,category_encoded
377876,2010-01-31 08:23:00,OTHER OFFENSES,CONTRIBUTING TO THE DELINQUENCY OF MINOR,Sunday,TENDERLOIN,"ARREST, BOOKED",400 Block of ELLIS ST,-122.413631,37.784805,21
739105,2004-11-12 12:00:00,FORGERY/COUNTERFEITING,"CHECKS, FORGERY (FELONY)",Friday,MISSION,NONE,4200 Block of 23RD ST,-122.436992,37.752672,12
312949,2011-01-16 09:00:00,SUSPICIOUS OCC,SUSPICIOUS OCCURRENCE,Sunday,SOUTHERN,NONE,700 Block of FOLSOM ST,-122.399625,37.783159,32
104868,2013-12-14 16:30:00,VANDALISM,"MALICIOUS MISCHIEF, VANDALISM",Saturday,RICHMOND,NONE,0 Block of COMMONWEALTH AV,-122.455856,37.785018,35
602622,2006-10-22 22:33:00,SUSPICIOUS OCC,SUSPICIOUS OCCURRENCE,Sunday,BAYVIEW,NONE,700 Block of KIRKWOOD AV,-122.374019,37.729203,32


### Encode PdDistrict

In [4]:
from sklearn.preprocessing import LabelEncoder

def preprocess_pd_district(data):
    label_encoder = LabelEncoder()
    data['pd_district_encoded'] = label_encoder.fit_transform(data["PdDistrict"])
    
preprocess_pd_district(data)
preprocess_list.append(preprocess_pd_district)

data.head()

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y,category_encoded,pd_district_encoded
377876,2010-01-31 08:23:00,OTHER OFFENSES,CONTRIBUTING TO THE DELINQUENCY OF MINOR,Sunday,TENDERLOIN,"ARREST, BOOKED",400 Block of ELLIS ST,-122.413631,37.784805,21,9
739105,2004-11-12 12:00:00,FORGERY/COUNTERFEITING,"CHECKS, FORGERY (FELONY)",Friday,MISSION,NONE,4200 Block of 23RD ST,-122.436992,37.752672,12,3
312949,2011-01-16 09:00:00,SUSPICIOUS OCC,SUSPICIOUS OCCURRENCE,Sunday,SOUTHERN,NONE,700 Block of FOLSOM ST,-122.399625,37.783159,32,7
104868,2013-12-14 16:30:00,VANDALISM,"MALICIOUS MISCHIEF, VANDALISM",Saturday,RICHMOND,NONE,0 Block of COMMONWEALTH AV,-122.455856,37.785018,35,6
602622,2006-10-22 22:33:00,SUSPICIOUS OCC,SUSPICIOUS OCCURRENCE,Sunday,BAYVIEW,NONE,700 Block of KIRKWOOD AV,-122.374019,37.729203,32,0


### Scale/Normalize Location

In [5]:
def scale_normalize(data):
    scaled = data - max(data)
    normalize = scaled / min(scaled)
    
    return normalize

def preprocess_location(data):
    data["scale_normalized_X"] = scale_normalize(data["X"])
    data["scale_normalized_Y"] = scale_normalize(data["Y"])

preprocess_location(data)
preprocess_list.append(preprocess_location)

data.head()

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y,category_encoded,pd_district_encoded,scale_normalized_X,scale_normalized_Y
377876,2010-01-31 08:23:00,OTHER OFFENSES,CONTRIBUTING TO THE DELINQUENCY OF MINOR,Sunday,TENDERLOIN,"ARREST, BOOKED",400 Block of ELLIS ST,-122.413631,37.784805,21,9,0.950333,0.998529
739105,2004-11-12 12:00:00,FORGERY/COUNTERFEITING,"CHECKS, FORGERY (FELONY)",Friday,MISSION,NONE,4200 Block of 23RD ST,-122.436992,37.752672,12,3,0.961935,0.999143
312949,2011-01-16 09:00:00,SUSPICIOUS OCC,SUSPICIOUS OCCURRENCE,Sunday,SOUTHERN,NONE,700 Block of FOLSOM ST,-122.399625,37.783159,32,7,0.943378,0.99856
104868,2013-12-14 16:30:00,VANDALISM,"MALICIOUS MISCHIEF, VANDALISM",Saturday,RICHMOND,NONE,0 Block of COMMONWEALTH AV,-122.455856,37.785018,35,6,0.971303,0.998525
602622,2006-10-22 22:33:00,SUSPICIOUS OCC,SUSPICIOUS OCCURRENCE,Sunday,BAYVIEW,NONE,700 Block of KIRKWOOD AV,-122.374019,37.729203,32,0,0.930662,0.999592


### Parse Date/Week

In [6]:
from sklearn.preprocessing import LabelEncoder

def preprocess_dayofweek(data):
    label_encoder = LabelEncoder()
    data['dayofweek_encoded'] = label_encoder.fit_transform(data["DayOfWeek"])

preprocess_dayofweek(data)
preprocess_list.append(preprocess_dayofweek)

data.head()

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y,category_encoded,pd_district_encoded,scale_normalized_X,scale_normalized_Y,dayofweek_encoded
377876,2010-01-31 08:23:00,OTHER OFFENSES,CONTRIBUTING TO THE DELINQUENCY OF MINOR,Sunday,TENDERLOIN,"ARREST, BOOKED",400 Block of ELLIS ST,-122.413631,37.784805,21,9,0.950333,0.998529,3
739105,2004-11-12 12:00:00,FORGERY/COUNTERFEITING,"CHECKS, FORGERY (FELONY)",Friday,MISSION,NONE,4200 Block of 23RD ST,-122.436992,37.752672,12,3,0.961935,0.999143,0
312949,2011-01-16 09:00:00,SUSPICIOUS OCC,SUSPICIOUS OCCURRENCE,Sunday,SOUTHERN,NONE,700 Block of FOLSOM ST,-122.399625,37.783159,32,7,0.943378,0.99856,3
104868,2013-12-14 16:30:00,VANDALISM,"MALICIOUS MISCHIEF, VANDALISM",Saturday,RICHMOND,NONE,0 Block of COMMONWEALTH AV,-122.455856,37.785018,35,6,0.971303,0.998525,2
602622,2006-10-22 22:33:00,SUSPICIOUS OCC,SUSPICIOUS OCCURRENCE,Sunday,BAYVIEW,NONE,700 Block of KIRKWOOD AV,-122.374019,37.729203,32,0,0.930662,0.999592,3


In [7]:
import time
from datetime import datetime

from sklearn.preprocessing import LabelEncoder

def preprocess_dates(data):
    for index, row in data.iterrows():
        date = datetime.strptime(row["Dates"], "%Y-%m-%d %H:%M:%S")

        data.set_value(index, "year", date.year)
        data.set_value(index, "month", date.month)
        data.set_value(index, "day", date.day)
        data.set_value(index, "hour", date.hour)
        data.set_value(index, "minute", date.minute)
        data.set_value(index, "second", date.second)

        numeric_time = time.mktime(date.timetuple())

        data.set_value(index, "numeric_time", numeric_time)

preprocess_dates(data)
preprocess_list.append(preprocess_dates)

data.head()

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y,category_encoded,...,scale_normalized_X,scale_normalized_Y,dayofweek_encoded,year,month,day,hour,minute,second,numeric_time
377876,2010-01-31 08:23:00,OTHER OFFENSES,CONTRIBUTING TO THE DELINQUENCY OF MINOR,Sunday,TENDERLOIN,"ARREST, BOOKED",400 Block of ELLIS ST,-122.413631,37.784805,21,...,0.950333,0.998529,3,2010,1,31,8,23,0,1264893780
739105,2004-11-12 12:00:00,FORGERY/COUNTERFEITING,"CHECKS, FORGERY (FELONY)",Friday,MISSION,NONE,4200 Block of 23RD ST,-122.436992,37.752672,12,...,0.961935,0.999143,0,2004,11,12,12,0,0,1100228400
312949,2011-01-16 09:00:00,SUSPICIOUS OCC,SUSPICIOUS OCCURRENCE,Sunday,SOUTHERN,NONE,700 Block of FOLSOM ST,-122.399625,37.783159,32,...,0.943378,0.99856,3,2011,1,16,9,0,0,1295136000
104868,2013-12-14 16:30:00,VANDALISM,"MALICIOUS MISCHIEF, VANDALISM",Saturday,RICHMOND,NONE,0 Block of COMMONWEALTH AV,-122.455856,37.785018,35,...,0.971303,0.998525,2,2013,12,14,16,30,0,1387006200
602622,2006-10-22 22:33:00,SUSPICIOUS OCC,SUSPICIOUS OCCURRENCE,Sunday,BAYVIEW,NONE,700 Block of KIRKWOOD AV,-122.374019,37.729203,32,...,0.930662,0.999592,3,2006,10,22,22,33,0,1161523980


### Tokenize Address

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
bow_address = vectorizer.fit_transform(data["Address"]).toarray()

bow_address = pd.DataFrame(bow_address, index=data.index)
bow_address.columns = vectorizer.get_feature_names()

bow_address

Unnamed: 0,100,1000,101,10th,1100,11th,1200,12th,1300,13th,...,yorke,yosemite,young,ysabel,yukon,zampa,zeno,zircon,zoe,zoo
377876,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
739105,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
312949,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
104868,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
602622,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
147530,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
205885,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
637277,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
673139,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
103748,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Cross validate by multi-class logarithmic loss

In [9]:
from sklearn.cross_validation import KFold
from sklearn.metrics import log_loss

def multiclass_logloss_score(model, features, labels, num_folds=5):
    kfolds = KFold(len(features), num_folds)

    total_score = 0.0

    for train_index, test_index in kfolds:
        train_features = features.iloc[train_index]
        test_features = features.iloc[test_index]
        train_labels = labels.iloc[train_index]
        test_labels = labels.iloc[test_index]

        model.fit(train_features, train_labels)
        prediction = model.predict_proba(test_features)

        score = log_loss(test_labels, prediction)
        total_score += score

    total_score = total_score / num_folds
    
    return total_score

## Prediction

### Predict only Location(X, Y)

In [10]:
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB

# feature_names = ["scale_normalized_X", "scale_normalized_Y", "dayofweek_encoded", "year", "month", "day", "hour", "minute", "second", "pd_district_encoded"]
feature_names = ["scale_normalized_X", "scale_normalized_Y", "dayofweek_encoded", "numeric_time", "pd_district_encoded"]
label_name = "category_encoded"

gaussian_score = multiclass_logloss_score(GaussianNB(), data[feature_names], data[label_name])
multinomial_score = multiclass_logloss_score(MultinomialNB(), data[feature_names], data[label_name])
bernoulli_score = multiclass_logloss_score(BernoulliNB(), data[feature_names], data[label_name])

print("GaussianNB = %.5f, MultinomialNB = %.5f, BernoulliNB = %.5f" % (gaussian_score, multinomial_score, bernoulli_score))

GaussianNB = 2.66670, MultinomialNB = 2.67391, BernoulliNB = 2.67144


## Submit

In [40]:
train = pd.read_csv("data/train.csv")
test = pd.read_csv("data/test.csv")

for preprocess_function in preprocess_list:
    preprocess_function(train)
    preprocess_function(test)

feature_names = ["scale_normalized_X", "scale_normalized_Y", "dayofweek_encoded", "year", "month", "day", "hour", "minute", "second", "pd_district_encoded"]
label_name = "Category"

model = BernoulliNB()
model.fit(train[feature_names], train[label_name])
prediction = model.predict_proba(test[feature_names])

label_columns = sorted(train[label_name].unique())

submit = pd.DataFrame(prediction)
submit.index.names = ["id"]
submit.columns = label_columns

submit

Unnamed: 0_level_0,ARSON,ASSAULT,BAD CHECKS,BRIBERY,BURGLARY,DISORDERLY CONDUCT,DRIVING UNDER THE INFLUENCE,DRUG/NARCOTIC,DRUNKENNESS,EMBEZZLEMENT,...,SEX OFFENSES NON FORCIBLE,STOLEN PROPERTY,SUICIDE,SUSPICIOUS OCC,TREA,TRESPASS,VANDALISM,VEHICLE THEFT,WARRANTS,WEAPON LAWS
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.005016,0.121032,0.000190,0.000650,0.033831,0.003102,0.002471,0.068774,0.003488,0.000638,...,0.000142,0.005401,0.000455,0.043761,1.523766e-05,0.008728,0.048059,0.049998,0.064774,0.023262
1,0.005016,0.121032,0.000190,0.000650,0.033831,0.003102,0.002471,0.068774,0.003488,0.000638,...,0.000142,0.005401,0.000455,0.043761,1.523766e-05,0.008728,0.048059,0.049998,0.064774,0.023262
2,0.001646,0.094925,0.000234,0.000308,0.032591,0.006673,0.003310,0.087247,0.006660,0.000777,...,0.000090,0.006488,0.000651,0.035545,1.757862e-06,0.010576,0.040746,0.037198,0.065500,0.011251
3,0.001646,0.094925,0.000234,0.000308,0.032591,0.006673,0.003310,0.087247,0.006660,0.000777,...,0.000090,0.006488,0.000651,0.035545,1.757862e-06,0.010576,0.040746,0.037198,0.065500,0.011251
4,0.001646,0.094925,0.000234,0.000308,0.032591,0.006673,0.003310,0.087247,0.006660,0.000777,...,0.000090,0.006488,0.000651,0.035545,1.757862e-06,0.010576,0.040746,0.037198,0.065500,0.011251
5,0.001646,0.094925,0.000234,0.000308,0.032591,0.006673,0.003310,0.087247,0.006660,0.000777,...,0.000090,0.006488,0.000651,0.035545,1.757862e-06,0.010576,0.040746,0.037198,0.065500,0.011251
6,0.001646,0.094925,0.000234,0.000308,0.032591,0.006673,0.003310,0.087247,0.006660,0.000777,...,0.000090,0.006488,0.000651,0.035545,1.757862e-06,0.010576,0.040746,0.037198,0.065500,0.011251
7,0.001646,0.094925,0.000234,0.000308,0.032591,0.006673,0.003310,0.087247,0.006660,0.000777,...,0.000090,0.006488,0.000651,0.035545,1.757862e-06,0.010576,0.040746,0.037198,0.065500,0.011251
8,0.001646,0.094925,0.000234,0.000308,0.032591,0.006673,0.003310,0.087247,0.006660,0.000777,...,0.000090,0.006488,0.000651,0.035545,1.757862e-06,0.010576,0.040746,0.037198,0.065500,0.011251
9,0.001646,0.094925,0.000234,0.000308,0.032591,0.006673,0.003310,0.087247,0.006660,0.000777,...,0.000090,0.006488,0.000651,0.035545,1.757862e-06,0.010576,0.040746,0.037198,0.065500,0.011251


In [41]:
from time import strftime, localtime

current_time = strftime("%Y.%m.%d %H.%M.%S", localtime())

submit.to_csv("submit/%s.csv" % current_time)