# San Francisco Crime Classification
## Predict the category of crimes that occurred in the city by the bay

From 1934 to 1963, San Francisco was infamous for housing some of the world's most notorious criminals on the inescapable island of Alcatraz.

Today, the city is known more for its tech scene than its criminal past. But, with rising wealth inequality, housing shortages, and a proliferation of expensive digital toys riding BART to work, there is no scarcity of crime in the city by the bay.

From Sunset to SOMA, and Marina to Excelsior, this competition's dataset provides nearly 12 years of crime reports from across all of San Francisco's neighborhoods. Given time and location, you must predict the category of crime that occurred.

We're also encouraging you to explore the dataset visually. What can we learn about the city through visualizations like this Top Crimes Map? The top most up-voted scripts from this competition will receive official Kaggle swag as prizes. 

![Police Line](https://kaggle2.blob.core.windows.net/competitions/kaggle/4458/media/sfcrime_banner.png)

# Load Data

In [2]:
import numpy as np
import pandas as pd

data = pd.read_csv("train.csv", parse_dates=['Dates'])
data = data.reindex(np.random.permutation(data.index)) 
data

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
522695,2007-12-28 23:30:00,LARCENY/THEFT,PETTY THEFT FROM A BUILDING,Friday,PARK,NONE,600 Block of DIVISADERO ST,-122.437781,37.775483
534955,2007-10-21 17:00:00,VANDALISM,"MALICIOUS MISCHIEF, VANDALISM",Sunday,BAYVIEW,NONE,1200 Block of MARIPOSA ST,-122.396168,37.763823
350355,2010-07-02 23:00:00,MISSING PERSON,MISSING ADULT,Friday,CENTRAL,LOCATED,600 Block of SUTTER ST,-122.411069,37.788856
62810,2014-07-13 06:11:00,WEAPON LAWS,POSS OF PROHIBITED WEAPON,Sunday,MISSION,"ARREST, BOOKED",300 Block of SHOTWELL ST,-122.416082,37.762823
699921,2005-06-14 09:20:00,LARCENY/THEFT,PETTY THEFT SHOPLIFTING,Tuesday,SOUTHERN,"ARREST, BOOKED",200 Block of KING ST,-122.393111,37.777310
338493,2010-09-07 19:35:00,PROSTITUTION,SOLICITS FOR ACT OF PROSTITUTION,Tuesday,MISSION,"ARREST, CITED",17TH ST / CAPP ST,-122.418486,37.763495
182514,2012-12-11 21:00:00,VEHICLE THEFT,STOLEN AUTOMOBILE,Tuesday,INGLESIDE,NONE,300 Block of MOSCOW ST,-122.427008,37.721998
675831,2005-10-17 02:16:00,WARRANTS,WARRANT ARREST,Monday,NORTHERN,"ARREST, BOOKED",900 Block of LARKIN ST,-122.418065,37.786342
270343,2011-09-10 15:50:00,RECOVERED VEHICLE,RECOVERED VEHICLE - STOLEN OUTSIDE SF,Saturday,TARAVAL,NONE,700 Block of QUINTARA ST,-122.473424,37.748818
37482,2014-11-12 19:10:00,SUSPICIOUS OCC,SUSPICIOUS OCCURRENCE,Wednesday,BAYVIEW,NONE,900 Block of CAROLINA ST,-122.399638,37.755974


# Feature Engineering

In [3]:
preprocess_list = []

## Encode Label

In [4]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
data['Category_Ecoded'] = label_encoder.fit_transform(data["Category"])
#add columns for encoded label and then if we don't need original column
#that time we will can drop the original columns

# 너무 작은 라벨은 가장 많은 값으로 바꿔버린다. (log_loss가 계산을 못 함)
data.loc[data['Category_Ecoded'] == 22, 'Category_Ecoded'] = 21
data.loc[data['Category_Ecoded'] == 33, 'Category_Ecoded'] = 21

data.head()

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y,Category_Ecoded
522695,2007-12-28 23:30:00,LARCENY/THEFT,PETTY THEFT FROM A BUILDING,Friday,PARK,NONE,600 Block of DIVISADERO ST,-122.437781,37.775483,16
534955,2007-10-21 17:00:00,VANDALISM,"MALICIOUS MISCHIEF, VANDALISM",Sunday,BAYVIEW,NONE,1200 Block of MARIPOSA ST,-122.396168,37.763823,35
350355,2010-07-02 23:00:00,MISSING PERSON,MISSING ADULT,Friday,CENTRAL,LOCATED,600 Block of SUTTER ST,-122.411069,37.788856,19
62810,2014-07-13 06:11:00,WEAPON LAWS,POSS OF PROHIBITED WEAPON,Sunday,MISSION,"ARREST, BOOKED",300 Block of SHOTWELL ST,-122.416082,37.762823,38
699921,2005-06-14 09:20:00,LARCENY/THEFT,PETTY THEFT SHOPLIFTING,Tuesday,SOUTHERN,"ARREST, BOOKED",200 Block of KING ST,-122.393111,37.77731,16


### Encode PdDistrict 

In [5]:
from sklearn.preprocessing import LabelEncoder

def preprocess_pd_district(data):
    label_encoder = LabelEncoder()
    data['PdDistrictEncoded'] = label_encoder.fit_transform(data["PdDistrict"])
    
preprocess_pd_district(data)
preprocess_list.append(preprocess_pd_district)

data.head()

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y,Category_Ecoded,PdDistrictEncoded
522695,2007-12-28 23:30:00,LARCENY/THEFT,PETTY THEFT FROM A BUILDING,Friday,PARK,NONE,600 Block of DIVISADERO ST,-122.437781,37.775483,16,5
534955,2007-10-21 17:00:00,VANDALISM,"MALICIOUS MISCHIEF, VANDALISM",Sunday,BAYVIEW,NONE,1200 Block of MARIPOSA ST,-122.396168,37.763823,35,0
350355,2010-07-02 23:00:00,MISSING PERSON,MISSING ADULT,Friday,CENTRAL,LOCATED,600 Block of SUTTER ST,-122.411069,37.788856,19,1
62810,2014-07-13 06:11:00,WEAPON LAWS,POSS OF PROHIBITED WEAPON,Sunday,MISSION,"ARREST, BOOKED",300 Block of SHOTWELL ST,-122.416082,37.762823,38,3
699921,2005-06-14 09:20:00,LARCENY/THEFT,PETTY THEFT SHOPLIFTING,Tuesday,SOUTHERN,"ARREST, BOOKED",200 Block of KING ST,-122.393111,37.77731,16,7


## Encode DayOfWeek

In [6]:
from sklearn.preprocessing import LabelEncoder

def preprocess_dayofweek(data):
    label_encoder = LabelEncoder()
    data['DayOfWeekEncoded'] = label_encoder.fit_transform(data["DayOfWeek"])
    
preprocess_dayofweek(data)
preprocess_list.append(preprocess_dayofweek)

data.head()

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y,Category_Ecoded,PdDistrictEncoded,DayOfWeekEncoded
522695,2007-12-28 23:30:00,LARCENY/THEFT,PETTY THEFT FROM A BUILDING,Friday,PARK,NONE,600 Block of DIVISADERO ST,-122.437781,37.775483,16,5,0
534955,2007-10-21 17:00:00,VANDALISM,"MALICIOUS MISCHIEF, VANDALISM",Sunday,BAYVIEW,NONE,1200 Block of MARIPOSA ST,-122.396168,37.763823,35,0,3
350355,2010-07-02 23:00:00,MISSING PERSON,MISSING ADULT,Friday,CENTRAL,LOCATED,600 Block of SUTTER ST,-122.411069,37.788856,19,1,0
62810,2014-07-13 06:11:00,WEAPON LAWS,POSS OF PROHIBITED WEAPON,Sunday,MISSION,"ARREST, BOOKED",300 Block of SHOTWELL ST,-122.416082,37.762823,38,3,3
699921,2005-06-14 09:20:00,LARCENY/THEFT,PETTY THEFT SHOPLIFTING,Tuesday,SOUTHERN,"ARREST, BOOKED",200 Block of KING ST,-122.393111,37.77731,16,7,5


## Date parsing

In [7]:
import time
from datetime import datetime

def preprocess_dates(data):
    data['Day'] = data['Dates'].dt.day
    data['Month'] = data['Dates'].dt.month
    data['Year'] = data['Dates'].dt.year
    data['Hour'] = data['Dates'].dt.hour
    data['Minute'] = data['Dates'].dt.minute

preprocess_dates(data)
preprocess_list.append(preprocess_dates)

data.head()

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y,Category_Ecoded,PdDistrictEncoded,DayOfWeekEncoded,Day,Month,Year,Hour,Minute
522695,2007-12-28 23:30:00,LARCENY/THEFT,PETTY THEFT FROM A BUILDING,Friday,PARK,NONE,600 Block of DIVISADERO ST,-122.437781,37.775483,16,5,0,28,12,2007,23,30
534955,2007-10-21 17:00:00,VANDALISM,"MALICIOUS MISCHIEF, VANDALISM",Sunday,BAYVIEW,NONE,1200 Block of MARIPOSA ST,-122.396168,37.763823,35,0,3,21,10,2007,17,0
350355,2010-07-02 23:00:00,MISSING PERSON,MISSING ADULT,Friday,CENTRAL,LOCATED,600 Block of SUTTER ST,-122.411069,37.788856,19,1,0,2,7,2010,23,0
62810,2014-07-13 06:11:00,WEAPON LAWS,POSS OF PROHIBITED WEAPON,Sunday,MISSION,"ARREST, BOOKED",300 Block of SHOTWELL ST,-122.416082,37.762823,38,3,3,13,7,2014,6,11
699921,2005-06-14 09:20:00,LARCENY/THEFT,PETTY THEFT SHOPLIFTING,Tuesday,SOUTHERN,"ARREST, BOOKED",200 Block of KING ST,-122.393111,37.77731,16,7,5,14,6,2005,9,20


## Scale/Normailze Features

In [8]:
#필요하면 여러개 feature들 노멀라이즈

# Cross validatae by multi-class logarithmic loss


In [9]:
from sklearn.cross_validation import KFold
from sklearn.metrics import log_loss

def multiclass_logloss_score(model, features, labels, num_folds = 5):
    kfolds = KFold(len(features), num_folds)
    
    total_score = 0.0
    
    for train_index, test_index in kfolds:
        train_features = features.iloc[train_index]
        test_features = features.iloc[test_index]
        train_labels = labels.iloc[train_index]
        test_labels = labels.iloc[test_index]
        
        model.fit(train_features, train_labels)
        prediction = model.predict_proba(test_features)
        
        score = log_loss(test_labels, prediction)
        total_score += score
        
    total_score = total_score / num_folds
    
    return total_score
        

# Prediction

## Predict 

In [11]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

feature_names = ["X", "Y", "PdDistrictEncoded", "DayOfWeekEncoded","Day", "Month", "Year", "Hour", "Minute"]
#배열로 넘김
label_name = "Category_Ecoded"
#컬럼 넘김

#tree10_score = multiclass_logloss_score(RandomForestClassifier(n_estimators = 10, n_jobs= -1), data[feature_names], data[label_name])
#tree40_score = multiclass_logloss_score(RandomForestClassifier(n_estimators = 40, n_jobs= -1), data[feature_names], data[label_name])
logistic_score = multiclass_logloss_score(LogisticRegression(n_jobs = -1), data[feature_names], data[label_name])
#tree100_score = multiclass_logloss_score(RandomForestClassifier(n_estimators = 100, n_jobs= -1), data[feature_names], data[label_name])
tree100_score = 10.0
#model 들 넘김
print("logistic = %.5f, tree100 = %.5f" % ( logistic_score ,tree100_score))

NameError: name 'tree100_score' is not defined

In [19]:
from sklearn.ensemble import RandomForestClassifier

feature_names = ["X", "Y", "PdDistrictEncoded", "DayOfWeekEncoded", "Month", "Year", "Hour", "Minute"]
#배열로 넘김
label_name = "Category_Ecoded"
#컬럼 넘김

tree200_score = multiclass_logloss_score(RandomForestClassifier(n_estimators = 200, n_jobs= -1), data[feature_names], data[label_name])

print "tree200 = %.5f" % (tree200_score)

tree200 = 8.20108


# Submit

In [None]:
from sklearn.ensemble import RandomForestClassifier


train = pd.read_csv("train.csv",parse_dates=['Dates'])
test = pd.read_csv("test.csv",parse_dates=['Dates'])

for preprocess_function in preprocess_list:
    preprocess_function(train)
    preprocess_function(test)
    
feature_names = ["X", "Y", "PdDistrictEncoded", "DayOfWeekEncoded", "Month", "Year", "Hour", "Minute"]
train.drop(["Descript", "DayOfWeek", "PdDistrict", "Resolution", "Dates"], axis = 1)

label_name = "Category"

model = RandomForestClassifier(n_estimators = 120, n_jobs = -1)
model.fit(train[feature_names], train[label_name])
prediction = model.predict_proba(test[feature_names])

label_columns = sorted(train[label_name].unique())

submit = pd.DataFrame(prediction)
submit.index.names = ["Id"]
submit.columns = label_columns

submit

In [None]:
from time import strftime, localtime

current_time = strftime("%Y.%m.%d %H.%M.%S", localtime())

submit.to_csv("tree_120_%s.csv" % current_time)