In [1]:
import scipy
import numpy as np
import pandas as pd
import seaborn as sns
import xgboost as xgb
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.cross_validation import train_test_split
%matplotlib inline
le = LabelEncoder()



The first thing I need to do is import the libraries I will use. The ones I use most are pandas and XGBoost. Pandas reads in the files and converts them into a dataframe, while XGBoost is what I use to write my predictions program

In [2]:
test = pd.read_csv('SF_crime/test.csv', index_col='Id')
test = test.rename(columns={'X': 'Longitude', "Y": "Latitude"})
test.Dates = pd.to_datetime(test.Dates)
test_keep = test
crime_in_sf = pd.read_csv('SF_crime/train.csv')
crime_in_sf.Dates = pd.to_datetime(crime_in_sf.Dates)
crime_in_sf = crime_in_sf.rename(columns={'X': 'Longitude', "Y": "Latitude",})
crime_in_sf = crime_in_sf.drop(['Resolution', 'Descript'], axis=1)
crime_train, crime_test = train_test_split(crime_in_sf, test_size=.4)

The next thing I have to do is read in all the files and make any corrections to them so I can make them more readable. I change some column names (X and Y) so that they are easier to read and convert the Dates column to a datetime format so I can pull out individual years or days if I need too. I also drop two columns off of the training data as they don't influence my predictions.

In [3]:
for column in test.columns.values:
    if column != 'Longitude' and column != 'Latitude':
        le.fit(test[column])
        test[column] = le.transform(test[column])

for column in crime_in_sf.columns.values:
    if column != 'Longitude' and column != 'Latitude':
        le.fit(crime_in_sf[column])
        crime_train[column] = le.transform(crime_train[column])

for column in crime_in_sf.columns.values:
    if column != 'Longitude' and column != 'Latitude':
        le.fit(crime_in_sf[column])        
        crime_test[column] = le.transform(crime_test[column])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Here I take the information, except for the latitude and longitude, and convert it from strings into integers. Each one is in a dictionary and stored so that they can be converted back later.

In [4]:
categories = crime_train.Category
crime_train = crime_train.drop('Category', axis=1)

categories2 = crime_test.Category
crime_test = crime_test.drop('Category', axis=1)

To properly train my data I needed to make the categories of crimes seperate from the rest of the data.

In [5]:
dtrain = xgb.DMatrix(crime_train.as_matrix(),
                     label=categories)
dtest = xgb.DMatrix(crime_test.as_matrix(),
                    label=categories2)

Now that the data is separated it needs to be prepared to be put into the decision tree. The first thing that needs to happen is that the information is converted from a pandas table into a matrix, and the categories need to be added in their own identifier so that the program knows what it's predicting on.

In [6]:
param = {'bst:max_depth':6, 'objective':'multi:softprob', 'num_class':39}
param['nthread'] = 4
param['eval_metric'] = ['merror', 'mlogloss']
evallist  = [(dtest,'eval'), (dtrain,'train')]
num_round = 280

Lastly I need to tell the program how it's suppose to wrong and what it should use to evaluate the information. I set how large of a tree I want (the max_depth), what I want it to return (softprob), how many categories it should be in. 
I also set up the evaluation metrics that it would run on.

In [7]:
bst = xgb.train(param, dtrain, num_round, evallist, early_stopping_rounds=3)

Will train until train error hasn't decreased in 3 rounds.
Multiple eval metrics have been passed: 'mlogloss' will be used for early stopping.

[0]	eval-merror:0.744966	eval-mlogloss:3.105221	train-merror:0.742843	train-mlogloss:3.100287
[1]	eval-merror:0.738694	eval-mlogloss:2.925017	train-merror:0.735478	train-mlogloss:2.917377
[2]	eval-merror:0.736570	eval-mlogloss:2.810193	train-merror:0.733469	train-mlogloss:2.800245
[3]	eval-merror:0.735263	eval-mlogloss:2.727604	train-merror:0.731953	train-mlogloss:2.715515
[4]	eval-merror:0.734093	eval-mlogloss:2.666842	train-merror:0.730406	train-mlogloss:2.652890
[5]	eval-merror:0.733153	eval-mlogloss:2.619537	train-merror:0.729493	train-mlogloss:2.603561
[6]	eval-merror:0.732381	eval-mlogloss:2.582603	train-merror:0.728652	train-mlogloss:2.564752
[7]	eval-merror:0.731789	eval-mlogloss:2.553843	train-merror:0.728033	train-mlogloss:2.533941
[8]	eval-merror:0.730867	eval-mlogloss:2.530344	train-merror:0.726955	train-mlogloss:2.508746
[9]	eval-m

And here is where the program trains. As you can see the numbers are getting smaller as they go along, showing that it is getting more accurate. This will hopefully give me a better prediction.

In [8]:
predictions = bst.predict(xgb.DMatrix(test.as_matrix()), output_margin=False)

Now that the model is trained I convert the data I will actually predict upon into a matrix and run it through the model I just created and it returns it's predictions based off of all the descisions it had to make

In [9]:
predictions = pd.DataFrame(predictions)

I then put those predictions back into a DataFrame. I can easily use that to look over my data and see what it looks like. This is a good time to see if there are any trends or problems that may arise.

In [10]:
le.fit(crime_in_sf.Category)
predictions.columns = le.inverse_transform(predictions.columns)

I also relabel the information so that it has what the crimes are as opposed to simply numbers from 0-38 so that I know what the crimes that it is predicting on are

In [11]:
test_keep.head()

Unnamed: 0_level_0,Dates,DayOfWeek,PdDistrict,Address,Longitude,Latitude
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,392172,3,0,6407,-122.399588,37.735051
1,392171,3,0,9744,-122.391523,37.732432
2,392170,3,4,6336,-122.426002,37.792212
3,392169,3,2,10633,-122.437394,37.721412
4,392169,3,2,10633,-122.437394,37.721412


This is what my data looked like when I fed it into my program

In [12]:
predictions.head()

Unnamed: 0,ARSON,ASSAULT,BAD CHECKS,BRIBERY,BURGLARY,DISORDERLY CONDUCT,DRIVING UNDER THE INFLUENCE,DRUG/NARCOTIC,DRUNKENNESS,EMBEZZLEMENT,...,SEX OFFENSES NON FORCIBLE,STOLEN PROPERTY,SUICIDE,SUSPICIOUS OCC,TREA,TRESPASS,VANDALISM,VEHICLE THEFT,WARRANTS,WEAPON LAWS
0,0.000949,0.044045,9.55018e-07,0.00011,0.003899,0.000366,0.00063,0.00066,0.00015,0.000219,...,7.601371e-07,0.120559,7.2e-05,0.003106,5.250954e-07,0.000603,0.090599,0.033588,0.003451,0.030885
1,0.000141,0.034285,1.139519e-06,1.8e-05,0.00016,0.000917,0.001422,0.002334,0.00025,4.6e-05,...,9.537289e-07,0.010991,1.9e-05,0.002438,2.473992e-06,0.000278,0.037942,0.019777,0.012543,0.038404
2,0.000422,0.006864,3.679608e-06,3.7e-05,0.016094,0.000242,4.5e-05,0.003035,0.001438,5.4e-05,...,1.18542e-05,0.001401,0.000169,0.001423,1.301531e-06,0.007248,0.212529,0.324179,0.007601,0.030309
3,3.7e-05,0.042451,3.707393e-05,0.000398,0.059745,0.001464,0.000672,0.00627,0.002349,6e-06,...,5.848924e-05,0.001301,6e-06,0.00566,3.670072e-07,0.001865,0.235017,0.117005,0.065951,0.073283
4,3.7e-05,0.042451,3.707393e-05,0.000398,0.059745,0.001464,0.000672,0.00627,0.002349,6e-06,...,5.848924e-05,0.001301,6e-06,0.00566,3.670072e-07,0.001865,0.235017,0.117005,0.065951,0.073283


And this is what my predictions look like. Several categories which numbers to identify them and a probabilities of their likelyhood for each type of crime.

In [13]:
predictions['Id'] = predictions.index

def order(frame,var):
    varlist =[w for w in frame.columns if w not in var]
    frame = frame[var+varlist]
    return frame

predictions = order(predictions,['Id'])

I used a small definition here to add an ID column and put it on the front of my data so that it could be easily identified for the competition, and then I simply run the panel below and create a file which I can submit.

In [14]:
predictions.to_csv('/Users/MatthewBarnette/final_project_predictions//predictions_XGB_280.csv', index=False)