code for the kaggle san francisco crime challenge
Python R
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


This code is for the Kaggle San Francisco crime challenge ( It contains a data loader with preprocessing and two main files. The first trains single classifiers and evaluates them using logloss (also used in the competition), the second one ( uses the randomized search of sklearn for hyperparameter estimation.

To get a feel for the data, plots some statistics about the dataset.

The first try:

  • Random Forest Classifier (clf = RandomForestClassifier(max_depth=16, n_estimators=1024, n_jobs=48)) placed 580/2335 with a logloss of 2.41519 (number one entry: 1.95936)


I did not want to checkin the raw data (too big), but I also hate searching data in the future, so I zipped the kaggle data. Just unzip data/, and you have everything you need


Some plots using pandas and seaborn Global stats

Number of Crimes per Hour of Day for each Category

Number of Crimes per Attribute for the Top 5 categories

Map for visualization

map-creation: script in utils (get_map_and_save.r): [0] use ggmap package of R, specify lat/lon box, retreive map.

Two options: 1) save to rds file w/ gray values: use python script to reload rds and plot [1], mapdata = np.loadtxt("outputmap.txt") 2) the colored image mapfile is created by ggmap (ggmapTemp.png), can be loaded and in matplotlib set extent to lat_lon_box

Example using the second option: Map plot for a specific category with Kernel Density as Heatmap