This code is for the Kaggle San Francisco crime challenge (https://www.kaggle.com/c/sf-crime). It contains a data loader with preprocessing and two main files. The first trains single classifiers and evaluates them using logloss (also used in the competition), the second one (main_search.py) uses the randomized search of sklearn for hyperparameter estimation.
To get a feel for the data, visualization.py plots some statistics about the dataset.
The first try:
- Random Forest Classifier (clf = RandomForestClassifier(max_depth=16, n_estimators=1024, n_jobs=48)) placed 580/2335 with a logloss of 2.41519 (number one entry: 1.95936)
I did not want to checkin the raw data (too big), but I also hate searching data in the future, so I zipped the kaggle data. Just unzip data/kaggle_data.zip, and you have everything you need
Map for visualization
map-creation: script in utils (get_map_and_save.r):  use ggmap package of R, specify lat/lon box, retreive map.
Two options: 1) save to rds file w/ gray values: use python script to reload rds and plot , mapdata = np.loadtxt("outputmap.txt") 2) the colored image mapfile is created by ggmap (ggmapTemp.png), can be loaded and in matplotlib set extent to lat_lon_box