# San Francisco Crime Dataset

## Introduction

From 1934 to 1963, San Francisco was infamous for housing some of the world's most notorious criminals on the inescapable island of Alcatraz.


## Load Data and Packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy import stats
from scipy.stats.stats import pearsonr
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from scipy import stats
from scipy.special import inv_boxcox
from math import sqrt
from sklearn.metrics import r2_score
import seaborn as sns
from sklearn.preprocessing import Imputer, StandardScaler
from sklearn import tree
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
from sklearn import svm
from sklearn.ensemble.gradient_boosting import GradientBoostingRegressor
from xgboost import XGBRegressor
from xgboost import plot_importance
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold


In [2]:
train = pd.read_csv("data/train.csv")
test = pd.read_csv("data/test.csv")

In [3]:
# to make this notebook's output stable across runs
random_state = 42
np.random.seed(random_state)

In [4]:
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

## Data Analysis

In [5]:
print("Train dataset shape : ",train.shape)
print("Test dataset shape : ",test.shape)

Train dataset shape :  (878049, 9)
Test dataset shape :  (884262, 7)


In [6]:
print("Number of datapoints in the train data set : {} and the number of attributes including label : {}".format(*(train.shape)))
print("Number of datapoints in the test data set : {} and the number of attributes without label : {}".format(*(test.shape)))
print("Column in the input data : ")
train.info()
test.info()

Number of datapoints in the train data set : 878049 and the number of attributes including label : 9
Number of datapoints in the test data set : 884262 and the number of attributes without label : 7
Column in the input data : 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 878049 entries, 0 to 878048
Data columns (total 9 columns):
Dates         878049 non-null object
Category      878049 non-null object
Descript      878049 non-null object
DayOfWeek     878049 non-null object
PdDistrict    878049 non-null object
Resolution    878049 non-null object
Address       878049 non-null object
X             878049 non-null float64
Y             878049 non-null float64
dtypes: float64(2), object(7)
memory usage: 60.3+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 884262 entries, 0 to 884261
Data columns (total 7 columns):
Id            884262 non-null int64
Dates         884262 non-null object
DayOfWeek     884262 non-null object
PdDistrict    884262 non-null object
Address       8842

In [7]:
#Look for catigorical data
train.nunique()

Dates         389257
Category          39
Descript         879
DayOfWeek          7
PdDistrict        10
Resolution        17
Address        23228
X              34243
Y              34243
dtype: int64

In [8]:
#Analyze the train set from a statistical point
train.describe()

Unnamed: 0,X,Y
count,878049.0,878049.0
mean,-122.422616,37.77102
std,0.030354,0.456893
min,-122.513642,37.707879
25%,-122.432952,37.752427
50%,-122.41642,37.775421
75%,-122.406959,37.784369
max,-120.5,90.0


In [9]:
#look at some sample data
train.head()

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541


In [10]:
#look at some sample data
test.head()

Unnamed: 0,Id,Dates,DayOfWeek,PdDistrict,Address,X,Y
0,0,2015-05-10 23:59:00,Sunday,BAYVIEW,2000 Block of THOMAS AV,-122.399588,37.735051
1,1,2015-05-10 23:51:00,Sunday,BAYVIEW,3RD ST / REVERE AV,-122.391523,37.732432
2,2,2015-05-10 23:50:00,Sunday,NORTHERN,2000 Block of GOUGH ST,-122.426002,37.792212
3,3,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412
4,4,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412


In [11]:
#Columsn with null/NAN values
train_nan_freq = train.isnull().sum().to_frame()
train_nan_freq['nan_frequency'] = (train_nan_freq[0]/train.shape[0])*100
train_nan_freq.sort_values(by=['nan_frequency'], ascending=False)

Unnamed: 0,0,nan_frequency
Dates,0,0.0
Category,0,0.0
Descript,0,0.0
DayOfWeek,0,0.0
PdDistrict,0,0.0
Resolution,0,0.0
Address,0,0.0
X,0,0.0
Y,0,0.0


__Observations__ :
-  There are total of 878049 datapoints in training set.
-  There are total of 884262 datapoints in training set.
-  Independent attributes in train dataset : 6
    - Dates
    - DayOfWeek
    - PdDistrict
    - Address
    - X
    - Y
-  Dependent variable : 1, isa categorical variable
    - Category 
-  Need to ignore/drop the below features:-
    - Descript
    - Resolution
-  Target/dependent variable is a String/Object
-  There are mix of float & Object datatypes in independent variables.
-  No Null/nan values

## Data preparation