# Naive Bayes from scratch for "San Francisco Crime Classification" Kaggle competition

This challenge was the second public competition I tried on Kaggle. I was required to partecipate uploading a Naive Bayes algorithm written from scratch within a week from the assignment date.

## Short description of the competition

From 1934 to 1963, San Francisco was infamous for housing some of the world's most notorious criminals on the inescapable island of Alcatraz.

Today, the city is known more for its tech scene than its criminal past. But, with rising wealth inequality, housing shortages, and a proliferation of expensive digital toys riding BART to work, there is no scarcity of crime in the city by the bay.

From Sunset to SOMA, and Marina to Excelsior, this competition's dataset provides nearly 12 years of crime reports from across all of San Francisco's neighborhoods. Given time and location, you must predict the category of crime that occurred.

![Screenshot](images/sfcrime_banner.png)

## Implementation

Let's recall the formula of Naive Bayes:

![Screenshot](images/bayes.png)

In [1]:
import csv
from collections import Counter
from datetime import datetime

In [2]:
Traindata = csv.DictReader(open("train.csv"))

categories_raw = []
C_I = {}
for item in Traindata:
    categories_raw.append(item["Category"])
    timestamp = datetime.strptime(item["Dates"], '%Y-%m-%d %X')
    if item["Category"] in C_I:
        pass
    else:
        C_I[item["Category"]] = []
    C_I[item["Category"]].append(timestamp.strftime("%Y"))
    C_I[item["Category"]].append(timestamp.strftime("%B"))
    C_I[item["Category"]].append(timestamp.strftime("%H"))
    C_I[item["Category"]].append(item["PdDistrict"])
    C_I[item["Category"]].append(item["DayOfWeek"])

In [3]:
P_C = Counter(categories_raw)
items = sum(P_C.values())

for key, val in P_C.iteritems():
    P_C[key]= P_C[key]/float(items)

i= 0
for item in Traindata: 
    print item
    i+=1
    if i==10: break

In [4]:
P_I_C={}

items = []
for crime in P_C.keys():
    P_I_C[crime] = Counter(C_I[crime])
    items.extend(P_I_C[crime].keys())
items = set(items)

for crime in P_C.keys():
    for item in items:
        if item in P_I_C[crime]:
            pass
        else:
            P_I_C[crime][item]=1

for crime in P_C.keys():
    number_of_items = sum (P_I_C[crime].values())
    for key, val in P_I_C[crime].iteritems():
        P_I_C[crime][key] = P_I_C[crime][key]/float(number_of_items)

In [5]:
Testdata = csv.DictReader(open("test.csv"))

submit = []

i=0
for item in Testdata: 
    test_items = []
    submit.append({})
    submit[i]={"Id": item["Id"]}
    timestamp = datetime.strptime(item["Dates"], '%Y-%m-%d %X')
    test_items.append(timestamp.strftime("%Y"))
    test_items.append(timestamp.strftime("%B"))
    test_items.append(timestamp.strftime("%H"))
    test_items.append(item["PdDistrict"])
    test_items.append(item["DayOfWeek"])
    for crime in P_C.keys():
        prob = P_C[crime]
        for test_item in test_items:
            prob *= P_I_C[crime][test_item]
        submit[i][crime] = prob
    rescal_factor = sum ([submit[i][crime] for crime in P_C.keys()])
    for crime in P_C.keys():
        submit[i][crime] = round(submit[i][crime]/(rescal_factor), 6)
    i+=1

In [6]:
keys=submit[0].keys()
with open('out.csv', 'wb') as f: 
    dict_writer = csv.DictWriter(f, keys)
    dict_writer.writeheader()
    for row in submit:
            dict_writer.writerow(row)

import pandas as pd
output=pd.read_csv('out.csv')
print(output)

        KIDNAPPING  WEAPON LAWS  SECONDARY CODES  WARRANTS  PROSTITUTION  \
0         0.004602     0.025322         0.026729  0.036752      0.000201   
1         0.004602     0.025322         0.026729  0.036752      0.000201   
2         0.002726     0.009736         0.012895  0.031388      0.004226   
3         0.005588     0.018972         0.022832  0.023475      0.000082   
4         0.005588     0.018972         0.022832  0.023475      0.000082   
5         0.003680     0.011423         0.020717  0.018019      0.000309   
6         0.005588     0.018972         0.022832  0.023475      0.000082   
7         0.005588     0.018972         0.022832  0.023475      0.000082   
8         0.003486     0.015528         0.017588  0.042749      0.007931   
9         0.002791     0.007215         0.010910  0.022755      0.002156   
10        0.005588     0.018972         0.022832  0.023475      0.000082   
11        0.003486     0.015528         0.017588  0.042749      0.007931   
12        0.

## Final considerations

I couldn't drive the Naive Bayes method more than the score i made (for example putting a months column didn't improve the score). Other possible ideas, in addition to use the streets (without the numbers) and years, are:

• Differ block and corners of the streets
• Know the changings of Police Departments control areas (some PD are
more efficient)
• Know the years when some laws were promulgated (e.g. cannabis drug is
legalized in California, the probability of Drug/Narcotic decreases)
• Improvements in some areas of the city planning and public services

(e.g.life quality improves, less possibility to become a criminal,
probability of violent crimes decreases)

![Screenshot](images/kaggle.png)