# Naive Bayes from scratch for "What's cooking?" Kaggle competition

This challenge was the first public competition I tried on Kaggle. It was only one month I started to code and I was required to partecipate uploading a Naive Bayes algorithm written from scratch within a week from the assignment date.

## Short description of the competition

Picture yourself strolling through your local, open-air market... What do you see? What do you smell? What will you make for dinner tonight?

If you're in Northern California, you'll be walking past the inevitable bushels of leafy greens, spiked with dark purple kale and the bright pinks and yellows of chard. Across the world in South Korea, mounds of bright red kimchi greet you, while the smell of the sea draws your attention to squids squirming nearby. India’s market is perhaps the most colorful, awash in the rich hues and aromas of dozens of spices: turmeric, star anise, poppy seeds, and garam masala as far as the eye can see.

Some of our strongest geographic and cultural associations are tied to a region's local foods. 

This playground competitions asks you to predict the category of a dish's cuisine given a list of its ingredients. 

## Implementation

Let's recall the formula of Naive Bayes:

![Screenshot](cooking.png)

In [2]:
import json
from collections import Counter
import pandas as pd

with open('train.json') as data_file: 
    train = json.load(data_file)
with open('test.json') as data_file: 
    test = json.load(data_file)

In [3]:
Prob_C ={}
cousines = [item["cuisine"] for item in train]
cous_count  = dict(Counter(cousines))

for key in cous_count:    
    Prob_C[key] = float(cous_count[key])/float(len(cousines))    

dict_cuis = {}
dict_cuis = {cuis: [] for cuis in Prob_C.keys()}
for item in train:
    dict_cuis[item['cuisine']].extend(item['ingredients'])

Prob_I_C = { cuis  : dict(Counter(dict_cuis[cuis])) for cuis in dict_cuis}
output = pd.DataFrame(columns=["id", "cuisine"])

for item in test:
    max = 0
    max_cuis = ""
    for cuis in Prob_C.keys():
        prob = Prob_C[cuis]
        for ingr in item["ingredients"]:
            if ingr in Prob_I_C[cuis].keys():
                prob*=Prob_I_C[cuis][ingr]
            else:
                prob*=10**-6
        if prob>max:
            max = prob
            max_cuis = cuis    
    output = output.append({"id": int(item["id"]), "cuisine": max_cuis}, ignore_index=True)
    
output.id = output.id.astype(int)
output.to_csv("output.csv", index=False)

In [5]:
print output

         id      cuisine
0     18009  southern_us
1     28583  southern_us
2     41580      italian
3     29752  southern_us
4     35687      italian
5     38527  southern_us
6     19666      italian
7     41217      chinese
8     28753      mexican
9     22659      british
10    21749      mexican
11    44967      italian
12    42969       indian
13    44883      italian
14    20827      chinese
15    23196      italian
16    35387  southern_us
17    33780  southern_us
18    19001      mexican
19    16526  southern_us
20    42455      chinese
21    47453       indian
22    42478      italian
23    11885   vietnamese
24    16585      italian
25    29639  southern_us
26    26245         thai
27    38516      chinese
28    47520      italian
29    26212      mexican
...     ...          ...
9914  49157      chinese
9915  40847      italian
9916  14084      italian
9917   6802      italian
9918  22381  southern_us
9919  21016    brazilian
9920  29024      italian
9921   4478      chinese
