# Naive Bayes is a classification algorithm. work with text #
Used mainly for spam filters. The more spammy words are in the email, the more like it is to be spam.

### Types of Bayes ###

Naive Bayes works with text, sometimes text is long, sometimes short.

** Multinominal Bayes  - (multiple numbers) ** 

we counts words, and we care about the occurrence of words in the text - appears once or twice or multiple

** Bernoulli Bayes - True/ False Bayes ** : only care about whether a word shows up (True) or not (False) - better for shorter passages

In [1]:
from sklearn import preprocessing
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.cross_validation import train_test_split

  (fname, cnt))


In [2]:
df=pd.read_csv("recipes.csv")
df.head()

Unnamed: 0,cuisine,id,ingredient_list
0,greek,10259,"romaine lettuce, black olives, grape tomatoes,..."
1,southern_us,25693,"plain flour, ground pepper, salt, tomatoes, gr..."
2,filipino,20130,"eggs, pepper, salt, mayonaise, cooking oil, gr..."
3,indian,22213,"water, vegetable oil, wheat, salt"
4,indian,13162,"black pepper, shallots, cornflour, cayenne pep..."


## Training our algorithm to Recognize a particular type of food

We have a bunch of recipes in categories, maybe when someone sends new recipes, what category do the new recipes belong to?
We are going to train  a classifier to recognize, for example, indian food, so if someone sends a new recipes, we know if it's indian food.

## RULE 1 : For classification algorithms, you must have categories on your original dataset.

** for clustering **

1. you will get a lot of doc
2. feed it to an algorithm, tel it create 'x' no of categories
3. machine gives you back categories whether they make sense or not.


** For Classification **

1. you get a lot of doc
2. classify some of them into categories that you know
3. you ask the algorithm what categories a new bunch of unlabeled doc end up in

classification =  class = label

## How does Naive Bayes work ?

** Bayes Theorem **

** Naive: ** every word/ ingradients etc is independent of any other word

if you se a word that is normally in spam email, chances are high it's a spam
if you see a work that's not normally in spam email, chances are high its not a spam

## Step 1 : Convert the text data into numerical data

we two need two things

* Our labels: aka the categories everything belongs in
* Features

we have two labels:

* Italian (1)
* Not italian (0)

In [3]:
df.head()

Unnamed: 0,cuisine,id,ingredient_list
0,greek,10259,"romaine lettuce, black olives, grape tomatoes,..."
1,southern_us,25693,"plain flour, ground pepper, salt, tomatoes, gr..."
2,filipino,20130,"eggs, pepper, salt, mayonaise, cooking oil, gr..."
3,indian,22213,"water, vegetable oil, wheat, salt"
4,indian,13162,"black pepper, shallots, cornflour, cayenne pep..."


In [4]:
def make_label(cuisine):
    if cuisine =="italian":
        return 1
    else:
        return 0
    
    

In [5]:
df['label'] = df['cuisine'].apply(make_label)
df.head(8)

Unnamed: 0,cuisine,id,ingredient_list,label
0,greek,10259,"romaine lettuce, black olives, grape tomatoes,...",0
1,southern_us,25693,"plain flour, ground pepper, salt, tomatoes, gr...",0
2,filipino,20130,"eggs, pepper, salt, mayonaise, cooking oil, gr...",0
3,indian,22213,"water, vegetable oil, wheat, salt",0
4,indian,13162,"black pepper, shallots, cornflour, cayenne pep...",0
5,jamaican,6602,"plain flour, sugar, butter, eggs, fresh ginger...",0
6,spanish,42779,"olive oil, salt, medium shrimp, pepper, garlic...",0
7,italian,3735,"sugar, pistachio nuts, white almond bark, flou...",1


## Converting Features into Numbers

** Feature selection **

what features matter: in this case, what ingradients do we want to look at ?
our feature is going to be: whether it has spaghetti or not ?

In [6]:
df["has_spaghetti"]= df["ingredient_list"].str.contains("spaghetti")
df["has_curry_powder"]= df["ingredient_list"].str.contains("curry powder")
df.head()

Unnamed: 0,cuisine,id,ingredient_list,label,has_spaghetti,has_curry_powder
0,greek,10259,"romaine lettuce, black olives, grape tomatoes,...",0,False,False
1,southern_us,25693,"plain flour, ground pepper, salt, tomatoes, gr...",0,False,False
2,filipino,20130,"eggs, pepper, salt, mayonaise, cooking oil, gr...",0,False,False
3,indian,22213,"water, vegetable oil, wheat, salt",0,False,False
4,indian,13162,"black pepper, shallots, cornflour, cayenne pep...",0,False,False


## Let's feed our labels and features and see how it learns !!!

### looking at our labels

 we stored it in `lables` and if its `0` its not italian, if it's `1` its italian
### looking at our labels

we have two features `has_spaghetti` is `has_curry-powder`

In [7]:
df[['has_spaghetti','has_curry_powder']].head()

Unnamed: 0,has_spaghetti,has_curry_powder
0,False,False
1,False,False
2,False,False
3,False,False
4,False,False


In [8]:
# splitting into
# x are features, y are labels
# X_train are our feature to train on (80%)
# y_train are our labels to train on (20%)
# x_test are our features to test on (20%)
#y_test are our labels to test on 

In [9]:
X_train, X_test, y_train, y_test = train_test_split(
df[['has_spaghetti', 'has_curry_powder']], #the first is our FEATURES
df['label'], # the second parameter is the label (this is 0/1, not italian/italian)
test_size=0.2) #80% training, 20% testing

In [10]:
#import naive_bayes to get access to ALL kinds of naive bayes classifications
# but remember we're using Bernoulli because it's true/false which is fine
# for small passages

from sklearn import naive_bayes

#creatE Bernoulli Naive Bayes Classifier

clf = naive_bayes.BernoulliNB()


# feed the classifier two thinhgs:
  #  our training features (X_train)
# our training labels (y_train)
#to help it study for the exam later when we test it
clf.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [11]:
# all the zeroes are recipes that are not Italian 
clf.predict(X_test)

array([0, 1, 0, ..., 0, 0, 0])

In [12]:
# Naive Bayes can't overfit, really
# it can't study too hard, it can't memorize the questions
# a decision tree can
# so if we give it the training data back it will get some wrong
clf.score(X_train, y_train)

0.80863634935101669

In [13]:
clf.score(X_test, y_test)

0.817850408548083

In [14]:
df.head()

Unnamed: 0,cuisine,id,ingredient_list,label,has_spaghetti,has_curry_powder
0,greek,10259,"romaine lettuce, black olives, grape tomatoes,...",0,False,False
1,southern_us,25693,"plain flour, ground pepper, salt, tomatoes, gr...",0,False,False
2,filipino,20130,"eggs, pepper, salt, mayonaise, cooking oil, gr...",0,False,False
3,indian,22213,"water, vegetable oil, wheat, salt",0,False,False
4,indian,13162,"black pepper, shallots, cornflour, cayenne pep...",0,False,False


In [15]:
greek_df = df
greek_df.head()

Unnamed: 0,cuisine,id,ingredient_list,label,has_spaghetti,has_curry_powder
0,greek,10259,"romaine lettuce, black olives, grape tomatoes,...",0,False,False
1,southern_us,25693,"plain flour, ground pepper, salt, tomatoes, gr...",0,False,False
2,filipino,20130,"eggs, pepper, salt, mayonaise, cooking oil, gr...",0,False,False
3,indian,22213,"water, vegetable oil, wheat, salt",0,False,False
4,indian,13162,"black pepper, shallots, cornflour, cayenne pep...",0,False,False


In [16]:
greek_df["has_olives"]= greek_df["ingredient_list"].str.contains("black olives")
greek_df["has_pistachio_nuts"]= greek_df["ingredient_list"].str.contains("pistachio nuts")
greek_df["has_curry_powder"]= greek_df["ingredient_list"].str.contains("curry powder")

In [17]:
X_train, X_test, y_train, y_test = train_test_split(
greek_df[['has_olives', 'has_curry_powder', 'has_pistachio_nuts']], #the first is our FEATURES
greek_df['label'], # the second parameter is the label (this is 0/1, not italian/italian)
test_size=0.2) #80% training, 20% testing

In [18]:
greek_df

Unnamed: 0,cuisine,id,ingredient_list,label,has_spaghetti,has_curry_powder,has_olives,has_pistachio_nuts
0,greek,10259,"romaine lettuce, black olives, grape tomatoes,...",0,False,False,True,False
1,southern_us,25693,"plain flour, ground pepper, salt, tomatoes, gr...",0,False,False,False,False
2,filipino,20130,"eggs, pepper, salt, mayonaise, cooking oil, gr...",0,False,False,False,False
3,indian,22213,"water, vegetable oil, wheat, salt",0,False,False,False,False
4,indian,13162,"black pepper, shallots, cornflour, cayenne pep...",0,False,False,False,False
5,jamaican,6602,"plain flour, sugar, butter, eggs, fresh ginger...",0,False,False,False,False
6,spanish,42779,"olive oil, salt, medium shrimp, pepper, garlic...",0,False,False,False,False
7,italian,3735,"sugar, pistachio nuts, white almond bark, flou...",1,False,False,False,True
8,mexican,16903,"olive oil, purple onion, fresh pineapple, pork...",0,False,False,False,False
9,italian,12734,"chopped tomatoes, fresh basil, garlic, extra-v...",1,False,False,False,False


In [19]:
X_train, X_test, y_train, y_test = train_test_split(
greek_df[['has_olives', 'has_pistachio_nuts', 'has_curry_powder']], #the first is our FEATURES
greek_df['label'], # the second parameter is the label (this is 0/1, not italian/italian)
test_size=0.2) #80% training, 20% testing

In [20]:
#import naive_bayes to get access to ALL kinds of naive bayes classifications
# but remember we're using Bernoulli because it's true/false which is fine
# for small passages

from sklearn import naive_bayes

#creatE Bernoulli Naive Bayes Classifier

clf = naive_bayes.BernoulliNB()


# feed the classifier two thinhgs:
  #  our training features (X_train)
# our training labels (y_train)
#to help it study for the exam later when we test it
clf.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [21]:
# all the zeroes are recipes that are not Italian 
clf.predict(X_test)

array([0, 0, 0, ..., 0, 0, 0])

In [22]:
# Naive Bayes can't overfit, really
# it can't study too hard, it can't memorize the questions
# a decision tree can
# so if we give it the training data back it will get some wrong
clf.score(X_train, y_train)

0.8027593576165184

In [23]:
clf.score(X_test, y_test)

0.80364550597108741

In [24]:
# recipe with water brazilian

In [25]:
df["has_water"]= df["ingredient_list"].str.contains("water")

In [26]:
len(df[(df['has_water']) & (df['cuisine'] == 'brazilian')]) # probablity recipe has water and is brazillian 

109

In [27]:
len(df['has_water']) # probability recipe has water

39774

In [28]:
109/39774 #probaility that has water is brazillian

0.0027404837330919697

In [29]:
9385/39774 # chanc that recipe is not brazillian and has water in it 

0.2359581636244783

In [30]:
#label encoder

from sklearn import preprocessing
le = preprocessing.LabelEncoder()

In [31]:
le.fit(['orange', 'red', 'red', 'yellow', 'blue'])

LabelEncoder()

In [32]:
le.transform(['orange', 'blue', 'yellow'])

array([1, 0, 3])

In [33]:
le.fit(df['cuisine'])

LabelEncoder()

In [34]:
le.transform(df['cuisine'])

array([ 6, 16,  4, ...,  8,  3, 13])

In [36]:
from sklearn import naive_bayes


clf = naive_bayes.BernoulliNB()

clf.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [37]:
clf.score(X_test, y_test)

0.80364550597108741

In [38]:
from sklearn.dummy import DummyClassifier

In [39]:
dummy_clf = DummyClassifier(strategy='most_frequent')

# fir with our training data
dummy_clf.fit(X_train, y_train)

DummyClassifier(constant=None, random_state=None, strategy='most_frequent')

In [40]:
clf.score(X_train, y_train)

0.8027593576165184

In [41]:
clf.score(X_test, y_test)

0.80364550597108741