# Naive Bayes

### Work on probabitity

# Conditional Probability 

In [9]:
## Both adult men and alcoholics = 2.25%
## What is the probability of being alcoholic if you're a man?
probability_manAlcoholic = 0.0225
probability_beingMan = 0.5
probability_alcoholic_man = probability_manAlcoholic/probability_beingMan
print ("Probability of being an alcoholic if you're a man is: " ,probability_alcoholic_man * 100, "%")

Probability of being an alcoholic if you're a man is:  4.5 %


## Bayesian Trap

### True Positive:
    > Sick and diagnosed sick
### False Negative (Type II error)
    > Sick diagnosed not sick
### False positive (Type I error)
    > Sick but diagnosed not sick
### True Negative:
    > Not sick and diagnosed as not sick


![title](error.jpg)


## Confision Matrix


![title](confusion.PNG)

## Sensitivity and specificity

![title](sen.png)

## Accuracy, Precision and Recall

## Precision in spam detection
    Out of all emails that we detected and as spam and sent to the spam folder, how many were actually spam?
    We do not want good emails to be sent to spam folder

# Binary classification

 ![title](spam.jpg)

In the example above, lets say we would have to find the probability of next message that appears to be spam or ham given the word money.

Notice that P(Spam|money) and P(Ham|money) have common denominator. Since it is a constant we can ignore it.

The "Naive" in "Naive Bayes" is the assumption that events are mutually exclusive. Even though it is a naive assumption, it is extremely efficient. Also, for the example above other than not caring about dependency in word usage, the NB also ignores length of the message or the order of the words in the message.

In [5]:
prob_money_given_spam = 2/3

In [7]:
print ('The probability that an email contains the word \'money\' given that it is a spam', prob_money_given_spam)

The probability that an email contains the word 'money' given that it is a spam 0.6666666666666666


> Ignoring the denominator in Bayes Theorem, what is the probability of email being spam given that it contains the word money? Similarly what is the probablity of email being ham given that it contains the word money?

The question is asking 
- p(spam|money) = p(money|spam)*p(spam) (take the equals sign lightly)
- p(ham|money) = p(money|spam)*p(spam) (take the equals sign lightly)

In [8]:
prob_spam = 3/8 

In [13]:
prob_spam_money = prob_money_given_spam * prob_spam

In [15]:
print ('Thr probability of email being spam given that it contains the word money is proportunal to: ', prob_spam_money)

Thr probability of email being spam given that it contains the word money is proportunal to:  0.25


In [16]:
prob_ham = 5/8

In [17]:
prob_money_given_ham = 1/5

In [18]:
prob_ham_money = prob_money_given_ham * prob_ham


In [19]:
print ('Thr probability of email being ham given that it contains the word money is proportunal to: ', prob_ham_money)

Thr probability of email being ham given that it contains the word money is proportunal to:  0.125


In [21]:
prob_spam_money + prob_ham_money

0.375

> They should be equal to 1 .

### Normalization

How do you manipulate a and b so that the sum is equal to 1? 
> a / (a+b) + b / (a+b) = 1

The denominator was removed, which does not normalize it. This is therefore a later step. After normalization, you get the probabilities. 


In [26]:
norm_prob_spam_money = prob_spam_money/(prob_spam_money+prob_ham_money)

In [27]:
norm_prob_ham_money = prob_ham_money/(prob_spam_money+prob_ham_money)

In [28]:
normal_prob_spam_money+norm_prob_ham_money

1.0

### Assuming independent events
> - p(spam | money, easy, cash) 
> - = p(money, easy, cash | spam) * p (spam)
> - = p(money|spam) * p(easy|spam) * p (cash | spam) * p(spam)

# NB Spam Detection Classifier

In [2]:
### Import cell
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

### Import DataFrame

In [6]:
df = pd.read_table('SMSSpamCollection', sep='\t', header=None, names=['label', 'sms_message'])

In [7]:
df.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [8]:
df.tail()

Unnamed: 0,label,sms_message
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...
5571,ham,Rofl. Its true to its name


In [10]:
df.groupby('label').size()

label
ham     4825
spam     747
dtype: int64

In [12]:
df.shape

(5572, 2)

## Data preprocessing

> Scikit-learn only deals with numerical values and hence if we were to leave our label values as strings, scikit-learn would do the conversion internally. 

## Therefore we change our ham label to '0' and spam label to '1'.

In [13]:
# change ham to 0 and spam to 1. This is our response value
df['label'] = df.label.map({'ham':0, 'spam':1})

In [14]:
# Split our dataset into training and testing data
X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], 
                                                    df['label'], 
                                                    random_state=1)

In [15]:
print('Number of rows in the total set: ', df.shape[0])
print('Number of rows in the training set: ', X_train.shape[0])
print('Number of rows in the test set: ', X_test.shape[0])

Number of rows in the total set:  5572
Number of rows in the training set:  4179
Number of rows in the test set:  1393


In [16]:
## Blank

In [17]:
# Instantiate the CountVectorizer method
count_vector = CountVectorizer()

In [18]:
# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)

In [19]:
# Transform testing data and return the matrix. 
# Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.transform(X_test)

# Model Training and Predecting

In [20]:
# Instantiate our model
naive_bayes = MultinomialNB()

In [21]:
# Fit our model to the training data
naive_bayes.fit(training_data, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [22]:
# Predict on the test data
predictions = naive_bayes.predict(testing_data)

### Model Evaluation

In [23]:
# Score our model
print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions)))
print('Recall score: ', format(recall_score(y_test, predictions)))
print('F1 score: ', format(f1_score(y_test, predictions)))

Accuracy score:  0.9885139985642498
Precision score:  0.9720670391061452
Recall score:  0.9405405405405406
F1 score:  0.9560439560439562


It turns out that our naive bayes model actually does a pretty good job.