###**Implementing Naive Bayes Algorithm to detect Spam Messages**

In [None]:
import pandas as pd
import numpy as np

In [None]:
df=pd.read_table("https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv",header=None)
names=['label','messages']
df.columns=names
df.head()

Unnamed: 0,label,messages
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Train-Test Split the message data using sklearn

X_train gives Training message data on which algorithm is trained on (excludes output or 'label' column)

X_test gives Testing message data on which algorithm is tested on (excludes output or 'label' column)

y_train gives 'label' column or output column for prediction derived from training data

y_train gives 'label' column or output column to be checked for prediction derived from testing data and calculate accuracy of algorithm

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['messages'], df['label'], test_size=0.30)
XTrain=X_train.reset_index(drop=True)
YTrain=y_train.reset_index(drop=True)
XTest=X_test.reset_index(drop=True)
YTest=y_test.reset_index(drop=True)

Count Vectorizer is called, Firstly the number of columns are decided based on number of disitnct words in the given data. 
Each row corresponds to a given message sentence and each column value will be the number of occurences of that distinct word (column name) in that sentence. This is all handled by CountVectorizer. 
 
CountVectorizer is internal to Jupyter Notebook, no need to import anything

The parameter to initialize CountVectorizer needs to be changed since it does not recognize single words like 'a' as distinct words. For this ' token_pattern=r"(?u)\b\w\w+\b" ' which is a default parameter is updated to ' token_pattern=r"(?u)\b\w+\b" ' which recognizes single words 

The vectorized data is converted to arrays for each message sentence

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

cv=CountVectorizer(stop_words=None, token_pattern=r"(?u)\b\w+\b")
x=cv.fit_transform(XTrain)
a2=x.toarray()

Data from CountVectorizer a2 is sent to new DataFrame for training data which would be numerical now (no. of occurences of a distinct word in given sentence corresponding to a row) 

Column names are obtained from get_feature_names() function of CountVectorizer that gives names of distinct words

Output column appended

In [None]:
dfTRAIN=pd.DataFrame(a2, columns =cv.get_feature_names())
dfTRAIN['Output']=YTrain
dfTRAIN.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,0121,01223585236,01223585334,02,0207,02072069400,02073162414,03,04,0430,05,050703,06,07,07008009200,07090201529,07090298926,07099833605,07123456789,0721072,07734396839,07742676969,07753741225,0776xxxxxxx,07781482378,077xxx,078,07808,07808247860,07808726822,07821230901,078498,0796xxxxxx,07xxxxxxxxx,...,yo,yoga,yogasana,you,youdoing,youi,youphone,your,youre,yourjob,yours,yourself,youwanna,yowifes,yr,yrs,ystrday,ything,yummmm,yummy,yun,yunny,yuo,yup,z,zaher,zealand,zebra,zed,zeros,zhong,zindgi,zoe,zyada,é,ú1,ü,〨ud,鈥,Output
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,spam
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,ham
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,ham
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,ham
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,ham


train_col contains column names of our training data (distinct words)

output_vals contains different distinct values of output i.e spam and ham

Probability of <<'word'|'output value'>> has:
Numerator=(Total no. of occurences 'word' in sentences which are classified as given 'output value') + 1

Denomenator=(Total no. of occurences of all words in sentences which are classified as given 'output value') + (no. of distinct words in sentences which are classified as given 'output value')

P('word'|'output value')=Numerator/Denomenator

denom has Total no. of occurences of all words in sentences which are classified as given 'output value'

PriorProb has keys 'wordOutput_value' and values as P(word|Output_value)

In [None]:
Num=0

PriorProb={}

train_col=dfTRAIN.columns
output_vals=dfTRAIN['Output'].value_counts().index

for i in range(0,len(output_vals)):
    for j in range(0,(len(train_col)-1)):
        Num=dfTRAIN[dfTRAIN['Output']==output_vals[i]][train_col[j]].sum()
        denom=dfTRAIN[dfTRAIN['Output']==output_vals[i]].sum()#Returns a series
        denom=denom.drop(labels=['Output'])

        Prob=(Num+1)/(denom.sum()+len(denom[denom>0]))

        print("P(Text=",train_col[j],"|","Output=",output_vals[i],")=",Prob)
        
        PriorProb[train_col[j]+output_vals[i]]=Prob
        
print(PriorProb)
        

P(Text= 0 | Output= ham )= 1.7988846914912754e-05
P(Text= 00 | Output= ham )= 1.7988846914912754e-05
P(Text= 000 | Output= ham )= 1.7988846914912754e-05
P(Text= 000pes | Output= ham )= 3.597769382982551e-05
P(Text= 008704050406 | Output= ham )= 1.7988846914912754e-05
P(Text= 0089 | Output= ham )= 1.7988846914912754e-05
P(Text= 0121 | Output= ham )= 1.7988846914912754e-05
P(Text= 01223585236 | Output= ham )= 1.7988846914912754e-05
P(Text= 01223585334 | Output= ham )= 1.7988846914912754e-05
P(Text= 02 | Output= ham )= 1.7988846914912754e-05
P(Text= 0207 | Output= ham )= 1.7988846914912754e-05
P(Text= 02072069400 | Output= ham )= 1.7988846914912754e-05
P(Text= 02073162414 | Output= ham )= 1.7988846914912754e-05
P(Text= 03 | Output= ham )= 1.7988846914912754e-05
P(Text= 04 | Output= ham )= 1.7988846914912754e-05
P(Text= 0430 | Output= ham )= 1.7988846914912754e-05
P(Text= 05 | Output= ham )= 1.7988846914912754e-05
P(Text= 050703 | Output= ham )= 1.7988846914912754e-05
P(Text= 06 | Output= 

In [None]:
Test=pd.DataFrame(XTest, columns='Text')
Test['Class']=YTest

'words' is getting distinct words from each sentence of 'Test' data

output_vals contains different distinct values of output i.e spam and ham

colnames are distinct words

Probability of a sentence in Test data having a certain 'Output Value' is calculated by multiplying each of their distinct words probabilities in given sentence for given 'Output Value'. These probabilities are fetched from Dict2 Dictionary.

This is compares with Probability of same sentence having another 'Output Value' in same way

All these probabilities of same sentence having different Output values is compared. Highest probability Output value for sentence gives classification of that sentence (meaning it has that Ouput value). This prediction is compared with Output Values of Test data to get final accuracy

In [None]:
count=0

for k in range(0,Test.shape[0]):
    words=(Test['Text'][k]).split(" ")
    colnames=list(dfTRAIN.columns)
    print(words)

    Mul=1
    Greatest=0
    Greatest_Output='yo'

    for i in range(0,len(output_vals)):
        Mul=1
        for j in range(0,len(words)):
            try:
                a=colnames.index(words[j])
            except ValueError:
                print(words[j])
                continue
            
            print("Dict:",PriorProb[words[j]+output_vals[i]])
            Mul=PriorProb[words[j]+output_vals[i]]*Mul
            print(Mul)
        Mul=Mul*(dfTRAIN[dfTRAIN['Output']==output_vals[i]].shape[0])/(dfTRAIN.shape[0])
      
            
        if(Mul>Greatest):
            Greatest=Mul
            print(Greatest)
            Greatest_Output=output_vals[i]
        
    print(Test['Text'][k],"is classified as:",Greatest_Output)
    if(Greatest_Output==Test['Class'][k]):
        count=count+1

print("Accuracy:",(count/Test.shape[0])) 