In [None]:
Naive Bayes algorithms are mostly used in sentiment analysis, spam filtering, recommendation systems etc.                       
They are fast and easy to implement but their biggest disadvantage is that the requirement of predictors to be independent.     
In most of the real life cases, the predictors are dependent, this hinders the performance of the classifier.                   

In [None]:
Multinomial Naive Bayes:                                                                                                       
This is mostly used for document classification problem, i.e whether a document belongs to the category of sports, politics,
technology etc. The features/predictors used by the classifier are the frequency of the words present in the document.
            
Bernoulli Naive Bayes :                                                                                                        
This is similar to the multinomial naive bayes but the predictors are boolean variables. The parameters that we use to predict
the class variable take up only values yes or no, for example if a word occurs in the text or not.
            
Gaussian Naive Bayes :                                                                                                         
When the predictors take up a continuous value and are not discrete, we assume that these values are sampled from a gaussian
distribution.

In [None]:
Bayes Theorem : P(A|B) = P(B|A).P(A) / P(B)

#### Here we are taking the SMS dataset, based on the text predict whether the message is 'Spam' or 'ham' using NLP also. 
#### dataset downladed from kaggle : https://www.kaggle.com/uciml/sms-spam-collection-dataset

In [1]:
# 1.Importing the Dataset

import pandas as pd
messages = pd.read_csv('SMSSpamCollection.csv', sep='\t', names=['label', 'message'])
# Now, spam and message is separated with tab-space here we use delimiter '\t' - indicates tab 

messages.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
# 2.Data Cleaning and preprocessing  (In message- the, (,) (.) stopwords are there those are not useful to tell its spam or not)

import re 
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer          # also can use : lemmatizer

ps = PorterStemmer()
corpus = []   # after preprocessing messages we put into corpus.

for i in range(0, len(messages)):
    review = re.sub('[^a-zA-Z]', ' ', messages['message'][i])      # remove punctuations using 're' - regular expression
    review = review.lower()                                        # lower all the words 
    review = review.split()                                        # remove spaces
    
    review = [ps.stem(word) for word in review if word not in stopwords.words('english')]   # remove stopwords
    review = ' '.join(review)                                                               # join the words into list
    corpus.append(review)                                                                   # finally append into 'corpus' list 

In [3]:
# 3.Create Bag of Words (Document matrix) 

from sklearn.feature_extraction.text import CountVectorizer       # also can use : TF-IDF vectorizer

cv =CountVectorizer(max_features=5000)  # max-features using top 5000 words, instead of using all in data, change acc. 
X = cv.fit_transform(corpus).toarray()  # x - is independent feature, label -in message is dependent feature (y-becomes label)

y= pd.get_dummies(messages['label'])   # create 2 column of dummies, using for label(ham, spam) 
y = y.iloc[:,1].values  # it removes 1 column, not need of 2, we use only 1(indicates 0: ham, 1: spam), Now y -dependent feature

In [4]:
# 4.Train-Test split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

In [5]:
# 5.Training model - Naives Bayes Classifier (using bcoz NaiveBayes works well w.r.t NLP , also works on probability)

from sklearn.naive_bayes import  GaussianNB, MultinomialNB, BernoulliNB      # Multinomail works for any no. of classes

spam_detect_model = MultinomialNB().fit(X_train,y_train)  # 1st train model 

y_pred = spam_detect_model.predict(X_test)      # then we predict     

In [6]:
# To compare y_pred and y_test 
# using - confusion_matrix (it gives 2x2 matrix, which says how many elements are correctly predicted)

from sklearn.metrics import confusion_matrix, accuracy_score

cm = confusion_matrix(y_test,y_pred)    # In matrix, vertical(shows actual output) , horizontal(predicted output)
print('Confusion matrix : \n', cm)

# To check accuracy

accuracy = accuracy_score(y_test,y_pred)
print(f'\nAccuracy is : {accuracy:0.2%}')

Confusion matrix : 
 [[946   9]
 [  8 152]]

Accuracy is : 98.48%
