# Naive Bayes

##### Bayes’s Theorem

According to the Wikipedia, In probability theory and statistics,** Bayes’s theorem** (alternatively *Bayes’s law* or *Bayes’s rule*) describes the probability of an event, based on prior knowledge of conditions that might be related to the event.
Mathematically, it can be written as:



Where A and B are events and P(B)≠0
* P(A|B) is a conditional probability: the likelihood of event A occurring given that B is true.
* P(B|A) is also a conditional probability: the likelihood of event B occurring given that A is true.
* P(A) and P(B) are the probabilities of observing A and B respectively; they are known as the marginal probability.


Let’s understand it with the help of an example:

**The problem statement:**

You are planning a picnic today, but the morning is cloudy

Oh no! 50% of all rainy days start off cloudy!
But cloudy mornings are common (about 40% of days start cloudy)
And this is usually a dry month (only 3 of 30 days tend to be rainy, or 10%)
What is the chance of rain during the day?

We will use Rain to mean rain during the day, and Cloud to mean cloudy morning.

The chance of Rain given Cloud is written P(Rain|Cloud)

So let's put that in the formula:

$P(Rain|Cloud) = \frac{P(Rain)*P(Cloud|Rain)} {P(Cloud)}$          
                      
 

- P(Rain) is Probability of Rain = 10%
- P(Cloud|Rain) is Probability of Cloud, given that Rain happens = 50%
- P(Cloud) is Probability of Cloud = 40%

$P(Rain|Cloud) =  \frac{(0.1 x 0.5)} {0.4}   = .125$

Or a 12.5% chance of rain. Not too bad, let's have a picnic!

**Naïve:** It is called Naïve because it assumes that the occurrence of a certain feature is independent of the occurrence of other features. Such as if the fruit is identified on the bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each feature individually contributes to identify that it is an apple without depending on each other.<br>
**Bayes:** It is called Bayes because it depends on the principle of Bayes' Theorem

# Problem statement

Spam filtering using naive Bayes classifiers in order to predict whether a new mail based on its content, can be categorized as spam or not-spam.

### Data processing using panda library

In [1]:
# Import the required libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
import string
import matplotlib.pyplot as plt

In [3]:
# Load the dataset

data = pd.read_csv("spam.tsv",sep='\t',names=['Class','Message'])
data.head(8) # View the first 8 records of our dataset

Unnamed: 0,Class,Message
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!!
5,ham,As per your request 'Melle Melle (Oru Minnamin...
6,spam,WINNER!! As a valued network customer you have...
7,spam,Had your mobile 11 months or more? U R entitle...


In [None]:
# to view the first record
data.loc[:0]

In [None]:
# Summary of the dataset
data.info()

In [4]:
# create a column to keep the count of the characters present in each record
data['Length'] = data['Message'].apply(len)

In [5]:
data['Length']

0       196
1       155
2        61
3        77
4        36
       ... 
5562    160
5563     36
5564     57
5565    125
5566     26
Name: Length, Length: 5567, dtype: int64

In [6]:
# view the dataset with the column 'Length' which contains the number of characters present in each mail
data.head(10)

Unnamed: 0,Class,Message,Length
0,ham,I've been searching for the right words to tha...,196
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155
2,ham,"Nah I don't think he goes to usf, he lives aro...",61
3,ham,Even my brother is not like to speak with me. ...,77
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!!,36
5,ham,As per your request 'Melle Melle (Oru Minnamin...,160
6,spam,WINNER!! As a valued network customer you have...,157
7,spam,Had your mobile 11 months or more? U R entitle...,154
8,ham,I'm gonna be home soon and i don't want to tal...,109
9,spam,"SIX chances to win CASH! From 100 to 20,000 po...",136


In [7]:
## The mails are categorised into 2 classes ie., spam and ham. 
# Let's see the count of each class
data.groupby('Class').count()

Unnamed: 0_level_0,Message,Length
Class,Unnamed: 1_level_1,Unnamed: 2_level_1
ham,4821,4821
spam,746,746


### Data Visualization

In [None]:
data['Length'].describe() # to find the max length of the message. 

See what we found, A 910 character long message. Let's use masking to find this message:

In [None]:
data['Length']==910

In [None]:
# the message that has the max characters
data[data['Length']==910]['Message']

In [None]:
# view the message that has 910 characters in it
data[data['Length']==910]['Message'].iloc[0]

In [None]:
# View the message that has min characters
data[data['Length']==2]['Message'].iloc[0]

### Text Pre-Processing

In [None]:
# creating an object for the target values using values attribute to convert into arrays
dObject = data['Class'].values
dObject

In [8]:
# Lets assign ham as 1
data.loc[data['Class']=="ham","Class"] = 1

In [9]:
# Lets assign spam as 0
data.loc[data['Class']=="spam","Class"] = 0

In [10]:
dObject2=data['Class'].values
dObject2

array([1, 0, 1, ..., 1, 1, 1], dtype=object)

In [None]:
data.head(8)

First removing punctuation. We can just take advantage of Python's built-in string library to get a quick list of all the possible punctuation:

### REMOVE PANCTUATION
whenever work on text data first remove the punctuation

In [12]:
# the default list of punctuations
import string

string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [13]:
# Why is it important to remove punctuation?

"This message is spam" == "This message is spam"

True

In [14]:
# Let's remove the punctuation

def remove_punct(text):
    text = "".join([char for char in text if char not in string.punctuation]) # join----> help to join to string
    return text

data['text_clean'] = data['Message'].apply(lambda x: remove_punct(x)) # Add clean text column in data

data.head()

Unnamed: 0,Class,Message,Length,text_clean
0,1,I've been searching for the right words to tha...,196,Ive been searching for the right words to than...
1,0,Free entry in 2 a wkly comp to win FA Cup fina...,155,Free entry in 2 a wkly comp to win FA Cup fina...
2,1,"Nah I don't think he goes to usf, he lives aro...",61,Nah I dont think he goes to usf he lives aroun...
3,1,Even my brother is not like to speak with me. ...,77,Even my brother is not like to speak with me T...
4,1,I HAVE A DATE ON SUNDAY WITH WILL!!!,36,I HAVE A DATE ON SUNDAY WITH WILL


In [15]:
data['Message']

0       I've been searching for the right words to tha...
1       Free entry in 2 a wkly comp to win FA Cup fina...
2       Nah I don't think he goes to usf, he lives aro...
3       Even my brother is not like to speak with me. ...
4                    I HAVE A DATE ON SUNDAY WITH WILL!!!
                              ...                        
5562    This is the 2nd time we have tried 2 contact u...
5563                 Will ü b going to esplanade fr home?
5564    Pity, * was in mood for that. So...any other s...
5565    The guy did some bitching but I acted like i'd...
5566                           Rofl. Its true to its name
Name: Message, Length: 5567, dtype: object

### BAG OF WORD
convert text into feature vectore with the help of countvectorizer

#### Step:1
__Tokenization__ (process of converting the normal text strings in to a list of tokens(also known as lemmas)).

Now we need to convert each of those messages into a vector the SciKit Learn's algorithm models can work with and machine learning model which we will gonig to use can understand.

In [20]:
# Step:1 Obeject creation
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
cv = CountVectorizer(stop_words="english")


# Step:2 Define independant and dependant veriable
xset = data['text_clean'].values
yset = data['Class'].values
yset

array([1, 0, 1, ..., 1, 1, 1], dtype=object)

In [24]:
xset

array(['Ive been searching for the right words to thank you for this breather I promise i wont take your help for granted and will fulfil my promise You have been wonderful and a blessing at all times',
       'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive entry questionstd txt rateTCs apply 08452810075over18s',
       'Nah I dont think he goes to usf he lives around here though', ...,
       'Pity  was in mood for that Soany other suggestions',
       'The guy did some bitching but I acted like id be interested in buying something else next week and he gave it to us for free',
       'Rofl Its true to its name'], dtype=object)

In [23]:
# change data type
yset = yset.astype('int')
yset


array([1, 0, 1, ..., 1, 1, 1])

In [25]:
# Step:3 create training and testing data
from sklearn.model_selection import train_test_split
Xset_train,Xset_test,yset_train,yset_test = train_test_split(xset,yset,test_size=0.25,random_state=42)

In [28]:
Xset_train

array(['You are being contacted by our Dating Service by someone you know To find out who it is call from your mobile or landline 09064017305 PoBox75LDNS7 ',
       'Im in a meeting call me later at',
       'Uhhhhrmm isnt having tb test bad when youre sick', ...,
       'I realise you are a busy guy and im trying not to be a bother I have to get some exams outta the way and then try the cars Do have a gr8 day',
       'Dunno lei shd b driving lor cos i go sch 1 hr oni',
       'Dude ive been seeing a lotta corvettes lately'], dtype=object)

In [30]:
# Step:4 fit and transform Xset train
Xset_train_cv = cv.fit_transform(Xset_train)
Xset_train_cv # Create sparse matrix and sparse matrix contain only zero

<4175x7847 sparse matrix of type '<class 'numpy.int64'>'
	with 32249 stored elements in Compressed Sparse Row format>

### Training a model

With messages represented as vectors, we can finally train our spam/ham classifier. Now we can actually use almost any sort of classification algorithms. For a variety of reasons, the Naive Bayes classifier algorithm is a good choice.

In [31]:
# Step:5Initialising the model
from sklearn.naive_bayes import MultinomialNB
NB = MultinomialNB() # Object creation

# Step:6 fit data to the model
NB.fit(Xset_train_cv,yset_train)

MultinomialNB()

In [32]:
# Step:7 testing CV on test data
Xset_test_cv = cv.transform(Xset_test) # use only transfor

# Step:8 Prediction on new test data
y_hat = NB.predict(Xset_test_cv)
y_hat

array([1, 1, 1, ..., 1, 1, 1])

### EVALUATION

In [34]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_hat,yset_test)*100
accuracy

98.5632183908046

### SPAM CLASSIFICATION APPLICATION

In [36]:
MSG = input("Enter the message:")  # get the input of messange
MSGINPUT = cv.transform([MSG]) # transform message
PREDICTMSG = NB.predict(MSGINPUT) # predict message input
if (PREDICTMSG[0] == [0]):
    print("------------------------------MESSAGE SENT [CHECK-SPAM-FOLDER]---------------------------")
else:
    print("------------------------------MESSAGE SENT [CHECK-INBOX]---------------------------------")

Enter the message:The workshop will start on 18th May 2022 at 8:30 PM IST  About the Webinar  For every Data Scientist a company hires, they in turn need to hire an average of 5 Data Engineers. This has led to a huge increase in the demand for data engineers – in India as well as globally. A career in this field can be both rewarding and challenging. You’ll play an important role in an organization’s success, providing easier access to data that data scientists, analysts, and decision-makers need to do their jobs.   So, This Wednesday our Intellipaat Team has organised one Special webinar on "IS DATA ENGINEERING A GOOD CAREER"   Topic Covered
------------------------------MESSAGE SENT [CHECK-INBOX]---------------------------------


## TF-IDF

In **BOW approach** we saw so far, all the words in the text are treated equally important. There is no notion of some words in the document being more important than others. TF-IDF addresses this issue. It aims to quantify the importance of a given word relative to other words in the document and in the 


<font color=darkviolet>  **Term Frequency (tf)** </font>
TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:

TF(t) = (Number of times term 't' appears in a document) / (Total number of terms in the document).



<font color=darkviolet>  **Inverse Document Frequency (idf)** </font>
              It measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).corpus. It was commonly used representation scheme for information retrieval systems, for extracting relevant documents from a corpus for given text query.



__Let's see an example:__

Consider a document containing 100 words wherein the word cat appears 3 times. 

The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. 

Now, assume we have 10 million documents and the word cat appears in one thousand of these. 

Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. 

Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12

In [37]:
# Step:1 Define independant and dependant feature

X = data['text_clean'].values
y = data['Class'].values

In [38]:
X

array(['Ive been searching for the right words to thank you for this breather I promise i wont take your help for granted and will fulfil my promise You have been wonderful and a blessing at all times',
       'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive entry questionstd txt rateTCs apply 08452810075over18s',
       'Nah I dont think he goes to usf he lives around here though', ...,
       'Pity  was in mood for that Soany other suggestions',
       'The guy did some bitching but I acted like id be interested in buying something else next week and he gave it to us for free',
       'Rofl Its true to its name'], dtype=object)

In [39]:
# changing the data type of y
y = y.astype('int')
y

array([1, 0, 1, ..., 1, 1, 1])

In [40]:
# Check the type
type(X)

numpy.ndarray

In [41]:
# Step:2 Text preprocessing and feature vectorizer to extract the feature from document of word
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer() # Object creation

# Step:3 Fitting and transforming data into vectors
X = tf.fit_transform(X)
X.shape

(5567, 9537)

In [44]:
# Step:3 print feature name selected from the raw document
print(tf.get_feature_names())



In [45]:
# getting the length of feature
len(tf.get_feature_names())

9537

In [46]:
# Get the type
type(X)

scipy.sparse.csr.csr_matrix

In [30]:
## getting the feature vectors
X=X.toarray() 

In [31]:
X

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [32]:
X.shape

(5567, 9537)

In [53]:
# Step:4 create training and testing data
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.20,random_state=45)

# Step:5 Model creation
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB() # object creation

# Step:6 fitting the data
nb.fit(X_train,y_train)

# Step:7 prediction on test data
y_predict = nb.predict(X_test)


### EVALUATION

In [54]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_predict))

              precision    recall  f1-score   support

           0       1.00      0.59      0.74       145
           1       0.94      1.00      0.97       969

    accuracy                           0.95      1114
   macro avg       0.97      0.80      0.86      1114
weighted avg       0.95      0.95      0.94      1114



### SPAM AND HAM CLASSIFIER

In [55]:
msg = input("Enter The Message:") # Get message input
msginput = tf.transform([msg])   # transform msg with tfidf
predict = nb.predict(msginput)  # prediction on input msg

# Condition
if predict [0] == 0 :
    print("------------------------MESSAGE-SENT-[CHECK-SPAM-FOLDER]---------------------------")
else:
    print("---------------------------MESSAGE-SENT-[CHECK-INBOX]------------------------------")
    

Enter The Message:	    D-Tribe           Shubhangi Sakarkar   Shubhangi shared: Dimensionality Reduction and Feature Projection The black spiral represents a certain mechanism that generates data in 3D. The values of Y are being translated to a projected red line (sine) 2D The values of X are being translated to a projected blue line (cosine) 2D   Go to Post     	Shubhangi Sakarkar      Dimensionality Reduction and Feature Projection  The black spiral represents a certain mechanism that generates data in 3D.  The values of Y are be... This is a notification for Updates From Your Hosts.  Mute Post Switch to Daily Digest Update Notification Settings     	 	You’re a Member of D-Tribe  Mute Post Update Preferences Unsubscribe  Sent by Mighty Networks  530 Lytton Ave 2nd Fl Office #208, Palo Alto, CA 94301	 	    	    
---------------------------MESSAGE-SENT-[CHECK-INBOX]------------------------------


### Pros of Naive Bayes

- Naive Bayes Algorithm is a fast, highly scalable algorithm
- Naive Bayes can be classified for both binary classification and multi class classification. It provides different types of Naive Bayes Algorithms like GaussianNB, MultinominalNB, BernoulliNB.
- It is simple algorithm that depends on doing a bunch of count.
- Great choice for text classification problems. it's a popular choice for spam email classification.
- It can be easily trained on small datasets.
- Naive Bayes can handle misssing data, as they ignored when a probabilty is calculated for a class value.


### Cons of Naive Bayes

- It considers all the features to be unrelated, so it cannot learn the relationship between features. This limits the applicability of this algorithm in real-world use cases.
- Naive Bayes can learn individual featutre importance but can't determine the relationship among features. 

## Application of Naive Bayes

##### Text classification / spam filtering / Sentiment analysis:
 - Naive Bayes classifiers mostly used in text classification
 - News article classification SPORTS, TECHNOLOGY etc.
 - Spam or Ham: Naive Bayes is the most popular method for mail filtering
 - Sentiment analysis focuses on identifying whether the customers think positively or negatively about a certain topic (product or service).
 
 
##### Recommendation System:
- Naive Bayes classifier and Collabrative filtering together buids a recommendation system that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not. 



### 3 Types of Naive Bayes in Scikit Learn

__Gaussian__

- It is used in classification and it assumes that features follow a normal distribution.

__Multinominal__
- It is used for discrete counts. For eg., let's say we have a text cLassification problem. Here we consider Bernoulli trails which is one step further and instead of "word occuring in the document", we have "count how often word occurs in the document" you can think of it as "number of times outcome number_x is observed over n trails".

__Bernoulli__
- The binomial model is useful if your feature vectors are binary (ie., Zeroes and One). One application would be text classification with 'bag of words' model where the 1s and 0s are "words occur in the document" and "word does not occur in the document" respectively.