# NaiveBayes Classifier

### Naive Bayes Theorem:

Acoording to the wikipedia, In probability theory and statistics,** Bayes’s theorem** (alternatively *Bayes’s law* or *Bayes’s rule*) describes the probability of an event, based on prior knowledge of conditions that might be related to the event.
Mathematically, it can be written as:



Where A and B are events and P(B)≠0
* P(A|B) is a conditional probability: the likelihood of event A occurring given that B is true.
* P(B|A) is also a conditional probability: the likelihood of event B occurring given that A is true.
* P(A) and P(B) are the probabilities of observing A and B respectively; they are known as the marginal probability.


Let’s understand it with the help of an example:

**The problem statement:**

You are planning a picnic today, but the morning is cloudy

Oh no! 50% of all rainy days start off cloudy!
But cloudy mornings are common (about 40% of days start cloudy)
And this is usually a dry month (only 3 of 30 days tend to be rainy, or 10%)
What is the chance of rain during the day?

We will use Rain to mean rain during the day, and Cloud to mean cloudy morning.

The chance of Rain given Cloud is written P(Rain|Cloud)

So let's put that in the formula:

$P(Rain|Cloud) = \frac{P(Rain)*P(Cloud|Rain)} {P(Cloud)}$          
                      
 

- P(Rain) is Probability of Rain = 10%
- P(Cloud|Rain) is Probability of Cloud, given that Rain happens = 50%
- P(Cloud) is Probability of Cloud = 40%

$P(Rain|Cloud) =  \frac{(0.1 x 0.5)} {0.4}   = .125$

Or a 12.5% chance of rain. Not too bad, let's have a picnic!

**Naïve:** It is called Naïve because it assumes that the occurrence of a certain feature is independent of the occurrence of other features. Such as if the fruit is identified on the bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each feature individually contributes to identify that it is an apple without depending on each other.<br>
**Bayes:** It is called Bayes because it depends on the principle of Bayes' Theorem

# Problem: 
To predict whether a new mail based on its content, can be categorized into spam or not-spam.

In [1]:
import pandas as pd
import numpy as np
import string 

In [2]:
data = pd.read_csv('spam.tsv',sep='\t',names=['Class','Message'])

In [3]:
data

Unnamed: 0,Class,Message
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!!
...,...,...
5562,spam,This is the 2nd time we have tried 2 contact u...
5563,ham,Will ü b going to esplanade fr home?
5564,ham,"Pity, * was in mood for that. So...any other s..."
5565,ham,The guy did some bitching but I acted like i'd...


# Basic Checks:

In [4]:
data.shape

(5567, 2)

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5567 entries, 0 to 5566
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Class    5567 non-null   object
 1   Message  5567 non-null   object
dtypes: object(2)
memory usage: 87.1+ KB


 So no null values are present here.

### Create a column 'Length' to keep the count of characters in present in each record/message:

In [6]:
data['Length'] = data['Message'].apply(len)

In [7]:
data.head()

Unnamed: 0,Class,Message,Length
0,ham,I've been searching for the right words to tha...,196
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155
2,ham,"Nah I don't think he goes to usf, he lives aro...",61
3,ham,Even my brother is not like to speak with me. ...,77
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!!,36


### Let's see the counts of each class:

In [8]:
data.groupby('Class').count()

Unnamed: 0_level_0,Message,Length
Class,Unnamed: 1_level_1,Unnamed: 2_level_1
ham,4821,4821
spam,746,746


In [9]:
data.describe()

Unnamed: 0,Length
count,5567.0
mean,80.450153
std,59.891023
min,2.0
25%,36.0
50%,62.0
75%,122.0
max,910.0


Maximum length of a message is 910 in this dataset and minimum length is 2

### The message which has max length of characters.:

In [10]:
data.loc[data['Length']==910]

Unnamed: 0,Class,Message,Length
1080,ham,For me the love should start with attraction.i...,910


### The message which has minimum length of characters.:

In [11]:
data.loc[data['Length'] == 2]

Unnamed: 0,Class,Message,Length
1920,ham,Ok,2
3046,ham,Ok,2
4493,ham,Ok,2
5352,ham,Ok,2


In [12]:
data[data['Length']==910]['Message']

1080    For me the love should start with attraction.i...
Name: Message, dtype: object

# Text Preprocessing:

Creating an object for the target variable and independent variable using values attribute to convert a series into an array.:

In [13]:
# class_obj = data['Class'].values

In [14]:
# msg_obj = data['Message'].values

# Handling Categorical Column:

In [15]:
data.loc[data['Class'] == 'ham','Class'] = 1
data.loc[data['Class'] == 'spam','Class'] = 0

In [17]:
# class_obj

In [18]:
# Datatype for y is object. lets convert it into int
data['Class']=data['Class'].astype('int')


# Step-I: 
Is to remove punctuation from the text/message/record:

In [19]:
string.punctuation  

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

*  here we've punctuation signs according to python

### Let's remove the punctuation:

In [20]:
def remove_punct(text):
    text = ''.join([char for char in text if char not in string.punctuation])
    
    return text

### We just created a simple function to remove punctuation from a text:
1. we'd created an empty list
2. char for char in text means whatever msg or text we pass 'char' starts traversing and 
3. if not in string .punctuation means check for the characters which are not present in punctuation signs , basically it          compares the characters with punctuation signs.
4. Then, cleaned data'll join with that empty list and return the cleaned text

# Step-II:

Add cleaned text column in the dataset

In [21]:
data['cleaned_text'] = data['Message'].apply(lambda x: remove_punct(x))

In [22]:
data.head()

Unnamed: 0,Class,Message,Length,cleaned_text
0,1,I've been searching for the right words to tha...,196,Ive been searching for the right words to than...
1,0,Free entry in 2 a wkly comp to win FA Cup fina...,155,Free entry in 2 a wkly comp to win FA Cup fina...
2,1,"Nah I don't think he goes to usf, he lives aro...",61,Nah I dont think he goes to usf he lives aroun...
3,1,Even my brother is not like to speak with me. ...,77,Even my brother is not like to speak with me T...
4,1,I HAVE A DATE ON SUNDAY WITH WILL!!!,36,I HAVE A DATE ON SUNDAY WITH WILL


# Step-III:
* Tokenization: Process of converting the normal string data into a list of tokens (also known as lemmas).
 OR
* It consists of splitting an entire text into small units (tokens).

# CountVectorizer :
* It'll help us convert a collection of text documents to a vector of token counts.
* That counts the number of times a word was mentioned in doucuments.

In [23]:
from sklearn.feature_extraction.text import CountVectorizer , TfidfVectorizer

In [24]:
CV_object = CountVectorizer(stop_words="english")  # CountVectorizer(stop_words = 'english')

 Stopwords are the words in any language which does not add much meaning to a sentence. They are the words which are very common in text documents such as a, an, the, you, your, etc. The Stop Words highly appear in text documents. However, they are not being helpful for text analysis in many of the cases, So it is better to remove from the text. We can focus on the important words if stop words have removed.

# Step-IV:
Splitting x and y

In [25]:
x = data['cleaned_text'].values
y = data['Class'].values


### Oues:
can numpy array we used to hold string objects? yes. but the functions of Numpy 
library cannot performed on the string object.

### Splitting training and testing data:-

In [26]:
from sklearn.model_selection import train_test_split

In [27]:
x_train,x_test,y_train,y_test = train_test_split(x , y , random_state = 42 , test_size=0.2)

### Creating the vectors of the training data:

In [28]:
x_train_CV = CV_object.fit_transform(x_train)

In [29]:
x_train_CV

<4453x8216 sparse matrix of type '<class 'numpy.int64'>'
	with 34481 stored elements in Compressed Sparse Row format>

In [30]:
x_train_CV.getrow(2)

<1x8216 sparse matrix of type '<class 'numpy.int64'>'
	with 5 stored elements in Compressed Sparse Row format>

### Training a model

With messages represented as vectors, we can finally train our spam/ham classifier. Now we can actually use almost any sort of classification algorithms. For a variety of reasons, the Naive Bayes classifier algorithm is a good choice.

In [31]:
from sklearn.naive_bayes import MultinomialNB

In [32]:
NB_model = MultinomialNB()

In [33]:
NB_model.fit(x_train_CV,y_train)

MultinomialNB()

### Prediction/Testing the model:

In [34]:
x_test_CV = CV_object.transform(x_test)

In [35]:
y_predict = NB_model.predict(x_test_CV)

In [36]:
y_predict

array([1, 1, 1, ..., 1, 1, 1])

In [37]:
from sklearn.metrics import classification_report

In [38]:
print(classification_report(y_test,y_predict))

              precision    recall  f1-score   support

           0       0.96      0.96      0.96       145
           1       0.99      0.99      0.99       969

    accuracy                           0.99      1114
   macro avg       0.98      0.98      0.98      1114
weighted avg       0.99      0.99      0.99      1114



# Spam Filtering Application:

In [44]:
msg = input('Enter any mail: ')
msg_input = CV_object.transform([msg])
predict = NB_model.predict(msg_input)

if (predict[0] == 0):
    print('Mail will go to the spam folder')
else:
    print('Check your inbox')

Enter any mail: Hello and welcome to this course What is Machine Learning?  
Check your inbox


# Tfidf :

### In **BOW approach** we saw so far, all the words in the text are treated equally important. There is no notion of some words in the document being more important than others. TF-IDF addresses this issue. It aims to quantify the importance of a given word relative to other words in the document and in the 


<font color=darkviolet>  **Term Frequency (tf)** </font>
TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:

TF(t) = (Number of times term 't' appears in a document) / (Total number of terms in the document).



<font color=darkviolet>  **Inverse Document Frequency (idf)** </font>
              It measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).corpus. It was commonly used representation scheme for information retrieval systems, for extracting relevant documents from a corpus for given text query.



__Let's see an example:__

Consider a document containing 100 words wherein the word cat appears 3 times. 

The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. 

Now, assume we have 10 million documents and the word cat appears in one thousand of these. 

Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. 

Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12

# Step-I: Splitting x and y:

In [46]:
x = data['cleaned_text'].values
y= data['Class'].values

# Step-II : Creating vectors using TF-IDF technique:

In [47]:
## text preprocessing and feature vectorizer
# To extract features from a document of words, we import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_object = TfidfVectorizer()   ## objectcreation.
x = tfidf_object.fit_transform(x)    ## fitting and transforming the data into vectors.


In [48]:
## print feature names selected from the raw documents
tfidf_object.get_feature_names_out()[2000:3000]

array(['campus', 'camry', 'can', 'canada', 'canal', 'canary', 'cancel',
       'canceled', 'cancelled', 'cancer', 'candont', 'canlove', 'canname',
       'cannot', 'cannt', 'cant', 'cantdo', 'canteen', 'capacity',
       'capital', 'cappuccino', 'caps', 'captain', 'captaining', 'car',
       'card', 'cardiff', 'cardin', 'cards', 'care', 'careabout', 'cared',
       'career', 'careful', 'carefully', 'careinsha', 'careless',
       'carente', 'cares', 'careswt', 'careumma', 'carewhoever', 'caring',
       'carlie', 'carlin', 'carlos', 'carlosll', 'carly', 'carolina',
       'caroline', 'carpark', 'carry', 'carryin', 'cars', 'carso',
       'cartons', 'cartoon', 'case', 'cash', 'cashbalance', 'cashbincouk',
       'cashed', 'cashin', 'cashto', 'casing', 'cast', 'casting',
       'castor', 'casualty', 'cat', 'catch', 'catches', 'catching',
       'categories', 'caught', 'cause', 'causes', 'causing', 'caveboy',
       'cbe', 'cc', 'cc100pmin', 'ccna', 'cd', 'cdgt', 'cds', 'cedar',
       'c

In [49]:
len(tfidf_object.get_feature_names_out())

9537

In [50]:
# Getting the feature vectors:
x = x.toarray()

In [51]:
x

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

# Step-III: Split data into training and testing data

In [52]:
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state =42, test_size=0.2)

# Step-IV: Model Creation

In [53]:
nb = MultinomialNB()

In [54]:
nb.fit(x_train,y_train)

MultinomialNB()

In [55]:
y_hat = nb.predict(x_test)

In [56]:
y_hat

array([1, 1, 1, ..., 1, 1, 1])

In [57]:
pd.crosstab(y_test,y_hat)

col_0,0,1
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
0,100,45
1,0,969


* It can be seen that the False positive counts are 0 so our model is performing well and good.

In [58]:
print(classification_report(y_test,y_hat))

              precision    recall  f1-score   support

           0       1.00      0.69      0.82       145
           1       0.96      1.00      0.98       969

    accuracy                           0.96      1114
   macro avg       0.98      0.84      0.90      1114
weighted avg       0.96      0.96      0.96      1114



# Spam Filtering Application

In [61]:
msg = input('Enter any msg: ')
msgInput = tfidf_object.transform([msg])
prediction = nb.predict(msgInput)

if prediction[0]==0:
    print('It\'ll go to the spam folder')
else:
    print('Check Inbox')

Enter any msg: Hello and welcome to this course What is Machine Lear
Check Inbox


### Pros of Naive Bayes

- Naive Bayes Algorithm is a fast, highly scalable algorithm
- Naive Bayes can be classified for both binary classification and multi class classification. It provides different types of Naive Bayes Algorithms like GaussianNB, MultinominalNB, BernoulliNB.
- It is simple algorithm that depends on doing a bunch of count.
- Great choice for text classification problems. it's a popular choice for spam email classification.
- It can be easily trained on small datasets.
- Naive Bayes can handle misssing data, as they ignored when a probabilty is calculated for a class value.


### Cons of Naive Bayes

- It considers all the features to be unrelated, so it cannot learn the relationship between features. This limits the applicability of this algorithm in real-world use cases.
- Naive Bayes can learn individual featutre importance but can't determine the relationship among features. 

## Application of Naive Bayes

##### Text classification / spam filtering / Sentiment analysis:
 - Naive Bayes classifiers mostly used in text classification
 - News article classification SPORTS, TECHNOLOGY etc.
 - Spam or Ham: Naive Bayes is the most popular method for mail filtering
 - Sentiment analysis focuses on identifying whether the customers think positively or negatively about a certain topic (product or service).
 
 
##### Recommendation System:
- Naive Bayes classifier and Collabrative filtering together buids a recommendation system that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not. 

### 

### 3 Types of Naive Bayes in Scikit Learn

__Gaussian__

- It is used in classification and it assumes that features follow a normal distribution.

__Multinominal__
- It is used for discrete counts. For eg., let's say we have a text cLassification problem. Here we consider Bernoulli trails which is one step further and instead of "word occuring in the document", we have "count how often word occurs in the document" you can think of it as "number of times outcome number_x is observed over n trails".

__Bernoulli__
- The binomial model is useful if your feature vectors are binary (ie., Zeroes and One). One application would be text classification with 'bag of words' model where the 1s and 0s are "words occur in the document" and "word does not occur in the document" respectively.### 