## <Font color=Red>360DigiTMG Assignments<font/>  - <font color=Blue>NaiveBayes<font/>

- Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for solving classification problems.
- It is mainly used in text classification that includes a high-dimensional training dataset.
- Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps in building the fast machine learning models that can make quick predictions.
- It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.
Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and classifying articles.
    

#### The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:

#### Naïve:

It is called Naïve because it assumes that the occurrence of a certain feature is independent of the occurrence of other features. Such as if the fruit is identified on the bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each feature individually contributes to identify that it is an apple without depending on each other.
#### Bayes:
    
It is called Bayes because it depends on the principle of Bayes' Theorem


#### <font color=Green>Bayes' Theorem:<font/>

Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability of a hypothesis with prior knowledge. It depends on the conditional probability.
    
The formula for Bayes' theorem is given as:

![image.png](attachment:image.png)

Where,

#### P(A|B) is Posterior probability: 
    Probability of hypothesis A on the observed event B.

#### P(B|A) is Likelihood probability: 
    Probability of the evidence given that the probability of a hypothesis is true.

#### P(A) is Prior Probability: 
    Probability of hypothesis before observing the evidence.

#### P(B) is Marginal Probability: 
    Probability of Evidence.
    
    
#### <font color=purple>Advantages of Naïve Bayes Classifier:<font/>
    
- Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
- It can be used for Binary as well as Multi-class Classifications.
- It performs well in Multi-class predictions as compared to the other Algorithms.
- It is the most popular choice for text classification problems.

#### <font color=purple>Disadvantages of Naïve Bayes Classifier:<font/>
    
- Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the relationship between features.

#### <font color=purple>Applications of Naïve Bayes Classifier:<font/>
    
- It is used for Credit Scoring.
- It is used in medical data classification.
- It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
- It is used in Text classification such as Spam filtering and Sentiment analysis.

## Problem 3). <font color=darkblue >Build a Naive Bayes model on the data set for classifying if the Business Social network user is going to buy luxury SUV or not

## <font color=blue>Step 1.<font/>

### <font color=red>Bussiness objective:-<font/> <font color=black>To classify whether a given Tweet about a real disaster  is Real or Fake.<font/>

### <font color=red>Bussiness contraints:- <font/> <font color=black>Minimize the model error percentage.<font/>

## <font color=blue>Step 2.<font/>

### <font color=red>Data Understanding and Analysis<font/>

In [4]:
# Importig all the dependencies/lobraries

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer

In [5]:
######## Loading the data set
tweets = pd.read_csv("Disaster_tweets_NB.csv",encoding = "ISO-8859-1")

## <font color=blue>Step 3.<font/>

### <font color=red>Data Pre-Cleansing or Data Preparation:<font/>

In [6]:
tweet_data = tweets.iloc[:, 3:]

In [8]:
# cleaning data 
import re
stop_words = []

# Load the custom built Stopwords
with open("stopwords_en.txt","r") as sw:
    stop_words = sw.read()

stop_words = stop_words.split("\n")
   
def cleaning_text(i):
    i = re.sub("[^A-Za-z" "]+"," ",i).lower()
    i = re.sub("[0-9" "]+"," ",i)
    w = []
    for word in i.split(" "):
        if len(word)>3:
            w.append(word)
    return (" ".join(w))


#### Testing the above specified function

In [11]:
# testing above function with sample text => removes punctuations, numbers
print(cleaning_text("Hope you are having a good week. Just checking in"))
print(cleaning_text("hope i can understand your feelings 123121. 123 hi how .. are you?"))
print(cleaning_text("Hi how are you, I am good"))

tweet_data.text = tweet_data.text.apply(cleaning_text)

hope having good week just checking
hope understand your feelings
good


In [12]:
# removing empty rows
tweet_data = tweet_data.loc[tweet_data.text != " ",:]

# CountVectorizer
# Convert a collection of text documents to a matrix of token counts

### splitting data into train and test data sets

In [13]:
# splitting data into train and test data sets 
from sklearn.model_selection import train_test_split

tweet_train, tweet_test = train_test_split(tweet_data, test_size = 0.3)

# creating a matrix of token counts for the entire text document 
def split_into_words(i):
    return [word for word in i.split(" ")]

In [14]:
# Defining the preparation of tweet texts into word count matrix format - Bag of Words
tweets_bow = CountVectorizer(analyzer = split_into_words).fit(tweet_data.text)

# Defining BOW for all messages
all_tweets_matrix = tweets_bow.transform(tweet_data.text)

# For training messages
train_tweets_matrix = tweets_bow.transform(tweet_train.text)

# For testing messages
test_tweets_matrix = tweets_bow.transform(tweet_test.text)

# Learning Term weighting and normalizing on entire tweets
tfidf_transformer = TfidfTransformer().fit(all_tweets_matrix)

In [15]:
# Preparing TFIDF for train tweets
train_tfidf = tfidf_transformer.transform(train_tweets_matrix)
train_tfidf.shape # (row, column)

(5329, 19280)

In [16]:
# Preparing TFIDF for test tweets
test_tfidf = tfidf_transformer.transform(test_tweets_matrix)
test_tfidf.shape #  (row, column)

(2284, 19280)

## <font color=blue>Step 4.<font/>

### <font color=red>Model Building - Multinomial NaiveBayes<font/>

In [17]:
# Preparing a naive bayes model on training data set 

from sklearn.naive_bayes import MultinomialNB as MB

# Multinomial Naive Bayes
classifier_mb = MB()
classifier_mb.fit(train_tfidf, tweet_train.target)


MultinomialNB()

## <font color=blue>Step 5.<font/>

### <font color=red>Model Evaluation - Multinomial Naive Bayes<font/>

In [18]:

# Training Data accuracy
train_pred_m = classifier_mb.predict(train_tfidf)
accuracy_train_m = np.mean(train_pred_m == tweet_train.target)
accuracy_train_m

0.9044848939763558

In [19]:
# Evaluation on Test Data
test_pred_m = classifier_mb.predict(test_tfidf)
accuracy_test_m = np.mean(test_pred_m == tweet_test.target)
accuracy_test_m

0.7898423817863398

In [20]:
from sklearn.metrics import accuracy_score
accuracy_score(test_pred_m, tweet_test.target) 
pd.crosstab(test_pred_m, tweet_test.target)

target,0,1
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1178,390
1,90,626


In [21]:
# Multinomial Naive Bayes changing default alpha for laplace smoothing
# if alpha = 0 then no smoothing is applied and the default alpha parameter is 1
# the smoothing process mainly solves the emergence of zero probability problem in the dataset.

classifier_mb_lap = MB(alpha = 3)
classifier_mb_lap.fit(train_tfidf, tweet_train.target)

MultinomialNB(alpha=3)

In [23]:
# Training Data accuracy
train_pred_lap = classifier_mb_lap.predict(train_tfidf)
accuracy_train_lap = np.mean(train_pred_lap == tweet_train.target)
accuracy_train_lap

0.8556952523925689

In [22]:
# Evaluation on Test Data after applying laplace
test_pred_lap = classifier_mb_lap.predict(test_tfidf)
accuracy_test_lap = np.mean(test_pred_lap == tweet_test.target)
accuracy_test_lap

0.7666374781085814

In [24]:
from sklearn.metrics import accuracy_score
accuracy_score(test_pred_lap, tweet_test.target) 

pd.crosstab(test_pred_lap, tweet_test.target)

target,0,1
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1214,479
1,54,537


### After Laplace Smoothining, We got better results.
False tweets are getting detected as False more accurately.