<a href="https://colab.research.google.com/github/SwappyCodes/SwappyCodes/blob/main/Untitled7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Naive bayes

#Objectives :-

Setup

Importing Required Libraries

Load the Data

Conduct an Exploratory Data Analysis

Split Your Data

Preprocess the Data

Optimize and Evaluate Your Multinomial Naive Bayes Model

Other Naive Bayes Classification Models

Gaussian Naive Bayes

In [1]:
from tqdm import tqdm
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import nltk
import seaborn as sns
import re
import os, types

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import roc_auc_score, roc_curve, confusion_matrix, precision_score, recall_score, accuracy_score, balanced_accuracy_score, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB


In [2]:
df = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX09X7EN/SMSSpamCollection', sep='\t', header=None)

In [3]:
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download("punkt")
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [4]:
df.sample(5)


Unnamed: 0,0,1
4127,spam,"SPJanuary Male Sale! Hot Gay chat now cheaper,..."
3071,ham,I'm now but have to wait till 2 for the bus to...
2487,ham,I dont thnk its a wrong calling between us
3060,ham,"Dear all, as we know &lt;#&gt; th is the &lt..."
2661,ham,Want to finally have lunch today?


In [5]:
df = df.rename(columns={0: "classification", 1: "text"})

In [6]:
df.head()


Unnamed: 0,classification,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


#Step 2: Conduct an exploratory data analysis
Before preprocessing, it’s always good to organize the data and examine it for any underlying issues, such as missing or duplicate data. Plotting the data can also help us to see if the data is balanced.

Let's start by recoding our class labels from their categorical form, such as "spam" and "ham" to a numerical format using 1's and 0's.


In [7]:
df['target'] = np.where(df['classification'] == 'ham', 0, 1)


While there is no missing data, there is duplicate data in this data set. Additionally, the plot indicates that the class distribution is uneven with spam representing only 13% of the data. So, we have an imbalanced dataset, which can be a concern as it can lead to overfitting. While it does not mean that the training data will overfit, it is good to be aware of this upfront in case we need to use over-sampling (also known as upsampling) or under-sampling (also known as downsampling) techniques. With this in mind, we'll drop the duplicate data and proceed with training our model on this imbalanced distribution first.


In [8]:
df.info()

df_no_dup = df.drop_duplicates()
df_no_dup.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   classification  5572 non-null   object
 1   text            5572 non-null   object
 2   target          5572 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 130.7+ KB
<class 'pandas.core.frame.DataFrame'>
Index: 5169 entries, 0 to 5571
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   classification  5169 non-null   object
 1   text            5169 non-null   object
 2   target          5169 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 161.5+ KB


#Step 3: Split your data
Next, we will split our data set into two groups, a training set and a test set. The training data will help us train our Naive Bayes model and our test data will help us to evaluate its performance. The test data set will be 30% of our initial data set, but you can adjust this by changing the value of the test_size parameter.



In [9]:
X_train, X_test, Y_train, Y_test = train_test_split(df_no_dup['text'],
                                                    df_no_dup['target'],
                                                    test_size=0.3,
                                                    random_state=0)

#Step 4: Preprocess the data
After we split the data, we can start preprocessing it. This includes natural language processing tasks, such as tokenization, stop-word removal, stemming, and lemmatization.

Then, we will use a popular word embedding technique, called bag-of-words, to extract features from the text. This technique specifically calculates the frequency of words within a given document, which can help us classify documents, assuming that similar documents have similar content.

 We can use sklearn's CountVectorizer or TfidfVectorizer to do the heavy lifting for us here. For the purposes of this tutorial, we will only show the code for TFidVectorizer, but you can find the full code in the notebook on GitHub.


In [10]:
def text_clean(text, method, rm_stop):
    text = re.sub(r"\n","",text)   #remove line breaks
    text = text.lower() #convert to lowercase
    text = re.sub(r"\d+","",text)   #remove digits and currencies
    text = re.sub(r'[\$\d+\d+\$]', "", text)
    text = re.sub(r'\d+[\.\/-]\d+[\.\/-]\d+', '', text)   #remove dates
    text = re.sub(r'\d+[\.\/-]\d+[\.\/-]\d+', '', text)
    text = re.sub(r'\d+[\.\/-]\d+[\.\/-]\d+', '', text)
    text = re.sub(r'[^\x00-\x7f]',r' ',text)   #remove non-ascii
    text = re.sub(r'[^\w\s]','',text)   #remove punctuation
    text = re.sub(r'https?:\/\/.*[\r\n]*', '', text)   #remove hyperlinks

    #remove stop words
    if rm_stop == True:
        filtered_tokens = [word for word in word_tokenize(text) if not word in set(stopwords.words('english'))]
        text = " ".join(filtered_tokens)

    #lemmatization: typically preferred over stemming
    if method == 'L':
        lemmer = WordNetLemmatizer()
        lemm_tokens = [lemmer.lemmatize(word) for word in word_tokenize(text)]
        return " ".join(lemm_tokens)

    #stemming
    if method == 'S':
        porter = PorterStemmer()
        stem_tokens = [porter.stem(word) for word in word_tokenize(text)]
        return " ".join(stem_tokens)

    return text

#Step 5: Optimize and evaluate your Multinomial Naive Bayes model
Because we can process the data in a number of ways, we should model different versions of preprocessed data to understand which variation of data provides the optimal results within our model. For this use case, we will be using the most popular NB classifier, Multinomial Naive Bayes, as it is most commonly used for classification tasks, such as document classification. This variant is useful when using discrete data, such as frequency counts, and it is typically applied within natural language processing use cases.

After we apply the Multinomial Naive Bayes model to our different variants of training data, we can evaluate performance of the estimator using the testing  

In [11]:
def transform_model_data_w_tfidf_vectorizer(preprocessed_text, Y_train,  X_test, Y_test):
    #vectorize dataset
    tfidf = TfidfVectorizer()
    vectorized_data = tfidf.fit_transform(preprocessed_text)

    #define model
    model = MultinomialNB(alpha=0.1)
    model.fit(vectorized_data, Y_train)

    #evaluate model
    predictions = model.predict(tfidf.transform(X_test))

    accuracy = accuracy_score(Y_test, predictions)
    balanced_accuracy = balanced_accuracy_score(Y_test, predictions)
    precision = precision_score(Y_test, predictions)

    print("Accuracy:",round(100*accuracy,2),'%')
    print("Balanced accuracy:",round(100*balanced_accuracy,2),'%')
    print("Precision:", round(100*precision,2),'%')
    return predictions


X_test_preprocessed = X_test.apply(lambda x: text_clean(x, 'L', False))
X_train_preprocessed = X_train.apply(lambda x: text_clean(x, 'L', False))
