---

## **Machine Learning - II**

---

### **Gaussian, Bernoulli & Multinomial Naive Bayes on News Groups Dataset**

In [None]:
# Import necessary libraries
import numpy as np
from sklearn.datasets import fetch_20newsgroups # Importing the News Group dataset with 20 categories
from sklearn.feature_extraction.text import CountVectorizer # For text data preprocessing
#makes the vocab in alphabetical order and removes stop words(the,in)  # can use TFIDF

In [None]:
# Importing Bernoulli and Multinomial Naive Bayes models
from sklearn.naive_bayes import BernoulliNB, MultinomialNB

In [None]:
# Load the News Groups dataset
newsgroups = fetch_20newsgroups(subset='all') # Fetch all the data from the dataset

In [None]:
# Vectorization using CountVectorizer with binary and count settings
vectorizer1 = CountVectorizer(binary=True) # Binary representation (presence or absence of a word) and written as (010110)
vectorizer2 = CountVectorizer(binary=False) # Frequency count of words in the document and written as (01214) - count

# vocab comprises of 100 words
# a vocab is a combination of unique words

In [None]:
vectorizer1

In [None]:
vectorizer2

In [None]:
# Transform the dataset into binary and frequency-based feature vectors
X1 = vectorizer1.fit_transform(newsgroups.data) # Binary features
X2 = vectorizer2.fit_transform(newsgroups.data) # Frequency-based features

In [None]:
X1

<18846x173762 sparse matrix of type '<class 'numpy.int64'>'
	with 2952534 stored elements in Compressed Sparse Row format>

In [None]:
X2

<18846x173762 sparse matrix of type '<class 'numpy.int64'>'
	with 2952534 stored elements in Compressed Sparse Row format>

In [None]:
# Target labels (categories of the news articles)
y = newsgroups.target
y

array([10,  3, 17, ...,  3,  1,  7])

In [None]:
# Split the data into training and testing sets for binary and frequency-based features
from sklearn.model_selection import train_test_split
xtrain1, xtest1, ytrain1, ytest1 = train_test_split(X1, y, test_size=0.25, random_state=42) # Binary features
xtrain2, xtest2, ytrain2, ytest2 = train_test_split(X2, y, test_size=0.25, random_state=42) # Frequency-based features

In [None]:
# Initialize and train Bernoulli Naive Bayes model (for binary features)
bnb = BernoulliNB() # should be used when the data distribution is bernoulli i.e binary
bnb.fit(xtrain1,ytrain1)

In [None]:
# Initialize and train Multinomial Naive Bayes model (for frequency-based features)
mnb = MultinomialNB() # should be used when the data distribution is multinomial i.e frequency of words
mnb.fit(xtrain2, ytrain2)

In [None]:
# Predict on the test data for both models
y_pred1 = bnb.predict(xtest1) # Binary predictions
y_pred1

array([ 9, 12, 14, ...,  9,  3,  8])

In [None]:
y_pred2 = mnb.predict(xtest2) # Frequency-based predictions
y_pred2

array([ 9, 12, 14, ...,  9,  3,  8])

In [None]:
# Calculate the accuracy of both models
from sklearn.metrics import accuracy_score


In [None]:
# Accuracy of BernoulliNB
accuracy_score(ytest1,y_pred1)

0.6878183361629882

In [None]:
# Accuracy of MultinomialNB
accuracy_score(ytest2,y_pred2)

0.8469864176570459

***As per the above results between Bernoulii and Multinominal Naiye Bayes, we come to a conclusion that -***
***The accuracy score of MultinomalNB was better and this indicates that the model trained with frequency-based vector offers higher frequency than the binary-based vector.***

In [None]:
# Text processing using TFIDF (Term Frequency - Inverse Document Frequency) and Multinomial Naive Bayes
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline #set of all ML operations

In [None]:
# Create a pipeline for TFIDF vectorization followed by Multinomial Naive Bayes
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
#tfidf uses text data to create vectors and then they will be provided to Multinomial

In [None]:
# Fetch training and testing data separately
train_data = fetch_20newsgroups(subset='train')
test_data = fetch_20newsgroups(subset='test')

In [None]:
# Train the model on the training data
model.fit(train_data.data, train_data.target)
#applying fit function (x and y - supervised learning)

In [None]:
# Predict on the test data using the trained pipeline
predictions_tf = model.predict(test_data.data)
#predicting (emails) - it will go in the pipeline and vectors will be created first

In [None]:
# Calculate the accuracy of the TFIDF-MultinomialNB model
accuracy_score(test_data.target,predictions_tf)

0.7738980350504514



---


### **Results and Interpretation**


---


The accuracy scores for the different models and feature extraction methods are as follows:

- **BernoulliNB with Binary CountVectorizer:** 0.6878
- **MultinomialNB with Frequency CountVectorizer:** 0.8469
- **MultinomialNB with TF-IDF Vectorizer:** 0.7739


---


### **Conclusion**


---


1. **Multinomial Naive Bayes** with frequency-based CountVectorizer performs the best with an accuracy of **0.8469**.
2. **Bernoulli Naive Bayes**, which uses binary features (presence or absence of words), shows lower accuracy compared to frequency-based features.
3. **TF-IDF with Multinomial Naive Bayes** provides a moderate accuracy score, slightly less than the frequency-based approach but higher than binary.


---


### **Interpretation**


---


- The **frequency-based CountVectorizer** approach works better for this dataset because it captures the importance of each word based on its occurrence, which is crucial in text classification tasks.
- **BernoulliNB**, which uses binary features, may lose information about the significance of word occurrences, leading to lower accuracy.
- **TF-IDF** helps in reducing the impact of commonly occurring words but might not be as effective as the simple frequency counts in this case.

This practical demonstrates the importance of selecting the right vectorization technique based on the nature of the data and the problem at hand.


---

