# **NAIVE BAYES ML MODEL**

* Naive Bayes is a simple supervised machine learning algorithm.
* Naive Bayes is a family of probabilistic algorithms primarily used for classification problems.
* This can not be used for regression tasks.
* The algorithm is based on applying `Bayes' theorem` and assuming `conditional independence` between features given the class label.
* It's a simple yet effective method for many classification tasks, especially with text data (e.g., spam filtering or sentiment analysis).

## There are three main types of Naive Bayes:
1. **Gaussian Naive Bayes**
   * This algorithm is used for `continuous numerical features` that are assumed to follow a ``normal distribution`` (also known as Gaussian distribution). 
   * In Gaussian Naive Bayes, the likelihood probability P(features | class) is modeled using the normal distribution with a mean and variance estimated from the training data.
1. **Multinomial Naive Bayes**
   * This algorithm is used for `discrete count data` such as `word counts` in text classification. 
   * In Multinomial Naive Bayes, the likelihood probability P(features | class) is modeled using the `Multinomial distribution`, which models the probability of observing a feature count given the class label.
2. **Bernoulli Naive Bayes**
   *  This algorithm is also used for `discrete count` data such as `word counts` in text classification, but the `features are binary (0 or 1)` instead of counts. 
   *  In Bernoulli Naive Bayes, the likelihood probability P(features | class) is modeled using the `Bernoulli distribution`, which models the probability of observing a feature given the class label as a binary variable.


<img src="https://www.scribbr.com/wp-content/uploads/2023/02/standard-normal-distribution-example.webp" alt="gassian distribution" width="20%">

<img src="https://www.researchgate.net/profile/Peter-Hall-20/publication/266659404/figure/fig4/AS:806568463458306@1569312315640/A-Multinomial-Distribution_W640.jpg" alt="Multinomial Distribution" width="35%">

<img src="https://www.tutorialspoint.com/python/images/bernoulli.png" alt="Bernoulli Distribution" width="20%">

**Complement Naive Bayes:**

This algorithm is an `extension of Multinomial Naive Bayes` that is designed `to handle imbalanced datasets` where one class has significantly fewer observations than the other. 

In Complement Naive Bayes, the likelihood probability P(features | class) is modeled using `the complement of the Multinomial distribution`, which models the probability of observing the absence of a feature given the class label.

# Example

In [32]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB, ComplementNB
from sklearn.metrics import accuracy_score

# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize the Naive Bayes models
gnb = GaussianNB()
mnb = MultinomialNB()
bnb = BernoulliNB()
cnb = ComplementNB()

# Train the models on the training set
gnb.fit(X_train, y_train)
mnb.fit(X_train, y_train)
bnb.fit(X_train, y_train)
cnb.fit(X_train, y_train)

# Make predictions on the testing set using each model
gnb_pred = gnb.predict(X_test)
mnb_pred = mnb.predict(X_test)
bnb_pred = bnb.predict(X_test)
cnb_pred = cnb.predict(X_test)

# Calculate the accuracy scores for each model
gnb_score = accuracy_score(y_test, gnb_pred)
mnb_score = accuracy_score(y_test, mnb_pred)
bnb_score = accuracy_score(y_test, bnb_pred)
cnb_score = accuracy_score(y_test, cnb_pred)

# Print the accuracy scores
print("Gaussian Naive Bayes accuracy:", gnb_score)
print("Multinomial Naive Bayes accuracy:", mnb_score)
print("Bernoulli Naive Bayes accuracy:", bnb_score)
print("Complement Naive Bayes accuracy:", cnb_score)

# Select the best model based on the accuracy score
best_model = max([(gnb_score, 'Gaussian'), (mnb_score, 'Multinomial'), (bnb_score, 'Bernoulli'), (cnb_score, 'Complement')])
# print a separating line in output
print("---------------------------------")

print("Best model:", best_model[1])
print("Best accuracy:", best_model[0])


Gaussian Naive Bayes accuracy: 0.9777777777777777
Multinomial Naive Bayes accuracy: 0.9555555555555556
Bernoulli Naive Bayes accuracy: 0.28888888888888886
Complement Naive Bayes accuracy: 0.7111111111111111
---------------------------------
Best model: Gaussian
Best accuracy: 0.9777777777777777


In [37]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB, ComplementNB
from sklearn.metrics import accuracy_score

# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Naive Bayes models
gnb = GaussianNB()
mnb = MultinomialNB()
bnb = BernoulliNB()
cnb = ComplementNB()

# Train the models on the training set
gnb.fit(X_train, y_train)
mnb.fit(X_train, y_train)
bnb.fit(X_train, y_train)
cnb.fit(X_train, y_train)

# Make predictions on the testing set using each model
gnb_pred = gnb.predict(X_test)
mnb_pred = mnb.predict(X_test)
bnb_pred = bnb.predict(X_test)
cnb_pred = cnb.predict(X_test)

# Calculate the accuracy scores for each model
gnb_score = accuracy_score(y_test, gnb_pred)
mnb_score = accuracy_score(y_test, mnb_pred)
bnb_score = accuracy_score(y_test, bnb_pred)
cnb_score = accuracy_score(y_test, cnb_pred)

# Print the accuracy scores
print("Gaussian Naive Bayes accuracy:", gnb_score)
print("Multinomial Naive Bayes accuracy:", mnb_score)
print("Bernoulli Naive Bayes accuracy:", bnb_score)
print("Complement Naive Bayes accuracy:", cnb_score)

# Select the best model based on the accuracy score
best_model = max([(gnb_score, 'Gaussian'), (mnb_score, 'Multinomial'), (bnb_score, 'Bernoulli'), (cnb_score, 'Complement')])
# print a separating line in output
print("---------------------------------")

print("Best model:", best_model[1])
print("Best accuracy:", best_model[0])


Gaussian Naive Bayes accuracy: 1.0
Multinomial Naive Bayes accuracy: 0.9
Bernoulli Naive Bayes accuracy: 0.3
Complement Naive Bayes accuracy: 0.7
---------------------------------
Best model: Gaussian
Best accuracy: 1.0


# **Text Classification:**

In [38]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Load the 20 Newsgroups dataset
categories = ['alt.atheism', 'talk.religion.misc']
newsgroups = fetch_20newsgroups(subset='all', categories=categories, shuffle=True, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(newsgroups.data, newsgroups.target, test_size=0.3, random_state=42)

# Convert the text data into feature vectors using bag-of-words model
vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(X_train)
X_test_counts = vectorizer.transform(X_test)

# Transform the feature vectors using TF-IDF weighting
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

# Train the Naive Bayes model on the training set
clf = MultinomialNB()
clf.fit(X_train_tfidf, y_train)

# Make predictions on the testing set
y_pred = clf.predict(X_test_tfidf)

# Calculate the accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.8181818181818182


# Sentiment Analysis

In [39]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Load the 20 Newsgroups dataset
newsgroups = fetch_20newsgroups(subset='all', categories=['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian'], shuffle=True, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(newsgroups.data, newsgroups.target, test_size=0.3, random_state=42)

# Convert the text data into feature vectors using bag-of-words model
vectorizer = CountVectorizer(stop_words='english', ngram_range=(1,2))
X_train_counts = vectorizer.fit_transform(X_train)
X_test_counts = vectorizer.transform(X_test)

# Transform the feature vectors using TF-IDF weighting
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

# Train the Naive Bayes model on the training set
clf = MultinomialNB()
clf.fit(X_train_tfidf, y_train)

# Make predictions on the testing set
y_pred = clf.predict(X_test_tfidf)

# Calculate the accuracy score and classification report
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification report:")
print(classification_report(y_test, y_pred, target_names=newsgroups.target_names))


Accuracy: 0.9539007092198581
Classification report:
                        precision    recall  f1-score   support

           alt.atheism       0.99      0.90      0.94       252
         comp.graphics       0.97      0.99      0.98       295
               sci.med       0.97      0.94      0.96       299
soc.religion.christian       0.89      0.98      0.94       282

              accuracy                           0.95      1128
             macro avg       0.96      0.95      0.95      1128
          weighted avg       0.96      0.95      0.95      1128



# Check your sentiment

In [42]:
# Load the trained Naive Bayes model
clf = MultinomialNB()
clf.fit(X_train_tfidf, y_train)

# Define your own text
my_text = "The movie was sick."

# Convert your text into a feature vector using the same CountVectorizer and TfidfTransformer objects used for training
my_text_counts = vectorizer.transform([my_text])
my_text_tfidf = tfidf_transformer.transform(my_text_counts)

# Make a prediction on your text
my_sentiment = clf.predict(my_text_tfidf)

# Print the predicted sentiment
if my_sentiment == 1:
    print("Positive")
else:
    print("Negative")

Negative
