# Machine Learning
## Programming Assignment 6: Naive Bayes

Instructions:
The aim of this assignment is to give you hands-on experience with a real-life machine learning application.
You will be analyzing the sentiment of reviews using Naive Bayes classification.
You can only use the Python programming language and Jupyter Notebooks.
Please use procedural programming style and comment your code thoroughly.
There are two parts of this assignment. In part 1, you can use NumPy, Pandas, Matplotlib, and any other standard Python libraries. You are not allowed to use NLTK, scikit-learn, or any other machine learning toolkit. You can only use scikit-learn in part 2.

### Part 1: Implementing Naive Bayes classifier from scratch (60 points)

You are not allowed to use scikit-learn or any other machine learning toolkit for this part. You have to implement your own Naive Bayes classifier from scratch. You may use Pandas, NumPy, Matplotlib, and other standard Python libraries.

#### Problem:
The purpose of this assignment is to get you familiar with Naive Bayes classification. The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). There are two top-level directories [train/, test/] corresponding to the training and test sets. Each contains [pos/, neg/] directories for the reviews with binary labels positive and negative. Within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. For example, the file [test/pos/200_8.txt] is the text for a positive-labeled testset example with unique id 200 and star rating 8/10 from IMDb.


In [1]:
## Here are the libraries you will need for this part/
import pandas as pd
import numpy as np
import scipy.spatial as sc
import matplotlib.pyplot as plt
import re
import random
%matplotlib inline

#### Task 1.1: Dataset (5 points)
Your task is to read the dataset and stopwords file into a useful data structure. Print out a few reviews and a few items from the stop word list, succesfully being able to do this will earn you 5 points.

In [2]:
import os

# Function to read files from directory
def read_reviews(directory):
    reviews = []
    for filename in os.listdir(directory):
        if filename.endswith(".txt"):
            with open(os.path.join(directory, filename), 'r', encoding='utf-8') as file:
                reviews.append(file.read())
    return reviews

# Reading stop words
def read_stopwords(filepath):
    with open(filepath, 'r') as file:
        stopwords = file.read().splitlines()
    return stopwords

train_pos_reviews = read_reviews('C:/Users/HP/Desktop/Naive Bayes Data/train/pos')
train_neg_reviews = read_reviews('C:/Users/HP/Desktop/Naive Bayes Data/train/pos/neg')
test_pos_reviews = read_reviews('C:/Users/HP/Desktop/Naive Bayes Data/test/pos')
test_neg_reviews = read_reviews('C:/Users/HP/Desktop/Naive Bayes Data/test/neg')
stopwords = read_stopwords('stop_words.txt')

print("Sample Train Positive Review:\n", train_pos_reviews[0])
print("\nSample Train Negative Review:\n", train_neg_reviews[0])
print("\nSample Test Positive Review:\n", test_pos_reviews[0])
print("\nSample Test Negative Review:\n", test_neg_reviews[0])
print("\nSample Stop Words: ", stopwords[:10])

Sample Train Positive Review:
 Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!

Sample Train Negative Review:
 Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. 

#### Task 1.2: Data Preprocessing (10 points)

In the preprocessing step, you’re required to remove the stop words, punctuation marks, numbers, unwanted symbols, hyperlinks, and usernames from the tweets and convert them to lower case. You may find the string and regex module useful for this purpose. Use the stop word list provided within the assignment.

Print out a few random reviews from your dataset, if they conform to the rules mentioned above, you will gain 10 points.

In [3]:
# Function to preprocess reviews
def preprocess_review(review, stopwords):
    # Convert to lowercase
    review = review.lower()  
    # Remove punctuation, numbers, and symbols
    review = re.sub(r'[^a-z\s]', '', review)  
    # Tokenize the review
    words = review.split()
    filtered_words = [word for word in words if word not in stopwords]
    
    return ' '.join(filtered_words)

# Preprocessing
train_pos_reviews_cleaned = [preprocess_review(review, stopwords) for review in train_pos_reviews]
train_neg_reviews_cleaned = [preprocess_review(review, stopwords) for review in train_neg_reviews]
test_pos_reviews_cleaned = [preprocess_review(review, stopwords) for review in test_pos_reviews]
test_neg_reviews_cleaned = [preprocess_review(review, stopwords) for review in test_neg_reviews]

# Display cleaned reviews
print("Sample Cleaned Positive Review:\n", train_pos_reviews_cleaned[0])
print("\nSample Cleaned Negative Review:\n", train_neg_reviews_cleaned[0])
print("\nSample Cleaned Positive Review:\n", train_pos_reviews_cleaned[0])
print("\nSample Cleaned Negative Review:\n", train_neg_reviews_cleaned[0])

Sample Cleaned Positive Review:
 bromwell high cartoon comedy ran time programs school life teachers years teaching profession lead believe bromwell highs satire much closer reality teachers scramble survive financially insightful students see right pathetic teachers pomp pettiness whole situation remind schools knew students saw episode student repeatedly tried burn school immediately recalled high classic line inspector im sack one teachers student welcome bromwell high expect many adults age think bromwell high far fetched pity isnt

Sample Cleaned Negative Review:
 story man unnatural feelings pig starts opening scene terrific example absurd comedy formal orchestra audience turned insane violent mob crazy chantings singers unfortunately stays absurd whole time general narrative eventually making putting even era turned cryptic dialogue would make shakespeare seem easy third grader technical level better might think good cinematography future great vilmos zsigmond future stars sally

#### Task 1.3: Splitting the dataset (5 points)

In this part, divide the given dataset into training and testing sets based on an 80-20 split using python.
Print out the sizes of the training dataset and test dataset, training data should contain 40000 reviews and test data should contain 10000 reviews. If your sizes are correct, you get full points.

In [4]:
# Print lengths of positive and negative reviews for train and test sets
print("Training Positive Reviews:", len(train_pos_reviews_cleaned))
print("Training Negative Reviews:", len(train_neg_reviews_cleaned))

print("Test Positive Reviews:", len(test_pos_reviews_cleaned))
print("Test Negative Reviews:", len(test_neg_reviews_cleaned))


Training Positive Reviews: 12500
Training Negative Reviews: 12500
Test Positive Reviews: 12500
Test Negative Reviews: 12500


In [5]:
# Labels:
# 1 for positive
# 0 for negative
train_labels = [1] * len(train_pos_reviews_cleaned) + [0] * len(train_neg_reviews_cleaned)
test_labels = [1] * len(test_pos_reviews_cleaned) + [0] * len(test_neg_reviews_cleaned)

train_data = train_labels + test_labels[0: 15000]
test_data = test_labels[15000: 25000]

print("Training Data Size (80%):", len(train_data)) # train_labels = 40000
print("Test Data Size (20%):", len(test_data)) # test_labels = 10000

Training Data Size (80%): 40000
Test Data Size (20%): 10000


#### Task 1.4: Create Naive Bayes classifier (30 points)

You will create your own Naive Neighbors classifier function by implementing the following algorithm

In [7]:
from collections import defaultdict
import numpy as np

# to build vocabulary and word frequency
def build_vocabulary(reviews):
    word_freq = defaultdict(int)
    for review in reviews:
        words = review.split()
        for word in words:
            word_freq[word] += 1
    return word_freq

pos_word_freq = build_vocabulary(train_pos_reviews_cleaned)
neg_word_freq = build_vocabulary(train_neg_reviews_cleaned)

total_pos_words = sum(pos_word_freq.values())
total_neg_words = sum(neg_word_freq.values())

vocabulary = set(list(pos_word_freq.keys()) + list(neg_word_freq.keys()))
print("Vocabulary Size:", len(vocabulary))

Vocabulary Size: 117710


In [8]:
# Calculate prior probabilities
P_positive = len(train_pos_reviews_cleaned) / (len(train_pos_reviews_cleaned) + len(train_neg_reviews_cleaned))
P_negative = len(train_neg_reviews_cleaned) / (len(train_pos_reviews_cleaned) + len(train_neg_reviews_cleaned))

print("P(Positive):", P_positive)
print("P(Negative):", P_negative)

P(Positive): 0.5
P(Negative): 0.5


In [9]:
alpha = 1

def calculate_word_likelihood(word_freq, total_words, vocabulary_size, word):
    return (word_freq[word] + alpha) / (total_words + alpha * vocabulary_size)
def calculate_log_likelihood(review, word_freq, total_words, vocabulary_size):
    log_likelihood = 0
    words = review.split()
    for word in words:
        if word in vocabulary:  # Only consider words in vocabulary
            log_likelihood += np.log(calculate_word_likelihood(word_freq, total_words, vocabulary_size, word))
    return log_likelihood

In [10]:
# Function to classify

def classify_review(review):
    pos_log_likelihood = np.log(P_positive) + calculate_log_likelihood(review, pos_word_freq, total_pos_words, len(vocabulary))
    neg_log_likelihood = np.log(P_negative) + calculate_log_likelihood(review, neg_word_freq, total_neg_words, len(vocabulary))
    
    if pos_log_likelihood > neg_log_likelihood:
        return 1  # Positive
    else:
        return 0  # Negative

In [12]:
# Classify all test data

predicted_labels = [classify_review(review) for review in test_pos_reviews_cleaned + test_neg_reviews_cleaned]

# Calculate accuracy
def calculate_accuracy(predicted, actual):
    correct = sum(p == a for p, a in zip(predicted, actual))
    return correct / len(actual)

# Accuracy on the test set
accuracy = calculate_accuracy(predicted_labels, test_labels)
print("Accuracy on Test data:", accuracy)

Accuracy on Test data: 0.82456


#### Task 1.5: Implement evaluation functions (10 points)

Implement evaluation functions that calculates the:
- classification accuracy,
- F1 score,
- and the confusion matrix
of your classifier on the test set.


In [13]:
# calculate classification accuracy

def calculate_accuracy(predicted, actual):
    correct = sum(p == a for p, a in zip(predicted, actual))
    return correct / len(actual)

In [14]:
# Function to calculate confusion matrix

def calculate_confusion_matrix(predicted, actual):
    TP = FP = TN = FN = 0
    for p, a in zip(predicted, actual):
        if p == 1 and a == 1:
            TP += 1
        elif p == 1 and a == 0:
            FP += 1
        elif p == 0 and a == 0:
            TN += 1
        elif p == 0 and a == 1:
            FN += 1
    return TP, FP, TN, FN

In [15]:
# calculate F1 Score, Precision, and Recall

def calculate_f1_score(predicted, actual):
    TP, FP, TN, FN = calculate_confusion_matrix(predicted, actual)

    precision = TP / (TP + FP) if (TP + FP) != 0 else 0
    recall = TP / (TP + FN) if (TP + FN) != 0 else 0
    f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) != 0 else 0

    return precision, recall, f1_score

predicted_labels = [classify_review(review) for review in test_pos_reviews_cleaned + test_neg_reviews_cleaned]

# Calculate accuracy
accuracy = calculate_accuracy(predicted_labels, test_labels)
print(f"Accuracy: {accuracy * 100:.2f}%")

# Calculate confusion matrix
TP, FP, TN, FN = calculate_confusion_matrix(predicted_labels, test_labels)
print(f"Confusion Matrix:\nTP: {TP}, FP: {FP}, TN: {TN}, FN: {FN}")

# Calculate Precision, Recall, and F1 Score
precision, recall, f1 = calculate_f1_score(predicted_labels, test_labels)
print(f"Precision: {precision:.2f}, Recall: {recall:.2f}, F1 Score: {f1:.2f}")


Accuracy: 82.46%
Confusion Matrix:
TP: 9616, FP: 1502, TN: 10998, FN: 2884
Precision: 0.86, Recall: 0.77, F1 Score: 0.81


### Part 2:  Naive Bayes classifier using scikit-learn (40 points)

In this part, use scikit-learn’s CountVectorizer to transform your train and test set to bag-of-words representation and Naïve Bayes implementation to train and test the Naïve Bayes on the provided dataset. Use scikit-learn’s accuracy_score function to calculate the accuracy and confusion_matrix function to calculate the confusion matrix on the test set.

In [16]:
# Here are the libraries and specific functions you will be needing for this part

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [17]:
# I used Pre Processed data in 'train_pos_reviews_cleaned' worked in Task#1

vectorizer = CountVectorizer(stop_words=stopwords)

X_train = vectorizer.fit_transform(train_pos_reviews_cleaned + train_neg_reviews_cleaned)
X_test = vectorizer.transform(test_pos_reviews_cleaned + test_neg_reviews_cleaned)

# Labels: 1 for positive, 0 for negative
y_train = [1] * len(train_pos_reviews_cleaned) + [0] * len(train_neg_reviews_cleaned)
y_test = [1] * len(test_pos_reviews_cleaned) + [0] * len(test_neg_reviews_cleaned)

In [18]:
# Initialize the Naive Bayes classifier
nb_classifier = MultinomialNB()

# Train the classifier on the training data
nb_classifier.fit(X_train, y_train)

In [19]:
# Predict the labels for the test set
y_pred = nb_classifier.predict(X_test)

In [22]:
# Accuracy on the test set
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

# confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n")
print(conf_matrix)

# Classification report (Includes Precision, Recall, F1 Score)
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=["Negative", "Positive"]))

Accuracy: 82.44%

Confusion Matrix:

[[10994  1506]
 [ 2885  9615]]

Classification Report:
              precision    recall  f1-score   support

    Negative       0.79      0.88      0.83     12500
    Positive       0.86      0.77      0.81     12500

    accuracy                           0.82     25000
   macro avg       0.83      0.82      0.82     25000
weighted avg       0.83      0.82      0.82     25000



*****I got the same accuracy with implementation using scratch and using standard sklearn python library. It means my alorithms are working fine.*****