<h1 style="text-align: center;">Sentiment Analysis on Amazon Reviews 📊</h1>

## Objective

The rapid growth of e-commerce, accelerated significantly during and after the COVID-19 pandemic, has reshaped consumer purchasing behaviors for both essential and non-essential goods. This shift has resulted in an overwhelming increase in online customer reviews, offering businesses a wealth of insights into customer satisfaction, product performance, and potential areas for improvement. However, the sheer volume of these reviews makes manual analysis infeasible for organizations striving to understand and act on customer sentiments effectively.
Sentiment analysis has emerged as an essential solution, leveraging Natural Language Processing (NLP) and machine learning techniques to automatically identify and classify opinions expressed in text. This research explores the application of these techniques to analyze e-commerce reviews, aiming to uncover actionable insights at scale. By automating sentiment analysis, businesses can enhance customer experiences, personalize offerings, and make informed, data-driven decisions that align with evolving customer preferences. This study not only addresses the challenges of large-scale sentiment analysis but also highlights its transformative potential for improving business strategies in the dynamic e-commerce landscape

## Data Description

The dataset used in this project is titled **Amazon Product Reviews** and was sourced from both Kaggle and the University of San Diego’s website. It is a publicly available dataset under the **CC0 1.0 Universal license**, which means it is free to use, share, and adapt without legal restrictions. The dataset can be accessed through [this Kaggle link](https://www.kaggle.com/datasets/arhamrumi/amazon-product-reviews/data).

### Dataset Structure

The dataset comprises the following fields:

1. **Id**: A unique identifier for each review entry.
2. **ProductId**: A unique identifier for the product being reviewed.
3. **UserId**: A unique identifier for the user who submitted the review.
4. **ProfileName**: The name of the user who submitted the review.
5. **HelpfulnessNumerator**: The number of users who found the review helpful.
6. **HelpfulnessDenominator**: The total number of users who rated the helpfulness of the review.
7. **Score**: The rating provided by the user, typically on a scale of 1 to 5.
8. **Time**: A timestamp representing when the review was submitted.
9. **Summary**: A short title or summary of the review.
10. **Text**: The full review text.

### Data Preprocessing and Ethical Considerations:

For this project, the **UserId** and **ProfileName** columns will be dropped from the dataset. This decision is made to ensure that no personal identifiers are used, thus maintaining ethical standards and adhering to data privacy principles. Removing these fields ensures that the dataset is ethically cleared for analysis while retaining all necessary information for sentiment analysis

## Key Research Questions to be Addressed

- **How accurately can various machine learning models classify sentiment in e-commerce reviews?**
- **How do different text preprocessing techniques impact the performance of sentiment classification models?**
- **How do various feature extraction methods affect the accuracy of sentiment classification?**
- **How do different machine learning models compare in terms of performance when classifying sentiment in e-commerce reviews?**

## Methodology

### Imports & Downloads

To run this notebook you will need the following installed:
- `pip install pandas`
- `pip install numpy`
- `pip install seaborn`
- `pip install matplotlib`
- `pip install scikit-learn`
- `pip install nltk`
- `pip install textblob`
- `pip install wordcloud`
- `pip install beautifulsoup4`
- `pip install emoji`
- `pip install contractions`

#### 1. Libraries & Packages

In [None]:
# For local installation please uncomment the following and run this code block

# %pip install pandas
# %pip install numpy
# %pip install seaborn
# %pip install matplotlib
# %pip install scikit-learn
# %pip install nltk
# %pip install textblob
# %pip install wordcloud
# %pip install beautifulsoup4
# %pip install emoji
# %pip install contractions

#### 2. Import Packages

In [None]:
# ===================================================
# DATA MANIPULATION & ANALYSIS
# ===================================================
import pandas as pd
import numpy as np

# ===================================================
# DATA VISUALIZATION
# ===================================================
import seaborn as sns
import matplotlib.pyplot as plt

# ===================================================
# MACHINE LEARNING & MODEL SELECTION
# ===================================================
import sklearn as sk
from sklearn.model_selection import train_test_split, GridSearchCV, validation_curve
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay

# ===================================================
# MACHINE LEARNING MODELS
# ===================================================
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB

# ===================================================
# FEATURE EXTRACTION & TEXT VECTORIZATION
# ===================================================
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline

# ===================================================
# TEXT PREPROCESSING
# ===================================================
import re
import string
import unicodedata
import contractions
import collections
import emoji
from bs4 import BeautifulSoup  # For parsing HTML
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
from nltk.stem import WordNetLemmatizer

# ===================================================
# SENTIMENT ANALYSIS
# ===================================================
from textblob import TextBlob

# ===================================================
# WORD CLOUDS
# ===================================================
from wordcloud import WordCloud

# ===================================================
# UTILITIES & MISCELLANEOUS
# ===================================================
import json  # To save results
from tqdm import tqdm  # Progress bar

#### 3. Install NLTK Tools

In [None]:
# Downloading NLTK resources
# Please uncomment the following line if you haven't downloaded the NLTK resources:

# nltk.download('all')

### **Step 1: Load Data**

In [None]:
# Balanced data is data where all reviews (1 star to 5 star) are taken in equal proportion to avoid overfitting or underfitting
# 25000 Records of each star rating is taken
df_balancedData = pd.read_csv('Datasets/balanced_reviews.csv')

### **Step 2: Exploratory Data Analysis (EDA)**

In [None]:
df_balancedData.shape

In [None]:
df_balancedData.head()

In [None]:
df_balancedData.tail()

In [None]:
df_balancedData.info()

In [None]:
df_balancedData.describe()

In [None]:
df_balancedData.columns

In [None]:
df_balancedData['Text']

In [None]:
# Create a pie chart
plt.figure(figsize=(10, 8))
df_balancedData['Score'].value_counts().plot(
    kind='pie', 
    autopct='%.0f%%', 
    startangle=90,  # Rotate the pie chart for better orientation
    colors=plt.cm.Paired.colors  # Add distinct colors for each segment
)

# Add title and legend
plt.title('Data Distribution of Scores', fontsize=16)
plt.ylabel('')  # Remove the default y-axis label for better aesthetics
plt.legend(title='Scores', loc='upper right')

# Show the plot
plt.show()

In [None]:
# Define stopwords
stop_words = set(stopwords.words('english'))

# Combine all text from the 'Review' column into a single string
all_reviews = " ".join(df_balancedData["Text"])

# Generate the Word Cloud
wordcloud = WordCloud(
    stopwords=stop_words, 
    width=800, 
    height=800, 
    background_color='white', 
    min_font_size=10
).generate(all_reviews)

# Plot the Word Cloud
plt.figure(figsize=(8, 8), facecolor=None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

In [None]:
# Select numeric columns only
numeric_data = df_balancedData.select_dtypes(include=['int64', 'float64'])

# Compute the correlation matrix
correlation_matrix = numeric_data.corr()

plt.figure(figsize=(16, 6))

# Generate the heatmap
heatmap = sns.heatmap(
    correlation_matrix, 
    vmin=-1, vmax=1, annot=True, fmt=".2f", cmap='coolwarm', cbar_kws={'label': 'Correlation Coefficient'}
)

# Customize the title
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize': 16, 'fontweight': 'bold'}, pad=20)

# Rotate x and y axis labels for clarity
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)

# Tighten the layout for better spacing
plt.tight_layout()

# Display the heatmap
plt.show()

### **Step 3: Data Cleaning & Preprocessing**

#### Drop Unnecesary Columns

In [None]:
# Select only relevant columns
df_balancedData = df_balancedData[['ProductId', 'Score', 'Text']]
# Display the updated DataFrame
df_balancedData.head()

#### Map Sentiment Classes from Score (3 Classes)

In [None]:
#########################################
#   Map 5-Star Ratings to 3 Classes
#########################################
def map_to_3_classes(score):
    if score in [1, 2]:
        return 'negative'
    elif score == 3:
        return 'neutral'
    else:  # 4 or 5
        return 'positive'

df_balancedData['Sentiment'] = df_balancedData['Score'].apply(map_to_3_classes)

In [None]:
df_balancedData.head()

#### Remove Emojis

In [None]:

class CleanEmojis(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        # Apply emoji removal to the 'Text' column of the DataFrame
        tqdm.pandas(desc="Removing emojis")  # Add progress bar for this step
        X['Text'] = X['Text'].progress_apply(lambda review: emoji.demojize(review))
        return X

# Apply the transformer to the DataFrame
cleaner = CleanEmojis()
df_balancedData = cleaner.transform(df_balancedData)

# Display the updated DataFrame
df_balancedData.head()

#### Method for Removing Special Characters, URLS, HTML etc. and convert to lowercase

In [None]:
# Define the StringProcessing class
class StringProcessing(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        filtered_tweets = []
        for review in X:
            # Converting to lowercase
            review = review.lower()
            # Substituting ampersand signs to 'and'
            review = re.sub(r"&amp", "and", review)
            review = re.sub(r"&", "and", review)
            # Removing @ mentions and hashtags
            review = re.sub(r"[@#][^\s]+", " ", review)
            # Removing URLs
            review = re.sub(r"(http\S+)|(www\S+)", "", review)
            # Removing HTML tags
            review = re.sub(r"<[^<]+?>", " ", review)
            # Cleaning emojis
            review = re.sub(r":", " ", review)
            review = review.replace("_", " ")
            # Removing special characters
            review = re.sub(r"[^a-z0-9'’ ]", " ", review)
            # Removing extra whitespace
            review = re.sub(r"\s+", " ", review)
            filtered_tweets.append(review)
        return filtered_tweets

# Create an instance of the StringProcessing transformer
string_processor = StringProcessing()

# Apply the transformation to the 'Text' column and store the result in a new 'Text' column
df_balancedData.loc[:, 'Text'] = string_processor.transform(df_balancedData['Text'])

# Display the updated DataFrame with the new 'Review' column
df_balancedData.head()

#### Preprocessed Text into New column

In [None]:
# Select only relevant columns
df_balancedData = df_balancedData[['ProductId', 'Score', 'Text', 'Sentiment']]
df_balancedData.head()

#### Method to substitute contractions

In [None]:
def collect_short_contractions(short_contractions):
    short_contractions_dict = {}
    # Sort the contractions alphabetically
    for key, value in sorted(short_contractions.items()):
        # Remove apostrophes from the contraction keys
        new_key = key.replace("'", "")
        short_contractions_dict[new_key] = value
    return short_contractions_dict

# (Maarten et al., 2013)
short_contractions = {
    "i'd": "i would",
    "we'd": "we would",
    "ain't": "is not",
    "aren't": "are not",
    "can't": "cannot",
    "can't've": "cannot have",
    "'cause": "because",
    "could've": "could have",
    "couldn't": "could not",
    "didn't": "did not",
    "doesn't": "does not",
    "doesn’t": "does not",
    "don't": "do not",
    "hadn't": "had not",
    "hasn't": "has not",
    "haven't": "have not",
    "haven’t": "have not",
    "he'd": "he would",
    "he'll": "he will",
    "he's": "he is",
    "how'd": "how did",
    "how'd'y": "how do you",
    "how'll": "how will",
    "how's": "how is",
    "I'd": "I would",
    "I'll": "I will",
    "I'm": "I am",
    "I've": "I have",
    "isn't": "is not",
    "it'd": "it would",
    "it'll": "it will",
    "it's": "it is",
    "let's": "let us",
    "ma'am": "madam",
    "mayn't": "may not",
    "might've": "might have",
    "mightn't": "might not",
    "must've": "must have",
    "mustn't": "must not",
    "mustn't've": "must not have",
    "needn't": "need not",
    "oughtn't": "ought not",
    "shan't": "shall not",
    "she'd": "she would",
    "she'll": "she will",
    "she's": "she is",
    "should've": "should have",
    "shouldn't": "should not",
    "shouldn't've": "should not have",
    "that'd": "that would",
    "that's": "that is",
    "there'd": "there had",
    "there's": "there is",
    "they'd": "they would",
    "they'll": "they will",
    "they're": "they are",
    "they've": "they have",
    "wasn't": "was not",
    "we'd": "we would",
    "we'll": "we will",
    "we're": "we are",
    "we've": "we have",
    "weren't": "were not",
    "what'll": "what will",
    "what're": "what are",
    "what's": "what is",
    "what've": "what have",
    "when's": "when is",
    "when've": "when have",
    "where'd": "where did",
    "where's": "where is",
    "where've": "where have",
    "who'll": "who will",
    "who's": "who is",
    "who've": "who have",
    "won't": "will not",
    "won't've": "will not have",
    "would've": "would have",
    "wouldn't": "would not",
    "y'all": "you all",
    "you'd": "you would",
    "you'd've": "you would have",
    "you'll": "you will",
    "you're": "you are",
    "you've": "you have"
}

short_contractions_dict = collect_short_contractions(short_contractions)
short_contractions_dict

In [None]:
class ExpandContractions(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        expanded_reviews = []
        # Wrap X in tqdm for a progress bar
        for review in tqdm(X, desc="Expanding Contractions"):
            expanded_reviews.append(contractions.fix(review))
        return expanded_reviews

#### Method to handle Slang words

In [None]:
def collect_slangs(slangs):
    slang_dict = {}
    for line in slangs.strip().split('\n'):
        key, value = line.lower().strip().split(': ', 1)  # Added limit to split on first ':'
        slang_dict[key] = value

    # Sort the dictionary by keys alphabetically
    return dict(sorted(slang_dict.items()))

# (SlickText, 2023)
slangs = '''
  ROFL: Rolling on the floor laughing
  STFU: Shut up
  ICYMI: In case you missed it
  TL;DR: Too long, didn’t read
  TMI: Too much information
  AFAIK: As far as I know
  LMK: Let me know
  NVM: Nevermind
  FTW: For the win
  BYOB: Bring your own beer
  BOGO: Buy one get one
  JK: Just kidding
  TBH: To be honest
  TBF: To be frank
  RN: Right now
  BRB: Be right back
  BTW: By the way
  GG: Good game
  IRL: In real life
  LOL: Laugh out loud
  SMH: Shaking my head
  NGL: Not gonna lie
  IKR: I know right
  TTYL: Talk to you later
  IMO: In my opinion
  WYD: What are you doing?
  IDK: I don’t know
  IDC: I don’t care
  IDGAF: I don’t care
  TBA: To be announced
  TBD: To be decided
  AFK: Away from keyboard
  IYKYK: If you know you know
  B4: Before
  FOMO: Fear of missing out
  GTG: Got to go
  G2G: Got to go
  H8: Hate
  LMAO: Laughing my ass off
  IYKWIM: If you know what I mean
  MYOB: Mind your own business
  POV: Point of view
  HBD: Happy birthday
  WYSIWYG: What you see is what you get
  FWIF: For what it’s worth
  TW: Trigger warning
  EOD: End of day
  FAQ: Frequently asked question
  AKA: Also known as
  ASAP: As soon as possible
  DIY: Do it yourself
  NP: No problem
  U: you
  R: are
  PLS: please
  WYD: what are you doing
  N/A: Not applicable
  K: Okay
  WUT: what
  FYI: For your information
  NSFW: Not safe for work
  WFH: Work from home
  OMW: On my way
  DM: Direct message
  FB: Facebook
  IG: Instagram
  YT: YouTube
  QOTD: Quote of the day
  OOTD: Outfit of the day
  AMA: Ask me anything
  HMU: Hit me up
  ILY: I love you
  BF: Boyfriend
  GF: Girlfriend
  BAE: Before anyone else
  LYSM: Love you so much
  PDA: Public display of affection
  XOXO: Hugs and kisses
  LOML: Love of my life
  THX: thanks
  V: very
  OMG: Oh My God
'''

slang_dict = collect_slangs(slangs)
slang_dict

In [None]:
import re
from sklearn.base import BaseEstimator, TransformerMixin

class ReplaceSlangs(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        filtered_reviews = []
        for review in tqdm(X, desc="Replacing Slangs"):
            # Replace slang
            for key, value in slang_dict.items():
                review = re.sub(r'\b{}\b'.format(re.escape(key)), value, review, flags=re.IGNORECASE)
            
            # Replace short contractions
            for key, value in short_contractions_dict.items():
                review = re.sub(r'\b{}\b'.format(re.escape(key)), value, review, flags=re.IGNORECASE)
            
            filtered_reviews.append(review)
        return filtered_reviews

#### Method for Tokenization

In [None]:
class Tokenize(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        filtered_tweets = []
        for tweet in tqdm(X, desc="Tokenizing"):
            filtered_tweets.append(nltk.word_tokenize(tweet))
        return filtered_tweets

#### Method to remove stopwords
**There are certain words in the built-in NLTK stop word list that if we removed, would change the sentiment of the text. Therefore, we decided to modify the original list by removing certain words like "no", "against", "above" etc.**

In [None]:
class RemoveStopwords(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        sentiment_stopwords = [
            'against', 'above', 'below', 'up', 'down', 'over', 'under', 'no', 'nor', 'not',
            "don't", 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't",
            'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't",
            'isn', "isn't", 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't",
            'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't",
            'won', "won't", 'wouldn', "wouldn't"
        ]
        
        stop_words = set(stopwords.words('english'))
        for word in sentiment_stopwords:
            stop_words.remove(word)
        
        processed_reviews = []
        for review in tqdm(X, desc="Removing Stopwords"):
            tokens_without_stopwords = [word for word in review if word not in stop_words]
            processed_reviews.append(tokens_without_stopwords)
        
        return processed_reviews

#### Method to apply lemmatization

In [None]:
class Lemmatize(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        lemmatizer = WordNetLemmatizer()
        lemmatized_reviews = []

        def get_wordnet_pos(pos_tag):
            if pos_tag.startswith('J'):
                return wordnet.ADJ
            elif pos_tag.startswith('V'):
                return wordnet.VERB
            elif pos_tag.startswith('N'):
                return wordnet.NOUN
            elif pos_tag.startswith('R'):
                return wordnet.ADV
            else:
                return wordnet.NOUN
        
        # Wrap X in tqdm
        for review in tqdm(X, desc="Lemmatizing"):
            pos_tags = nltk.pos_tag(review)
            lemmatized_tokens = [
                lemmatizer.lemmatize(token, get_wordnet_pos(tag)) 
                for token, tag in pos_tags
            ]
            lemmatized_reviews.append(lemmatized_tokens)
        
        return lemmatized_reviews

#### Basic cleaning after preprocessing

In [None]:
class FinalClean(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.remove_digits = re.compile(r'\d+')
        self.remove_single_letters = re.compile(r'\b\w{1}\b')
        self.remove_apostrophes = re.compile(r"'")
        self.remove_href = "href"

    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        cleaned_reviews = []
        
        for review in tqdm(X, desc="Final Cleaning"):
            cleaned_words = [
                self.clean_word(word)
                for word in review
                if word and word != self.remove_href
            ]
            cleaned_reviews.append(cleaned_words)
        
        return cleaned_reviews

    def clean_word(self, word):
        # Removing digits
        word = self.remove_digits.sub("", word)
        # Removing single letters
        word = self.remove_single_letters.sub("", word)
        # Removing apostrophes
        word = self.remove_apostrophes.sub("", word)
        return word

#### Preprocessing Pipeline

In [None]:
df_balancedData.head()

In [None]:
def preprocess_reviews(reviews_frame):
    # Define the list of preprocessing steps as tuples
    # Each tuple contains a step name (for identification) and an instance of the corresponding transformer
    preprocessing_steps = [
        ('expand_contractions', ExpandContractions()),  # Step to expand contractions like "don't" -> "do not"
        ('replace_slangs', ReplaceSlangs()),         # Step to replace slang words with their full form
        ('tokenize', Tokenize()),                    # Step to tokenize the text (split into words)
        ('remove_stopwords', RemoveStopwords()),     # Step to remove common stopwords
        ('lemmatize', Lemmatize()),                  # Step to lemmatize words (reduce to root form)
        ('final_clean', FinalClean())                # Final cleaning step (removing unwanted characters, etc.)
    ]

    # Create the pipeline object using the preprocessing steps defined above
    # The Pipeline applies each step sequentially to the data
    preprocessing_pipeline = Pipeline(preprocessing_steps)
    
    # Apply the pipeline to the 'Text' column of the DataFrame
    # This will preprocess the 'Text' data column by applying each step defined in the pipeline
    preprocessed_reviews = preprocessing_pipeline.fit_transform(reviews_frame['Text'])
    
    # Convert the list of tokenized words into strings (space-separated)
    # List comprehension is used to join each list of tokens into a single string for each review
    # Each review is a list of words, so ' '.join(review) joins them back into a single string
    preprocessed_strings = [' '.join(review) for review in preprocessed_reviews]

    # Add the preprocessed reviews as a new column to the original DataFrame
    # This new column will contain the cleaned and processed review text
    reviews_frame['Preprocessed_Review'] = preprocessed_strings

    # Return the DataFrame with the added preprocessed reviews
    return reviews_frame


In [None]:
preprocess_reviews(df_balancedData)

In [None]:
df_balancedData.to_csv('Datasets/preprocessed_reviews.csv', index=False) # Save the preprocessed data to a new CSV file

## 🟢 Classification Tri-Class 

### **Train/Test Split**

#### Make a copy of the preprocessed data

In [None]:
df_preprocessed = pd.read_csv('Datasets/preprocessed_reviews.csv')
df_preprocessed.head()

In [None]:
df_preprocessed.shape

In [None]:
# 1. View the raw counts
print("Class Distribution (Counts):")
print(df_preprocessed['Sentiment'].value_counts())

# 2. View the normalized distribution (percentages)
print("\nClass Distribution (Percentages):")
print(df_preprocessed['Sentiment'].value_counts(normalize=True) * 100)

# 3. Visualize with a Seaborn countplot
plt.figure(figsize=(6,4))
sns.countplot(x='Sentiment', data=df_preprocessed, order=['negative','neutral','positive'])  # or remove 'order' if you want auto-sorting
plt.title("Distribution of Sentiment Classes")
plt.xlabel("Sentiment Class")
plt.ylabel("Count of Reviews")
plt.show()

#### Define Features (Preprocessed Text) and Target (Sentiment)

In [None]:
# Define Features (Preprocessed Text) and Target (Sentiment)
X = df_preprocessed['Preprocessed_Review']
y = df_preprocessed['Sentiment']

In [None]:
df_preprocessed.head()

#### Split into training and testing

In [None]:
# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,      # 70 train 30 test
    random_state=42,
    stratify=y
)

### **NLP Method 1:** Term Frequency-Inverse Document Frequency (TF-IDF)
Here, we run each model separately—each with its own hyperparameter tuning and 10-fold cross-validation.


#### Feature Extraction Method : TFIDF Setup

In [None]:
# TF-IDF Transformation
tfidf_vectorizer = TfidfVectorizer(
    max_features=10000,  # Can be varied
    ngram_range=(1,2),
    max_df=0.8,
    sublinear_tf=True
)

# Fit on training data and transform
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Create a dictionary to store the final results from each model (optional)
results = {}

#### M1 (Decision Tree)

##### a. Implementation

In [None]:
# -------------------------------
#  SECTION 1: DECISION TREE
# -------------------------------
print("\n" + "="*50)
print("MODEL 1: Decision Tree (Tri-Class)")
print("="*50)

# Define model
dt_classifier = DecisionTreeClassifier()

# Define hyperparameter grid
dt_param_grid = {
    'max_depth': [5, 10, 15, 20, 25],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5, 10],
}

# Set up GridSearchCV with 10-fold CV
dt_grid_search = GridSearchCV(
    estimator=dt_classifier,
    param_grid=dt_param_grid,
    scoring='accuracy',  
    cv=10,
    n_jobs=4,           # Adjust CPU usage
    verbose=2,
    return_train_score=True
)

In [None]:
# Fit GridSearch on the TF-IDF training data
dt_grid_search.fit(X_train_tfidf, y_train)

In [None]:
# Evaluate best model
print("\nBest Params (Decision Tree):", dt_grid_search.best_params_)
print("Best CV Score (Decision Tree):", dt_grid_search.best_score_)

In [None]:
# Predict on test data
dt_best = dt_grid_search.best_estimator_ # Get the best model
y_pred_dt = dt_best.predict(X_test_tfidf) # Predict on test data

In [None]:
# Accuracy & Classification Report
dt_test_accuracy = accuracy_score(y_test, y_pred_dt)
print("Test Accuracy (Decision Tree):", dt_test_accuracy)
print(classification_report(y_test, y_pred_dt))

# Store results
results['DecisionTree'] = {
    'best_params': dt_grid_search.best_params_,
    'best_cv_score': dt_grid_search.best_score_,
    'test_accuracy': dt_test_accuracy,
    'classification_report': classification_report(y_test, y_pred_dt, output_dict=True)
}

##### b. Save Results

In [None]:
# Save Decision Tree results to a JSON file
with open('decision_tree_results.json', 'w') as f:
    # results['DecisionTree'] is a dictionary
    json.dump(results['DecisionTree'], f, indent=4)


In [None]:
# After fitting your grid search
cv_results_df = pd.DataFrame(dt_grid_search.cv_results_)

# Display the first few rows to see the structure
cv_results_df.head(15)


In [None]:
# Plot mean test scores for different hyperparameter combinations
plt.figure(figsize=(12, 6))
plt.plot(range(1, len(cv_results_df) + 1), cv_results_df['mean_test_score'], marker='o')
plt.title('Cross-Validation Accuracy for Different Hyperparameters')
plt.xlabel('Hyperparameter Combination Index')
plt.ylabel('Mean CV Accuracy')
plt.grid(True)
plt.show()

In [None]:
# Save to CSV for analysis in Excel, Google Sheets, etc.
cv_results_df.to_csv('Results_3Classes/TFIDF_Models/tfidf_decisionTree_cv_results.csv', index=False)

##### c. Visualizations

In [None]:
param_range = [5, 10, 15, 20, 25]
train_scores, test_scores = validation_curve(
    dt_classifier, X_train_tfidf, y_train, param_name="max_depth", param_range=param_range,
    scoring="accuracy", cv=5, n_jobs=-1
)


# Calculate mean and std
train_mean = np.mean(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_std = np.std(test_scores, axis=1)

# Plot
plt.figure(figsize=(8, 6))
plt.plot(param_range, train_mean, label="Training Score", color='blue', marker='o')
plt.plot(param_range, test_mean, label="Validation Score", color='green', linestyle='--', marker='x')
plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, alpha=0.2, color='blue')
plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, alpha=0.2, color='green')
plt.title("Validation Curve for max_depth")
plt.xlabel("max_depth")
plt.ylabel("Accuracy")
plt.legend()
plt.grid(True)
plt.show()


In [None]:
# y_pred = dt_grid_search.predict(X_test)
# cm = confusion_matrix(y_test, y_pred)

# disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=class_names)
# disp.plot(cmap='Blues')
# plt.title("Confusion Matrix")
# plt.show()

#### M2 (Linear SVM)

##### a. Implementation

In [None]:
# -------------------------------
#  SECTION 2: SVM (LinearSVC)
# -------------------------------
print("\n" + "="*50)
print("MODEL 2: LinearSVC")
print("="*50)

# Define model
svm_classifier = LinearSVC(max_iter=10000)

# Define hyperparameter grid
svm_param_grid = {
    'C': [1e-3, 1e-2, 1e-1, 1, 1e1, 1e2]
}

# Set up GridSearchCV with 10-fold CV
svm_grid_search = GridSearchCV(
    estimator=svm_classifier,
    param_grid=svm_param_grid,
    scoring='accuracy',
    cv=10,
    n_jobs=4,
    verbose=2,
    return_train_score=True
)

# Fit the grid search
svm_grid_search.fit(X_train_tfidf, y_train)

# Evaluate best model
print("\nBest Params (SVM):", svm_grid_search.best_params_)
print("Best CV Score (SVM):", svm_grid_search.best_score_)

# Predict on test data
svm_best = svm_grid_search.best_estimator_
y_pred_svm = svm_best.predict(X_test_tfidf)

# Accuracy & Classification Report
svm_test_accuracy = accuracy_score(y_test, y_pred_svm)
print("Test Accuracy (SVM):", svm_test_accuracy)
print(classification_report(y_test, y_pred_svm))

# Store results
results['SVM'] = {
    'best_params': svm_grid_search.best_params_,
    'best_cv_score': svm_grid_search.best_score_,
    'test_accuracy': svm_test_accuracy,
    'classification_report': classification_report(y_test, y_pred_svm, output_dict=True)
}

##### b. Save and store results

In [None]:
# After svm_grid_search.fit(X_train_tfidf, y_train)
svm_cv_results_df = pd.DataFrame(svm_grid_search.cv_results_)

# Print per-fold scores
svm_fold_columns = [col for col in svm_cv_results_df.columns if col.startswith("split") and col.endswith("_test_score")]
svm_cv_fold_scores = svm_cv_results_df[['params', 'mean_test_score', 'std_test_score'] + svm_fold_columns]
print("\n--- SVM: Cross-Validation Scores for Each Fold ---")
print(svm_cv_fold_scores)

# Optionally, save results to CSV
svm_cv_results_df.to_csv('Results_3Classes/TFIDF_Models/tfidf_svm_cv_results.csv', index=False)

#### M3 Random Forest

##### a. Implementation

In [None]:
# -------------------------------
#  SECTION 3: RANDOM FOREST
# -------------------------------
print("\n" + "="*50)
print("MODEL 3: Random Forest")
print("="*50)

# Define model
rf_classifier = RandomForestClassifier()

# Define hyperparameter grid
# Add if model overfits too much
# 'min_samples_split': [2, 5],
# 'min_samples_leaf': [1, 2],
rf_param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'max_features': ['sqrt', 'log2']   # Usually beneficial for text
}

# Set up GridSearchCV
rf_grid_search = GridSearchCV(
    estimator=rf_classifier,
    param_grid=rf_param_grid,
    scoring='accuracy',
    cv=10,
    n_jobs=4,
    verbose=2,
    return_train_score=True
)

# Fit
rf_grid_search.fit(X_train_tfidf, y_train)

print("\nBest Params (Random Forest):", rf_grid_search.best_params_)
print("Best CV Score (Random Forest):", rf_grid_search.best_score_)

# Predict
rf_best = rf_grid_search.best_estimator_
y_pred_rf = rf_best.predict(X_test_tfidf)

rf_test_accuracy = accuracy_score(y_test, y_pred_rf)
print("Test Accuracy (Random Forest):", rf_test_accuracy)
print(classification_report(y_test, y_pred_rf))

# Store results
results['RandomForest'] = {
    'best_params': rf_grid_search.best_params_,
    'best_cv_score': rf_grid_search.best_score_,
    'test_accuracy': rf_test_accuracy,
    'classification_report': classification_report(y_test, y_pred_rf, output_dict=True)
}

##### b. Save and store results

In [None]:
# After rf_grid_search.fit(X_train_tfidf, y_train)
rf_cv_results_df = pd.DataFrame(rf_grid_search.cv_results_)

# Print per-fold scores
rf_fold_columns = [col for col in rf_cv_results_df.columns if col.startswith("split") and col.endswith("_test_score")]
rf_cv_fold_scores = rf_cv_results_df[['params', 'mean_test_score', 'std_test_score'] + rf_fold_columns]
print("\n--- Random Forest: Cross-Validation Scores for Each Fold ---")
print(rf_cv_fold_scores)

# Optionally, save results to CSV
rf_cv_results_df.to_csv('Results_3Classes/TFIDF_Models/tfidf_RandomForest_cv_results.csv', index=False)

#### M4 kNN

##### a. Implementation

In [None]:
# -------------------------------
#  SECTION 4: kNN
# -------------------------------
print("\n" + "="*50)
print("MODEL 4: kNN")
print("="*50)

knn_classifier = KNeighborsClassifier()

knn_param_grid = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'cosine']
}

knn_grid_search = GridSearchCV(
    estimator=knn_classifier,
    param_grid=knn_param_grid,
    scoring='accuracy',
    cv=10,
    n_jobs=4,
    verbose=2,
    return_train_score=True
)

knn_grid_search.fit(X_train_tfidf, y_train)

print("\nBest Params (kNN):", knn_grid_search.best_params_)
print("Best CV Score (kNN):", knn_grid_search.best_score_)

knn_best = knn_grid_search.best_estimator_
y_pred_knn = knn_best.predict(X_test_tfidf)

knn_test_accuracy = accuracy_score(y_test, y_pred_knn)
print("Test Accuracy (kNN):", knn_test_accuracy)
print(classification_report(y_test, y_pred_knn))

results['kNN'] = {
    'best_params': knn_grid_search.best_params_,
    'best_cv_score': knn_grid_search.best_score_,
    'test_accuracy': knn_test_accuracy,
    'classification_report': classification_report(y_test, y_pred_knn, output_dict=True)
}


##### b. Save and store results

In [None]:
# After knn_grid_search.fit(X_train_tfidf, y_train)
knn_cv_results_df = pd.DataFrame(knn_grid_search.cv_results_)

# Print per-fold scores
knn_fold_columns = [col for col in knn_cv_results_df.columns if col.startswith("split") and col.endswith("_test_score")]
knn_cv_fold_scores = knn_cv_results_df[['params', 'mean_test_score', 'std_test_score'] + knn_fold_columns]
print("\n--- kNN: Cross-Validation Scores for Each Fold ---")
print(knn_cv_fold_scores)

# Optionally, save results to CSV
knn_cv_results_df.to_csv('Results_3Classes/TFIDF_Models/tfidf_knn_cv_results.csv', index=False)

#### M5 Naive Bayes

##### a. Implementation

In [None]:
# -------------------------------
#  SECTION 5: Naïve Bayes
# -------------------------------
print("\n" + "="*50)
print("MODEL 5: Naïve Bayes")
print("="*50)

nb_classifier = MultinomialNB()

nb_param_grid = {
    'alpha': [0.5, 1.0, 1.5]
}

nb_grid_search = GridSearchCV(
    estimator=nb_classifier,
    param_grid=nb_param_grid,
    scoring='accuracy',
    cv=10,
    n_jobs=4,
    verbose=2,
    return_train_score=True
)

nb_grid_search.fit(X_train_tfidf, y_train)

print("\nBest Params (Naïve Bayes):", nb_grid_search.best_params_)
print("Best CV Score (Naïve Bayes):", nb_grid_search.best_score_)

nb_best = nb_grid_search.best_estimator_
y_pred_nb = nb_best.predict(X_test_tfidf)

nb_test_accuracy = accuracy_score(y_test, y_pred_nb)
print("Test Accuracy (Naïve Bayes):", nb_test_accuracy)
print(classification_report(y_test, y_pred_nb))

results['NaiveBayes'] = {
    'best_params': nb_grid_search.best_params_,
    'best_cv_score': nb_grid_search.best_score_,
    'test_accuracy': nb_test_accuracy,
    'classification_report': classification_report(y_test, y_pred_nb, output_dict=True)
}

##### b. Save and store

In [None]:
# After nb_grid_search.fit(X_train_tfidf, y_train)
nb_cv_results_df = pd.DataFrame(nb_grid_search.cv_results_)

# Print per-fold scores
nb_fold_columns = [col for col in nb_cv_results_df.columns if col.startswith("split") and col.endswith("_test_score")]
nb_cv_fold_scores = nb_cv_results_df[['params', 'mean_test_score', 'std_test_score'] + nb_fold_columns]
print("\n--- Naïve Bayes: Cross-Validation Scores for Each Fold ---")
print(nb_cv_fold_scores)

# Optionally, save results to CSV
nb_cv_results_df.to_csv('Results_3Classes/TFIDF_Models/tfidf_NaiveBayes_cv_results.csv', index=False)

#### Summary

In [None]:
pd.DataFrame(results).T

#### Confusion Matrices (Trigram TFIDF)

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay 
# Dictionary of your best models for Tri-Class classification (TF-IDF)
models_triclass = {
    "DecisionTree": dt_best,       # from your dt_grid_search
    "SVM": svm_best,               # from your svm_grid_search
    "RandomForest": rf_best,       # from your rf_grid_search
    "kNN": knn_best,               # from your knn_grid_search
    "NaiveBayes": nb_best          # from your nb_grid_search
}
 
# Generate and display the confusion matrix for each model
for model_name, model in models_triclass.items():
    y_pred = model.predict(X_test_tfidf)         # X_test_tfidf from your Tri-Class TF-IDF setup
    cm = confusion_matrix(y_test, y_pred)        # y_test is your Tri-Class test labels
    disp = ConfusionMatrixDisplay(
        confusion_matrix=cm, 
        display_labels=model.classes_            # Ensures the labels match your model’s classes
    )
    disp.plot(cmap='Blues')
    plt.title(f"Confusion Matrix: {model_name} (Tri-Class, TF-IDF)")
    plt.show()

### **NLP Method 2:** N-Gram (Tri-Gram)

#### Feature Extraction Method: N-Gram

In [None]:
# ================================
# NLP Method 2: N-Gram Approach
# ================================
print("\n=== NLP Method 2: N-Gram Approach (Unigrams, Bigrams, Trigrams) ===")

from sklearn.feature_extraction.text import CountVectorizer

# Create a CountVectorizer for (1,3)
ngram_vectorizer = CountVectorizer(
    ngram_range=(1,3),  # (1,3) => unigrams + bigrams + trigrams
    max_features=10000,
    max_df=0.8
)

# Fit on the same X_train, transform X_train and X_test
X_train_ngram = ngram_vectorizer.fit_transform(X_train)
X_test_ngram = ngram_vectorizer.transform(X_test)

# (Optional) create a dictionary to store the final results for N-Gram method
results_ngram = {}


#### Model 1: Decision Tree

In [None]:
# -------------------------------
#  SECTION 1: Decision Tree
# -------------------------------

dt_classifier = DecisionTreeClassifier()
dt_param_grid = {
    'max_depth': [10, 20, 25],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 5, 10]
}

dt_grid_search_ngram = GridSearchCV(
    estimator=dt_classifier,
    param_grid=dt_param_grid,
    scoring='accuracy',
    cv=10,
    n_jobs=4,
    verbose=2,
    return_train_score=True
)

In [None]:
print("\n" + "="*50)
print("N-Gram: Decision Tree")
print("="*50)

dt_grid_search_ngram.fit(X_train_ngram, y_train)

In [None]:
print("\nBest Params (N-Gram, Decision Tree):", dt_grid_search_ngram.best_params_)
print("Best CV Score (N-Gram, Decision Tree):", dt_grid_search_ngram.best_score_)

In [None]:
dt_best_ngram = dt_grid_search_ngram.best_estimator_
y_pred_dt_ngram = dt_best_ngram.predict(X_test_ngram)

dt_test_accuracy_ngram = accuracy_score(y_test, y_pred_dt_ngram)
print("Test Accuracy (N-Gram, Decision Tree):", dt_test_accuracy_ngram)
print(classification_report(y_test, y_pred_dt_ngram))

results_ngram['DecisionTree'] = {
    'best_params': dt_grid_search_ngram.best_params_,
    'best_cv_score': dt_grid_search_ngram.best_score_,
    'test_accuracy': dt_test_accuracy_ngram,
    'classification_report': classification_report(y_test, y_pred_dt_ngram, output_dict=True)
}

# Save cross-validation results to CSV in "Results/NGram_Models"
dt_cv_results_ngram = pd.DataFrame(dt_grid_search_ngram.cv_results_)
dt_cv_results_ngram.to_csv('Results_3Classes/NGram_Models/ngram_decision_tree_cv_results.csv', index=False)

#### M2: Linear SVM

In [None]:
# -------------------------------
#  SECTION 2: Linear SVM
# -------------------------------
svm_classifier = LinearSVC(max_iter=10000)
svm_param_grid = {
    'C': [1e-3, 1e-2, 1e-1, 1, 1e1, 1e2]
}

svm_grid_search_ngram = GridSearchCV(
    estimator=svm_classifier,
    param_grid=svm_param_grid,
    scoring='accuracy',
    cv=10,
    n_jobs=4,
    verbose=2,
    return_train_score=True
)

In [None]:
print("\n" + "="*50)
print("N-Gram: Linear SVM")
print("="*50)

svm_grid_search_ngram.fit(X_train_ngram, y_train)

In [None]:
print("\nBest Params (N-Gram, SVM):", svm_grid_search_ngram.best_params_)
print("Best CV Score (N-Gram, SVM):", svm_grid_search_ngram.best_score_)

In [None]:
svm_best_ngram = svm_grid_search_ngram.best_estimator_
y_pred_svm_ngram = svm_best_ngram.predict(X_test_ngram)

svm_test_accuracy_ngram = accuracy_score(y_test, y_pred_svm_ngram)
print("Test Accuracy (N-Gram, SVM):", svm_test_accuracy_ngram)
print(classification_report(y_test, y_pred_svm_ngram))

results_ngram['SVM'] = {
    'best_params': svm_grid_search_ngram.best_params_,
    'best_cv_score': svm_grid_search_ngram.best_score_,
    'test_accuracy': svm_test_accuracy_ngram,
    'classification_report': classification_report(y_test, y_pred_svm_ngram, output_dict=True)
}

In [None]:
svm_cv_results_ngram = pd.DataFrame(svm_grid_search_ngram.cv_results_)
svm_cv_results_ngram.to_csv('Results_3Classes/NGram_Models/ngram_svm_cv_results.csv', index=False)

#### M3: Random Forest

In [None]:
# -------------------------------
#  SECTION 3: Random Forest
# -------------------------------

rf_classifier = RandomForestClassifier()
rf_param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'max_features': ['sqrt', 'log2']
}

rf_grid_search_ngram = GridSearchCV(
    estimator=rf_classifier,
    param_grid=rf_param_grid,
    scoring='accuracy',
    cv=10,
    n_jobs=4,
    verbose=2,
    return_train_score=True
)

In [None]:
print("\n" + "="*50)
print("N-Gram: Random Forest")
print("="*50)
rf_grid_search_ngram.fit(X_train_ngram, y_train)

In [None]:
print("\nBest Params (N-Gram, Random Forest):", rf_grid_search_ngram.best_params_)
print("Best CV Score (N-Gram, Random Forest):", rf_grid_search_ngram.best_score_)

In [None]:
rf_best_ngram = rf_grid_search_ngram.best_estimator_
y_pred_rf_ngram = rf_best_ngram.predict(X_test_ngram)

rf_test_accuracy_ngram = accuracy_score(y_test, y_pred_rf_ngram)
print("Test Accuracy (N-Gram, Random Forest):", rf_test_accuracy_ngram)
print(classification_report(y_test, y_pred_rf_ngram))

results_ngram['RandomForest'] = {
    'best_params': rf_grid_search_ngram.best_params_,
    'best_cv_score': rf_grid_search_ngram.best_score_,
    'test_accuracy': rf_test_accuracy_ngram,
    'classification_report': classification_report(y_test, y_pred_rf_ngram, output_dict=True)
}

rf_cv_results_ngram = pd.DataFrame(rf_grid_search_ngram.cv_results_)
rf_cv_results_ngram.to_csv('Results_3Classes/NGram_Models/ngram_random_forest_cv_results.csv', index=False)

#### M4: kNN

In [None]:
# -------------------------------
#  SECTION 4: kNN
# -------------------------------

from sklearn.neighbors import KNeighborsClassifier

knn_classifier = KNeighborsClassifier()
knn_param_grid = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'cosine']
}

knn_grid_search_ngram = GridSearchCV(
    estimator=knn_classifier,
    param_grid=knn_param_grid,
    scoring='accuracy',
    cv=10,
    n_jobs=4,
    verbose=2,
    return_train_score=True
)

In [None]:
print("\n" + "="*50)
print("N-Gram: kNN")
print("="*50)

knn_grid_search_ngram.fit(X_train_ngram, y_train)

In [None]:
print("\nBest Params (N-Gram, kNN):", knn_grid_search_ngram.best_params_)
print("Best CV Score (N-Gram, kNN):", knn_grid_search_ngram.best_score_)

In [None]:
knn_best_ngram = knn_grid_search_ngram.best_estimator_
y_pred_knn_ngram = knn_best_ngram.predict(X_test_ngram)

knn_test_accuracy_ngram = accuracy_score(y_test, y_pred_knn_ngram)
print("Test Accuracy (N-Gram, kNN):", knn_test_accuracy_ngram)
print(classification_report(y_test, y_pred_knn_ngram))

results_ngram['kNN'] = {
    'best_params': knn_grid_search_ngram.best_params_,
    'best_cv_score': knn_grid_search_ngram.best_score_,
    'test_accuracy': knn_test_accuracy_ngram,
    'classification_report': classification_report(y_test, y_pred_knn_ngram, output_dict=True)
}

In [None]:
knn_cv_results_ngram = pd.DataFrame(knn_grid_search_ngram.cv_results_)
knn_cv_results_ngram.to_csv('Results_3Classes/NGram_Models/ngram_knn_cv_results.csv', index=False)

#### M5: Naive Bayes

In [None]:
# -------------------------------
#  SECTION 5: Naïve Bayes
# -------------------------------

nb_classifier = MultinomialNB()
nb_param_grid = {
    'alpha': [0.5, 1.0, 1.5]
}

nb_grid_search_ngram = GridSearchCV(
    estimator=nb_classifier,
    param_grid=nb_param_grid,
    scoring='accuracy',
    cv=10,
    n_jobs=4,
    verbose=2,
    return_train_score=True
)

In [None]:
print("\n" + "="*50)
print("N-Gram: Naïve Bayes")
print("="*50)

nb_grid_search_ngram.fit(X_train_ngram, y_train)

In [None]:
print("\nBest Params (N-Gram, Naïve Bayes):", nb_grid_search_ngram.best_params_)
print("Best CV Score (N-Gram, Naïve Bayes):", nb_grid_search_ngram.best_score_)

In [None]:
nb_best_ngram = nb_grid_search_ngram.best_estimator_
y_pred_nb_ngram = nb_best_ngram.predict(X_test_ngram)

nb_test_accuracy_ngram = accuracy_score(y_test, y_pred_nb_ngram)
print("Test Accuracy (N-Gram, Naïve Bayes):", nb_test_accuracy_ngram)
print(classification_report(y_test, y_pred_nb_ngram))

results_ngram['NaiveBayes'] = {
    'best_params': nb_grid_search_ngram.best_params_,
    'best_cv_score': nb_grid_search_ngram.best_score_,
    'test_accuracy': nb_test_accuracy_ngram,
    'classification_report': classification_report(y_test, y_pred_nb_ngram, output_dict=True)
}

In [None]:
nb_cv_results_ngram = pd.DataFrame(nb_grid_search_ngram.cv_results_)
nb_cv_results_ngram.to_csv('Results_3Classes/NGram_Models/ngram_naive_bayes_cv_results.csv', index=False)

#### Summary of Results

In [None]:
# ================================
# FINAL SUMMARY FOR N-GRAM
# ================================
print("\n=== Final Results (N-Gram) ===")
pd.DataFrame(results_ngram).T

#### Confusion Matrices (Trigram Ngram)

In [None]:
# Dictionary of your best N-Gram models for Tri-Class classification
models_triclass_ngram = {
    "DecisionTree": dt_best_ngram,       # from your dt_grid_search_ngram
    "SVM": svm_best_ngram,               # from your svm_grid_search_ngram
    "RandomForest": rf_best_ngram,       # from your rf_grid_search_ngram
    "kNN": knn_best_ngram,               # from your knn_grid_search_ngram
    "NaiveBayes": nb_best_ngram          # from your nb_grid_search_ngram
}
 
# Generate and display the confusion matrix for each model
for model_name, model in models_triclass_ngram.items():
    # X_test_ngram: your N-Gram test features for the Tri-Class setup
    # y_test:       your Tri-Class test labels (negative, neutral, positive)
    y_pred = model.predict(X_test_ngram)
    cm = confusion_matrix(y_test, y_pred)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
    disp.plot(cmap='Blues')
    plt.title(f"Confusion Matrix: {model_name} (Tri-Class, N-Gram)")
    plt.show()

## 🔵 Classification Dual-Class (General)

### Make a copy of the preprocessed data

In [None]:
df_preprocessed_2C = pd.read_csv('Datasets/preprocessed_reviews.csv')
df_preprocessed_2C.head()

In [None]:
df_preprocessed_2C.shape

### Remove 3 Star Reviews

In [None]:
# Remove all neutral sentiment reviews to create a dual-class dataset
df_preprocessed_2C = df_preprocessed_2C[df_preprocessed_2C['Sentiment'] != 'neutral'].copy()

# Verify the distribution of sentiment classes
print("Updated Class Distribution (Dual-Class General):")
print(df_preprocessed_2C['Sentiment'].value_counts())

# Display the first few rows to confirm changes
df_preprocessed_2C.head()

In [None]:
plt.figure(figsize=(6,4))
sns.countplot(x='Sentiment', data=df_preprocessed_2C, order=['negative','neutral','positive'])  # or remove 'order' if you want auto-sorting
plt.title("Distribution of Sentiment Classes")
plt.xlabel("Sentiment Class")
plt.ylabel("Count of Reviews")
plt.show()

In [None]:
df_preprocessed_2C.shape

### Train/Test Split

#### Define Features (Preprocessed Text) and Target (Sentiment)

In [None]:
# Define Features (Preprocessed Text) and Target (Sentiment) for Dual-Class (General)
X_2C = df_preprocessed_2C['Preprocessed_Review']
y_2C = df_preprocessed_2C['Sentiment']

In [None]:
df_preprocessed_2C.head()

#### Split Data for Training

In [None]:
# Train/Test Split for Dual-Class (General)
X_train_2C, X_test_2C, y_train_2C, y_test_2C = train_test_split(
    X_2C, 
    y_2C, 
    test_size=0.3,  # Keeping 70-30 split
    random_state=42, 
    stratify=y_2C  # Ensures class balance in train-test split
)

# Confirm the new distribution
print(y_train_2C.value_counts())
print(y_test_2C.value_counts())

### **NLP Method 1:** Term Frequency-Inverse Document Frequency (TF-IDF)
Here, we run each model separately—each with its own hyperparameter tuning and 10-fold cross-validation.

#### Feature Extraction Method : TFIDF Setup

In [None]:
# TF-IDF Transformation for Dual-Class (General)
tfidf_vectorizer_2C = TfidfVectorizer(
    max_features=10000,  # Keeping same as Tri-Class
    ngram_range=(1,2),  # Unigram and Bigram
    max_df=0.8,
    sublinear_tf=True
)

# Fit on training data and transform
X_train_tfidf_2C = tfidf_vectorizer_2C.fit_transform(X_train_2C)
X_test_tfidf_2C = tfidf_vectorizer_2C.transform(X_test_2C)

# Create a dictionary to store the final results from each model (optional)
results_2C = {}

# Verify shape of transformed feature set
print(f"TF-IDF Train Shape: {X_train_tfidf_2C.shape}")
print(f"TF-IDF Test Shape: {X_test_tfidf_2C.shape}")

#### M1 (Decision Tree)

##### a. Implementation

In [None]:
# -------------------------------
#  SECTION 1: DECISION TREE (Dual-Class General)
# -------------------------------
print("\n" + "="*50)
print("MODEL 1: Decision Tree (Dual-Class General)")
print("="*50)

# Define model
dt_classifier_2C = DecisionTreeClassifier()

# Define hyperparameter grid
dt_param_grid_2C = {
    'max_depth': [5, 10, 15, 20, 25],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5, 10],
}

# Set up GridSearchCV with 10-fold CV
dt_grid_search_2C = GridSearchCV(
    estimator=dt_classifier_2C,
    param_grid=dt_param_grid_2C,
    scoring='accuracy',  
    cv=10,
    n_jobs=4,  # Adjust CPU usage
    verbose=2,
    return_train_score=True
)

In [None]:
# Fit GridSearch on the TF-IDF training data
dt_grid_search_2C.fit(X_train_tfidf_2C, y_train_2C)

In [None]:
# Evaluate best model
print("\nBest Params (Decision Tree - Dual-Class General):", dt_grid_search_2C.best_params_)
print("Best CV Score (Decision Tree - Dual-Class General):", dt_grid_search_2C.best_score_)

In [None]:
# Predict on test data
dt_best_2C = dt_grid_search_2C.best_estimator_  # Get the best model
y_pred_dt_2C = dt_best_2C.predict(X_test_tfidf_2C)  # Predict on test data

In [None]:
# Accuracy & Classification Report
dt_test_accuracy_2C = accuracy_score(y_test_2C, y_pred_dt_2C)
print("Test Accuracy (Decision Tree - Dual-Class General):", dt_test_accuracy_2C)
print(classification_report(y_test_2C, y_pred_dt_2C))

# Store results
results_2C['DecisionTree'] = {
    'best_params': dt_grid_search_2C.best_params_,
    'best_cv_score': dt_grid_search_2C.best_score_,
    'test_accuracy': dt_test_accuracy_2C,
    'classification_report': classification_report(y_test_2C, y_pred_dt_2C, output_dict=True)
}

##### b. Save Results

In [None]:
# Save Decision Tree results to a JSON file
with open('Results_2C(General)/TFIDF_Models/tfidf_decisionTree_2C_results.json', 'w') as f:
    json.dump(results_2C['DecisionTree'], f, indent=4)

In [None]:
# Convert cross-validation results to DataFrame and save
cv_results_df_2C = pd.DataFrame(dt_grid_search_2C.cv_results_)
cv_results_df_2C.to_csv('Results_2C(General)/TFIDF_Models/tfidf_decisionTree_2C_cv_results.csv', index=False)

# Plot mean test scores for different hyperparameter combinations
plt.figure(figsize=(12, 6))
plt.plot(range(1, len(cv_results_df_2C) + 1), cv_results_df_2C['mean_test_score'], marker='o')
plt.title('Cross-Validation Accuracy for Different Hyperparameters (Decision Tree - Dual-Class General)')
plt.xlabel('Hyperparameter Combination Index')
plt.ylabel('Mean CV Accuracy')
plt.grid(True)
plt.show()

##### c. Visualizations

In [None]:
# -------------------------------
#  Validation Curve Visualization (Decision Tree - Dual-Class General)
# -------------------------------
param_range_2C = [5, 10, 15, 20, 25]

# Compute validation curve
train_scores_2C, test_scores_2C = validation_curve(
    dt_classifier_2C, X_train_tfidf_2C, y_train_2C, param_name="max_depth", 
    param_range=param_range_2C, scoring="accuracy", cv=5, n_jobs=-1
)

# Calculate mean and standard deviation for training and validation scores
train_mean_2C = np.mean(train_scores_2C, axis=1)
test_mean_2C = np.mean(test_scores_2C, axis=1)
train_std_2C = np.std(train_scores_2C, axis=1)
test_std_2C = np.std(test_scores_2C, axis=1)

# Plot validation curve
plt.figure(figsize=(8, 6))
plt.plot(param_range_2C, train_mean_2C, label="Training Score", color='blue', marker='o')
plt.plot(param_range_2C, test_mean_2C, label="Validation Score", color='green', linestyle='--', marker='x')

# Fill areas for standard deviation
plt.fill_between(param_range_2C, train_mean_2C - train_std_2C, train_mean_2C + train_std_2C, alpha=0.2, color='blue')
plt.fill_between(param_range_2C, test_mean_2C - test_std_2C, test_mean_2C + test_std_2C, alpha=0.2, color='green')

plt.title("Validation Curve for max_depth (Decision Tree - Dual-Class General)")
plt.xlabel("max_depth")
plt.ylabel("Accuracy")
plt.legend()
plt.grid(True)
plt.show()

#### M2 (Linear SVM)

##### a. Implementation

In [None]:
# -------------------------------
#  SECTION: SVM (LinearSVC) - Dual-Class General
# -------------------------------
print("\n" + "="*50)
print("MODEL: LinearSVC (Dual-Class General)")
print("="*50)

# Define model
svm_classifier_2C = LinearSVC(max_iter=10000)

# Define hyperparameter grid
svm_param_grid_2C = {
    'C': [1e-3, 1e-2, 1e-1, 1, 1e1, 1e2]
}

# Set up GridSearchCV with 10-fold CV
svm_grid_search_2C = GridSearchCV(
    estimator=svm_classifier_2C,
    param_grid=svm_param_grid_2C,
    scoring='accuracy',
    cv=10,
    n_jobs=4,
    verbose=2,
    return_train_score=True
)

# Fit the grid search
svm_grid_search_2C.fit(X_train_tfidf_2C, y_train_2C)

# Evaluate best model
print("\nBest Params (SVM - Dual-Class General):", svm_grid_search_2C.best_params_)
print("Best CV Score (SVM - Dual-Class General):", svm_grid_search_2C.best_score_)

# Predict on test data
svm_best_2C = svm_grid_search_2C.best_estimator_
y_pred_svm_2C = svm_best_2C.predict(X_test_tfidf_2C)

In [None]:
# Accuracy & Classification Report
svm_test_accuracy_2C = accuracy_score(y_test_2C, y_pred_svm_2C)
print("Test Accuracy (SVM - Dual-Class General):", svm_test_accuracy_2C)
print(classification_report(y_test_2C, y_pred_svm_2C))

# Store results
results_2C['SVM'] = {
    'best_params': svm_grid_search_2C.best_params_,
    'best_cv_score': svm_grid_search_2C.best_score_,
    'test_accuracy': svm_test_accuracy_2C,
    'classification_report': classification_report(y_test_2C, y_pred_svm_2C, output_dict=True)
}

##### b. Save and store results

In [None]:
# -------------------------------
#  Save and Store Results for SVM - Dual-Class General
# -------------------------------

# After svm_grid_search_2C.fit(X_train_tfidf_2C, y_train_2C)
svm_cv_results_df_2C = pd.DataFrame(svm_grid_search_2C.cv_results_)

# Print per-fold scores
svm_fold_columns_2C = [col for col in svm_cv_results_df_2C.columns if col.startswith("split") and col.endswith("_test_score")]
svm_cv_fold_scores_2C = svm_cv_results_df_2C[['params', 'mean_test_score', 'std_test_score'] + svm_fold_columns_2C]

print("\n--- SVM: Cross-Validation Scores for Each Fold (Dual-Class General) ---")
print(svm_cv_fold_scores_2C)

# Optionally, save results to CSV
svm_cv_results_df_2C.to_csv('Results_2C(General)/TFIDF_Models/tfidf_dualClass_svm_cv_results.csv', index=False)

#### M3 (Random Forest)

##### a. Implementation

In [None]:
# -------------------------------
#  SECTION: Random Forest (Dual-Class General)
# -------------------------------
print("\n" + "="*50)
print("MODEL: Random Forest (Dual-Class General)")
print("="*50)

# Define model
rf_classifier_2C = RandomForestClassifier()

# Define hyperparameter grid
rf_param_grid_2C = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'max_features': ['sqrt', 'log2']
}

# Set up GridSearchCV
rf_grid_search_2C = GridSearchCV(
    estimator=rf_classifier_2C,
    param_grid=rf_param_grid_2C,
    scoring='accuracy',
    cv=10,
    n_jobs=4,
    verbose=2,
    return_train_score=True
)

# Fit
rf_grid_search_2C.fit(X_train_tfidf_2C, y_train_2C)

print("\nBest Params (Random Forest - Dual-Class General):", rf_grid_search_2C.best_params_)
print("Best CV Score (Random Forest - Dual-Class General):", rf_grid_search_2C.best_score_)

# Predict
rf_best_2C = rf_grid_search_2C.best_estimator_
y_pred_rf_2C = rf_best_2C.predict(X_test_tfidf_2C)

In [None]:
# Accuracy & Classification Report
rf_test_accuracy_2C = accuracy_score(y_test_2C, y_pred_rf_2C)
print("Test Accuracy (Random Forest - Dual-Class General):", rf_test_accuracy_2C)
print(classification_report(y_test_2C, y_pred_rf_2C))

# Store results
results_2C['RandomForest'] = {
    'best_params': rf_grid_search_2C.best_params_,
    'best_cv_score': rf_grid_search_2C.best_score_,
    'test_accuracy': rf_test_accuracy_2C,
    'classification_report': classification_report(y_test_2C, y_pred_rf_2C, output_dict=True)
}

##### b. Save and Store Results

In [None]:
# Convert cv_results_ to DataFrame and view per-fold scores
rf_cv_results_df_2C = pd.DataFrame(rf_grid_search_2C.cv_results_)

rf_fold_columns_2C = [col for col in rf_cv_results_df_2C.columns if col.startswith("split") and col.endswith("_test_score")]
rf_cv_fold_scores_2C = rf_cv_results_df_2C[['params', 'mean_test_score', 'std_test_score'] + rf_fold_columns_2C]

print("\n--- Random Forest: Cross-Validation Scores for Each Fold (Dual-Class General) ---")
print(rf_cv_fold_scores_2C)

# Save results to CSV using your naming convention
rf_cv_results_df_2C.to_csv('Results_2C(General)/TFIDF_Models/tfidf_dualClass_randomForest_cv_results.csv', index=False)

#### M4 (kNN)

##### a. Implementation

In [None]:
# -------------------------------
#  SECTION: kNN (Dual-Class General)
# -------------------------------
print("\n" + "="*50)
print("MODEL: kNN (Dual-Class General)")
print("="*50)

# Define model
knn_classifier_2C = KNeighborsClassifier()

# Define hyperparameter grid
knn_param_grid_2C = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'cosine']
}

# Set up GridSearchCV
knn_grid_search_2C = GridSearchCV(
    estimator=knn_classifier_2C,
    param_grid=knn_param_grid_2C,
    scoring='accuracy',
    cv=10,
    n_jobs=4,
    verbose=2,
    return_train_score=True
)

# Fit
knn_grid_search_2C.fit(X_train_tfidf_2C, y_train_2C)

print("\nBest Params (kNN - Dual-Class General):", knn_grid_search_2C.best_params_)
print("Best CV Score (kNN - Dual-Class General):", knn_grid_search_2C.best_score_)

# Predict
knn_best_2C = knn_grid_search_2C.best_estimator_
y_pred_knn_2C = knn_best_2C.predict(X_test_tfidf_2C)


In [None]:

# Accuracy & Classification Report
knn_test_accuracy_2C = accuracy_score(y_test_2C, y_pred_knn_2C)
print("Test Accuracy (kNN - Dual-Class General):", knn_test_accuracy_2C)
print(classification_report(y_test_2C, y_pred_knn_2C))

# Store results
results_2C['kNN'] = {
    'best_params': knn_grid_search_2C.best_params_,
    'best_cv_score': knn_grid_search_2C.best_score_,
    'test_accuracy': knn_test_accuracy_2C,
    'classification_report': classification_report(y_test_2C, y_pred_knn_2C, output_dict=True)
}

##### b. Save and Store results

In [None]:
# Cross-validation results
knn_cv_results_df_2C = pd.DataFrame(knn_grid_search_2C.cv_results_)
knn_fold_columns_2C = [col for col in knn_cv_results_df_2C.columns if col.startswith("split") and col.endswith("_test_score")]
knn_cv_fold_scores_2C = knn_cv_results_df_2C[['params', 'mean_test_score', 'std_test_score'] + knn_fold_columns_2C]

print("\n--- kNN: Cross-Validation Scores for Each Fold (Dual-Class General) ---")
print(knn_cv_fold_scores_2C)

# Save results to CSV
knn_cv_results_df_2C.to_csv('Results_2C(General)/TFIDF_Models/tfidf_dualClass_kNN_cv_results.csv', index=False)

#### M5 (Naive Bayes)

##### a. Implementation

In [None]:
# -------------------------------
#  SECTION: Naïve Bayes (Dual-Class General)
# -------------------------------
print("\n" + "="*50)
print("MODEL: Naïve Bayes (Dual-Class General)")
print("="*50)

# Define model
nb_classifier_2C = MultinomialNB()

# Define hyperparameter grid
nb_param_grid_2C = {
    'alpha': [0.5, 1.0, 1.5]
}

# Set up GridSearchCV
nb_grid_search_2C = GridSearchCV(
    estimator=nb_classifier_2C,
    param_grid=nb_param_grid_2C,
    scoring='accuracy',
    cv=10,
    n_jobs=4,
    verbose=2,
    return_train_score=True
)

# Fit
nb_grid_search_2C.fit(X_train_tfidf_2C, y_train_2C)

print("\nBest Params (Naïve Bayes - Dual-Class General):", nb_grid_search_2C.best_params_)
print("Best CV Score (Naïve Bayes - Dual-Class General):", nb_grid_search_2C.best_score_)

# Predict
nb_best_2C = nb_grid_search_2C.best_estimator_
y_pred_nb_2C = nb_best_2C.predict(X_test_tfidf_2C)


In [None]:
# Accuracy & Classification Report
nb_test_accuracy_2C = accuracy_score(y_test_2C, y_pred_nb_2C)
print("Test Accuracy (Naïve Bayes - Dual-Class General):", nb_test_accuracy_2C)
print(classification_report(y_test_2C, y_pred_nb_2C))

# Store results
results_2C['NaiveBayes'] = {
    'best_params': nb_grid_search_2C.best_params_,
    'best_cv_score': nb_grid_search_2C.best_score_,
    'test_accuracy': nb_test_accuracy_2C,
    'classification_report': classification_report(y_test_2C, y_pred_nb_2C, output_dict=True)
}

##### b. Save and store results

In [None]:

# Cross-validation results
nb_cv_results_df_2C = pd.DataFrame(nb_grid_search_2C.cv_results_)
nb_fold_columns_2C = [col for col in nb_cv_results_df_2C.columns if col.startswith("split") and col.endswith("_test_score")]
nb_cv_fold_scores_2C = nb_cv_results_df_2C[['params', 'mean_test_score', 'std_test_score'] + nb_fold_columns_2C]

print("\n--- Naïve Bayes: Cross-Validation Scores for Each Fold (Dual-Class General) ---")
print(nb_cv_fold_scores_2C)

# Save results to CSV
nb_cv_results_df_2C.to_csv('Results_2C(General)/TFIDF_Models/tfidf_dualClass_naiveBayes_cv_results.csv', index=False)

#### Summary

In [None]:
# ================================
# FINAL SUMMARY FOR Dual-Class (General) - TF-IDF
# ================================
print("\n=== Final Results (Dual-Class General - TF-IDF) ===")

summary_2C_df = pd.DataFrame(results_2C).T
display(summary_2C_df)

#### Confusion Matrices (Dual Class General TFIDF)

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Best TF‑IDF models (Dual-Class General)
models_2C_tfidf = {
    "DecisionTree": dt_best_2C,       
    "SVM": svm_best_2C,               
    "RandomForest": rf_best_2C,       
    "kNN": knn_best_2C,               
    "NaiveBayes": nb_best_2C          
}

for model_name, model in models_2C_tfidf.items():
    # X_test_tfidf_2C: your TF‑IDF test data (Dual-Class General)
    # y_test_2C:       your Dual-Class General labels (negative, positive)
    y_pred = model.predict(X_test_tfidf_2C)
    cm = confusion_matrix(y_test_2C, y_pred)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
    disp.plot(cmap='Blues')
    plt.title(f"Confusion Matrix: {model_name} (Dual-Class General, TF-IDF)")
    plt.show()

### **NLP Method 2:** N-Gram (Tri-Gram)

#### Feature Extraction Method: N-Gram

In [None]:
# ================================
# NLP Method 2: N-Gram Approach (Dual-Class General)
# ================================
print("\n=== NLP Method 2: N-Gram Approach (Dual-Class General) ===")

from sklearn.feature_extraction.text import CountVectorizer

# Create a CountVectorizer for (1,3) => unigrams, bigrams, and trigrams
ngram_vectorizer_2C = CountVectorizer(
    ngram_range=(1,3),  # (1,3) includes unigrams, bigrams, and trigrams
    max_features=10000,
    max_df=0.8
)

# Fit on X_train_2C (the dual-class training data) and transform
X_train_ngram_2C = ngram_vectorizer_2C.fit_transform(X_train_2C)
X_test_ngram_2C = ngram_vectorizer_2C.transform(X_test_2C)

# (Optional) create a separate dictionary to store final results from the N-Gram approach for Dual-Class
results_ngram_2C = {}

# Verify shape of transformed feature set
print(f"N-Gram Train Shape (2C): {X_train_ngram_2C.shape}")
print(f"N-Gram Test Shape (2C): {X_test_ngram_2C.shape}")

#### M1 Decision Tree

In [None]:
# -------------------------------
#  SECTION 1: Decision Tree (Dual-Class General - N-Gram)
# -------------------------------

# Define model
dt_classifier_2C_ngram = DecisionTreeClassifier()

# Define hyperparameter grid
dt_param_grid_2C_ngram = {
    'max_depth': [10, 20, 25],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 5, 10]
}

# Set up GridSearchCV
dt_grid_search_ngram_2C = GridSearchCV(
    estimator=dt_classifier_2C_ngram,
    param_grid=dt_param_grid_2C_ngram,
    scoring='accuracy',
    cv=10,
    n_jobs=4,
    verbose=2,
    return_train_score=True
)

print("\n" + "="*50)
print("N-Gram: Decision Tree (Dual-Class General)")
print("="*50)

# Fit the grid search on N-Gram training data
dt_grid_search_ngram_2C.fit(X_train_ngram_2C, y_train_2C)

# Display best params
print("\nBest Params (N-Gram, Decision Tree - Dual-Class General):", dt_grid_search_ngram_2C.best_params_)
print("Best CV Score (N-Gram, Decision Tree - Dual-Class General):", dt_grid_search_ngram_2C.best_score_)

# Predict on test data
dt_best_ngram_2C = dt_grid_search_ngram_2C.best_estimator_
y_pred_dt_ngram_2C = dt_best_ngram_2C.predict(X_test_ngram_2C)

# Accuracy & Classification Report
dt_test_accuracy_ngram_2C = accuracy_score(y_test_2C, y_pred_dt_ngram_2C)
print("Test Accuracy (N-Gram, Decision Tree - Dual-Class General):", dt_test_accuracy_ngram_2C)
print(classification_report(y_test_2C, y_pred_dt_ngram_2C))

# Store results
results_ngram_2C['DecisionTree'] = {
    'best_params': dt_grid_search_ngram_2C.best_params_,
    'best_cv_score': dt_grid_search_ngram_2C.best_score_,
    'test_accuracy': dt_test_accuracy_ngram_2C,
    'classification_report': classification_report(y_test_2C, y_pred_dt_ngram_2C, output_dict=True)
}

# Convert cross-validation results to DataFrame
dt_cv_results_ngram_2C = pd.DataFrame(dt_grid_search_ngram_2C.cv_results_)

# Save results to CSV in "Results_2C(General)/NGram_Models"
dt_cv_results_ngram_2C.to_csv('Results_2C(General)/NGram_Models/ngram_dualClass_decision_tree_cv_results.csv', index=False)


#### M2 Linear SVM

In [None]:
# -------------------------------
#  SECTION 2: Linear SVM (Dual-Class General - N-Gram)
# -------------------------------
svm_classifier_2C_ngram = LinearSVC(max_iter=10000)

svm_param_grid_2C_ngram = {
    'C': [1e-3, 1e-2, 1e-1, 1, 1e1, 1e2]
}

svm_grid_search_ngram_2C = GridSearchCV(
    estimator=svm_classifier_2C_ngram,
    param_grid=svm_param_grid_2C_ngram,
    scoring='accuracy',
    cv=10,
    n_jobs=4,
    verbose=2,
    return_train_score=True
)

print("\n" + "="*50)
print("N-Gram: Linear SVM (Dual-Class General)")
print("="*50)

# Fit on dual-class N-Gram data
svm_grid_search_ngram_2C.fit(X_train_ngram_2C, y_train_2C)

print("\nBest Params (N-Gram, SVM - Dual-Class General):", svm_grid_search_ngram_2C.best_params_)
print("Best CV Score (N-Gram, SVM - Dual-Class General):", svm_grid_search_ngram_2C.best_score_)

# Predict on test data
svm_best_ngram_2C = svm_grid_search_ngram_2C.best_estimator_
y_pred_svm_ngram_2C = svm_best_ngram_2C.predict(X_test_ngram_2C)

# Evaluate performance
svm_test_accuracy_ngram_2C = accuracy_score(y_test_2C, y_pred_svm_ngram_2C)
print("Test Accuracy (N-Gram, SVM - Dual-Class General):", svm_test_accuracy_ngram_2C)
print(classification_report(y_test_2C, y_pred_svm_ngram_2C))

# Store results in n-gram dictionary
results_ngram_2C['SVM'] = {
    'best_params': svm_grid_search_ngram_2C.best_params_,
    'best_cv_score': svm_grid_search_ngram_2C.best_score_,
    'test_accuracy': svm_test_accuracy_ngram_2C,
    'classification_report': classification_report(y_test_2C, y_pred_svm_ngram_2C, output_dict=True)
}

# Convert cross-validation results to DataFrame
svm_cv_results_ngram_2C = pd.DataFrame(svm_grid_search_ngram_2C.cv_results_)

# Save results to CSV (dual-class general n-gram path)
svm_cv_results_ngram_2C.to_csv('Results_2C(General)/NGram_Models/ngram_dualClass_svm_cv_results.csv', index=False)

#### M3 Random Forest

In [None]:
# -------------------------------
#  SECTION: Random Forest (Dual-Class General - N-Gram)
# -------------------------------
print("\n" + "="*50)
print("N-Gram: Random Forest (Dual-Class General)")
print("="*50)

# Define the Random Forest model
rf_classifier_2C_ngram = RandomForestClassifier()

# Define the hyperparameter grid
rf_param_grid_2C_ngram = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'max_features': ['sqrt', 'log2']
}

# Set up GridSearchCV
rf_grid_search_ngram_2C = GridSearchCV(
    estimator=rf_classifier_2C_ngram,
    param_grid=rf_param_grid_2C_ngram,
    scoring='accuracy',
    cv=10,
    n_jobs=4,
    verbose=2,
    return_train_score=True
)

# Fit on N-Gram training data (dual-class general)
rf_grid_search_ngram_2C.fit(X_train_ngram_2C, y_train_2C)

# Print the best parameters and best CV score
print("\nBest Params (N-Gram, Random Forest - Dual-Class General):", rf_grid_search_ngram_2C.best_params_)
print("Best CV Score (N-Gram, Random Forest - Dual-Class General):", rf_grid_search_ngram_2C.best_score_)

# Predict on test data
rf_best_ngram_2C = rf_grid_search_ngram_2C.best_estimator_
y_pred_rf_ngram_2C = rf_best_ngram_2C.predict(X_test_ngram_2C)

# Evaluate performance
rf_test_accuracy_ngram_2C = accuracy_score(y_test_2C, y_pred_rf_ngram_2C)
print("Test Accuracy (N-Gram, Random Forest - Dual-Class General):", rf_test_accuracy_ngram_2C)
print(classification_report(y_test_2C, y_pred_rf_ngram_2C))

# Store results in the n-gram dictionary for Dual-Class
results_ngram_2C['RandomForest'] = {
    'best_params': rf_grid_search_ngram_2C.best_params_,
    'best_cv_score': rf_grid_search_ngram_2C.best_score_,
    'test_accuracy': rf_test_accuracy_ngram_2C,
    'classification_report': classification_report(y_test_2C, y_pred_rf_ngram_2C, output_dict=True)
}

# Convert cv_results_ to DataFrame and save
rf_cv_results_ngram_2C = pd.DataFrame(rf_grid_search_ngram_2C.cv_results_)
rf_cv_results_ngram_2C.to_csv('Results_2C(General)/NGram_Models/ngram_dualClass_randomForest_cv_results.csv', index=False)

#### M4 kNN

In [None]:
# -------------------------------
#  SECTION: kNN (Dual-Class General - N-Gram)
# -------------------------------
print("\n" + "="*50)
print("N-Gram: kNN (Dual-Class General)")
print("="*50)

# Define the kNN model
knn_classifier_2C_ngram = KNeighborsClassifier()

# Define the hyperparameter grid
knn_param_grid_2C_ngram = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'cosine']
}

# Set up GridSearchCV
knn_grid_search_ngram_2C = GridSearchCV(
    estimator=knn_classifier_2C_ngram,
    param_grid=knn_param_grid_2C_ngram,
    scoring='accuracy',
    cv=10,
    n_jobs=4,
    verbose=2,
    return_train_score=True
)

# Fit on N-Gram training data
knn_grid_search_ngram_2C.fit(X_train_ngram_2C, y_train_2C)

# Print best params and best CV score
print("\nBest Params (N-Gram, kNN - Dual-Class General):", knn_grid_search_ngram_2C.best_params_)
print("Best CV Score (N-Gram, kNN - Dual-Class General):", knn_grid_search_ngram_2C.best_score_)

# Predict on test data
knn_best_ngram_2C = knn_grid_search_ngram_2C.best_estimator_
y_pred_knn_ngram_2C = knn_best_ngram_2C.predict(X_test_ngram_2C)

# Evaluate performance
knn_test_accuracy_ngram_2C = accuracy_score(y_test_2C, y_pred_knn_ngram_2C)
print("Test Accuracy (N-Gram, kNN - Dual-Class General):", knn_test_accuracy_ngram_2C)
print(classification_report(y_test_2C, y_pred_knn_ngram_2C))

# Store results
results_ngram_2C['kNN'] = {
    'best_params': knn_grid_search_ngram_2C.best_params_,
    'best_cv_score': knn_grid_search_ngram_2C.best_score_,
    'test_accuracy': knn_test_accuracy_ngram_2C,
    'classification_report': classification_report(y_test_2C, y_pred_knn_ngram_2C, output_dict=True)
}

# Convert cv_results_ to DataFrame and save
knn_cv_results_df_2C_ngram = pd.DataFrame(knn_grid_search_ngram_2C.cv_results_)
knn_cv_results_df_2C_ngram.to_csv('Results_2C(General)/NGram_Models/ngram_dualClass_kNN_cv_results.csv', index=False)

#### M5 Naive Bayes

In [None]:
# -------------------------------
#  SECTION: Naïve Bayes (Dual-Class General - N-Gram)
# -------------------------------
print("\n" + "="*50)
print("N-Gram: Naïve Bayes (Dual-Class General)")
print("="*50)

# Define the Multinomial Naive Bayes model
nb_classifier_2C_ngram = MultinomialNB()

# Define the hyperparameter grid
nb_param_grid_2C_ngram = {
    'alpha': [0.5, 1.0, 1.5]
}

# Set up GridSearchCV
nb_grid_search_ngram_2C = GridSearchCV(
    estimator=nb_classifier_2C_ngram,
    param_grid=nb_param_grid_2C_ngram,
    scoring='accuracy',
    cv=10,
    n_jobs=4,
    verbose=2,
    return_train_score=True
)

# Fit on dual-class N-Gram data
nb_grid_search_ngram_2C.fit(X_train_ngram_2C, y_train_2C)

# Print best parameters and best CV score
print("\nBest Params (N-Gram, Naïve Bayes - Dual-Class General):", nb_grid_search_ngram_2C.best_params_)
print("Best CV Score (N-Gram, Naïve Bayes - Dual-Class General):", nb_grid_search_ngram_2C.best_score_)

# Predict on test data
nb_best_ngram_2C = nb_grid_search_ngram_2C.best_estimator_
y_pred_nb_ngram_2C = nb_best_ngram_2C.predict(X_test_ngram_2C)

# Evaluate performance
nb_test_accuracy_ngram_2C = accuracy_score(y_test_2C, y_pred_nb_ngram_2C)
print("Test Accuracy (N-Gram, Naïve Bayes - Dual-Class General):", nb_test_accuracy_ngram_2C)
print(classification_report(y_test_2C, y_pred_nb_ngram_2C))

# Store results
results_ngram_2C['NaiveBayes'] = {
    'best_params': nb_grid_search_ngram_2C.best_params_,
    'best_cv_score': nb_grid_search_ngram_2C.best_score_,
    'test_accuracy': nb_test_accuracy_ngram_2C,
    'classification_report': classification_report(y_test_2C, y_pred_nb_ngram_2C, output_dict=True)
}

# Convert cv_results_ to DataFrame
nb_cv_results_df_2C_ngram = pd.DataFrame(nb_grid_search_ngram_2C.cv_results_)

# Save results to CSV
nb_cv_results_df_2C_ngram.to_csv('Results_2C(General)/NGram_Models/ngram_dualClass_naiveBayes_cv_results.csv', index=False)

#### Summary of Results

In [None]:
pd.DataFrame(results_ngram_2C).T

#### Confusion Matrices (Dual Class General - Ngram)

In [None]:
# Best N‑Gram models (Dual-Class General)
models_2C_ngram = {
    "DecisionTree": dt_best_ngram_2C,       
    "SVM": svm_best_ngram_2C,               
    "RandomForest": rf_best_ngram_2C,       
    "kNN": knn_best_ngram_2C,               
    "NaiveBayes": nb_best_ngram_2C          
}

for model_name, model in models_2C_ngram.items():
    # X_test_ngram_2C: your N‑Gram test data (Dual-Class General)
    # y_test_2C:       your Dual-Class General labels (negative, positive)
    y_pred = model.predict(X_test_ngram_2C)
    cm = confusion_matrix(y_test_2C, y_pred)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
    disp.plot(cmap='Blues')
    plt.title(f"Confusion Matrix: {model_name} (Dual-Class General, N-Gram)")
    plt.show()


## 🟣 Classification Dual-Class (Negative Biased)

### Make a copy of the preprocessed data

In [None]:
df_negBias = pd.read_csv('Datasets/preprocessed_reviews.csv')
df_negBias.head()

### Merge 3 Star Reviews with Negative class

In [None]:
# 2. Drop the existing Sentiment column
df_negBias.drop(columns=['Sentiment'], inplace=True)

In [None]:
df_negBias.head()

In [None]:
# 3. Define a function to merge 1,2,3 → negative and 4,5 → positive
def map_neg_biased(score):
    if score in [1, 2, 3]:
        return 'negative'
    else:
        return 'positive'

In [None]:
# 4. Recreate the Sentiment column with negative bias mapping
df_negBias['Sentiment'] = df_negBias['Score'].apply(map_neg_biased)

In [None]:
df_negBias.head()

In [None]:
# 5. Verify the distribution
print("\nDistribution for Negative-Biased Classification:")
print(df_negBias['Sentiment'].value_counts())

### Train/Test Split

#### Define Features (Preprocessed Text) and Target (Sentiment)

In [None]:
# 6. Define Features (Preprocessed Text) and Target (Neg-Biased Sentiment)
X_negBias = df_negBias['Preprocessed_Review']
y_negBias = df_negBias['Sentiment']

##### Train Test Split

In [None]:
# 7. Train/Test Split
from sklearn.model_selection import train_test_split

X_train_negBias, X_test_negBias, y_train_negBias, y_test_negBias = train_test_split(
    X_negBias,
    y_negBias,
    test_size=0.3,      # 70 train / 30 test
    random_state=42,
    stratify=y_negBias
)

# 8. Print distribution in train/test sets
print("\nTraining Set Distribution:")
print(y_train_negBias.value_counts())
print("\nTest Set Distribution:")
print(y_test_negBias.value_counts())

### **NLP Method 1:** Term Frequency-Inverse Document Frequency (TF-IDF)
Here, we run each model separately—each with its own hyperparameter tuning and 10-fold cross-validation.

#### Feature Extraction Method: TFIDF

In [None]:
# ================================
# TF-IDF Feature Extraction (Negative-Biased)
# ================================

print("\n=== TF-IDF Feature Extraction for Negative-Biased Sentiment ===")

# Define the TF-IDF Vectorizer
tfidf_vectorizer_negBias = TfidfVectorizer(
    max_features=10000,  # Adjust as needed
    ngram_range=(1,2),   # Unigrams & Bigrams
    max_df=0.8,
    sublinear_tf=True
)

# Fit on training data and transform
X_train_tfidf_negBias = tfidf_vectorizer_negBias.fit_transform(X_train_negBias)
X_test_tfidf_negBias = tfidf_vectorizer_negBias.transform(X_test_negBias)

# Initialize a dictionary to store model results
results_negBias = {}

print("\nTF-IDF Feature Extraction Complete. Ready for Model Training!")

#### M1 Decision Tree

In [None]:
# ================================
# MODEL 1: Decision Tree (Negative-Biased)
# ================================

print("\n" + "="*50)
print("Negative-Biased: Decision Tree")
print("="*50)

# Define the model
dt_classifier_negBias = DecisionTreeClassifier()

# Define hyperparameter grid
dt_param_grid_negBias = {
    'max_depth': [5, 10, 15, 20, 25],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5, 10],
}

# Set up GridSearchCV with 10-fold cross-validation
dt_grid_search_negBias = GridSearchCV(
    estimator=dt_classifier_negBias,
    param_grid=dt_param_grid_negBias,
    scoring='accuracy',  
    cv=10,
    n_jobs=4,           # Adjust CPU usage
    verbose=2,
    return_train_score=True
)

# Fit GridSearch on the TF-IDF training data
dt_grid_search_negBias.fit(X_train_tfidf_negBias, y_train_negBias)

# Evaluate best model
print("\nBest Params (Decision Tree, Negative-Biased):", dt_grid_search_negBias.best_params_)
print("Best CV Score (Decision Tree, Negative-Biased):", dt_grid_search_negBias.best_score_)

# Predict on test data
dt_best_negBias = dt_grid_search_negBias.best_estimator_  # Get the best model
y_pred_dt_negBias = dt_best_negBias.predict(X_test_tfidf_negBias)  # Predict on test data

# Compute Accuracy & Classification Report
dt_test_accuracy_negBias = accuracy_score(y_test_negBias, y_pred_dt_negBias)
print("Test Accuracy (Decision Tree, Negative-Biased):", dt_test_accuracy_negBias)
print(classification_report(y_test_negBias, y_pred_dt_negBias))

# Store results
results_negBias['DecisionTree'] = {
    'best_params': dt_grid_search_negBias.best_params_,
    'best_cv_score': dt_grid_search_negBias.best_score_,
    'test_accuracy': dt_test_accuracy_negBias,
    'classification_report': classification_report(y_test_negBias, y_pred_dt_negBias, output_dict=True)
}

# Save cross-validation results to CSV
dt_cv_results_negBias = pd.DataFrame(dt_grid_search_negBias.cv_results_)
dt_cv_results_negBias.to_csv('Results_2C(NegBias)/TFIDF_Models/tfidf_negBias_decisionTree_cv_results.csv', index=False)

print("\nDecision Tree Model Training & Evaluation Complete for Negative-Biased Sentiment!")

#### M2 Linear SVM

In [None]:
# -------------------------------
#  SECTION 2: Linear SVM (Negative Bias)
# -------------------------------
print("\n" + "="*50)
print("NEGATIVE-BIASED: Linear SVM")
print("="*50)

# Define model
svm_classifier_negBias = LinearSVC(max_iter=10000)

# Define hyperparameter grid
svm_param_grid_negBias = {
    'C': [1e-3, 1e-2, 1e-1, 1, 1e1, 1e2]
}

# Set up GridSearchCV with 10-fold CV
svm_grid_search_negBias = GridSearchCV(
    estimator=svm_classifier_negBias,
    param_grid=svm_param_grid_negBias,
    scoring='accuracy',
    cv=10,
    n_jobs=4,
    verbose=2,
    return_train_score=True
)

# Fit the grid search
svm_grid_search_negBias.fit(X_train_tfidf_negBias, y_train_negBias)

# Evaluate best model
print("\nBest Params (Negative-Biased, SVM):", svm_grid_search_negBias.best_params_)
print("Best CV Score (Negative-Biased, SVM):", svm_grid_search_negBias.best_score_)

# Predict on test data
svm_best_negBias = svm_grid_search_negBias.best_estimator_
y_pred_svm_negBias = svm_best_negBias.predict(X_test_tfidf_negBias)

# Accuracy & Classification Report
svm_test_accuracy_negBias = accuracy_score(y_test_negBias, y_pred_svm_negBias)
print("Test Accuracy (Negative-Biased, SVM):", svm_test_accuracy_negBias)
print(classification_report(y_test_negBias, y_pred_svm_negBias))

# Store results
results_negBias['SVM'] = {
    'best_params': svm_grid_search_negBias.best_params_,
    'best_cv_score': svm_grid_search_negBias.best_score_,
    'test_accuracy': svm_test_accuracy_negBias,
    'classification_report': classification_report(y_test_negBias, y_pred_svm_negBias, output_dict=True)
}

# Save cross-validation results to CSV
svm_cv_results_negBias = pd.DataFrame(svm_grid_search_negBias.cv_results_)
svm_cv_results_negBias.to_csv('Results_2C(NegBias)/TFIDF_Models/tfidf_negBias_svm_cv_results.csv', index=False)

print("\nLinear SVM Model Training & Evaluation Complete for Negative-Biased Sentiment!")

#### M3 Random Forest

In [None]:
# -------------------------------
#  SECTION 3: Random Forest (Negative Bias)
# -------------------------------
print("\n" + "="*50)
print("NEGATIVE-BIASED: Random Forest")
print("="*50)

# Define model
rf_classifier_negBias = RandomForestClassifier()

# Define hyperparameter grid
rf_param_grid_negBias = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'max_features': ['sqrt', 'log2']   # Usually beneficial for text
}

# Set up GridSearchCV
rf_grid_search_negBias = GridSearchCV(
    estimator=rf_classifier_negBias,
    param_grid=rf_param_grid_negBias,
    scoring='accuracy',
    cv=10,
    n_jobs=4,
    verbose=2,
    return_train_score=True
)

# Fit the grid search
rf_grid_search_negBias.fit(X_train_tfidf_negBias, y_train_negBias)

# Evaluate best model
print("\nBest Params (Negative-Biased, Random Forest):", rf_grid_search_negBias.best_params_)
print("Best CV Score (Negative-Biased, Random Forest):", rf_grid_search_negBias.best_score_)

# Predict on test data
rf_best_negBias = rf_grid_search_negBias.best_estimator_
y_pred_rf_negBias = rf_best_negBias.predict(X_test_tfidf_negBias)

# Accuracy & Classification Report
rf_test_accuracy_negBias = accuracy_score(y_test_negBias, y_pred_rf_negBias)
print("Test Accuracy (Negative-Biased, Random Forest):", rf_test_accuracy_negBias)
print(classification_report(y_test_negBias, y_pred_rf_negBias))

# Store results
results_negBias['RandomForest'] = {
    'best_params': rf_grid_search_negBias.best_params_,
    'best_cv_score': rf_grid_search_negBias.best_score_,
    'test_accuracy': rf_test_accuracy_negBias,
    'classification_report': classification_report(y_test_negBias, y_pred_rf_negBias, output_dict=True)
}

# Save cross-validation results to CSV
rf_cv_results_negBias = pd.DataFrame(rf_grid_search_negBias.cv_results_)
rf_cv_results_negBias.to_csv('Results_2C(NegBias)/TFIDF_Models/tfidf_negBias_randomForest_cv_results.csv', index=False)

print("\nRandom Forest Model Training & Evaluation Complete for Negative-Biased Sentiment!")


#### M4 kNN

In [None]:
# -------------------------------
#  SECTION 4: kNN (Negative Bias)
# -------------------------------
print("\n" + "="*50)
print("NEGATIVE-BIASED: kNN")
print("="*50)

# Define model
knn_classifier_negBias = KNeighborsClassifier()

# Define hyperparameter grid
knn_param_grid_negBias = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'cosine']
}

# Set up GridSearchCV
knn_grid_search_negBias = GridSearchCV(
    estimator=knn_classifier_negBias,
    param_grid=knn_param_grid_negBias,
    scoring='accuracy',
    cv=10,
    n_jobs=4,
    verbose=2,
    return_train_score=True
)

# Fit the grid search
knn_grid_search_negBias.fit(X_train_tfidf_negBias, y_train_negBias)

# Evaluate best model
print("\nBest Params (Negative-Biased, kNN):", knn_grid_search_negBias.best_params_)
print("Best CV Score (Negative-Biased, kNN):", knn_grid_search_negBias.best_score_)

# Predict on test data
knn_best_negBias = knn_grid_search_negBias.best_estimator_
y_pred_knn_negBias = knn_best_negBias.predict(X_test_tfidf_negBias)

# Accuracy & Classification Report
knn_test_accuracy_negBias = accuracy_score(y_test_negBias, y_pred_knn_negBias)
print("Test Accuracy (Negative-Biased, kNN):", knn_test_accuracy_negBias)
print(classification_report(y_test_negBias, y_pred_knn_negBias))

# Store results
results_negBias['kNN'] = {
    'best_params': knn_grid_search_negBias.best_params_,
    'best_cv_score': knn_grid_search_negBias.best_score_,
    'test_accuracy': knn_test_accuracy_negBias,
    'classification_report': classification_report(y_test_negBias, y_pred_knn_negBias, output_dict=True)
}

# Save cross-validation results to CSV
knn_cv_results_negBias = pd.DataFrame(knn_grid_search_negBias.cv_results_)
knn_cv_results_negBias.to_csv('Results_2C(NegBias)/TFIDF_Models/tfidf_negBias_knn_cv_results.csv', index=False)

print("\nkNN Model Training & Evaluation Complete for Negative-Biased Sentiment!")

#### M5 Naive Bayes

In [None]:
# -------------------------------
#  SECTION 5: Naïve Bayes (Negative Bias)
# -------------------------------
print("\n" + "="*50)
print("NEGATIVE-BIASED: Naïve Bayes")
print("="*50)

# Define model
nb_classifier_negBias = MultinomialNB()

# Define hyperparameter grid
nb_param_grid_negBias = {
    'alpha': [0.5, 1.0, 1.5]
}

# Set up GridSearchCV
nb_grid_search_negBias = GridSearchCV(
    estimator=nb_classifier_negBias,
    param_grid=nb_param_grid_negBias,
    scoring='accuracy',
    cv=10,
    n_jobs=4,
    verbose=2,
    return_train_score=True
)

# Fit the grid search
nb_grid_search_negBias.fit(X_train_tfidf_negBias, y_train_negBias)

# Evaluate best model
print("\nBest Params (Negative-Biased, Naïve Bayes):", nb_grid_search_negBias.best_params_)
print("Best CV Score (Negative-Biased, Naïve Bayes):", nb_grid_search_negBias.best_score_)

# Predict on test data
nb_best_negBias = nb_grid_search_negBias.best_estimator_
y_pred_nb_negBias = nb_best_negBias.predict(X_test_tfidf_negBias)

# Accuracy & Classification Report
nb_test_accuracy_negBias = accuracy_score(y_test_negBias, y_pred_nb_negBias)
print("Test Accuracy (Negative-Biased, Naïve Bayes):", nb_test_accuracy_negBias)
print(classification_report(y_test_negBias, y_pred_nb_negBias))

# Store results
results_negBias['NaiveBayes'] = {
    'best_params': nb_grid_search_negBias.best_params_,
    'best_cv_score': nb_grid_search_negBias.best_score_,
    'test_accuracy': nb_test_accuracy_negBias,
    'classification_report': classification_report(y_test_negBias, y_pred_nb_negBias, output_dict=True)
}

# Save cross-validation results to CSV
nb_cv_results_negBias = pd.DataFrame(nb_grid_search_negBias.cv_results_)
nb_cv_results_negBias.to_csv('Results_2C(NegBias)/TFIDF_Models/tfidf_negBias_NaiveBayes_cv_results.csv', index=False)

print("\nNaïve Bayes Model Training & Evaluation Complete for Negative-Biased Sentiment!")

#### Summary

In [None]:
# -------------------------------
#  SUMMARY: Negative-Biased Sentiment Classification
# -------------------------------

print("\n" + "="*50)
print("NEGATIVE-BIASED: Summary of Results")
print("="*50)

# Convert results dictionary to DataFrame for readability
df_results_summary_negBias = pd.DataFrame(results_negBias).T

# Display summary table
print(df_results_summary_negBias)

#### Confusion Matrices (Dual Class NEG BIAS - TFIDF)

In [None]:
# Best TF‑IDF models (Negative-Biased)
models_negBias_tfidf = {
    "DecisionTree": dt_best_negBias,       
    "SVM": svm_best_negBias,               
    "RandomForest": rf_best_negBias,       
    "kNN": knn_best_negBias,               
    "NaiveBayes": nb_best_negBias          
}

for model_name, model in models_negBias_tfidf.items():
    # X_test_tfidf_negBias: your TF‑IDF test data (Negative-Biased)
    # y_test_negBias:       your Negative-Biased labels (negative, positive)
    y_pred = model.predict(X_test_tfidf_negBias)
    cm = confusion_matrix(y_test_negBias, y_pred)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
    disp.plot(cmap='Blues')
    plt.title(f"Confusion Matrix: {model_name} (Negative-Biased, TF-IDF)")
    plt.show()

### **NLP Method 2:** N-Gram (Tri-Gram)

#### Selection Method: N-Gram

In [None]:
# ================================
# NEGATIVE-BIASED: N-Gram Feature Extraction
# ================================

print("\n" + "="*50)
print("NEGATIVE-BIASED: N-Gram Feature Extraction")
print("="*50)

from sklearn.feature_extraction.text import CountVectorizer

# Create a CountVectorizer for (1,3) N-Grams
ngram_vectorizer_negBias = CountVectorizer(
    ngram_range=(1,3),  # (1,3) => unigrams, bigrams, trigrams
    max_features=10000,
    max_df=0.8
)

# Fit on training data and transform both train & test sets
X_train_ngram_negBias = ngram_vectorizer_negBias.fit_transform(X_train_negBias)
X_test_ngram_negBias = ngram_vectorizer_negBias.transform(X_test_negBias)

# Initialize a dictionary to store model results for N-Gram approach
results_ngram_negBias = {}

print("\nN-Gram Feature Extraction for Negative-Biased Sentiment Completed!")

#### M1: Decision Tree

In [None]:
# ================================
# NEGATIVE-BIASED: Decision Tree (N-Gram)
# ================================

print("\n" + "="*50)
print("NEGATIVE-BIASED: N-Gram Decision Tree")
print("="*50)

# Define Decision Tree Classifier
dt_classifier_negBias_ngram = DecisionTreeClassifier()

# Define hyperparameter grid
dt_param_grid_negBias_ngram = {
    'max_depth': [10, 20, 25],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 5, 10]
}

# Set up GridSearchCV with 10-fold cross-validation
dt_grid_search_negBias_ngram = GridSearchCV(
    estimator=dt_classifier_negBias_ngram,
    param_grid=dt_param_grid_negBias_ngram,
    scoring='accuracy',
    cv=10,
    n_jobs=4,
    verbose=2,
    return_train_score=True
)

# Train the model
dt_grid_search_negBias_ngram.fit(X_train_ngram_negBias, y_train_negBias)

# Print best parameters
print("\nBest Params (Negative-Biased, N-Gram, Decision Tree):", dt_grid_search_negBias_ngram.best_params_)
print("Best CV Score (Negative-Biased, N-Gram, Decision Tree):", dt_grid_search_negBias_ngram.best_score_)

# Make predictions
dt_best_negBias_ngram = dt_grid_search_negBias_ngram.best_estimator_
y_pred_dt_negBias_ngram = dt_best_negBias_ngram.predict(X_test_ngram_negBias)

# Evaluate model performance
dt_test_accuracy_negBias_ngram = accuracy_score(y_test_negBias, y_pred_dt_negBias_ngram)
print("Test Accuracy (Negative-Biased, N-Gram, Decision Tree):", dt_test_accuracy_negBias_ngram)
print(classification_report(y_test_negBias, y_pred_dt_negBias_ngram))

# Store results
results_ngram_negBias['DecisionTree'] = {
    'best_params': dt_grid_search_negBias_ngram.best_params_,
    'best_cv_score': dt_grid_search_negBias_ngram.best_score_,
    'test_accuracy': dt_test_accuracy_negBias_ngram,
    'classification_report': classification_report(y_test_negBias, y_pred_dt_negBias_ngram, output_dict=True)
}

# Save cross-validation results to CSV
dt_cv_results_negBias_ngram = pd.DataFrame(dt_grid_search_negBias_ngram.cv_results_)
dt_cv_results_negBias_ngram.to_csv('Results_2C(NegBias)/NGram_Models/ngram_negBias_decisionTree_cv_results.csv', index=False)

print("\nDecision Tree Model Training & Evaluation Complete for Negative-Biased Sentiment (N-Gram)!")

#### M2: Linear SVM

In [None]:
# ================================
# NEGATIVE-BIASED: Linear SVM (N-Gram)
# ================================

print("\n" + "="*50)
print("NEGATIVE-BIASED: N-Gram Linear SVM")
print("="*50)

# Define SVM Classifier
svm_classifier_negBias_ngram = LinearSVC(max_iter=10000)

# Define hyperparameter grid
svm_param_grid_negBias_ngram = {
    'C': [1e-3, 1e-2, 1e-1, 1, 1e1, 1e2]
}

# Set up GridSearchCV with 10-fold cross-validation
svm_grid_search_negBias_ngram = GridSearchCV(
    estimator=svm_classifier_negBias_ngram,
    param_grid=svm_param_grid_negBias_ngram,
    scoring='accuracy',
    cv=10,
    n_jobs=4,
    verbose=2,
    return_train_score=True
)

# Train the model
svm_grid_search_negBias_ngram.fit(X_train_ngram_negBias, y_train_negBias)

# Print best parameters
print("\nBest Params (Negative-Biased, N-Gram, SVM):", svm_grid_search_negBias_ngram.best_params_)
print("Best CV Score (Negative-Biased, N-Gram, SVM):", svm_grid_search_negBias_ngram.best_score_)

# Make predictions
svm_best_negBias_ngram = svm_grid_search_negBias_ngram.best_estimator_
y_pred_svm_negBias_ngram = svm_best_negBias_ngram.predict(X_test_ngram_negBias)

# Evaluate model performance
svm_test_accuracy_negBias_ngram = accuracy_score(y_test_negBias, y_pred_svm_negBias_ngram)
print("Test Accuracy (Negative-Biased, N-Gram, SVM):", svm_test_accuracy_negBias_ngram)
print(classification_report(y_test_negBias, y_pred_svm_negBias_ngram))

# Store results
results_ngram_negBias['SVM'] = {
    'best_params': svm_grid_search_negBias_ngram.best_params_,
    'best_cv_score': svm_grid_search_negBias_ngram.best_score_,
    'test_accuracy': svm_test_accuracy_negBias_ngram,
    'classification_report': classification_report(y_test_negBias, y_pred_svm_negBias_ngram, output_dict=True)
}

# Save cross-validation results to CSV
svm_cv_results_negBias_ngram = pd.DataFrame(svm_grid_search_negBias_ngram.cv_results_)
svm_cv_results_negBias_ngram.to_csv('Results_2C(NegBias)/NGram_Models/ngram_negBias_svm_cv_results.csv', index=False)

print("\nLinear SVM Model Training & Evaluation Complete for Negative-Biased Sentiment (N-Gram)!")

#### M3: Random Forest

In [None]:
# ================================
# NEGATIVE-BIASED: Random Forest (N-Gram)
# ================================

print("\n" + "="*50)
print("NEGATIVE-BIASED: N-Gram Random Forest")
print("="*50)

# Define Random Forest Classifier
rf_classifier_negBias_ngram = RandomForestClassifier()

# Define hyperparameter grid
rf_param_grid_negBias_ngram = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'max_features': ['sqrt', 'log2']
}

# Set up GridSearchCV
rf_grid_search_negBias_ngram = GridSearchCV(
    estimator=rf_classifier_negBias_ngram,
    param_grid=rf_param_grid_negBias_ngram,
    scoring='accuracy',
    cv=10,
    n_jobs=4,
    verbose=2,
    return_train_score=True
)

# Train the model
rf_grid_search_negBias_ngram.fit(X_train_ngram_negBias, y_train_negBias)

# Print best parameters
print("\nBest Params (Negative-Biased, N-Gram, Random Forest):", rf_grid_search_negBias_ngram.best_params_)
print("Best CV Score (Negative-Biased, N-Gram, Random Forest):", rf_grid_search_negBias_ngram.best_score_)

# Make predictions
rf_best_negBias_ngram = rf_grid_search_negBias_ngram.best_estimator_
y_pred_rf_negBias_ngram = rf_best_negBias_ngram.predict(X_test_ngram_negBias)

# Evaluate model performance
rf_test_accuracy_negBias_ngram = accuracy_score(y_test_negBias, y_pred_rf_negBias_ngram)
print("Test Accuracy (Negative-Biased, N-Gram, Random Forest):", rf_test_accuracy_negBias_ngram)
print(classification_report(y_test_negBias, y_pred_rf_negBias_ngram))

# Store results
results_ngram_negBias['RandomForest'] = {
    'best_params': rf_grid_search_negBias_ngram.best_params_,
    'best_cv_score': rf_grid_search_negBias_ngram.best_score_,
    'test_accuracy': rf_test_accuracy_negBias_ngram,
    'classification_report': classification_report(y_test_negBias, y_pred_rf_negBias_ngram, output_dict=True)
}

# Save cross-validation results to CSV
rf_cv_results_negBias_ngram = pd.DataFrame(rf_grid_search_negBias_ngram.cv_results_)
rf_cv_results_negBias_ngram.to_csv('Results_2C(NegBias)/NGram_Models/ngram_negBias_randomForest_cv_results.csv', index=False)

print("\nRandom Forest Model Training & Evaluation Complete for Negative-Biased Sentiment (N-Gram)!")

#### M4: kNN

In [None]:
# ================================
# NEGATIVE-BIASED: kNN (N-Gram)
# ================================

print("\n" + "="*50)
print("NEGATIVE-BIASED: N-Gram kNN")
print("="*50)

# Define kNN Classifier
knn_classifier_negBias_ngram = KNeighborsClassifier()

# Define hyperparameter grid
knn_param_grid_negBias_ngram = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'cosine']
}

# Set up GridSearchCV
knn_grid_search_negBias_ngram = GridSearchCV(
    estimator=knn_classifier_negBias_ngram,
    param_grid=knn_param_grid_negBias_ngram,
    scoring='accuracy',
    cv=10,
    n_jobs=4,
    verbose=2,
    return_train_score=True
)

# Train the model
knn_grid_search_negBias_ngram.fit(X_train_ngram_negBias, y_train_negBias)

# Print best parameters
print("\nBest Params (Negative-Biased, N-Gram, kNN):", knn_grid_search_negBias_ngram.best_params_)
print("Best CV Score (Negative-Biased, N-Gram, kNN):", knn_grid_search_negBias_ngram.best_score_)

# Make predictions
knn_best_negBias_ngram = knn_grid_search_negBias_ngram.best_estimator_
y_pred_knn_negBias_ngram = knn_best_negBias_ngram.predict(X_test_ngram_negBias)

# Evaluate model performance
knn_test_accuracy_negBias_ngram = accuracy_score(y_test_negBias, y_pred_knn_negBias_ngram)
print("Test Accuracy (Negative-Biased, N-Gram, kNN):", knn_test_accuracy_negBias_ngram)
print(classification_report(y_test_negBias, y_pred_knn_negBias_ngram))

# Store results
results_ngram_negBias['kNN'] = {
    'best_params': knn_grid_search_negBias_ngram.best_params_,
    'best_cv_score': knn_grid_search_negBias_ngram.best_score_,
    'test_accuracy': knn_test_accuracy_negBias_ngram,
    'classification_report': classification_report(y_test_negBias, y_pred_knn_negBias_ngram, output_dict=True)
}

# Save cross-validation results to CSV
knn_cv_results_negBias_ngram = pd.DataFrame(knn_grid_search_negBias_ngram.cv_results_)
knn_cv_results_negBias_ngram.to_csv('Results_2C(NegBias)/NGram_Models/ngram_negBias_knn_cv_results.csv', index=False)

print("\nkNN Model Training & Evaluation Complete for Negative-Biased Sentiment (N-Gram)!")

#### M5: Naive Bayes

In [None]:
# ================================
# NEGATIVE-BIASED: Naïve Bayes (N-Gram)
# ================================

print("\n" + "="*50)
print("NEGATIVE-BIASED: N-Gram Naïve Bayes")
print("="*50)

# Define Naïve Bayes Classifier
nb_classifier_negBias_ngram = MultinomialNB()

# Define hyperparameter grid
nb_param_grid_negBias_ngram = {
    'alpha': [0.5, 1.0, 1.5]
}

# Set up GridSearchCV
nb_grid_search_negBias_ngram = GridSearchCV(
    estimator=nb_classifier_negBias_ngram,
    param_grid=nb_param_grid_negBias_ngram,
    scoring='accuracy',
    cv=10,
    n_jobs=4,
    verbose=2,
    return_train_score=True
)

# Train the model
nb_grid_search_negBias_ngram.fit(X_train_ngram_negBias, y_train_negBias)

# Print best parameters
print("\nBest Params (Negative-Biased, N-Gram, Naïve Bayes):", nb_grid_search_negBias_ngram.best_params_)
print("Best CV Score (Negative-Biased, N-Gram, Naïve Bayes):", nb_grid_search_negBias_ngram.best_score_)

# Make predictions
nb_best_negBias_ngram = nb_grid_search_negBias_ngram.best_estimator_
y_pred_nb_negBias_ngram = nb_best_negBias_ngram.predict(X_test_ngram_negBias)

# Evaluate model performance
nb_test_accuracy_negBias_ngram = accuracy_score(y_test_negBias, y_pred_nb_negBias_ngram)
print("Test Accuracy (Negative-Biased, N-Gram, Naïve Bayes):", nb_test_accuracy_negBias_ngram)
print(classification_report(y_test_negBias, y_pred_nb_negBias_ngram))

# Store results
results_ngram_negBias['NaiveBayes'] = {
    'best_params': nb_grid_search_negBias_ngram.best_params_,
    'best_cv_score': nb_grid_search_negBias_ngram.best_score_,
    'test_accuracy': nb_test_accuracy_negBias_ngram,
    'classification_report': classification_report(y_test_negBias, y_pred_nb_negBias_ngram, output_dict=True)
}

# Save cross-validation results to CSV
nb_cv_results_negBias_ngram = pd.DataFrame(nb_grid_search_negBias_ngram.cv_results_)
nb_cv_results_negBias_ngram.to_csv('Results_2C(NegBias)/NGram_Models/ngram_negBias_naiveBayes_cv_results.csv', index=False)

print("\nNaïve Bayes Model Training & Evaluation Complete for Negative-Biased Sentiment (N-Gram)!")

#### Summary

In [None]:
# ================================
# FINAL SUMMARY FOR NEGATIVE-BIASED (N-GRAM)
# ================================
print("\n=== Final Results (Negative-Biased, N-Gram) ===")
pd.DataFrame(results_ngram_negBias).T

#### Confusion Matrices (Dual Class NEG Bias - NGram)

In [None]:
# Best N‑Gram models (Negative-Biased)
models_negBias_ngram = {
    "DecisionTree": dt_best_negBias_ngram,
    "SVM": svm_best_negBias_ngram,
    "RandomForest": rf_best_negBias_ngram,
    "kNN": knn_best_negBias_ngram,
    "NaiveBayes": nb_best_negBias_ngram
}

for model_name, model in models_negBias_ngram.items():
    # X_test_ngram_negBias: your N‑Gram test data (Negative-Biased)
    # y_test_negBias:       your Negative-Biased labels (negative, positive)
    y_pred = model.predict(X_test_ngram_negBias)
    cm = confusion_matrix(y_test_negBias, y_pred)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
    disp.plot(cmap='Blues')
    plt.title(f"Confusion Matrix: {model_name} (Negative-Biased, N-Gram)")
    plt.show()

## 🔴 Classification Dual-Class (Positive Biased)

### Make a copy of the data

In [None]:
df_posBias = pd.read_csv('Datasets/preprocessed_reviews.csv')

In [None]:
df_posBias.head()

### Merge 3 Star Reviews with POSITIVE class

In [None]:
# 3. Drop the existing 'Sentiment' column (ensuring no conflicts)
df_posBias.drop(columns=['Sentiment'], inplace=True)

In [None]:
df_posBias.head()

In [None]:
# 4. Define a function to merge 1,2 → negative and 3,4,5 → positive
def map_pos_biased(score):
    if score in [1, 2]:
        return 'negative'
    else:  # 3, 4, 5 become positive
        return 'positive'

In [None]:
# 5. Create a new 'Sentiment' column with updated mapping
df_posBias['Sentiment'] = df_posBias['Score'].apply(map_pos_biased)

In [None]:
# 6. Verify the new class distribution
print("\nDistribution for Positive-Biased Classification:")
print(df_posBias['Sentiment'].value_counts())

### Train/Test Split

In [None]:
# 6. Define Features (Preprocessed Text) and Target (Pos-Biased Sentiment)
X_pos = df_posBias['Preprocessed_Review']
y_pos = df_posBias['Sentiment']

In [None]:
# 7. Train/Test Split
X_train_pos, X_test_pos, y_train_pos, y_test_pos = train_test_split(
    X_pos,
    y_pos,
    test_size=0.3,      # 70 train / 30 test
    random_state=42,
    stratify=y_pos
)

# 8. Print distribution in train/test sets
print("\nTraining Set Distribution:")
print(y_train_pos.value_counts())
print("\nTest Set Distribution:")
print(y_test_pos.value_counts())

### **NLP Method 1:** Term Frequency-Inverse Document Frequency (TF-IDF)
Here, we run each model separately—each with its own hyperparameter tuning and 10-fold cross-validation.

#### Feature Extraction Method : TFIDF Setup

In [None]:
# ================================
# POSITIVE-BIASED SENTIMENT: TF-IDF SETUP
# ================================
print("\n=== Feature Extraction: TF-IDF for Positive-Biased Sentiment ===")

# Initialize the TF-IDF vectorizer
tfidf_vectorizer_posBias = TfidfVectorizer(
    max_features=10000,  # Limit to 10,000 most important words
    ngram_range=(1, 2),  # Unigrams and bigrams
    max_df=0.8,  # Ignore terms that appear in more than 80% of documents
    sublinear_tf=True  # Apply sublinear term frequency scaling
)

# Fit on training data and transform both train and test sets
X_train_tfidf_pos = tfidf_vectorizer_posBias.fit_transform(X_train_pos)
X_test_tfidf_pos = tfidf_vectorizer_posBias.transform(X_test_pos)

# Create a dictionary to store results for this section
results_posBias = {}

print("\nTF-IDF transformation complete for Positive-Biased Sentiment Classification!")

#### M1 Decision Tree

##### a. Implementation

In [None]:
# ================================
# POSITIVE-BIASED SENTIMENT: DECISION TREE
# ================================
print("\n" + "="*50)
print("POSITIVE-BIASED: Decision Tree Model")
print("="*50)

# Define Decision Tree classifier
dt_classifier_posBias = DecisionTreeClassifier()

# Define hyperparameter grid
dt_param_grid_posBias = {
    'max_depth': [5, 10, 15, 20, 25],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5, 10],
}

# Set up GridSearchCV with 10-fold cross-validation
dt_grid_search_posBias = GridSearchCV(
    estimator=dt_classifier_posBias,
    param_grid=dt_param_grid_posBias,
    scoring='accuracy',
    cv=10,
    n_jobs=4,  # Adjust CPU usage
    verbose=2,
    return_train_score=True
)

# Fit GridSearchCV on the TF-IDF transformed training data
dt_grid_search_posBias.fit(X_train_tfidf_pos, y_train_pos)

# Display the best parameters and cross-validation score
print("\nBest Params (Positive-Biased, Decision Tree):", dt_grid_search_posBias.best_params_)
print("Best CV Score (Positive-Biased, Decision Tree):", dt_grid_search_posBias.best_score_)

# Predict on test data using the best model
dt_best_posBias = dt_grid_search_posBias.best_estimator_
y_pred_dt_pos = dt_best_posBias.predict(X_test_tfidf_pos)

# Evaluate performance
dt_test_accuracy_pos = accuracy_score(y_test_pos, y_pred_dt_pos)
print("\nTest Accuracy (Positive-Biased, Decision Tree):", dt_test_accuracy_pos)
print(classification_report(y_test_pos, y_pred_dt_pos))

# Store results
results_posBias['DecisionTree'] = {
    'best_params': dt_grid_search_posBias.best_params_,
    'best_cv_score': dt_grid_search_posBias.best_score_,
    'test_accuracy': dt_test_accuracy_pos,
    'classification_report': classification_report(y_test_pos, y_pred_dt_pos, output_dict=True)
}

# Save cross-validation results to CSV
dt_cv_results_posBias = pd.DataFrame(dt_grid_search_posBias.cv_results_)
dt_cv_results_posBias.to_csv('Results_2C(PosBias)/TFIDF_Models/tfidf_posBias_decisionTree_cv_results.csv', index=False)

print("\nDecision Tree Model Training & Evaluation Complete for Positive-Biased Sentiment!")

##### b. Visualizations

In [None]:
# ================================
# POSITIVE-BIASED: VALIDATION CURVE (max_depth)
# ================================

param_range = [5, 10, 15, 20, 25]
train_scores, test_scores = validation_curve(
    dt_classifier_posBias, X_train_tfidf_pos, y_train_pos, 
    param_name="max_depth", param_range=param_range,
    scoring="accuracy", cv=5, n_jobs=-1
)

# Calculate mean and std
train_mean = np.mean(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_std = np.std(test_scores, axis=1)

# Plot
plt.figure(figsize=(8, 6))
plt.plot(param_range, train_mean, label="Training Score", color='blue', marker='o')
plt.plot(param_range, test_mean, label="Validation Score", color='green', linestyle='--', marker='x')
plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, alpha=0.2, color='blue')
plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, alpha=0.2, color='green')

plt.title("Validation Curve for max_depth (Positive-Biased)")
plt.xlabel("max_depth")
plt.ylabel("Accuracy")
plt.legend()
plt.grid(True)
plt.show()

In [None]:
# ================================
# POSITIVE-BIASED: CONFUSION MATRIX
# ================================

cm = confusion_matrix(y_test_pos, y_pred_dt_pos)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["Negative", "Positive"])

plt.figure(figsize=(6, 6))
disp.plot(cmap='Blues')
plt.title("Confusion Matrix: Decision Tree (Positive-Biased)")
plt.show()

In [None]:
# ================================
# POSITIVE-BIASED: HYPERPARAMETER TUNING RESULTS
# ================================

# Convert results to DataFrame
dt_cv_results_df_posBias = pd.DataFrame(dt_grid_search_posBias.cv_results_)

# Display top rows
print("\n--- Decision Tree: Cross-Validation Scores for Each Fold (Positive-Biased) ---")
print(dt_cv_results_df_posBias[['params', 'mean_test_score', 'std_test_score']])

# Save results to CSV
dt_cv_results_df_posBias.to_csv('Results_2C(PosBias)/TFIDF_Models/tfidf_posBias_decisionTree_cv_results.csv', index=False)

#### M2 Linear SVM

In [None]:
# ================================
# POSITIVE-BIASED: LINEAR SVM
# ================================
print("\n" + "="*50)
print("TF-IDF: Linear SVM (Positive-Biased Sentiment)")
print("="*50)

# 1. Define the model
svm_classifier_posBias = LinearSVC(max_iter=10000)

# 2. Define the hyperparameter grid
svm_param_grid_posBias = {
    'C': [1e-3, 1e-2, 1e-1, 1, 1e1, 1e2]  # Regularization parameter
}

# 3. Set up GridSearchCV for hyperparameter tuning
svm_grid_search_posBias = GridSearchCV(
    estimator=svm_classifier_posBias,
    param_grid=svm_param_grid_posBias,
    scoring='accuracy',
    cv=10,  # 10-fold Cross Validation
    n_jobs=4,  # Parallel processing
    verbose=2,
    return_train_score=True
)

# 4. Train the model using GridSearchCV
svm_grid_search_posBias.fit(X_train_tfidf_pos, y_train_pos)

# 5. Display the best hyperparameters
print("\nBest Params (TF-IDF, SVM - Positive-Biased):", svm_grid_search_posBias.best_params_)
print("Best CV Score (TF-IDF, SVM - Positive-Biased):", svm_grid_search_posBias.best_score_)

# 6. Make predictions on the test set
svm_best_posBias = svm_grid_search_posBias.best_estimator_
y_pred_svm_pos = svm_best_posBias.predict(X_test_tfidf_pos)

# 7. Evaluate the model performance
svm_test_accuracy_posBias = accuracy_score(y_test_pos, y_pred_svm_pos)
print("Test Accuracy (TF-IDF, SVM - Positive-Biased):", svm_test_accuracy_posBias)
print(classification_report(y_test_pos, y_pred_svm_pos))

# 8. Store results in dictionary
results_posBias['SVM'] = {
    'best_params': svm_grid_search_posBias.best_params_,
    'best_cv_score': svm_grid_search_posBias.best_score_,
    'test_accuracy': svm_test_accuracy_posBias,
    'classification_report': classification_report(y_test_pos, y_pred_svm_pos, output_dict=True)
}

# 9. Save cross-validation results to CSV
svm_cv_results_posBias = pd.DataFrame(svm_grid_search_posBias.cv_results_)
svm_cv_results_posBias.to_csv('Results_2C(PosBias)/TFIDF_Models/tfidf_posBias_svm_cv_results.csv', index=False)

print("\nLinear SVM Model Training & Evaluation Complete for Positive-Biased Sentiment!")

#### M3 Random Forest

In [None]:
# ================================
# POSITIVE-BIASED: RANDOM FOREST
# ================================
print("\n" + "="*50)
print("TF-IDF: Random Forest (Positive-Biased Sentiment)")
print("="*50)

# 1. Define the model
rf_classifier_posBias = RandomForestClassifier()

# 2. Define the hyperparameter grid
rf_param_grid_posBias = {
    'n_estimators': [100, 200, 300],  # Number of trees in the forest
    'max_depth': [None, 10, 20],  # Maximum depth of the tree
    'max_features': ['sqrt', 'log2']  # Features considered for best split
}

# 3. Set up GridSearchCV for hyperparameter tuning
rf_grid_search_posBias = GridSearchCV(
    estimator=rf_classifier_posBias,
    param_grid=rf_param_grid_posBias,
    scoring='accuracy',
    cv=10,  # 10-fold Cross Validation
    n_jobs=4,  # Parallel processing
    verbose=2,
    return_train_score=True
)

# 4. Train the model using GridSearchCV
rf_grid_search_posBias.fit(X_train_tfidf_pos, y_train_pos)

# 5. Display the best hyperparameters
print("\nBest Params (TF-IDF, Random Forest - Positive-Biased):", rf_grid_search_posBias.best_params_)
print("Best CV Score (TF-IDF, Random Forest - Positive-Biased):", rf_grid_search_posBias.best_score_)

# 6. Make predictions on the test set
rf_best_posBias = rf_grid_search_posBias.best_estimator_
y_pred_rf_pos = rf_best_posBias.predict(X_test_tfidf_pos)

# 7. Evaluate the model performance
rf_test_accuracy_posBias = accuracy_score(y_test_pos, y_pred_rf_pos)
print("Test Accuracy (TF-IDF, Random Forest - Positive-Biased):", rf_test_accuracy_posBias)
print(classification_report(y_test_pos, y_pred_rf_pos))

# 8. Store results in dictionary
results_posBias['RandomForest'] = {
    'best_params': rf_grid_search_posBias.best_params_,
    'best_cv_score': rf_grid_search_posBias.best_score_,
    'test_accuracy': rf_test_accuracy_posBias,
    'classification_report': classification_report(y_test_pos, y_pred_rf_pos, output_dict=True)
}

# 9. Save cross-validation results to CSV
rf_cv_results_posBias = pd.DataFrame(rf_grid_search_posBias.cv_results_)
rf_cv_results_posBias.to_csv('Results_2C(PosBias)/TFIDF_Models/tfidf_posBias_randomForest_cv_results.csv', index=False)

print("\nRandom Forest Model Training & Evaluation Complete for Positive-Biased Sentiment!")

#### M4 kNN

In [None]:
# ================================
# POSITIVE-BIASED: k-NEAREST NEIGHBORS (kNN)
# ================================
print("\n" + "="*50)
print("TF-IDF: kNN (Positive-Biased Sentiment)")
print("="*50)

# 1. Define the model
knn_classifier_posBias = KNeighborsClassifier()

# 2. Define the hyperparameter grid
knn_param_grid_posBias = {
    'n_neighbors': [3, 5, 7, 9],  # Number of neighbors to consider
    'weights': ['uniform', 'distance'],  # Weight function
    'metric': ['euclidean', 'manhattan', 'cosine']  # Distance metric
}

# 3. Set up GridSearchCV for hyperparameter tuning
knn_grid_search_posBias = GridSearchCV(
    estimator=knn_classifier_posBias,
    param_grid=knn_param_grid_posBias,
    scoring='accuracy',
    cv=10,  # 10-fold Cross Validation
    n_jobs=4,  # Parallel processing
    verbose=2,
    return_train_score=True
)

# 4. Train the model using GridSearchCV
knn_grid_search_posBias.fit(X_train_tfidf_pos, y_train_pos)

# 5. Display the best hyperparameters
print("\nBest Params (TF-IDF, kNN - Positive-Biased):", knn_grid_search_posBias.best_params_)
print("Best CV Score (TF-IDF, kNN - Positive-Biased):", knn_grid_search_posBias.best_score_)

# 6. Make predictions on the test set
knn_best_posBias = knn_grid_search_posBias.best_estimator_
y_pred_knn_pos = knn_best_posBias.predict(X_test_tfidf_pos)

# 7. Evaluate the model performance
knn_test_accuracy_posBias = accuracy_score(y_test_pos, y_pred_knn_pos)
print("Test Accuracy (TF-IDF, kNN - Positive-Biased):", knn_test_accuracy_posBias)
print(classification_report(y_test_pos, y_pred_knn_pos))

# 8. Store results in dictionary
results_posBias['kNN'] = {
    'best_params': knn_grid_search_posBias.best_params_,
    'best_cv_score': knn_grid_search_posBias.best_score_,
    'test_accuracy': knn_test_accuracy_posBias,
    'classification_report': classification_report(y_test_pos, y_pred_knn_pos, output_dict=True)
}

# 9. Save cross-validation results to CSV
knn_cv_results_posBias = pd.DataFrame(knn_grid_search_posBias.cv_results_)
knn_cv_results_posBias.to_csv('Results_2C(PosBias)/TFIDF_Models/tfidf_posBias_knn_cv_results.csv', index=False)

print("\nkNN Model Training & Evaluation Complete for Positive-Biased Sentiment!")

#### M5 Naive Bayes

In [None]:
# ================================
# POSITIVE-BIASED: NAÏVE BAYES (TF-IDF)
# ================================
print("\n" + "="*50)
print("TF-IDF: Naïve Bayes (Positive-Biased Sentiment)")
print("="*50)

# 1. Define the model
nb_classifier_posBias = MultinomialNB()

# 2. Define the hyperparameter grid
nb_param_grid_posBias = {
    'alpha': [0.5, 1.0, 1.5]  # Smoothing parameter
}

# 3. Set up GridSearchCV for hyperparameter tuning
nb_grid_search_posBias = GridSearchCV(
    estimator=nb_classifier_posBias,
    param_grid=nb_param_grid_posBias,
    scoring='accuracy',
    cv=10,  # 10-fold Cross Validation
    n_jobs=4,  # Parallel processing
    verbose=2,
    return_train_score=True
)

# 4. Train the model using GridSearchCV
nb_grid_search_posBias.fit(X_train_tfidf_pos, y_train_pos)

# 5. Display the best hyperparameters
print("\nBest Params (TF-IDF, Naïve Bayes - Positive-Biased):", nb_grid_search_posBias.best_params_)
print("Best CV Score (TF-IDF, Naïve Bayes - Positive-Biased):", nb_grid_search_posBias.best_score_)

# 6. Make predictions on the test set
nb_best_posBias = nb_grid_search_posBias.best_estimator_
y_pred_nb_pos = nb_best_posBias.predict(X_test_tfidf_pos)

# 7. Evaluate the model performance
nb_test_accuracy_posBias = accuracy_score(y_test_pos, y_pred_nb_pos)
print("Test Accuracy (TF-IDF, Naïve Bayes - Positive-Biased):", nb_test_accuracy_posBias)
print(classification_report(y_test_pos, y_pred_nb_pos))

# 8. Store results in dictionary
results_posBias['NaïveBayes'] = {
    'best_params': nb_grid_search_posBias.best_params_,
    'best_cv_score': nb_grid_search_posBias.best_score_,
    'test_accuracy': nb_test_accuracy_posBias,
    'classification_report': classification_report(y_test_pos, y_pred_nb_pos, output_dict=True)
}

# 9. Save cross-validation results to CSV
nb_cv_results_posBias = pd.DataFrame(nb_grid_search_posBias.cv_results_)
nb_cv_results_posBias.to_csv('Results_2C(PosBias)/TFIDF_Models/tfidf_posBias_naiveBayes_cv_results.csv', index=False)

print("\nNaïve Bayes Model Training & Evaluation Complete for Positive-Biased Sentiment!")


#### Summary

In [None]:
# ================================
# SUMMARY: TF-IDF (Positive-Biased Sentiment)
# ================================
print("\n=== SUMMARY: TF-IDF Classification for Positive-Biased Sentiment ===")

# Convert results dictionary into a DataFrame for better visualization
summary_posBias_tfidf = pd.DataFrame(results_posBias).T  # Transpose for readability


#### Confusion Matrices (Dual Class POS Bias - TFIDF)

In [None]:
# Best TF‑IDF models (Positive-Biased)
models_posBias_tfidf = {
    "DecisionTree": dt_best_posBias,
    "SVM": svm_best_posBias,
    "RandomForest": rf_best_posBias,
    "kNN": knn_best_posBias,
    "NaiveBayes": nb_best_posBias
}

for model_name, model in models_posBias_tfidf.items():
    # X_test_tfidf_pos: your TF‑IDF test data (Positive-Biased)
    # y_test_pos:       your Positive-Biased labels (negative, positive)
    y_pred = model.predict(X_test_tfidf_pos)
    cm = confusion_matrix(y_test_pos, y_pred)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
    disp.plot(cmap='Blues')
    plt.title(f"Confusion Matrix: {model_name} (Positive-Biased, TF-IDF)")
    plt.show()

### **NLP Method 2:** N-Gram (Tri-Gram)

#### Feature Extraction Method: N-Gram

In [None]:
# ================================
# POSITIVE-BIASED: N-Gram Feature Extraction
# ================================
print("\n" + "="*50)
print("POSITIVE-BIASED: N-Gram Feature Extraction")
print("="*50)

from sklearn.feature_extraction.text import CountVectorizer

# Create a CountVectorizer for N-Grams (unigrams, bigrams, trigrams)
ngram_vectorizer_pos = CountVectorizer(
    ngram_range=(1, 3),  # (1,3) includes unigrams, bigrams, and trigrams
    max_features=10000,
    max_df=0.8
)

# Fit the vectorizer on the positive-biased training set and transform both training and test sets
X_train_ngram_pos = ngram_vectorizer_pos.fit_transform(X_train_pos)
X_test_ngram_pos = ngram_vectorizer_pos.transform(X_test_pos)

# (Optional) Create a dictionary to store the results for the N-Gram approach for Positive-Biased sentiment
results_ngram_pos = {}

print("\nN-Gram Feature Extraction Complete for Positive-Biased Sentiment!")

#### M1 Decision Tree

In [None]:
# -------------------------------
#  POSITIVE-BIASED: Decision Tree (N-Gram)
# -------------------------------

# Define Decision Tree model
dt_classifier_pos_ngram = DecisionTreeClassifier()

# Define hyperparameter grid
dt_param_grid_pos_ngram = {
    'max_depth': [10, 20, 25],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 5, 10]
}

# Set up GridSearchCV for hyperparameter tuning
dt_grid_search_pos_ngram = GridSearchCV(
    estimator=dt_classifier_pos_ngram,
    param_grid=dt_param_grid_pos_ngram,
    scoring='accuracy',
    cv=10,
    n_jobs=4,
    verbose=2,
    return_train_score=True
)

print("\n" + "="*50)
print("N-Gram: Decision Tree (Positive-Biased)")
print("="*50)

# Fit GridSearch on the N-Gram transformed training data
dt_grid_search_pos_ngram.fit(X_train_ngram_pos, y_train_pos)

# Best parameters and cross-validation score
print("\nBest Params (N-Gram, Decision Tree - Positive Biased):", dt_grid_search_pos_ngram.best_params_)
print("Best CV Score (N-Gram, Decision Tree - Positive Biased):", dt_grid_search_pos_ngram.best_score_)

# Predict on test data
dt_best_pos_ngram = dt_grid_search_pos_ngram.best_estimator_
y_pred_dt_pos_ngram = dt_best_pos_ngram.predict(X_test_ngram_pos)

# Evaluate model
dt_test_accuracy_pos_ngram = accuracy_score(y_test_pos, y_pred_dt_pos_ngram)
print("Test Accuracy (N-Gram, Decision Tree - Positive Biased):", dt_test_accuracy_pos_ngram)
print(classification_report(y_test_pos, y_pred_dt_pos_ngram))

# Store results
results_ngram_pos['DecisionTree'] = {
    'best_params': dt_grid_search_pos_ngram.best_params_,
    'best_cv_score': dt_grid_search_pos_ngram.best_score_,
    'test_accuracy': dt_test_accuracy_pos_ngram,
    'classification_report': classification_report(y_test_pos, y_pred_dt_pos_ngram, output_dict=True)
}

# Save cross-validation results to CSV
dt_cv_results_pos_ngram = pd.DataFrame(dt_grid_search_pos_ngram.cv_results_)
dt_cv_results_pos_ngram.to_csv('Results_2C(PosBias)/NGram_Models/ngram_posBias_decisionTree_cv_results.csv', index=False)

print("\nDecision Tree Model Training & Evaluation Complete for Positive-Biased Sentiment (N-Gram)!")

#### M2 Linear SVM

In [None]:
# -------------------------------
#  POSITIVE-BIASED: Linear SVM (N-Gram)
# -------------------------------

# Define SVM model
svm_classifier_pos_ngram = LinearSVC(max_iter=10000)

# Define hyperparameter grid
svm_param_grid_pos_ngram = {
    'C': [1e-3, 1e-2, 1e-1, 1, 1e1, 1e2]
}

# Set up GridSearchCV for hyperparameter tuning
svm_grid_search_pos_ngram = GridSearchCV(
    estimator=svm_classifier_pos_ngram,
    param_grid=svm_param_grid_pos_ngram,
    scoring='accuracy',
    cv=10,
    n_jobs=4,
    verbose=2,
    return_train_score=True
)

print("\n" + "="*50)
print("N-Gram: Linear SVM (Positive-Biased)")
print("="*50)

# Fit GridSearch on the N-Gram transformed training data
svm_grid_search_pos_ngram.fit(X_train_ngram_pos, y_train_pos)

# Best parameters and cross-validation score
print("\nBest Params (N-Gram, SVM - Positive Biased):", svm_grid_search_pos_ngram.best_params_)
print("Best CV Score (N-Gram, SVM - Positive Biased):", svm_grid_search_pos_ngram.best_score_)

# Predict on test data
svm_best_pos_ngram = svm_grid_search_pos_ngram.best_estimator_
y_pred_svm_pos_ngram = svm_best_pos_ngram.predict(X_test_ngram_pos)

# Evaluate model
svm_test_accuracy_pos_ngram = accuracy_score(y_test_pos, y_pred_svm_pos_ngram)
print("Test Accuracy (N-Gram, SVM - Positive Biased):", svm_test_accuracy_pos_ngram)
print(classification_report(y_test_pos, y_pred_svm_pos_ngram))

# Store results
results_ngram_pos['SVM'] = {
    'best_params': svm_grid_search_pos_ngram.best_params_,
    'best_cv_score': svm_grid_search_pos_ngram.best_score_,
    'test_accuracy': svm_test_accuracy_pos_ngram,
    'classification_report': classification_report(y_test_pos, y_pred_svm_pos_ngram, output_dict=True)
}

# Save cross-validation results to CSV
svm_cv_results_pos_ngram = pd.DataFrame(svm_grid_search_pos_ngram.cv_results_)
svm_cv_results_pos_ngram.to_csv('Results_2C(PosBias)/NGram_Models/ngram_posBias_svm_cv_results.csv', index=False)

print("\nLinear SVM Model Training & Evaluation Complete for Positive-Biased Sentiment (N-Gram)!")

#### M3 Random Forest

In [None]:
# -------------------------------
#  POSITIVE-BIASED: Random Forest (N-Gram)
# -------------------------------

# Define Random Forest model
rf_classifier_pos_ngram = RandomForestClassifier()

# Define hyperparameter grid
rf_param_grid_pos_ngram = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'max_features': ['sqrt', 'log2']  # Usually beneficial for text
}

# Set up GridSearchCV for hyperparameter tuning
rf_grid_search_pos_ngram = GridSearchCV(
    estimator=rf_classifier_pos_ngram,
    param_grid=rf_param_grid_pos_ngram,
    scoring='accuracy',
    cv=10,
    n_jobs=4,
    verbose=2,
    return_train_score=True
)

print("\n" + "="*50)
print("N-Gram: Random Forest (Positive-Biased)")
print("="*50)

# Fit GridSearch on the N-Gram transformed training data
rf_grid_search_pos_ngram.fit(X_train_ngram_pos, y_train_pos)

# Best parameters and cross-validation score
print("\nBest Params (N-Gram, Random Forest - Positive Biased):", rf_grid_search_pos_ngram.best_params_)
print("Best CV Score (N-Gram, Random Forest - Positive Biased):", rf_grid_search_pos_ngram.best_score_)

# Predict on test data
rf_best_pos_ngram = rf_grid_search_pos_ngram.best_estimator_
y_pred_rf_pos_ngram = rf_best_pos_ngram.predict(X_test_ngram_pos)

# Evaluate model
rf_test_accuracy_pos_ngram = accuracy_score(y_test_pos, y_pred_rf_pos_ngram)
print("Test Accuracy (N-Gram, Random Forest - Positive Biased):", rf_test_accuracy_pos_ngram)
print(classification_report(y_test_pos, y_pred_rf_pos_ngram))

# Store results
results_ngram_pos['RandomForest'] = {
    'best_params': rf_grid_search_pos_ngram.best_params_,
    'best_cv_score': rf_grid_search_pos_ngram.best_score_,
    'test_accuracy': rf_test_accuracy_pos_ngram,
    'classification_report': classification_report(y_test_pos, y_pred_rf_pos_ngram, output_dict=True)
}

# Save cross-validation results to CSV
rf_cv_results_pos_ngram = pd.DataFrame(rf_grid_search_pos_ngram.cv_results_)
rf_cv_results_pos_ngram.to_csv('Results_2C(PosBias)/NGram_Models/ngram_posBias_randomForest_cv_results.csv', index=False)

print("\nRandom Forest Model Training & Evaluation Complete for Positive-Biased Sentiment (N-Gram)!")

#### M4 kNN

In [None]:
# -------------------------------
#  POSITIVE-BIASED: kNN (N-Gram)
# -------------------------------

# Define kNN model
knn_classifier_pos_ngram = KNeighborsClassifier()

# Define hyperparameter grid
knn_param_grid_pos_ngram = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'cosine']
}

# Set up GridSearchCV for hyperparameter tuning
knn_grid_search_pos_ngram = GridSearchCV(
    estimator=knn_classifier_pos_ngram,
    param_grid=knn_param_grid_pos_ngram,
    scoring='accuracy',
    cv=10,
    n_jobs=4,
    verbose=2,
    return_train_score=True
)

print("\n" + "="*50)
print("N-Gram: kNN (Positive-Biased)")
print("="*50)

# Fit GridSearch on the N-Gram transformed training data
knn_grid_search_pos_ngram.fit(X_train_ngram_pos, y_train_pos)

# Best parameters and cross-validation score
print("\nBest Params (N-Gram, kNN - Positive Biased):", knn_grid_search_pos_ngram.best_params_)
print("Best CV Score (N-Gram, kNN - Positive Biased):", knn_grid_search_pos_ngram.best_score_)

# Predict on test data
knn_best_pos_ngram = knn_grid_search_pos_ngram.best_estimator_
y_pred_knn_pos_ngram = knn_best_pos_ngram.predict(X_test_ngram_pos)

# Evaluate model
knn_test_accuracy_pos_ngram = accuracy_score(y_test_pos, y_pred_knn_pos_ngram)
print("Test Accuracy (N-Gram, kNN - Positive Biased):", knn_test_accuracy_pos_ngram)
print(classification_report(y_test_pos, y_pred_knn_pos_ngram))

# Store results
results_ngram_pos['kNN'] = {
    'best_params': knn_grid_search_pos_ngram.best_params_,
    'best_cv_score': knn_grid_search_pos_ngram.best_score_,
    'test_accuracy': knn_test_accuracy_pos_ngram,
    'classification_report': classification_report(y_test_pos, y_pred_knn_pos_ngram, output_dict=True)
}

# Save cross-validation results to CSV
knn_cv_results_pos_ngram = pd.DataFrame(knn_grid_search_pos_ngram.cv_results_)
knn_cv_results_pos_ngram.to_csv('Results_2C(PosBias)/NGram_Models/ngram_posBias_knn_cv_results.csv', index=False)

print("\nkNN Model Training & Evaluation Complete for Positive-Biased Sentiment (N-Gram)!")

#### M5 Naive Bayes

In [None]:
# -------------------------------
#  POSITIVE-BIASED: Naïve Bayes (N-Gram)
# -------------------------------

# Define Naïve Bayes model
nb_classifier_pos_ngram = MultinomialNB()

# Define hyperparameter grid
nb_param_grid_pos_ngram = {
    'alpha': [0.5, 1.0, 1.5]
}

# Set up GridSearchCV for hyperparameter tuning
nb_grid_search_pos_ngram = GridSearchCV(
    estimator=nb_classifier_pos_ngram,
    param_grid=nb_param_grid_pos_ngram,
    scoring='accuracy',
    cv=10,
    n_jobs=4,
    verbose=2,
    return_train_score=True
)

print("\n" + "="*50)
print("N-Gram: Naïve Bayes (Positive-Biased)")
print("="*50)

# Fit GridSearch on the N-Gram transformed training data
nb_grid_search_pos_ngram.fit(X_train_ngram_pos, y_train_pos)

# Best parameters and cross-validation score
print("\nBest Params (N-Gram, Naïve Bayes - Positive Biased):", nb_grid_search_pos_ngram.best_params_)
print("Best CV Score (N-Gram, Naïve Bayes - Positive Biased):", nb_grid_search_pos_ngram.best_score_)

# Predict on test data
nb_best_pos_ngram = nb_grid_search_pos_ngram.best_estimator_
y_pred_nb_pos_ngram = nb_best_pos_ngram.predict(X_test_ngram_pos)

# Evaluate model
nb_test_accuracy_pos_ngram = accuracy_score(y_test_pos, y_pred_nb_pos_ngram)
print("Test Accuracy (N-Gram, Naïve Bayes - Positive Biased):", nb_test_accuracy_pos_ngram)
print(classification_report(y_test_pos, y_pred_nb_pos_ngram))

# Store results
results_ngram_pos['NaiveBayes'] = {
    'best_params': nb_grid_search_pos_ngram.best_params_,
    'best_cv_score': nb_grid_search_pos_ngram.best_score_,
    'test_accuracy': nb_test_accuracy_pos_ngram,
    'classification_report': classification_report(y_test_pos, y_pred_nb_pos_ngram, output_dict=True)
}

# Save cross-validation results to CSV
nb_cv_results_pos_ngram = pd.DataFrame(nb_grid_search_pos_ngram.cv_results_)
nb_cv_results_pos_ngram.to_csv('Results_2C(PosBias)/NGram_Models/ngram_posBias_naiveBayes_cv_results.csv', index=False)

print("\nNaïve Bayes Model Training & Evaluation Complete for Positive-Biased Sentiment (N-Gram)!")

#### Summary

In [None]:
summary_results_ngram_pos = pd.DataFrame(results_ngram_pos).T

#### Confusion Matrices (Dual Class POS Bias - NGram)

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Best N‑Gram models (Positive-Biased)
models_posBias_ngram = {
    "DecisionTree": dt_best_pos_ngram,
    "SVM": svm_best_pos_ngram,
    "RandomForest": rf_best_pos_ngram,
    "kNN": knn_best_pos_ngram,
    "NaiveBayes": nb_best_pos_ngram
}

for model_name, model in models_posBias_ngram.items():
    # X_test_ngram_pos: your N‑Gram test data (Positive-Biased)
    # y_test_pos:       your Positive-Biased labels (negative, positive)
    y_pred = model.predict(X_test_ngram_pos)
    cm = confusion_matrix(y_test_pos, y_pred)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
    disp.plot(cmap='Blues')
    plt.title(f"Confusion Matrix: {model_name} (Positive-Biased, N-Gram)")
    plt.show()