# Name: Marwa
## Registration No: 2023-ag-9969

# Sentiment Classification with Bag-of-Words and Naive Bayes

This notebook demonstrates a simple text classification pipeline to categorize sentences into one of three sentiments: **Happy**, **Sad**, or **Neutral**. We use the Bag-of-Words (BoW) model for feature extraction and a Multinomial Naive Bayes classifier for model training.

In [1]:
# Import necessary libraries
import pandas as pd
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

## 1. Dataset Initialization and Preprocessing

We load the provided sample data and perform standard text preprocessing steps: converting to lowercase and removing punctuation.

In [2]:
# 1. Dataset
data = {
    'text': [
        "I am so excited for today!",
        "I feel really bad about the situation.",
        "I am just here, nothing special.",
        "This is the best day ever!",
        "I am feeling really down today.",
        "I don’t care about this anymore."
    ],
    'sentiment': [
        'Happy',
        'Sad',
        'Neutral',
        'Happy',
        'Sad',
        'Neutral'
    ]
}
df = pd.DataFrame(data)

print("Original Data:", df)

# Data Preprocessing: Convert to lowercase and remove punctuation
# Note: We use a simple regex to keep only letters and spaces.
df['processed_text'] = df['text'].str.lower()
df['processed_text'] = df['processed_text'].apply(lambda x: re.sub(r'[^a-z\s]', '', x))

print("\nProcessed Data:", df[['processed_text', 'sentiment']])

Original Data:                                      text sentiment
0              I am so excited for today!     Happy
1  I feel really bad about the situation.       Sad
2        I am just here, nothing special.   Neutral
3              This is the best day ever!     Happy
4         I am feeling really down today.       Sad
5        I don’t care about this anymore.   Neutral

Processed Data:                           processed_text sentiment
0              i am so excited for today     Happy
1  i feel really bad about the situation       Sad
2         i am just here nothing special   Neutral
3              this is the best day ever     Happy
4         i am feeling really down today       Sad
5         i dont care about this anymore   Neutral


## 2. Bag-of-Words (BoW) Feature Extraction and Dataset Splitting

The `CountVectorizer` converts the text into a matrix of token counts. The dataset is then split 80/20 for training and testing.

In [3]:
# Bag-of-Words (BoW) Feature Extraction
vectorizer = CountVectorizer() # Tokenize sentences and build vocabulary
X = vectorizer.fit_transform(df['processed_text']) # Transform text into token count matrix
y = df['sentiment']

print("Vocabulary size:", len(vectorizer.get_feature_names_out()))
print("Feature matrix shape (Samples x Features):", X.shape)

# Split the Dataset (80% Training, 20% Testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"\nTraining samples: {X_train.shape[0]}, Testing samples: {X_test.shape[0]}")

Vocabulary size: 25
Feature matrix shape (Samples x Features): (6, 25)

Training samples: 4, Testing samples: 2


## 3. Model Training (Naive Bayes)

In [4]:
# Model Training: Using Multinomial Naive Bayes
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

print("Multinomial Naive Bayes Model trained successfully.")

Multinomial Naive Bayes Model trained successfully.


## 4. Model Evaluation

We test the trained model on the unseen test set and report performance metrics.

In [5]:
# Predict on the test set
y_pred = classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, zero_division=0)

print(f"Accuracy: {accuracy}")
print(f"\nClassification Report:{report}")

Accuracy: 0.5

Classification Report:              precision    recall  f1-score   support

       Happy       0.00      0.00      0.00         1
         Sad       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2



## 5. Final Task: Predicting New Sample Sentences

In [6]:
# New sample sentences
new_sentences = [
    "I am having a great day!",
    "Everything feels so wrong today."
]

# Preprocess and vectorize the new data using the FITTED vectorizer
processed_new_sentences = [re.sub(r'[^a-z\s]', '', s.lower()) for s in new_sentences]
X_new = vectorizer.transform(processed_new_sentences)

# Predict the sentiment
predictions = classifier.predict(X_new)

# Report the predictions
print("New Sentence Predictions:")
print("---------------------------")
for sentence, prediction in zip(new_sentences, predictions):
    print(f'"{sentence}" -> Predicted: {prediction}')

New Sentence Predictions:
---------------------------
"I am having a great day!" -> Predicted: Neutral
"Everything feels so wrong today." -> Predicted: Sad
