# Task 04: Sentiment Analysis using NLP

## Objective
To perform text classification using Natural Language Processing techniques.

## Dataset
Used a real-world text dataset with two categories.
Total samples: 1193

## Methodology
- Converted text data into numerical format using CountVectorizer.
- Split dataset into 80% training and 20% testing.
- Trained a Multinomial Naive Bayes classifier.

## Results
- Accuracy: ~98%
- Model shows excellent performance with high precision and recall.
- Very few misclassifications observed.

## Conclusion
Text vectorization combined with Naive Bayes performs efficiently for text classification problems.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [5]:
from sklearn.datasets import fetch_20newsgroups

categories = ['rec.sport.hockey', 'sci.space']

data = fetch_20newsgroups(subset='train', categories=categories)

df = pd.DataFrame({
    "text": data.data,
    "sentiment": data.target
})

print("Dataset loaded successfully!")
print("Shape:", df.shape)

df.head()

Dataset loaded successfully!
Shape: (1193, 2)


Unnamed: 0,text,sentiment
0,From: e8l6@jupiter.sun.csd.unb.ca (Rocket)\nSu...,0
1,From: umfu0009@ccu.umanitoba.ca (J. M. K. Fu)\...,0
2,From: Mark.Prado@p2.f349.n109.z1.permanet.org ...,1
3,From: igor@pravda.tse.su\nSubject: Who will br...,0
4,From: u1452@penelope.sdsc.edu (Jeff Bytof - SI...,1


In [6]:
df.shape

(1193, 2)

In [7]:
vectorizer = CountVectorizer(stop_words='english')

X = vectorizer.fit_transform(df["text"])
y = df["sentiment"]

print("Shape of transformed text data:", X.shape)

Shape of transformed text data: (1193, 23283)


In [8]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Training size:", X_train.shape)
print("Testing size:", X_test.shape)

Training size: (954, 23283)
Testing size: (239, 23283)


In [9]:
model = MultinomialNB()

model.fit(X_train, y_train)

print("Model trained successfully!")

Model trained successfully!


In [10]:
y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.9874476987447699


In [11]:
print("Confusion Matrix:\n")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))

Confusion Matrix:

[[122   3]
 [  0 114]]

Classification Report:

              precision    recall  f1-score   support

           0       1.00      0.98      0.99       125
           1       0.97      1.00      0.99       114

    accuracy                           0.99       239
   macro avg       0.99      0.99      0.99       239
weighted avg       0.99      0.99      0.99       239

