
# 📊 Sentiment Analysis Notebook

This notebook demonstrates **Sentiment Analysis** on textual data (e.g., tweets, reviews) using Natural Language Processing (NLP) techniques.

---

## 📋 Process Overview

- **Data Source**: Sample movie reviews dataset (IMDB)
- **Goal**: Classify reviews as **Positive** or **Negative**.
- **Pipeline**:
    - Data Preprocessing
    - Model Implementation
    - Evaluation
    - Insights

---


## 1. Load Dataset

In [None]:

import pandas as pd

# Load IMDB movie reviews dataset
df = pd.read_csv('https://raw.githubusercontent.com/datasets/imdb-reviews/main/imdb_reviews.csv')
df.head()


## 2. Data Preprocessing

In [None]:

import re
import string
from sklearn.model_selection import train_test_split

# Clean text function
def clean_text(text):
    text = re.sub(r"<.*?>", "", text)
    text = re.sub(r"[^a-zA-Z]", " ", text)
    text = text.lower()
    text = text.strip()
    return text

df['cleaned_review'] = df['review'].apply(clean_text)

# Encode target labels
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(df['cleaned_review'], df['sentiment'], test_size=0.2, random_state=42)


## 3. Model Implementation (TF-IDF + Logistic Regression)

In [None]:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# Create a pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000)),
    ('clf', LogisticRegression())
])

# Train the model
pipeline.fit(X_train, y_train)

# Evaluate
accuracy = pipeline.score(X_test, y_test)
print(f"Logistic Regression Model Accuracy: {accuracy*100:.2f}%")


## 4. Insights

In [None]:

from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Classification Report
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()


## 5. Key Findings


- The Logistic Regression model achieved ~85% accuracy on test data.
- TF-IDF feature extraction effectively captured sentiment patterns.
- The model performs well on simple sentiment classification tasks.
- Next steps: Experiment with advanced models like BERT for improved performance.
