# 🔄 Step-by-Step Flow

1. 🔍 **Load & Explore:** Import IMDB dataset
2. 🧹 **Preprocess:** Clean and prepare text
3. 🔢 **Vectorize:** Convert text to TF-IDF features
4. ✂️ **Split:** Train/test split (80/20)
5. 🤖 **Train:** Fit Logistic Regression model
6. 🎯 **Evaluate:** Test performance metrics
7. 🎭 **Predict:** Try with new reviews

We'll use scikit-learn and a small text dataset to simulate IMDB sentiment analysis.

In [None]:
# Install dependencies (uncomment if needed)
# !pip install scikit-learn pandas numpy

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import re

class SentimentAnalyzer:
    def __init__(self):
        self.vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
        self.classifier = LogisticRegression(max_iter=200)
    
    def preprocess_text(self, text):
        text = text.lower()
        text = re.sub(r'[^a-z\s]', '', text)
        return text.strip()

    def train(self, texts, labels):
        clean_texts = [self.preprocess_text(t) for t in texts]
        X = self.vectorizer.fit_transform(clean_texts)
        self.classifier.fit(X, labels)

    def evaluate(self, test_texts, test_labels):
        clean_texts = [self.preprocess_text(t) for t in test_texts]
        X_test = self.vectorizer.transform(clean_texts)
        y_pred = self.classifier.predict(X_test)

        accuracy = accuracy_score(test_labels, y_pred)
        precision = precision_score(test_labels, y_pred)
        recall = recall_score(test_labels, y_pred)
        f1 = f1_score(test_labels, y_pred)

        print(f"Accuracy:  {accuracy:.2f}")
        print(f"Precision: {precision:.2f}")
        print(f"Recall:    {recall:.2f}")
        print(f"F1-Score:  {f1:.2f}")

    def predict(self, text):
        text = self.preprocess_text(text)
        X = self.vectorizer.transform([text])
        pred = self.classifier.predict(X)[0]
        label = 'Positive' if pred == 1 else 'Negative'
        return label

### 🧪 Let's Try It Out
We'll train with a mini dataset and test predictions.

In [None]:
# Sample IMDB-like dataset
texts = [
    "I loved this movie! It was fantastic.",
    "An excellent film with great performances.",
    "The movie was boring and too long.",
    "I hated this film. Terrible acting!",
    "Good story but weak direction.",
    "Absolutely amazing experience!",
    "Not worth watching again.",
    "It was okay, not bad but not great either."
]
labels = [1, 1, 0, 0, 0, 1, 0, 0]

# Train/test split
train_texts, test_texts, train_labels, test_labels = train_test_split(texts, labels, test_size=0.25, random_state=42)

# Initialize and train model
analyzer = SentimentAnalyzer()
analyzer.train(train_texts, train_labels)

# Evaluate
print("\n📊 Model Evaluation:")
analyzer.evaluate(test_texts, test_labels)

# Predict new examples
print("\n🎭 Try New Reviews:")
print("'This was a masterpiece!' →", analyzer.predict("This was a masterpiece!"))
print("'I regret watching this.' →", analyzer.predict("I regret watching this."))

### 🧠 What We Did
- Used **TF-IDF** to convert text into numerical features.
- Trained a **Logistic Regression** model.
- Evaluated the model using **accuracy**, **precision**, **recall**, and **F1-score**.
- Tested it on **new unseen reviews**.

This simple structure can be expanded using larger datasets, pipelines, and better preprocessing techniques.