# ðŸ“Œ Topic: Text Classification

### What you will learn
- How to train supervised learning models on text data
- Comparison of three popular algorithms: Logistic Regression, Naive Bayes, and SVM
- The standard ML pipeline: Bag of Words -> Splitting -> Training -> Evaluation
- How to interpret classification performance (Accuracy, Precision, Recall)

### Why this matters
Text classification is the backbone of many real-world applications: spam filters, sentiment analyzers (like detecting hate speech), and automated customer support sorting. Understanding which algorithm to use and how to evaluate it is key to building reliable NLP systems.

---

## The Workflow

1.  **Vectorization**: Transform text into numbers (we'll use Bag of Words here).
2.  **Dataset Splitting**: Separating data into a **Training Set** (to teach the model) and a **Testing Set** (to check its homework).
3.  **Model Training**: Letting the algorithms find patterns between word counts and labels.
4.  **Evaluation**: Measuring how many labels the model got right on unseen data.

In [None]:
import pandas as pd 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score, classification_report

# Create a toy sentiment dataset
data = pd.DataFrame(
    [
        ("I love spending time with my friends and family", "positive"),
        ("That was the best meal I've ever had in my life", "positive"),
        ("I feel so grateful for everything I have in my life", "positive"),
        ("I received a promotion at work and I couldn't be happier", "positive"),
        ("Watching a beautiful sunset always fills me with joy", "positive"),
        ("My partner surprised me with a thoughtful gift and it made my day", "positive"),
        ("I am so proud of my daughter for graduating with honors", "positive"),
        ("Listening to my favorite music always puts me in a good mood", "positive"),
        ("I love the feeling of accomplishment after completing a challenging task", "positive"),
        ("I am excited to go on vacation next week", "positive"),
        ("I feel so overwhelmed with work and responsibilities", "negative"),
        ("The traffic during my commute is always so frustrating", "negative"),
        ("I received a parking ticket and it ruined my day", "negative"),
        ("I got into an argument with my partner and we're not speaking", "negative"),
        ("I have a headache and I feel terrible", "negative"),
        ("I received a rejection letter for the job I really wanted", "negative"),
        ("My car broke down and it's going to be expensive to fix", "negative"),
        ("I'm feeling sad because I miss my friends who live far away", "negative"),
        ("I'm frustrated because I can't seem to make progress on my project", "negative"),
        ("I'm disappointed because my team lost the game", "negative"),
    ],
    columns=["text", "sentiment"]
)

# Shuffle the data for fair training
data = data.sample(frac=1, random_state=42).reset_index(drop=True)

## Step 1: Preprocessing and Vectorizing

We convert our text column into numerical features using `CountVectorizer`.

In [None]:
x = data['text']
y = data['sentiment']

# Transform text into Bag of Words representation
vectorizer = CountVectorizer()
x_bow = vectorizer.fit_transform(x)

# Split the data: 70% for training, 30% for testing
x_train, x_test, y_train, y_test = train_test_split(x_bow, y, test_size=0.3, random_state=7)

## Step 2: Logistic Regression

**Logistic Regression** predicts probabilities for binary classes. It works by finding the weights for each word that best separate positive from negative samples.

In [None]:
# Initialize and train
lr_model = LogisticRegression()
lr_model.fit(x_train, y_train)

# Prediction and evaluation
y_pred_lr = lr_model.predict(x_test)
print(f"Logistic Regression Accuracy: {accuracy_score(y_test, y_pred_lr):.2f}")

## Step 3: Multinomial Naive Bayes

**Naive Bayes** is based on Bayes' Theorem. It's "naive" because it assumes every word is independent of others (which we know isn't true), but it's incredibly fast and often performs surprisingly well on text data.

In [None]:
# Train Naive Bayes
nb_model = MultinomialNB()
nb_model.fit(x_train.toarray(), y_train)

# Predict
y_pred_nb = nb_model.predict(x_test.toarray())
print(f"Naive Bayes Accuracy: {accuracy_score(y_test, y_pred_nb):.2f}")

## Step 4: Support Vector Machine (SVM)

**SVM** tries to find the widest possible "road" or boundary between the two classes. It is very effective in high-dimensional spaces (like text datasets where every word is a dimension).

In [None]:
# Using SGDClassifier as an efficient implementation of linear SVM
svm_model = SGDClassifier()
svm_model.fit(x_train, y_train)

# Predict
y_pred_svm = svm_model.predict(x_test)
print(f"SVM Accuracy: {accuracy_score(y_test, y_pred_svm):.2f}")

## Comparing the Results

A single accuracy number doesn't tell the whole story. Let's look at the detailed breakdown.

In [None]:
print("Detailed Report (Logistic Regression):")
print(classification_report(y_test, y_pred_lr, zero_division=0))

## Key Takeaways

1.  **Fast Baseline**: Naive Bayes is usually the fastest to train and serves as a great starting reference.
2.  **High Accuracy**: Logistic Regression and SVM tend to be more accurate on larger datasets where word interactions matter more.
3.  **Data Quality**: Features (BoW, TF-IDF) often matter just as much as the algorithm itself.

## Next steps:
- Try adding **Stopword Removal** to see if accuracy improves.
- Experiment with **Hyperparameter Tuning** (e.g., changing the `C` parameter in Logistic Regression).
- Learn about **Dimensionality Reduction** in the next notebook to handle very large vocabularies.