# Text Classification with Sentence Embeddings & Logistic Regression

In this notebook, we’ll walk step-by-step through building a movie-review classifier on the Rotten Tomatoes dataset using pretrained sentence embeddings and scikit-learn’s Logistic Regression. We’ll cover:

1. **Environment Setup** – install and import libraries.  
2. **Load & Inspect Data** – load the Rotten Tomatoes reviews and take a first look.  
3. **Generate Embeddings** – encode each review into a fixed-size vector via sentence-transformers.  
4. **Train/Test Split** – partition our data for training and evaluation.  
5. **Train Classifier** – fit a Logistic Regression model on embedding features.  
6. **Evaluate Performance** – compute accuracy, confusion matrix, and classification report.  
7. **Predicting New Texts** – create new reviews and predict their "freshness/rottenness"

## **Step 1: Environment Setup**

First, install the required libraries (`sentence-transformers` for embeddings, `datasets` for loading our data) and import everything we need.

In [None]:
# Install sentence-transformers for easy embeddings

# Install the Hugging Face datasets library to load Rotten Tomatoes reviews


## **Step 2: Load & Inspect Data**
We’ll load the [Cornell Movie Review “Rotten Tomatoes” dataset](https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes) (just the training split for sample size purposes) and convert it to a pandas DataFrame for easy inspection.
- `label = 1` means "fresh"
- `label = 0` means "rotten"

In [None]:
import numpy as np
import pandas as pd

# Load the Rotten Tomatoes dataset


In [None]:
# Check class distribution to ensure balance


## **Step 3: Generate Sentence Embeddings**
We use the [`SentenceTransformer` wrapper](https://huggingface.co/sentence-transformers), which exposes a simple `.encode()` method. It returns an (n_samples × embedding_dim) NumPy array.

In [None]:
# sentence-transformers for pretrained encoder


# Instantiate the pretrained embedding model


# Encode all review texts to embeddings
# convert_to_numpy=True returns a NumPy array
# show_progress_bar displays encoding progress


# Inspect shape: (n_reviews, embedding_dimension)


## **Step 4: Split into Training & Testing Sets**
We’ll hold out 20% of our embeddings/labels for testing, using a fixed random_state for reproducibility.

In [None]:

# Define features (embeddings) and target (labels)


# Split embeddings and labels into train/test


# Confirm sizes


## **Step 5: Train Logistic Regression**
Fit a logistic regression classifier on our training set. We use the default L2 penalty and solver, and let `scikit-learn` choose sensible defaults.

In [None]:
from sklearn.linear_model import LogisticRegression

# Initialize the logistic regression model
clf = LogisticRegression()

# Train on embedding features
clf.fit(X_train, y_train)

## **Step 6: Evaluate Performance**
We’ll predict on the test set, compute overall accuracy, plot the confusion matrix, and print a detailed classification report (precision, recall, F1-score).

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns

# Predict on the test set
y_pred = clf.predict(X_test)

# Calculate Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Generate and plot Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# Print Classification Report
print("Classification Report:")
print(classification_report(y_test, y_pred))

## **Step 7: Predict New Texts**
Now that we have a trained model, let’s see how it does on user-supplied reviews. We’ll:

1. Define a few custom review strings.  
2. Encode them into embeddings.  
3. Predict labels and class-probabilities with our logistic regressor.  
4. Print out each review with its predicted sentiment and confidence.

In [None]:
# 7.1 Define some new reviews to classify
new_reviews = [
    "I absolutely loved this movie, the performances were stellar!",
    "This was a terrible film; I wasted two hours of my life.",
    "The story was okay, but the pacing felt off in the second half.",
    "An underrated gem—beautiful cinematography and great score."
]

# 7.2 Generate embeddings for these new texts
# (using the same embedder we initialized earlier)
new_embs = embedder.encode(
    new_reviews,
    convert_to_numpy=True,
    show_progress_bar=False
)

# 7.3 Predict labels and probabilities
new_preds = clf.predict(new_embs)
new_probs = clf.predict_proba(new_embs)

# 7.4 Display results
for text, pred, probs in zip(new_reviews, new_preds, new_probs):
    # Map numeric label back to sentiment
    label_str = "Positive" if pred == 1 else "Negative"
    confidence = probs[pred]
    print(f"Review: {text!r}")
    print(f" → Predicted: {label_str} (confidence = {confidence:.3f})\n")