# üéØ Sentiment Prediction - Drug Review Analysis

## Objective
Build a model to predict whether a drug review is **Positive** (rating ‚â• 7) or **Negative** (rating < 7) based on the review text.

## Why This Matters
- **Pharmaceutical companies** can automatically monitor product feedback
- **Healthcare platforms** can flag concerning reviews
- **Patients** can quickly filter to find relevant experiences

## Approach
1. Simple baseline with TF-IDF + Logistic Regression
2. Improved model with TF-IDF + XGBoost
3. Compare and explain results

---
**Author**: [Your Name]

## 1. Setup & Data Loading

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import warnings

# ML libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                             f1_score, confusion_matrix, classification_report,
                             roc_auc_score, roc_curve)

# Sentence Transformers for BERT embeddings
from sentence_transformers import SentenceTransformer

# Settings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-whitegrid')
pd.set_option('display.max_colwidth', 100)

print("‚úÖ Libraries loaded successfully!")

In [None]:
# Load data
df = pd.read_csv('../artifacts/data_ingestion/Drugs_Data.csv')
print(f"üìä Loaded {len(df):,} reviews")
df.head(3)

## 2. Data Preparation

We'll create a binary sentiment label:
- **Positive (1)**: Rating ‚â• 7
- **Negative (0)**: Rating < 7

In [None]:
# Create binary sentiment label
df['sentiment'] = (df['rating'] >= 7).astype(int)

# Check distribution
sentiment_dist = df['sentiment'].value_counts()
print("üìä Sentiment Distribution:")
print(f"   Positive (1): {sentiment_dist[1]:,} ({sentiment_dist[1]/len(df)*100:.1f}%)")
print(f"   Negative (0): {sentiment_dist[0]:,} ({sentiment_dist[0]/len(df)*100:.1f}%)")

# Visualize
fig, ax = plt.subplots(figsize=(8, 4))
colors = ['#ff6b6b', '#51cf66']
bars = ax.bar(['Negative (Rating < 7)', 'Positive (Rating ‚â• 7)'], 
              [sentiment_dist[0], sentiment_dist[1]], color=colors, edgecolor='white')
ax.set_title('Sentiment Class Distribution', fontsize=14, fontweight='bold')
ax.set_ylabel('Number of Reviews')

for bar, val in zip(bars, [sentiment_dist[0], sentiment_dist[1]]):
    ax.text(bar.get_x() + bar.get_width()/2, val + 1000, f'{val:,}', ha='center', fontsize=11)

plt.tight_layout()
plt.show()

In [None]:
# Text cleaning function
def clean_text(text):
    """Clean review text for ML processing"""
    text = str(text).lower()
    # Remove HTML entities
    text = re.sub(r'&#\d+;', '', text)
    text = re.sub(r'&\w+;', '', text)
    # Remove special characters but keep spaces
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Remove extra whitespace
    text = ' '.join(text.split())
    return text

# Apply cleaning
print("üîÑ Cleaning text...")
df['clean_review'] = df['review'].apply(clean_text)

# Show example
print("\nüìù Example:")
print(f"Original: {df['review'].iloc[0][:200]}...")
print(f"\nCleaned:  {df['clean_review'].iloc[0][:200]}...")

In [None]:
# Use a sample for faster training (full dataset would take too long)
SAMPLE_SIZE = 50000  # Using 50K samples for demo

# Stratified sampling to maintain class balance
df_sample = df.groupby('sentiment', group_keys=False).apply(
    lambda x: x.sample(n=min(len(x), SAMPLE_SIZE//2), random_state=42)
)

print(f"üìä Using {len(df_sample):,} samples for training")
print(f"   Positive: {(df_sample['sentiment']==1).sum():,}")
print(f"   Negative: {(df_sample['sentiment']==0).sum():,}")

## 3. Feature Engineering with Sentence Transformers (BERT)

**Why BERT instead of TF-IDF?**

| TF-IDF | Sentence Transformers (BERT) |
|--------|------------------------------|
| Counts word frequency | Understands semantic meaning |
| "headache" ‚â† "head pain" | "headache" ‚âà "head pain" |
| Sparse, high-dimensional | Dense, 384 dimensions |
| Fast but shallow | Slower but powerful |

We use **`all-MiniLM-L6-v2`** - a lightweight but effective model that converts text into 384-dimensional vectors.

In [None]:
# Split data FIRST (before generating embeddings)
X_text = df_sample['clean_review']
y = df_sample['sentiment']

X_train_text, X_test_text, y_train, y_test = train_test_split(
    X_text, y, test_size=0.2, random_state=42, stratify=y
)

print(f"üìä Train set: {len(X_train_text):,} samples")
print(f"üìä Test set:  {len(X_test_text):,} samples")

In [None]:
# Load Sentence Transformer model
print("üîÑ Loading Sentence Transformer model...")
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
print("‚úÖ Model loaded!")

# Generate embeddings for train and test sets
print("\nüîÑ Generating BERT embeddings for training data...")
X_train = embedding_model.encode(
    X_train_text.tolist(), 
    show_progress_bar=True,
    convert_to_numpy=True
)

print("üîÑ Generating BERT embeddings for test data...")
X_test = embedding_model.encode(
    X_test_text.tolist(), 
    show_progress_bar=True,
    convert_to_numpy=True
)

print(f"\n‚úÖ Embeddings created!")
print(f"   Training features shape: {X_train.shape}")
print(f"   Test features shape: {X_test.shape}")
print(f"   Each review ‚Üí 384-dimensional vector")

## 4. Model Training

Let's train multiple models and compare their performance.

In [None]:
# Helper function to evaluate models
def evaluate_model(model, X_test, y_test, model_name):
    """Evaluate model and return metrics"""
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1] if hasattr(model, 'predict_proba') else None
    
    metrics = {
        'Model': model_name,
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1 Score': f1_score(y_test, y_pred),
        'ROC AUC': roc_auc_score(y_test, y_prob) if y_prob is not None else None
    }
    return metrics, y_pred, y_prob

# Store results
results = []

### 4.1 Logistic Regression (Baseline)
A simple, interpretable model that works well for text classification.

In [None]:
%%time
# Train Logistic Regression
print("üîÑ Training Logistic Regression...")
lr_model = LogisticRegression(max_iter=1000, random_state=42, n_jobs=-1)
lr_model.fit(X_train, y_train)

# Evaluate
lr_metrics, lr_pred, lr_prob = evaluate_model(lr_model, X_test, y_test, 'Logistic Regression')
results.append(lr_metrics)

print(f"\n‚úÖ Logistic Regression Results:")
print(f"   Accuracy:  {lr_metrics['Accuracy']:.4f}")
print(f"   F1 Score:  {lr_metrics['F1 Score']:.4f}")
print(f"   ROC AUC:   {lr_metrics['ROC AUC']:.4f}")

### 4.2 XGBoost
A powerful gradient boosting model that often achieves state-of-the-art results.

In [None]:
%%time
# Train XGBoost
print("üîÑ Training XGBoost...")
xgb_model = XGBClassifier(
    n_estimators=100,
    max_depth=5,
    learning_rate=0.1,
    random_state=42,
    n_jobs=-1,
    verbosity=0
)
xgb_model.fit(X_train, y_train)

# Evaluate
xgb_metrics, xgb_pred, xgb_prob = evaluate_model(xgb_model, X_test, y_test, 'XGBoost')
results.append(xgb_metrics)

print(f"\n‚úÖ XGBoost Results:")
print(f"   Accuracy:  {xgb_metrics['Accuracy']:.4f}")
print(f"   F1 Score:  {xgb_metrics['F1 Score']:.4f}")
print(f"   ROC AUC:   {xgb_metrics['ROC AUC']:.4f}")

## 5. Model Comparison

In [None]:
# Compare all models
results_df = pd.DataFrame(results)
results_df = results_df.set_index('Model')

print("üìä Model Comparison:")
display(results_df.round(4))

# Visualize comparison
fig, ax = plt.subplots(figsize=(10, 5))
results_df[['Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC AUC']].plot(
    kind='bar', ax=ax, colormap='viridis', edgecolor='white'
)
ax.set_title('Model Performance Comparison', fontsize=14, fontweight='bold')
ax.set_ylabel('Score')
ax.set_ylim(0.5, 1.0)
ax.legend(loc='lower right')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:
# Confusion Matrix for best model (Logistic Regression - usually best for text)
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

for ax, (pred, name) in zip(axes, [(lr_pred, 'Logistic Regression'), (xgb_pred, 'XGBoost')]):
    cm = confusion_matrix(y_test, pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax,
                xticklabels=['Negative', 'Positive'],
                yticklabels=['Negative', 'Positive'])
    ax.set_title(f'{name}\nConfusion Matrix', fontsize=12, fontweight='bold')
    ax.set_ylabel('Actual')
    ax.set_xlabel('Predicted')

plt.tight_layout()
plt.show()

## 6. Model Interpretation

With BERT embeddings, we can't directly see which words matter (unlike TF-IDF). But we can:
1. Look at **embedding dimensions** importance
2. Test with **sample reviews** to verify behavior

In [None]:
# Analyze embedding dimension importance from Logistic Regression coefficients
coefficients = lr_model.coef_[0]

# Create visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Distribution of coefficients
ax1 = axes[0]
ax1.hist(coefficients, bins=50, color='steelblue', edgecolor='white', alpha=0.7)
ax1.axvline(x=0, color='red', linestyle='--', linewidth=2)
ax1.set_title('Distribution of Embedding Dimension Weights', fontsize=12, fontweight='bold')
ax1.set_xlabel('Coefficient Value')
ax1.set_ylabel('Frequency')

# Top important dimensions
ax2 = axes[1]
importance = np.abs(coefficients)
top_dims = np.argsort(importance)[-20:]
colors = ['#51cf66' if coefficients[i] > 0 else '#ff6b6b' for i in top_dims]
ax2.barh([f'Dim {i}' for i in top_dims], importance[top_dims], color=colors, edgecolor='white')
ax2.set_title('Top 20 Most Important Embedding Dimensions', fontsize=12, fontweight='bold')
ax2.set_xlabel('Absolute Coefficient Value')

plt.tight_layout()
plt.show()

print(f"\nüìä Total embedding dimensions: 384")
print(f"üìä Dimensions with positive coefficients (predict positive): {(coefficients > 0).sum()}")
print(f"üìä Dimensions with negative coefficients (predict negative): {(coefficients < 0).sum()}")

### üí° Key Insight: How BERT Embeddings Work

Unlike TF-IDF where we can see individual words, BERT creates **semantic representations** where:
- Similar meanings ‚Üí similar vectors
- The model learns which **combinations of dimensions** indicate positive/negative sentiment
- This is more powerful because "works great" and "excellent results" will have similar embeddings even though words are different

## 7. Test with Real Examples

Let's see how our model performs on actual review text!

In [None]:
# Function to predict sentiment using BERT embeddings
def predict_sentiment(text, model=lr_model, embedder=embedding_model):
    """Predict sentiment for a given review text"""
    cleaned = clean_text(text)
    
    # Generate BERT embedding
    embedding = embedder.encode([cleaned], convert_to_numpy=True)
    
    prediction = model.predict(embedding)[0]
    probability = model.predict_proba(embedding)[0]
    
    sentiment = "POSITIVE üòä" if prediction == 1 else "NEGATIVE üòû"
    confidence = probability[prediction] * 100
    
    return sentiment, confidence

# Test examples
test_reviews = [
    "This medication has been a life saver! I feel so much better now. Highly recommend!",
    "Terrible experience. Had severe side effects and it didn't help at all. Avoid this drug.",
    "It's okay. Works for some days, not others. Average results.",
    "After trying many medications, this one finally works. No side effects and great results!",
    "Made my condition worse. I had to stop taking it after a week."
]

print("üß™ TESTING THE MODEL WITH BERT EMBEDDINGS\n")
print("=" * 70)

for review in test_reviews:
    sentiment, confidence = predict_sentiment(review)
    print(f"\nüìù Review: \"{review[:80]}...\"" if len(review) > 80 else f"\nüìù Review: \"{review}\"")
    print(f"   ‚Üí Prediction: {sentiment} (Confidence: {confidence:.1f}%)")
    print("-" * 70)

## 8. Save the Model

Save the trained model for use in the Streamlit app.

In [None]:
import joblib
import os

# Create models directory
os.makedirs('../models', exist_ok=True)

# Save the best model (Logistic Regression) and vectorizer
joblib.dump(lr_model, '../models/sentiment_model.joblib')
joblib.dump(tfidf, '../models/tfidf_vectorizer.joblib')

print("‚úÖ Model saved successfully!")
print("   üìÅ ../models/sentiment_model.joblib")
print("   üìÅ ../models/tfidf_vectorizer.joblib")

## 9. Summary

### What We Built
A **sentiment classifier** that predicts whether a drug review is positive or negative.

### Results
| Metric | Logistic Regression | XGBoost |
|--------|--------------------:|--------:|
| Accuracy | ~90% | ~88% |
| F1 Score | ~90% | ~88% |
| ROC AUC | ~96% | ~94% |

### Key Learnings

1. **Logistic Regression performs best** for text classification - it's fast, interpretable, and accurate

2. **TF-IDF is effective** - simple but powerful feature extraction for text

3. **Model is interpretable** - we can see exactly which words drive predictions

4. **Real-world applicable** - the model correctly identifies sentiment in new reviews

### Business Applications
- **Automated review monitoring** for pharmaceutical companies
- **Content moderation** for healthcare platforms  
- **Quick filtering** for patients researching medications

---

### Next: Drug Recommendation System (Notebook 03)