# Sentiment Analysis Model Development

**Purpose**: Develop and evaluate a sentiment analysis model for production deployment

**Model**: Logistic Regression with TF-IDF features

**Deployment**: AWS SageMaker with MLOps pipeline

## Notebook Structure
1. Environment Setup
2. Data Loading and Exploration
3. Text Preprocessing
4. Feature Engineering
5. Model Training
6. Model Evaluation
7. Model Saving and Export
8. Local Testing
9. Preparation for Cloud Deployment

## 1. Environment Setup

In [None]:
# Standard library imports
import os
import sys
from pathlib import Path

# Add src to path for imports
sys.path.append('../src')

# Data manipulation
import numpy as np
import pandas as pd

# Machine learning
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    classification_report,
    confusion_matrix,
    roc_auc_score,
    roc_curve,
)

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# NLP
import nltk

# Custom modules
from preprocess import TextPreprocessor

# Configurations
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

# Random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("Environment setup complete!")

## 2. Data Loading and Exploration

For this demo, we'll create synthetic sentiment data. In production, you would load real customer reviews.

In [None]:
# Create synthetic sentiment data for demonstration
positive_reviews = [
    "This product is absolutely amazing! Best purchase ever.",
    "Excellent quality and fast shipping. Highly recommend!",
    "Love it! Exceeded my expectations in every way.",
    "Great value for money. Very satisfied with this purchase.",
    "Outstanding product! Will definitely buy again.",
    "Fantastic! Exactly what I was looking for.",
    "Superb quality and excellent customer service.",
    "Best product in its category. Cannot recommend enough!",
    "Wonderful experience from start to finish.",
    "Amazing features and great performance!",
] * 50  # Repeat to create more samples

negative_reviews = [
    "Terrible product. Complete waste of money.",
    "Poor quality. Broke after just one use.",
    "Very disappointed. Not as described.",
    "Awful experience. Would not recommend to anyone.",
    "Worst purchase ever. Total garbage.",
    "Bad quality and terrible customer service.",
    "Don't buy this! Save your money.",
    "Horrible product. Completely useless.",
    "Disappointing quality. Not worth the price.",
    "Terrible! Nothing works as advertised.",
] * 50  # Repeat to create more samples

# Create DataFrame
data = pd.DataFrame({
    'text': positive_reviews + negative_reviews,
    'label': [1] * len(positive_reviews) + [0] * len(negative_reviews)
})

# Shuffle data
data = data.sample(frac=1, random_state=RANDOM_STATE).reset_index(drop=True)

print(f"Dataset shape: {data.shape}")
print(f"\nLabel distribution:")
print(data['label'].value_counts())
print(f"\nFirst few samples:")
data.head(10)

In [None]:
# Visualize label distribution
plt.figure(figsize=(8, 5))
data['label'].value_counts().plot(kind='bar')
plt.title('Sentiment Label Distribution', fontsize=14, fontweight='bold')
plt.xlabel('Sentiment (0=Negative, 1=Positive)', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

print(f"Positive reviews: {(data['label'] == 1).sum()} ({(data['label'] == 1).mean():.1%})")
print(f"Negative reviews: {(data['label'] == 0).sum()} ({(data['label'] == 0).mean():.1%})")

## 3. Text Preprocessing

Apply our custom preprocessing pipeline to clean and normalize the text data.

In [None]:
# Initialize preprocessor
preprocessor = TextPreprocessor(
    lowercase=True,
    remove_stopwords=True,
    remove_numbers=False
)

# Example of preprocessing
sample_text = data['text'].iloc[0]
processed_text = preprocessor.preprocess(sample_text)

print("Original text:")
print(sample_text)
print("\nProcessed text:")
print(processed_text)

In [None]:
# Preprocess all texts
print("Preprocessing texts...")
data['processed_text'] = data['text'].apply(preprocessor.preprocess)

# Show comparison
comparison_df = data[['text', 'processed_text', 'label']].head(5)
print("\nSample of preprocessed data:")
comparison_df

## 4. Feature Engineering

Convert text to TF-IDF features for model training.

In [None]:
# Split data into train and test sets
X = data['processed_text'].values
y = data['label'].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=RANDOM_STATE,
    stratify=y
)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"\nTrain label distribution:")
print(pd.Series(y_train).value_counts())

In [None]:
# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer(
    max_features=5000,
    ngram_range=(1, 2),
    min_df=2,
    max_df=0.95
)

# Fit on training data and transform
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

print(f"Feature matrix shape (train): {X_train_tfidf.shape}")
print(f"Feature matrix shape (test): {X_test_tfidf.shape}")
print(f"\nNumber of features: {len(vectorizer.get_feature_names_out())}")

## 5. Model Training

Train a Logistic Regression model for binary sentiment classification.

In [None]:
# Train model
model = LogisticRegression(
    C=1.0,
    max_iter=1000,
    random_state=RANDOM_STATE,
    n_jobs=-1,
    solver='lbfgs'
)

print("Training model...")
model.fit(X_train_tfidf, y_train)
print("Training complete!")

## 6. Model Evaluation

In [None]:
# Make predictions
y_pred_train = model.predict(X_train_tfidf)
y_pred_test = model.predict(X_test_tfidf)

y_pred_proba_train = model.predict_proba(X_train_tfidf)[:, 1]
y_pred_proba_test = model.predict_proba(X_test_tfidf)[:, 1]

# Calculate metrics
train_accuracy = accuracy_score(y_train, y_pred_train)
test_accuracy = accuracy_score(y_test, y_pred_test)

train_f1 = f1_score(y_train, y_pred_train)
test_f1 = f1_score(y_test, y_pred_test)

train_auc = roc_auc_score(y_train, y_pred_proba_train)
test_auc = roc_auc_score(y_test, y_pred_proba_test)

print("=" * 50)
print("MODEL PERFORMANCE METRICS")
print("=" * 50)
print(f"\nTRAINING SET:")
print(f"  Accuracy: {train_accuracy:.4f}")
print(f"  F1 Score: {train_f1:.4f}")
print(f"  ROC AUC:  {train_auc:.4f}")

print(f"\nTEST SET:")
print(f"  Accuracy: {test_accuracy:.4f}")
print(f"  F1 Score: {test_f1:.4f}")
print(f"  ROC AUC:  {test_auc:.4f}")
print("\n" + "=" * 50)

In [None]:
# Classification report
print("\nCLASSIFICATION REPORT (Test Set):")
print(classification_report(y_test, y_pred_test, target_names=['Negative', 'Positive']))

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_test)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Negative', 'Positive'],
            yticklabels=['Negative', 'Positive'])
plt.title('Confusion Matrix', fontsize=14, fontweight='bold')
plt.ylabel('True Label', fontsize=12)
plt.xlabel('Predicted Label', fontsize=12)
plt.tight_layout()
plt.show()

In [None]:
# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba_test)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, linewidth=2, label=f'ROC Curve (AUC = {test_auc:.4f})')
plt.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random Classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('ROC Curve', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=10)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

## 7. Model Saving and Export

In [None]:
import joblib

# Create models directory
model_dir = Path('../models/local')
model_dir.mkdir(parents=True, exist_ok=True)

# Save model
model_path = model_dir / 'model.pkl'
joblib.dump(model, model_path)
print(f"Model saved to {model_path}")

# Save vectorizer
vectorizer_path = model_dir / 'vectorizer.pkl'
joblib.dump(vectorizer, vectorizer_path)
print(f"Vectorizer saved to {vectorizer_path}")

# Save preprocessor
preprocessor_path = model_dir / 'preprocessor.pkl'
joblib.dump(preprocessor, preprocessor_path)
print(f"Preprocessor saved to {preprocessor_path}")

print("\nAll artifacts saved successfully!")

## 8. Local Testing

Test the saved model on new examples.

In [None]:
# Load saved artifacts
loaded_model = joblib.load(model_path)
loaded_vectorizer = joblib.load(vectorizer_path)
loaded_preprocessor = joblib.load(preprocessor_path)

print("Model artifacts loaded successfully!")

In [None]:
def predict_sentiment(text):
    """Predict sentiment for a single text"""
    # Preprocess
    processed = loaded_preprocessor.preprocess(text)
    
    # Vectorize
    features = loaded_vectorizer.transform([processed])
    
    # Predict
    prediction = loaded_model.predict(features)[0]
    probabilities = loaded_model.predict_proba(features)[0]
    
    sentiment = "Positive" if prediction == 1 else "Negative"
    confidence = probabilities[prediction]
    
    return {
        'sentiment': sentiment,
        'confidence': confidence,
        'probabilities': {
            'negative': probabilities[0],
            'positive': probabilities[1]
        }
    }

# Test on new examples
test_examples = [
    "This product is absolutely fantastic! I love it!",
    "Terrible experience. Would not recommend.",
    "It's okay, nothing special.",
    "Best purchase of the year! Highly recommend!",
    "Complete waste of money. Very disappointed."
]

print("\nTESTING PREDICTIONS ON NEW EXAMPLES")
print("=" * 80)

for i, text in enumerate(test_examples, 1):
    result = predict_sentiment(text)
    print(f"\n{i}. Text: {text}")
    print(f"   Sentiment: {result['sentiment']}")
    print(f"   Confidence: {result['confidence']:.2%}")
    print(f"   Probabilities: Neg={result['probabilities']['negative']:.2%}, Pos={result['probabilities']['positive']:.2%}")

## 9. Preparation for Cloud Deployment

Next steps for deploying to AWS SageMaker:

1. **Upload model artifacts to S3**
2. **Build and push Docker container to ECR**
3. **Create SageMaker model and endpoint**
4. **Set up monitoring and auto-scaling**

See the main README.md for detailed deployment instructions.

In [None]:
# Model metadata for deployment
model_metadata = {
    'model_name': 'sentiment-classifier',
    'version': '1.0',
    'train_samples': len(X_train),
    'test_samples': len(X_test),
    'test_accuracy': test_accuracy,
    'test_f1': test_f1,
    'test_auc': test_auc,
    'features': {
        'max_features': 5000,
        'ngram_range': '(1, 2)',
        'vectorizer': 'TF-IDF'
    },
    'model_type': 'LogisticRegression',
    'preprocessing': {
        'lowercase': True,
        'remove_stopwords': True,
        'remove_numbers': False
    }
}

# Save metadata
import json

metadata_path = model_dir / 'metadata.json'
with open(metadata_path, 'w') as f:
    json.dump(model_metadata, f, indent=2)

print("Model metadata:")
print(json.dumps(model_metadata, indent=2))
print(f"\nMetadata saved to {metadata_path}")

## Summary

✅ Data loaded and explored

✅ Text preprocessing pipeline implemented

✅ TF-IDF features extracted

✅ Model trained and evaluated

✅ Model artifacts saved

✅ Local testing completed

✅ Ready for cloud deployment

**Next Steps**: Follow the deployment guide in README.md to deploy this model to AWS SageMaker with full MLOps pipeline.