# Week 7 Lab: Building Your First ML Models

**CS 203: Software Tools and Techniques for AI**

In this lab, you'll learn to:
1. Build baseline models (Logistic Regression, Decision Tree, Random Forest)
2. Use cross-validation for reliable evaluation
3. Tune hyperparameters with GridSearchCV
4. Try AutoML with AutoGluon
5. Use transfer learning for text classification

## Setup

In [None]:
# Install required packages
# !pip install pandas scikit-learn matplotlib seaborn

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.dummy import DummyClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

## Part 1: Create Sample Netflix Movie Data

Let's create a synthetic dataset similar to what we've been working with.

In [None]:
# Create synthetic movie data
np.random.seed(42)
n_samples = 500

# Generate features
genres = np.random.choice(['Action', 'Comedy', 'Drama', 'Horror', 'Sci-Fi'], n_samples)
budgets = np.random.uniform(5, 300, n_samples)  # Budget in millions
runtimes = np.random.uniform(80, 180, n_samples)  # Runtime in minutes
is_sequel = np.random.choice([0, 1], n_samples, p=[0.7, 0.3])
star_power = np.random.uniform(1, 10, n_samples)  # Actor popularity score

# Generate target (success) with some logic
success_prob = (
    0.3 +  # Base probability
    0.002 * budgets +  # Higher budget helps
    0.02 * star_power +  # Star power helps
    0.1 * is_sequel +  # Sequels have slight advantage
    np.where(genres == 'Action', 0.1, 0) +  # Action does well
    np.random.normal(0, 0.1, n_samples)  # Random noise
)
success_prob = np.clip(success_prob, 0, 1)
success = (np.random.random(n_samples) < success_prob).astype(int)

# Create DataFrame
movies = pd.DataFrame({
    'genre': genres,
    'budget': budgets,
    'runtime': runtimes,
    'is_sequel': is_sequel,
    'star_power': star_power,
    'success': success
})

print(f"Dataset shape: {movies.shape}")
print(f"\nSuccess rate: {movies['success'].mean():.1%}")
movies.head()

## Part 2: Prepare Data for Modeling

In [None]:
# Encode categorical variable
le = LabelEncoder()
movies['genre_encoded'] = le.fit_transform(movies['genre'])

# Prepare features and target
feature_cols = ['genre_encoded', 'budget', 'runtime', 'is_sequel', 'star_power']
X = movies[feature_cols]
y = movies['success']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

## Part 3: The Dumbest Baseline

Before building any model, let's see what accuracy we get by just predicting the most common class.

In [None]:
# Majority class classifier
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train, y_train)
dummy_acc = dummy.score(X_test, y_test)

print(f"Majority class baseline accuracy: {dummy_acc:.1%}")
print(f"\nThis is what we need to beat!")

## Part 4: Baseline Models

### 4.1 Logistic Regression

In [None]:
# Logistic Regression - the simplest model
lr = LogisticRegression(random_state=42)
lr.fit(X_train, y_train)

lr_acc = lr.score(X_test, y_test)
print(f"Logistic Regression accuracy: {lr_acc:.1%}")
print(f"Improvement over baseline: {(lr_acc - dummy_acc)*100:.1f} percentage points")

In [None]:
# Look at feature importance (coefficients)
coef_df = pd.DataFrame({
    'feature': feature_cols,
    'coefficient': lr.coef_[0]
}).sort_values('coefficient', key=abs, ascending=False)

print("Feature importance (Logistic Regression):")
print(coef_df)

### 4.2 Decision Tree

In [None]:
# Decision Tree
dt = DecisionTreeClassifier(max_depth=5, random_state=42)
dt.fit(X_train, y_train)

dt_acc = dt.score(X_test, y_test)
print(f"Decision Tree accuracy: {dt_acc:.1%}")

In [None]:
# Visualize the decision tree
plt.figure(figsize=(20, 10))
plot_tree(dt, feature_names=feature_cols, class_names=['Fail', 'Success'], 
          filled=True, rounded=True, fontsize=10)
plt.title("Decision Tree for Movie Success")
plt.tight_layout()
plt.show()

### 4.3 Random Forest

In [None]:
# Random Forest - often the best simple model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

rf_acc = rf.score(X_test, y_test)
print(f"Random Forest accuracy: {rf_acc:.1%}")

In [None]:
# Feature importance from Random Forest
importance_df = pd.DataFrame({
    'feature': feature_cols,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 5))
plt.barh(importance_df['feature'], importance_df['importance'])
plt.xlabel('Importance')
plt.title('Feature Importance (Random Forest)')
plt.gca().invert_yaxis()
plt.show()

## Part 5: Model Comparison

In [None]:
# Compare all models
results = pd.DataFrame({
    'Model': ['Majority Baseline', 'Logistic Regression', 'Decision Tree', 'Random Forest'],
    'Accuracy': [dummy_acc, lr_acc, dt_acc, rf_acc]
})

results['Improvement'] = results['Accuracy'] - dummy_acc
results = results.sort_values('Accuracy', ascending=False)

print("Model Comparison:")
print(results.to_string(index=False))

In [None]:
# Visualize comparison
plt.figure(figsize=(10, 5))
colors = ['gray' if x == 'Majority Baseline' else 'steelblue' for x in results['Model']]
plt.barh(results['Model'], results['Accuracy'], color=colors)
plt.axvline(x=dummy_acc, color='red', linestyle='--', label='Baseline')
plt.xlabel('Accuracy')
plt.title('Model Comparison')
plt.xlim(0, 1)
plt.legend()
plt.tight_layout()
plt.show()

## Part 6: Cross-Validation

A single train/test split can be misleading. Let's use 5-fold cross-validation for more reliable estimates.

In [None]:
# Cross-validation for all models
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

cv_results = []

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5)
    cv_results.append({
        'Model': name,
        'Mean Accuracy': scores.mean(),
        'Std': scores.std(),
        'Scores': scores
    })
    print(f"{name}: {scores.mean():.1%} ± {scores.std():.1%}")
    print(f"  Individual folds: {[f'{s:.1%}' for s in scores]}")
    print()

In [None]:
# Visualize CV results
cv_df = pd.DataFrame(cv_results)

plt.figure(figsize=(10, 5))
x = range(len(cv_df))
plt.bar(x, cv_df['Mean Accuracy'], yerr=cv_df['Std'], capsize=5, color='steelblue')
plt.xticks(x, cv_df['Model'])
plt.ylabel('Accuracy')
plt.title('5-Fold Cross-Validation Results')
plt.ylim(0.5, 0.9)
plt.show()

## Part 7: Hyperparameter Tuning

### 7.1 Grid Search

In [None]:
# Grid Search for Random Forest
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10]
}

print(f"Total combinations to try: {3 * 4 * 3} = 36")
print("This might take a moment...\n")

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1  # Use all CPU cores
)

grid_search.fit(X, y)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV accuracy: {grid_search.best_score_:.1%}")

### 7.2 Random Search (Faster Alternative)

In [None]:
from scipy.stats import randint

# Random Search - sample random combinations
param_dist = {
    'n_estimators': randint(50, 300),
    'max_depth': randint(3, 20),
    'min_samples_split': randint(2, 20)
}

random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_dist,
    n_iter=20,  # Only try 20 random combinations
    cv=5,
    random_state=42,
    n_jobs=-1
)

random_search.fit(X, y)

print(f"Best parameters: {random_search.best_params_}")
print(f"Best CV accuracy: {random_search.best_score_:.1%}")

In [None]:
# Compare tuned vs default
default_rf = RandomForestClassifier(random_state=42)
default_scores = cross_val_score(default_rf, X, y, cv=5)

tuned_rf = grid_search.best_estimator_
tuned_scores = cross_val_score(tuned_rf, X, y, cv=5)

print(f"Default RF: {default_scores.mean():.1%} ± {default_scores.std():.1%}")
print(f"Tuned RF:   {tuned_scores.mean():.1%} ± {tuned_scores.std():.1%}")
print(f"\nImprovement: {(tuned_scores.mean() - default_scores.mean())*100:.1f} percentage points")

## Part 8: AutoML with AutoGluon (Optional)

If you have AutoGluon installed, try this section. If not, skip to the next part.

In [None]:
# Uncomment to install AutoGluon (takes a while)
# !pip install autogluon

try:
    from autogluon.tabular import TabularPredictor
    AUTOGLUON_AVAILABLE = True
    print("AutoGluon is available!")
except ImportError:
    AUTOGLUON_AVAILABLE = False
    print("AutoGluon not installed. Skip this section or install with: pip install autogluon")

In [None]:
if AUTOGLUON_AVAILABLE:
    # Prepare data for AutoGluon
    train_data = movies[['genre', 'budget', 'runtime', 'is_sequel', 'star_power', 'success']]
    
    # Train with AutoGluon (time_limit in seconds)
    predictor = TabularPredictor(label='success', eval_metric='accuracy')
    predictor.fit(train_data, time_limit=120)  # 2 minutes
    
    # Show leaderboard
    print("\nAutoGluon Leaderboard:")
    print(predictor.leaderboard())
else:
    print("Skipping AutoGluon section.")

## Part 9: Transfer Learning for Text (Demo)

Let's see how transfer learning works with a pretrained model for text classification.

In [None]:
# Install transformers if not available
# !pip install transformers

try:
    from transformers import pipeline
    TRANSFORMERS_AVAILABLE = True
    print("Transformers is available!")
except ImportError:
    TRANSFORMERS_AVAILABLE = False
    print("Transformers not installed. Skip this section or install with: pip install transformers")

In [None]:
if TRANSFORMERS_AVAILABLE:
    # Load pretrained sentiment classifier
    classifier = pipeline("sentiment-analysis")
    
    # Test on movie reviews
    reviews = [
        "This movie was absolutely fantastic! The acting was superb.",
        "Terrible film. Complete waste of time and money.",
        "It was okay, nothing special but not bad either.",
        "A masterpiece! One of the best movies I've ever seen.",
        "Boring and predictable. Would not recommend."
    ]
    
    print("Transfer Learning Demo: Sentiment Analysis\n")
    print("Using a pretrained model (no training needed!)\n")
    
    for review in reviews:
        result = classifier(review)[0]
        print(f"Review: '{review[:50]}...'")
        print(f"  → {result['label']} (confidence: {result['score']:.1%})")
        print()
else:
    print("Skipping transfer learning section.")

## Part 10: Final Summary

In [None]:
# Final comparison
print("="*60)
print("FINAL MODEL COMPARISON")
print("="*60)
print(f"\nMajority Baseline:       {dummy_acc:.1%}")
print(f"Logistic Regression:     {lr_acc:.1%}")
print(f"Decision Tree:           {dt_acc:.1%}")
print(f"Random Forest:           {rf_acc:.1%}")
print(f"Tuned Random Forest:     {grid_search.best_score_:.1%} (CV)")
print(f"\n" + "="*60)
print("\nKey Takeaways:")
print("1. Always compare against a baseline!")
print("2. Use cross-validation for reliable estimates")
print("3. Hyperparameter tuning can improve accuracy")
print("4. Random Forest is often a great default choice")
print("5. AutoML can find good models automatically")

## Exercises

Try these on your own:

1. **Add more features**: What if you had director name, release month, etc.?
2. **Try other models**: SVM, XGBoost, LightGBM
3. **Feature engineering**: Create interaction features (budget × star_power)
4. **Different metrics**: Try precision, recall, F1 instead of accuracy
5. **Larger parameter grid**: More combinations for grid search