# 🎓 Week 6 - Day 4: Scikit-Learn Basics

## Today's Goals:
✅ Understand the Scikit-Learn API

✅ Learn data splitting and scaling

✅ Build ML pipelines

✅ Train, predict, and evaluate models

✅ Save and load models with joblib

---

## Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-Learn imports
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import joblib

import warnings
warnings.filterwarnings('ignore')

print('✅ Libraries imported!')

---
## Part 1: The Scikit-Learn Workflow

**4 Simple Steps:**
1. Import & Instantiate
2. Fit (Train)
3. Predict
4. Evaluate

In [None]:
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

print(f'Dataset: {iris.data.shape[0]} samples, {iris.data.shape[1]} features')
print(f'Classes: {iris.target_names}')

In [None]:
# Step 1: Import & Instantiate
model = LogisticRegression(max_iter=200)
print('Step 1: ✅ Model created')

# Step 2: Fit (Train)
model.fit(X, y)
print('Step 2: ✅ Model trained')

# Step 3: Predict
predictions = model.predict(X[:5])
print('Step 3: ✅ Predictions made')
print(f'Predictions: {predictions}')

# Step 4: Evaluate
accuracy = model.score(X, y)
print('Step 4: ✅ Model evaluated')
print(f'Accuracy: {accuracy:.3f}')

---
## Part 2: Train-Test Split

**Never test on training data!**

In [None]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f'Training samples: {len(X_train)}')
print(f'Test samples: {len(X_test)}')

In [None]:
# Train on training data
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Test on test data
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f'Test Accuracy: {accuracy:.3f}')
print(f'\nSample predictions (first 5):')
for i in range(5):
    print(f'  True: {y_test[i]}, Predicted: {y_pred[i]}')

---
## Part 3: Feature Scaling

**Why scale?** Different features have different ranges.

In [None]:
# Create sample data with different scales
sample_data = np.array([
    [25, 50000],  # Age: 25, Salary: $50,000
    [30, 60000],  # Age: 30, Salary: $60,000
    [35, 70000]   # Age: 35, Salary: $70,000
])

print('Original data:')
print(sample_data)
print(f'\nAge range: {sample_data[:, 0].min()} - {sample_data[:, 0].max()}')
print(f'Salary range: ${sample_data[:, 1].min()} - ${sample_data[:, 1].max()}')
print('\n⚠️ Very different scales!')

In [None]:
# Apply StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(sample_data)

print('After StandardScaler:')
print(scaled_data)
print('\n✅ Now both features are on similar scale!')
print(f'Mean ≈ 0, Std Dev ≈ 1')

### Scaling Impact on Model Performance

In [None]:
# WITHOUT scaling
svm_no_scale = SVC()
svm_no_scale.fit(X_train, y_train)
score_no_scale = svm_no_scale.score(X_test, y_test)

print('❌ SVM without scaling:')
print(f'   Accuracy: {score_no_scale:.3f}')

In [None]:
# WITH scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

svm_scaled = SVC()
svm_scaled.fit(X_train_scaled, y_train)
score_scaled = svm_scaled.score(X_test_scaled, y_test)

print('✅ SVM WITH scaling:')
print(f'   Accuracy: {score_scaled:.3f}')
print(f'\n📈 Improvement: {(score_scaled - score_no_scale)*100:.1f}%')

---
## Part 4: Pipelines - Clean ML Workflows

**Pipelines chain steps together!**

In [None]:
# Create a pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC())
])

print('✅ Pipeline created with 2 steps:')
print('   1. StandardScaler')
print('   2. SVM')

In [None]:
# Train (scaling happens automatically!)
pipeline.fit(X_train, y_train)
print('✅ Pipeline trained')

# Predict (scaling happens automatically!)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy: {accuracy:.3f}')
print('\n✅ Much cleaner! Pipeline handles everything!')

---
## Part 5: Cross-Validation

**More reliable than single train-test split!**

In [None]:
# Single train-test split
pipeline.fit(X_train, y_train)
single_score = pipeline.score(X_test, y_test)

print('Single Train-Test Split:')
print(f'  Accuracy: {single_score:.3f}')
print('  → Based on ONE split')

In [None]:
# 5-Fold Cross-Validation
cv_scores = cross_val_score(pipeline, X, y, cv=5)

print('\n5-Fold Cross-Validation:')
print(f'  Fold scores: {["{:.3f}".format(s) for s in cv_scores]}')
print(f'  Mean: {cv_scores.mean():.3f} (+/- {cv_scores.std()*2:.3f})')
print('  → Based on 5 different splits')
print('\n✅ Cross-validation is more reliable!')

---
## Part 6: Saving & Loading Models

**Train once, use anywhere!**

In [None]:
# Train a model
best_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42))
])

best_pipeline.fit(X_train, y_train)
original_score = best_pipeline.score(X_test, y_test)

print('✅ Model trained!')
print(f'Accuracy: {original_score:.3f}')

In [None]:
# Save the model
joblib.dump(best_pipeline, 'iris_model.pkl')
print('✅ Model saved as "iris_model.pkl"')

In [None]:
# Load the model
loaded_model = joblib.load('iris_model.pkl')
loaded_score = loaded_model.score(X_test, y_test)

print('✅ Model loaded successfully!')
print(f'\nOriginal accuracy: {original_score:.3f}')
print(f'Loaded accuracy: {loaded_score:.3f}')
print('\n✅ Scores match! Model saved and loaded correctly.')

---
## Part 7: Project - Compare Multiple Models on Titanic

**Let's apply everything we learned!**

In [None]:
# Load Titanic dataset
titanic = sns.load_dataset('titanic')

print('✅ Titanic dataset loaded!')
print(f'Shape: {titanic.shape}')
print(f'\nFirst few rows:')
titanic.head()

In [None]:
# Prepare data
features = ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare']
df = titanic[features + ['survived']].copy()

# Handle missing values
df['age'].fillna(df['age'].median(), inplace=True)
df['fare'].fillna(df['fare'].median(), inplace=True)
df.dropna(inplace=True)

# Encode sex
df['sex'] = df['sex'].map({'male': 0, 'female': 1})

print('✅ Data preprocessed!')
print(f'Final shape: {df.shape}')
print(f'Survival rate: {df["survived"].mean():.2%}')

In [None]:
# Split data
X = df[features].values
y = df['survived'].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f'Training: {len(X_train)} samples')
print(f'Test: {len(X_test)} samples')

In [None]:
# Create 3 different pipelines
models = {
    'Logistic Regression': Pipeline([
        ('scaler', StandardScaler()),
        ('lr', LogisticRegression(max_iter=200))
    ]),
    
    'SVM': Pipeline([
        ('scaler', StandardScaler()),
        ('svm', SVC())
    ]),
    
    'Random Forest': Pipeline([
        ('scaler', StandardScaler()),
        ('rf', RandomForestClassifier(n_estimators=100, random_state=42))
    ])
}

print('✅ Created 3 pipelines:')
for name in models.keys():
    print(f'  • {name}')

In [None]:
# Train and compare all models
results = []

for name, pipeline in models.items():
    # Train
    pipeline.fit(X_train, y_train)
    
    # Predict
    y_pred = pipeline.predict(X_test)
    
    # Evaluate
    accuracy = accuracy_score(y_test, y_pred)
    
    results.append({
        'Model': name,
        'Accuracy': f"{accuracy:.3f}"
    })
    
    print(f'✅ {name}: {accuracy:.3f}')

# Display results
results_df = pd.DataFrame(results)
print('\n' + '='*40)
print('🏆 MODEL COMPARISON')
print('='*40)
print(results_df.to_string(index=False))
print('='*40)

In [None]:
# Find best model
accuracies = [float(r['Accuracy']) for r in results]
best_idx = accuracies.index(max(accuracies))
best_model_name = results[best_idx]['Model']

print(f'⭐ Best Model: {best_model_name}')
print(f'   Accuracy: {results[best_idx]["Accuracy"]}')

In [None]:
# Confusion Matrix for best model
best_pipeline = models[best_model_name]
y_pred = best_pipeline.predict(X_test)

cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False,
            xticklabels=['Died', 'Survived'],
            yticklabels=['Died', 'Survived'])
plt.title(f'Confusion Matrix - {best_model_name}', fontsize=14, fontweight='bold')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.show()

print('\nClassification Report:')
print(classification_report(y_test, y_pred, target_names=['Died', 'Survived']))

In [None]:
# Save the best model
joblib.dump(best_pipeline, 'titanic_best_model.pkl')
print(f'✅ Best model ({best_model_name}) saved!')
print(f'\n🎉 Project Complete!')

---
## 📚 Summary

### What We Learned:

**1. Scikit-Learn Workflow:**
- Import → Fit → Predict → Evaluate
- Consistent API for all algorithms

**2. Train-Test Split:**
- Never test on training data
- Use `test_size=0.2` (20% for testing)

**3. Feature Scaling:**
- `StandardScaler`: Mean=0, Std=1
- Critical for SVM, KNN, Neural Nets

**4. Pipelines:**
- Chain preprocessing + model
- Prevents data leakage
- Cleaner code

**5. Cross-Validation:**
- More reliable than single split
- Use `cv=5` for 5-fold CV

**6. Model Persistence:**
- `joblib.dump()`: Save model
- `joblib.load()`: Load model

### 🎯 Key Takeaways:
- Always use pipelines for production code
- Scale features for distance-based algorithms
- Cross-validation > single train-test split
- Compare multiple models to find the best
- Save trained models for reuse

---

**Great job! You've mastered Scikit-Learn basics! 🎉**