# üéì AI Bootcamp - Week 6 Day 1
## Logistic Regression: Binary Classification

### Today's Learning Goals:
- ‚úÖ Understand classification vs regression
- ‚úÖ Implement logistic regression with scikit-learn
- ‚úÖ Work with probabilities and thresholds
- ‚úÖ Build and interpret confusion matrices
- ‚úÖ Calculate accuracy, precision, recall, F1-score
- ‚úÖ Apply to Titanic survival prediction

---

**Let's start classifying! üöÄ**

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    confusion_matrix, accuracy_score, precision_score, 
    recall_score, f1_score, classification_report, roc_curve, auc
)

np.random.seed(42)
plt.style.use('seaborn-v0_8-darkgrid')
plt.rcParams['figure.figsize'] = (12, 6)
print('‚úÖ Libraries loaded!')

## Part 1: Understanding the Sigmoid Function

The sigmoid function œÉ(z) = 1 / (1 + e^(-z)) converts any number into a probability.

In [None]:
# Visualize the sigmoid function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

z = np.linspace(-10, 10, 200)
sigma = sigmoid(z)

plt.figure(figsize=(10, 6))
plt.plot(z, sigma, 'b-', linewidth=2, label='Sigmoid œÉ(z)')
plt.axhline(y=0.5, color='r', linestyle='--', alpha=0.7, label='Threshold = 0.5')
plt.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
plt.xlabel('z (input)', fontsize=12)
plt.ylabel('œÉ(z) (probability)', fontsize=12)
plt.title('Sigmoid Function: The Heart of Logistic Regression', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.legend()
plt.ylim(-0.1, 1.1)
plt.show()

print('Key observations:')
print(f'œÉ(-5) = {sigmoid(-5):.4f} (very unlikely)')
print(f'œÉ(0)  = {sigmoid(0):.4f} (50-50 chance)')
print(f'œÉ(5)  = {sigmoid(5):.4f} (very likely)')

## Part 2: Simple Binary Classification Example

Let's create a simple dataset and train our first classifier!

In [None]:
# Generate simple binary classification data
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=200, n_features=2, n_informative=2,
    n_redundant=0, n_clusters_per_class=1, random_state=42
)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Visualize the data
plt.figure(figsize=(10, 6))
plt.scatter(X[y==0, 0], X[y==0, 1], c='red', label='Class 0', alpha=0.6, s=50)
plt.scatter(X[y==1, 0], X[y==1, 1], c='blue', label='Class 1', alpha=0.6, s=50)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Binary Classification Dataset', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f'Training samples: {len(X_train)}')
print(f'Test samples: {len(X_test)}')
print(f'Class distribution: {np.bincount(y)}')

In [None]:
# Train logistic regression model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]  # Probability of class 1

print('‚úÖ Model trained!')
print(f'\nCoefficients: {model.coef_[0]}')
print(f'Intercept: {model.intercept_[0]:.4f}')
print(f'\nSample predictions (first 5):')
for i in range(5):
    print(f'  True: {y_test[i]}, Predicted: {y_pred[i]}, Probability: {y_pred_proba[i]:.3f}')

## Part 3: Confusion Matrix

The confusion matrix shows all four possible outcomes.

In [None]:
# Create confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Visualize
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False,
            xticklabels=['Predicted 0', 'Predicted 1'],
            yticklabels=['Actual 0', 'Actual 1'])
plt.title('Confusion Matrix', fontsize=14, fontweight='bold')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

# Extract values
tn, fp, fn, tp = cm.ravel()
print('Confusion Matrix Breakdown:')
print(f'  True Negatives (TN):  {tn} - Correctly predicted class 0')
print(f'  False Positives (FP): {fp} - Incorrectly predicted class 1')
print(f'  False Negatives (FN): {fn} - Incorrectly predicted class 0')
print(f'  True Positives (TP):  {tp} - Correctly predicted class 1')

## Part 4: Evaluation Metrics

Calculate all the key metrics!

In [None]:
# Calculate metrics manually
accuracy = (tp + tn) / (tp + tn + fp + fn)
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

print('Manual Calculation:')
print(f'  Accuracy:  {accuracy:.3f}')
print(f'  Precision: {precision:.3f}')
print(f'  Recall:    {recall:.3f}')
print(f'  F1-Score:  {f1:.3f}')

# Using scikit-learn functions
print('\nUsing scikit-learn:')
print(f'  Accuracy:  {accuracy_score(y_test, y_pred):.3f}')
print(f'  Precision: {precision_score(y_test, y_pred):.3f}')
print(f'  Recall:    {recall_score(y_test, y_pred):.3f}')
print(f'  F1-Score:  {f1_score(y_test, y_pred):.3f}')

# Full classification report
print('\nFull Classification Report:')
print(classification_report(y_test, y_pred, target_names=['Class 0', 'Class 1']))

## Part 5: Experimenting with Thresholds

The default threshold is 0.5, but we can adjust it!

In [None]:
# Test different thresholds
thresholds = [0.3, 0.5, 0.7]
results = []

for threshold in thresholds:
    y_pred_custom = (y_pred_proba >= threshold).astype(int)
    acc = accuracy_score(y_test, y_pred_custom)
    prec = precision_score(y_test, y_pred_custom)
    rec = recall_score(y_test, y_pred_custom)
    f1 = f1_score(y_test, y_pred_custom)
    
    results.append({
        'Threshold': threshold,
        'Accuracy': acc,
        'Precision': prec,
        'Recall': rec,
        'F1-Score': f1
    })

results_df = pd.DataFrame(results)
print('Effect of Different Thresholds:')
print(results_df.to_string(index=False))

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for idx, threshold in enumerate(thresholds):
    y_pred_custom = (y_pred_proba >= threshold).astype(int)
    cm = confusion_matrix(y_test, y_pred_custom)
    
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False, ax=axes[idx])
    axes[idx].set_title(f'Threshold = {threshold}')
    axes[idx].set_ylabel('True')
    axes[idx].set_xlabel('Predicted')

plt.tight_layout()
plt.show()

print('\nüìä Observation: Lower threshold ‚Üí More positives, higher recall')
print('üìä Observation: Higher threshold ‚Üí Fewer positives, higher precision')

## Part 6: ROC Curve and AUC

The ROC curve shows performance across all thresholds!

In [None]:
# Calculate ROC curve
fpr, tpr, thresholds_roc = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC Curve (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], 'r--', linewidth=2, label='Random Classifier')
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate (Recall)', fontsize=12)
plt.title('ROC Curve', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f'AUC Score: {roc_auc:.3f}')
print('\nInterpretation:')
print('  AUC = 1.0: Perfect classifier')
print('  AUC = 0.5: Random classifier')
print('  AUC > 0.7: Good classifier')

## Part 7: TITANIC DATASET - Real World Application

Let's predict Titanic survival using logistic regression!

In [None]:
# Load Titanic data
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
df = pd.read_csv(url)

print(f'Dataset shape: {df.shape}')
print(f'\nFirst few rows:')
df.head()

In [None]:
# Data preprocessing
from sklearn.preprocessing import LabelEncoder

# Select features
df_clean = df[['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']].copy()

# Handle missing values
df_clean['Age'].fillna(df_clean['Age'].median(), inplace=True)
df_clean['Fare'].fillna(df_clean['Fare'].median(), inplace=True)

# Encode Sex
df_clean['Sex'] = LabelEncoder().fit_transform(df_clean['Sex'])

print('Clean dataset:')
print(df_clean.info())
print(f'\nSurvival rate: {df_clean["Survived"].mean():.2%}')

In [None]:
# Prepare features and target
X_titanic = df_clean.drop('Survived', axis=1)
y_titanic = df_clean['Survived']

# Split data
X_train_t, X_test_t, y_train_t, y_test_t = train_test_split(
    X_titanic, y_titanic, test_size=0.2, random_state=42, stratify=y_titanic
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_t)
X_test_scaled = scaler.transform(X_test_t)

print(f'Training samples: {len(X_train_t)}')
print(f'Test samples: {len(X_test_t)}')
print(f'\nClass distribution:')
print(y_titanic.value_counts())

In [None]:
# Train model on Titanic data
model_titanic = LogisticRegression(random_state=42, max_iter=1000)
model_titanic.fit(X_train_scaled, y_train_t)

# Predictions
y_pred_t = model_titanic.predict(X_test_scaled)
y_pred_proba_t = model_titanic.predict_proba(X_test_scaled)[:, 1]

print('‚úÖ Titanic model trained!')
print(f'\nTest Accuracy: {accuracy_score(y_test_t, y_pred_t):.3f}')

# Feature importance
feature_importance = pd.DataFrame({
    'Feature': X_titanic.columns,
    'Coefficient': model_titanic.coef_[0]
}).sort_values('Coefficient', key=abs, ascending=False)

print('\nFeature Importance (by coefficient magnitude):')
print(feature_importance)

In [None]:
# Confusion matrix for Titanic
cm_titanic = confusion_matrix(y_test_t, y_pred_t)

plt.figure(figsize=(8, 6))
sns.heatmap(cm_titanic, annot=True, fmt='d', cmap='RdYlGn',
            xticklabels=['Did Not Survive', 'Survived'],
            yticklabels=['Did Not Survive', 'Survived'])
plt.title('Titanic Survival Prediction - Confusion Matrix', fontsize=14, fontweight='bold')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

# Metrics
print('\nTitanic Model Performance:')
print(classification_report(y_test_t, y_pred_t, 
                          target_names=['Did Not Survive', 'Survived']))

In [None]:
# Sample predictions
sample_passengers = X_test_t.head(10).copy()
sample_actual = y_test_t.head(10).values
sample_pred = y_pred_t[:10]
sample_proba = y_pred_proba_t[:10]

print('Sample Predictions:')
print('=' * 80)
for i in range(10):
    actual = 'Survived' if sample_actual[i] == 1 else 'Did Not Survive'
    predicted = 'Survived' if sample_pred[i] == 1 else 'Did Not Survive'
    prob = sample_proba[i]
    correct = '‚úÖ' if sample_actual[i] == sample_pred[i] else '‚ùå'
    
    print(f'{correct} Actual: {actual:20} | Predicted: {predicted:20} | Prob: {prob:.2%}')

## üéØ Your Challenge

Try these exercises:
1. Add more features to Titanic (Embarked, Cabin has_cabin, etc.)
2. Try different thresholds and see effect on precision/recall
3. Compare performance with/without feature scaling
4. Build a logistic regression from scratch using gradient descent
5. Try multi-class logistic regression on a different dataset

## üìö Summary

Today you learned:
- ‚úÖ Classification predicts categories, not numbers
- ‚úÖ Sigmoid function converts numbers to probabilities
- ‚úÖ Decision boundary separates classes (default threshold = 0.5)
- ‚úÖ Confusion matrix shows all four outcomes (TP, TN, FP, FN)
- ‚úÖ Accuracy: overall correctness
- ‚úÖ Precision: avoid false positives
- ‚úÖ Recall: catch all positives
- ‚úÖ F1-Score: balance precision and recall
- ‚úÖ ROC/AUC: performance across all thresholds

**Key Takeaways:**
- Choose metrics based on business problem
- Accuracy fails with imbalanced data
- Threshold tuning is powerful
- Always visualize confusion matrix
- Feature engineering matters!

**Tomorrow:** Support Vector Machines! üöÄ