# Coding Assignment 2: Classification Models

**Name:** [Your Name Here]  
**Student ID:** [Your Student ID]  
**Date:** [Today's Date]  

## Overview

Welcome to your second machine learning assignment! In this notebook, you'll explore classification algorithms by implementing three fundamental approaches: K-Nearest Neighbors, Logistic Regression, and Naive Bayes. You'll apply these to predict Titanic passenger survival using real historical data.

**Learning Goals:**
- Understand classification vs regression
- Implement and compare classification algorithms
- Apply data preprocessing for classification tasks
- Interpret model results and performance metrics
- Reflect on algorithm strengths and weaknesses

**Estimated Time:** 2 hours

## Part 1: Mathematical Foundation (30 minutes)

Before implementing algorithms, let's understand the mathematics behind classification.

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")

print("Libraries imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

### 1.1 Understanding Classification

**Classification** predicts discrete categories or classes, unlike regression which predicts continuous values.

**Key Concepts:**
- **Decision Boundary**: Line/surface separating different classes
- **Probability**: Many classifiers output probabilities for each class
- **Binary vs Multiclass**: Two classes vs multiple classes

**Examples:**
- Email spam detection (spam/not spam)
- Medical diagnosis (disease/healthy)
- Image recognition (cat/dog/bird)
- **Our task**: Titanic survival (survived/died)

In [None]:
# Let's create a simple visualization of classification vs regression
np.random.seed(42)

# Generate sample data
X_sample = np.random.rand(100, 1) * 10
y_regression = 2 * X_sample.flatten() + np.random.normal(0, 1, 100)
y_classification = (X_sample.flatten() + np.random.normal(0, 1, 100) > 5).astype(int)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Regression plot
ax1.scatter(X_sample, y_regression, alpha=0.6, color='blue')
ax1.set_xlabel('X (Feature)')
ax1.set_ylabel('y (Continuous Target)')
ax1.set_title('Regression: Predicting Continuous Values')
ax1.grid(True, alpha=0.3)

# Classification plot
colors = ['red' if y == 0 else 'green' for y in y_classification]
ax2.scatter(X_sample, y_classification, alpha=0.6, c=colors)
ax2.set_xlabel('X (Feature)')
ax2.set_ylabel('y (Class: 0 or 1)')
ax2.set_title('Classification: Predicting Categories')
ax2.set_yticks([0, 1])
ax2.set_yticklabels(['Class 0', 'Class 1'])
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Key Differences:")
print("Regression: Predicts continuous values (prices, temperatures, etc.)")
print("Classification: Predicts discrete categories (classes, labels, etc.)")

### 1.2 Distance Metrics for K-Nearest Neighbors

KNN classifies points based on the class of their nearest neighbors. The key is measuring **distance**.

**Euclidean Distance** between points A and B:
$$d = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2 + ... + (z_1 - z_2)^2}$$

For n-dimensional space:
$$d = \sqrt{\sum_{i=1}^{n}(a_i - b_i)^2}$$

In [None]:
def euclidean_distance(point1, point2):
    """
    Calculate Euclidean distance between two points.
    
    Parameters:
    point1 (array): First point coordinates
    point2 (array): Second point coordinates
    
    Returns:
    distance (float): Euclidean distance
    """
    # TODO: Calculate the Euclidean distance
    # Hint: Use np.sqrt() and np.sum() with squared differences
    distance = None  # Replace None with your code
    
    return distance

# Test the distance function
# TODO: Uncomment the lines below after implementing the function
# point_a = np.array([1, 2])
# point_b = np.array([4, 6])
# test_distance = euclidean_distance(point_a, point_b)
# print(f"Distance between {point_a} and {point_b}: {test_distance:.2f}")
# print(f"Expected: 5.00 (3² + 4² = 9 + 16 = 25, √25 = 5)")

print("TODO: Implement the euclidean_distance function above")

### 1.3 Logistic Regression Mathematics

Unlike linear regression, logistic regression predicts probabilities using the **sigmoid function**.

**Sigmoid Function:**
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

Where $z = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$

**Key Properties:**
- Output is always between 0 and 1 (perfect for probabilities)
- S-shaped curve
- Coefficients (β) represent log-odds changes

In [None]:
def sigmoid(z):
    """
    Sigmoid activation function.
    
    Parameters:
    z (array): Input values
    
    Returns:
    sigmoid_z (array): Sigmoid of input
    """
    # TODO: Implement the sigmoid function: 1 / (1 + exp(-z))
    # Hint: Use np.exp() for exponential
    sigmoid_z = None  # Replace None with your code
    
    return sigmoid_z

# TODO: Uncomment the visualization code below after implementing sigmoid
# z_values = np.linspace(-10, 10, 100)
# sigmoid_values = sigmoid(z_values)

# plt.figure(figsize=(10, 6))
# plt.plot(z_values, sigmoid_values, 'b-', linewidth=2, label='Sigmoid Function')
# plt.axhline(y=0.5, color='r', linestyle='--', alpha=0.7, label='Decision Boundary (0.5)')
# plt.axvline(x=0, color='r', linestyle='--', alpha=0.7)
# plt.xlabel('z (Linear Combination)')
# plt.ylabel('σ(z) (Probability)')
# plt.title('Sigmoid Function: Converting Linear Output to Probabilities')
# plt.grid(True, alpha=0.3)
# plt.legend()
# plt.ylim(-0.1, 1.1)
# plt.show()

# print("Key Points:")
# print(f"σ(-∞) ≈ {sigmoid(-100):.6f} (approaches 0)")
# print(f"σ(0) = {sigmoid(0):.6f} (exactly 0.5)")
# print(f"σ(+∞) ≈ {sigmoid(100):.6f} (approaches 1)")

print("TODO: Implement the sigmoid function above")

### 1.4 Naive Bayes Foundation

Naive Bayes applies **Bayes' Theorem** with a "naive" assumption of feature independence.

**Bayes' Theorem:**
$$P(Class|Features) = \frac{P(Features|Class) \times P(Class)}{P(Features)}$$

**Naive Assumption:**
All features are independent, so:
$$P(x_1, x_2, ..., x_n|Class) = P(x_1|Class) \times P(x_2|Class) \times ... \times P(x_n|Class)$$

**Example**: For Titanic survival:
- P(Survived | Age=25, Class=1st, Gender=Female)
- = P(Age=25 | Survived) × P(Class=1st | Survived) × P(Gender=Female | Survived) × P(Survived)

## Part 2: Dataset & Exploration (20 minutes)

Let's load and explore the famous Titanic dataset to understand our classification task.

In [None]:
# Load the Titanic dataset
try:
    # Try loading from seaborn first
    titanic = sns.load_dataset('titanic')
    print("Loaded Titanic dataset from seaborn")
except:
    # Fallback: create a sample dataset if seaborn data is not available
    print("Creating sample Titanic dataset...")
    np.random.seed(42)
    n_samples = 891
    
    # Create synthetic Titanic-like data
    titanic = pd.DataFrame({
        'survived': np.random.choice([0, 1], n_samples, p=[0.62, 0.38]),
        'pclass': np.random.choice([1, 2, 3], n_samples, p=[0.24, 0.21, 0.55]),
        'sex': np.random.choice(['male', 'female'], n_samples, p=[0.65, 0.35]),
        'age': np.random.normal(30, 12, n_samples).clip(0.5, 80),
        'sibsp': np.random.poisson(0.5, n_samples).clip(0, 8),
        'parch': np.random.poisson(0.4, n_samples).clip(0, 6),
        'fare': np.random.lognormal(3, 1, n_samples).clip(0, 500),
        'embarked': np.random.choice(['S', 'C', 'Q'], n_samples, p=[0.72, 0.19, 0.09])
    })
    
    # Add some missing values to make it realistic
    titanic.loc[np.random.choice(titanic.index, 177, replace=False), 'age'] = np.nan
    titanic.loc[np.random.choice(titanic.index, 2, replace=False), 'embarked'] = np.nan

print(f"\nDataset shape: {titanic.shape}")
print(f"\nFirst few rows:")
print(titanic.head())

print(f"\nDataset info:")
titanic.info()

### 2.1 Understanding Our Target Variable

In [None]:
# Analyze the target variable: survival
print("Survival Statistics:")
survival_counts = titanic['survived'].value_counts().sort_index()
print(survival_counts)
print(f"\nSurvival Rate: {titanic['survived'].mean():.3f} ({titanic['survived'].mean()*100:.1f}%)")

# Visualize survival distribution
plt.figure(figsize=(12, 4))

plt.subplot(1, 3, 1)
survival_counts.plot(kind='bar', color=['red', 'green'], alpha=0.7)
plt.title('Survival Distribution')
plt.xlabel('Survived (0=No, 1=Yes)')
plt.ylabel('Count')
plt.xticks([0, 1], ['Died', 'Survived'], rotation=0)

plt.subplot(1, 3, 2)
plt.pie(survival_counts.values, labels=['Died', 'Survived'], colors=['red', 'green'], alpha=0.7, autopct='%1.1f%%')
plt.title('Survival Proportions')

plt.subplot(1, 3, 3)
# Survival by passenger class
survival_by_class = titanic.groupby('pclass')['survived'].mean()
survival_by_class.plot(kind='bar', color='skyblue', alpha=0.7)
plt.title('Survival Rate by Passenger Class')
plt.xlabel('Passenger Class')
plt.ylabel('Survival Rate')
plt.xticks(rotation=0)

plt.tight_layout()
plt.show()

print(f"\nSurvival by Class:")
for pclass in [1, 2, 3]:
    rate = survival_by_class[pclass]
    print(f"Class {pclass}: {rate:.3f} ({rate*100:.1f}%)")

### 2.2 Feature Exploration

In [None]:
# Explore key relationships
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Survival by Gender
axes[0, 0].set_title('Survival by Gender')
survival_gender = pd.crosstab(titanic['sex'], titanic['survived'])
survival_gender.div(survival_gender.sum(axis=1), axis=0).plot(kind='bar', ax=axes[0, 0], color=['red', 'green'])
axes[0, 0].set_ylabel('Proportion')
axes[0, 0].legend(['Died', 'Survived'])
axes[0, 0].tick_params(axis='x', rotation=0)

# Survival by Age
axes[0, 1].set_title('Age Distribution by Survival')
titanic[titanic['survived']==0]['age'].hist(alpha=0.7, color='red', label='Died', bins=20, ax=axes[0, 1])
titanic[titanic['survived']==1]['age'].hist(alpha=0.7, color='green', label='Survived', bins=20, ax=axes[0, 1])
axes[0, 1].set_xlabel('Age')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].legend()

# Survival by Fare
axes[1, 0].set_title('Fare Distribution by Survival')
titanic[titanic['survived']==0]['fare'].hist(alpha=0.7, color='red', label='Died', bins=30, ax=axes[1, 0])
titanic[titanic['survived']==1]['fare'].hist(alpha=0.7, color='green', label='Survived', bins=30, ax=axes[1, 0])
axes[1, 0].set_xlabel('Fare')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_xlim(0, 200)  # Limit x-axis for better visualization
axes[1, 0].legend()

# Family size vs Survival
titanic['family_size'] = titanic['sibsp'] + titanic['parch'] + 1
family_survival = titanic.groupby('family_size')['survived'].mean()
axes[1, 1].bar(family_survival.index, family_survival.values, alpha=0.7, color='skyblue')
axes[1, 1].set_title('Survival Rate by Family Size')
axes[1, 1].set_xlabel('Family Size')
axes[1, 1].set_ylabel('Survival Rate')

plt.tight_layout()
plt.show()

# Print some insights
print("Key Insights:")
print(f"Female survival rate: {titanic[titanic['sex']=='female']['survived'].mean():.3f}")
print(f"Male survival rate: {titanic[titanic['sex']=='male']['survived'].mean():.3f}")
print(f"Average age of survivors: {titanic[titanic['survived']==1]['age'].mean():.1f}")
print(f"Average age of non-survivors: {titanic[titanic['survived']==0]['age'].mean():.1f}")

## Part 3: Data Preprocessing (15 minutes)

Before training models, we need to prepare our data.

### 3.1 Handle Missing Values and Feature Selection

In [None]:
# Check for missing values
print("Missing values by column:")
missing_values = titanic.isnull().sum()
print(missing_values[missing_values > 0])

# Create a working copy
df = titanic.copy()

# Handle missing age values by filling with median
df['age'].fillna(df['age'].median(), inplace=True)

# Handle missing embarked values by filling with mode (most common)
df['embarked'].fillna(df['embarked'].mode()[0], inplace=True)

# Select features for our models
# We'll use: pclass, sex, age, sibsp, parch, fare, embarked
feature_columns = ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']
target_column = 'survived'

print(f"\nSelected features: {feature_columns}")
print(f"Target variable: {target_column}")
print(f"\nNo missing values after preprocessing: {df[feature_columns + [target_column]].isnull().sum().sum()}")

### 3.2 Encode Categorical Variables

In [None]:
# Create features and target arrays
X = df[feature_columns].copy()
y = df[target_column].copy()

print("Before encoding:")
print(X.dtypes)
print(f"\nSample of categorical data:")
print(X[['sex', 'embarked']].head())

# TODO: Encode the 'sex' column (male=0, female=1)
# Hint: You can use pd.get_dummies() or manual mapping
X['sex_encoded'] = None  # Replace None with your code to encode sex

# TODO: Encode the 'embarked' column using one-hot encoding
# Hint: Use pd.get_dummies(X['embarked'], prefix='embarked')
embarked_encoded = None  # Replace None with your code

# Add encoded columns and remove original categorical columns
# TODO: Uncomment these lines after implementing the encoding above
# X = pd.concat([X, embarked_encoded], axis=1)
# X = X.drop(['sex', 'embarked'], axis=1)

print("\nTODO: Implement the encoding steps above")
# TODO: Uncomment the lines below after implementing encoding
# print(f"\nAfter encoding:")
# print(f"Shape: {X.shape}")
# print(f"Columns: {list(X.columns)}")
# print(f"\nFirst few rows:")
# print(X.head())

### 3.3 Train-Test Split and Feature Scaling

In [None]:
# TODO: After implementing encoding above, uncomment this entire cell

# # Split the data into training and testing sets
# X_train, X_test, y_train, y_test = train_test_split(
#     X, y, test_size=0.2, random_state=42, stratify=y
# )

# print(f"Training set: {X_train.shape[0]} samples")
# print(f"Test set: {X_test.shape[0]} samples")
# print(f"Training survival rate: {y_train.mean():.3f}")
# print(f"Test survival rate: {y_test.mean():.3f}")

# # TODO: Apply feature scaling using StandardScaler
# # Initialize the scaler
# scaler = None  # Replace None with StandardScaler()

# # TODO: Fit the scaler on training data and transform both train and test
# # Hint: Use fit_transform for training data, transform for test data
# X_train_scaled = None  # Replace None with your code
# X_test_scaled = None   # Replace None with your code

# print(f"\nOriginal feature ranges:")
# print(f"Age: {X_train['age'].min():.1f} to {X_train['age'].max():.1f}")
# print(f"Fare: {X_train['fare'].min():.1f} to {X_train['fare'].max():.1f}")

# # TODO: Uncomment after implementing scaling
# # print(f"\nAfter scaling:")
# # print(f"Mean: {X_train_scaled.mean():.6f}")
# # print(f"Std: {X_train_scaled.std():.6f}")

print("TODO: Implement encoding first, then uncomment this cell")

## Part 4: Model Implementations (60 minutes)

Now let's implement and compare three classification algorithms!

### 4.1 K-Nearest Neighbors (KNN)

In [None]:
# TODO: Uncomment and complete this section after preprocessing is done

# print("=== K-Nearest Neighbors Classification ===")

# # TODO: Create a KNN classifier with k=5
# # Hint: Use KNeighborsClassifier from sklearn
# knn_model = None  # Replace None with KNeighborsClassifier(n_neighbors=?)

# # TODO: Train the model
# # Hint: Use the fit method with scaled training data
# # knn_model.fit(?, ?)

# # TODO: Make predictions on test set
# y_pred_knn = None  # Replace None with predictions

# # TODO: Calculate accuracy
# knn_accuracy = None  # Replace None with accuracy calculation

# print(f"KNN Accuracy (k=5): {knn_accuracy:.4f}")

# # Test different values of k
# print(f"\nTesting different k values:")
# k_values = [1, 3, 5, 7, 10, 15, 20]
# k_accuracies = []

# for k in k_values:
#     # TODO: Create, train, and evaluate KNN with different k values
#     knn_temp = KNeighborsClassifier(n_neighbors=k)
#     knn_temp.fit(X_train_scaled, y_train)
#     y_pred_temp = knn_temp.predict(X_test_scaled)
#     accuracy = accuracy_score(y_test, y_pred_temp)
#     k_accuracies.append(accuracy)
#     print(f"k={k}: {accuracy:.4f}")

# # Plot k vs accuracy
# plt.figure(figsize=(10, 6))
# plt.plot(k_values, k_accuracies, 'bo-', linewidth=2, markersize=8)
# plt.xlabel('k (Number of Neighbors)')
# plt.ylabel('Accuracy')
# plt.title('KNN Performance vs k Value')
# plt.grid(True, alpha=0.3)
# plt.xticks(k_values)
# plt.show()

# best_k = k_values[np.argmax(k_accuracies)]
# print(f"\nBest k value: {best_k} with accuracy: {max(k_accuracies):.4f}")

print("TODO: Complete preprocessing first, then implement KNN")

### 4.2 Logistic Regression

In [None]:
# TODO: Uncomment and complete this section after preprocessing is done

# print("=== Logistic Regression Classification ===")

# # TODO: Create a Logistic Regression classifier
# # Hint: Use LogisticRegression from sklearn with random_state=42
# lr_model = None  # Replace None with LogisticRegression

# # TODO: Train the model
# # lr_model.fit(?, ?)

# # TODO: Make predictions
# y_pred_lr = None  # Replace None with predictions
# y_proba_lr = None  # Replace None with predict_proba for probabilities

# # TODO: Calculate accuracy
# lr_accuracy = None  # Replace None with accuracy calculation

# print(f"Logistic Regression Accuracy: {lr_accuracy:.4f}")

# # Analyze coefficients
# print(f"\nLogistic Regression Coefficients:")
# feature_names = X.columns
# coefficients = lr_model.coef_[0]

# # TODO: Sort features by absolute coefficient value and print
# coef_df = pd.DataFrame({
#     'Feature': feature_names,
#     'Coefficient': coefficients,
#     'Abs_Coefficient': np.abs(coefficients)
# })
# coef_df = coef_df.sort_values('Abs_Coefficient', ascending=False)

# print(coef_df)

# # Visualize coefficients
# plt.figure(figsize=(12, 6))
# colors = ['red' if x < 0 else 'green' for x in coef_df['Coefficient']]
# plt.barh(coef_df['Feature'], coef_df['Coefficient'], color=colors, alpha=0.7)
# plt.xlabel('Coefficient Value')
# plt.title('Logistic Regression Coefficients\n(Positive = Increases Survival Probability)')
# plt.axvline(x=0, color='black', linestyle='-', alpha=0.3)
# plt.grid(True, alpha=0.3, axis='x')
# plt.tight_layout()
# plt.show()

print("TODO: Complete preprocessing first, then implement Logistic Regression")

### 4.3 Naive Bayes

In [None]:
# TODO: Uncomment and complete this section after preprocessing is done

# print("=== Naive Bayes Classification ===")

# # TODO: Create a Gaussian Naive Bayes classifier
# # Hint: Use GaussianNB from sklearn
# nb_model = None  # Replace None with GaussianNB()

# # TODO: Train the model
# # Note: Naive Bayes can work with or without scaling, but let's use scaled data for consistency
# # nb_model.fit(?, ?)

# # TODO: Make predictions
# y_pred_nb = None  # Replace None with predictions
# y_proba_nb = None  # Replace None with predict_proba for probabilities

# # TODO: Calculate accuracy
# nb_accuracy = None  # Replace None with accuracy calculation

# print(f"Naive Bayes Accuracy: {nb_accuracy:.4f}")

# # Show some example predictions with probabilities
# print(f"\nExample Predictions (first 10 test samples):")
# print(f"{'Actual':<8} {'Predicted':<10} {'Prob_Died':<12} {'Prob_Survived':<12} {'Correct':<8}")
# print("-" * 60)

# for i in range(min(10, len(y_test))):
#     actual = y_test.iloc[i]
#     predicted = y_pred_nb[i]
#     prob_died = y_proba_nb[i][0]
#     prob_survived = y_proba_nb[i][1]
#     correct = "✓" if actual == predicted else "✗"
    
#     print(f"{actual:<8} {predicted:<10} {prob_died:<12.3f} {prob_survived:<12.3f} {correct:<8}")

print("TODO: Complete preprocessing first, then implement Naive Bayes")

## Part 5: Model Comparison (20 minutes)

Let's compare all three models systematically.

In [None]:
# TODO: Uncomment and complete this section after implementing all models

# print("=== Model Comparison ===")

# # Collect all results
# models = {
#     'K-Nearest Neighbors': {
#         'model': None,  # TODO: Use your best KNN model
#         'predictions': None,  # TODO: Add KNN predictions
#         'accuracy': None  # TODO: Add KNN accuracy
#     },
#     'Logistic Regression': {
#         'model': None,  # TODO: Add your LR model
#         'predictions': None,  # TODO: Add LR predictions  
#         'accuracy': None  # TODO: Add LR accuracy
#     },
#     'Naive Bayes': {
#         'model': None,  # TODO: Add your NB model
#         'predictions': None,  # TODO: Add NB predictions
#         'accuracy': None  # TODO: Add NB accuracy
#     }
# }

# # TODO: Create a comparison table
# comparison_df = pd.DataFrame({
#     'Model': models.keys(),
#     'Accuracy': [models[model]['accuracy'] for model in models.keys()]
# })
# comparison_df = comparison_df.sort_values('Accuracy', ascending=False)
# print(comparison_df)

# # Visualize model comparison
# plt.figure(figsize=(12, 8))

# # Subplot 1: Accuracy comparison
# plt.subplot(2, 2, 1)
# colors = ['gold', 'silver', 'bronze'][:len(comparison_df)]
# bars = plt.bar(comparison_df['Model'], comparison_df['Accuracy'], color=colors, alpha=0.7)
# plt.title('Model Accuracy Comparison')
# plt.ylabel('Accuracy')
# plt.xticks(rotation=15)
# plt.ylim(0.7, 0.9)  # Focus on the relevant range

# # Add accuracy values on top of bars
# for bar, acc in zip(bars, comparison_df['Accuracy']):
#     plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005, 
#              f'{acc:.3f}', ha='center', va='bottom', fontweight='bold')

# # TODO: Add confusion matrices for each model (subplots 2, 3, 4)
# for i, (model_name, model_data) in enumerate(models.items(), 2):
#     plt.subplot(2, 2, i)
#     cm = confusion_matrix(y_test, model_data['predictions'])
#     sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
#                 xticklabels=['Died', 'Survived'], 
#                 yticklabels=['Died', 'Survived'])
#     plt.title(f'{model_name}\nConfusion Matrix')
#     plt.xlabel('Predicted')
#     plt.ylabel('Actual')

# plt.tight_layout()
# plt.show()

# # Detailed classification reports
# print("\n=== Detailed Classification Reports ===")
# for model_name, model_data in models.items():
#     print(f"\n{model_name}:")
#     print(classification_report(y_test, model_data['predictions'], 
#                               target_names=['Died', 'Survived']))

print("TODO: Complete all model implementations first")

### 5.1 Cross-Validation Analysis (Optional Bonus)

In [None]:
# TODO: Bonus section - implement cross-validation
# Uncomment after completing main implementations

# print("=== Cross-Validation Analysis ===")

# # TODO: Perform 5-fold cross-validation for each model
# cv_results = {}

# for model_name, model_data in models.items():
#     model = model_data['model']
#     
#     # TODO: Use cross_val_score with cv=5
#     cv_scores = None  # Replace with cross_val_score(model, X_train_scaled, y_train, cv=5)
#     
#     cv_results[model_name] = {
#         'scores': cv_scores,
#         'mean': cv_scores.mean(),
#         'std': cv_scores.std()
#     }
#     
#     print(f"{model_name}:")
#     print(f"  Mean CV Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
#     print(f"  Individual Scores: {cv_scores}")

# # Visualize cross-validation results
# cv_means = [cv_results[model]['mean'] for model in cv_results.keys()]
# cv_stds = [cv_results[model]['std'] for model in cv_results.keys()]

# plt.figure(figsize=(10, 6))
# plt.bar(cv_results.keys(), cv_means, yerr=cv_stds, capsize=10, alpha=0.7)
# plt.title('Cross-Validation Results\n(Mean ± Standard Deviation)')
# plt.ylabel('Accuracy')
# plt.xticks(rotation=15)
# plt.grid(True, alpha=0.3, axis='y')
# plt.show()

print("TODO: Bonus - implement after completing main models")

## Part 6: Analysis Questions (15 minutes)

Now let's analyze your results and reflect on what you've learned.

### 6.1 Model Performance Analysis

**TODO: Answer these questions based on your results:**

**1. Which model performed best on the Titanic survival prediction? Why might this be the case?**

[TODO: Write your answer here. Consider the nature of the data, feature relationships, and algorithm characteristics]

**2. How does changing k in KNN affect the results? What did you observe when testing k=1, 5, 10, 20?**

[TODO: Write your observations here. Discuss overfitting vs underfitting, bias-variance tradeoff]

**3. What do the logistic regression coefficients tell you about the factors affecting Titanic survival?**

[TODO: Analyze the coefficients. Which features have positive vs negative coefficients? What does this mean for survival probability?]

**4. Compare the confusion matrices. Which model makes fewer false positives vs false negatives?**

[TODO: Analyze the confusion matrices. Discuss the trade-offs between different types of errors]

### 6.2 Algorithm Characteristics

**5. When would you choose each algorithm in practice?**

**K-Nearest Neighbors:**
[TODO: Describe scenarios where KNN would be preferred]

**Logistic Regression:**
[TODO: Describe scenarios where Logistic Regression would be preferred]

**Naive Bayes:**
[TODO: Describe scenarios where Naive Bayes would be preferred]

**6. What assumptions does each algorithm make? How might these affect performance?**

[TODO: Discuss the key assumptions of each algorithm and their implications]

## Part 7: Critical Reflection (10 minutes)

Let's step back and think about the bigger picture.

### 7.1 Classification vs Regression

**1. How does classification differ from the regression you implemented in CA.01?**

**Key Differences:**
[TODO: Compare the two problem types, evaluation metrics, output types, etc.]

**2. Which evaluation metrics are important for classification vs regression?**

**Classification Metrics:**
[TODO: List and explain classification metrics like accuracy, precision, recall]

**Regression Metrics (from CA.01):**
[TODO: Recall regression metrics like MSE, R²]

### 7.2 Real-World Applications

**3. Give examples of real-world classification problems for each algorithm:**

**K-Nearest Neighbors:**
[TODO: Provide 2-3 real-world examples where KNN would be used]

**Logistic Regression:**
[TODO: Provide 2-3 real-world examples where Logistic Regression would be used]

**Naive Bayes:**
[TODO: Provide 2-3 real-world examples where Naive Bayes would be used]

**4. What ethical considerations should we keep in mind when applying these models to real problems?**

[TODO: Discuss bias, fairness, interpretability, and other ethical concerns in classification]

### 7.3 Limitations and Improvements

**5. What limitations did you encounter with each algorithm?**

[TODO: Reflect on challenges, assumptions violations, or performance issues you observed]

**6. How could you improve the Titanic survival predictions?**

[TODO: Suggest feature engineering, different algorithms, ensemble methods, etc.]

## Bonus: Advanced Experiments (Optional)

If you have extra time, try these experiments:

In [None]:
# Bonus 1: Feature Engineering
# TODO: Create new features like 'Title' from name, 'IsAlone' from family size, etc.

print("Bonus experiments:")
print("1. Extract title from passenger names (Mr, Mrs, Miss, etc.)")
print("2. Create 'IsAlone' feature (family_size == 1)")
print("3. Create age groups (Child, Adult, Senior)")
print("4. Try ensemble methods (combining multiple models)")
print("5. Experiment with different train/test splits")

In [None]:
# Bonus 2: Hyperparameter Tuning
# TODO: Use GridSearchCV to find optimal parameters for each model

# from sklearn.model_selection import GridSearchCV
# 
# # Example for KNN
# param_grid_knn = {'n_neighbors': range(1, 31)}
# grid_knn = GridSearchCV(KNeighborsClassifier(), param_grid_knn, cv=5)
# grid_knn.fit(X_train_scaled, y_train)
# print(f"Best KNN parameters: {grid_knn.best_params_}")

print("TODO: Implement hyperparameter tuning if you have extra time")

## Summary and Submission

### What You've Accomplished

Congratulations! In this assignment, you have:

**Understood classification fundamentals** and how they differ from regression  
**Implemented three classification algorithms**: KNN, Logistic Regression, and Naive Bayes  
**Applied data preprocessing** including encoding and scaling  
**Compared model performance** using multiple metrics  
**Interpreted results** and analyzed algorithm characteristics  
**Reflected on real-world applications** and ethical considerations

### Key Takeaways

**TODO: Write 2-3 key insights from this assignment:**

1. [TODO: Your first key takeaway about classification algorithms]
2. [TODO: Your second key takeaway about model comparison or data preprocessing]  
3. [TODO: Your third key takeaway about practical applications or limitations]

### Final Reflection

**TODO: Write a brief (100-150 words) final reflection on your experience with classification:**

[TODO: Your final reflection here - discuss what surprised you, what was challenging, what you'd like to explore further]

---

**Assignment Complete!** 

Make sure to:
1. Fill in all TODO sections
2. Answer all analysis questions
3. Save your notebook
4. Export as HTML
5. Submit both .ipynb and .html files
6. Include your name and student ID at the top