# Exercise-4: Model Comparison [30 points]

This notebook provides a comprehensive comparison of different machine learning models, focusing on Decision Trees and Support Vector Machines from the previous exercises, along with additional baseline models.

## Learning Objectives
- Systematic comparison of multiple machine learning algorithms
- Understanding model selection criteria and evaluation metrics
- Analyzing trade-offs between different algorithms
- Statistical significance testing for model comparisons
- Creating comprehensive model evaluation reports
- Making informed decisions about model selection

## Instructions
Complete the exercises below by implementing the required code in the designated cells.

## 1. Import Required Libraries

Import all necessary libraries for comprehensive model comparison.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.svm import SVC, SVR
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score, StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.metrics import classification_report, confusion_matrix
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Configure plotting
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)

## 2. Load Preprocessed Data

Load the data that was prepared in Exercise-1.

In [None]:
# TODO: Load the preprocessed data from Exercise-1
# Example:
# X_train = pd.read_csv('../data/X_train_processed.csv')
# X_test = pd.read_csv('../data/X_test_processed.csv')
# y_train = pd.read_csv('../data/y_train.csv').squeeze()
# y_test = pd.read_csv('../data/y_test.csv').squeeze()

# TODO: Display the shape of the datasets
# TODO: Determine if this is a classification or regression problem

## 3. Define Models for Comparison

Define a comprehensive set of models to compare.

In [None]:
# TODO: Define models for comparison
# For classification:
# models = {
#     'Decision Tree': DecisionTreeClassifier(random_state=42),
#     'SVM (RBF)': SVC(kernel='rbf', random_state=42),
#     'SVM (Linear)': SVC(kernel='linear', random_state=42),
#     'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
#     'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
#     'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5),
#     'Naive Bayes': GaussianNB()
# }

# For regression:
# models = {
#     'Decision Tree': DecisionTreeRegressor(random_state=42),
#     'SVM (RBF)': SVR(kernel='rbf'),
#     'SVM (Linear)': SVR(kernel='linear'),
#     'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
#     'Linear Regression': LinearRegression(),
#     'K-Nearest Neighbors': KNeighborsRegressor(n_neighbors=5)
# }

# TODO: Choose appropriate models based on your problem type

## 4. Basic Model Training and Evaluation

Train all models and collect basic performance metrics.

In [None]:
# TODO: Train all models and collect basic metrics
# results = {}
# training_times = {}
# prediction_times = {}

# for name, model in models.items():
#     print(f"Training {name}...")
#     
#     # Measure training time
#     start_time = time.time()
#     model.fit(X_train, y_train)
#     training_time = time.time() - start_time
#     
#     # Measure prediction time
#     start_time = time.time()
#     y_pred = model.predict(X_test)
#     prediction_time = time.time() - start_time
#     
#     # Calculate metrics
#     if hasattr(model, 'predict_proba'):  # Classification
#         accuracy = accuracy_score(y_test, y_pred)
#         results[name] = {
#             'accuracy': accuracy,
#             'predictions': y_pred
#         }
#     else:  # Regression
#         mse = mean_squared_error(y_test, y_pred)
#         mae = mean_absolute_error(y_test, y_pred)
#         r2 = r2_score(y_test, y_pred)
#         results[name] = {
#             'mse': mse,
#             'mae': mae,
#             'r2': r2,
#             'predictions': y_pred
#         }
#     
#     training_times[name] = training_time
#     prediction_times[name] = prediction_time
#     
#     print(f"  Training time: {training_time:.4f}s")
#     print(f"  Prediction time: {prediction_time:.4f}s")

## 5. Cross-Validation Comparison

Perform systematic cross-validation for all models.

In [None]:
# TODO: Perform cross-validation for all models
# cv_results = {}
# cv_folds = 5

# # Choose appropriate cross-validation strategy
# if problem_type == 'classification':
#     cv = StratifiedKFold(n_splits=cv_folds, shuffle=True, random_state=42)
#     scoring = 'accuracy'
# else:
#     cv = KFold(n_splits=cv_folds, shuffle=True, random_state=42)
#     scoring = 'r2'  # or 'neg_mean_squared_error'

# for name, model in models.items():
#     print(f"Cross-validating {name}...")
#     scores = cross_val_score(model, X_train, y_train, cv=cv, scoring=scoring)
#     cv_results[name] = {
#         'scores': scores,
#         'mean': scores.mean(),
#         'std': scores.std()
#     }
#     print(f"  CV Score: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})")

## 6. Performance Visualization

Create comprehensive visualizations comparing model performance.

In [None]:
# TODO: Create performance comparison plots

# Cross-validation scores comparison
# cv_df = pd.DataFrame({
#     'Model': list(cv_results.keys()),
#     'Mean_Score': [cv_results[model]['mean'] for model in cv_results.keys()],
#     'Std_Score': [cv_results[model]['std'] for model in cv_results.keys()]
# })

# plt.figure(figsize=(12, 6))
# plt.errorbar(cv_df['Model'], cv_df['Mean_Score'], yerr=cv_df['Std_Score'], 
#              fmt='o', capsize=5, capthick=2, elinewidth=2, markersize=8)
# plt.xticks(rotation=45)
# plt.title('Cross-Validation Performance Comparison')
# plt.ylabel('Score')
# plt.grid(True, alpha=0.3)
# plt.tight_layout()
# plt.show()

# Box plots for cross-validation scores
# cv_scores_list = [cv_results[model]['scores'] for model in cv_results.keys()]
# plt.figure(figsize=(12, 6))
# plt.boxplot(cv_scores_list, labels=list(cv_results.keys()))
# plt.xticks(rotation=45)
# plt.title('Cross-Validation Score Distribution')
# plt.ylabel('Score')
# plt.grid(True, alpha=0.3)
# plt.tight_layout()
# plt.show()

## 7. Training and Prediction Time Analysis

Compare computational efficiency of different models.

In [None]:
# TODO: Create time comparison plots
# time_df = pd.DataFrame({
#     'Model': list(training_times.keys()),
#     'Training_Time': list(training_times.values()),
#     'Prediction_Time': list(prediction_times.values())
# })

# fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# # Training time comparison
# ax1.bar(time_df['Model'], time_df['Training_Time'])
# ax1.set_title('Training Time Comparison')
# ax1.set_ylabel('Time (seconds)')
# ax1.tick_params(axis='x', rotation=45)

# # Prediction time comparison
# ax2.bar(time_df['Model'], time_df['Prediction_Time'])
# ax2.set_title('Prediction Time Comparison')
# ax2.set_ylabel('Time (seconds)')
# ax2.tick_params(axis='x', rotation=45)

# plt.tight_layout()
# plt.show()

# TODO: Create a scatter plot of performance vs training time
# plt.figure(figsize=(10, 6))
# plt.scatter(time_df['Training_Time'], cv_df['Mean_Score'], s=100)
# for i, model in enumerate(time_df['Model']):
#     plt.annotate(model, (time_df['Training_Time'][i], cv_df['Mean_Score'][i]), 
#                  xytext=(5, 5), textcoords='offset points')
# plt.xlabel('Training Time (seconds)')
# plt.ylabel('Cross-Validation Score')
# plt.title('Performance vs Training Time Trade-off')
# plt.grid(True, alpha=0.3)
# plt.show()

## 8. Statistical Significance Testing

Perform statistical tests to determine if performance differences are significant.

In [None]:
# TODO: Perform pairwise statistical significance tests
# from scipy.stats import ttest_rel

# # Get model names and their CV scores
# model_names = list(cv_results.keys())
# n_models = len(model_names)

# # Create a matrix to store p-values
# p_values = np.zeros((n_models, n_models))

# for i in range(n_models):
#     for j in range(n_models):
#         if i != j:
#             scores_i = cv_results[model_names[i]]['scores']
#             scores_j = cv_results[model_names[j]]['scores']
#             _, p_value = ttest_rel(scores_i, scores_j)
#             p_values[i, j] = p_value

# # Create a heatmap of p-values
# plt.figure(figsize=(10, 8))
# sns.heatmap(p_values, 
#             xticklabels=model_names, 
#             yticklabels=model_names,
#             annot=True, 
#             fmt='.3f', 
#             cmap='viridis',
#             cbar_kws={'label': 'p-value'})
# plt.title('Statistical Significance Test (p-values)\nValues < 0.05 indicate significant differences')
# plt.tight_layout()
# plt.show()

## 9. Detailed Classification Report (for Classification Problems)

Generate detailed classification reports for all models.

In [None]:
# TODO: Generate detailed classification reports (if classification problem)
# if problem_type == 'classification':
#     for name, model in models.items():
#         print(f"\n{'='*50}")
#         print(f"Classification Report - {name}")
#         print(f"{'='*50}")
#         
#         model.fit(X_train, y_train)
#         y_pred = model.predict(X_test)
#         
#         print(classification_report(y_test, y_pred))
#         
#         # Confusion matrix
#         plt.figure(figsize=(8, 6))
#         cm = confusion_matrix(y_test, y_pred)
#         sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
#         plt.title(f'Confusion Matrix - {name}')
#         plt.ylabel('True Label')
#         plt.xlabel('Predicted Label')
#         plt.show()

## 10. Model Complexity Analysis

Analyze the complexity and interpretability of different models.

In [None]:
# TODO: Analyze model complexity
# complexity_analysis = {
#     'Model': [],
#     'Interpretability': [],
#     'Training_Complexity': [],
#     'Prediction_Complexity': [],
#     'Memory_Usage': [],
#     'Hyperparameters': []
# }

# # Define complexity characteristics for each model
# model_characteristics = {
#     'Decision Tree': {
#         'interpretability': 'High',
#         'training_complexity': 'O(n*log(n)*m)',
#         'prediction_complexity': 'O(log(n))',
#         'memory_usage': 'Low',
#         'hyperparameters': 'Few'
#     },
#     'SVM (RBF)': {
#         'interpretability': 'Low',
#         'training_complexity': 'O(n²) to O(n³)',
#         'prediction_complexity': 'O(k*m)',
#         'memory_usage': 'Medium',
#         'hyperparameters': 'Medium'
#     },
#     # Add other models...
# }

# TODO: Create a comparison table
# comparison_df = pd.DataFrame(model_characteristics).T
# print("Model Complexity Analysis:")
# print(comparison_df)

## 11. Bias-Variance Trade-off Analysis

Analyze the bias-variance trade-off for different models.

In [None]:
# TODO: Implement bias-variance decomposition
# This is a simplified analysis - for detailed bias-variance decomposition,
# you might need specialized libraries like mlxtend

# # Analyze variance across different train-test splits
# from sklearn.model_selection import ShuffleSplit

# n_splits = 20
# cv_splitter = ShuffleSplit(n_splits=n_splits, test_size=0.3, random_state=42)

# model_variances = {}

# for name, model in models.items():
#     scores = []
#     for train_idx, test_idx in cv_splitter.split(X_train):
#         X_train_fold, X_test_fold = X_train.iloc[train_idx], X_train.iloc[test_idx]
#         y_train_fold, y_test_fold = y_train.iloc[train_idx], y_train.iloc[test_idx]
#         
#         model.fit(X_train_fold, y_train_fold)
#         score = model.score(X_test_fold, y_test_fold)
#         scores.append(score)
#     
#     model_variances[name] = {
#         'mean': np.mean(scores),
#         'variance': np.var(scores),
#         'std': np.std(scores)
#     }

# # Plot bias-variance trade-off
# variance_df = pd.DataFrame(model_variances).T
# plt.figure(figsize=(10, 6))
# plt.scatter(variance_df['variance'], variance_df['mean'], s=100)
# for i, model in enumerate(variance_df.index):
#     plt.annotate(model, (variance_df['variance'][i], variance_df['mean'][i]),
#                  xytext=(5, 5), textcoords='offset points')
# plt.xlabel('Variance')
# plt.ylabel('Mean Performance')
# plt.title('Bias-Variance Trade-off Analysis')
# plt.grid(True, alpha=0.3)
# plt.show()

## 12. Learning Curves Comparison

Compare how models perform with different amounts of training data.

In [None]:
# TODO: Generate learning curves for selected models
# from sklearn.model_selection import learning_curve

# # Select a subset of models for learning curve analysis
# selected_models = ['Decision Tree', 'SVM (RBF)', 'Random Forest']

# plt.figure(figsize=(15, 5))

# for i, model_name in enumerate(selected_models):
#     model = models[model_name]
#     
#     train_sizes, train_scores, test_scores = learning_curve(
#         model, X_train, y_train, cv=5, n_jobs=-1, 
#         train_sizes=np.linspace(0.1, 1.0, 10)
#     )
#     
#     train_mean = np.mean(train_scores, axis=1)
#     train_std = np.std(train_scores, axis=1)
#     test_mean = np.mean(test_scores, axis=1)
#     test_std = np.std(test_scores, axis=1)
#     
#     plt.subplot(1, 3, i+1)
#     plt.plot(train_sizes, train_mean, 'o-', label='Training Score')
#     plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1)
#     plt.plot(train_sizes, test_mean, 'o-', label='Validation Score')
#     plt.fill_between(train_sizes, test_mean - test_std, test_mean + test_std, alpha=0.1)
#     
#     plt.title(f'Learning Curve - {model_name}')
#     plt.xlabel('Training Set Size')
#     plt.ylabel('Score')
#     plt.legend()
#     plt.grid(True, alpha=0.3)

# plt.tight_layout()
# plt.show()

## 13. Model Selection Criteria

Establish comprehensive criteria for model selection.

In [None]:
# TODO: Create a comprehensive model comparison table
# model_comparison = pd.DataFrame({
#     'Model': list(models.keys()),
#     'CV_Score_Mean': [cv_results[model]['mean'] for model in models.keys()],
#     'CV_Score_Std': [cv_results[model]['std'] for model in models.keys()],
#     'Training_Time': [training_times[model] for model in models.keys()],
#     'Prediction_Time': [prediction_times[model] for model in models.keys()]
# })

# # Add normalized scores for easier comparison
# model_comparison['CV_Score_Normalized'] = (
#     model_comparison['CV_Score_Mean'] / model_comparison['CV_Score_Mean'].max()
# )
# model_comparison['Speed_Score'] = (
#     1 / (model_comparison['Training_Time'] + model_comparison['Prediction_Time'])
# )
# model_comparison['Speed_Score_Normalized'] = (
#     model_comparison['Speed_Score'] / model_comparison['Speed_Score'].max()
# )

# # Calculate composite score (you can adjust weights based on your priorities)
# performance_weight = 0.7
# speed_weight = 0.3
# model_comparison['Composite_Score'] = (
#     performance_weight * model_comparison['CV_Score_Normalized'] +
#     speed_weight * model_comparison['Speed_Score_Normalized']
# )

# # Sort by composite score
# model_comparison = model_comparison.sort_values('Composite_Score', ascending=False)
# print("Model Comparison Summary:")
# print(model_comparison.round(4))

## 14. Final Model Recommendation

Make a final recommendation based on comprehensive analysis.

In [None]:
# TODO: Make final recommendation
# best_model_name = model_comparison.iloc[0]['Model']
# best_model = models[best_model_name]

# print(f"\n{'='*60}")
# print(f"FINAL MODEL RECOMMENDATION: {best_model_name}")
# print(f"{'='*60}")

# print(f"\nReasons for selection:")
# print(f"1. Cross-validation score: {cv_results[best_model_name]['mean']:.4f} (+/- {cv_results[best_model_name]['std']*2:.4f})")
# print(f"2. Training time: {training_times[best_model_name]:.4f} seconds")
# print(f"3. Prediction time: {prediction_times[best_model_name]:.4f} seconds")
# print(f"4. Composite score: {model_comparison.iloc[0]['Composite_Score']:.4f}")

# # Train the final model on full training data
# final_model = best_model
# final_model.fit(X_train, y_train)
# final_predictions = final_model.predict(X_test)

# # Final evaluation
# if problem_type == 'classification':
#     final_accuracy = accuracy_score(y_test, final_predictions)
#     print(f"\nFinal test accuracy: {final_accuracy:.4f}")
# else:
#     final_r2 = r2_score(y_test, final_predictions)
#     final_mse = mean_squared_error(y_test, final_predictions)
#     print(f"\nFinal test R²: {final_r2:.4f}")
#     print(f"Final test MSE: {final_mse:.4f}")

## 15. Model Deployment Considerations

Discuss considerations for deploying the selected model.

In [None]:
# TODO: Save the final model and preprocessing steps
# import joblib

# # Save the model
# joblib.dump(final_model, '../models/final_model.pkl')
# print"Final model saved to '../models/final_model.pkl'")

# # Create a simple prediction function
# def make_prediction(input_data):
#     """
#     Make prediction using the final model.
#     
#     Args:
#         input_data: preprocessed input data
#     
#     Returns:
#         prediction
#     """
#     prediction = final_model.predict(input_data)
#     return prediction

# # Test the prediction function
# sample_prediction = make_prediction(X_test.iloc[:1])
# print(f"Sample prediction: {sample_prediction[0]}")
# print(f"Actual value: {y_test.iloc[0]}")

## Summary and Conclusions

Provide a comprehensive summary of the model comparison analysis.

### Key Findings

**TODO: Summarize your key findings here:**

1. **Best Performing Model:** [Model name and performance metrics]
2. **Performance vs Complexity Trade-offs:** [Discussion of trade-offs]
3. **Computational Efficiency:** [Training and prediction time analysis]
4. **Statistical Significance:** [Results of significance tests]
5. **Bias-Variance Analysis:** [Insights from bias-variance trade-off]

### Recommendations

**TODO: Provide specific recommendations:**

1. **Primary Recommendation:** [Best model for production use]
2. **Alternative Options:** [Backup models and when to use them]
3. **Deployment Considerations:** [Important factors for deployment]
4. **Future Improvements:** [Suggestions for model enhancement]

### Lessons Learned

**TODO: Discuss lessons learned:**

1. **Data Insights:** [What the comparison revealed about your data]
2. **Algorithm Insights:** [Strengths and weaknesses of different approaches]
3. **Methodology Insights:** [Lessons about model comparison process]


## Reflection Questions

1. Which model performed best overall and why do you think it was superior for this dataset?
2. How did the computational efficiency of models compare, and how important is this for your use case?
3. What role did statistical significance testing play in your model selection decision?
4. How did the bias-variance trade-off manifest differently across the models?
5. What would you do differently if you had to deploy this model in a production environment?
6. How might the model comparison results change with a different dataset or problem domain?
7. What additional models or techniques would you consider exploring in future iterations?

**TODO: Answer the reflection questions above in markdown cells below.**