# Exercise-2: Decision Trees [30 points]

This notebook covers the implementation and analysis of Decision Tree algorithms for machine learning tasks.

## Learning Objectives
- Understanding Decision Tree algorithms and their parameters
- Implementing Decision Trees for classification and regression
- Analyzing tree structure and feature importance
- Hyperparameter tuning and model optimization
- Handling overfitting and underfitting

## Instructions
Complete the exercises below by implementing the required code in the designated cells.

## 1. Import Required Libraries

Import the necessary libraries for Decision Tree implementation and analysis.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.tree import plot_tree, export_text
import graphviz
from sklearn.tree import export_graphviz

# Set random seed for reproducibility
np.random.seed(42)

# Configure plotting
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)

## 2. Load Preprocessed Data

Load the data that was prepared in Exercise-1.

In [None]:
# TODO: Load the preprocessed data from Exercise-1
# Example:
# X_train = pd.read_csv('../data/X_train_processed.csv')
# X_test = pd.read_csv('../data/X_test_processed.csv')
# y_train = pd.read_csv('../data/y_train.csv').squeeze()
# y_test = pd.read_csv('../data/y_test.csv').squeeze()

# TODO: Display the shape of the datasets

## 3. Basic Decision Tree Implementation

Implement a basic Decision Tree model with default parameters.

In [None]:
# TODO: Create a Decision Tree classifier/regressor with default parameters
# dt_model = DecisionTreeClassifier(random_state=42)  # or DecisionTreeRegressor

# TODO: Train the model
# dt_model.fit(X_train, y_train)

# TODO: Make predictions
# y_pred = dt_model.predict(X_test)

## 4. Model Evaluation

Evaluate the performance of the Decision Tree model.

In [None]:
# TODO: Calculate and display evaluation metrics
# For classification: accuracy, precision, recall, F1-score
# For regression: MSE, RMSE, R²

# TODO: Display confusion matrix (for classification) or scatter plot (for regression)

## 5. Tree Visualization and Analysis

Visualize the Decision Tree structure and analyze its components.

In [None]:
# TODO: Visualize the Decision Tree
# Option 1: Using plot_tree (for smaller trees)
# plt.figure(figsize=(20, 10))
# plot_tree(dt_model, feature_names=X_train.columns, class_names=None, filled=True)
# plt.show()

# Option 2: Using export_text for text representation
# tree_rules = export_text(dt_model, feature_names=list(X_train.columns))
# print(tree_rules[:2000])  # Print first 2000 characters

In [None]:
# TODO: Analyze feature importance
# feature_importance = dt_model.feature_importances_
# importance_df = pd.DataFrame({
#     'feature': X_train.columns,
#     'importance': feature_importance
# }).sort_values('importance', ascending=False)

# TODO: Plot feature importance
# plt.figure(figsize=(10, 6))
# sns.barplot(data=importance_df.head(10), x='importance', y='feature')
# plt.title('Top 10 Feature Importances')
# plt.show()

## 6. Hyperparameter Tuning

Optimize the Decision Tree model by tuning its hyperparameters.

In [None]:
# TODO: Define hyperparameter grid for GridSearch
# param_grid = {
#     'max_depth': [3, 5, 10, 15, 20, None],
#     'min_samples_split': [2, 5, 10, 20],
#     'min_samples_leaf': [1, 2, 4, 8],
#     'criterion': ['gini', 'entropy']  # for classification
#     # 'criterion': ['mse', 'mae']  # for regression
# }

# TODO: Perform GridSearchCV
# grid_search = GridSearchCV(
#     DecisionTreeClassifier(random_state=42),
#     param_grid,
#     cv=5,
#     scoring='accuracy',  # or appropriate metric for regression
#     n_jobs=-1
# )
# grid_search.fit(X_train, y_train)

# TODO: Display best parameters and score
# print("Best parameters:", grid_search.best_params_)
# print("Best cross-validation score:", grid_search.best_score_)

## 7. Optimized Model Evaluation

Evaluate the performance of the optimized Decision Tree model.

In [None]:
# TODO: Get the best model from GridSearch
# best_dt_model = grid_search.best_estimator_

# TODO: Make predictions with the optimized model
# y_pred_optimized = best_dt_model.predict(X_test)

# TODO: Evaluate the optimized model
# Compare performance with the basic model

## 8. Overfitting Analysis

Analyze the relationship between model complexity and overfitting.

In [None]:
# TODO: Create learning curves to analyze overfitting
# Test different max_depth values and plot training vs validation scores

# max_depths = range(1, 21)
# train_scores = []
# val_scores = []

# for depth in max_depths:
#     dt = DecisionTreeClassifier(max_depth=depth, random_state=42)
#     dt.fit(X_train, y_train)
#     
#     train_score = dt.score(X_train, y_train)
#     val_score = dt.score(X_test, y_test)
#     
#     train_scores.append(train_score)
#     val_scores.append(val_score)

# TODO: Plot the learning curves
# plt.figure(figsize=(10, 6))
# plt.plot(max_depths, train_scores, 'o-', label='Training Score')
# plt.plot(max_depths, val_scores, 'o-', label='Validation Score')
# plt.xlabel('Max Depth')
# plt.ylabel('Score')
# plt.title('Training vs Validation Score by Max Depth')
# plt.legend()
# plt.grid(True)
# plt.show()

## 9. Cross-Validation Analysis

Perform cross-validation to get a more robust evaluation of model performance.

In [None]:
# TODO: Perform k-fold cross-validation
# cv_scores = cross_val_score(
#     best_dt_model, 
#     X_train, 
#     y_train, 
#     cv=5, 
#     scoring='accuracy'  # or appropriate metric
# )

# TODO: Display cross-validation results
# print(f"Cross-validation scores: {cv_scores}")
# print(f"Mean CV score: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")

## 10. Model Interpretation

Interpret the Decision Tree model and extract insights.

In [None]:
# TODO: Extract and analyze decision rules
# Create a function to extract meaningful rules from the tree

# TODO: Identify the most important decision paths

# TODO: Analyze how different features contribute to the final predictions

## 11. Comparison with Different Tree Configurations

Compare Decision Trees with different configurations and criteria.

In [None]:
# TODO: Compare different splitting criteria (gini vs entropy for classification)
# TODO: Compare different pruning strategies
# TODO: Create a comparison table of different configurations

## Summary and Conclusions

Summarize the findings from the Decision Tree analysis.

## Reflection Questions

1. How does the complexity of the Decision Tree affect its performance?
2. What are the advantages and disadvantages of Decision Trees for your dataset?
3. How do different hyperparameters influence the model's behavior?
4. What insights can you gain from the feature importance analysis?
5. How would you prevent overfitting in Decision Trees?

**TODO: Answer the reflection questions above in markdown cells below.**