<p style="font-family: 'Arial', sans-serif; font-size: 3rem; color: #6a1b9a; text-align: center; margin: 0; 
           text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.1); background-color: #f5f5f5; padding: 10px; 
           border-radius: 10px; border: 4px solid #6a5acd; box-shadow: 2px 2px 12px rgba(0, 0, 0, 0.1); width: 97%;">
    <span style="font-weight: bold; color: #6a1b9a; animation: pulse 2s infinite;"></span>COMPX310-2025B Lab 2 <br>Decision Trees Forest Cover Type Prediction
</p>


## Lab Information
- **Due Date:** Monday, September 29, 11:59 PM
- **Weight:** 3% of your total COMPX310 grade
- **Platform:** VSCode, Kaggle or Google Colab

## 1. Introduction to This Lab

### What is a Decision Tree?
A **decision tree** is a machine learning algorithm that makes predictions by asking a series of yes/no questions about the data. Think of it like a flowchart:
- Each node (box) asks a question about a feature
- Each branch represents a possible answer
- The leaves (final boxes) give us the prediction

<div align="center">
  <img src="https://miro.medium.com/v2/resize:fit:2000/1*S10T4ah3_JqdQ-eY6Hau0Q.png" width="600" height="400">
</div>

### What Will You Learn?
In this lab, you will:
1. Load and explore a real-world dataset
2. Build decision tree models with different settings
3. Use validation techniques to find the best model
4. Visualize your results with plots
5. Compare different approaches to model evaluation

### The Dataset: Forest Cover Type
We will predict what type of forest covers a specific area based on geographical and environmental features like elevation, slope, and distance to water sources.

## 2. Dataset Information

### Background
This dataset comes from the US Forest Service and contains information about forest areas in northern Colorado. Each data point represents a 30×30 meter area of forest. The goal is to predict which of 7 different tree types grows in each area.

<div align="center">
  <img src="https://assets.weforum.org/article/image/responsive_big_webp_qGgfxhM2PU3tyvAFoc-5FDTxmVY5sDcL8JSIo5Kj4aI.webp" width="600" height="400">
</div>
The study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices.

### Study Areas Background
- **Neota (Area 2)**: Probably has the highest mean elevational value of the 4 wilderness areas
- **Rawah (Area 1)** and **Comanche Peak (Area 3)**: Would have a lower mean elevational value
- **Cache la Poudre (Area 4)**: Would have the lowest mean elevational value

### Primary Tree Species by Area
- **Neota**: Mainly spruce/fir (type 1)
- **Rawah and Comanche Peak**: Mainly lodgepole pine (type 2), followed by spruce/fir and aspen (type 5)
- **Cache la Poudre**: Mainly Ponderosa pine (type 3), Douglas-fir (type 6), and cottonwood/willow (type 4)

The Rawah and Comanche Peak areas would tend to be more typical of the overall dataset than either the Neota or Cache la Poudre, due to their assortment of tree species and range of predictive variable values. Cache la Poudre would probably be more unique than the others, due to its relatively low elevation range and species composition.

### Dataset Features
The dataset contains 54 features total. Here's what each feature means:

| Feature Name | Data Type | Unit | Description |
|--------------|-----------|------|-------------|
| Elevation | Numerical | meters | Elevation in meters |
| Aspect | Numerical | degrees azimuth | Aspect in degrees azimuth |
| Slope | Numerical | degrees | Slope in degrees |
| Horizontal_Distance_To_Hydrology | Numerical | meters | Horizontal distance to nearest surface water features |
| Vertical_Distance_To_Hydrology | Numerical | meters | Vertical distance to nearest surface water features |
| Horizontal_Distance_To_Roadways | Numerical | meters | Horizontal distance to nearest roadway |
| Hillshade_9am | Numerical | 0-255 index | Hillshade index at 9am, summer solstice |
| Hillshade_Noon | Numerical | 0-255 index | Hillshade index at noon, summer solstice |
| Hillshade_3pm | Numerical | 0-255 index | Hillshade index at 3pm, summer solstice |
| Horizontal_Distance_To_Fire_Points | Numerical | meters | Horizontal distance to nearest wildfire ignition points |
| Wilderness_Area (4 binary columns) | Binary | 0 or 1 | Wilderness area designation (0 = absence, 1 = presence) |
| Soil_Type (40 binary columns) | Binary | 0 or 1 | Soil Type designation (0 = absence, 1 = presence) |
| class | Integer | 1-7 | **TARGET VARIABLE**: Forest Cover Type designation |

### Forest Cover Types (Target Classes)
The **class** is what we want to predict. There are 7 different types:
1. **Spruce/Fir**
2. **Lodgepole Pine** 
3. **Ponderosa Pine**
4. **Cottonwood/Willow**
5. **Aspen**
6. **Douglas-fir**
7. **Krummholz**

### Additional Information
- **Total Features:** 54 (10 numerical + 4 wilderness areas + 40 soil types)
- **Target:** 1 (class)
- **Data Format:** Raw form (not scaled) with binary columns for qualitative variables
- **Learn more:** https://archive.ics.uci.edu/dataset/31/covertype

## 3. Required Files
You need to download these two CSV files:
- `covtype_train.csv` - for training your models
- `covtype_test.csv` - for final testing

**Important:** Make sure to keep both files in the same folder you open in VSCode or upload both files to your Kaggle or Google Colab environment before starting.

## 4. Assignment Tasks
### **Complete the missing parts marked with # TODO comments**

### Task 1: Data Loading and Preparation (Setup)

**What you need to do:**
1. Import all necessary libraries (pandas, numpy, sklearn, matplotlib, seaborn)
2. Set your random seed using the last 4 digits of your student ID (or sum of IDs if working in pairs)
3. Load both CSV files (`covtype_train.csv` and `covtype_test.csv`)
4. Separate features (X) and target variable (y) from both datasets
   - **X**: All columns except "class" which is Cover_Type
   - **y**: Only the "class" column
5. Display basic information about your datasets (shape, first few rows)

**Important Note:** For all random steps in this lab, always use last four digits of your student ID as the random seed!

In [None]:
# Task 1: Data Loading and Preparation
# Complete the missing parts marked with # COMPLETE comments

# Step 1: Import all necessary libraries
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn import tree
import matplotlib.pyplot as plt
import seaborn as sns
# COMPLETE: Import graphviz for tree visualization
# import graphviz

print("✓ All libraries imported successfully!")

# Step 2: Set your random seed
# COMPLETE: Replace XXXX with the last 4 digits of your student ID
ID = XXXX  # COMPLETE: Change this to your student ID

np.random.seed(ID)
print(f"✓ Random seed set to: {ID}")

# Step 3: Load the datasets
# COMPLETE: Load the CSV files using pd.read_csv()
train_df = # COMPLETE: Load 'covtype_train.csv'
test_df = # COMPLETE: Load 'covtype_test.csv'

# Clean up column names by removing leading/trailing spaces
train_df.columns = train_df.columns.str.strip()
test_df.columns = test_df.columns.str.strip()

print("✓ Data loaded successfully!")

# Step 4: Check your data
print(f"Training data shape: {train_df.shape}")
print(f"Test data shape: {test_df.shape}")

# COMPLETE: Display first 5 rows of training data
print("\nFirst 5 rows:")
# COMPLETE: Use .head() method

In [None]:
# Step 5: Separate features (X) and target (y)
# COMPLETE: Create X by dropping 'class' column, y by selecting 'class'
y_train_full = # COMPLETE: Select 'class' from train_df
X_train_full = # COMPLETE: Drop 'class' from train_df

y_test = # COMPLETE: Select 'class' from test_df
X_test = # COMPLETE: Drop 'class' from test_df  

# Verify the separation
print(f"\nFeatures shape: {X_train_full.shape} (should be 54 columns)")
print(f"Target values: {sorted(y_train_full.unique())} (should be 1-7)")

print("\n✓ Task 1 completed!")

### Task 2: Basic Data Analysis (Understand Your Data)
**What you need to do:**
1. **Dataset Information**: Check data types, missing values, and basic statistics
2. **Target Distribution**: Create a bar plot showing how many examples of each class you have
3. **Feature Distributions**: Create histograms for the first 10 numerical features
4. **Correlation Analysis**: Create a correlation heatmap for numerical features
5. **Feature vs Target**: Create box plots showing how key features (like Elevation) vary by class

**Why this is important:**
- **Histograms** help you see if features are normally distributed, skewed, or have outliers
- **Correlation heatmaps** show which features are related to each other
- **Box plots** help you understand which features might be most useful for prediction
- Understanding your data helps you make better modeling decisions

**Add markdown cells to explain what you observe in each plot!**

In [None]:
# Task 2: Data Exploration - Understanding Your Data
# Complete the missing parts marked with # COMPLETE comments

print("TASK 2: DATA EXPLORATION")
print("="*50)

# Part 1: Basic Dataset Information
print("1. BASIC DATASET INFORMATION")

# COMPLETE: Use .info() to check data types and missing values
train_df.info()

# COMPLETE: Count missing values in both datasets
missing_train = # COMPLETE: Use train_df.isnull().sum().sum()
missing_test = # COMPLETE: Use test_df.isnull().sum().sum()

print(f"\nMissing values - Training: {missing_train}, Test: {missing_test}")

# COMPLETE: Get summary statistics for first 10 numerical features
print("\nSummary statistics for first 10 features:")
# COMPLETE: Use train_df.iloc[:, :10].describe()

# Part 2: Target Distribution
print("\n2. TARGET DISTRIBUTION")

# COMPLETE: Count how many examples of each class (class)  we have
cover_counts = # COMPLETE: Use train_df['class'].value_counts().sort_index()

print("Cover Type distribution:")
print(cover_counts)

# COMPLETE: Create bar plot of target distribution
plt.figure(figsize=(10, 6))
# COMPLETE: Use cover_counts.plot(kind='bar')

plt.title('Forest Cover Types Distribution')
plt.xlabel('Cover Type')
plt.ylabel('Number of Examples')
plt.xticks(rotation=0)
plt.grid(True, alpha=0.3)
plt.show()

# Part 3: Feature Distributions
print("\n3. FEATURE DISTRIBUTIONS")

# First 10 numerical features
numerical_features = ['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology', 
                     'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways',
                     'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm', 
                     'Horizontal_Distance_To_Fire_Points']

# COMPLETE: Create histograms for all numerical features
fig, axes = plt.subplots(2, 5, figsize=(20, 10))
axes = axes.flatten()

for i, feature in enumerate(numerical_features):
    # COMPLETE: Create histogram - axes[i].hist(train_df[feature], bins=30, alpha=0.7)
    
    axes[i].set_title(f'{feature}')
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Part 4: Correlation Analysis
print("\n4. CORRELATION ANALYSIS")

# COMPLETE: Calculate correlation matrix for numerical features
correlation_matrix = # COMPLETE: Use train_df[numerical_features].corr()

# COMPLETE: Create correlation heatmap
plt.figure(figsize=(12, 10))
# COMPLETE: Use sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)

plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

# Part 5: Feature vs Target Analysis
print("\n5. FEATURE vs TARGET ANALYSIS")

# COMPLETE: Create box plot showing Elevation by class 
plt.figure(figsize=(12, 8))
# COMPLETE: Use sns.boxplot(data=train_df, x='class', y='Elevation')

plt.title('Elevation by Cover Type')
plt.grid(True, alpha=0.3)
plt.show()

# COMPLETE: Analyze wilderness areas by cover type
wilderness_cols = ['Wilderness_Area_1', 'Wilderness_Area_2', 'Wilderness_Area_3', 'Wilderness_Area_4']
# COMPLETE: Use train_df.groupby('class')[wilderness_cols].sum()
wilderness_summary = 

print("Wilderness Area by Cover Type:")
print(wilderness_summary)

print("\n✓ Task 2 completed! Add markdown cells to explain your observations.")

### Task 3: Cross-Validation Experiment - Build and Validate Decision Trees (1 mark)

**What you need to do:**

**Step 1: Set up the experiment**
- Split your training data into 80% for model fitting and 20% for validation
- Repeat this process **30 times** with different random splits
- Use `random_state=ID+i` where `i` goes from 0 to 29

**Step 2: Test different tree complexities**
For each of the 30 splits, build 15 different decision trees using these `max_leaf_nodes` values:
```
[2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768]
```

**Step 3: Collect results**
- For each tree, measure its accuracy on the validation set
- You should collect **450 accuracy values** total (30 splits × 15 parameter values)

**What is max_leaf_nodes?**
- This parameter controls how complex your decision tree can be
- **Small values** (like 2, 4): Simple tree, might be too simple (underfitting)
- **Large values** (like 16384): Complex tree, might memorize training data (overfitting)
- We test different values to find the best balance

**What is Cross-Validation?**
- We split our training data to test how well our model works on "unseen" data
- This helps us choose the best parameters without using our test set
- Using 30 different splits gives us a better estimate of performance

<div align="center">
  <img src="https://miro.medium.com/v2/resize:fit:1400/1*GhKMAUmi4bfFiEwZCPlDsA.png" width="700" height="600">
</div>

In [None]:
# Task 3: Cross-Validation Experiment - Build and Validate Decision Trees
# Complete the missing parts marked with # COMPLETE comments

print("TASK 3: CROSS-VALIDATION EXPERIMENT")
print("="*50)

# COMPLETE: Step 1: Set up experiment parameters
max_leaf_nodes_values = [ ]

print(f"Testing {len(max_leaf_nodes_values)} different max_leaf_nodes values")
print(f"Will perform 30 splits × 15 parameters = {30 * 15} total experiments")

# Initialize storage for results
all_accuracies = []
all_max_leaf_nodes = []
all_split_numbers = []

# Step 2: Run the cross-validation experiment
print("\nStarting experiment...")

# COMPLETE: Create loop for 30 different train/validation splits
for i in range(xx):
    print(f"Split {i+1}/30...", end=" ")
    
    # COMPLETE: Split training data into 80% train, 20% validation
    # Use train_test_split with test_size=0.2, random_state=ID+i, stratify=y_train_full
    X_train, X_val, y_train, y_val = train_test_split(
        # COMPLETE: Fill in parameters
    )
    
    # COMPLETE: Test each max_leaf_nodes value
    for max_nodes in max_leaf_nodes_values:
        
        # COMPLETE: Create DecisionTreeClassifier with max_leaf_nodes=max_nodes, random_state=ID
        tree_model = DecisionTreeClassifier(
            # COMPLETE: Fill in parameters
        )
        
        # COMPLETE: Train the model
        # COMPLETE: Use .fit(X_train, y_train)
        
        # COMPLETE: Make predictions on validation set
        y_val_pred = # COMPLETE: Use .predict(X_val)
        
        # COMPLETE: Calculate accuracy
        accuracy = accuracy_score(# COMPLETE: y_val, y_val_pred)
        
        # Store results
        all_accuracies.append(accuracy)
        all_max_leaf_nodes.append(max_nodes)
        all_split_numbers.append(i + 1)
    
    print("✓")

print(f"\n✓ Experiment completed! Total results: {len(all_accuracies)}")

# Step 3: Organize results
# COMPLETE: Create DataFrame with results
results_df = pd.DataFrame({
    # COMPLETE: Fill in the three columns
    'split_number': ,
    'max_leaf_nodes': ,
    'validation_accuracy': 
})

print("First 10 results:")
print(results_df.head(10))

# Step 4: Analyze results
print("\nValidation accuracy summary:")
# COMPLETE: Show summary statistics using .describe()

# COMPLETE: Calculate mean accuracy for each max_leaf_nodes
summary_stats = results_df.groupby('max_leaf_nodes')['validation_accuracy'].agg([
    # COMPLETE: Add 'mean', 'std' aggregation functions
]).round(4)

print("\nMean performance by max_leaf_nodes:")
print(summary_stats)

# COMPLETE: Find best performing max_leaf_nodes
best_max_leaf_nodes = # COMPLETE: Use summary_stats['mean'].idxmax()

print(f"\nBest max_leaf_nodes based on validation: {best_max_leaf_nodes}")
print(f"Mean accuracy: {summary_stats.loc[best_max_leaf_nodes, 'mean']:.4f}")

print("\n✓ Task 3 completed! Ready for visualization.")

### Task 4: Create Violin Plot (1 mark)

**What you need to do:**
1. Create a **violin plot** using seaborn that shows the distribution of validation accuracies
2. Group results by `max_leaf_nodes` values (x-axis should show all 15 values)
3. Y-axis should show validation accuracy
4. The plot should clearly show 15 "violins" - one for each `max_leaf_nodes` value

**What is a Violin Plot?**
- Similar to a box plot but shows the full shape of the data distribution
- **Width** indicates how common different accuracy values are
- **Wider parts** = more common accuracy values
- **Narrower parts** = less common accuracy values

<div align="center">
  <img src="https://miro.medium.com/0*PHYwIcm5knwcrTek.png" width="600" height="500">
</div>

**Add a markdown cell explaining:**
- Which `max_leaf_nodes` value appears to perform best based on the violin plot?
- What patterns do you observe as `max_leaf_nodes` increases?

In [None]:
# Task 4: Create Violin Plot - Visualizing Validation Results
# Complete the missing parts marked with # COMPLETE comments

print("TASK 4: VIOLIN PLOT VISUALIZATION")
print("="*50)

# Step 1: Check our data from Task 3
print(f"We have {len(results_df)} validation results")
print(f"Testing {len(results_df['max_leaf_nodes'].unique())} different max_leaf_nodes values")

# Step 2: Create the violin plot
print("\nCreating violin plot...")

# Set up the figure size
plt.figure(figsize=(15, 8))

# COMPLETE: Create violin plot using sns.violinplot()
# Parameters needed: data=results_df, x='max_leaf_nodes', y='validation_accuracy'
sns.violinplot(
    # COMPLETE: Fill in the parameters
)

# Add labels and title
plt.title('Validation Accuracy Distribution for Different max_leaf_nodes Values')
plt.xlabel('max_leaf_nodes')
plt.ylabel('Validation Accuracy')

# COMPLETE: Rotate x-axis labels for better readability
# Use plt.xticks(rotation=45)

plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Step 3: Find the best performing parameter
print("\nAnalyzing results...")

# COMPLETE: Calculate mean accuracy for each max_leaf_nodes value
# Use results_df.groupby('max_leaf_nodes')['validation_accuracy'].mean()
mean_accuracies = 

# COMPLETE: Find the max_leaf_nodes with highest mean accuracy
# Use mean_accuracies.idxmax()
best_max_leaf_nodes = 

print(f"Best max_leaf_nodes based on validation: {best_max_leaf_nodes}")
print(f"Mean validation accuracy: {mean_accuracies[best_max_leaf_nodes]:.4f}")

print("\n✓ Task 4 completed!")
print("Now write a markdown cell explaining your observations!")

### Task 5: Test Set Evaluation - Final Model Comparison (1 mark)

**What you need to do:**

**Step 1: Train on full training data**
- Use the complete training dataset (not split)
- Train 15 decision trees, one for each `max_leaf_nodes` value
- Test each tree on the test set and record the accuracy

**Step 2: Create line plot**
- Plot test accuracy (y-axis) vs `max_leaf_nodes` (x-axis)
- Use a line plot with markers to show the results
- Clearly mark the best-performing parameter value

**Step 3: Compare results**
Answer these questions in markdown cells:
- **Agreement**: Is the `max_leaf_nodes` value with the highest test accuracy the same as the one with the "highest violin" in your validation plot?
- **Analysis**: Do the validation and test results agree on the best parameter? Why or why not?

**Step 4: Visualize decision tree**
- Train a decision tree with `max_leaf_nodes=8` on the full training data
- Plot this decision tree using `sklearn.tree.export_graphviz` and `graphviz`
- Add a markdown cell explaining what you observe in the tree structure

In [None]:
# Task 5: Test Set Evaluation - Final Model Comparison
# Complete the missing parts marked with # COMPLETE comments

print("TASK 5: TEST SET EVALUATION")
print("="*50)

# Step 1: Train models on full training data and test on test set
print("Training models on full training data...")

test_accuracies = []

# COMPLETE: Train 15 decision trees with different max_leaf_nodes values
for max_nodes in max_leaf_nodes_values:
    print(f"Testing max_leaf_nodes={max_nodes}...")
    
    # COMPLETE: Create DecisionTreeClassifier with max_leaf_nodes=max_nodes, random_state=ID
    tree_model = DecisionTreeClassifier(
        # COMPLETE: Fill in parameters
    )
    
    # COMPLETE: Train on full training data
    # COMPLETE: Use .fit(X_train_full, y_train_full)
    
    # COMPLETE: Test on test set
    y_test_pred = # COMPLETE: Use .predict(X_test)
    
    # COMPLETE: Calculate test accuracy
    accuracy = accuracy_score(# COMPLETE: y_test, y_test_pred)
    
    test_accuracies.append(accuracy)
    print(f"  Test accuracy: {accuracy:.4f}")

print(f"\n✓ All models tested!")

# Step 2: Create line plot of test results
print("\nCreating test results plot...")

# COMPLETE: Find best test performance
best_test_idx = # COMPLETE: Use np.argmax(test_accuracies)
best_test_max_leaf_nodes = max_leaf_nodes_values[best_test_idx]
best_test_accuracy = test_accuracies[best_test_idx]

print(f"Best test performance: max_leaf_nodes={best_test_max_leaf_nodes}, accuracy={best_test_accuracy:.4f}")

# COMPLETE: Create line plot
plt.figure(figsize=(12, 8))
# COMPLETE: Use plt.plot() with max_leaf_nodes_values, test_accuracies, and markers
plt.plot(# COMPLETE: x=max_leaf_nodes_values, y=test_accuracies, marker='o', linewidth=2, markersize=8)

plt.title('Test Set Accuracy vs max_leaf_nodes')
plt.xlabel('max_leaf_nodes')
plt.ylabel('Test Accuracy')

# COMPLETE: Use log scale for x-axis since values vary widely
plt.xscale('log')

# COMPLETE: Mark the best performing point
plt.plot(best_test_max_leaf_nodes, best_test_accuracy, 'ro', markersize=12, 
         label=f'Best: {best_test_max_leaf_nodes} nodes')

plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Step 3: Compare validation and test results
print("\n" + "="*40)
print("VALIDATION vs TEST COMPARISON")
print("="*40)

# COMPLETE: Get the best max_leaf_nodes from validation (from Task 3)
best_validation_max_leaf_nodes = # COMPLETE: From Task 3 results

print(f"Best from validation: max_leaf_nodes = {best_validation_max_leaf_nodes}")
print(f"Best from test set: max_leaf_nodes = {best_test_max_leaf_nodes}")

# COMPLETE: Check if they agree
agreement = best_validation_max_leaf_nodes == best_test_max_leaf_nodes
print(f"Do validation and test agree? {agreement}")

if agreement:
    print("✓ Great! Validation successfully identified the best parameter.")
else:
    print("⚠ Validation and test disagree. This suggests overfitting to validation set.")

# Step 4: Visualize decision tree with max_leaf_nodes=8
print("\n" + "="*40)
print("DECISION TREE VISUALIZATION")
print("="*40)

print("Training decision tree with max_leaf_nodes=8...")

# COMPLETE: Create and train tree with max_leaf_nodes=8
viz_tree = DecisionTreeClassifier(
    # COMPLETE: max_leaf_nodes=8, random_state=ID
)

# COMPLETE: Train on full training data
# COMPLETE: Use .fit(X_train_full, y_train_full)

# Get accuracy for this specific tree
viz_predictions = viz_tree.predict(X_test)
viz_accuracy = accuracy_score(y_test, viz_predictions)
print(f"Tree with max_leaf_nodes=8 test accuracy: {viz_accuracy:.4f}")

# COMPLETE: Visualize the tree using graphviz
print("\nCreating tree visualization...")

# COMPLETE: Export tree to graphviz format
dot_data = tree.export_graphviz(
    viz_tree,
    feature_names=list(X_train_full.columns),
    class_names=[str(i) for i in range(1, 8)],  # Cover types 1-7
    filled=True,
    rounded=True,
    out_file=None
)

# COMPLETE: Create graphviz visualization
# COMPLETE: graph = graphviz.Source(dot_data, format="png")
# COMPLETE: graph

print("✓ Tree visualization completed!")

print("\n" + "="*50)
print("✓ TASK 5 COMPLETED!")
print("\nNow add markdown cells to discuss:")
print("- Do validation and test results agree?")
print("- What does the decision tree structure tell you?")
print("- Which features appear most important?")
print("="*50)

### Task 6: Discussion and Analysis - Results Interpretation

**Answer these questions in markdown cells:**

1. **Model Performance**: 
   - What was your best validation accuracy and best test accuracy?
   - Do these results seem reasonable for a 7-class classification problem?

2. **Parameter Selection**:
   - Did validation and test results agree on the best `max_leaf_nodes`?
   - Which approach (validation or test) should you trust for selecting parameters? Why?

3. **Overfitting vs Underfitting**:
   - Which `max_leaf_nodes` values likely caused underfitting? Why?
   - Which values likely caused overfitting? Why?
   - Where do you think the "sweet spot" is?

4. **Decision Tree Interpretation**:
   - Looking at your tree with `max_leaf_nodes=8`, which features seem most important?
   - Can you trace through a decision path and understand how the tree makes predictions?

5. **Real-world Application**:
   - How could this forest cover prediction model be useful in practice?
   - What limitations might this approach have?

## 5. Important Guidelines

### Technical Requirements:
- **Random Seeds**: Always use your student ID (last 4 digits) for random_state parameters
- **Pair Programming**: If working with a buddy, use the sum of both student IDs
- **Code Comments**: Add comments to explain what your code does
- **Markdown Explanations**: Use markdown cells to explain your observations and findings

### What to Include:
- Your name and student ID at the top
- All plots should be clearly labeled with titles and axis labels
- Markdown explanations for each major step
- Discussion of your results and findings
- Answers to all analysis questions

### Submission:
- Execute all code cells so outputs are visible
- Print your notebook to HTML
- Submit HTML to Canvas before the deadline
- **Both partners must submit if working in pairs**

---

## 6. Helpful Hints

### For Decision Tree Visualization:
```python
# Hint for plotting trees:
from sklearn import tree
import graphviz

dot_data = tree.export_graphviz(tree_model, out_file=None)
graph = graphviz.Source(dot_data, format="png")
graph
```

### For Advanced Visualization (Optional):
```python
# If you want to try a fancier tree visualization:
!pip install dtreeviz
import dtreeviz
# Note: You'll need to adjust class labels for dtreeviz (subtract 1 from class)
```

## Installing Graphviz for Tree Visualization

- Download Graphviz from https://graphviz.org/download/ (choose Windows or other installer)
- During installation, **check the option "Add Graphviz to system PATH"** if available
- If PATH option wasn't available, manually add `C:\Program Files\Graphviz\bin` to your system PATH
- **Restart VS Code completely** after installation
- Test by running `dot -V` in command prompt to verify installation

### Remember:
- Learning is easier with a buddy - pair programming is encouraged!
- Don't share your work outside your pair
- Ask questions if you get stuck
- Focus on understanding the concepts, not just getting the code to work

---

## 12. Learning Objectives Check

By the end of this lab, you should be able to:
- [ ] Load and explore a real dataset
- [ ] Understand what decision trees are and how they work
- [ ] Use cross-validation to select model parameters
- [ ] Create and interpret violin plots
- [ ] Compare validation and test set performance
- [ ] Visualize and interpret decision trees
- [ ] Discuss overfitting, underfitting, and model complexity

**Good luck with your assignment!**