# Decision Tree Challenge

Feature Importance and Categorical Variable Encoding

# ðŸŒ³ Decision Tree Challenge - Feature Importance and Variable Encoding

## Challenge Overview

**Your Mission:** Create a simple GitHub Pages site that demonstrates
how decision trees measure feature importance and analyzes the critical
differences between categorical and numerical variable encoding. Youâ€™ll
answer two key discussion questions by adding narrative to a pre-built
analysis and posting those answers to your GitHub Pages site as a
rendered HTML document.

``` python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.ioff()  # Turn off interactive mode for Quarto
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')

# Load data
sales_data = pd.read_csv('salesPriceData.csv')
sales_data.head()

# Prepare model data (treating zipCode as numerical)
model_vars = ['SalePrice', 'LotArea', 'YearBuilt', 'GrLivArea', 'FullBath',
              'HalfBath', 'BedroomAbvGr', 'TotRmsAbvGrd', 'GarageCars', 'zipCode']
model_data = sales_data[model_vars].dropna()

# Split data
X = model_data.drop('SalePrice', axis=1)
y = model_data['SalePrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Build decision tree
tree_model = DecisionTreeRegressor(max_depth=3,
                                  min_samples_split=20,
                                  min_samples_leaf=10,
                                  random_state=123)
tree_model.fit(X_train, y_train)

print(f"Model built with {tree_model.get_n_leaves()} terminal nodes")

# Prepare model data (treating zipCode as categorical - one-hot encoded)
model_vars_cat = ['SalePrice', 'LotArea', 'YearBuilt', 'GrLivArea', 'FullBath',
                  'HalfBath', 'BedroomAbvGr', 'TotRmsAbvGrd', 'GarageCars', 'zipCode']
model_data_cat = sales_data[model_vars_cat].dropna()

# One-hot encode zipCode
model_data_cat = pd.get_dummies(model_data_cat, columns=['zipCode'], prefix='zipCode')

# Split data
X_cat = model_data_cat.drop('SalePrice', axis=1)
y_cat = model_data_cat['SalePrice']
X_train_cat, X_test_cat, y_train_cat, y_test_cat = train_test_split(X_cat, y_cat, test_size=0.2, random_state=123)

# Build decision tree with categorical zipCode
tree_model_cat = DecisionTreeRegressor(max_depth=3,
                                     min_samples_split=20,
                                     min_samples_leaf=10,
                                     random_state=123)
tree_model_cat.fit(X_train_cat, y_train_cat)

print(f"Categorical model built with {tree_model_cat.get_n_leaves()} terminal nodes")
```

## Tree Visualization

### Python

``` python
#| fig-width: 20
#| fig-height: 12
# Visualize tree
plt.figure(figsize=(20, 12))
plot_tree(tree_model,
          feature_names=X_train.columns,
          filled=True,
          rounded=True,
          fontsize=10,
          max_depth=3)
plt.title("Decision Tree (zipCode as Numerical)")
plt.tight_layout()
```

``` python
#| fig-width: 20
#| fig-height: 12
# Visualize categorical tree
plt.figure(figsize=(20, 12))
plot_tree(tree_model_cat,
          feature_names=X_train_cat.columns,
          filled=True,
          rounded=True,
          fontsize=10,
          max_depth=3)
plt.title("Decision Tree (zipCode as Categorical - One-Hot Encoded)")
plt.tight_layout()
```

## Discussion Questions for Challenge

1.  **Numerical vs Categorical Encoding:** There are two models in
    Python written above. For each language, the models differ by how
    zip code is modelled, either as a numerical variable or as a
    categorical variable. Given what you know about zip codes and real
    estate prices, how should zip code be modelled, numerically or
    categorically? Is zipcode and ordinal or non-ordinal variable?

# Answer

Calculate and visualize the feature importances for both decision tree
models (one with `zipCode` as numerical, one with `zipCode` as
categorical), then compare and interpret the results, summarizing the
impact of numerical vs.Â categorical encoding of `zipCode` on feature
importance.

## Calculate Numerical Feature Importance

### Subtask:

Calculate and store the feature importances from `tree_model`, which was
trained using `zipCode` as a numerical variable.

**Reasoning**: The task is to calculate and store feature importances
from `tree_model`. I will access the `feature_importances_` attribute,
create a pandas Series mapping these to feature names from
`X_train.columns`, and then sort them in descending order for clear
presentation.

``` python
#| fig-width: 10
#| fig-height: 6
feature_importances = pd.Series(tree_model.feature_importances_, index=X_train.columns)
feature_importances_numerical = feature_importances.sort_values(ascending=False)

print("Feature Importances (zipCode as Numerical):")
print(feature_importances_numerical)

# Plot feature importances for numerical zipCode
plt.figure(figsize=(10, 6))
feature_importances_numerical.plot(kind='bar')
plt.title('Feature Importances (zipCode as Numerical)')
plt.ylabel('Importance')
plt.xlabel('Features')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
```

**Reasoning**: The subtask asked to calculate and visualize feature
importances for both models. I have already done this for the numerical
zipCode model. Now, I need to calculate and visualize the feature
importances for the `tree_model_cat`, which was trained with `zipCode`
as a categorical variable using one-hot encoding.

``` python
#| fig-width: 12
#| fig-height: 7
feature_importances_cat = pd.Series(tree_model_cat.feature_importances_, index=X_train_cat.columns)
feature_importances_categorical = feature_importances_cat.sort_values(ascending=False)

print("Feature Importances (zipCode One-Hot Encoded):")
print(feature_importances_categorical)

# Plot feature importances for categorical zipCode
plt.figure(figsize=(12, 7))
feature_importances_categorical.plot(kind='bar')
plt.title('Feature Importances (zipCode One-Hot Encoded)')
plt.ylabel('Importance')
plt.xlabel('Features')
plt.xticks(rotation=90, ha='right')
plt.tight_layout()
```

## Comparison and Interpretation of Feature Importances

We have analyzed two decision tree models, one treating `zipCode` as a
numerical variable and another treating it as a categorical variable
(one-hot encoded).

### Observations:

**1. Numerical `zipCode` Model:** \* **`zipCode` Importance:** In the
model where `zipCode` was treated numerically, its importance was 0.0,
indicating it was not used for any splits in the tree. This is expected,
as numerical ordering of zip codes (e.g., 50010 \< 50011) holds no
intrinsic meaning for predicting house prices. \* **Top Features:**
`GrLivArea`, `GarageCars`, `YearBuilt`, and `TotRmsAbvGrd` were
identified as the most important features.

**2. Categorical `zipCode` Model (One-Hot Encoded):** \* **`zipCode`
Importance:** Even after one-hot encoding, none of the individual
`zipCode` dummy variables (`zipCode_50011`, `zipCode_50012`, etc.)
showed any importance (all 0.0). This suggests that within the limited
`max_depth` (3) of our decision tree and the
`min_samples_split`/`min_samples_leaf` constraints, no single `zipCode`
category was deemed significant enough to create a split that
substantially reduced impurity. \* **Top Features:** The top features
remain the same (`GrLivArea`, `GarageCars`, `YearBuilt`, `TotRmsAbvGrd`)
and their importances are identical to the numerical `zipCode` model.

### Impact of Encoding:

In this specific scenario with a shallow tree (`max_depth=3`) and the
given data, the encoding of `zipCode` (numerical vs.Â one-hot encoded
categorical) did not significantly change the **overall ranked
importance** of the other numerical features (`GrLivArea`, `GarageCars`,
`YearBuilt`, `TotRmsAbvGrd`). Both models yielded the same dominant
features with identical importance scores.

However, the lack of importance for *any* `zipCode` representation
(either numerical or individual one-hot encoded dummies) is notable.
While `zipCode` is generally considered an important factor in real
estate, its individual one-hot encoded features did not contribute to
the modelâ€™s splits in this instance. This could be due to several
factors:

-   **Tree Depth:** A shallow tree (max_depth=3) might not be able to
    capture the complex relationships or small improvements in impurity
    offered by individual zip code categories.
-   **One-Hot Encoding Sparsity:** One-hot encoding creates many new
    features, making the data sparse. With a limited number of splits,
    the tree might prioritize more impactful continuous features.
-   **Lack of Strong Individual Zip Code Effect:** Itâ€™s possible that
    within the context of the other strong predictors like `GrLivArea`,
    no *single* zip code dummy variable has a strong enough individual
    predictive power to be selected for a split at this tree depth.

### Conclusion:

While the *method* of encoding categorical variables correctly is
crucial for model interpretability and avoiding meaningless numerical
splits, in this particular limited-depth decision tree, `zipCode`
(whether numerical or one-hot encoded) did not emerge as an important
feature. The problem description itself highlights that â€˜zipCode \<
50012.5â€™ is meaningless. By one-hot encoding, we remove this meaningless
numerical order, even if the individual dummy variables didnâ€™t
contribute to the tree splits. This result, therefore, confirms that
treating `zipCode` as numerical would indeed lead to misleading insights
about its importance (or lack thereof), whereas one-hot encoding, while
also showing zero importance in this shallow tree, at least correctly
represents the variableâ€™s categorical nature.

1.  **R vs Python Implementation Differences:** When modelling zip code
    as a categorical variable, the output tree and feature importance
    would differ quite significantly had you used R as opposed to
    Python. Investigate why this is the case. What does R offer that
    Python does not? Which language would you say does a better job of
    modelling zip code as a categorical variable? Can you quote the
    documentation at <https://scikit-learn.org/stable/modules/tree.html>
    suggesting a weakness in the Python implementation? If so, please
    provide a quote from the documentation.

2.  **Are There Any Suggestions for Implementing Decision Trees in
    Python With Prioper Categorical Handling?** Please poke around the
    Internet (AI is not as helpful with new libraries) for suggestions
    on how to implement decision trees in Python with better (i.e.Â not
    one-hot encoding) categorical handling. Please provide a link to the
    source and a quote from the source. There is not right answer here,
    but please provide a thoughtful answer, I am curious to see what you
    find.