### **Chunk 5: Essential Tree-Based Models**

#### **1. Concept Introduction**

Linear models try to find a single best line or plane to separate data. Tree-based models work differently. They learn by asking a series of simple "yes/no" questions to split the data into smaller and smaller groups.

1.  **`DecisionTreeClassifier`/`DecisionTreeRegressor`**:
    -   **How it works**: Imagine a flowchart. The model starts with all your data at the top (the "root"). It then finds the best possible feature and split-point (e.g., "Is `median_income` <= 3.5?") that divides the data into two purer groups (the "branches"). It repeats this process on each new group, creating a tree structure. To make a prediction for a new data point, it simply follows the path down the tree from the root to a final "leaf" node and returns the majority class or average value of the training samples that ended up in that leaf.
    -   **Pros**: Extremely easy to understand and visualize. The decision-making process is transparent.
    -   **Cons**: Their biggest weakness is a tendency to **overfit**. A single decision tree can keep splitting until every single data point is in its own leaf, perfectly memorizing the training data but failing to generalize to new data.

2.  **`RandomForestClassifier`/`RandomForestRegressor`**:
    -   **How it works**: This is an **ensemble** model that brilliantly solves the overfitting problem of a single decision tree. Instead of building one giant, complex tree, it builds hundreds of smaller, simpler trees (a "forest").
    -   The "Random" part is key:
        1.  Each tree is trained on a different random sample of the training data (a bootstrap sample).
        2.  At each split point, each tree only considers a random subset of the features.
    -   This randomness ensures that the individual trees are diverse and each makes slightly different errors. To make a final prediction, the Random Forest averages the predictions of all the individual trees (for regression) or takes a majority vote (for classification). This "wisdom of the crowd" approach dramatically reduces overfitting and leads to a much more robust and accurate model.
    -   **Feature Importance**: A fantastic byproduct of tree-based models is the ability to easily calculate feature importance. The model can track how much each feature contributes to reducing impurity (making better splits) across all the trees in the forest. This gives you a powerful tool to understand which features are driving the predictions.

#### **2. Dataset EDA: California Housing Dataset**

The goal is to predict the median house value for California districts, based on features from the 1990 census. This is a classic regression task perfect for demonstrating the power of tree-based models and their feature importance capabilities.

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing

# Set plot style
sns.set_style('whitegrid')

$Load$ $Data$

In [None]:
housing  = fetch_california_housing()
df = pd.DataFrame(data=housing.data,
                  columns=housing.feature_names)
df['MedHouseVal'] = housing.target
df.info()

In [None]:
df.head()

`Basic Statisitcs`

In [None]:
df.describe()

`Check for missing values`

In [None]:
df.isnull().sum()

`Target Variable Distribution (Median House Value)`

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(df['MedHouseVal'], bins=50,
             kde=True)
plt.title('Distribution of Median House Value in California')
plt.xlabel('Median House Value ($100,000s)')
plt.ylabel('Frequency')
plt.show()

***Correlation Matrix Heatmap***

In [None]:
plt.figure(figsize=(12, 10))
corr_mat  = df.corr()
sns.heatmap(corr_mat,
            annot=True,
            cmap='YlGnBu',
            fmt='.2f')
plt.title('Correlation Matrix of California housing Features')
plt.show()
# 'MedInc' (Median Income) has the strongest positive correlation with the target.


#### **3. Minimal Working Example: Overfitting and the Random Forest Solution**

One of the best things about tree models is that they are **not sensitive to feature scaling**. The splitting decisions don't depend on the magnitude of the features, so we can skip the `StandardScaler` step.

In [None]:
# imports, Data, and Splitting
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

X = housing.data
y = housing.target


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Attempt 1: A single decision Tree (demonstrating overfitting)**

In [None]:
# Train an unconstrained Decision Tree
# By not setting max_depth, the tree can grow as deep as it wants
tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(X_train, y_train)

**Evaluate the Decision Tree**

In [None]:
y_pred_tree  = tree_reg.predict(X_test)
rmse_tree    = np.sqrt(mean_squared_error(y_test, y_pred_tree))

# Let's also check the performance on the Training set
y_train_pred_tree = tree_reg.predict(X_train)
rmse_train_tree   = np.sqrt(mean_squared_error(y_true=y_train, y_pred=y_train_pred_tree))

print(f"Decision Tree TRAIN RMSE : {rmse_train_tree:.4f}")
print(f"Decision Tree TEST RMSE: {rmse_tree:.4f}")
# The TRAIN RMSE is 0.0! This is a classic sign of massive overfitting.
# The model has perfectly memorized the training data but performs poorly on new data.


**Attempt 2: This time with Random Forest**

In [None]:
# Train a Random Forest Regressor
# n_estimators is the number of trees in the forest.
# n_jobs = -1 uses all available CPU cores for faster training.
forest_reg = RandomForestRegressor(n_estimators=100,
                                   random_state=42,
                                   n_jobs=-1)
forest_reg.fit(X_train, y_train)

**Evaluate the Random Forest**

In [None]:
y_pred_forest = forest_reg.predict(X_test)
rmse_forest = np.sqrt(mean_squared_error(y_test, y_pred_forest))

# Check performance on the TRAINING set again
y_train_pred_forest = forest_reg.predict(X_train)
rmse_train_forest = np.sqrt(mean_squared_error(y_train, y_train_pred_forest))

print(f"Random Forest TRAIN RMSE: {rmse_train_forest:.4f}")
print(f"Random Forest TEST RMSE:  {rmse_forest:.4f}")
# Notice two things:
# 1. The TEST RMSE is significantly better than the single decision tree.
# 2. The gap between TRAIN and TEST RMSE is much, much smaller. This is a well-generalized model.

#### **4. Feature Importance Extraction**
Now for the best part. Let's see what the Random Forest learned.

In [None]:
imps = forest_reg.feature_importances_
feature_names = housing.feature_names

Create a Pandas DataFrame for easier visualization

In [None]:
featu_imp_df = pd.DataFrame({
    'feature': feature_names,
    'importances': imps
}).sort_values('importances', ascending=False)
featu_imp_df

##### PLotting

In [None]:
plt.figure(figsize=(10, 6))
sns.barplot(x='importances', y='feature', data=featu_imp_df)
plt.title('Feature Importance for California Housing Prediction')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()
# As the EDA suggested, Median Income is by far the most important feature.
# The geographical coordinates (Latitude, Longitude) are also highly important.

#### **5. Common Pitfalls**

1.  **Shipping an Overfit Decision Tree**: Never use a single, unconstrained `DecisionTree` as your final model. It's a great tool for learning, but `RandomForest` is almost always superior for prediction.
2.  **Choosing `n_estimators`**: More trees (`n_estimators`) is generally better, but it comes at the cost of longer training time and memory usage. Performance usually plateaus after a few hundred trees. 100 is a very common and effective starting point.
3.  **Ignoring Hyperparameters**: We used the defaults here, but tuning parameters like `max_depth` (how deep each tree can be) and `min_samples_leaf` (minimum samples required to be at a leaf node) is crucial for getting the best performance. We will cover this with `GridSearchCV` soon.


# Move on to Chunk 06 Pipeplines