### HW 4: Build and evaluate regression models

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html#sklearn.datasets.fetch_california_housing
    
    

In [36]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing

In [37]:
from sklearn.metrics import mean_squared_error as mse, r2_score as r_2

# Follow this link to find more metrics for regression:
# https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics
def regression_metrics(y_train, y_train_pred, y_test, y_test_pred):
    '''report the mse and r2 scores for both the training and test sets
    '''
    print('MSE Train: ', round(mse(y_train, y_train_pred), 3))
    print('MSE Test: ', round(mse(y_test, y_test_pred), 3))
    print('R^2 Train: ', round(r_2(y_train, y_train_pred), 3))
    print('R^2 Test: ', round(r_2(y_test, y_test_pred), 3))

#### Load the dataset

In [38]:
#Load the California housing dataset
ds = fetch_california_housing()
X = ds.data
y = ds.target
print('dataset size:', X.shape, y.shape)
print('Data in ds', dir(ds))
print(ds.DESCR)

dataset size: (20640, 8) (20640,)
Data in ds ['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']
.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house 

In [39]:
# data split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
print('training set size:', X_train.shape[0])
print('test set size:', X_test.shape[0])

training set size: 14448
test set size: 6192


#### 1 Train and evaluate a linear regression model using OLS. 20 points.
    1) Train a linear reguression model
    2) Evalate the model and print out the RMS and $(R^2)$ values on the training and test sets
    3) What issues can you observe from the results? any possible solutions ?

In [40]:
#OLS: ordinary least squares. 
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Initialize the Linear Regression model
lr_model = LinearRegression()

# Train the model
lr_model.fit(X_train, y_train)

# Predict on training and test data
y_train_pred = lr_model.predict(X_train)
y_test_pred = lr_model.predict(X_test)

# Evaluate the model
mse_train = mean_squared_error(y_train, y_train_pred)
mse_test = mean_squared_error(y_test, y_test_pred)
r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)

# Print the evaluation results
print("Training MSE:", round(mse_train, 3))
print("Test MSE:", round(mse_test, 3))
print("Training R^2:", round(r2_train, 3))
print("Test R^2:", round(r2_test, 3))

# Analyzing the results
print("Issues observed and possible solutions:")
if r2_test < r2_train:
    print("1. Overfitting: The model performs well on training data but not on unseen test data.")
    print("   Solution: Try regularization methods like Ridge or Lasso.")
if mse_test > mse_train:
    print("2. High error on test set indicates potential overfitting or underfitting.")
    print("   Solution: Adjust model complexity, increase data, or improve feature selection.")


Training MSE: 0.517
Test MSE: 0.543
Training R^2: 0.611
Test R^2: 0.593
Issues observed and possible solutions:
1. Overfitting: The model performs well on training data but not on unseen test data.
   Solution: Try regularization methods like Ridge or Lasso.
2. High error on test set indicates potential overfitting or underfitting.
   Solution: Adjust model complexity, increase data, or improve feature selection.


##### 3) Question: What issues can you observe from the results? any possible solutions?. 10 points

Response:
Overfitting:
Evidence: The R² score for the training dataset (0.611) is higher than the R² score for the test dataset (0.593). Although the difference isn't very large, it suggests that the model might be slightly overfitting to the training data, as it performs better on the training set than on unseen test data.
Possible Solutions:
Regularization: Implement regularization techniques such as Ridge or Lasso regression, which can help in reducing the model's complexity by penalizing large coefficients.
Feature Selection: Evaluate the importance of features and remove irrelevant or less important features which might be causing noise.
High Test Error:
Evidence: The mean squared error (MSE) is higher on the test set (0.543) compared to the training set (0.517). This discrepancy suggests that the model is not generalizing well to new, unseen data.
Possible Solutions:
Increase Model Complexity: If underfitting is suspected (due to simple linear assumptions not capturing complex patterns), consider using polynomial features or interaction terms.
Cross-validation: Use cross-validation to better understand model performance across different subsets of the dataset, ensuring that the model is robust and generalizes well.
Data Augmentation: Increasing the dataset size or adding more diverse examples might help in improving the model's ability to generalize.
Model Performance Metrics:
Insight: The R² scores are relatively moderate (~0.6), indicating that while the model captures some of the variability in the data, there is room for improvement. An R² score of 1 would indicate a perfect fit, which is far from what we observe.
Possible Actions:
Advanced Modeling Techniques: Besides regularization, consider exploring more sophisticated regression models like Decision Trees, Random Forests, or even ensemble methods that might capture complex patterns more effectively.ts based on their variance.


#### 2 Train and evaluate the decision tree approach? 50 points.

    1) Train a decision tree model. Please tune the arguments, 'criterion', 'max_depth', and 'min_samples_leaf', to achieve good performance
    2) Print out the depth of the tree, the number of leaves, and the importance of each feature
    3) Evalate the model and print out the RMS and $R^2$ values on the training and test sets
    4) Show the tree using tree.export_text() in sklearn
    5) Print out the decision path for data sample X_test[0]
    6) Test different max_depth values and analyse the results?

In [41]:
# Decision tree
from sklearn.tree import DecisionTreeRegressor, export_text
dt_model = DecisionTreeRegressor(criterion='squared_error', max_depth=10, min_samples_leaf=20)
dt_model.fit(X_train, y_train)

# Print the decision tree model attributes
print("Decision Tree - Depth of the tree:", dt_model.get_depth())
print("Decision Tree - Number of leaves:", dt_model.get_n_leaves())
print("Decision Tree - Feature importances:", dict(zip(ds.feature_names, dt_model.feature_importances_)))

# Evaluate the decision tree model
y_train_pred_dt = dt_model.predict(X_train)
y_test_pred_dt = dt_model.predict(X_test)
mse_train_dt = mse(y_train, y_train_pred_dt)
mse_test_dt = mse(y_test, y_test_pred_dt)
r2_train_dt = r_2(y_train, y_train_pred_dt)
r2_test_dt = r_2(y_test, y_test_pred_dt)
print("Decision Tree - Training MSE:", round(mse_train_dt, 3))
print("Decision Tree - Test MSE:", round(mse_test_dt, 3))
print("Decision Tree - Training R^2:", round(r2_train_dt, 3))
print("Decision Tree - Test R^2:", round(r2_test_dt, 3))

# Display the decision tree using textual representation
tree_text = export_text(dt_model, feature_names=ds.feature_names)
print(tree_text)

# Print the decision path for the first test sample
path = dt_model.decision_path(X_test.iloc[:1])
print("Decision path for the first test sample:", path)



Decision Tree - Depth of the tree: 10
Decision Tree - Number of leaves: 336
Decision Tree - Feature importances: {'MedInc': 0.6483682099511283, 'HouseAge': 0.0375038075743299, 'AveRooms': 0.032364554230063326, 'AveBedrms': 0.006510907284394702, 'Population': 0.0035173208874631446, 'AveOccup': 0.1421169322865336, 'Latitude': 0.06336834534579244, 'Longitude': 0.06624992244029478}
Decision Tree - Training MSE: 0.29
Decision Tree - Test MSE: 0.4
Decision Tree - Training R^2: 0.782
Decision Tree - Test R^2: 0.7
|--- MedInc <= 5.03
|   |--- MedInc <= 3.07
|   |   |--- AveRooms <= 4.22
|   |   |   |--- AveOccup <= 2.50
|   |   |   |   |--- MedInc <= 2.19
|   |   |   |   |   |--- AveRooms <= 3.33
|   |   |   |   |   |   |--- Population <= 1208.00
|   |   |   |   |   |   |   |--- Latitude <= 37.51
|   |   |   |   |   |   |   |   |--- Latitude <= 34.01
|   |   |   |   |   |   |   |   |   |--- value: [1.63]
|   |   |   |   |   |   |   |   |--- Latitude >  34.01
|   |   |   |   |   |   |   |   |  

AttributeError: 'numpy.ndarray' object has no attribute 'iloc'

In [42]:
#4) explore the trained decision tree using export_text(). 5 points
from sklearn import tree
from sklearn.tree import export_text

# Export and print the text representation of the trained decision tree
tree_text = export_text(dt_model, feature_names=ds.feature_names)
print("Text representation of the trained Decision Tree:")
print(tree_text)

Text representation of the trained Decision Tree:
|--- MedInc <= 5.03
|   |--- MedInc <= 3.07
|   |   |--- AveRooms <= 4.22
|   |   |   |--- AveOccup <= 2.50
|   |   |   |   |--- MedInc <= 2.19
|   |   |   |   |   |--- AveRooms <= 3.33
|   |   |   |   |   |   |--- Population <= 1208.00
|   |   |   |   |   |   |   |--- Latitude <= 37.51
|   |   |   |   |   |   |   |   |--- Latitude <= 34.01
|   |   |   |   |   |   |   |   |   |--- value: [1.63]
|   |   |   |   |   |   |   |   |--- Latitude >  34.01
|   |   |   |   |   |   |   |   |   |--- value: [2.29]
|   |   |   |   |   |   |   |--- Latitude >  37.51
|   |   |   |   |   |   |   |   |--- value: [1.36]
|   |   |   |   |   |   |--- Population >  1208.00
|   |   |   |   |   |   |   |--- value: [2.62]
|   |   |   |   |   |--- AveRooms >  3.33
|   |   |   |   |   |   |--- AveOccup <= 1.97
|   |   |   |   |   |   |   |--- Longitude <= -118.19
|   |   |   |   |   |   |   |   |--- Latitude <= 37.95
|   |   |   |   |   |   |   |   |   |--- valu

In [43]:
#5) print out the decision path for the first test data sample X_test[0]. 5 points
# Print the first test data sample to see what it looks like
print("First test data sample:")
print(X_test[0])

# Use the decision_path method to get the decision path of the first test data sample
path = dt_model.decision_path(X_test[0:1])

# To visualize the path in a readable format, we will convert the path to a dense format and then print it
path_indices = path.indices
print("Decision path for the first test sample:")
for node_index in path_indices:
    # To find the conditions that split the current node
    if node_index == 0:
        # Skip the first node since it doesn't have an incoming edge
        continue
    # Find the feature used to split the node
    split_feature = dt_model.tree_.feature[node_index]
    split_feature_name = ds.feature_names[split_feature]
    # Find the threshold used to split the node
    threshold = dt_model.tree_.threshold[node_index]
    print(f"Node {node_index} splits on feature {split_feature_name} <= {threshold}")



First test data sample:
[ 4.15180000e+00  2.20000000e+01  5.66307278e+00  1.07547170e+00
  1.55100000e+03  4.18059299e+00  3.25800000e+01 -1.17050000e+02]
Decision path for the first test sample:
Node 1 splits on feature MedInc <= 3.066849946975708
Node 255 splits on feature AveOccup <= 2.344438910484314
Node 335 splits on feature MedInc <= 3.9287999868392944
Node 397 splits on feature AveOccup <= 2.8379322290420532
Node 429 splits on feature Longitude <= -122.3849983215332
Node 435 splits on feature HouseAge <= 18.5
Node 449 splits on feature Longitude <= -117.58000183105469
Node 457 splits on feature AveOccup <= 3.416804790496826
Node 461 splits on feature Latitude <= -2.0


In [44]:
#THis is the additional cell I have added to analyze the result
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Define different max_depth values to test
depths = [5, 8, 10, 20]
results = []

for depth in depths:
    dt_model = DecisionTreeRegressor(criterion='squared_error', max_depth=depth, min_samples_leaf=20)
    dt_model.fit(X_train, y_train)
    y_pred_train = dt_model.predict(X_train)
    y_pred_test = dt_model.predict(X_test)
    train_mse = mean_squared_error(y_train, y_pred_train)
    test_mse = mean_squared_error(y_test, y_pred_test)
    train_r2 = r2_score(y_train, y_pred_train)
    test_r2 = r2_score(y_test, y_pred_test)
    results.append({
        'max_depth': depth,
        'train_mse': train_mse,
        'test_mse': test_mse,
        'train_r2': train_r2,
        'test_r2': test_r2
    })

# Print the results
for result in results:
    print(f"Depth {result['max_depth']} - Train MSE: {result['train_mse']:.3f}, Test MSE: {result['test_mse']:.3f}, Train R²: {result['train_r2']:.3f}, Test R²: {result['test_r2']:.3f}")

Depth 5 - Train MSE: 0.491, Test MSE: 0.538, Train R²: 0.631, Test R²: 0.596
Depth 8 - Train MSE: 0.343, Test MSE: 0.434, Train R²: 0.743, Test R²: 0.675
Depth 10 - Train MSE: 0.290, Test MSE: 0.400, Train R²: 0.782, Test R²: 0.700
Depth 20 - Train MSE: 0.258, Test MSE: 0.387, Train R²: 0.806, Test R²: 0.710


##### 6) Question: Test different max_depth values (5, 8, 10, 20) and analyse the results? 15 points.
 - SET criterion = 'squared_error' and min_samples_leaf = 20
 - Compare the results with different max_depth values
 - Summarize the main disadvantages decision trees
 
Response: After running the above script, you'll observe how the model's complexity (controlled by max_depth) affects its performance.
Typically: 
Lower depths might lead to higher bias (underfitting): The model is too simplistic and does not capture the complex patterns in the data.
Higher depths might lead to higher variance (overfitting): The model becomes too tailored to the training data, capturing noise as if it were signal, which negatively impacts its performance on unseen da
The main disadvantage of decision trees:
Overfitting: Decision trees are prone to overfitting, especially with very deep trees. They learn highly flexible models that fit the noise in the training data, leading to poor generalization to new data.
Instability: Small changes in the data can lead to very different tree structures. This instability is due to the hierarchical nature of the decision-making process: a change at the top of the tree can affect all decisions below it.
Greedy Algorithms: Decision trees use a greedy approach, which means they make the best split at each node without considering future impacts. This might not lead to the best overall tree structure (locally optimal decisions may not be globally optimal).
Biased Trees: Trees are biased toward selecting features with more levels as they are more likely to improve the model fit, leading to biased trees which may not perform well on unseen data.
Difficulty in capturing linear relationships: Decision trees are non-parametric models and can struggle to capture linear relationships or interactions between variables as efficiently as some other regression methods.
ta.

#### 3. Random forests. 30 points
    1) What are the difference between bagging and random forests?
    2) Train a ran
    2)Train a random forest model. Please tune the arguments, n_estimators, max_features, max_depth, to achieve good performance. 
    3) Can your random forest model achieve better perofrmance than the decision tree? Please summarize the advantages of random forests?

##### 1) What is the main problem of bagging approach? and how random forests can address the problem? 5 points

Response: 
Main Problem of Bagging:
Bagging does not handle the scenario of having a large number of correlated features well. Since each model in bagging uses all the features, there's a high variance if individual features do not contribute evenly to accurac
y.
How Random Forests Address t:

Random forests introduce an extra layer of randomness compared to bagging: they not only use samples of the input data but also samples of features for splitting nodes. This randomness ensures that the trees are de-correlated and leads to a decrease in model variance. This method also helps in identifying the most significant variables from a very large dataset, providing better performance, especially when dealing with datasets where certain features dominate the predictive modeling.

In [46]:
# Random forests
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Initialize the Random Forest Regressor
# 'n_estimators' is the number of trees in the forest
# 'max_features' is the number of features to consider when looking for the best split
# 'max_depth' is the maximum depth of each tree
rf_model = RandomForestRegressor(n_estimators=200, max_features='sqrt', max_depth=15, random_state=42)

# Train the model on the training data
rf_model.fit(X_train, y_train)

# Predict on training and test data
y_train_pred_rf = rf_model.predict(X_train)
y_test_pred_rf = rf_model.predict(X_test)

# Evaluate the model using MSE and R^2
train_mse_rf = mean_squared_error(y_train, y_train_pred_rf)
test_mse_rf = mean_squared_error(y_test, y_test_pred_rf)
train_r2_rf = r2_score(y_train, y_train_pred_rf)
test_r2_rf = r2_score(y_test, y_test_pred_rf)

# Print evaluation metrics
print("Random Forest - Train MSE:", round(train_mse_rf, 3))
print("Random Forest - Test MSE:", round(test_mse_rf, 3))
print("Random Forest - Train R²:", round(train_r2_rf, 3))
print("Random Forest - Test R²:", round(test_r2_rf, 3))

# Check for achieving R² > 0.8 on test set
if test_r2_rf > 0.8:
    print("Achieved R² > 0.8 on test set: Extra points earned!")
else:
    print("R² <= 0.8 on test set: Consider further tuning parameters or exploring more complex models.")


Random Forest - Train MSE: 0.07
Random Forest - Test MSE: 0.257
Random Forest - Train R²: 0.947
Random Forest - Test R²: 0.807
Achieved R² > 0.8 on test set: Extra points earned!


##### 3) Can your random forest model achieve better perofrmance than the decision tree? Please summarize the advantages of random forests? 10 points

Response: 
Random Forests usually outperform single decision trees for several reasons:

Reduction in Overfitting: While a single decision tree can easily overfit the training data by creating highly complex paths that account for outliers or noise, Random Forests mitigate this by averaging multiple trees. This creates a more generalized model that performs better on unseen dat
a.
Higher Accuracy: By combining the predictions of numerous trees, Random Forests reduce variance without substantially increasing bias. This generally leads to better accuracy on test datasets compared to a single decision tr
ee.
Improved Stability: Small changes in the data can significantly alter the structure of a decision tree, making its predictions very sensitive to the specifics of the training data. Random Forests, however, are more stable as each tree in the ensemble compensates for the weaknesses of others.
Advantages of Random Fo
rests
The advantages of Random Forests extend beyond just performance metrics:

Robustness to Noise and Outliers: The ensemble approach of Random Forests makes them very robust to noise and outliers in the dataset. Since each individual tree in the forest sees only a subset of the data (and features), the impact of outliers or noise is mi
nimized.
Feature Importance: Random Forests provide insights into feature importance — the extent to which each feature contributes to the decision-making in the model. This is beneficial for feature selection and understanding the underlying processes influencing the pre
dictions.
Versatility: Random Forests can be used for both classification and regression tasks with high efficiency. They are also non-parametric, meaning they do not make any assumptions on the distributio
n of data.
Handling Missing Values: They can handle missing values by maintaining accuracy even when a significant part of the data is missing. Techniques like imputation are not necessarily required before fitting a Random Fo
rest model.
No Need for Feature Scaling: Random Forests do not require input features to be scaled or normalized, as decision trees are scal
e-invariant.
Parallelizable Computation: The process of building trees in a Random Forest can be parallelized, as each tree is built independently. This makes the training process faster and more scalable.