# Exercise - DTs for regression

1. Use the `./data/HousingData.csv` data (remember to split your data into a train and test data). Using your training and validation data, optimize the parameters of your DT. How well does your optimized model perform on the test data?
1. (Optional/bonus): Try to perform standardization to your data. Does it improve your model? Further, try to select only the 5 most important features. Does it improve the performance of your model?

**See slides for more details!**

In [2]:
import pandas as pd
import numpy as np
# from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import accuracy_score
from sklearn import tree

data = './data/HousingData.csv'
raw_df = pd.read_csv(data).dropna()

print(raw_df.head())

# Create a copy of the DataFrame with column names
df_copy = raw_df.copy()

# Separate the target variable (y) and features (X)
y = df_copy['MEDV']  # Replace 'TargetColumn' with your actual target column name

# MEDV - median value of owner occupied
X = df_copy.drop(columns=['MEDV'])  # Remove the target column

# Use `train_test_split` to split your data into a train and a test set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Use `train_test_split` to split your train data into a train and a validation  set.
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

print("\nThe shape of train, validation and test sets are:")
print(X_train.shape, X_val.shape, X_test.shape, y_train.shape, y_val.shape, y_test.shape)

      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD  TAX  PTRATIO  \
0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900    1  296     15.3   
1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671    2  242     17.8   
2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671    2  242     17.8   
3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622    3  222     18.7   
5  0.02985   0.0   2.18   0.0  0.458  6.430  58.7  6.0622    3  222     18.7   

        B  LSTAT  MEDV  
0  396.90   4.98  24.0  
1  396.90   9.14  21.6  
2  392.83   4.03  34.7  
3  394.63   2.94  33.4  
5  394.12   5.21  28.7  

The shape of train, validation and test sets are:
(252, 13) (63, 13) (79, 13) (252,) (63,) (79,)


# Exercise 1

Use the `./data/HousingData.csv` data (remember to split your data into a train and test data). Using your training and validation data, optimize the parameters of your DT. How well does your optimized model perform on the test data? Is it better than your optimized SVM for the same data (the third exercise from last week)?

In [3]:
from sklearn import tree
min_samples_split_list = [2, 4, 6, 12] # input values seperated by ",".
min_samples_leaf_list = [5, 10, 15] # input values seperated by ",".
max_features_list = [2, 5, 10] # input values seperated by ",".

results = []

for min_samples_split in min_samples_split_list:
    for min_samples_leaf in min_samples_leaf_list:
        for max_features in max_features_list:
            dt_current = tree.DecisionTreeRegressor(min_samples_split=min_samples_split,
                                                    min_samples_leaf=min_samples_leaf,
                                                    max_features=max_features)
                                                    
            dt_current.fit(X_train, y_train)
            y_val_hat = dt_current.predict(X_val)
            mse = mean_squared_error(y_val, y_val_hat)

            results.append([mse, min_samples_split, min_samples_leaf, max_features])

results = pd.DataFrame(results)
results.columns = ['MSE', 'min_samples_split', 'min_samples_leaf', 'max_features']
print(results)

          MSE  min_samples_split  min_samples_leaf  max_features
0   11.474091                  2                 5             2
1   18.755583                  2                 5             5
2   28.926581                  2                 5            10
3   22.719969                  2                10             2
4   28.074906                  2                10             5
5   13.557582                  2                10            10
6   22.514651                  2                15             2
7   15.911417                  2                15             5
8   15.197658                  2                15            10
9   36.234507                  4                 5             2
10  15.905257                  4                 5             5
11  23.447054                  4                 5            10
12  26.912885                  4                10             2
13  20.320952                  4                10             5
14  25.276486            

In [4]:
# Extract best parameters.
results[results['MSE'] == results['MSE'].max()]

Unnamed: 0,MSE,min_samples_split,min_samples_leaf,max_features
33,42.748998,12,15,2


In [5]:
# Initialize your final model
dt_optimized = tree.DecisionTreeRegressor(
    min_samples_split=10,
    min_samples_leaf=5,
    max_features=10,
    )

# Use both training and validation data to fit it (np.concatenate "stacks" the array like rbind in R)
dt_optimized.fit(np.concatenate([X_train, X_val]), np.concatenate([y_train, y_val]))

# Predict on test data
y_test_hat_optimized = dt_optimized.predict(X_test)

# Obtain and check mse on test data
mse_optimized = mean_squared_error(y_test, y_test_hat_optimized)
print(f'Optimized DT achieved MSE = {round(mse_optimized, 2)}. - lower values indicating better performance.')



Optimized DT achieved MSE = 39.58. - lower values indicating better performance.




# Exercise 2

(Optional/bonus): Try to perform standardization to your data. Does it improve your model? Further, try to select only the 5 most important features. Does it improve the performance of your model?

In [6]:
from sklearn.preprocessing import StandardScaler

# Create a StandardScaler instance
scaler = StandardScaler()

# Fit the scaler on the training data and transform both train and test data
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

In [7]:
from sklearn.feature_selection import SelectFromModel

# Train the decision tree regressor
dt_regressor.fit(X_train_scaled, y_train)

# Get feature importances
feature_importances = dt_regressor.feature_importances_

# Create a feature selector to select the top 5 features
selector = SelectFromModel(dt_regressor, max_features=5)

# Fit the selector on your scaled training data
X_train_selected = selector.fit_transform(X_train_scaled, y_train)

# Transform your validation and test data using the same selector
X_val_selected = selector.transform(X_val_scaled)
X_test_selected = selector.transform(X_test_scaled)


NameError: name 'dt_regressor' is not defined