# Introduction
We have done feature engineering on the raw dataset and got our final dataset.
Now we will use three different models for predictions:
- Linear Regression
- Decision Tree
- Random Forest Regression
- Regression Enhanced Random Forest

For each model, predictions are made, after a training and parameter tuning phase, on each of the datasets produced in the feat-engineering notebook.
Results for each set of predictions are then plotted and visualized.

## Train Multiple Models

Now that we've tested our data preparation pipeline with a sample model, the next step is to train the data on different regression algorithms to shortlist the most promising algorithms for our problem.

Algorithms to test with include:
- **Linear Regression**: Simple algorithm to implement but can over-simplify real-world problems by assuming a linear relationship among the variables.
- **Support Vector Regression**: Uses hyperplanes to segregate the data.
- **Decision Tree**: Powerful model capable of finding complex nonlinear relationships in the data.
- **Random Forest**: Train many Decision Tress on random subsets of the features (*Ensemble Learning*).

- i.Adaboost Regressor

# Setup
Let us import the required modules.

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import sys
import os
from math import sqrt
import matplotlib.pyplot as plt
import pickle

import project.src.feat_eng as fe
import project.src.visualization as viz

import sklearn.model_selection as modsel
from sklearn.decomposition import PCA
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, accuracy_score

%matplotlib inline
sys.path.insert(0, os.path.abspath("../../"))
color = sns.color_palette()
pd.set_option("display.max_columns", 100) #
np.random.seed(1)

## Load Data
Note that the dataset is already split into Train-Test sets.

In [4]:
engineered_dataset = fe.TrainTestSplit.from_csv_directory(dir_path="../data/lvl4_rfecv")

In [5]:
full_dataset = pd.concat([engineered_dataset.x_train, engineered_dataset.x_test])

In [7]:
engineered_dataset.x_train.info()
engineered_dataset.y_train

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62090 entries, 0 to 62089
Data columns (total 23 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   bathroomcnt                   62090 non-null  float64
 1   bedroomcnt                    62090 non-null  float64
 2   fireplacecnt                  62090 non-null  float64
 3   garagecarcnt                  62090 non-null  float64
 4   latitude                      62090 non-null  float64
 5   longitude                     62090 non-null  float64
 6   poolcnt                       62090 non-null  float64
 7   roomcnt                       62090 non-null  float64
 8   threequarterbathnbr           62090 non-null  float64
 9   unitcnt                       62090 non-null  float64
 10  numberofstories               62090 non-null  float64
 11  house_age                     62090 non-null  float64
 12  heatingorsystemtypeid_2.0     62090 non-null  float64
 13  h

array([ 0.01776168,  0.04743217,  0.05062936, ...,  0.03397105,
       -0.08284219,  0.02421837])

promemoria:
-
-
-
- Linear Regression
- Decision Tree
- Random Forest Regression

In [49]:
# METTERE IN FILE

class Evaluation(object):
    def __init__(self,y_real: np.ndarray, y_pred: np.ndarray):
        self.y_pred = y_pred
        self.y_real = y_real

        self.mae = mean_absolute_error(y_real, y_pred)
        self.mse = mean_squared_error(y_real, y_pred)
        self.rmse = sqrt(mean_squared_error(y_real, y_pred))
        self.r2 = r2_score(y_real, y_pred)

    def print_eval(self):
        print("--------------Model Evaluations:--------------")
        print('Mean Absolute Error : {}'.format(self.mae))
        print()
        print('Mean Squared Error : {}'.format(self.mse))
        print()
        print('Root Mean Squared Error : {}'.format(self.rmse))
        print()
        print('R2 : {}'.format(self.r2))
        print()

# Linear Regression Model

-------SCRIVERE QUALCOSA----------------

In [50]:
linear_reg = LinearRegression()
linear_reg.fit(engineered_dataset.x_train, engineered_dataset.y_train)

LinearRegression()

In [51]:
linreg_train_pred = linear_reg.predict(engineered_dataset.x_train)
linreg_test_pred = linear_reg.predict(engineered_dataset.x_test)

linreg_train_eval = Evaluation(y_real=engineered_dataset.y_train, y_pred=linreg_train_pred)
linreg_test_eval = Evaluation(y_real=engineered_dataset.y_test, y_pred=linreg_test_pred)

In [52]:
print("Training:")
linreg_train_eval.print_eval()
print("Testing:")
linreg_test_eval.print_eval()

Training:
--------------Model Evaluations:--------------
Mean Absolute Error : 0.07040053572011877

Mean Squared Error : 0.02808442166089148

Root Mean Squared Error : 0.1675840734106063

R2 : 0.004461712937144702

Testing:
--------------Model Evaluations:--------------
Mean Absolute Error : 0.07225147731871365

Mean Squared Error : 0.033117004168537564

Root Mean Squared Error : 0.1819807796679022

R2 : 0.0015696233476769628



# Ada Boost Regression Model

-------------DA FARE BENE----------- vedere notebook lezione su che parametri ci sono da tunare perche lotto lo fa

In [None]:
adaboost_reg = AdaBoostRegressor()
adaboost_reg.fit(engineered_dataset.x_train, engineered_dataset.y_train)

In [None]:
adaboostreg_train_pred = adaboost_reg.predict(engineered_dataset.x_train)
adaboostreg_test_pred = adaboost_reg.predict(engineered_dataset.x_test)

adaboostreg_train_eval = Evaluation(y_real=engineered_dataset.y_train, y_pred=adaboostreg_train_pred)
adaboostreg_test_eval = Evaluation(y_real=engineered_dataset.y_test, y_pred=adaboostreg_test_pred)

In [None]:
print("Training:")
adaboostreg_train_eval.print_eval()
print("Testing:")
adaboostreg_test_eval.print_eval()

# Decision Tree Regressor

-------SCRIVERE QUALCOSA----------------

In [None]:
# parameters tuning
dt=DecisionTreeRegressor()
properties={
            'min_samples_leaf':[x for x in range(1,200,2)],
            'max_leaf_nodes':[x for x in range(1,50,2)]
            }

tuned_dt=GridSearchCV(dt,properties,scoring="neg_mean_squared_error",cv=5,return_train_score=True,verbose=2,n_jobs=-1)
tuned_dt.fit(engineered_dataset.x_train,engineered_dataset.y_train)

print ("Best Score: {:.3f}".format(tuned_dt.best_score_) )
print ("Best Params: ", tuned_dt.best_params_)

Fitting 5 folds for each of 2500 candidates, totalling 12500 fits
[CV] END ...............max_leaf_nodes=1, min_samples_leaf=1; total time=   0.0s
[CV] END ...............max_leaf_nodes=1, min_samples_leaf=1; total time=   0.0s
[CV] END ...............max_leaf_nodes=1, min_samples_leaf=3; total time=   0.1s
[CV] END ...............max_leaf_nodes=1, min_samples_leaf=3; total time=   0.0s
[CV] END ...............max_leaf_nodes=1, min_samples_leaf=5; total time=   0.0s
[CV] END ...............max_leaf_nodes=1, min_samples_leaf=7; total time=   0.1s
[CV] END ...............max_leaf_nodes=1, min_samples_leaf=9; total time=   0.0s
[CV] END ..............max_leaf_nodes=1, min_samples_leaf=11; total time=   0.1s
[CV] END ..............max_leaf_nodes=1, min_samples_leaf=13; total time=   0.0s
[CV] END ..............max_leaf_nodes=1, min_samples_leaf=13; total time=   0.0s
[CV] END ..............max_leaf_nodes=1, min_samples_leaf=17; total time=   0.0s
[CV] END ..............max_leaf_nodes=1, mi

In [None]:
print(tuned_dt.cv_results_)
results = pd.DataFrame( tuned_dt.cv_results_)

print(tuned_dt.best_estimator__)
results.info(verbose=True)

In [None]:
#accuracy score è di classificator non va bene per regressor
# test_acc = accuracy_score(y_true = engineered_dataset.y_test,
#                           y_pred = tuned_dt.predict(engineered_dataset.x_test) )
# print ("Test Accuracy: {:.3f}".format(test_acc) )

In [56]:
tree_reg = DecisionTreeRegressor(min_samples_leaf=1, max_leaf_nodes=1) #da mettere con i parametri che trovo con gridserachcv, leggi sugli appunti
tree_reg.fit(engineered_dataset.x_train, engineered_dataset.y_train)

DecisionTreeRegressor(max_depth=5)

In [57]:
dtreg_train_pred = tree_reg.predict(engineered_dataset.x_train)
dtreg_test_pred = tree_reg.predict(engineered_dataset.x_test)

dtreg_train_eval = Evaluation(y_real=engineered_dataset.y_train, y_pred=dtreg_train_pred)
dtreg_test_eval = Evaluation(y_real=engineered_dataset.y_test, y_pred=dtreg_test_pred)

In [None]:
print("Training:")
dtreg_train_eval.print_eval()
print("Testing:")
dtreg_test_eval.print_eval()

# Random Forest Regression Model

-------SCRIVERE QUALCOSA----------------

fare GridSearchCV per il tuning di parametri cosi ho un motivo per cui sto usando certi parametri

uso parametri trovati per decision tree

In [None]:
RF_HYPER_PARAMS = { # riempire con risultati del gridsearchsv
    "n_estimators": [150, 200, 250],
    "min_samples_leaf": [1, 50, 100, 200],
    "max_leaf_nodes": [2, 5, 10],
    "max_features": ["sqrt", "log2"]
}

In [53]:
forest_reg = RandomForestRegressor()

search_result = tr.grid_search_cv_tuning(
                                         train_data=data.x_train.values,
                                         train_target=data.y_train,
                                         scoring="neg_mean_squared_error",)

tuned_model = modsel.HalvingGridSearchCV(forest_reg, RF_HYPER_PARAMS, cv=5, verbose=2,
                                         n_jobs=-1, **gridsearchcv_kwargs)
tuned_model.fit(train_data, train_target)

search_result = tuned_model

print("Tuning results:")
print(f"Best params: {search_result.best_params_}")

print("Fitting with best params and full training set...")
tuned_rf = RandomForestRegressor(n_jobs=N_JOBS, random_state=RND_SEED,
                                 **search_result.best_params_)
tuned_rf.fit(X=data.x_train.values, y=data.y_train)



forest_reg.fit(engineered_dataset.x_train, engineered_dataset.y_train)

RandomForestRegressor(max_depth=6, n_estimators=50)

In [54]:
rforreg_train_pred = forest_reg.predict(engineered_dataset.x_train)
rforreg_test_pred = forest_reg.predict(engineered_dataset.x_test)

rforreg_train_eval = Evaluation(y_real=engineered_dataset.y_train, y_pred=rforreg_train_pred)
rforreg_test_eval = Evaluation(y_real=engineered_dataset.y_test, y_pred=rforreg_test_pred)

In [55]:
print("Training:")
rforreg_train_eval.print_eval()
print("Testing:")
rforreg_test_eval.print_eval()

Training:
--------------Model Evaluations:--------------
Mean Absolute Error : 0.0696034486709443

Mean Squared Error : 0.026873474451725804

Root Mean Squared Error : 0.1639313101629027

R2 : 0.047387443254572004

Testing:
--------------Model Evaluations:--------------
Mean Absolute Error : 0.07203089965868967

Mean Squared Error : 0.03335425261038464

Root Mean Squared Error : 0.18263146664905433

R2 : -0.005583078326304669



----------------------- DA NOTEBOOK CRASTO -----------------------

## Model Evaluation

### Baseline Metrics

It is important to set a baseline for the model's performance to compare different algorithms. For regression problems, the baseline metrics are calculated by replacing $y'$ with $\bar{y}$. Using this, the different baseline regression metrics are:

- **MSE Baseline**: Variance of the target variable (Mean Squared Error)
- **RMSE Baseline**: Standard Deviation of the target variable (Root Mean Squared Error)
- **MAE Baseline**: Average Abolsute Deviation of the target variable (Mean Absolute Error)
- **R2 Baseline**: 0

For this regression problem, we will use the models' **Mean Absolute Error** and **RMSE (Root Mean Squared Error)** to compare the different algorithms which have **baseline values of 0.533 and 0.0837** respectively.

We will also observe the RMSE as another evaluation metric which punishes more for outliers than MAE.

In [None]:
# Baseline for RMSE
print(f"MAE Baseline: {engineered_dataset.y_train.mad()}")
print(f"RMSE Baseline: {engineered_dataset.y_train.std()}")

### MAE Evaluation

To evaluate and short list the most promising models, we will use the models' **MAE** in two different ways:

1) **MAE on Validation Set**: Calculates the MAE on the validation set which is quicker to calculate than evaluation using Cross-Validation. However, it is possible the MAE obtained is skewed depending on the instances sampled in the validation set.

2) A great alternative is to use **K-Fold Cross-Validation** where the training set is randomly split into `n` subsets (for example 10 subsets) called *folds*. It trains and evaluates the model 10 times, picking a different fold for evaluation every time and training on the other 9 folds. Result is an array containing the 10 evaluation scores. Takes longer to evaluate but provides a more accurate measure of the model's performance.

In [23]:
def get_eval_metrics(models, X, y_true):
    """
    Calculates MAE (Mean Absoulate Error) and RMSE (Root Mean Squared Error) on the data set for input models.
    `models`: list of fit models
    """
    for model in models:
        y_pred= model.predict(X)
        rmse = mean_squared_error(y_true, y_pred, squared=False)
        mae = mean_absolute_error(y_true, y_pred)
        print(f"Model: {model}")
        print(f"MAE: {mae}, RMSE: {rmse}")

# Test usage of RMSE function
# get_eval_metrics([lin_reg, ridge_reg, lasso_reg], X_prepared_val, y_val)

In [24]:
def display_scores(model, scores):
    print("-"*50)
    print("Model:", model)
    print("\nScores:", scores)
    print("\nMean:", scores.mean())
    print("\nStandard deviation:", scores.std())

def get_cross_val_scores(models, X, y, cv=10, fit_params=None):
    """
    Performs k-fold cross validation and calculates MAE for each fold for all input models.
    `models`: list of fit models
    """
    for model in models:
        mae = -cross_val_score(model, X, y, scoring="neg_mean_absolute_error", cv=cv, fit_params=fit_params)
        display_scores(model, mae)

    # Test usage of cross val function
# get_cross_val_scores([lin_reg, ridge_reg], X_prepared, y_train, cv=5)

# Linear Regression Model

Linear Regression: Plain linear regression that minimizes the Mean Squared Error(MSE) cost function.

The model RMSE is significantly higher than MAE which suggests that the outliers are affecting the model's performance as RMSE punishes the model more for mispredicting outliers.
The K-Fold Cross Validation shows that the model's performance is highly volatile

In [None]:
# dopo fit (notebook crasto)
# fa osservazioni craste ma complicate

# Decision Tree Regressor

Decision Tree: Powerful model capable of finding complex nonlinear relationships in the data.
Random Forest: Train many Decision Tress on random subsets of the features via the bagging method (Ensemble Learning).

# Random Forest Regression Model¶

In [12]:
forest_reg = RandomForestRegressor(n_estimators= 50, max_depth=6)

forest_reg.fit(engineered_dataset.x_train, engineered_dataset.y_train)

  forest_reg.fit(engineered_dataset.x_train, engineered_dataset.y_train)


RandomForestRegressor(max_depth=6, n_estimators=50)

In [13]:
forest_reg_pred = forest_reg.predict(engineered_dataset.x_test)

print('Mean Absolute Error : {}'.format(mean_absolute_error(engineered_dataset.y_test, forest_reg_pred)))
print()
print('Mean Squared Error : {}'.format(mean_squared_error(engineered_dataset.y_test, forest_reg_pred)))
print()
print('Root Mean Squared Error : {}'.format(sqrt(mean_squared_error(engineered_dataset.y_test, forest_reg_pred))))

Mean Absolute Error : 0.07201648566472933

Mean Squared Error : 0.03331842958157641

Root Mean Squared Error : 0.18253336566659917


# Cross Validation & Hyperparameter Optimization for Random Forest

In [16]:
scores = cross_val_score(forest_reg, engineered_dataset.x_train, engineered_dataset.y_train, scoring="neg_mean_squared_error", cv = 5)

  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)


In [17]:
forest_reg_rmse_scores = np.sqrt(-scores)
forest_reg_rmse_scores

array([0.16437461, 0.16956266, 0.16824223, 0.15338318, 0.17947782])

In [21]:
param_grid = [
    {'n_estimators': [3, 4, 5], 'max_features': [2, 4, 6]},
    {'bootstrap': [False], 'n_estimators': [3, 6, 9], 'max_features': [2, 4, 6]}]

forest_regressor = RandomForestRegressor()

grid_search = GridSearchCV(forest_regressor, param_grid, scoring='neg_mean_squared_error',return_train_score=True,cv=3)

In [22]:
# grid_search.fit(engineered_dataset.x_train, engineered_dataset.y_train)

  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_

GridSearchCV(cv=3, estimator=RandomForestRegressor(),
             param_grid=[{'max_features': [2, 4, 6],
                          'n_estimators': [300, 400, 500]},
                         {'bootstrap': [False], 'max_features': [2, 4, 6],
                          'n_estimators': [3, 6, 9]}],
             return_train_score=True, scoring='neg_mean_squared_error')

In [25]:
grid_search.best_params_

{'max_features': 4, 'n_estimators': 500}

In [26]:
grid_search.best_estimator_

RandomForestRegressor(max_features=4, n_estimators=500)

In [None]:
# final_predictor = grid_search.best_estimator_
# final_predictor.fit(engineered_dataset.x_train, engineered_dataset.y_train)
# final_pred = final_predictor.predict(engineered_dataset.x_test)

In [None]:
# print('Mean Absolute Error : {}'.format(mean_absolute_error(engineered_dataset.y_test, final_pred)))
# print()
# print('Mean Squared Error : {}'.format(mean_squared_error(engineered_dataset.y_test, final_pred)))
# print()
# print('Root Mean Squared Error : {}'.format(sqrt(mean_squared_error(engineered_dataset.y_test, final_pred))))

In [None]:
# saving the model
# file_name = 'final_pickle_model.pickle'
# pickle.dump(final_predictor,open(file_name,'wb'))

# Feature importance

In [None]:
# feature_importances = grid_search.best_estimator_.feature_importances_
#
# attrs = list(engineered_dataset.select_dtypes(include = ['float64','int64']))
#
# sorted(zip(attrs, feature_importances), reverse=True)

# Saving Predictions

In [None]:
# model_pred = pd.DataFrame({'parcelid':X_test_new.parcelid, 'logerror':final_pred})
# model_pred.to_csv('model_predictions.csv',index=False)
# model_pred.head()

# Conclusion

1. I have performed all the feature engineering steps necessary to ensure the dataset is ready to be fed into Machine Learning algorithms.

2. After Pre-processing and Feature Engineering the raw dataset we splitted the dataset into train and test sets.

3. Performed Feature scaling on data for better performance.

4. Trained multiple models using different ML regression algorithms on dataset.

5. Appleied Performance metrics such as MAE, MSE, RMSE to find out best prediction model.

6. With the help of GridSearch CV we found out best estimator with least Root mean squred error.

7. Saved best predictor in .pickle format for future predictions.

8. Done prediction on test data and saved predictions into .csv file.

# Hypermeter Tuning (GridSearchCV)

a.For Random Forest Regressor

# Checking for Feature Importance

# Creating the final model and making predictions

# Conclusion