# Introduction
We have done feature engineering on the raw dataset and got our final dataset.
Now we will use three different models for predictions:
- Linear Regression
- Decision Tree
- Random Forest Regression
- Regression Enhanced Random Forest

For each model, predictions are made, after a training and parameter tuning phase, on each of the datasets produced in the feat-engineering notebook.
Results for each set of predictions are then plotted and visualized.

## Train Multiple Models

Now that we've tested our data preparation pipeline with a sample model, the next step is to train the data on different regression algorithms to shortlist the most promising algorithms for our problem.

Algorithms to test with include:
- **Linear Regression**: Simple algorithm to implement but can over-simplify real-world problems by assuming a linear relationship among the variables.
- **Support Vector Regression**: Uses hyperplanes to segregate the data.
- **Decision Tree**: Powerful model capable of finding complex nonlinear relationships in the data.
- **Random Forest**: Train many Decision Tress on random subsets of the features (*Ensemble Learning*).

- i.Adaboost Regressor

# Setup
Let us import the required modules.

In [15]:
import pandas as pd
import numpy as np
import seaborn as sns
import sys
import os
from math import sqrt
import pickle

import project.src.feat_eng as fe
import project.src.visualization as viz

from sklearn.decomposition import PCA
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

%matplotlib inline
sys.path.insert(0, os.path.abspath("../../"))
color = sns.color_palette()
pd.set_option("display.max_columns", 100) #
np.random.seed(1)

## Load Data
Note that the dataset is already split into Train-Test sets.

In [2]:
engineered_dataset = fe.TrainTestSplit.from_csv_directory(dir_path="../data/lvl4_rfecv")

In [3]:
engineered_dataset.x_train.info()
engineered_dataset.y_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62090 entries, 0 to 62089
Data columns (total 23 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   bathroomcnt                   62090 non-null  float64
 1   bedroomcnt                    62090 non-null  float64
 2   fireplacecnt                  62090 non-null  float64
 3   garagecarcnt                  62090 non-null  float64
 4   latitude                      62090 non-null  float64
 5   longitude                     62090 non-null  float64
 6   poolcnt                       62090 non-null  float64
 7   roomcnt                       62090 non-null  float64
 8   threequarterbathnbr           62090 non-null  float64
 9   unitcnt                       62090 non-null  float64
 10  numberofstories               62090 non-null  float64
 11  house_age                     62090 non-null  float64
 12  heatingorsystemtypeid_2.0     62090 non-null  float64
 13  h

------------ lotto -------------

------------fine lotto ---------------

---------------- inizio pier ---------------

In [None]:
hyper parameters #devo inventarmi una scusa e fare qualcosa o uno screen, chiedere a pier per sta roba

In [None]:
# This is already a selection after many grid-search runs
RF_HYPER_PARAMS = {
    "n_estimators": [150, 200, 250],
    "min_samples_leaf": [1, 50, 100, 200],
    "max_leaf_nodes": [2, 5, 10],
    "max_features": ["auto", "sqrt", "log2"]
}

# Linear Regression

In [None]:
@utils.print_time_perf
def evaluate_lin_reg(datasets: dict[str, tr.TrainTestSplit]) \
        -> (list[ev.RegressorEvaluation], list[ev.RegressorEvaluation]):

    train_results = []
    test_results = []
    for name, data in datasets.items():
        print(f"---- Now using {name} dataset ----")
        reg = sup.linear_regressor_fit(data.x_train.values, data.y_train, n_jobs=N_JOBS)

        train_eval, test_eval = ev.evaluate_performance(reg, name, data)
        train_results.append(train_eval)
        test_results.append(test_eval)

    return train_results, test_results

In [None]:
print("------ LINEAR REGRESSION ------\n")
lin_reg_train_results, lin_reg_test_results = evaluate_lin_reg(engineered_datasets)

In [None]:
# Save progress so that training doesn't have to be run again
dm.store_evaluations(dir_path=LIN_REG_DIR, evals=lin_reg_test_results)

train_dir = f"{LIN_REG_DIR}/train"
dm.create_dirs_if_not_exists(dir_paths=[train_dir])
dm.store_evaluations(dir_path=train_dir, evals=lin_reg_train_results)

In [None]:
print("Training:")
ev.print_evaluation_stats(lin_reg_train_results)

print("Testing:")
ev.print_evaluation_stats(lin_reg_test_results)

# Random Forest Regression

In [None]:
@utils.print_time_perf
def evaluate_rf_reg(datasets: dict[str, tr.TrainTestSplit])
        -> (list[ev.RegressorEvaluation], list[ev.RegressorEvaluation]):

    train_results = []
    test_results = []
    for name, data in datasets.items():
        print(f"---- Now using {name} dataset ----")

        print("Tuning...")
        rf_reg = RandomForestRegressor(n_jobs=N_JOBS, random_state=RND_SEED)
        search_result = tr.grid_search_cv_tuning(model=rf_reg,
                                                 train_data=data.x_train.values,
                                                 train_target=data.y_train,
                                                 hyper_params=RF_HYPER_PARAMS,
                                                 scoring="neg_mean_squared_error",
                                                 k_folds=4, n_jobs=N_JOBS, verbosity=VERBOSITY)
        print("Tuning results:")
        print(f"Best params: {search_result.best_params_}")

        print("Fitting with best params and full training set...")
        tuned_rf = RandomForestRegressor(n_jobs=N_JOBS, random_state=RND_SEED,
                                         **search_result.best_params_)
        tuned_rf.fit(X=data.x_train.values, y=data.y_train)

        print("Evaluating performance...")
        train_eval, test_eval = ev.evaluate_performance(tuned_rf, name, data)
        train_results.append(train_eval)
        test_results.append(test_eval)

    return train_results, test_results

In [None]:
print("------- RF REGRESSION -------\n")
rf_reg_train_results, rf_reg_test_results = evaluate_rf_reg(engineered_datasets)

In [None]:
# Save progress so that training doesn't have to be run again
dm.store_evaluations(dir_path=RF_REG_DIR, evals=rf_reg_test_results)

train_dir = f"{RF_REG_DIR}/train"
dm.create_dirs_if_not_exists(dir_paths=[train_dir])
dm.store_evaluations(dir_path=train_dir, evals=rf_reg_train_results)

In [None]:
print("Training:")
ev.print_evaluation_stats(rf_reg_train_results)

print("Testing:")
ev.print_evaluation_stats(rf_reg_test_results)

qua ha fatto regression enhanced random forest

# Performance Visualization

## Setup

In [None]:
# Order results by regressor_id (dataset name) so that plots are ordered
key_selector = lambda x: x.regressor_id

lin_reg_train_results = list(sorted(dm.load_evaluations(dir_path=f"{LIN_REG_DIR}/train"), key=key_selector))
rf_reg_train_results = list(sorted(dm.load_evaluations(dir_path=f"{RF_REG_DIR}/train"), key=key_selector))
refr_train_results = list(sorted(dm.load_evaluations(dir_path=f"{REFR_DIR}/train"), key=key_selector))

lin_reg_test_results = list(sorted(dm.load_evaluations(dir_path=LIN_REG_DIR), key=key_selector))
rf_reg_test_results = list(sorted(dm.load_evaluations(dir_path=RF_REG_DIR), key=key_selector))
refr_test_results = list(sorted(dm.load_evaluations(dir_path=REFR_DIR), key=key_selector))

training_results = {
    "[Training] Linear Regression": lin_reg_train_results,
    "[Training] Random Forest Regression": rf_reg_train_results,
    "[Training] Regression Enhanced Random Forest": refr_train_results
}

testing_results = {
    "Linear Regression": lin_reg_test_results,
    "Random Forest Regression": rf_reg_test_results,
    "Regression Enhanced Random Forest": refr_test_results
}

In [None]:
def get_performance_df(results: dict[str, list[ev.RegressorEvaluation]]):
    perf_records = []
    for model_name, evaluations in results.items():
        for evl in evaluations:
            record = {
                "model": model_name,
                "dataset id": evl.regressor_id,
                "MAE": evl.mae,
                "MSE": evl.mse,
                "R2": evl.r2
            }
            perf_records.append(record)

    return pd.DataFrame.from_records(data=perf_records).sort_values(by="dataset id")

def performance_plot(performance_df: pd.DataFrame):
    plot = sns.lineplot(data=performance_df, x="dataset id", y="MSE", hue="model",
                        style="model", palette="pastel", markers=True)
    plot.tick_params(axis="x", rotation=90)

    return plot

In [None]:
def plot_features_vs_predictions(evaluation: ev.RegressorEvaluation):
    dataset_name = evaluation.regressor_id
    dataset = engineered_datasets[dataset_name]
    test_data = dataset.x_test

    fig, axs = vis.bivariate_feature_plot(data=test_data, mode="scatter",
                                          y_var=("Model predictions", pd.Series(evaluation.y_pred)),
                                          subplot_size=(5, 4),
                                          width=3, title_size=50,
                                          title=f"[{dataset_name}] Features vs Predictions",
                                          scatter_kwargs={
                                              "alpha": 0.8
                                          })

    return fig, axs


def plot_features_vs_residuals(evaluation: ev.RegressorEvaluation):
    dataset_name = evaluation.regressor_id
    dataset = engineered_datasets[dataset_name]
    test_data = dataset.x_test

    residuals = evaluation.y_true - evaluation.y_pred
    fig, axs = vis.bivariate_feature_plot(data=test_data, mode="scatter",
                                          y_var=("Model residuals", pd.Series(residuals)),
                                          subplot_size=(5, 4),
                                          width=3, title_size=50,
                                          title=f"[{dataset_name}] Features vs Residuals",
                                          scatter_kwargs={
                                              "alpha": 0.8
                                          })

    return fig, axs


def get_extreme_predictions(data: pd.DataFrame, evaluation: ev.RegressorEvaluation, percentile: float):
    most_wrong = ev.get_highest_error_instances(data=data, percentile=percentile,
                                                pred=evaluation.y_pred, true_pred=evaluation.y_true,
                                                error_type="squared")
    most_correct = ev.get_lowest_error_instances(data=data, percentile=percentile,
                                                 pred=evaluation.y_pred, true_pred=evaluation.y_true,
                                                 error_type="squared")

    return most_wrong, most_correct


color_red = "#bf1515"
color_green = "#32a852"


def plot_extreme_instances_on_distribution(evaluation: ev.RegressorEvaluation):
    dataset_name = evaluation.regressor_id
    dataset = engineered_datasets[dataset_name]
    test_data = dataset.x_test.copy()

    fig, axs = vis.feature_distributions_plot(data=test_data, numerical_mode="violin",
                                              subplot_size=(5, 4),
                                              width=5, title_size=40,
                                              title=f"[{dataset_name}] Most Wrong/Correct on Distributions")

    most_wrong, most_correct = get_extreme_predictions(data=test_data,
                                                       evaluation=evaluation,
                                                       percentile=99.5)
    for ax in axs.flatten():
        feature_name = ax.get_xlabel()
        if feature_name != "":
            x_worst = most_wrong[feature_name].values
            x_best = most_correct[feature_name].values

            # Points are drawn at mid height + an offset so they don't overlap
            y_min, y_max = ax.get_ylim()
            height = (abs(y_max) - abs(y_min))
            half_height = height / 2

            # Heights of worst and best on two different levels
            y_worst_height = half_height + 1
            y_best_height = half_height + 2

            # Also add gaussian noise to mitigate overlapping with the violin/box plot
            y_worst = [y_worst_height + np.random.normal(0, 0.05) for _ in range(len(x_worst))]
            y_best = [y_best_height + np.random.normal(0, 0.05) for _ in range(len(x_best))]

            # Plot the most wrong/correct values over the distribution plots
            # and assign them size in proportion to their wrongness/correctness
            worst_size = 50 * ((np.argsort(most_wrong["errors"].values) + 1) / len(x_worst))
            best_size = 50 * ((np.argsort(-most_correct["errors"].values) + 1) / len(x_best))

            ax.scatter(x=x_worst, y=y_worst, s=worst_size, c=color_red)
            ax.scatter(x=x_best, y=y_best, s=best_size, c=color_green)

    return fig, axs


def plot_extreme_instances_on_feature_vs_target(evaluation: ev.RegressorEvaluation):
    dataset_name = evaluation.regressor_id
    dataset = engineered_datasets[dataset_name]
    test_data = dataset.x_test

    fig, axs = vis.bivariate_feature_plot(data=test_data, mode="scatter",
                                          y_var=("True logerror", pd.Series(dataset.y_test)),
                                          subplot_size=(5, 4),
                                          width=3, title_size=40,
                                          title=f"[{dataset_name}] Most Wrong/Correct on Features vs Target",
                                          scatter_kwargs={
                                              "alpha": 0.65  # so that extreme instances are highlighted
                                          })

    most_wrong, most_correct = get_extreme_predictions(data=test_data,
                                                       evaluation=evaluation,
                                                       percentile=99.5)
    for ax in axs.flatten():
        feature_name = ax.get_xlabel()
        if feature_name != "":
            x_worst = most_wrong[feature_name].values
            x_best = most_correct[feature_name].values

            y_worst = most_wrong["true predictions"].values
            y_best = most_correct["true predictions"].values

            # Plot the most wrong/correct values over the distribution plots
            # and assign them size in proportion to their wrongness/correctness
            worst_size = 80 * ((np.argsort(most_wrong["errors"].values) + 1) / len(x_worst))
            best_size = 80 * ((np.argsort(-most_correct["errors"].values) + 1) / len(x_best))

            ax.scatter(x=x_worst, y=y_worst, s=worst_size, c=color_red)
            ax.scatter(x=x_best, y=y_best, s=best_size, c=color_green)

    return fig, axs

## Training Performance

In [None]:
train_performance_df = get_performance_df(training_results)
train_performance_df

In [None]:
performance_plot(train_performance_df)

## Testing performance

In [None]:
test_performance_df = get_performance_df(testing_results)
test_performance_df

In [None]:
performance_plot(test_performance_df)

## Training vs Testing

In [None]:
train_test_perf_df = pd.concat([train_performance_df, test_performance_df]).reset_index(drop=True)
train_test_perf_df = train_test_perf_df.sort_values(by="dataset id")
train_test_perf_df

In [None]:
performance_plot(train_test_perf_df)

## Predictions and Residuals

In this section, for each combination of model and testing set, 2 types of plots will be shown:
- Features vs model predictions;
- Features vs model residuals.

Here the goal is to understand how the models make their predictions and get a general idea of where and how wrong they are.

### Linear Regression

In [None]:
for result in lin_reg_test_results:
    plot_features_vs_predictions(evaluation=result)
    plot_features_vs_residuals(evaluation=result)

### Random Forest Regression

In [None]:
for result in rf_reg_test_results:
    plot_features_vs_predictions(evaluation=result)
    plot_features_vs_residuals(evaluation=result)

### Regression Enhanced Random Forest

### A Look at the Expected Value and Variance of True log-errors

The expected value of both training and testing sets' targets is shown to provide a better context around the previously plotted predictions. We can see that the expected value is around 0.017, which is very close to what the models are predicting: in other words, the models seem to be predicting values around the average of the true log-errors with relatively little variance.

A look at the variance of true log-errors also gives an idea as to why the testing sets perform much better than the training ones: since all the models seem to predict the average log-error, or very close to it, for each instance, error is expected to be directly proportional with the variance of the target of each set. The variance of the testing set is, in fact, lower than the training set one, most likely because its size is smaller.

In [None]:
# Since all y_train and y_test are equal, the dataset from which they are extracted
# does not matter
any_dataset = "lvl1-leave-one-out"
y_train = engineered_datasets[any_dataset].y_train
y_test = engineered_datasets[any_dataset].y_test

pd.DataFrame(data={
    "Set": ["Training", "Testing"],
    "Expected value": [y_train.mean(), y_test.mean()],
    "Variance": [y_train.var(), y_test.var()]
})

## Best and Worst instances

In this section, for each combination of model and testing set, 2 types of plots will be shown:
- Distribution of extreme instances (in terms of predictions) vs actual feature distribution;
- Dataset features and extreme instances vs true logerror.

Both plots' goal is to help me understand if there is some peculiarity in the distribution and predictions of the most wrongly/correctly predicted instances.

### Linear Regression

In [None]:
for result in lin_reg_test_results:
    plot_extreme_instances_on_distribution(evaluation=result)
    plot_extreme_instances_on_feature_vs_target(evaluation=result)

### Random Forest Regression

In [None]:
for result in rf_reg_test_results:
    plot_extreme_instances_on_distribution(evaluation=result)
    plot_extreme_instances_on_feature_vs_target(evaluation=result)

------------------ fine pier -----------------

----------------------- DA NOTEBOOK CRASTO -----------------------

## Model Evaluation

### Baseline Metrics

It is important to set a baseline for the model's performance to compare different algorithms. For regression problems, the baseline metrics are calculated by replacing $y'$ with $\bar{y}$. Using this, the different baseline regression metrics are:

- **MSE Baseline**: Variance of the target variable (Mean Squared Error)
- **RMSE Baseline**: Standard Deviation of the target variable (Root Mean Squared Error)
- **MAE Baseline**: Average Abolsute Deviation of the target variable (Mean Absolute Error)
- **R2 Baseline**: 0

For this regression problem, we will use the models' **Mean Absolute Error** and **RMSE (Root Mean Squared Error)** to compare the different algorithms which have **baseline values of 0.533 and 0.0837** respectively.

We will also observe the RMSE as another evaluation metric which punishes more for outliers than MAE.

In [None]:
# Baseline for RMSE
print(f"MAE Baseline: {engineered_dataset.y_train.mad()}")
print(f"RMSE Baseline: {engineered_dataset.y_train.std()}")

### MAE Evaluation

To evaluate and short list the most promising models, we will use the models' **MAE** in two different ways:

1) **MAE on Validation Set**: Calculates the MAE on the validation set which is quicker to calculate than evaluation using Cross-Validation. However, it is possible the MAE obtained is skewed depending on the instances sampled in the validation set.

2) A great alternative is to use **K-Fold Cross-Validation** where the training set is randomly split into `n` subsets (for example 10 subsets) called *folds*. It trains and evaluates the model 10 times, picking a different fold for evaluation every time and training on the other 9 folds. Result is an array containing the 10 evaluation scores. Takes longer to evaluate but provides a more accurate measure of the model's performance.

In [None]:
def get_eval_metrics(models, X, y_true):
    """
    Calculates MAE (Mean Absoulate Error) and RMSE (Root Mean Squared Error) on the data set for input models.
    `models`: list of fit models
    """
    for model in models:
        y_pred= model.predict(X)
        rmse = mean_squared_error(y_true, y_pred, squared=False)
        mae = mean_absolute_error(y_true, y_pred)
        print(f"Model: {model}")
        print(f"MAE: {mae}, RMSE: {rmse}")

# Test usage of RMSE function
# get_eval_metrics([lin_reg, ridge_reg, lasso_reg], X_prepared_val, y_val)

In [None]:
def display_scores(model, scores):
    print("-"*50)
    print("Model:", model)
    print("\nScores:", scores)
    print("\nMean:", scores.mean())
    print("\nStandard deviation:", scores.std())

def get_cross_val_scores(models, X, y, cv=10, fit_params=None):
    """
    Performs k-fold cross validation and calculates MAE for each fold for all input models.
    `models`: list of fit models
    """
    for model in models:
        mae = -cross_val_score(model, X, y, scoring="neg_mean_absolute_error", cv=cv, fit_params=fit_params)
        display_scores(model, mae)

    # Test usage of cross val function
# get_cross_val_scores([lin_reg, ridge_reg], X_prepared, y_train, cv=5)

# Linear Regression Model

Linear Regression: Plain linear regression that minimizes the Mean Squared Error(MSE) cost function.

The model RMSE is significantly higher than MAE which suggests that the outliers are affecting the model's performance as RMSE punishes the model more for mispredicting outliers.
The K-Fold Cross Validation shows that the model's performance is highly volatile

In [4]:
linear_reg = LinearRegression()
linear_reg.fit(engineered_dataset.x_train, engineered_dataset.y_train)

LinearRegression()

In [5]:
linear_reg_pred = linear_reg.predict(engineered_dataset.x_test)

print('Mean Absolute Error : {}'.format(mean_absolute_error(engineered_dataset.y_test, linear_reg_pred)))
print()
print('Mean Squared Error : {}'.format(mean_squared_error(engineered_dataset.y_test, linear_reg_pred)))
print()
print('Root Mean Squared Error : {}'.format(sqrt(mean_squared_error(engineered_dataset.y_test, linear_reg_pred))))

Mean Absolute Error : 0.07225147731871365

Mean Squared Error : 0.033117004168537564

Root Mean Squared Error : 0.1819807796679022


In [None]:
# dopo fit (notebook crasto)
# fa osservazioni craste ma complicate

# Ada Boost Regression Model

In [6]:
adaboost_reg = AdaBoostRegressor()

adaboost_reg.fit(engineered_dataset.x_train, engineered_dataset.y_train)

  y = column_or_1d(y, warn=True)


AdaBoostRegressor()

In [7]:
adaboost_reg_pred = adaboost_reg.predict(engineered_dataset.x_test)

print('Mean Absolute Error : {}'.format(mean_absolute_error(engineered_dataset.y_test, adaboost_reg_pred)))
print()
print('Mean Squared Error : {}'.format(mean_squared_error(engineered_dataset.y_test, adaboost_reg_pred)))
print()
print('Root Mean Squared Error : {}'.format(sqrt(mean_squared_error(engineered_dataset.y_test, adaboost_reg_pred))))

Mean Absolute Error : 0.20165288729791706

Mean Squared Error : 0.08715703095313854

Root Mean Squared Error : 0.29522369646276453


# Decision Tree Regressor

Decision Tree: Powerful model capable of finding complex nonlinear relationships in the data.
Random Forest: Train many Decision Tress on random subsets of the features via the bagging method (Ensemble Learning).

In [8]:
tree_reg = DecisionTreeRegressor(max_depth=5)

tree_reg.fit(engineered_dataset.x_train, engineered_dataset.y_train)

DecisionTreeRegressor(max_depth=5)

In [9]:
tree_reg_pred = tree_reg.predict(engineered_dataset.x_test)

print('Mean Absolute Error : {}'.format(mean_absolute_error(engineered_dataset.y_test, tree_reg_pred)))
print()
print('Mean Squared Error : {}'.format(mean_squared_error(engineered_dataset.y_test, tree_reg_pred)))
print()
print('Root Mean Squared Error : {}'.format(sqrt(mean_squared_error(engineered_dataset.y_test, tree_reg_pred))))

Mean Absolute Error : 0.07222576974868114

Mean Squared Error : 0.033611091729615135

Root Mean Squared Error : 0.18333328047470032


# Random Forest Regression Model¶

In [12]:
forest_reg = RandomForestRegressor(n_estimators= 50, max_depth=6)

forest_reg.fit(engineered_dataset.x_train, engineered_dataset.y_train)

  forest_reg.fit(engineered_dataset.x_train, engineered_dataset.y_train)


RandomForestRegressor(max_depth=6, n_estimators=50)

In [13]:
forest_reg_pred = forest_reg.predict(engineered_dataset.x_test)

print('Mean Absolute Error : {}'.format(mean_absolute_error(engineered_dataset.y_test, forest_reg_pred)))
print()
print('Mean Squared Error : {}'.format(mean_squared_error(engineered_dataset.y_test, forest_reg_pred)))
print()
print('Root Mean Squared Error : {}'.format(sqrt(mean_squared_error(engineered_dataset.y_test, forest_reg_pred))))

Mean Absolute Error : 0.07201648566472933

Mean Squared Error : 0.03331842958157641

Root Mean Squared Error : 0.18253336566659917


# Cross Validation & Hyperparameter Optimization for Random Forest

In [16]:
scores = cross_val_score(forest_reg, engineered_dataset.x_train, engineered_dataset.y_train, scoring="neg_mean_squared_error", cv = 5)

  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)


In [17]:
forest_reg_rmse_scores = np.sqrt(-scores)
forest_reg_rmse_scores

array([0.16437461, 0.16956266, 0.16824223, 0.15338318, 0.17947782])

In [21]:
param_grid = [
    {'n_estimators': [300, 400, 500], 'max_features': [2, 4, 6]},
    {'bootstrap': [False], 'n_estimators': [3, 6, 9], 'max_features': [2, 4, 6]}]

forest_regressor = RandomForestRegressor()

grid_search = GridSearchCV(forest_regressor, param_grid, scoring='neg_mean_squared_error',return_train_score=True,cv=3)

In [None]:
grid_search.fit(engineered_dataset.x_train, engineered_dataset.y_train)

  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_

In [None]:
# grid_search.best_params_

In [None]:
# grid_search.best_estimator_

In [None]:
# final_predictor = grid_search.best_estimator_
# final_predictor.fit(engineered_dataset.x_train, engineered_dataset.y_train)
# final_pred = final_predictor.predict(engineered_dataset.x_test)

In [None]:
# print('Mean Absolute Error : {}'.format(mean_absolute_error(engineered_dataset.y_test, final_pred)))
# print()
# print('Mean Squared Error : {}'.format(mean_squared_error(engineered_dataset.y_test, final_pred)))
# print()
# print('Root Mean Squared Error : {}'.format(sqrt(mean_squared_error(engineered_dataset.y_test, final_pred))))

In [None]:
# saving the model
# file_name = 'final_pickle_model.pickle'
# pickle.dump(final_predictor,open(file_name,'wb'))

# Feature importance

In [None]:
# feature_importances = grid_search.best_estimator_.feature_importances_
#
# attrs = list(engineered_dataset.select_dtypes(include = ['float64','int64']))
#
# sorted(zip(attrs, feature_importances), reverse=True)

# Saving Predictions

In [None]:
# model_pred = pd.DataFrame({'parcelid':X_test_new.parcelid, 'logerror':final_pred})
# model_pred.to_csv('model_predictions.csv',index=False)
# model_pred.head()

# Conclusion

1. I have performed all the feature engineering steps necessary to ensure the dataset is ready to be fed into Machine Learning algorithms.

2. After Pre-processing and Feature Engineering the raw dataset we splitted the dataset into train and test sets.

3. Performed Feature scaling on data for better performance.

4. Trained multiple models using different ML regression algorithms on dataset.

5. Appleied Performance metrics such as MAE, MSE, RMSE to find out best prediction model.

6. With the help of GridSearch CV we found out best estimator with least Root mean squred error.

7. Saved best predictor in .pickle format for future predictions.

8. Done prediction on test data and saved predictions into .csv file.

# Hypermeter Tuning (GridSearchCV)

a.For Random Forest Regressor

# Checking for Feature Importance

# Creating the final model and making predictions

# Conclusion