Must run from start using run all:

### Importing libraries
This section imports all libraries utilised within the programme

In [335]:
import pandas as pd
import numpy as np
from math import sqrt

# SKLearn
from sklearn.pipeline import Pipeline
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
import matplotlib.ticker as ticker
from sklearn.linear_model import LinearRegression, Ridge, ElasticNet, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import StackingRegressor

# Graphing
from IPython.display import display, clear_output
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from ipywidgets import interact, widgets, interactive


'''Global variables
    '''

global gMethodDictionary
global gVarErrorDf
gVarErrorDf = pd.DataFrame(columns=[ 'Drop index','Predicted y', 'RMSE', 'R^2', 'MAE', 'RMSE %', 'R^2 %', 'MAE %'])
gVarNames = ['Cement','Blast Furnace Slag','Fly Ash','Water','Superplasticizer','Coarse Aggregate','Fine Aggregate','Age','Concrete compressive strength',' ']
gX_train_preprocessed = pd.DataFrame()
gX_test_preprocessed = pd.DataFrame()
gy_train = []
gy_test = []


Hyperperameters

As shown Random Forrest Regression has the best untuned performance on our dataset and as such we are only analysing hyperperameters for this model:
The most important hyperparameters to tune in a Random Forest model to improve its performance are:

1. `n_estimators`: This is the number of trees you want to build before taking the maximum voting or averages of predictions. Higher number of trees give you better performance but makes your code slower.

2. `max_depth`: The maximum depth of the tree. This parameter can help to prevent overfitting. If the max depth is too high, the model may learn too much from the training data and perform poorly on unseen data.

3. `min_samples_split`: The minimum number of samples required to split an internal node. If you increase this parameter, each tree in the forest becomes more constrained as it has to consider more samples at each node.

4. `min_samples_leaf`: The minimum number of samples required to be at a leaf node. This parameter is similar to min_samples_splits, however, this describe the minimum number of samples of samples at the leafs, the base of the tree.

5. `max_features`: The number of features to consider when looking for the best split. If set to "auto", then `max_features=sqrt(n_features)`.

6. `bootstrap`: Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.

Both Random Search and Grid Search are hyperparameter tuning techniques, and each has its own advantages and disadvantages.

**Grid Search** systematically works through multiple combinations of parameter tunes, cross-validating as it goes to determine which tune gives the best performance. The benefit is that it's guaranteed to find the best combination of parameters supplied. However, it can be computationally expensive, especially if the number of parameters or their possible values are large.

**Random Search** sets up a grid of hyperparameter values and selects random combinations to train the model and score. The benefit is that it's not as computationally expensive as Grid Search, and you have more control over how long you want it to run for, as you can set the number of iterations. However, it's not guaranteed to find the best parameters.

In practice, it's often recommended to start with Random Search to narrow down the possible range of values for each hyperparameter, and then use Grid Search within that range to find the best combination.

So, neither is strictly "better" - the best choice depends on your specific situation, including the number of hyperparameters you need to tune, the number of possible values for each hyperparameter, and the computational resources you have available.

In [336]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

def random_grid_search():
    """
    This function performs a random grid search to identify the best hyperparameter combination for the Random Forest Regressor model.

    The function first defines a hyperparameter grid, then performs a randomized search to identify the best hyperparameter combination. 
    It then defines a narrower hyperparameter grid around the best hyperparameter combination and performs a grid search to identify the 
    best hyperparameter combination. Finally, it uses the best hyperparameter combination to train the final Random Forest model.

    Parameters:
    None

    Returns:
    None
    """
    '''param_grid = {
        'max_features': ['auto', 'sqrt'],
        'bootstrap': [True, False]
    }'''


    param_random = {
        'n_estimators': [50, 100, 125, 135, 145, 155, 170], # 7
        'max_depth': [None, 20, 30, 40], #4
        'min_samples_split': [ 2, 3,5, 10], #5
        'min_samples_leaf': [ 1, 2, 5, 10],#4
        'min_impurity_decrease': [0.0, 0.2, 0.4, 0.5],#4
        'bootstrap': [False], #2
        'max_features': ['auto', 'sqrt'] #2
    }

    #add a parameter to the param_random dictionary
    #param_random['max_features'] = ['auto', 'sqrt']

    # Step 1: Perform randomized search
    rf_random = RandomizedSearchCV(estimator=RandomForestRegressor(), param_distributions=param_random, n_iter=300, cv=5, n_jobs=-1, verbose=2)
    rf_random.fit(gX_train_preprocessed, gy_train)

    # Step 2: Identify the best hyperparameter combination from randomized search
    best_random_params = rf_random.best_params_
    print("Best Random Forrest Parameters:", best_random_params)

    # Step 3: Define a narrower hyperparameter grid around the best hyperparameter combination

    param_grid = {
        'n_estimators': [best_random_params['n_estimators'] - 10, best_random_params['n_estimators'], best_random_params['n_estimators'] + 10],
        'min_samples_split': [best_random_params['min_samples_split'] - 1 if best_random_params['min_samples_split'] > 1 else 2, best_random_params['min_samples_split'], best_random_params['min_samples_split'] + 1],
        'min_samples_leaf': [best_random_params['min_samples_leaf'] - 1 if best_random_params['min_samples_leaf'] > 1 else 2, best_random_params['min_samples_leaf'], best_random_params['min_samples_leaf'] + 1],
        'min_impurity_decrease': [best_random_params['min_impurity_decrease'] - 0.1 if best_random_params['min_impurity_decrease'] > 0.1 else 0, best_random_params['min_impurity_decrease'], best_random_params['min_impurity_decrease'] + 0.1],
        'bootstrap': [best_random_params['bootstrap']],
        'max_features': [best_random_params['max_features']]
    }

    if best_random_params['max_depth'] == None:
        param_grid['max_depth'] = [None]
    else:
        param_grid['max_depth'] = [best_random_params['max_depth'] - 5, best_random_params['max_depth'], best_random_params['max_depth'] + 5]
    

    # Step 5: Perform grid search
    rf_grid = GridSearchCV(estimator=RandomForestRegressor(), param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)
    rf_grid.fit(gX_train_preprocessed, gy_train)

    # Step 6: Identify the best hyperparameter combination from grid search
    best_grid_params = rf_grid.best_params_

    # Use the best hyperparameter combination to train the final Random Forest model
    final_rf_model = RandomForestRegressor(**best_grid_params)
    final_rf_model.fit(gX_train_preprocessed, gy_train)
    print("Best Random Forrest Parameters:", best_grid_params)
    return best_grid_params


In [337]:
def opt_unopt_random_forest(best_grid_params):
    """
    This function compares the performance of the original unoptimized Random Forest Regressor model with the optimized Random Forest Regressor model.

    The function first trains the original unoptimized Random Forest Regressor model and uses it to predict the target variable for the testing data. 
    It then trains the optimized Random Forest Regressor model and uses it to predict the target variable for the testing data. Finally, it calculates 
    the root mean squared error (RMSE), R^2 score, and mean absolute error (MAE) for the predictions of both models and prints the results.

    Parameters:
    None

    Returns:
    None
    """ 
    # Original unoptimized Random Forest Regression
    y_pred_unoptimized = random_forest_regression(gX_train_preprocessed, gy_train, gX_test_preprocessed)

    # Optimized Random Forest Regression  
    y_pred_optimized = random_forest_regression(gX_train_preprocessed, gy_train, gX_test_preprocessed,best_grid_params)

    # Calculate errors
    rmse_unoptimized = sqrt(mean_squared_error(gy_test, y_pred_unoptimized))
    rmse_optimized = sqrt(mean_squared_error(gy_test, y_pred_optimized))
    r2_unoptimized = r2_score(gy_test, y_pred_unoptimized)
    r2_optimized = r2_score(gy_test, y_pred_optimized)
    mae_unoptimized = mean_absolute_error(gy_test, y_pred_unoptimized)
    mae_optimized = mean_absolute_error(gy_test, y_pred_optimized)

    # Print errors
    print("Unoptimized RMSE:", rmse_unoptimized)
    print("Optimized RMSE:", rmse_optimized)
    print("Unoptimized R^2:", r2_unoptimized)
    print("Optimized R^2:", r2_optimized)
    print("Unoptimized MAE:", mae_unoptimized)
    print("Optimized MAE:", mae_optimized)


## Function: main()

The `main()` function is the entry point of the program. It is responsible for coordinating the execution of other functions and controlling the flow of the program.



In [338]:
import time

def main():
    X,y = csv_import()
    preprocessing(X,y)
    print("Preprocessing done")
    
    # Hyperparameter tuning
    print("Performing grid search \n")
    start_time = time.time()  # Start the timer
    best_grid_params = random_grid_search()
    end_time = time.time()  # End the timer
    print("********Time taken for grid search:", end_time - start_time, "seconds********\n")
    print("Grid search complete. Errors are as follows:")
    opt_unopt_random_forest(best_grid_params)
main()

Null values: 
 cement               0
slag                 6
flyash               1
water                8
superplasticizer    14
coarseaggregate      7
fineaggregate        3
age                  5
csMPa                0
dtype: int64
Duplicated values: 23
Preprocessing done
Performing grid search 

Fitting 5 folds for each of 300 candidates, totalling 1500 fits
Best Random Forrest Parameters: {'n_estimators': 100, 'min_samples_split': 3, 'min_samples_leaf': 1, 'min_impurity_decrease': 0.0, 'max_features': 'sqrt', 'max_depth': 30, 'bootstrap': False}
Fitting 5 folds for each of 243 candidates, totalling 1215 fits
Best Random Forrest Parameters: {'bootstrap': False, 'max_depth': 35, 'max_features': 'sqrt', 'min_impurity_decrease': 0, 'min_samples_leaf': 1, 'min_samples_split': 3, 'n_estimators': 110}
********Time taken for grid search: 101.90292930603027 seconds********

Grid search complete. Errors are as follows:
Unoptimized RMSE: 4.813945610963076
Optimized RMSE: 4.015334217741445
Un