### Code_hyperparameters

- This code was used for running external extensive hyperparameter testing to inform our choices over parameter grids and optimisation methods
- We have included this code to show the larger parameter grids we chose to test, running for >1hr using randomised and overnight when using grid search

- As stated in our main code, we use a **2 step optimisation process** increasing efficiency and the likleyhood of finding the "best" parameter combination

### Importing libraries
This section imports all libraries utilised within the programme

In [1]:
import pandas as pd
import numpy as np
from math import sqrt
import time

# SKLearn
from sklearn.pipeline import Pipeline
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

#Define global variables
global gMethodDictionary
global gVarErrorDf
gVarErrorDf = pd.DataFrame(columns=[ 'Drop index','Predicted y', 'RMSE', 'R^2', 'MAE', 'RMSE %', 'R^2 %', 'MAE %'])
gVarNames = ['Cement','Blast Furnace Slag','Fly Ash','Water','Superplasticizer','Coarse Aggregate','Fine Aggregate','Age','Concrete compressive strength',' ']
gX_train_preprocessed = pd.DataFrame()
gX_test_preprocessed = pd.DataFrame()
gy_train = []
gy_test = []

In [2]:
def csv_import()->pd.DataFrame:
    """ 
    Import the CSV file and return the data as a pandas dataframe

    Returns:
    X_dataset: (pd.DataFrame): Feature variables (Cement, Blast Furnace Slag, Fly Ash, Water, Superplasticizer, Coarse Aggregate, Fine Aggregate, Age)
    y_dataset (pd.Series): Target variable (Concrete compressive strength)
    """
    #import data from the files
    dataset = pd.read_csv('Concrete_Data_Yeh_final.csv')

    #Data Preprocessing
    #format as a dataframe
    dataset = pd.DataFrame(dataset)
    
    #print(f'Null values: \n',dataset.isnull().sum()) #check for null values
    print(dataset.duplicated().sum(), 'duplicated rows dropped') #check for duplicates
    dataset = dataset.drop_duplicates() #drop duplicates
    dataset.dtypes #check for data types
  
    y_dataset = dataset["csMPa"]
    X_dataset = dataset.drop("csMPa", axis=1)
    return X_dataset, y_dataset

In [3]:
def preprocessing(X_dataset: pd.DataFrame, y_dataset: pd.Series) -> None:
    """ 
    Preprocess the data (simple imputer (mean), standard scaler) and split into training and test sets

    Parameters:
    X_dataset (pd.DataFrame): The feature variables
    y_dataset (pd.Series): The target variable (concrete compressive strength)
    
    Returns: None (global variables are set)
    """
    # Splitting the data into training and test sets
    global gX_train_preprocessed, gX_test_preprocessed
    global gy_train, gy_test    

    X_train, X_test, y_train, y_test = train_test_split(X_dataset, y_dataset, test_size=0.2, random_state=42)

    # Creating a preprocessing pipeline that imputes missing values with the mean 
    # and scales features to have zero mean and unit variance.
    preprocessing_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', StandardScaler())])

    gX_train_preprocessed = pd.DataFrame(preprocessing_pipeline.fit_transform(X_train), columns=X_train.columns)
    gX_test_preprocessed = pd.DataFrame(preprocessing_pipeline.transform(X_test), columns=X_test.columns)
    gy_train = y_train
    gy_test = y_test

In [4]:
def random_forest_regression(xTrainData: pd.DataFrame, yTrainData: pd.DataFrame, yTestData: pd.Series, bestFit = {})-> np.ndarray:
    """
    This function creates and fits a random forest regression model

    Parameters:
    xTrainData (pandas.DataFrame): The independent variables for training
    yTrainData (pandas.Series): Target training values (Compressive Strength)
    yTestData (pandas.Series): The independent variables for the test set, used to predict y_pred
    bestFit (dict, optional): The parameters for the Linear Regression model. Defaults to an empty dictionary.

    Returns:
    y_pred (numpy.ndarray): The predicted values of compressive strength for the test set
    """ 
   
    regressor = RandomForestRegressor(**bestFit) # Creating the Random Forest Regressor
    regressor.fit(xTrainData, yTrainData) 
    y_pred = regressor.predict(yTestData)

    return y_pred

### Hyperparameters

As shown Random Forest Regression has the best untuned performance on our dataset and as such we are only analysing hyperparameters for this model:
The most important hyperparameters to tune in a Random Forest model to improve its performance are:

1. `n_estimators`: Higher number of trees give you better performance but is more computationally intense.

2. `max_depth`: Can help to prevent overfitting. If the max depth is too high, the model may learn too much from the training data and perform poorly on test data.

3. `min_samples_split`: Increasing this parameter increases the number of samples considered at each node.

4. `min_samples_leaf`: Similar to min_samples_splits, instead describing the minimum number of samples of samples at the leafs

5. `max_features`: The number of features to consider when looking for the best split. When using "auto", `max_features=sqrt(n_features)`.

6. `bootstrap`: If False, the whole dataset is used to build each tree. We chose to only use False throught as preforming initial testing found true to yield significantly lower performance

We chose to test these along with minimum impurity decrease.

**Random Search** sets up a grid of hyperparameter values and selects random combinations and is therefore less computationally expensive then Grid Search (and was found to be better optimised for multi-core processing). We use this step first, as while efficient it's not guaranteed to find the best parameters.

**Grid Search** systematically works through multiple combinations of parameter tunes, cross-validating as it goes to determine which tune gives the best performance. We use a smaller grid defined around the results of the random search.

**NOTE: This shows are larger test grid - this config took ~1hr30 to run**

In [5]:
def random_grid_search():
    """
    This function performs a randomised search to narrow down the hyperparameter grid for the Random Forest Regressor.
    It then performs a grid search based on the randomised search results, to find the best hyperparameter combination.

    Parameters: None
    Returns: best_grid_params (dict): The best hyperparameter combination
    """

    # Large hyperparameter grid for randomised search - 13*7*5*5*5*1*2 = 22,750 combinations
    # While this is a large run during testing we completed several larger runs including using full exhaustive grid search over the entire hyperparameter space
    param_random = {
        'n_estimators': [50, 100, 125, 135, 145, 155, 170, 180, 190, 200, 300, 400,600], 
        'max_depth': [None, 20, 30, 40, 60, 80, 100],
        'min_samples_split': [ 2, 3,5, 10, 20], 
        'min_samples_leaf': [ 1, 2, 5, 10, 20],
        'min_impurity_decrease': [0.0, 0.2, 0.4, 0.5, 0.8],
        'bootstrap': [False], 
        'max_features': ['auto', 'sqrt'] 
    }


    # Randomised search - 20,000 iterations
    rf_random = RandomizedSearchCV(estimator=RandomForestRegressor(), param_distributions=param_random, n_iter=20000, cv=5, n_jobs=-1, verbose=2)
    rf_random.fit(gX_train_preprocessed, gy_train)

    # Finding the best hyperparameter combination from randomised search
    best_random_params = rf_random.best_params_
    print("Best Random Forrest Parameters:", best_random_params)

    # Defining a smaller hyperparameter grid for grid search
    param_grid = {
        'n_estimators': [best_random_params['n_estimators'] - 10, best_random_params['n_estimators'], best_random_params['n_estimators'] + 10],
        'min_samples_split': [best_random_params['min_samples_split'] - 1 if best_random_params['min_samples_split'] > 1 else 2, best_random_params['min_samples_split'], best_random_params['min_samples_split'] + 1],
        'min_samples_leaf': [best_random_params['min_samples_leaf'] - 1 if best_random_params['min_samples_leaf'] > 1 else 2, best_random_params['min_samples_leaf'], best_random_params['min_samples_leaf'] + 1],
        'min_impurity_decrease': [best_random_params['min_impurity_decrease'] - 0.1 if best_random_params['min_impurity_decrease'] > 0.1 else 0, best_random_params['min_impurity_decrease'], best_random_params['min_impurity_decrease'] + 0.1],
        'bootstrap': [best_random_params['bootstrap']],
        'max_features': [best_random_params['max_features']]
    }
    # If statements used to ensure that hyperparameter values are not out of range (and within the param_grid)
    if best_random_params['max_depth'] == None:
        param_grid['max_depth'] = [None]
    else:
        param_grid['max_depth'] = [best_random_params['max_depth'] - 5, best_random_params['max_depth'], best_random_params['max_depth'] + 5]
    
    # Grid search
    rf_grid = GridSearchCV(estimator=RandomForestRegressor(), param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)
    rf_grid.fit(gX_train_preprocessed, gy_train)

    # Finding the best hyperparameter combination from grid search
    best_grid_params = rf_grid.best_params_

    # Use the best hyperparameter combination to train the final Random Forest model
    final_rf_model = RandomForestRegressor(**best_grid_params)
    final_rf_model.fit(gX_train_preprocessed, gy_train)
    print("Best Random Forrest Parameters:", best_grid_params)
    return best_grid_params


In [6]:
def opt_unopt_random_forest(best_grid_params: dict):
    """
    This function compares the performance of the original unoptimised Random Forest Regressor model with the optimised Random Forest Regressor model.
    The model trained, using the hyperparameter combination identified by the grid search, performance metrics are compared to the original model. If the
    optimised model performs worse than the original model, the grid search is deemed unsuccessful and an empty dictionary is returned.

    Parameters: 
    best_grid_params (dict): The best hyperparameter combination from the grid search.
    
    Returns:
    best_grid_params (dict): The best hyperparameter combination in case the optimisation was unsuccessful.
    """ 
    
    # Original unoptimized Random Forest Regression
    y_pred_unoptimised = random_forest_regression(gX_train_preprocessed, gy_train, gX_test_preprocessed)

    # Optimised Random Forest Regressor
    y_pred_optimised = random_forest_regression(gX_train_preprocessed, gy_train, gX_test_preprocessed,best_grid_params)

    # Calculate error matrics
    rmse_unoptimised = sqrt(mean_squared_error(gy_test, y_pred_unoptimised))
    rmse_optimised = sqrt(mean_squared_error(gy_test, y_pred_optimised))
    r2_unoptimised = r2_score(gy_test, y_pred_unoptimised)
    r2_optimised = r2_score(gy_test, y_pred_optimised)
    mae_unoptimised = mean_absolute_error(gy_test, y_pred_unoptimised)
    mae_optimised = mean_absolute_error(gy_test, y_pred_optimised)

    # Print errors
    print("Unoptimised RMSE:", rmse_unoptimised)
    print("Optimised RMSE:", rmse_optimised)
    print("Unoptimised R^2:", r2_unoptimised)
    print("Optimised R^2:", r2_optimised)
    print("Unoptimised MAE:", mae_unoptimised)
    print("Optimised MAE:", mae_optimised)


## Function: main()

- The `main()` function is the entry point of the program. It is responsible for coordinating the execution of other functions and controlling the flow of the program.
- This is a *"stripped back"* version from the main code only running the functions needed for hyperparameter optimisation 

- A timer is included in the function to determine how long the hyperparameter optimisation run took when run overnight/unsupervised

In [7]:
def main():
    X_dataset, y_dataset = csv_import()
    preprocessing(X_dataset, y_dataset)
    print("Preprocessing done")
    
    # Hyperparameter optimisation
    print("Performing grid search \n")
    start_time = time.time()  # Start the timer
    best_grid_params = random_grid_search()
    end_time = time.time()  # End the timer
    print("********Time taken for grid search:", end_time - start_time, "seconds********\n")
    print("Grid search complete. Errors are as follows:")
    opt_unopt_random_forest(best_grid_params) # Calculate errors for optimised and unoptimised models and print
main()

23 duplicated rows dropped
Preprocessing done
Performing grid search 

Fitting 5 folds for each of 20000 candidates, totalling 100000 fits
Best Random Forrest Parameters: {'n_estimators': 180, 'min_samples_split': 3, 'min_samples_leaf': 1, 'min_impurity_decrease': 0.0, 'max_features': 'sqrt', 'max_depth': None, 'bootstrap': False}
Fitting 5 folds for each of 81 candidates, totalling 405 fits
Best Random Forrest Parameters: {'bootstrap': False, 'max_depth': None, 'max_features': 'sqrt', 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 3, 'n_estimators': 180}
********Time taken for grid search: 6757.785653829575 seconds********

Grid search complete. Errors are as follows:
Unoptimized RMSE: 4.7741369547647965
Optimized RMSE: 4.093477595057695
Unoptimized R^2: 0.9005838067551077
Optimized R^2: 0.9269109666398374
Unoptimized MAE: 3.5635091643092873
Optimized MAE: 2.967129675467544
