### **HyperParameter Optimization using GridSearch**

**Estimating blueberry yield using a Random Forest Model**

The focus is on a basic HyperParameter optimisation using GridSearch, which is why we are going to use a clean straightforward dataset.

**Features**

Clonesize	m2 -	      The average blueberry clone size in the field

Honeybee	bees/m2/min	Honeybee density in the field

Bumbles	bees/m2/min	  Bumblebee density in the field

Andrena	bees/m2/min	  Andrena bee density in the field

Osmia	bees/m2/min	    Osmia bee density in the field

MaxOfUpperTRange	℃	 The highest record of the upper band daily air temperature 
during the bloom season

MinOfUpperTRange	℃	 The lowest record of the upper band daily air temperature

AverageOfUpperTRange  The average of the upper band daily air temperature

MaxOfLowerTRange	℃	 The highest record of the lower band daily air temperature

MinOfLowerTRange	℃	 The lowest record of the lower band daily air temperature

AverageOfLowerTRange  The average of the lower band daily air temperature

RainingDays	Day	      The total number of days during the bloom season, each of which has precipitation larger than zero

AverageRainingDays   	The average of raining days of the entire bloom season


Dataset: [wild-blueberry-yield-prediction](https://www.kaggle.com/datasets/saurabhshahane/wild-blueberry-yield-prediction)

**Import packages, libraries and modules**

In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor

In [None]:
# Import dataset
data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/HyperParameter_optimisation/WildBlueberryPollinationSimulationData.csv')

In [None]:
data.head()

Unnamed: 0,Row#,clonesize,honeybee,bumbles,andrena,osmia,MaxOfUpperTRange,MinOfUpperTRange,AverageOfUpperTRange,MaxOfLowerTRange,MinOfLowerTRange,AverageOfLowerTRange,RainingDays,AverageRainingDays,fruitset,fruitmass,seeds,yield
0,0,37.5,0.75,0.25,0.25,0.25,86.0,52.0,71.9,62.0,30.0,50.8,16.0,0.26,0.410652,0.408159,31.678898,3813.165795
1,1,37.5,0.75,0.25,0.25,0.25,86.0,52.0,71.9,62.0,30.0,50.8,1.0,0.1,0.444254,0.425458,33.449385,4947.605663
2,2,37.5,0.75,0.25,0.25,0.25,94.6,57.2,79.0,68.2,33.0,55.9,16.0,0.26,0.383787,0.399172,30.546306,3866.798965
3,3,37.5,0.75,0.25,0.25,0.25,94.6,57.2,79.0,68.2,33.0,55.9,1.0,0.1,0.407564,0.408789,31.562586,4303.94303
4,4,37.5,0.75,0.25,0.25,0.25,86.0,52.0,71.9,62.0,30.0,50.8,24.0,0.39,0.354413,0.382703,28.873714,3436.493543


# **Preprocessing**

In [None]:
# Preprocessing funtion

def preprocess_inputs(df):
    df=df.copy()

    # Drop id column
    df=df.drop('Row#', axis=1)

    # Split dataset in target and features
    y=df['yield']
    X=df.drop('yield', axis=1)

    # Train-test split
    X_train, X_test, y_train, y_test=train_test_split(X, y, train_size=0.7, shuffle=True, random_state=1)
    
    return X_train, X_test, y_train, y_test


In [None]:
X_train, X_test, y_train, y_test=preprocess_inputs(data)

In [None]:
X_train

Unnamed: 0,clonesize,honeybee,bumbles,andrena,osmia,MaxOfUpperTRange,MinOfUpperTRange,AverageOfUpperTRange,MaxOfLowerTRange,MinOfLowerTRange,AverageOfLowerTRange,RainingDays,AverageRainingDays,fruitset,fruitmass,seeds
214,12.5,0.25,0.250,0.50,0.50,94.6,57.2,79.0,68.2,33.0,55.9,1.00,0.10,0.582954,0.488176,40.559770
88,12.5,0.25,0.250,0.25,0.50,86.0,52.0,71.9,62.0,30.0,50.8,34.00,0.56,0.435969,0.419720,32.815794
479,25.0,0.50,0.250,0.38,0.63,94.6,57.2,79.0,68.2,33.0,55.9,24.00,0.39,0.364565,0.391617,29.908518
602,25.0,0.50,0.250,0.75,0.50,86.0,52.0,71.9,62.0,30.0,50.8,1.00,0.10,0.523846,0.460305,37.277297
147,12.5,0.25,0.250,0.38,0.38,86.0,52.0,71.9,62.0,30.0,50.8,16.00,0.26,0.553730,0.471250,38.534569
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
715,25.0,0.50,0.380,0.50,0.63,94.6,57.2,79.0,68.2,33.0,55.9,16.00,0.26,0.527592,0.464639,37.782288
767,20.0,0.00,0.585,0.00,0.00,86.0,52.0,71.9,62.0,30.0,50.8,3.77,0.06,0.599984,0.529791,46.585105
72,12.5,0.25,0.250,0.25,0.38,86.0,52.0,71.9,62.0,30.0,50.8,34.00,0.56,0.416271,0.409438,31.577558
235,12.5,0.25,0.250,0.50,0.63,77.4,46.8,64.7,55.8,27.0,45.8,16.00,0.26,0.589306,0.488616,40.546480


In [None]:
X_train.describe()

Unnamed: 0,clonesize,honeybee,bumbles,andrena,osmia,MaxOfUpperTRange,MinOfUpperTRange,AverageOfUpperTRange,MaxOfLowerTRange,MinOfLowerTRange,AverageOfLowerTRange,RainingDays,AverageRainingDays,fruitset,fruitmass,seeds
count,543.0,543.0,543.0,543.0,543.0,543.0,543.0,543.0,543.0,543.0,543.0,543.0,543.0,543.0,543.0,543.0
mean,18.775322,0.389118,0.283116,0.476162,0.559987,82.078637,49.56372,68.547514,59.167956,28.617495,48.487845,18.121418,0.317827,0.503151,0.446645,36.201995
std,6.922487,0.787501,0.063583,0.160474,0.162965,9.19762,5.606868,7.678822,6.652775,3.209863,5.418443,12.188437,0.171467,0.078537,0.039988,4.319438
min,12.5,0.0,0.0,0.0,0.0,69.7,39.0,58.2,50.2,24.3,41.2,1.0,0.06,0.192732,0.311921,22.079199
25%,12.5,0.25,0.25,0.38,0.5,77.4,46.8,64.7,55.8,27.0,45.8,1.0,0.1,0.458556,0.417076,33.225444
50%,12.5,0.25,0.25,0.5,0.63,86.0,52.0,71.9,62.0,30.0,50.8,16.0,0.26,0.508992,0.446576,36.250838
75%,25.0,0.5,0.38,0.63,0.63,86.0,52.0,71.9,62.0,30.0,50.8,24.0,0.39,0.562291,0.476229,39.285339
max,40.0,18.43,0.585,0.75,0.75,94.6,57.2,79.0,68.2,33.0,55.9,34.0,0.56,0.645641,0.532772,46.585105


Tree based models don't use scaling, so we dont need to scale the data.


# **Training  -  Random Forest Model**

In [None]:
# Random Forest Model

model=RandomForestRegressor(random_state=1)
model.fit(X_train, y_train)

y_pred=model.predict(X_test)

In [None]:
# RMSE
RMSE=np.sqrt(np.mean((y_test-y_pred)**2))

In [None]:
# Base line model

y_test.mean()

5963.470689585471

In [None]:
# Sum of squared errors for base line model
np.sum((y_test-y_test.mean())**2)

462021092.3782099

In [None]:
# Sum of squared errors for our model
np.sum((y_test-y_pred)**2)

8115311.619052531

In [None]:
# R^2
r2=1-(np.sum((y_test-y_pred)**2)/np.sum((y_test-y_test.mean())**2))

In [None]:
print('RMSE: {:.2f}'.format(RMSE))
print('R^2 score: {:.5f}'.format(r2))

RMSE: 186.23
R^2 score: 0.98244


# **Improve results with HyperParameter optimization using GridSearch**

In [None]:
params={
    'n_estimators': [50, 100, 150, 200],
    'max_depth': [2,4,6,8,10]
}

model=GridSearchCV(RandomForestRegressor(random_state=1), params)
model.fit(X_train, y_train)


GridSearchCV(estimator=RandomForestRegressor(random_state=1),
             param_grid={'max_depth': [2, 4, 6, 8, 10],
                         'n_estimators': [50, 100, 150, 200]})

In [None]:
y_pred=model.predict(X_test)

In [None]:
# RMSE and R^2

RMSE=np.sqrt(np.mean((y_test-y_pred)**2))
r2=1-(np.sum((y_test-y_pred)**2)/np.sum((y_test-y_test.mean())**2))

print('RMSE: {:.2f}'.format(RMSE))
print('R^2 score: {:.5f}'.format(r2))

RMSE: 185.37
R^2 score: 0.98260
