<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Python-Notebook-Banners/Exercise.png"  style="display: block; margin-left: auto; margin-right: auto;";/>
</div>

# Exercise: The random forest
© ExploreAI Academy

In this exercise, we build, evaluate and compare random forest regression models.

## Learning objectives

By the end of this train, you should be able to:
* Build a random forest regression model in Python.
* Experiment with different number of trees.
* Evaluate feature importance using a random forest. 

## Exercises

In this excercise, we will be using the `Crop_yield` dataset that contains various factors that could influence the yield of a particular crop across different regions.

### Import libraries and dataset

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics

In [2]:
# Load dataset
df= pd.read_csv("https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/Python/Crop_yield.csv")
df.head(5)

Unnamed: 0,Region,Temperature,Rainfall,Soil_Type,Fertilizer_Usage,Pesticide_Usage,Irrigation,Crop_Variety,Yield
0,East,23.152156,803.362573,Clayey,204.792011,20.76759,1,Variety B,40.316318
1,West,19.382419,571.56767,Sandy,256.201737,49.290242,0,Variety A,26.846639
2,North,27.89589,-8.699637,Loamy,222.202626,25.316121,0,Variety C,-0.323558
3,East,26.741361,897.426194,Loamy,187.98409,17.115362,0,Variety C,45.440871
4,East,19.090286,649.384694,Loamy,110.459549,24.068804,1,Variety B,35.478118


### Preparing the dataset

In the code below, we prepare our dataset for modeling by encoding categorical variables and standardising our features.

In [3]:
# Dummy Variable Encoding for categorical variables
df_encoded = pd.get_dummies(df, drop_first=True)

# Standardisation of features
scaler = StandardScaler()
scaled = scaler.fit_transform(df_encoded)
df_scaled = pd.DataFrame(scaled, columns=df_encoded.columns)

### Exercise 1

Create a function named `train_rf_model` to train and evaluate a random forest regression model on the scaled dataset. 

The function should take in a single parameter , i.e., the number of estimators for the random forest and return the trained model object as well as the RMSE and R<sup>2</sup> scores of the model's performance on the test set. 

In [6]:
# Your solution here...
def train_rf_model(n_estimators):
    X = df_scaled.drop(columns='Yield')
    y = df_scaled['Yield']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    rf_reg = RandomForestRegressor(n_estimators=n_estimators, random_state=42)
    rf_reg.fit(X_train, y_train)

    y_pred = rf_reg.predict(X_test)

    rmse = np.sqrt(metrics.mean_squared_error(y_pred, y_test))
    r2 = metrics.r2_score(y_pred, y_test)

    return rf_reg, {'RMSE': rmse, 'R2': r2}

### Exercise 2

Use the function you have defined in **Exercise 1** to train and evaluate three different random forest regression models with each having the following number of estimators respectively: `50`, `100`, and `200`. Store the results in a dictionary.

In [12]:
# Your solution here...
results = {}

for i in (50, 100, 200):
    model, results[str(i) + ' Trees'] = train_rf_model(i)

results

{'50 Trees': {'RMSE': 0.08498822668615197, 'R2': 0.9917895881703979},
 '100 Trees': {'RMSE': 0.08445395347368703, 'R2': 0.9918845091378994},
 '200 Trees': {'RMSE': 0.08383772170644259, 'R2': 0.9920032842017515}}

### Exercise 3

Say we wish to understand which features have the most impact on crop yield predictions.

Use the `feature_importances_` attribute from our lastly trained random forest model in **Exercise 2** to return a series containing the feature importance score for each of the features in our dataset, sorted in descending order. 

In [26]:
# Your solution here...
feature_importances = model.feature_importances_
feature_names = df_scaled.drop(columns='Yield').columns

pd.Series(feature_importances, feature_names).sort_values(ascending=False)

Rainfall                  0.979003
Fertilizer_Usage          0.016679
Temperature               0.001972
Pesticide_Usage           0.001013
Irrigation                0.000240
Crop_Variety_Variety B    0.000203
Region_West               0.000195
Soil_Type_Loamy           0.000168
Soil_Type_Sandy           0.000152
Crop_Variety_Variety C    0.000140
Region_North              0.000119
Region_South              0.000117
dtype: float64

## Solutions

### Exercise 1

In [None]:
def train_rf_model(n_estimators):
    
    # Splitting the dataset into features and target variable
    X = df_scaled.drop('Yield', axis=1)  # Features
    y = df_scaled['Yield']  # Target variable

    # Splitting the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Initializing the RandomForestRegressor with n_estimators
    rf_model = RandomForestRegressor(n_estimators=n_estimators, random_state=42)

    # Training the model on the training set
    rf_model.fit(X_train, y_train)

    # Making predictions on the test set
    y_pred = rf_model.predict(X_test)

    # Evaluating the model
    mse = metrics.mean_squared_error(y_test, y_pred)  # Setting squared=False returns the RMSE
    r2 = metrics.r2_score(y_test, y_pred)
    
    # Print the training MSE and R-squared score
    return rf_model, {'MSE': mse, 'R2': r2}

The function `train_rf_model` is designed to train and evaluate a random forest model. 

It takes one parameter, `n_estimators`, which specifies the number of trees in the Random Forest. 

The function returns two items: the trained random forest model `rf_model` and a dictionary containing the evaluation metrics, `mse` and `r2`.

### Exercise 2

In [None]:
# Number of estimators to evaluate
estimators_list = [50, 100, 200]

# Dictionary to store results
results = {}

# Train and evaluate models with different numbers of estimators
for n in estimators_list:
    # Store the entire returned dictionary as the value for each key
    model, metric = train_rf_model(n)
    results[f"{n} trees"] = metric
    
results

In the code above, we use the previously created function to train and evaluate multiple random forest models, each with a different number of trees (estimators). 

The for loop iterates over each value in `estimators_list`, where it calls the `train_rf_model()` function, passing the current number of estimators `n` as an argument.

The two items returned by the function are stored in separate variables, `model` and `metric`.

The `results` dictionary is then used to store the evaluation metrics for each model trained with a different number of trees. The keys are strings indicating the number of trees, and the values are the dictionary of metrics returned by the function.

### Exercise 3

In [28]:
# Extract feature importances from the model
feature_importances = model.feature_importances_

# Get the names of the features, excluding the target variable 'Yield'
feature_names =df_scaled.drop('Yield', axis=1).columns

# Create a pandas Series 
importances = pd.Series(feature_importances, index=feature_names)

# Sort the feature importances in descending order
sorted_importances = importances.sort_values(ascending=False)
sorted_importances

feature_importances

array([1.97174154e-03, 9.79002823e-01, 1.66789583e-02, 1.01317829e-03,
       2.39707649e-04, 1.18613416e-04, 1.16642091e-04, 1.95040685e-04,
       1.67974904e-04, 1.51918507e-04, 2.02930014e-04, 1.40472046e-04])

In the code above, we use the `feature_importances_` attribute of the trained random forest model to extract the importance scores for each feature. 

The variable `feature_names` stores the list of feature names that were used to train the model. This will be used for mapping each importance score to its corresponding feature name.

`importances` is a pandas series object where each feature's importance score is associated with its name. 

In `sorted_importances`, we get the importances sorted in descending order to get a quick view of the features considered most important by the model.

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>