# Crop Yield Prediction accross Different Regions

### Import libraries and dataset

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics

In [2]:
# Load dataset
df= pd.read_csv("https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/Python/Crop_yield.csv")
df.head(5)

Unnamed: 0,Region,Temperature,Rainfall,Soil_Type,Fertilizer_Usage,Pesticide_Usage,Irrigation,Crop_Variety,Yield
0,East,23.152156,803.362573,Clayey,204.792011,20.76759,1,Variety B,40.316318
1,West,19.382419,571.56767,Sandy,256.201737,49.290242,0,Variety A,26.846639
2,North,27.89589,-8.699637,Loamy,222.202626,25.316121,0,Variety C,-0.323558
3,East,26.741361,897.426194,Loamy,187.98409,17.115362,0,Variety C,45.440871
4,East,19.090286,649.384694,Loamy,110.459549,24.068804,1,Variety B,35.478118


### Preparing the dataset

In the code below, we prepare our dataset for modelling by encoding categorical variables to convert them to a numeric format.

In [3]:
# Dummy Variable Encoding for categorical variables
df_encoded = pd.get_dummies(df, drop_first=True)

### Training Model

In [4]:
def train_rf_model(data, target_variable, n_estimators):

    # Splitting the dataset into features and target variable
    X = data.drop(target_variable, axis=1)  # Features
    y = data[target_variable]  # Target variable

    # Splitting the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Initializing the RandomForestRegressor with n_estimators
    rf_model = RandomForestRegressor(n_estimators=n_estimators, random_state=42)

    # Training the model on the training set
    rf_model.fit(X_train, y_train)

    # Making predictions on the test set
    y_pred = rf_model.predict(X_test)

    # Evaluating the model
    mse = metrics.mean_squared_error(y_test, y_pred)  # Setting squared=False returns the RMSE
    r2 = metrics.r2_score(y_test, y_pred)

    # Return the trained model and its performance metrics
    return rf_model, {'MSE': mse, 'R2': r2}


The function `train_rf_model` is designed to train and evaluate a random forest regression model.

It takes three parameters: `data`, `target_variable`, and `n_estimators`.

The function returns two items: the trained random forest model `rf_model` and a dictionary containing the evaluation metrics, `mse` and `r2`.

### Hyperparamerer Tuning

In [5]:
# Number of estimators to evaluate
estimators_list = [50, 100, 200]

# Dictionary to store results
results = {}

# Train and evaluate models with different numbers of estimators
for n in estimators_list:
    # Store the entire returned dictionary as the value for each key
    model, metric = train_rf_model(df_encoded, 'Yield', n)
    results[f"{n} trees"] = metric

results

{'50 trees': {'MSE': 0.739261264251345, 'R2': 0.9920180175887953},
 '100 trees': {'MSE': 0.7288864859605081, 'R2': 0.9921300365756436},
 '200 trees': {'MSE': 0.7200078994393476, 'R2': 0.9922259008186051}}

In the code above, we use the previously created function to train and evaluate multiple random forest models, each with a different number of trees (estimators).

The for loop iterates over each value in `estimators_list`, where it calls the `train_rf_model()` function, passing the required parameters including the current number of estimators `n` as arguments.

The two items returned by the function are stored in separate variables, `model` and `metric`.

The `results` dictionary is then used to store the evaluation metrics for each model trained with a different number of trees. The keys are strings indicating the number of trees, and the values are the dictionary of metrics returned by the function.

### Important Features

In [6]:
# Extract feature importances from the model
feature_importances = model.feature_importances_

# Get the names of the features, excluding the target variable 'Yield'
feature_names =df_encoded.drop('Yield', axis=1).columns

# Create a Pandas series
importances = pd.Series(feature_importances, index=feature_names)

# Sort the feature importances in descending order
sorted_importances = importances.sort_values(ascending=False)
sorted_importances

Unnamed: 0,0
Rainfall,0.97891
Fertilizer_Usage,0.01667
Temperature,0.001971
Pesticide_Usage,0.001102
Irrigation,0.000251
Crop_Variety_Variety B,0.000202
Region_West,0.000194
Soil_Type_Loamy,0.000161
Soil_Type_Sandy,0.000158
Crop_Variety_Variety C,0.000143


In the code above, we use the `feature_importances_` attribute of the trained random forest model to extract the importance scores for each feature.

The variable `feature_names` stores the list of feature names that were used to train the model. This will be used for mapping each importance score to its corresponding feature name.

`importances` is a Pandas series object where each feature's importance score is associated with its name.

In `sorted_importances`, we get the importances sorted in descending order to get a quick view of the features considered most important by the model.

> Which top two features contribute the most to the model's predictive ability?

Understanding feature importance and the contribution of each variable to the model's predictions offers us an opportunity to streamline our models. This understanding enables us to focus on the most influential features, thereby reducing model complexity without significantly sacrificing performance.