# Yield Prediction using Random Forest Regressor

This notebook implements a Random Forest Regressor model to predict the yield.

In [27]:
import pandas as pd
import numpy as np
import optuna
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score, KFold
# Load data
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

## 2. Data Preprocessing

In [28]:
def preprocess_data(train, test):
    X = train.drop(columns=["id", "Row#", "yield"])
    y = train["yield"]
    test_ids = test["id"]
    X_test = test.drop(columns=["id", "Row#"])
    return X, y, X_test, test_ids

X, y, X_test, test_ids = preprocess_data(train, test)

## 3. Defining Cross Validation

Here,This sets up 5-fold cross-validation and shuffle=True ensures random splitting.

In [29]:
cv = KFold(n_splits=5, shuffle=True, random_state=42)

## 4. Defining the Optuna Objective Function

it will try different combinations of hyperparameters (n_estimators, max_depth, etc.).
For each set, it trains the model and checks how good it is using cross-validation.
It returns the average MAE.
(I received help from ChatGPT to explore Optuna and Random Forest tuning. I’m using this to learn how to improve model performance and understand cross-validation better)

In [30]:
def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
        'max_depth': trial.suggest_int('max_depth', 5, 50),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 10),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 5),
        'random_state': 42,
        'n_jobs': -1
    }

    model = RandomForestRegressor(**params)
    score = cross_val_score(model, X, y, cv=cv, scoring='neg_mean_absolute_error').mean()
    return -score  # Since it's negative MAE

## 5. Run optuna study

In [31]:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=30)

[I 2025-06-03 09:55:25,514] A new study created in memory with name: no-name-c64453a8-adeb-44d4-b37e-263cfc2d5fa5
[I 2025-06-03 09:55:40,834] Trial 0 finished with value: 245.69991940017894 and parameters: {'n_estimators': 922, 'max_depth': 28, 'min_samples_split': 10, 'min_samples_leaf': 5}. Best is trial 0 with value: 245.69991940017894.
[I 2025-06-03 09:55:49,582] Trial 1 finished with value: 246.17526887239924 and parameters: {'n_estimators': 509, 'max_depth': 26, 'min_samples_split': 10, 'min_samples_leaf': 4}. Best is trial 0 with value: 245.69991940017894.
[I 2025-06-03 09:55:52,159] Trial 2 finished with value: 252.05284654194074 and parameters: {'n_estimators': 115, 'max_depth': 33, 'min_samples_split': 6, 'min_samples_leaf': 1}. Best is trial 0 with value: 245.69991940017894.
[I 2025-06-03 09:56:08,020] Trial 3 finished with value: 247.24973039492278 and parameters: {'n_estimators': 882, 'max_depth': 46, 'min_samples_split': 8, 'min_samples_leaf': 3}. Best is trial 0 with val

# Train final Model with Best params from optuna and predict

In [32]:
best_params = study.best_params
model = RandomForestRegressor(**best_params)
model.fit(X, y)
preds = model.predict(X_test)

## 6. Submission File

In [34]:
submission = pd.DataFrame({
    "id": test_ids,
    "yield": test_preds
})

submission.to_csv("submission.csv", index=False)
submission.head()

Unnamed: 0,id,yield
0,15000,5550.581655
1,15001,6335.442007
2,15002,5546.580174
3,15003,3083.345956
4,15004,3363.252236
