## NOTEBOOK 1 ##
Group members:

150487 Kevin Otieno

169423 Ryan Matu

134706 Peter Wachira

169957-Joshua Munyirwa

122830 Princess Imelda Sidai

168956 Mutunga Eric Musyimi

# Task Selection and Justification

**Selected Task:** Regression

**Justification:**
The regression task is the most suitable for this dataset because our goal is to predict the `hg/ha_yield` (Maize yield), which is a continuous numerical variable.

* **Nature of Data:** We are predicting a specific quantity (amount of crop per hectare) rather than a class label.
* **Why not Classification?** Classification would be appropriate if we were predicting discrete categories (e.g., "High Yield" vs. "Low Yield"). Since we need to estimate the exact numerical yield based on inputs like rainfall and temperature, regression algorithms (like Random Forest Regressor) are the correct choice.

In [None]:
import pandas as pd
import numpy as np
import io
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [None]:
url = 'https://raw.githubusercontent.com/OKevina/AI-assignment-dataset/refs/heads/main/maize_global_cleaned.csv'

In [None]:
df = pd.read_csv(url)

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,Country,Crop,Year,hg/ha_yield,average_rain_fall_mm_per_year,pesticides_tonnes,avg_temp
0,0,Albania,Maize,1990,36613,1485.0,121.0,16.37
1,6,Albania,Maize,1991,29068,1485.0,121.0,15.36
2,12,Albania,Maize,1992,24876,1485.0,121.0,16.06
3,18,Albania,Maize,1993,24185,1485.0,121.0,16.05
4,23,Albania,Maize,1994,25848,1485.0,201.0,16.96


In [None]:
df.describe()

Unnamed: 0.1,Unnamed: 0,Year,hg/ha_yield,average_rain_fall_mm_per_year,pesticides_tonnes,avg_temp
count,4121.0,4121.0,4121.0,4121.0,4121.0,4121.0
mean,14135.846639,2001.553749,36310.070614,1098.124242,32765.983322,19.925159
std,8201.607601,7.04449,27456.370877,721.559071,54088.622824,6.654389
min,0.0,1990.0,849.0,51.0,0.04,1.61
25%,6847.0,1995.0,17086.0,537.0,1597.0,15.67
50%,14595.0,2001.0,25401.0,1020.0,14485.33,20.81
75%,21228.0,2008.0,48243.0,1622.0,43720.04,25.92
max,28235.0,2013.0,207556.0,3240.0,367778.0,30.65


In [None]:
target = 'hg/ha_yield' # Defining the target variable for prediction

In [None]:
numeric_features = ['Year', 'average_rain_fall_mm_per_year', 'pesticides_tonnes', 'avg_temp'] # Defining numeric features
categorical_features = ['Country'] # Defining categorical features

In [None]:
X = df[numeric_features + categorical_features] # Creating feature matrix X
y = df[target] # Creating target vector y

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Splitting data into training and testing sets

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features), # Applies StandardScaler to numeric features
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_features) # Applies OneHotEncoder to categorical features
    ])

In [None]:
model = Pipeline([
    ('preprocessor', preprocessor), # Applies the defined preprocessor to the data
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42)) # Uses a RandomForestRegressor for the prediction task
])

In [None]:
model.fit(X_train, y_train) # Training the model with the training data

In [None]:
y_pred = model.predict(X_test)

metrics = {
    'R2 Score': r2_score(y_test, y_pred),
    'RMSE': np.sqrt(mean_squared_error(y_test, y_pred)),
    'MAE': mean_absolute_error(y_test, y_pred)
}

In [None]:
pd.DataFrame([metrics], index=['Random Forest'])

Unnamed: 0,R2 Score,RMSE,MAE
Random Forest,0.964744,5088.041932,2315.260412


In [None]:
# Extract feature names
ohe_feature_names = model.named_steps['preprocessor'].named_transformers_['cat'].get_feature_names_out(categorical_features)
all_feature_names = numeric_features + list(ohe_feature_names)

# Get importances
importances = model.named_steps['regressor'].feature_importances_

# Create DataFrame and display top 10
feature_importance_df = pd.DataFrame({'feature': all_feature_names, 'importance': importances})
feature_importance_df.sort_values(by='importance', ascending=False).head(10)

Unnamed: 0,feature,importance
3,avg_temp,0.352575
1,average_rain_fall_mm_per_year,0.100339
76,Country_Qatar,0.095707
2,pesticides_tonnes,0.093839
0,Year,0.067418
83,Country_Spain,0.062011
36,Country_Greece,0.040887
48,Country_Japan,0.037016
9,Country_Australia,0.021408
30,Country_Egypt,0.016139


#  Model Evaluation and Interpretation

The model was evaluated using R² Score, Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE).

### **Results Interpretation:**

* **R² Score (~0.96):**
    The model explains approximately 96.5% of the variance in the maize yield. This is an exceptionally high score, indicating that the selected features (Rainfall, Temperature, Country, etc.) are very strong predictors of crop production.

* **Error Margins (MAE & RMSE):**
    * **MAE (~2,315 hg/ha):** On average, the model's predictions deviate from the actual yield by about 2,315 hectograms per hectare.
    * **Context:** Given that the average yield in the dataset is roughly 36,000 hg/ha, an error of ~2,300 represents a deviation of less than 10%, making the model reliable for general estimation.

### **Feature Importance:**

The analysis highlights the key drivers of maize yield:
1.  **Average Temperature (`avg_temp`):** This was identified as the most critical factor, confirming that maize is highly sensitive to thermal conditions.
2.  **Rainfall (`average_rain_fall_mm_per_year`):** Water availability is the second most important environmental predictor.
3.  **Country Effects:** The high importance of specific "Country" features suggests that local factors (soil quality, farming technology, policy) play a massive role in yield differences, distinct from just weather patterns.