# 🌾 Ethiopian Crop Yield Prediction

Welcome to the **Ethiopian Data Science & AI Community** starter project!

In this notebook, you'll build a simple **crop yield prediction model** using climate and soil data — a first step toward data-driven agriculture in Ethiopia.

🎯 Goal: Predict crop yield (kg/ha) based on environmental and farming factors.

🔧 Skills: Data cleaning, visualization, linear regression

In [None]:
# Step 1: Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

## Step 2: Generate Synthetic Crop Dataset

💡 This is **synthetic data** for learning. In real projects, you'd use data from MoA, FAO, or field surveys.

In [None]:
# Set random seed for reproducibility
np.random.seed(42)

# Regions in Ethiopia
regions = ["Amhara", "Oromia", "SNNP", "Tigray", "Sidama"]
crops = ["Teff", "Maize", "Coffee"]

# Generate 300 synthetic samples
n_samples = 300
data = {
    "region": np.random.choice(regions, n_samples),
    "crop": np.random.choice(crops, n_samples),
    "rainfall_mm": np.random.normal(800, 150, n_samples),  # mm/year
    "temperature_c": np.random.normal(22, 3, n_samples),  # °C
    "soil_ph": np.random.uniform(5.0, 7.5, n_samples),    # pH
    "fertilizer_kg_per_ha": np.random.uniform(0, 100, n_samples),
}

# Realistic yield calculation (with noise)
yield_per_crop = {
    "Teff": lambda x: 1000 + 0.8*x['rainfall_mm'] - 20*abs(x['temperature_c'] - 18) + 30*x['fertilizer_kg_per_ha'] + 50*x['soil_ph'],
    "Maize": lambda x: 2000 + 0.6*x['rainfall_mm'] - 15*abs(x['temperature_c'] - 25) + 25*x['fertilizer_kg_per_ha'] + 40*x['soil_ph'],
    "Coffee": lambda x: 800 + 0.5*x['rainfall_mm'] - 10*abs(x['temperature_c'] - 20) + 15*x['fertilizer_kg_per_ha'] + 60*x['soil_ph']
}

yields = []
for i in range(n_samples):
    row = {"rainfall_mm": data['rainfall_mm'][i],
           "temperature_c": data['temperature_c'][i],
           "soil_ph": data['soil_ph'][i],
           "fertilizer_kg_per_ha": data['fertilizer_kg_per_ha'][i]}
    base_yield = yield_per_crop[data['crop'][i]](row)
    # Add noise
    yield_val = max(base_yield + np.random.normal(0, 100), 300)  # Minimum yield
    yields.append(round(yield_val, 2))

data["yield_kg_per_ha"] = yields

# Convert to DataFrame
df = pd.DataFrame(data)
print("First 10 rows of the dataset:")
df.head(10)

## Step 3: Explore the Data

In [None]:
# Basic info
print("Dataset Shape:", df.shape)
print("\nData Types:")
print(df.dtypes)
print("\nMissing Values:")
print(df.isnull().sum())

In [None]:
# Distribution of yield by crop
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.boxplot(data=df, x="crop", y="yield_kg_per_ha")
plt.title("Crop Yield by Crop Type")
plt.ylabel("Yield (kg/ha)")

plt.subplot(1, 2, 2)
sns.scatterplot(data=df, x="rainfall_mm", y="yield_kg_per_ha", hue="crop")
plt.title("Yield vs Rainfall")
plt.xlabel("Rainfall (mm)")
plt.ylabel("Yield (kg/ha)")

plt.tight_layout()
plt.show()

## Step 4: Prepare Data for Modeling

We need to convert categorical columns (`region`, `crop`) into numbers using **one-hot encoding**.

In [None]:
# One-hot encode categorical variables
df_encoded = pd.get_dummies(df, columns=["region", "crop"], drop_first=True)

# Define features and target
X = df_encoded.drop("yield_kg_per_ha", axis=1)
y = df_encoded["yield_kg_per_ha"]

print("Features shape:", X.shape)
print("Target shape:", y.shape)

## Step 5: Split into Train and Test Sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Training samples:", X_train.shape[0])
print("Testing samples:", X_test.shape[0])

## Step 6: Train a Linear Regression Model

In [None]:
model_lr = LinearRegression()
model_lr.fit(X_train, y_train)

# Predict
y_pred_lr = model_lr.predict(X_test)

# Evaluate
mae_lr = mean_absolute_error(y_test, y_pred_lr)
rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))
r2_lr = r2_score(y_test, y_pred_lr)

print("Linear Regression Results:")
print(f"MAE: {mae_lr:.2f} kg/ha")
print(f"RMSE: {rmse_lr:.2f} kg/ha")
print(f"R²: {r2_lr:.4f}")

## Step 7: Try a Random Forest Model (Better Performance)

In [None]:
model_rf = RandomForestRegressor(n_estimators=100, random_state=42)
model_rf.fit(X_train, y_train)

# Predict
y_pred_rf = model_rf.predict(X_test)

# Evaluate
mae_rf = mean_absolute_error(y_test, y_pred_rf)
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
r2_rf = r2_score(y_test, y_pred_rf)

print("Random Forest Results:")
print(f"MAE: {mae_rf:.2f} kg/ha")
print(f"RMSE: {rmse_rf:.2f} kg/ha")
print(f"R²: {r2_rf:.4f}")

## Step 8: Feature Importance (Insight for Farmers)

In [None]:
importances = model_rf.feature_importances_
features = X.columns
feat_importance = pd.Series(importances, index=features).sort_values(ascending=False)

# Top 10 features
plt.figure(figsize=(10, 6))
feat_importance.head(10).plot(kind='bar')
plt.title("Top 10 Features Influencing Crop Yield")
plt.ylabel("Importance")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## Step 9: Predict Yield for a New Farm

In [None]:
def predict_yield(region, crop, rainfall, temp, soil_ph, fertilizer):
    # Create a dataframe with all zeros (to match encoded columns)
    sample = pd.DataFrame([0] * len(X.columns), index=X.columns).T
    
    # Fill known values
    sample["rainfall_mm"] = rainfall
    sample["temperature_c"] = temp
    sample["soil_ph"] = soil_ph
    sample["fertilizer_kg_per_ha"] = fertilizer
    
    # Set one-hot encoded columns
    region_col = "region_" + region
    crop_col = "crop_" + crop
    
    if region_col in sample.columns:
        sample[region_col] = 1
    if crop_col in sample.columns:
        sample[crop_col] = 1
    
    # Fill NaNs with 0 (for regions/crops not present)
    sample = sample.fillna(0)
    
    # Predict
    pred = model_rf.predict(sample)[0]
    return round(pred, 2)

# Example: Predict Teff yield in Amhara
yield_pred = predict_yield(
    region="Amhara",
    crop="Teff",
    rainfall=850,
    temp=18,
    soil_ph=6.2,
    fertilizer=40
)

print(f"Predicted Teff Yield: {yield_pred} kg/ha")

## ✅ Next Steps & Challenges

🔹 **Add real data** from MoA, FAO, or field surveys
🔹 **Include soil type or altitude**
🔹 **Predict by region-season** (e.g., Meher vs Belg)
🔹 **Build a web dashboard** for farmers
🔹 **Contribute your cleaned dataset** to the community repo!

📌 Join the [Ethiopian Data Science & AI Community](https://www.linkedin.com/groups/11836061/) to share your results and collaborate!