# Climate and Crop Yield Prediction (Wheat)

This notebook demonstrates a simple pipeline to predict wheat yield (tons/ha) from climate variables:
- Average temperature (°C)
- Total annual rainfall (mm)
- Solar radiation (MJ/m²)

**Files provided in `data/`**:
- `crop_yield_wheat.csv` — simplified wheat yields
- `climate_annual.csv` — simplified climate variables (annual)
- `merged_crop_climate.csv` — merged dataset used for modeling

This is a toy dataset for demonstration. Replace with real FAO and NASA POWER data for production analysis.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error

pd.options.display.float_format = '{:,.3f}'.format


In [None]:
# Load data
crop = pd.read_csv('data/crop_yield_wheat.csv')
climate = pd.read_csv('data/climate_annual.csv')
data = pd.read_csv('data/merged_crop_climate.csv')
print('Crop data:'); display(crop.head())
print('\nClimate data:'); display(climate.head())
print('\nMerged data:'); display(data.head())

# Quick info
print('\nDataset shape:', data.shape)


In [None]:
# Exploratory analysis
sns.pairplot(data[['Yield_t_ha','Avg_Temp_C','Total_Rainfall_mm','Radiation_MJ_m2']], diag_kind='kde')
plt.suptitle('Pairplot of Yield and Climate Variables', y=1.02)
plt.show()

print('\nCorrelation matrix:')
display(data[['Yield_t_ha','Avg_Temp_C','Total_Rainfall_mm','Radiation_MJ_m2']].corr())

# Country trends
plt.figure(figsize=(8,5))
for c in data['Country'].unique():
    sub = data[data['Country']==c]
    plt.plot(sub['Year'], sub['Yield_t_ha'], marker='o', label=c)
plt.xlabel('Year'); plt.ylabel('Yield (t/ha)'); plt.title('Wheat Yield Over Time by Country'); plt.legend()
plt.show()


In [None]:
# Modeling: Predict Yield using climate variables
X = data[['Avg_Temp_C','Total_Rainfall_mm','Radiation_MJ_m2']]
y = data['Yield_t_ha']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)
pred_lr = lr.predict(X_test)
print('Linear Regression R2:', r2_score(y_test, pred_lr))
print('Linear Regression RMSE:', mean_squared_error(y_test, pred_lr, squared=False))
coefs = pd.Series(lr.coef_, index=X.columns)
print('\nCoefficients:'); display(coefs)

# Random Forest
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
pred_rf = rf.predict(X_test)
print('\nRandom Forest R2:', r2_score(y_test, pred_rf))
print('Random Forest RMSE:', mean_squared_error(y_test, pred_rf, squared=False))

# Feature importances
fi = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)
print('\nFeature importances:'); display(fi)

# Plot predictions vs actual
plt.figure(figsize=(6,4))
plt.scatter(y_test, pred_rf, label='RF Predictions')
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', label='Perfect')
plt.xlabel('Actual Yield'); plt.ylabel('Predicted Yield'); plt.title('RF: Actual vs Predicted'); plt.legend()
plt.show()


---
## Conclusion & Next Steps

This toy analysis shows how climate variables can be used to predict crop yields. For a stronger project you should:
- Replace toy data with **real FAO crop yield** and **NASA POWER** climate data (download CSVs for each country/year).
- Expand features: soil type, irrigation, fertilizer use, planting area, and management practices.
- Use panel data techniques (fixed effects), time-series forecasting (LSTM), or spatial analysis (maps).
- Build an interactive dashboard (Streamlit) to demonstrate use to stakeholders.
