# Crop Yield Prediction Using Machine Learning

Authors: Awanti Dattu Rohite (RBT23CS079), Saundarya Karhade (RBT23CS081), Simrah Shaikh (RBT23CS083)

This notebook is prepared for the Mini Project for Machine Learning (TY B, Semester I, 2025–26).


## 1. Setup and Imports

This notebook will try to load a Kaggle dataset if configured; otherwise it uses the bundled sample CSV.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import os
print('Libraries imported')


## 2. Load Dataset

By default, this notebook will load `crop_yield_sample.csv` included with the project.

In [None]:
kaggle_csv_path = None
if kaggle_csv_path and os.path.exists(kaggle_csv_path):
    df = pd.read_csv(kaggle_csv_path)
else:
    df = pd.read_csv('crop_yield_sample.csv')

df.head()


## 3. EDA
Check distributions and relationships.

In [None]:
print('Dataset shape:', df.shape)
print('Columns:', df.columns.tolist())
df.describe()


In [None]:
plt.figure(figsize=(10,6))
sns.histplot(df['Yield_quintals_per_hectare'], kde=True)
plt.title('Distribution of Yield (quintals per hectare)')
plt.show()


In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(x='Crop', y='Yield_quintals_per_hectare', data=df)
plt.title('Yield by Crop')
plt.show()


## 4. Preprocessing
Encode categorical variables and scale numeric features.

In [None]:
X = df.drop(columns=['Yield_quintals_per_hectare', 'State', 'District', 'Year'])
y = df['Yield_quintals_per_hectare']

numeric_features = ['Area_hectares', 'Rainfall_mm', 'Avg_Temperature_C', 'Humidity_pct', 'Fertilizer_kg_per_hectare']
categorical_features = ['Crop', 'Soil_Type']

numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
cat_transformer = Pipeline(steps=[('onehot', OneHotEncoder(drop='first'))])
preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_features), ('cat', cat_transformer, categorical_features)])

X_pre = preprocessor.fit_transform(X)
print('Preprocessing finished. Feature shape:', X_pre.shape)


## 5. Train/Test Split and Model Training
Train Linear Regression, Decision Tree, and Random Forest, and compare results.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_pre, y, test_size=0.2, random_state=42)
models = {'Linear Regression': LinearRegression(), 'Decision Tree': DecisionTreeRegressor(random_state=42), 'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42)}
results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    mae = mean_absolute_error(y_test, preds)
    rmse = mean_squared_error(y_test, preds, squared=False)
    r2 = r2_score(y_test, preds)
    results[name] = {'MAE': mae, 'RMSE': rmse, 'R2': r2}
    print(f'{name} -> MAE: {mae:.3f}, RMSE: {rmse:.3f}, R2: {r2:.3f}')


## 6. Model Comparison and Plots

In [None]:
res_df = pd.DataFrame(results).T
res_df


In [None]:
res_df[['R2']].plot(kind='bar', legend=False, figsize=(8,5))
plt.title('Model R2 Scores')
plt.ylabel('R2')
plt.ylim(0,1)
plt.show()


## 7. Actual vs Predicted (Best Model)

In [None]:
best_model_name = max(results.items(), key=lambda x: x[1]['R2'])[0]
best_model = models[best_model_name]
preds = best_model.predict(X_test)

plt.figure(figsize=(8,6))
plt.scatter(y_test, preds, alpha=0.7)
plt.xlabel('Actual Yield (quintals/ha)')
plt.ylabel('Predicted Yield (quintals/ha)')
plt.title(f'Actual vs Predicted - {best_model_name}')
lims = [min(y_test.min(), preds.min()), max(y_test.max(), preds.max())]
plt.plot(lims, lims, '--', color='grey')
plt.show()


## 8. Conclusion
This notebook demonstrated classical ML regression techniques for crop yield prediction. Future work: more features, ensemble methods, satellite data.