# Crop Yield Prediction using Machine Learning
Predict crop yield based on rainfall, fertilizer usage,crop year,season ,statee, and crop type. This project helps farmers and policymakers make data-driven decisions.


# Problem Statement
The goal of this project is to forecast the yield of crops per unit area using machine learning models. 
Input features include rainfall, temperature, fertilizer usage, crop type, and area. 
The predicted yield can help in agricultural planning and resource optimization.


In [57]:
# Importing Required Libraries

#pandas: For data loading, manipulation, and analysis.
#numpy: For numerical computations and array operations.
#matplotlib.pyplot: For creating static plots and visualizations.
#seaborn: For advanced and statistical data visualizations (heatmaps, pairplots, etc.).

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
     

In [59]:
# Step 1: Load the dataset
df = pd.read_csv("crop_yield.csv")
df.head()

Unnamed: 0,Crop,Crop_Year,Season,State,Area,Production,Annual_Rainfall,Fertilizer,Pesticide,Yield
0,Arecanut,1997,Whole Year,Assam,73814.0,56708,2051.4,7024878.38,22882.34,0.796087
1,Arhar/Tur,1997,Kharif,Assam,6637.0,4685,2051.4,631643.29,2057.47,0.710435
2,Castor seed,1997,Kharif,Assam,796.0,22,2051.4,75755.32,246.76,0.238333
3,Coconut,1997,Whole Year,Assam,19656.0,126905000,2051.4,1870661.52,6093.36,5238.051739
4,Cotton(lint),1997,Kharif,Assam,1739.0,794,2051.4,165500.63,539.09,0.420909


# Dataset Information

- `df.info()` is used to check:
  - Number of rows and columns in the dataset.
  - Column names and data types (e.g., int, float, object).
  - Non-null counts for each column to identify missing values.
  
This helps in understanding the structure and completeness of the dataset before preprocessing.


In [62]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19689 entries, 0 to 19688
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Crop             19689 non-null  object 
 1   Crop_Year        19689 non-null  int64  
 2   Season           19689 non-null  object 
 3   State            19689 non-null  object 
 4   Area             19689 non-null  float64
 5   Production       19689 non-null  int64  
 6   Annual_Rainfall  19689 non-null  float64
 7   Fertilizer       19689 non-null  float64
 8   Pesticide        19689 non-null  float64
 9   Yield            19689 non-null  float64
dtypes: float64(5), int64(2), object(3)
memory usage: 1.5+ MB


# Statistical Summary of Dataset

- `df.describe()` provides descriptive statistics for numerical columns:
  - **Count**: Number of non-null values.
  - **Mean**: Average value of the column.
  - **Std**: Standard deviation (measure of spread).
  - **Min / Max**: Minimum and maximum values.
  - **25%, 50%, 75%**: Quartiles, showing data distribution.
  
This helps in understanding the range, central tendency, and variability of each numerical feature.


In [65]:
df.describe()

Unnamed: 0,Crop_Year,Area,Production,Annual_Rainfall,Fertilizer,Pesticide,Yield
count,19689.0,19689.0,19689.0,19689.0,19689.0,19689.0,19689.0
mean,2009.127584,179926.6,16435940.0,1437.755177,24103310.0,48848.35,79.954009
std,6.498099,732828.7,263056800.0,816.909589,94946000.0,213287.4,878.306193
min,1997.0,0.5,0.0,301.3,54.17,0.09,0.0
25%,2004.0,1390.0,1393.0,940.7,188014.6,356.7,0.6
50%,2010.0,9317.0,13804.0,1247.6,1234957.0,2421.9,1.03
75%,2015.0,75112.0,122718.0,1643.7,10003850.0,20041.7,2.388889
max,2020.0,50808100.0,6326000000.0,6552.7,4835407000.0,15750510.0,21105.0


# Checking for Missing Values

- `df.isnull().sum()` counts the number of missing (null) values in each column.
- Identifying missing values is important because:
  - Models cannot handle nulls directly.
  - Missing data may need to be filled (imputation) or removed.
- This step helps in planning proper data preprocessing.


In [68]:
df.isnull().sum()

Crop               0
Crop_Year          0
Season             0
State              0
Area               0
Production         0
Annual_Rainfall    0
Fertilizer         0
Pesticide          0
Yield              0
dtype: int64

## Preprocess Data

## Encoding Categorical Variables

- Many machine learning models cannot work with categorical text data directly.
- `LabelEncoder` from `sklearn.preprocessing` is used to convert categorical values into numerical labels.
- In this step:
  - The `Crop` column (categorical) is transformed into numeric values.
  - Each unique crop name is assigned a unique integer.


In [71]:
#  Preprocess Data
# Encode categorical column 'State'
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df['Crop'] = label_encoder.fit_transform(df['Crop'])

## Encoding the 'Season' Column

- The `Season` column contains categorical data (e.g., Kharif, Rabi, Zaid).
- Machine learning models require numerical input, so we use `LabelEncoder` to convert categories into integers.
- Each unique season is assigned a unique numeric label.


In [74]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df['Season'] = label_encoder.fit_transform(df['Season'])

## Encoding the 'State' Column

- The `State` column contains categorical values representing different regions.
- Machine learning models require numeric input, so we use `LabelEncoder` to convert each state into a unique integer.
- This allows the model to process the region information as a feature.


In [77]:

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df['State'] = label_encoder.fit_transform(df['State'])

## Dropping Unnecessary Columns

- The `Crop_Year` column is not required for predicting crop yield.
- Removing irrelevant or redundant columns can help:
  - Reduce noise in the data
  - Improve model performance
- Here, we drop the `Crop_Year` column using `df.drop()`.


In [80]:
# Drop Year column as it is not needed
df.drop(columns=["Crop_Year"],inplace=True)

In [82]:
# : Split features and target variable
from sklearn.model_selection import train_test_split
x = df.drop(columns=["Yield"])  # Features
y = df["Yield"]  # Target variable
     

In [84]:
#  Split data into training and testing sets (80% train, 20% test)
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2, random_state=42)

In [86]:
#  Normalize the data to improve model performance
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
xtrain = scaler.fit_transform(xtrain)
xtest = scaler.transform(xtest)

In [88]:
#  Train the Linear Regression model

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(xtrain, ytrain)

In [90]:
#  Make predictions
ypred = model.predict(xtest)
ypred
     

array([-49.24045179,  82.88116574, 210.69432385, ..., 261.76956498,
       225.64438805,  34.4073312 ])

#  Model Evaluation

- After training the model, we evaluate its performance using several metrics:

1. **Mean Absolute Error (MAE)**  
   - Measures the average absolute difference between actual and predicted values.  
   - Lower MAE indicates better model performance.

2. **Mean Squared Error (MSE)**  
   - Measures the average squared difference between actual and predicted values.  
   - Penalizes larger errors more than MAE.

3. **R-squared Score (R²)**  
   - Indicates how well the model explains the variance in the target variable.  
   - R² closer to 1 means the model fits the data better.

- These metrics provide a comprehensive understanding of model accuracy.


In [93]:

#Evaluate the model
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
mae = mean_absolute_error(ytest, ypred)
mse = mean_squared_error(ytest, ypred)
r2 = r2_score(ytest, ypred)

# Print results
print("Mean Absolute Error:", mae)
print("Mean Squared Error:", mse)
print("R-squared Score:", r2)

Mean Absolute Error: 139.5031162793605
Mean Squared Error: 480217.7380571669
R-squared Score: 0.4006567083226701
