# Crop Yield Prediction #

This project focuses on predicting crop yield using a combination of soil properties, weather conditions, and farming practices. The dataset contains information on soil nutrients (such as nitrogen, phosphorus, and potassium), environmental factors like temperature, rainfall, humidity, and sunlight, as well as categorical variables including crop type, season, region, and irrigation method.

The goal of this project is to build a machine learning regression model that can learn the relationships between these factors and crop yield measured in tons per hectare. By performing exploratory data analysis, feature engineering, and model evaluation, I aim to identify the most important factors influencing crop productivity and develop a model that can generalize well across different crops and regions. This work is intended to simulate a real-world agricultural yield prediction system that could support data-driven decision-making in smart farming and agricultural planning.

### - Import Libraries ###

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

In [2]:
data= pd.read_csv("../data/crop-yield.csv")

In [3]:
data.head(10)

Unnamed: 0,N,P,K,Soil_pH,Soil_Moisture,Soil_Type,Organic_Carbon,Temperature,Humidity,Rainfall,Sunlight_Hours,Wind_Speed,Region,Altitude,Season,Crop_Type,Irrigation_Type,Fertilizer_Used,Pesticide_Used,Crop_Yield_ton_per_hectare
0,132,62,22,6.35,59.78,Clay,0.43,22.97,53.89,1305.68,7.73,15.96,Central,36,Rabi,Maize,Canal,223.48,23.36,11.42
1,122,71,66,5.98,25.54,Sandy,0.65,17.0,76.9,1942.05,9.25,12.6,North,1561,Rabi,Potato,Canal,161.54,4.42,23.19
2,44,35,104,8.07,25.87,Sandy,0.79,25.52,44.78,2216.2,8.5,15.63,North,1870,Rabi,Rice,Rainfed,184.62,6.29,7.94
3,136,96,113,4.83,42.97,Silt,0.45,18.59,31.89,607.18,8.75,5.49,East,765,Kharif,Sugarcane,Rainfed,274.02,2.72,72.53
4,101,34,42,5.84,48.01,Silt,0.69,22.74,46.27,483.47,8.0,7.44,Central,1143,Zaid,Wheat,Rainfed,72.69,15.37,6.72
5,50,29,22,6.87,32.73,Silt,1.2,13.88,68.91,1993.65,10.17,11.25,East,1739,Kharif,Rice,Canal,335.8,3.8,8.67
6,132,83,148,7.46,40.98,Silt,0.92,14.92,87.21,2433.33,10.28,13.82,East,1360,Rabi,Potato,Canal,301.54,2.84,26.96
7,151,91,86,7.58,26.39,Sandy,0.85,28.42,53.74,1499.4,8.24,17.7,North,1348,Kharif,Rice,Rainfed,317.16,19.71,9.51
8,104,65,90,4.96,21.8,Silt,0.86,26.96,77.85,1881.33,9.12,2.16,East,54,Rabi,Cotton,Sprinkler,253.49,17.82,7.01
9,117,90,86,7.21,26.91,Clay,1.29,15.14,42.03,1045.25,7.24,10.21,North,57,Rabi,Sugarcane,Sprinkler,231.33,21.83,75.1


Chceking for null entries

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 20 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   N                           10000 non-null  int64  
 1   P                           10000 non-null  int64  
 2   K                           10000 non-null  int64  
 3   Soil_pH                     10000 non-null  float64
 4   Soil_Moisture               10000 non-null  float64
 5   Soil_Type                   10000 non-null  object 
 6   Organic_Carbon              10000 non-null  float64
 7   Temperature                 10000 non-null  float64
 8   Humidity                    10000 non-null  float64
 9   Rainfall                    10000 non-null  float64
 10  Sunlight_Hours              10000 non-null  float64
 11  Wind_Speed                  10000 non-null  float64
 12  Region                      10000 non-null  object 
 13  Altitude                    1000

In [7]:
data.info()
data.isnull().sum()
data.duplicated().sum()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 20 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   N                           10000 non-null  int64  
 1   P                           10000 non-null  int64  
 2   K                           10000 non-null  int64  
 3   Soil_pH                     10000 non-null  float64
 4   Soil_Moisture               10000 non-null  float64
 5   Soil_Type                   10000 non-null  object 
 6   Organic_Carbon              10000 non-null  float64
 7   Temperature                 10000 non-null  float64
 8   Humidity                    10000 non-null  float64
 9   Rainfall                    10000 non-null  float64
 10  Sunlight_Hours              10000 non-null  float64
 11  Wind_Speed                  10000 non-null  float64
 12  Region                      10000 non-null  object 
 13  Altitude                    1000

np.int64(0)

Since we have no null column, we can now proceed with the data exploration and data analysis

## Exploratory Data Analysis

### 1. Do Mineral Nutrients Impact Crop Yield?

#### Scientific Expectation

- Nitrogen (N): Strong positive impact (leaf growth, biomass)

- Phosphorus (P): Root development, early growth

- Potassium (K): Stress tolerance, water regulation

- Organic Carbon: Soil fertility and microbial health

#### Engineering Approach

##### We test:

- Correlation with yield

- Non-linear effects

- Diminishing returns (too much fertilizer hurts yield)

##### What We Expect in Data

- Positive correlation for N and Organic Carbon

- Weaker but meaningful contribution from P and K

- Tree-based models will capture thresholds better than linear ones

In [5]:

data[["N", "P", "K", "Organic_Carbon", "Crop_Yield_ton_per_hectare"]].corr() 

Unnamed: 0,N,P,K,Organic_Carbon,Crop_Yield_ton_per_hectare
N,1.0,0.005533,-0.006166,0.010804,0.002334
P,0.005533,1.0,-0.004304,0.010566,-0.002483
K,-0.006166,-0.004304,1.0,-0.008846,0.009897
Organic_Carbon,0.010804,0.010566,-0.008846,1.0,-0.000781
Crop_Yield_ton_per_hectare,0.002334,-0.002483,0.009897,-0.000781,1.0



The data shows almost zero linear relationship between these soil nutrients and crop yield. 

While N, P, K should impact yield in controlled experiments, in real-world data:

- Other factors dominate (weather, farm management, crop type, pests)

- Nutrients might be within optimal ranges already (farmers apply adequate amounts)

- Diminishing returns might mean the variation present isn't affecting yield much

### 2. Does Weather Affect Crop Yield?

(Temperature, Humidity, Rainfall, Sunlight, Wind, Altitude)

#### Scientific Expectation

- Optimal temperature range per crop

- Rainfall improves yield up to a saturation point

- Sunlight is critical for photosynthesis

- Excessive wind can reduce yield

- Altitude indirectly affects temperature and oxygen levels

#### Engineering Reality

Weather variables often interact

Correlation alone is insufficient

In [10]:
weather_cols = [
    "Temperature", "Humidity", "Rainfall",
    "Sunlight_Hours", "Wind_Speed", "Altitude"
]

data[weather_cols + ["Crop_Yield_ton_per_hectare"]].corr()


Unnamed: 0,Temperature,Humidity,Rainfall,Sunlight_Hours,Wind_Speed,Altitude,Crop_Yield_ton_per_hectare
Temperature,1.0,-0.008862,-0.005747,0.000798,-0.010281,0.003325,-0.006589
Humidity,-0.008862,1.0,0.012658,0.021042,-0.002977,-0.002748,0.006916
Rainfall,-0.005747,0.012658,1.0,-0.001289,-0.014392,-0.000546,0.031213
Sunlight_Hours,0.000798,0.021042,-0.001289,1.0,-0.003318,0.016818,-0.001162
Wind_Speed,-0.010281,-0.002977,-0.014392,-0.003318,1.0,-0.013441,0.015185
Altitude,0.003325,-0.002748,-0.000546,0.016818,-0.013441,1.0,-0.001562
Crop_Yield_ton_per_hectare,-0.006589,0.006916,0.031213,-0.001162,0.015185,-0.001562,1.0


Once again, there's no linear correlations between individual weather variables and crop yield. The strongest correlation is Rainfall → Yield at 0.031, which is still extremely weak.

### 3. Do Certain Regions Have Better Crop Yield?

In [12]:
data.groupby("Region")["Crop_Yield_ton_per_hectare"].mean().sort_values(ascending=False)

Region
West       23.100000
South      22.567029
Central    22.540969
East       22.435189
North      21.074284
Name: Crop_Yield_ton_per_hectare, dtype: float64

#### Analysis Summary:

- West region has the highest average yield: 23.10 tons/hectare

- North region has the lowest: 21.07 tons/hectare

Range: 2.03 tons/hectare difference between highest and lowest

All regions are in the 21-23 tons/hectare range

### 4. Does Season and Crop Type Matter?

- Season

Controls rainfall, sunlight, temperature cycles

- Crop Type

Different nutrient and climate needs

In [15]:
data.groupby("Season")["Crop_Yield_ton_per_hectare"].mean()

Season
Kharif    22.425284
Rabi      22.799642
Zaid      21.793781
Name: Crop_Yield_ton_per_hectare, dtype: float64

In [16]:
data.groupby("Crop_Type")["Crop_Yield_ton_per_hectare"].mean()

Crop_Type
Cotton        7.427941
Maize        10.076037
Potato       24.891012
Rice          9.402027
Sugarcane    74.830257
Wheat         8.724680
Name: Crop_Yield_ton_per_hectare, dtype: float64

#### Key Insights:
1. Season Impact:
- Rabi season: Highest yield (22.80 tons/ha)

- Kharif season: Close second (22.43 tons/ha)

- Zaid season: Lowest yield (21.79 tons/ha)

Difference: ~1 ton/ha between best and worst seasons

2. Crop Type Impact:

- Sugarcane: 74.83 tons/ha (Extremely high - 8-10× other crops!)

- Potato: 24.89 tons/ha

- Maize: 10.08 tons/ha

- Rice: 9.40 tons/ha

- Wheat: 8.72 tons/ha

- Cotton: 7.43 tons/ha (lowest)

### 5. Does Fertilizer and Pesticide Used Matter?

In [3]:
data[["Fertilizer_Used", "Pesticide_Used", "Crop_Yield_ton_per_hectare"]].corr()

Unnamed: 0,Fertilizer_Used,Pesticide_Used,Crop_Yield_ton_per_hectare
Fertilizer_Used,1.0,0.007045,0.051371
Pesticide_Used,0.007045,1.0,-0.01206
Crop_Yield_ton_per_hectare,0.051371,-0.01206,1.0


#### Analysis Summary:
Fertilizer Usage:
Correlation: 0.051

Interpretation: More fertilizer is slightly associated with higher yield

Pesticide Usage:
Correlation: -0.012

Interpretation: No meaningful linear relationship with yield

Key Insight:
Even management practices show very weak correlations with yield in your dataset. This reinforces that crop type dominates yield variation.

### 6. Does Soil pH, Soil Moisture, and Soil Type Matter?
#### Domain Knowledge

- Soil pH controls nutrient absorption

- Soil moisture directly impacts growth

- Soil type influences drainage and retention

In [5]:
data.groupby("Soil_Type")["Crop_Yield_ton_per_hectare"].mean()

Soil_Type
Clay     22.686830
Loamy    22.599233
Sandy    21.261783
Silt     22.822911
Name: Crop_Yield_ton_per_hectare, dtype: float64

In [6]:
data[["Soil_pH", "Soil_Moisture", "Crop_Yield_ton_per_hectare"]].corr()

Unnamed: 0,Soil_pH,Soil_Moisture,Crop_Yield_ton_per_hectare
Soil_pH,1.0,-0.005134,-0.010509
Soil_Moisture,-0.005134,1.0,0.000152
Crop_Yield_ton_per_hectare,-0.010509,0.000152,1.0


### Analysis Summary:
#### Soil Type Findings:
- Silt: Highest yield (22.82 tons/ha)

- Clay: Close second (22.69 tons/ha)

- Loamy: Third (22.60 tons/ha)

- Sandy: Lowest yield (21.26 tons/ha) - 1.5 tons/ha lower!

#### Soil pH & Moisture:
- Soil pH: Correlation = -0.0105 (essentially zero)

- Soil Moisture: Correlation = 0.00015 (absolutely zero)


### Hierarchy of Factors in Your Dataset:
- Crop Type → Massive influence (74.8 vs 7.4 t/ha)

- Soil Type → Moderate influence (1.5 t/ha difference Sandy vs Silt)

- Season → Small influence (1.0 t/ha difference Rabi vs Zaid)

- Region → Small influence (2.0 t/ha difference West vs North)

- All other variables → Minimal linear correlation (<0.05)