# **Final Project Submission**

Name: Austin Murunga

## **Business Problem**
### **Crop Yield Production in India**

In a country where agriculture plays a key role in the economy, yield of crops practices tremendous variability as many factors like climate changes soil quality and availability of resources. This will be a model in machine learning to predict the production of crops across various states and districts within India using features like state, district, crop type for this edition. Will keep on updating based data availability.

### **Objectives**

**Crop Selection Optimization:** Yield prediction to suggest based on the predicted yield of maximum Output gives farmers a clue about growing best crop in their region & season.

**Resource Distribution:** Advise policy makers on the best way of allocating resources, e. g water and fertilisers

**Risk Management:** Be proactive by forecasting likely yield deficits.

**Impact:** Reliable predictions will help farmers and officials take educated decisions which in turn can enhance productivity, economical utilization of resources and farmer sustainability India.

### Key features

1.State: The state in India where the crop is cultivated.

2.District: The district within the state where the crop is cultivated.

3.Crop: The type of crop being cultivated (e.g., rice, wheat, maize).

4.Season: The season during which the crop is cultivated (e.g., Kharif, Rabi).

5.Crop Year: The year in which the crop was cultivated

6.Area: The area of land (in hectares) used for cultivating the crop.

7.Production: The total production of the crop (in tonnes), which is the target variable for prediction

### Import Libraries

In [113]:
#basic libaries 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#algorithm used to build Machine Learning model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


### Load the Data

Import the data stored in ***crop_production.csv***

In [114]:
data = pd.read_csv('crop_production.csv')
data

Unnamed: 0,State_Name,District_Name,Crop_Year,Season,Crop,Area,Production
0,Andaman and Nicobar Islands,NICOBARS,2000,Kharif,Arecanut,1254.0,2000.0
1,Andaman and Nicobar Islands,NICOBARS,2000,Kharif,Other Kharif pulses,2.0,1.0
2,Andaman and Nicobar Islands,NICOBARS,2000,Kharif,Rice,102.0,321.0
3,Andaman and Nicobar Islands,NICOBARS,2000,Whole Year,Banana,176.0,641.0
4,Andaman and Nicobar Islands,NICOBARS,2000,Whole Year,Cashewnut,720.0,165.0
...,...,...,...,...,...,...,...
246086,West Bengal,PURULIA,2014,Summer,Rice,306.0,801.0
246087,West Bengal,PURULIA,2014,Summer,Sesamum,627.0,463.0
246088,West Bengal,PURULIA,2014,Whole Year,Sugarcane,324.0,16250.0
246089,West Bengal,PURULIA,2014,Winter,Rice,279151.0,597899.0


In [115]:
# basic info about the data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 246091 entries, 0 to 246090
Data columns (total 7 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   State_Name     246091 non-null  object 
 1   District_Name  246091 non-null  object 
 2   Crop_Year      246091 non-null  int64  
 3   Season         246091 non-null  object 
 4   Crop           246091 non-null  object 
 5   Area           246091 non-null  float64
 6   Production     242361 non-null  float64
dtypes: float64(2), int64(1), object(4)
memory usage: 13.1+ MB


In [116]:
#first 5 rows of every column
print(data.head())

                    State_Name District_Name  Crop_Year       Season  \
0  Andaman and Nicobar Islands      NICOBARS       2000  Kharif        
1  Andaman and Nicobar Islands      NICOBARS       2000  Kharif        
2  Andaman and Nicobar Islands      NICOBARS       2000  Kharif        
3  Andaman and Nicobar Islands      NICOBARS       2000  Whole Year    
4  Andaman and Nicobar Islands      NICOBARS       2000  Whole Year    

                  Crop    Area  Production  
0             Arecanut  1254.0      2000.0  
1  Other Kharif pulses     2.0         1.0  
2                 Rice   102.0       321.0  
3               Banana   176.0       641.0  
4            Cashewnut   720.0       165.0  


In [117]:
data.describe()

Unnamed: 0,Crop_Year,Area,Production
count,246091.0,246091.0,242361.0
mean,2005.643018,12002.82,582503.4
std,4.952164,50523.4,17065810.0
min,1997.0,0.04,0.0
25%,2002.0,80.0,88.0
50%,2006.0,582.0,729.0
75%,2010.0,4392.0,7023.0
max,2015.0,8580100.0,1250800000.0


### Missing Values

In [118]:
#checking missing values 
data.isnull().sum()

State_Name          0
District_Name       0
Crop_Year           0
Season              0
Crop                0
Area                0
Production       3730
dtype: int64

Will fill Production with median to avoid outliers

In [119]:
#fill issing values in the production column with median 
data['Production'].fillna(data['Production'].median(), inplace=True)

#inspect
data.isnull().sum()

State_Name       0
District_Name    0
Crop_Year        0
Season           0
Crop             0
Area             0
Production       0
dtype: int64

### Split the data into Train and Test


I will create another column called 'Yield_Production' based on the 'Production' column where 45% of 'Production' returns 'High' to the 'Yield_Production' column and the 'Low' to the vice

In [120]:
# calculate the 80% pecent threshold of the production column
threshold = data['Production'].quantile(0.45)

#create the Yield_Production column
data['Yield_Production'] = data['Production'].apply(lambda x: 'HIGH' if x > threshold else 'LOW')

#inspect
data.head()

Unnamed: 0,State_Name,District_Name,Crop_Year,Season,Crop,Area,Production,Yield_Production
0,Andaman and Nicobar Islands,NICOBARS,2000,Kharif,Arecanut,1254.0,2000.0,HIGH
1,Andaman and Nicobar Islands,NICOBARS,2000,Kharif,Other Kharif pulses,2.0,1.0,LOW
2,Andaman and Nicobar Islands,NICOBARS,2000,Kharif,Rice,102.0,321.0,LOW
3,Andaman and Nicobar Islands,NICOBARS,2000,Whole Year,Banana,176.0,641.0,HIGH
4,Andaman and Nicobar Islands,NICOBARS,2000,Whole Year,Cashewnut,720.0,165.0,LOW


In [121]:
#checking the distribution of Yield_Production
data['Yield_Production'].value_counts()

Yield_Production
HIGH    135350
LOW     110741
Name: count, dtype: int64

In [122]:
#features of the data
y= data['Yield_Production']
X = data.drop(columns=['Yield_Production'], axis= 1)

#split the data into training and testing sets (80% train and 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) 

#stratify=y ensures that the class distribution (High/Low) is maintained in both the training and testing sets.

#inspect the shapes of the resulting datasets
print(f'X_train shape: {X_train.shape}')
print(f'X_test shape: {X_test.shape}')
print(f'y_train shape: {y_train.shape}')
print(f'y_test shape: {y_test.shape}')

X_train shape: (196872, 7)
X_test shape: (49219, 7)
y_train shape: (196872,)
y_test shape: (49219,)
