# **Yield Estimation**
> **Crop yield** is a standard measurement of the amount of agricultural production harvested per unit of land area. Why is it important?
+ It provides valuable information for planning, resource management, and making informed crop production decisions. Additionally, it helps improve food security, increase the efficiency of food production and reduce food waste

> 💡 **Insight**

> Despite being an agriculture-specific task, crop yield estimation maps to a well-defined concept in Artificial Intelligence; **regression**. This exercise involves training a model to predict a _continuous variable_, `Production`, hence is a **regression** task.

### **The dataset**
> **Citation:** The dataset used in this exercise encompasses agricultural data for multiple crops cultivated across various states in **India** (_since I couldn't find any open-source datasets for Kenya_) from the year **1997** till **2020**.
+ The dataset provides crucial features related to crop yield prediction, and can be found on the [`Kaggle`](https://www.kaggle.com/) website [`here`](https://www.kaggle.com/datasets/akshatgupta7/crop-yield-in-indian-states-dataset)

> The dataset contains the following columns
+ `Crop` - The name of the crop cultivated
+ `Crop_Year` - The year in which the crop was grown
+ `Season` - The specific cropping season (_Kharif, Rabi, Whole Year, Summer, Winter or Autumn_)
+ `State` - The Indian state where the crop was cultivated
+ `Area` - The total land area (in hectares) under cultivation for the specific crop (_ha_)
+ `Production` - The quantity of crop produced (_Tons_)
+ `Annual_Rainfall` - The annual rainfall received in the crop-growing region (_mm_)
+ `Fertilizer` - The total amount of fertilizer used for the crop (_kg_)
+ `Pesticide` - The total amount of pesticide used for the crop (_kg_)
+ `Yield` - The calculated crop yield (production per unit area) _[Tons/ha]_

> 📝 **Note**  

> The `Production` column is the _target / label_ _(y value)_ we want to train the model to predict. The rest of the columns are known as _features_ _(x values)_ in Machine Learning lingo  

> Also, some of the rows and columns will be dropped, and not all will be included in model training. Particularly
+ `Crop_Year` - Since I'd wish to predict crop yield irrespective of the year the crop was cultivated
+ `Yield` - Since this is a derived quantity, and can simply be calculated after inferencing as `Production / Area`

### **Library Versions**
> For the sake of future reference and comparison, I'll print the versions of all the libraries I used for this exercise

In [None]:
# install optuna & xgboost
!pip install optuna xgboost

In [None]:
# import libraries
import optuna, xgboost, matplotlib, pandas, seaborn, sys, sklearn

In [None]:
print('System version: ',sys.version)
print('optuna version:', optuna.__version__)
print('xgboost version: ', xgboost.__version__)
print('pandas version: ', pandas.__version__)
print('seaborn version: ', seaborn.__version__)
print('matplotlib version: ', matplotlib.__version__)
print('scikit-learn version', sklearn.__version__)

System version:  3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0]
optuna version: 4.0.0
xgboost version:  2.1.1
pandas version:  2.1.4
seaborn version:  0.13.1
matplotlib version:  3.7.1
scikit-learn version 1.3.2


> ▶️ **Up Next**  

> Some _thorough_ Exploratory Data Analyis (EDA)