## 🌾 **A Data-Driven Investigation into Climate, Pesticides, and Crop Yields**

### **Business Understanding**

Agriculture is the backbone of global food security, and accurate prediction of crop yield is essential for ensuring sustainable production and informed policy-making. With increasing pressure from climate change, evolving pesticide usage patterns, and the need to optimize inputs for better outcomes, stakeholders—from farmers to governments—require data-driven insights to make proactive decisions.

This project focuses on analyzing historical data related to climate conditions, pesticide usage, and crop yields for the 10 most consumed crops globally. The goal is to uncover patterns, trends, and statistically significant relationships that influence crop productivity. By developing a deep understanding of these interdependencies, this investigation aims to support strategies that enhance agricultural efficiency, minimize environmental impact, and reduce risk in food production systems.

## 📌 Project Plan

### 🧾 **Overview / Background**

With the growing global population and increasing threats from climate change, the demand for accurate, data-driven insights in agriculture has never been higher. This project explores the relationship between climate conditions, pesticide usage, and crop yields using historical data from the 10 most consumed crops worldwide. By understanding how these factors interact, the project aims to inform strategies for improving productivity, sustainability, and resilience in agricultural systems.

### ⚠️ **Challenges**

- 🌦 **Climate Variability**: Unpredictable weather patterns can obscure yield trends and complicate analysis.
- 🧪 **Data Quality**: Issues like missing values, inconsistent units, and outliers may affect data reliability.
- 🌍 **Regional Differences**: Yield determinants may vary significantly across countries, complicating generalized conclusions.
- 📉 **Complex Interactions**: The interplay between pesticides, rainfall, and temperature can be nonlinear and hard to isolate without robust statistical methods.

### 💡 **Proposed Solution**

The project will follow a structured data science pipeline starting with rigorous data cleaning and exploration. It will use descriptive statistics and hypothesis testing to uncover meaningful patterns and relationships between climate factors, pesticides, and yield performance. Insights will be visualized for interpretability and aimed at supporting better decision-making in crop management, policy formulation, and risk mitigation.

### ❗ Problem Statement

Accurately predicting crop yield remains a significant challenge due to the complex and often unpredictable nature of climate variability, inconsistencies in data quality, and the diverse agricultural practices across different regions. The presence of missing values, non-uniform data entries, and outliers further complicates the analysis. Additionally, the intricate interactions between environmental factors such as temperature, rainfall, and pesticide usage make it difficult to isolate and understand their individual and combined effects on crop productivity.

### 🎯 Project Objectives

The primary objective of this project is to conduct a comprehensive, data-driven investigation into the factors affecting crop yield, with a focus on climate variables and pesticide usage. Specific objectives include:

1. **Understand the Dataset**: Explore and describe the structure, variables, and distribution of the crop yield dataset to gain a clear understanding of its scope and limitations.

2. **Assess Data Quality**: Identify and address missing values, inconsistencies, outliers, and other data quality issues to ensure reliable and accurate analysis.

3. **Perform Exploratory Data Analysis (EDA)**: Visualize and summarize the data to detect trends, patterns, and potential relationships between climate factors, pesticide usage, and crop yield.

4. **Apply Computational Statistics**: Use statistical techniques to test hypotheses and measure the strength and significance of associations between variables.

5. **Generate Insightful Visuals**: Create intuitive charts and graphs that clearly communicate key findings to both technical and non-technical stakeholders.

6. **Support Decision-Making**: Derive actionable insights that can inform agricultural policy, resource allocation, and climate-resilient crop management strategies.

7. **Lay the Foundation for Predictive Modeling**: Prepare the dataset and insights in a way that supports the development of future machine learning models for crop yield prediction.




### 🧾 **Conclusion**

This project serves as a practical and insightful exploration into how climate factors and pesticide usage impact crop yields, with a focus on data quality, analysis, and statistical reasoning. By leveraging data science techniques in the agritech domain, the investigation aims to generate actionable insights that can support more sustainable and productive agricultural practices. The findings are expected to contribute toward data-informed decisions in food security, environmental management, and policy development.




### 🔍 **Step 2: Data Understanding**

The objective of this phase is to **explore and assess the dataset** to uncover quality issues and extract initial insights that guide further analysis.

#### 📊 **Dataset Overview**
- **Columns Description**: Review each column to understand its meaning, data type, and role in the analysis (e.g., crop type, country, year, yield, rainfall, temperature, pesticide usage).
- **Data Types**: Ensure columns have appropriate formats (e.g., numerical, categorical, datetime).

#### ✅ **Key Checks**
- 🟨 **Missing Values**: Identify and quantify null or NaN entries.
- 📏 **Uniformity**: Check for consistent units, naming conventions, and formats.
- ⚠️ **Outliers**: Detect unusual values that may skew analysis.
- ♻️ **Duplicates**: Look for repeated rows or records.
- 🧹 **Noise/Errors**: Spot typos, impossible values (e.g., negative rainfall), and inconsistent labeling.

This step sets the foundation for effective cleaning, analysis, and modeling by ensuring the **data is reliable, accurate, and ready for deeper investigation**.


In cell below import libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as st
from load import load_data

Load data in cell below 

In [8]:
# load data

yield_df = pd.read_csv("Data/Agritech/yield_df.csv", index_col=0)
yield_df.head()


Unnamed: 0,Area,Item,Year,hg/ha_yield,average_rain_fall_mm_per_year,pesticides_tonnes,avg_temp
0,Albania,Maize,1990,36613,1485.0,121.0,16.37
1,Albania,Potatoes,1990,66667,1485.0,121.0,16.37
2,Albania,"Rice, paddy",1990,23333,1485.0,121.0,16.37
3,Albania,Sorghum,1990,12500,1485.0,121.0,16.37
4,Albania,Soybeans,1990,7000,1485.0,121.0,16.37


In cell below I check the shape of `yield_df` using `.shape`

In [9]:
yield_df.shape

(28242, 7)

Cell above shows that `yield_df` contains `28242` entries and `8` features.

In cell below I execute `.info()` to check metadata summarry

In [10]:
yield_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 28242 entries, 0 to 28241
Data columns (total 7 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Area                           28242 non-null  object 
 1   Item                           28242 non-null  object 
 2   Year                           28242 non-null  int64  
 3   hg/ha_yield                    28242 non-null  int64  
 4   average_rain_fall_mm_per_year  28242 non-null  float64
 5   pesticides_tonnes              28242 non-null  float64
 6   avg_temp                       28242 non-null  float64
dtypes: float64(3), int64(2), object(2)
memory usage: 1.7+ MB


In cell above the metadata reveals `yield_df` contains of `7` features both with **non missing values**. Tow of features are `objects` while 5 are `int64` and `float64`.

Below I perfom statistical summary using `.describe()`

In [11]:
# statistics summary
yield_df.describe()

Unnamed: 0,Year,hg/ha_yield,average_rain_fall_mm_per_year,pesticides_tonnes,avg_temp
count,28242.0,28242.0,28242.0,28242.0,28242.0
mean,2001.544296,77053.332094,1149.05598,37076.909344,20.542627
std,7.051905,84956.612897,709.81215,59958.784665,6.312051
min,1990.0,50.0,51.0,0.04,1.3
25%,1995.0,19919.25,593.0,1702.0,16.7025
50%,2001.0,38295.0,1083.0,17529.44,21.51
75%,2008.0,104676.75,1668.0,48687.88,26.0
max,2013.0,501412.0,3240.0,367778.0,30.65


## 🧹 **Data Preparation**

This phase focuses on transforming raw data into a clean and analysis-ready format. Preparing the dataset is crucial to ensure that insights drawn later are accurate, meaningful, and unbiased. The quality of the data directly influences the outcome of the analysis and any models built upon it.

Key steps include:

- 🔍 **Checking for Data Inconsistencies**: Identify and resolve issues such as inconsistent naming conventions, duplicated entries, or incorrect data types.

- ❓ **Handling Missing Values**: Detect and appropriately address null or empty values that could skew results.

- ⚠️ **Detecting and Treating Outliers**: Locate unusually high or low values that may distort analysis and handle them accordingly.

- 🧾 **Data Formatting**: Standardize formats across the dataset (e.g., date formats, numerical precision, unit conversions) to ensure consistency and compatibility during analysis.

By the end of this stage, the dataset will be cleaner, more reliable, and ready for effective exploratory data analysis and statistical testing.

In cell below I create a copy of dataset from original for use to avoid tumpering with original dataset. I store the copy under variable `yield_copy`. Then proceeds to group my into `numeric` and `categorical` for easy and efficient cleaning.
