## Business Understanding

### Business Context

Agriculture remains one of the world‚Äôs most vital economic sectors, providing food, employment, and raw materials for millions of people. Globally, agriculture contributes about 4% to the world‚Äôs GDP and employs nearly 26% of the global workforce, according to the World Bank (2024). Despite this importance, crop yields are increasingly affected by climate change, soil degradation, and fluctuating input costs, posing major risks to global food security.

The need to accurately predict crop yields has therefore become critical for decision-making in agribusiness, policymaking, and food supply management. Using statistical and machine learning methods, stakeholders can forecast yields based on factors such as temperature, rainfall, soil quality, and fertilizer use. These insights help optimize resource allocation, reduce financial losses, and enhance productivity across different regions and crop types.

### Business Problem
Unpredictable weather patterns, soil degradation, and rising input costs make global crop yield forecasting unreliable. This project aims to develop a data-driven model to predict yields of the 10 most consumed crops worldwide, improving food security and agribusiness decision-making.

## Data Understanding

In [3]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
import warnings
 

print("LOADING ALL DATASETS...")
print("="*50)

# Load all datasets
yield_df = pd.read_csv('data/yield_df.csv')
yield_data = pd.read_csv('data/yield.csv')
pesticides = pd.read_csv('data/pesticides.csv')
rainfall = pd.read_csv('data/rainfall.csv')
temperature = pd.read_csv('data/temp.csv')

print("üìä DATASET OVERVIEW:")
datasets = {
    'yield_df.csv': yield_df,
    'yield.csv': yield_data,
    'pesticides.csv': pesticides,
    'rainfall.csv': rainfall,
    'temp.csv': temperature
}

for name, df in datasets.items():
    print(f"\n{name}:")
    print(f"  Shape: {df.shape}")
    print(f"  Columns: {list(df.columns)}")
    print(f"  Missing values: {df.isnull().sum().sum()}")

LOADING ALL DATASETS...
üìä DATASET OVERVIEW:

yield_df.csv:
  Shape: (28242, 8)
  Columns: ['Unnamed: 0', 'Area', 'Item', 'Year', 'hg/ha_yield', 'average_rain_fall_mm_per_year', 'pesticides_tonnes', 'avg_temp']
  Missing values: 0

yield.csv:
  Shape: (56717, 12)
  Columns: ['Domain Code', 'Domain', 'Area Code', 'Area', 'Element Code', 'Element', 'Item Code', 'Item', 'Year Code', 'Year', 'Unit', 'Value']
  Missing values: 0

pesticides.csv:
  Shape: (4349, 7)
  Columns: ['Domain', 'Area', 'Element', 'Item', 'Year', 'Unit', 'Value']
  Missing values: 0

rainfall.csv:
  Shape: (6727, 3)
  Columns: [' Area', 'Year', 'average_rain_fall_mm_per_year']
  Missing values: 774

temp.csv:
  Shape: (71311, 3)
  Columns: ['year', 'country', 'avg_temp']
  Missing values: 2547


- Explores yield_df.csv:
Shows first 3 rows
Counts unique countries
Lists unique crops
Displays year range

- Explores yield_data.csv:
Shows first 3 rows
Lists unique elements (e.g., yield, production)

- Explores pesticides.csv:
Shows first 3 rows
Explores rainfall.csv:
Shows first 3 rows

- Explores temperature.csv:
Shows first 3 rows

## Data Exploration

In [5]:
def explore_each_dataset():
    """Detailed exploration of each dataset"""
    
    print("\nüîç DETAILED DATASET EXPLORATION")
    print("="*60)
    
    # 1. Yield DataFrame
    print("\n1. YIELD_DF.CSV:")
    print(yield_df.head(3))
    print(f"Unique countries: {yield_df['Area'].nunique()}")
    print(f"Unique crops: {yield_df['Item'].unique()}")
    print(f"Year range: {yield_df['Year'].min()} - {yield_df['Year'].max()}")
    
    # 2. Yield Data
    print("\n2. YIELD.CSV:")
    print(yield_data.head(3))
    print(f"Unique elements: {yield_data['Element'].unique()}")
    
    # 3. Pesticides
    print("\n3. PESTICIDES.CSV:")
    print(pesticides.head(3))
    
    # 4. Rainfall
    print("\n4. RAINFALL.CSV:")
    print(rainfall.head(3))
    
    # 5. Temperature
    print("\n5. TEMPERATURE.CSV:")
    print(temperature.head(3))

explore_each_dataset()


üîç DETAILED DATASET EXPLORATION

1. YIELD_DF.CSV:
   Unnamed: 0     Area         Item  Year  hg/ha_yield  \
0           0  Albania        Maize  1990        36613   
1           1  Albania     Potatoes  1990        66667   
2           2  Albania  Rice, paddy  1990        23333   

   average_rain_fall_mm_per_year  pesticides_tonnes  avg_temp  
0                         1485.0              121.0     16.37  
1                         1485.0              121.0     16.37  
2                         1485.0              121.0     16.37  
Unique countries: 101
Unique crops: ['Maize' 'Potatoes' 'Rice, paddy' 'Sorghum' 'Soybeans' 'Wheat' 'Cassava'
 'Sweet potatoes' 'Plantains and others' 'Yams']
Year range: 1990 - 2013

2. YIELD.CSV:
  Domain Code Domain  Area Code         Area  Element Code Element  Item Code  \
0          QC  Crops          2  Afghanistan          5419   Yield         56   
1          QC  Crops          2  Afghanistan          5419   Yield         56   
2          QC  Cro

### Data Preparation

In [6]:
def integrate_datasets(yield_df, yield_data, pesticides, rainfall, temperature):
    """
    Integrate all datasets into a single master dataset
    """
    print("üîÑ INTEGRATING ALL DATASETS...")
    
    # Start with the main yield dataframe
    master_df = yield_df.copy()
    
    print("1. Checking common keys for integration...")
    
    # Check common columns across datasets
    print(f"yield_df columns: {list(yield_df.columns)}")
    print(f"pesticides columns: {list(pesticides.columns)}")
    print(f"rainfall columns: {list(rainfall.columns)}")
    print(f"temperature columns: {list(temperature.columns)}")
    
    # Check if we need to merge additional data
    # If yield_df already has all columns, we might not need to merge
    if all(col in yield_df.columns for col in ['average_rain_fall_mm_per_year', 'pesticides_tonnes', 'avg_temp']):
        print("‚úÖ yield_df.csv already contains integrated data (rainfall, pesticides, temperature)")
        return master_df
    else:
        print("üîÑ Need to merge separate datasets...")
        # Integration logic would go here
        return master_df

# Integrate all datasets
master_df = integrate_datasets(yield_df, yield_data, pesticides, rainfall, temperature)
print(f"‚úÖ Final master dataset shape: {master_df.shape}")

üîÑ INTEGRATING ALL DATASETS...
1. Checking common keys for integration...
yield_df columns: ['Unnamed: 0', 'Area', 'Item', 'Year', 'hg/ha_yield', 'average_rain_fall_mm_per_year', 'pesticides_tonnes', 'avg_temp']
pesticides columns: ['Domain', 'Area', 'Element', 'Item', 'Year', 'Unit', 'Value']
rainfall columns: [' Area', 'Year', 'average_rain_fall_mm_per_year']
temperature columns: ['year', 'country', 'avg_temp']
‚úÖ yield_df.csv already contains integrated data (rainfall, pesticides, temperature)
‚úÖ Final master dataset shape: (28242, 8)


- yield_df already contained the columns average_rain_fall_mm_per_year, pesticides_tonnes, and avg_temp,

- which are also present in the rainfall, pesticides, and temperature datasets respectively.