## Business Understanding

### Business Context

Agriculture remains one of the worldâ€™s most vital economic sectors, providing food, employment, and raw materials for millions of people. Globally, agriculture contributes about 4% to the worldâ€™s GDP and employs nearly 26% of the global workforce, according to the World Bank (2024). Despite this importance, crop yields are increasingly affected by climate change, soil degradation, and fluctuating input costs, posing major risks to global food security.

The need to accurately predict crop yields has therefore become critical for decision-making in agribusiness, policymaking, and food supply management. Using statistical and machine learning methods, stakeholders can forecast yields based on factors such as temperature, rainfall, soil quality, and fertilizer use. These insights help optimize resource allocation, reduce financial losses, and enhance productivity across different regions and crop types.

### Business Problem
Unpredictable weather patterns, soil degradation, and rising input costs make global crop yield forecasting unreliable. This project aims to develop a data-driven model to predict yields of the 10 most consumed crops worldwide, improving food security and agribusiness decision-making.

## Data Understanding

In [3]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
import warnings
 

print("LOADING ALL DATASETS...")
print("="*50)

# Load all datasets
yield_df = pd.read_csv('data/yield_df.csv')
yield_data = pd.read_csv('data/yield.csv')
pesticides = pd.read_csv('data/pesticides.csv')
rainfall = pd.read_csv('data/rainfall.csv')
temperature = pd.read_csv('data/temp.csv')

print("ðŸ“Š DATASET OVERVIEW:")
datasets = {
    'yield_df.csv': yield_df,
    'yield.csv': yield_data,
    'pesticides.csv': pesticides,
    'rainfall.csv': rainfall,
    'temp.csv': temperature
}

for name, df in datasets.items():
    print(f"\n{name}:")
    print(f"  Shape: {df.shape}")
    print(f"  Columns: {list(df.columns)}")
    print(f"  Missing values: {df.isnull().sum().sum()}")

LOADING ALL DATASETS...
ðŸ“Š DATASET OVERVIEW:

yield_df.csv:
  Shape: (28242, 8)
  Columns: ['Unnamed: 0', 'Area', 'Item', 'Year', 'hg/ha_yield', 'average_rain_fall_mm_per_year', 'pesticides_tonnes', 'avg_temp']
  Missing values: 0

yield.csv:
  Shape: (56717, 12)
  Columns: ['Domain Code', 'Domain', 'Area Code', 'Area', 'Element Code', 'Element', 'Item Code', 'Item', 'Year Code', 'Year', 'Unit', 'Value']
  Missing values: 0

pesticides.csv:
  Shape: (4349, 7)
  Columns: ['Domain', 'Area', 'Element', 'Item', 'Year', 'Unit', 'Value']
  Missing values: 0

rainfall.csv:
  Shape: (6727, 3)
  Columns: [' Area', 'Year', 'average_rain_fall_mm_per_year']
  Missing values: 774

temp.csv:
  Shape: (71311, 3)
  Columns: ['year', 'country', 'avg_temp']
  Missing values: 2547
