# Crop Yield Prediction

This notebook walks through the process of building a machine learning model to predict crop yield based on various factors like location, weather, and agricultural inputs.

## 1. Setup

First, let's install the necessary libraries.

In [1]:
!pip install pandas scikit-learn









[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## 2. Data Loading and Exploratory Data Analysis (EDA)

Now, we'll load the dataset and perform some initial analysis to understand its structure and properties.

In [2]:
import pandas as pd

# Load the dataset
file_path = "data/yield_df.csv"
df = pd.read_csv(file_path)

# Display the first few rows
print("First 5 rows of the dataset:")
print(df.head())

# Display dataset information
print("\nDataset Info:")
df.info()

# Display summary statistics
print("\nSummary Statistics:")
print(df.describe())

# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())

First 5 rows of the dataset:
   Unnamed: 0     Area         Item  Year  hg/ha_yield  \
0           0  Albania        Maize  1990        36613   
1           1  Albania     Potatoes  1990        66667   
2           2  Albania  Rice, paddy  1990        23333   
3           3  Albania      Sorghum  1990        12500   
4           4  Albania     Soybeans  1990         7000   

   average_rain_fall_mm_per_year  pesticides_tonnes  avg_temp  
0                         1485.0              121.0     16.37  
1                         1485.0              121.0     16.37  
2                         1485.0              121.0     16.37  
3                         1485.0              121.0     16.37  
4                         1485.0              121.0     16.37  

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28242 entries, 0 to 28241
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  ---

## 3. Preprocessing and Feature Engineering

Next, we'll preprocess the data to prepare it for model training. This includes dropping unnecessary columns and converting categorical features into a numerical format using one-hot encoding.

In [3]:
# Drop the 'Unnamed: 0' column if it exists
if 'Unnamed: 0' in df.columns:
    df = df.drop('Unnamed: 0', axis=1)

# One-hot encode categorical features
df_processed = pd.get_dummies(df, columns=['Area', 'Item'], drop_first=True)

print("Shape of the processed dataframe:", df_processed.shape)
print("First 5 rows of the processed dataframe:")
print(df_processed.head())

Shape of the processed dataframe: (28242, 114)
First 5 rows of the processed dataframe:
   Year  hg/ha_yield  average_rain_fall_mm_per_year  pesticides_tonnes  \
0  1990        36613                         1485.0              121.0   
1  1990        66667                         1485.0              121.0   
2  1990        23333                         1485.0              121.0   
3  1990        12500                         1485.0              121.0   
4  1990         7000                         1485.0              121.0   

   avg_temp  Area_Algeria  Area_Angola  Area_Argentina  Area_Armenia  \
0     16.37         False        False           False         False   
1     16.37         False        False           False         False   
2     16.37         False        False           False         False   
3     16.37         False        False           False         False   
4     16.37         False        False           False         False   

   Area_Australia  ...  Area_Zimba

## 4. Model Training and Evaluation

Now we'll split the data into training and testing sets, train a Linear Regression model, and evaluate its performance.

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error

# Split data into features (X) and target (y)
X = df_processed.drop('hg/ha_yield', axis=1)
y = df_processed['hg/ha_yield']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model Training
print("Training the model...")
model = LinearRegression()
model.fit(X_train, y_train)
print("Model training complete.")

# Model Evaluation
print("\nEvaluating the model...")
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

print(f"R-squared: {r2:.4f}")
print(f"Mean Absolute Error: {mae:.4f}")

Training the model...


Model training complete.

Evaluating the model...
R-squared: 0.7551
Mean Absolute Error: 29582.4950
