# Crop Yield Prediction

This notebook walks through the process of building a machine learning model to predict crop yield based on various factors like location, weather, and agricultural inputs.

## 1. Setup

First, let's install the necessary libraries.

In [None]:
!pip install pandas scikit-learn

## 2. Data Loading and Exploratory Data Analysis (EDA)

Now, we'll load the dataset and perform some initial analysis to understand its structure and properties.

In [None]:
import pandas as pd

# Load the dataset
file_path = "data/yield_df.csv"
df = pd.read_csv(file_path)

# Display the first few rows
print("First 5 rows of the dataset:")
print(df.head())

# Display dataset information
print("\nDataset Info:")
df.info()

# Display summary statistics
print("\nSummary Statistics:")
print(df.describe())

# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())

## 3. Preprocessing and Feature Engineering

Next, we'll preprocess the data to prepare it for model training. This includes dropping unnecessary columns and converting categorical features into a numerical format using one-hot encoding.

In [None]:
# Drop the 'Unnamed: 0' column if it exists
if 'Unnamed: 0' in df.columns:
    df = df.drop('Unnamed: 0', axis=1)

# One-hot encode categorical features
df_processed = pd.get_dummies(df, columns=['Area', 'Item'], drop_first=True)

print("Shape of the processed dataframe:", df_processed.shape)
print("First 5 rows of the processed dataframe:")
print(df_processed.head())

## 4. Model Training and Evaluation

Now we'll split the data into training and testing sets, train a Linear Regression model, and evaluate its performance.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error\n
# Split data into features (X) and target (y)
X = df_processed.drop('hg/ha_yield', axis=1)
y = df_processed['hg/ha_yield']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model Training
print("Training the model...")
model = LinearRegression()
model.fit(X_train, y_train)
print("Model training complete.")

# Model Evaluation
print("\nEvaluating the model...")
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)\n
print(f"R-squared: {r2:.4f}")
print(f"Mean Absolute Error: {mae:.4f}")\nprint(f"Root Mean Squared Error: {rmse:.4f}")