# Crop Yield Prediction

## Introduction
This notebook analyzes agricultural data and predicts crop yield using Machine Learning models. The objective is to understand the factors that influence crop yields and develop predictive models that can help farmers optimize their agricultural practices for better productivity.

## 1. Setup and Data Loading

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Set plot style
plt.style.use('seaborn-whitegrid')
sns.set_palette('viridis')

# Display settings
%matplotlib inline
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 20)

# Set random seed for reproducibility
np.random.seed(42)

In [None]:
# Load the dataset
df = pd.read_csv('../data/crop_yield.csv')

# Display the first few rows
print("First 5 rows of the dataset:")
df.head()

## 2. Dataset Information

In [None]:
# Display general information about the dataset
print("Dataset shape (rows, columns):", df.shape)
print("\nColumn names:")
for col in df.columns:
    print(f"- {col}")

print("\nData types:")
df.info()

In [None]:
# Display basic statistical summary of numerical columns
print("Statistical summary of numerical columns:")
df.describe()

In [None]:
# Check for unique values in categorical columns
print("Unique crop types:", df['Crop'].nunique())
print("\nList of unique crops:")
df['Crop'].unique()

## 3. Missing Values Analysis

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100

# Create a DataFrame to display missing values information
missing_info = pd.DataFrame({
    'Missing Values': missing_values,
    'Percentage (%)': missing_percentage
})

print("Missing values per column:")
missing_info

In [None]:
# Visualize missing values if any
if missing_values.sum() > 0:
    plt.figure(figsize=(10, 6))
    sns.heatmap(df.isnull(), cmap='viridis', cbar=False, yticklabels=False)
    plt.title('Missing Values Heatmap')
    plt.tight_layout()
    plt.show()
else:
    print("No missing values found in the dataset.")

## 4. Data Distribution

In [None]:
# Distribution of the target variable (Yield)
plt.figure(figsize=(10, 6))
sns.histplot(df['Yield'], kde=True)
plt.title('Distribution of Crop Yield')
plt.xlabel('Yield')
plt.ylabel('Frequency')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Distribution of numerical features
numerical_cols = df.select_dtypes(include=['float64', 'int64']).columns.tolist()
numerical_cols.remove('Yield')  # Remove target variable

plt.figure(figsize=(15, 10))
for i, col in enumerate(numerical_cols, 1):
    plt.subplot(2, 2, i)
    sns.histplot(df[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Box plots for numerical features
plt.figure(figsize=(15, 10))
for i, col in enumerate(numerical_cols, 1):
    plt.subplot(2, 2, i)
    sns.boxplot(y=df[col])
    plt.title(f'Box Plot of {col}')
    plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Yield distribution by crop type
plt.figure(figsize=(14, 8))
sns.boxplot(x='Crop', y='Yield', data=df)
plt.title('Yield Distribution by Crop Type')
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 5. Correlation Analysis

In [None]:
# Calculate correlation matrix for numerical features
numerical_df = df.select_dtypes(include=['float64', 'int64'])
correlation_matrix = numerical_df.corr()

# Plot correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of Numerical Features')
plt.tight_layout()
plt.show()

In [None]:
# Correlation with target variable (Yield)
target_correlation = correlation_matrix['Yield'].sort_values(ascending=False)
print("Correlation with Yield (target variable):")
target_correlation

In [None]:
# Scatter plots of features vs. target
plt.figure(figsize=(15, 10))
for i, col in enumerate(numerical_cols, 1):
    plt.subplot(2, 2, i)
    sns.scatterplot(x=df[col], y=df['Yield'], alpha=0.6)
    plt.title(f'{col} vs. Yield')
    plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 6. Summary of Findings

Based on the exploratory data analysis, we can summarize the following key findings:

1. **Dataset Overview**: The dataset contains information about different crops and their yields, along with environmental factors like precipitation, humidity, and temperature.

2. **Missing Values**: [To be filled after running the analysis]

3. **Data Distribution**: [To be filled after running the analysis]

4. **Correlation Analysis**: [To be filled after running the analysis]

5. **Next Steps**: In the following sections, we will perform clustering analysis to identify patterns in crop productivity and develop predictive models to forecast crop yields based on environmental factors.