Project Report: Crop Recommendation Model

Executive Summary

This project developed a machine learning model to recommend suitable crops for farmers based on critical agricultural parameters. The model aims to optimize farming outcomes by matching crops to specific environmental conditions, ultimately reducing losses and increasing production efficiency.

Project Overview

The Crop Recommendation Model was designed to solve a critical problem in agriculture: farmers often invest significant capital and resources without knowledge of which crops would thrive in their specific conditions. By analyzing parameters such as soil nutrients (N, P, K), temperature, humidity, pH levels, and rainfall, the model provides data-driven crop recommendations to maximize yields while minimizing inputs.

Approach and Methodology

Data Collection and Preparation

Utilized a comprehensive dataset containing soil nutrients, climate conditions, and corresponding crop labels
Initial dataset exploration revealed multiple data types (numeric and categorical) requiring different preprocessing approaches
Conducted extensive exploratory data analysis to understand parameter distributions and relationships

Here's how we loaded and initially explored the dataset:
python# Import relevant libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Loading the Dataset

df = pd.read_csv("Crop_recommendation.csv")
df.head()
df.shape
df.info()
df.describe()
print(df.dtypes)

# Data Distribution Analysis
# Analysis of Object Column

print("Object Column Analysis:")
print(df['label'].value_counts())
df['label'].value_counts().plot(kind='bar')
plt.show()

Data Preprocessing

Performed thorough data cleaning to handle missing values, placeholders, and anomalies
Encoded categorical variables using One-Hot Encoding for machine readability
Normalized numerical features with Standard Scaler to prevent dominant features
Balanced the dataset using SMOTETomek to address class imbalance issues

The data cleaning process included:
python# Checking Missing Values
df.isnull().sum()

# Checking NAN, N/A, Unknown Values in the dataset

object_columns = df.select_dtypes(include="object").columns
placeholders = ["NA", "Unknown", "missing", "N/A"]
placeholder_mask = df[object_columns].apply(lambda x: x.isin(placeholders))
print(placeholder_mask.sum())

# Handling Categorical Values Using One-Hot Encoder

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False, drop='first')
encoded_data = encoder.fit_transform(df[categorical_cols])

encoded_df = pd.DataFrame(
    encoded_data,
    columns=encoder.get_feature_names_out(categorical_cols)
)

final_df = pd.concat([df.drop(categorical_cols, axis=1), encoded_df], axis=1)
Feature Engineering

Created enhanced features to improve model performance:

Nutrient ratios (N/P, N/K, P/K)
Composite NPK scores
pH suitability indicators
Temperature-humidity indices
Growing degree days (GDD) calculations
Cumulative rainfall measurements
Crop growth stage indicators



Feature engineering implementation:

python# Nutrient ratios
df['N_P_ratio'] = df['N'] / df['P']
df['N_K_ratio'] = df['N'] / df['K']
df['P_K_ratio'] = df['P'] / df['K']

# NPK composite score

df['npk_score'] = (df['N'] * 0.4) + (df['P'] * 0.3) + (df['K'] * 0.3)

# pH suitability

optimal_ph = {
    'wheat': (6.0, 7.0),
    'rice': (5.0, 6.5),
    'corn': (5.8, 6.8),
}

def ph_suitability(row):
    crop = row['label']
    min_ph, max_ph = optimal_ph.get(crop, (0, 14))
    return 1 if min_ph <= row['ph'] <= max_ph else 0

df['ph_suitability'] = df.apply(ph_suitability, axis=1)

# Temperature-humidity index

df['temp_humidity_index'] = df['temperature'] * df['humidity'] / 100

Model Development

Implemented a structured approach to model selection by testing multiple algorithms:

Logistic Regression
Decision Tree
Random Forest
Support Vector Machine (SVM)


Split data strategically (70% training, 15% validation, 15% testing)
Created robust preprocessing pipelines to handle different data types
Conducted hyperparameter tuning using RandomizedSearchCV
Validated models using 5-fold cross-validation

Model pipeline and training:

pythonfrom sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer

# Data Splitting

X = df.drop('N', axis=1)  
y = df['K']

# First split: 70% train, 30% temp (val+test)

X_train, X_temp, y_train, y_temp = train_test_split(
    X, y,
    test_size=0.3,
    random_state=42
)

# Second split: Split temp into 15% val and 15% test

X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp,
    test_size=0.5,
    random_state=42
)

# Building preprocessing pipeline

numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns.tolist()

preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ]), numeric_features),
        ('cat', Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
        ]), categorical_features)
    ])

# Full pipeline with model

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

# Model training

pipeline.fit(X_train, y_train)
Model Evaluation

Assessed models using multiple metrics:

Accuracy

F1-score (macro average)
ROC-AUC for multi-class classification


Analyzed feature importance to understand key predictors
Evaluated model fairness across different parameter groups

Model evaluation and comparison:

pythonfrom sklearn.metrics import classification_report, roc_auc_score, accuracy_score, f1_score
from sklearn.preprocessing import label_binarize

# Evaluate predictions

y_pred = pipeline.predict(X_test)
y_proba = pipeline.predict_proba(X_test)

print(classification_report(y_test, y_pred))

# ROC-AUC for multiclass

classes = pipeline.named_steps['classifier'].classes_
y_test_bin = label_binarize(y_test, classes=classes)
roc_auc = roc_auc_score(y_test_bin, y_proba, multi_class='ovr', average='macro')
print(f"ROC-AUC (multi-class OVR): {roc_auc:.2f}")

# Comparison of different models

results = []
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Decision Tree": DecisionTreeClassifier(max_depth=10),
    "Random Forest": RandomForestClassifier(),
    "SVM": SVC(probability=True)
}

for name, model in models.items():
    # Create pipeline with preprocessor and model
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])
    
    pipeline.fit(X_train, y_train)
    score = pipeline.score(X_val, y_val)
    results[name] = score
    print(f"{name}: Validation Accuracy = {score:.4f}")

Challenges Encountered

Data Quality Issues

Missing values and potential placeholders required careful handling
Encountered type conversion difficulties when working with categorical data
Some preprocessing pipelines created dimensionality mismatches

Technical Implementation Challenges

Faced pipeline compatibility issues when combining preprocessing steps with models
Encountered errors in SMOTE implementation due to data format inconsistencies
ROC-AUC calculation for multi-class problems required special handling

Model Performance Challenges

Some models showed signs of potential overfitting
Balancing model complexity with interpretability proved difficult
Ensuring fairness across all parameter groups required additional analysis

Key Improvements and Solutions

Enhanced Data Pipeline

Created a robust preprocessing pipeline handling both numerical and categorical features
Implemented proper imputation strategies to handle missing values
Developed a consistent approach to feature scaling and encoding

Model Optimization

Fine-tuned hyperparameters to improve model performance:

pythonfrom sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

# Hyperparameter tuning
param_dist = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [None, 10, 20, 30],
    'classifier__min_samples_split': randint(2, 10),
    'classifier__min_samples_leaf': randint(1, 5),
    'classifier__max_features': ['sqrt', 'log2', None],
    'classifier__bootstrap': [True, False],
    'classifier__class_weight': [None, 'balanced']
}

random_search = RandomizedSearchCV(
    estimator=pipeline,
    param_distributions=param_dist,
    n_iter=10,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

random_search.fit(X_train, y_train)
Interpretability Enhancements

Added LIME explainer to provide insights into individual predictions:

pythonfrom lime.lime_tabular import LimeTabularExplainer


explainer = LimeTabularExplainer(
    training_data=X_train_imputed.values,
    feature_names=X_train.columns,
    class_names=['class_0', 'class_1'],  # Adjust based on your classes
    mode='classification',
    discretize_continuous=False
)


instance = X_test_imputed.iloc[0].values
explanation = explainer.explain_instance(
    instance, 
    pipeline.predict_proba
)

Visualized feature importance:

python# Feature Importance Analysis
importances = pipeline.named_steps['classifier'].feature_importances_
features = pipeline.named_steps['preprocessor'].get_feature_names_out()

importance_df = pd.DataFrame({'Feature': features, 'Importance': importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(data=importance_df.head(15), x='Importance', y='Feature')
plt.title('Top 15 Feature Importances')
plt.tight_layout()
plt.show()
Key Findings and Insights
Model Performance

Random Forest consistently outperformed other models, showing superior accuracy and F1-scores
Tree-based models generally performed better than linear models, suggesting non-linear relationships in the data
Cross-validation demonstrated stable performance across different data subsets

Feature Importance

Soil nutrients (N, P, K) showed significant influence on crop recommendations
Environmental factors (particularly rainfall and temperature) were crucial determinants
Composite features like nutrient ratios proved valuable for prediction

Agricultural Insights

Different crops showed distinct requirements for optimal growth conditions
Certain parameter combinations were strongly predictive of specific crop suitability
The relationship between soil pH and crop suitability was confirmed as a critical factor

Recommendations for Implementation

Deploy as Decision Support Tool: Implement the model as a practical tool for farmers to consult before planting decisions
Continuous Improvement: Set up mechanisms to gather feedback from actual crop outcomes to refine the model
Localization: Consider regional adaptations to account for local agricultural practices and conditions
User Interface Development: Create an accessible interface that translates complex model outputs into actionable recommendations
Integration with Other Systems: Connect with weather forecasting and soil testing services for real-time recommendations

Conclusion

The Crop Recommendation Model successfully addresses a critical agricultural challenge by providing data-driven guidance for crop selection. By analyzing soil and environmental parameters, the model helps farmers optimize resource allocation and increase productivity. The Random Forest algorithm proved most effective for this application, demonstrating strong predictive performance while maintaining interpretability. With proper implementation, this model has significant potential to improve agricultural outcomes and promote sustainable farming practices