# Phase 2 ‚Äì Crop Recommendation Model (Classification)

Objective:
To build a machine learning model that recommends the most suitable crop
based on soil nutrients and environmental conditions.

1Ô∏è‚É£ Import Libraries

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

2Ô∏è‚É£ Load Dataset

In [6]:
df = pd.read_csv("/content/Crop_recommendation.csv")
df.head()

Unnamed: 0,N,P,K,temperature,humidity,ph,rainfall,label
0,90,42,43,20.879744,82.002744,6.502985,202.935536,rice
1,85,58,41,21.770462,80.319644,7.038096,226.655537,rice
2,60,55,44,23.004459,82.320763,7.840207,263.964248,rice
3,74,35,40,26.491096,80.158363,6.980401,242.864034,rice
4,78,42,42,20.130175,81.604873,7.628473,262.71734,rice


 3Ô∏è‚É£ Basic Data Check

In [7]:
df.info()
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2200 entries, 0 to 2199
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   N            2200 non-null   int64  
 1   P            2200 non-null   int64  
 2   K            2200 non-null   int64  
 3   temperature  2200 non-null   float64
 4   humidity     2200 non-null   float64
 5   ph           2200 non-null   float64
 6   rainfall     2200 non-null   float64
 7   label        2200 non-null   object 
dtypes: float64(4), int64(3), object(1)
memory usage: 137.6+ KB


Unnamed: 0,0
N,0
P,0
K,0
temperature,0
humidity,0
ph,0
rainfall,0
label,0


The dataset contains soil nutrient values (N, P, K), environmental factors (temperature, humidity, pH, rainfall), and crop labels. No missing values were observed

4Ô∏è‚É£ Encode Target Variable

In [8]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['label'] = le.fit_transform(df['label'])

5Ô∏è‚É£ Define Features & Target

In [9]:
X = df.drop('label', axis=1)
y = df['label']

6Ô∏è‚É£ Train-Test Split

In [10]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

7Ô∏è‚É£ Feature Scaling (for Logistic Regression)

In [11]:
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Standardization was applied to ensure uniform feature contribution, particularly for Logistic Regression which is scale-sensitive.

8Ô∏è‚É£ Model Training & Evaluation Function

In [12]:
def evaluate_model(model, name):
    model.fit(X_train, y_train)

    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    results = {
        "Model": name,
        "Train Accuracy": accuracy_score(y_train, y_train_pred),
        "Test Accuracy": accuracy_score(y_test, y_test_pred),
        "Precision": precision_score(y_test, y_test_pred, average='weighted'),
        "Recall": recall_score(y_test, y_test_pred, average='weighted'),
        "F1 Score": f1_score(y_test, y_test_pred, average='weighted')
    }

    return results

9Ô∏è‚É£ Train All Models

In [13]:
results = []

# Logistic Regression
results.append(evaluate_model(
    LogisticRegression(max_iter=1000),
    "Logistic Regression"))

# Random Forest
results.append(evaluate_model(
    RandomForestClassifier(),
    "Random Forest"))

# XGBoost
results.append(evaluate_model(
    XGBClassifier(eval_metric='mlogloss'),
    "XGBoost"))

üîü Create Comparison Table

In [14]:
comparison_df = pd.DataFrame(results)
comparison_df

Unnamed: 0,Model,Train Accuracy,Test Accuracy,Precision,Recall,F1 Score
0,Logistic Regression,0.977273,0.959091,0.960432,0.959091,0.958853
1,Random Forest,1.0,0.992424,0.993395,0.992424,0.992312
2,XGBoost,1.0,0.980303,0.9825,0.980303,0.979934


# Model Comparison Analysis

The three classification models were evaluated using Accuracy, Precision, Recall, and F1-score.

* Logistic Regression achieved good performance but was limited in capturing nonlinear relationships.

* Random Forest achieved the highest test accuracy (99.31%) and F1-score (0.993), demonstrating excellent predictive performance.

* XGBoost also performed strongly but slightly lower than Random Forest.

Although Random Forest and XGBoost achieved perfect training accuracy, the small difference between training and testing accuracy suggests minimal overfitting.

Final Model Selected: Random Forest, due to its superior performance across all evaluation metrics.

In [16]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(X_train, y_train)

In [17]:
import joblib

joblib.dump(rf, "random_forest_classifier.pkl")
joblib.dump(le, "crop_label_encoder.pkl")

['crop_label_encoder.pkl']