<a href="https://colab.research.google.com/github/KaifAhmad1/Crop-Recommendation/blob/main/CropRecommendation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Crop Recommendation System: Utilizing Machine Learning for Precise Crop Selection Based on Environmental Conditions

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# Load the dataset
file_path = "/content/drive/MyDrive/Crop_recommendation.csv"
data = pd.read_csv(file_path)

**Exploring the Data:**

In [3]:
data

Unnamed: 0,N,P,K,temperature,humidity,ph,rainfall,label
0,90,42,43,20.879744,82.002744,6.502985,202.935536,rice
1,85,58,41,21.770462,80.319644,7.038096,226.655537,rice
2,60,55,44,23.004459,82.320763,7.840207,263.964248,rice
3,74,35,40,26.491096,80.158363,6.980401,242.864034,rice
4,78,42,42,20.130175,81.604873,7.628473,262.717340,rice
...,...,...,...,...,...,...,...,...
2195,107,34,32,26.774637,66.413269,6.780064,177.774507,coffee
2196,99,15,27,27.417112,56.636362,6.086922,127.924610,coffee
2197,118,33,30,24.131797,67.225123,6.362608,173.322839,coffee
2198,117,32,34,26.272418,52.127394,6.758793,127.175293,coffee


In [4]:
data.shape

(2200, 8)

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2200 entries, 0 to 2199
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   N            2200 non-null   int64  
 1   P            2200 non-null   int64  
 2   K            2200 non-null   int64  
 3   temperature  2200 non-null   float64
 4   humidity     2200 non-null   float64
 5   ph           2200 non-null   float64
 6   rainfall     2200 non-null   float64
 7   label        2200 non-null   object 
dtypes: float64(4), int64(3), object(1)
memory usage: 137.6+ KB


In [6]:
data.describe()

Unnamed: 0,N,P,K,temperature,humidity,ph,rainfall
count,2200.0,2200.0,2200.0,2200.0,2200.0,2200.0,2200.0
mean,50.551818,53.362727,48.149091,25.616244,71.481779,6.46948,103.463655
std,36.917334,32.985883,50.647931,5.063749,22.263812,0.773938,54.958389
min,0.0,5.0,5.0,8.825675,14.25804,3.504752,20.211267
25%,21.0,28.0,20.0,22.769375,60.261953,5.971693,64.551686
50%,37.0,51.0,32.0,25.598693,80.473146,6.425045,94.867624
75%,84.25,68.0,49.0,28.561654,89.948771,6.923643,124.267508
max,140.0,145.0,205.0,43.675493,99.981876,9.935091,298.560117


In [7]:
# How many target category classes or lables in this dataset:
targets = dict(enumerate(data['label'].astype('category').cat.categories))
print(targets)

{0: 'apple', 1: 'banana', 2: 'blackgram', 3: 'chickpea', 4: 'coconut', 5: 'coffee', 6: 'cotton', 7: 'grapes', 8: 'jute', 9: 'kidneybeans', 10: 'lentil', 11: 'maize', 12: 'mango', 13: 'mothbeans', 14: 'mungbean', 15: 'muskmelon', 16: 'orange', 17: 'papaya', 18: 'pigeonpeas', 19: 'pomegranate', 20: 'rice', 21: 'watermelon'}


In [8]:
num_categories = len(data['label'].unique())
print(num_categories)

22


In [9]:
# Checking for missing values:
print(data.isnull().sum())

N              0
P              0
K              0
temperature    0
humidity       0
ph             0
rainfall       0
label          0
dtype: int64


**EDA and Finding the Insights by Visualizing**

In [10]:
import plotly.express as px

fig = px.scatter_3d(data, x='temperature', y='ph', z='rainfall', color='temperature', opacity=0.7,
                    size_max=10, symbol='temperature', title='3D Scatter Plot of Temperature, pH, and Rainfall',
                    labels={'temperature': 'Temperature', 'ph': 'pH', 'rainfall': 'Rainfall'})

fig.update_traces(marker=dict(line=dict(width=2, color='DarkSlateGray')), selector=dict(mode='markers'))

fig.update_layout(scene=dict(xaxis_title='Temperature',
                             yaxis_title='pH',
                             zaxis_title='Rainfall'))

fig.show()

In [11]:
import plotly.express as px

fig = px.box(data[['temperature', 'ph', 'rainfall']], title='Box Plot of Temperature, pH, and Rainfall',
             labels={'variable': 'Variable', 'value': 'Value'}, color_discrete_map={'temperature': 'blue', 'ph': 'green', 'rainfall': 'orange'})
fig.update_layout(xaxis=dict(title='Variables'), yaxis=dict(title='Values'))
fig.show()

In [12]:
import plotly.express as px

fig = px.violin(data, y=['temperature', 'ph', 'rainfall'], box=True, points="all", title='Violin Plot of Temperature, pH, and Rainfall',
                labels={'value': 'Measurement', 'variable': 'Variable'})
fig.update_layout(
    yaxis_title='Measurement',
    xaxis_title='Variable',
    showlegend=False
)

fig.show()

In [13]:
fig = px.bar(data['label'].value_counts().reset_index(),
             x='label',
             y='index',
             orientation='h',
             title='Count Plot of Labels',
             labels={'index': 'Label', 'label': 'Count'},
             color='label',
             color_continuous_scale='plasma_r')

fig.update_layout(
    yaxis_title='Label',
    xaxis_title='Count',
    showlegend=False
)
fig.show()

In [14]:
fig = px.scatter_matrix(data, dimensions=data.columns[:-1], color='label', title='Pair Plot of Features with Label Hue')
fig.show()

#### Rainy Season Insights

During the rainy season, we observe a substantial increase in average rainfall, reaching around 120 mm, while temperatures remain relatively cool, staying below 30°C. The increased rainfall significantly impacts soil moisture levels, subsequently influencing the pH of the soil. This environmental condition is conducive to the cultivation of specific crops.

##### Rice Cultivation

- **Water Requirement:** Rice, being a water-intensive crop, thrives during this season. It requires heavy rainfall, typically exceeding 200 mm.
- **Humidity Level:** Rice cultivation benefits from a high humidity level above 80%.
- **Geographical Impact:** Major rice production in India is concentrated in the East Coast, where an average annual rainfall of approximately 220 mm is recorded.

##### Coconut Plantations

- **Crop Preference:** Coconut, being a tropical crop, flourishes in regions with high humidity levels.
- **Export Hotspots:** Coastal areas across the country are notable for exporting coconuts, as they provide optimal conditions for coconut cultivation.
- **Contribution to Production:** These coastal areas significantly contribute to the country's coconut production.


In [15]:
import plotly.express as px

filtered_df = data[(data['temperature'] < 30) & (data['rainfall'] > 120)]
fig = px.scatter(filtered_df, x='rainfall', y='humidity', color='label', title='Joint Plot of Rainfall and Humidity with Label Hue')

fig.show()

In [16]:
import plotly.express as px

fig = px.scatter(data, x='K', y='humidity', color='label', size_max=8, opacity=0.7, title='Joint Plot of K and Humidity with Label Hue')
fig.show()

In [17]:
import plotly.express as px

fig = px.box(data, x='ph', y='label', orientation='h', title='Box Plot of pH with Label Hue')
fig.show()

In [18]:
filtered_df = data[data['rainfall'] > 150]

fig = px.box(filtered_df, x='P', y='label', orientation='h', title='Box Plot of P with Label Hue (Rainfall > 150)')
fig.show()

### Further Analysis of Phosphorus Levels

When humidity levels drop below `65`, a consistent range of phosphorous levels `(approximately 14 to 25)` is required for the successful cultivation of six different crops. These crops can thrive based on the anticipated amount of rainfall over the next few weeks.



In [19]:
filtered_df = data[data['humidity'] < 65]

fig = px.line(filtered_df, x='K', y='rainfall', color='label', title='Line Plot of K and Rainfall with Label  (Humidity < 65)')
fig.show()

**Data Preprcessing**

In [20]:
# Convert the 'label' column to categorical codes, create 'target' column, and define X and y variables
data['target'] = data['label'].astype('category').cat.codes
X = data[['N', 'P', 'K', 'temperature', 'humidity', 'ph', 'rainfall']]
y = data['target']

**Correlation Analysis:**

In [21]:
# Correlation Heatmap and features:
import plotly.express as px
correlation_matrix = X.corr()

fig = px.imshow(correlation_matrix, x=X.columns, y=X.columns, color_continuous_scale='Viridis')

fig.update_layout(
    title='Correlation Heatmap of Features',
    width=700,
    height=600,
    xaxis=dict(tickangle=-45, tickmode='array', tickvals=list(range(len(X.columns))), ticktext=X.columns),
    yaxis=dict(tickangle=0, tickmode='array', tickvals=list(range(len(X.columns))), ticktext=X.columns),
)
for i in range(len(X.columns)):
    for j in range(len(X.columns)):
        fig.add_annotation(
            x=X.columns[i],
            y=X.columns[j],
            text=f"{correlation_matrix.iloc[j, i]:.2f}",
            showarrow=False,
            font=dict(size=10),
            xshift=-8,
            opacity=0.7
        )
fig.add_annotation(
    x=1.1,
    y=0.5,
    xref="paper",
    yref="paper",
    text=f"Overall Correlation: {correlation_matrix.mean().mean():.2f}",
    showarrow=False,
    font=dict(size=12),
    opacity=0.7
)
fig.show()

In [22]:
# Print the correlation matrix
print("Correlation Matrix:")
print(correlation_matrix)

Correlation Matrix:
                    N         P         K  temperature  humidity        ph  \
N            1.000000 -0.231460 -0.140512     0.026504  0.190688  0.096683   
P           -0.231460  1.000000  0.736232    -0.127541 -0.118734 -0.138019   
K           -0.140512  0.736232  1.000000    -0.160387  0.190859 -0.169503   
temperature  0.026504 -0.127541 -0.160387     1.000000  0.205320 -0.017795   
humidity     0.190688 -0.118734  0.190859     0.205320  1.000000 -0.008483   
ph           0.096683 -0.138019 -0.169503    -0.017795 -0.008483  1.000000   
rainfall     0.059020 -0.063839 -0.053461    -0.030084  0.094423 -0.109069   

             rainfall  
N            0.059020  
P           -0.063839  
K           -0.053461  
temperature -0.030084  
humidity     0.094423  
ph          -0.109069  
rainfall     1.000000  


**Normalization and Feature Scaling:**

In [23]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=1)

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)

# we must apply the scaling to the test set as well that we are computing for the training set
X_test_scaled = scaler.transform(X_test)

### Model Building:
**KNN Classifier for Crop Prediction:**

In [24]:
# Import necessary libraries
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

In [25]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [26]:
# Scale the features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [27]:
# Define the parameter grid for GridSearchCV
param_grid = {
    'n_neighbors': [3, 5, 7, 9],  # You can adjust these values based on your dataset
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

In [28]:
# Initialize the KNN classifier
knn = KNeighborsClassifier()

In [29]:
# Use GridSearchCV to find the best parameters
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_scaled, y_train)

In [31]:
from sklearn.metrics import classification_report
# Display the best parameters and corresponding accuracy
print("Best Parameters: ", grid_search.best_params_)
print("Best Training Accuracy: {:.2f}".format(grid_search.best_score_))
# Print the classification report
print("\nClassification Report:")
print(classification_report(y_test, y_test))

Best Parameters:  {'metric': 'manhattan', 'n_neighbors': 3, 'weights': 'distance'}
Best Training Accuracy: 0.98

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        23
           1       1.00      1.00      1.00        21
           2       1.00      1.00      1.00        20
           3       1.00      1.00      1.00        26
           4       1.00      1.00      1.00        27
           5       1.00      1.00      1.00        17
           6       1.00      1.00      1.00        17
           7       1.00      1.00      1.00        14
           8       1.00      1.00      1.00        23
           9       1.00      1.00      1.00        20
          10       1.00      1.00      1.00        11
          11       1.00      1.00      1.00        21
          12       1.00      1.00      1.00        19
          13       1.00      1.00      1.00        24
          14       1.00      1.00      1.00        19

**SVM Classifier for Crop Prediction:**

In [32]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
import numpy as np

In [33]:
# Linear Kernel
svc_linear = SVC(kernel='linear').fit(X_train_scaled, y_train)
linear_accuracy = svc_linear.score(X_test_scaled, y_test)
print("Linear Kernel Accuracy: ", linear_accuracy)

Linear Kernel Accuracy:  0.9772727272727273


In [34]:
# Rbf Kernel
svc_rbf = SVC(kernel='rbf').fit(X_train_scaled, y_train)
rbf_accuracy = svc_rbf.score(X_test_scaled, y_test)
print("Rbf Kernel Accuracy: ", rbf_accuracy)

Rbf Kernel Accuracy:  0.9681818181818181


In [35]:
# Poly Kernel
svc_poly = SVC(kernel='poly').fit(X_train_scaled, y_train)
poly_accuracy = svc_poly.score(X_test_scaled, y_test)
print("Poly Kernel Accuracy: ", poly_accuracy)

Poly Kernel Accuracy:  0.9204545454545454


In [36]:
# Parameter tuning using GridSearchCV for the Linear Kernel
parameters = {'C': np.logspace(-3, 2, 6).tolist(), 'gamma': np.logspace(-3, 2, 6).tolist()}

In [37]:
# Create a GridSearchCV model for the Linear Kernel
svc_linear_grid_search = GridSearchCV(estimator=SVC(kernel="linear"), param_grid=parameters, n_jobs=-1, cv=4)
svc_linear_grid_search.fit(X_train_scaled, y_train)

In [38]:
# Print the results of the GridSearchCV
print("Best Score: ", svc_linear_grid_search.best_score_)
print("Best Parameters: ", svc_linear_grid_search.best_params_)

Best Score:  0.9846590909090909
Best Parameters:  {'C': 10.0, 'gamma': 0.001}


In [39]:
# Use the best parameters to fit the SVC model with the Linear Kernel
best_svc_linear = SVC(kernel='linear', C=svc_linear_grid_search.best_params_['C'],
                      gamma=svc_linear_grid_search.best_params_['gamma']).fit(X_train_scaled, y_train)

In [40]:
# Evaluate the accuracy of the tuned Linear Kernel model
tuned_linear_accuracy = best_svc_linear.score(X_test_scaled, y_test)
print("Tuned Linear Kernel Accuracy: ", tuned_linear_accuracy)

Tuned Linear Kernel Accuracy:  0.9840909090909091


In [41]:
from sklearn.metrics import classification_report

# Linear Kernel
linear_predictions = svc_linear.predict(X_test_scaled)
print("Classification Report for Linear Kernel:")
print(classification_report(y_test, linear_predictions))

# Rbf Kernel
rbf_predictions = svc_rbf.predict(X_test_scaled)
print("Classification Report for Rbf Kernel:")
print(classification_report(y_test, rbf_predictions))

# Poly Kernel
poly_predictions = svc_poly.predict(X_test_scaled)
print("Classification Report for Poly Kernel:")
print(classification_report(y_test, poly_predictions))

# Tuned Linear Kernel
tuned_linear_predictions = best_svc_linear.predict(X_test_scaled)
print("Classification Report for Tuned Linear Kernel:")
print(classification_report(y_test, tuned_linear_predictions))

Classification Report for Linear Kernel:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        23
           1       1.00      1.00      1.00        21
           2       0.95      1.00      0.98        20
           3       1.00      1.00      1.00        26
           4       1.00      1.00      1.00        27
           5       1.00      1.00      1.00        17
           6       0.94      1.00      0.97        17
           7       1.00      1.00      1.00        14
           8       0.85      0.96      0.90        23
           9       0.91      1.00      0.95        20
          10       0.92      1.00      0.96        11
          11       1.00      0.95      0.98        21
          12       1.00      1.00      1.00        19
          13       1.00      0.96      0.98        24
          14       1.00      1.00      1.00        19
          15       1.00      1.00      1.00        17
          16       1.00      1.00      1

**Classification using Decision Tree**

In [42]:
from sklearn.tree import DecisionTreeClassifier

In [43]:
# Create a Decision Tree Classifier with a fixed random state for reproducibility
clf = DecisionTreeClassifier(random_state=42)

In [44]:
# Train the classifier using the training data
clf.fit(X_train, y_train)

In [45]:
# Evaluate the classifier on the test data and print the accuracy score
accuracy = clf.score(X_test, y_test)
print(f"Accuracy on the test set: {accuracy:.4f}")

Accuracy on the test set: 0.9864


In [46]:
import plotly.express as px
# Get feature importances from the model
feature_importance = clf.feature_importances_
# Create a DataFrame for plotting
importance_df = pd.DataFrame({'Feature': X_train.columns, 'Importance': feature_importance})
# Sort the DataFrame by importance
importance_df = importance_df.sort_values(by='Importance', ascending=False)
# Create a horizontal bar chart
fig = px.bar(
    importance_df,
    x='Importance',
    y='Feature',
    orientation='h',
    title='Feature Importance',
    labels={'Importance': 'Feature Importance', 'Feature': 'Feature Name'},
    width=800,
    height=400
)
fig.show()

**Classification using Random Forest:**

In [47]:
from sklearn.ensemble import RandomForestClassifier

In [48]:
clf = RandomForestClassifier(max_depth=4,n_estimators=100,random_state=42).fit(X_train, y_train)

In [49]:
print('RF Accuracy on training set: {:.2f}'.format(clf.score(X_train, y_train)))
print('RF Accuracy on test set: {:.2f}'.format(clf.score(X_test, y_test)))

RF Accuracy on training set: 0.95
RF Accuracy on test set: 0.92


**Classification using Gradient Boosting:**

In [51]:
from sklearn.ensemble import GradientBoostingClassifier

In [52]:
grad = GradientBoostingClassifier().fit(X_train, y_train)
print('Gradient Boosting accuracy : {}'.format(grad.score(X_test,y_test)))

Gradient Boosting accuracy : 0.9818181818181818


In [53]:
# List of custom inputs representing environmental conditions for multiple samples
custom_inputs = [
    {'N': 75, 'P': 35, 'K': 45, 'temperature': 28.5, 'humidity': 70.5, 'ph': 6.8, 'rainfall': 120.0}
]

# Create an empty list to store predicted crop names
predicted_crop_names = []

# Loop through each custom input and print the predicted crop name
for i, custom_input in enumerate(custom_inputs):
    custom_df = pd.DataFrame([custom_input])[['N', 'P', 'K', 'temperature', 'humidity', 'ph', 'rainfall']]
    custom_input_scaled = scaler.transform(custom_df)
    predicted_crop_name = targets[clf.predict(custom_input_scaled)[0]]
    predicted_crop_names.append(predicted_crop_name)
    print(f"Sample {i + 1}: Predicted Crop Name: {predicted_crop_name}")

# Print the overall predicted crop names for all samples
print("\nOverall Predicted Crop Names:", predicted_crop_names)

Sample 1: Predicted Crop Name: muskmelon

Overall Predicted Crop Names: ['muskmelon']



X does not have valid feature names, but RandomForestClassifier was fitted with feature names

