# Sowing Success: How Machine Learning Helps Farmers Select the Best Crops

In agriculture, making informed decisions about crop selection is crucial for maximizing yield and ensuring sustainable farming practices. One of the primary factors influencing crop growth is the condition of the soil, which can be assessed by measuring essential soil metrics such as nitrogen, phosphorus, potassium levels, and pH value. However, this process can often be expensive and time-consuming, forcing farmers to prioritize which metrics to measure based on their budget constraints.

Farmers face numerous choices each season regarding which crops to plant, with their primary objective being to optimize crop yield. Understanding soil health is vital in this decision-making process, as each crop has specific ideal soil conditions that promote optimal growth and yield.

A farmer has approached you, a machine learning expert, for assistance in selecting the best crop for his field. To aid in this endeavor, you have been provided with a dataset called `soil_measures.csv`, which contains the following key metrics:

- **N**: Nitrogen content ratio in the soil
- **P**: Phosphorous content ratio in the soil
- **K**: Potassium content ratio in the soil
- **pH**: pH value of the soil
- **crop**: Categorical values representing various crops (target variable)

Each row in this dataset corresponds to the soil measurements of a specific field, with the crop specified in the "crop" column being the optimal choice for that field based on these measurements.

## Project Goals

In this project, you will build multi-class classification models to predict the type of crop that should be planted based on the provided soil metrics. Additionally, you will identify the single most important feature that contributes to predictive performance. The steps will include:

1. **Model Development**
2. **Feature Importance Analysis**
3. **Best Predictive Feature**

This project aims to leverage machine learning techniques to provide valuable insights that can enhance decision-making in crop selection, ultimately leading to increased agricultural productivity.


In [36]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

In [2]:
df = pd.read_csv('Dataset/soil_measures.csv')

In [3]:
# Show the first five rows of the DataFrame
df.head()

Unnamed: 0,N,P,K,ph,crop
0,90,42,43,6.502985,rice
1,85,58,41,7.038096,rice
2,60,55,44,7.840207,rice
3,74,35,40,6.980401,rice
4,78,42,42,7.628473,rice


In [4]:
# Display the dimensions of the DataFrame
df.shape

(2200, 5)

In [5]:
# Display basic statistics of the DataFrame
df.describe

<bound method NDFrame.describe of         N   P   K        ph    crop
0      90  42  43  6.502985    rice
1      85  58  41  7.038096    rice
2      60  55  44  7.840207    rice
3      74  35  40  6.980401    rice
4      78  42  42  7.628473    rice
...   ...  ..  ..       ...     ...
2195  107  34  32  6.780064  coffee
2196   99  15  27  6.086922  coffee
2197  118  33  30  6.362608  coffee
2198  117  32  34  6.758793  coffee
2199  104  18  30  6.779833  coffee

[2200 rows x 5 columns]>

In [6]:
# Display the number of missing values in each column
df.isna().sum()

N       0
P       0
K       0
ph      0
crop    0
dtype: int64

In [7]:
# Display all unique values
df['crop'].unique()

array(['rice', 'maize', 'chickpea', 'kidneybeans', 'pigeonpeas',
       'mothbeans', 'mungbean', 'blackgram', 'lentil', 'pomegranate',
       'banana', 'mango', 'grapes', 'watermelon', 'muskmelon', 'apple',
       'orange', 'papaya', 'coconut', 'cotton', 'jute', 'coffee'],
      dtype=object)

## Split the Data 

In [8]:
X = df.drop(columns=["crop"])
y = df['crop']

In [9]:
X

Unnamed: 0,N,P,K,ph
0,90,42,43,6.502985
1,85,58,41,7.038096
2,60,55,44,7.840207
3,74,35,40,6.980401
4,78,42,42,7.628473
...,...,...,...,...
2195,107,34,32,6.780064
2196,99,15,27,6.086922
2197,118,33,30,6.362608
2198,117,32,34,6.758793


In [10]:
y

0         rice
1         rice
2         rice
3         rice
4         rice
         ...  
2195    coffee
2196    coffee
2197    coffee
2198    coffee
2199    coffee
Name: crop, Length: 2200, dtype: object

## Model Development

In [11]:
# Separate data into test and train 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42)

In [48]:
# Create a dictionary to store the model performance for each feature
feature_performance = {}

In [52]:
# Train a logistic regression model for each feature
for feature in ["N", "P", "K", "ph"]:
    log_reg = LogisticRegression(multi_class="multinomial", max_iter=4000)
    log_reg.fit(X_train[[feature]], y_train)
    y_pred = log_reg.predict(X_test[[feature]])
    
    # Calculate F1 score, the harmonic mean of precision and recall
    # Could also use balanced_accuracy_score
    f1 = metrics.f1_score(y_test, y_pred, average="weighted")
    
    # Add feature-f1 score pairs to the dictionary
    feature_performance[feature] = f1
    print(f"F1-score for {feature}: {f1}")

F1-score for N: 0.10386483359322711
F1-score for P: 0.13079194919058798
F1-score for K: 0.20104791685662504
F1-score for ph: 0.04532731061152114


In [53]:
# K produced the best F1 score
# Store in best_predictive_feature dictionary
best_predictive_feature = {"K": feature_performance["K"]}
best_predictive_feature

{'K': 0.20104791685662504}