### Project Description
A farmer reached out to you as a machine learning expert seeking help to select the best crop for his field. Due to budget constraints, the farmer explained that he could only afford to measure one out of the four essential soil measures:

Nitrogen content ratio in the soil
Phosphorous content ratio in the soil
Potassium content ratio in the soil
pH value of the soil
The expert realized that this is a classic feature selection problem, where the objective is to pick the most important feature that could help predict the crop accurately. Can you help him?

In [1]:
# All required libraries are imported here for you.
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

# Load the dataset
crops = pd.read_csv("soil_measures.csv")

In [2]:
# Having a look to our data
crops.head()

Unnamed: 0,N,P,K,ph,crop
0,90,42,43,6.502985,rice
1,85,58,41,7.038096,rice
2,60,55,44,7.840207,rice
3,74,35,40,6.980401,rice
4,78,42,42,7.628473,rice


In [3]:
# Checking for missinng values
crops.isna().sum()

N       0
P       0
K       0
ph      0
crop    0
dtype: int64

In [4]:
# Checking how many crops we have, i.e., multi-class target
crops.crop.unique()

array(['rice', 'maize', 'chickpea', 'kidneybeans', 'pigeonpeas',
       'mothbeans', 'mungbean', 'blackgram', 'lentil', 'pomegranate',
       'banana', 'mango', 'grapes', 'watermelon', 'muskmelon', 'apple',
       'orange', 'papaya', 'coconut', 'cotton', 'jute', 'coffee'],
      dtype=object)

In [5]:
# Spliting into feature and target sets
X = crops.drop(columns="crop")
y = crops["crop"]

# Splitting and training the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [6]:
# Dictionary to store the model performance for each feature
feature_performance = {}

# Training a logistic regression model for each feature
for feature in ["N", "P", "K", "ph"]:
    logreg = LogisticRegression(max_iter=1000, solver='newton-cg') # used max_iter and solver for this ocation
    logreg.fit(X_train[[feature]], y_train)
    y_pred = logreg.predict(X_test[[feature]])
    f1 = metrics.f1_score(y_test, y_pred, average='weighted')
 
    # Calculate F1 score, the harmonic mean of precision and recall
    # Could also use balanced_accuracy_score
    f1 = metrics.f1_score(y_test, y_pred, average="weighted")
    
    # Add feature-f1 score pairs to the dictionary
    feature_performance[feature] = f1
    print(f"F1-score for {feature}: {f1}")

F1-score for N: 0.10408698154331626
F1-score for P: 0.1281669634364753
F1-score for K: 0.17939335656599273
F1-score for ph: 0.04539052723723777


In [7]:
# K produced the best F1 score
# Store in best_predictive_feature dictionary
best_predictive_feature = {"K": feature_performance["K"]}
best_predictive_feature

{'K': 0.17939335656599273}

#### Here are the available options for the solver parameter:

- 'newton-cg' - Good for multinomial logistic regression and handles L2 regularization
- 'lbfgs' - The default solver; works well for multiclass problems but may have convergence issues
- 'liblinear' - Works well for small datasets but only supports "ovr" (one-vs-rest) for multiclass
- 'sag' - Stochastic Average Gradient descent; fast for large datasets
- 'saga' - Variant of SAG that also handles L1 regularization