# Sowing Success: How Machine Learning Helps Farmers Select the Best Crops

![Farmer in a field]()

Measuring essential soil metrics such as nitrogen, phosphorous, potassium levels, and pH value is an important aspect of assessing soil condition. However, it can be an expensive and time-consuming process, which can cause farmers to prioritize which metrics to measure based on their budget constraints.

Farmers have various options when it comes to deciding which crop to plant each season. Their primary objective is to maximize the yield of their crops, taking into account different factors. One crucial factor that affects crop growth is the condition of the soil in the field, which can be assessed by measuring basic elements such as nitrogen and potassium levels. Each crop has an ideal soil condition that ensures optimal growth and maximum yield.

A farmer reached out to you as a machine learning expert for assistance in selecting the best crop for his field. They've provided you with a dataset called `soil_measures.csv`, which contains:

- `"N"`: Nitrogen content ratio in the soil
- `"P"`: Phosphorous content ratio in the soil
- `"K"`: Potassium content ratio in the soil
- `"pH"` value of the soil
- `"crop"`: categorical values that contain various crops (target variable).

Each row in this dataset represents various measures of the soil in a particular field. Based on these measurements, the crop specified in the `"crop"` column is the optimal choice for that field.  

In this project, you will build multi-class classification models to predict the type of `"crop"` and identify the single most importance feature for predictive performance.

In [1]:
# All required libraries are imported here for you.
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics

# Load the dataset
crops = pd.read_csv("datasets\datacamp_ml\soil_measures.csv")

# Write your code here

  crops = pd.read_csv("datasets\datacamp_ml\soil_measures.csv")
  crops = pd.read_csv("datasets\datacamp_ml\soil_measures.csv")


ModuleNotFoundError: No module named 'pandas'

# 1. Basics EDA

## 1.1: Sample display of `crops` df

In [30]:
display(crops)

Unnamed: 0,N,P,K,ph,crop
0,90,42,43,6.502985,rice
1,85,58,41,7.038096,rice
2,60,55,44,7.840207,rice
3,74,35,40,6.980401,rice
4,78,42,42,7.628473,rice
...,...,...,...,...,...
2195,107,34,32,6.780064,coffee
2196,99,15,27,6.086922,coffee
2197,118,33,30,6.362608,coffee
2198,117,32,34,6.758793,coffee


## 1.2: Count null rows

In [31]:
display(crops.isna().sum())

N       0
P       0
K       0
ph      0
crop    0
dtype: int64

# 2. Split the data into train and test set

## 2.1: Set features and target variables

In [32]:
y = crops['crop']
display(y)

0         rice
1         rice
2         rice
3         rice
4         rice
         ...  
2195    coffee
2196    coffee
2197    coffee
2198    coffee
2199    coffee
Name: crop, Length: 2200, dtype: object

In [33]:
X = crops.drop("crop", axis=1)
display(X)

Unnamed: 0,N,P,K,ph
0,90,42,43,6.502985
1,85,58,41,7.038096
2,60,55,44,7.840207
3,74,35,40,6.980401
4,78,42,42,7.628473
...,...,...,...,...
2195,107,34,32,6.780064
2196,99,15,27,6.086922
2197,118,33,30,6.362608
2198,117,32,34,6.758793


## 2.2: Split

In [34]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 12)

In [35]:
display(X_train, X_test, y_train, y_test)

Unnamed: 0,N,P,K,ph
625,34,45,21,6.287380
307,26,80,18,5.581022
2169,111,28,26,6.937353
671,37,50,23,6.530471
443,1,76,17,6.012719
...,...,...,...,...
1987,117,43,25,7.839849
1283,39,140,203,6.349876
1414,109,26,45,6.224535
1691,37,24,13,7.854624


Unnamed: 0,N,P,K,ph
1074,97,74,45,5.677720
2156,108,35,25,6.971963
1815,0,19,33,6.234458
1771,67,68,49,6.821775
988,40,18,43,5.767373
...,...,...,...,...
1549,19,122,202,5.811975
1860,20,29,27,6.047044
427,9,66,21,5.719890
1391,100,10,53,6.211749


625        mungbean
307     kidneybeans
2169         coffee
671        mungbean
443      pigeonpeas
           ...     
1987         cotton
1283         grapes
1414      muskmelon
1691         orange
1867        coconut
Name: crop, Length: 1540, dtype: object

1074         banana
2156         coffee
1815        coconut
1771         papaya
988     pomegranate
           ...     
1549          apple
1860        coconut
427      pigeonpeas
1391     watermelon
1940         cotton
Name: crop, Length: 660, dtype: object

# 3. Train the model, select the best features

## 3.1: 

In [36]:
# Empty dictionary
feature_performance_dict = {}

## 3.2: Train the model on EACH FEATURE

We train the model on each feature as if it's a dataset. 

Pseudocode:

For each feature:
- Initialize a LogReg object
- Fit training data
- Make prediction

In [37]:
# List of features (columns)
feature_list = list(X.columns)

for feature in feature_list:
    # Create a logistic regression object:
    feature_predictor = LogisticRegression(
        # Specifiy using multi-class here:
        multi_class = 'multinomial',
    
        # Solver
        solver='lbfgs', 
    
        # Maximmum iterations
        max_iter=200
    )
  

    # Fit each individual feature to classification label
    feature_predictor.fit(
        X_train[[feature]], y_train
    )

    # Make prediction based on testing data
    y_pred = feature_predictor.predict(X_test[[feature]])

    # Calculate metrics (F1):
    feature_importance = metrics.f1_score(y_pred, y_test, average='weighted')

    # Append result to a dict:
    feature_performance_dict[feature] = feature_importance

print(feature_performance_dict)


{'N': 0.18333110866076713, 'P': 0.24827048711322613, 'K': 0.3269372490249437, 'ph': 0.11790819057728834}


## 3.3: Best predictive feature

In [38]:
# Retrive item with highest values:
# To sort from LOWEST score to HIGHEST score (ascending)
sorted_features = sorted(feature_performance_dict.items(), key=lambda item: item[1], reverse=True)
print(sorted_features)

best_predictive_feature = (sorted_features[0])

best_predictive_feature = dict([best_predictive_feature])
print(best_predictive_feature)

[('K', 0.3269372490249437), ('P', 0.24827048711322613), ('N', 0.18333110866076713), ('ph', 0.11790819057728834)]
{'K': 0.3269372490249437}
