Python 3.8

Project Instructions

Identify the single feature that has the strongest predictive performance for classifying crop types.

    Find the feature in the dataset that produces the best score for predicting "crop".
    From this information, create a variable called best_predictive_feature, which:
        Should be a dictionary containing the best predictive feature name as a key and the evaluation score (for the metric you chose) as the value.

How to approach the project

1. Read the data into a pandas DataFrame and perform exploratory data analysis

Read in the "soil_measures.csv" file as pandas DataFrame.
Read in a csv file

    You can use pd.read_csv() to read in a csv file.

Check for missing values

    You can chain the pandas DataFrame methods isna().sum() to count the number of null values in each column, helping you decide whether you need to drop or impute missing values. 

Check for crop types

    To confirm if "crop" is a binary or multi-label feature you can use the pandas Series .unique() method to display all unique values in that column.

2. Split the data

Create training and test sets using all features.
Features and target variables

    Create a variable containing the features, all columns except "crop", and another variable containing only the "crop".

Use train_test_split()

    You can unpack the results of train_test_split() into four variables: X_train, X_test, y_train, and y_test.

3. Evaluate feature performance

Predict the crop using each feature individually. You should build a model for each feature. That means you will build four models.
Create a dictionary to store each features predictive performance

    Create an empty dictionary, e.g., features_dict = {}.

Loop through the features

    You can train and evaluate the performance of each feature by looping through them using the syntax for feature in ["N", "P", "K", "ph"]:.

Training a multi-class classifier algorithm

    Inside of the for loop iterating over a list of features, you can call LogisticRegression() to create your model, assigning to the variable log_reg.
    You should set the multi_class argument to "multinomial" so that multi-class prediction is supported.
    Fit the model to the feature in X_train by subsetting it using double square brackets e.g., log_reg.fit(X_train[[feature]], y_train).

Predicting target values using the test set

    You can use the model's .predict() method, subsetting the feature from X_test, to predict target values.
    Convention is to store the results as a variable called y_pred.

Evaluating the performance of each feature

    You can calculate F1 score, which is the harmonic mean of precision and recall, to evaluate feature performance.
    Alternatively, you can use metrics.balanced_accuracy_score().
    Scikit-learn's metrics.f1_score() function takes the target values, y_test, and the predicted values, y_pred, in order to calculate the F1 score.
    Set the f1_score()'s keyword argument average equal to "weighted" when calculating performance for each feature.
    Assign the results of f1_score() to a variable called feature_performance.
    If you created an empty dictionary called feature_performance outside of a for loop where you built your models, you can add the feature-performance key-value pairs to the dictionary using the syntax feature_performance[feature] = feature_importance.
    You can use a print() statement with an f-string to output the feature and the performance, for example, print(f"F1-score for {feature}: {feature_performance}").

4. Create the best_predictive_feature variable

Store the feature name as a key and the respective model's evaluation score as the value.
Saving the information

    Create a variable called best_predictive_feature.
    It should contain a single key-value pair.
    The key should be a string representing the name of the feature that produced the best model performance.
    The value should be the model's evaluation metric score. 

# Sowing Success: How Machine Learning Helps Farmers Select the Best Crops

![Farmer in a field](farmer_in_a_field.jpg)

Measuring essential soil metrics such as nitrogen, phosphorous, potassium levels, and pH value is an important aspect of assessing soil condition. However, it can be an expensive and time-consuming process, which can cause farmers to prioritize which metrics to measure based on their budget constraints.

Farmers have various options when it comes to deciding which crop to plant each season. Their primary objective is to maximize the yield of their crops, taking into account different factors. One crucial factor that affects crop growth is the condition of the soil in the field, which can be assessed by measuring basic elements such as nitrogen and potassium levels. Each crop has an ideal soil condition that ensures optimal growth and maximum yield.

A farmer reached out to you as a machine learning expert for assistance in selecting the best crop for his field. They've provided you with a dataset called `soil_measures.csv`, which contains:

- `"N"`: Nitrogen content ratio in the soil
- `"P"`: Phosphorous content ratio in the soil
- `"K"`: Potassium content ratio in the soil
- `"pH"` value of the soil
- `"crop"`: categorical values that contain various crops (target variable).

Each row in this dataset represents various measures of the soil in a particular field. Based on these measurements, the crop specified in the `"crop"` column is the optimal choice for that field.  

In this project, you will build multi-class classification models to predict the type of `"crop"` and identify the single most importance feature for predictive performance.

In [3]:
# All required libraries are imported here for you.
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics

# Load the dataset
crops = pd.read_csv("soil_measures.csv")

# Check for missing values
crops.isna().sum()

# Check how many crops we have, i.e., multi-class target
crops.crop.unique()

# Split into feature and target sets
X = crops.drop(columns="crop")
y = crops["crop"]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

# Create a dictionary to store the model performance for each feature
feature_performance = {}

# Train a logistic regression model for each feature
for feature in ["N", "P", "K", "ph"]:
    log_reg = LogisticRegression(multi_class="multinomial")
    log_reg.fit(X_train[[feature]], y_train)
    y_pred = log_reg.predict(X_test[[feature]])
    
    # Calculate F1 score, the harmonic mean of precision and recall
    # Could also use balanced_accuracy_score
    f1 = metrics.f1_score(y_test, y_pred, average="weighted")
    
    # Add feature-f1 score pairs to the dictionary
    feature_performance[feature] = f1
    print(f"F1-score for {feature}: {f1}")

# K produced the best F1 score
# Store in best_predictive_feature dictionary
best_predictive_feature = {"K": feature_performance["K"]}
best_predictive_feature

F1-score for N: 0.09149868209906838
F1-score for P: 0.14761942909728204
F1-score for K: 0.23896974566001802
F1-score for ph: 0.04532731061152114


{'K': 0.23896974566001802}

In [6]:
print(crops.head(100))
# Congratulations, you completed the project!

     N   P   K        ph  crop
0   90  42  43  6.502985  rice
1   85  58  41  7.038096  rice
2   60  55  44  7.840207  rice
3   74  35  40  6.980401  rice
4   78  42  42  7.628473  rice
..  ..  ..  ..       ...   ...
95  88  46  42  6.604993  rice
96  93  47  37  6.500343  rice
97  60  55  45  5.935745  rice
98  78  35  44  7.072656  rice
99  65  37  40  5.333323  rice

[100 rows x 5 columns]
