# 4. Predictive Modelling for Agriculture

Dive into agriculture using supervised machine learning and feature selection to aid farmers in crop cultivation and solve real-world problems.

## Project Description

A farmer reached out to you as a machine learning expert seeking help to select the best crop for his field. Due to budget constraints, the farmer explained that he could only afford to measure one out of the four essential soil measures:

* ``Nitrogen`` content ratio in the soil
* ``Phosphorous`` content ratio in the soil
* ``Potassium`` content ratio in the soil
* ``pH`` value of the soil
The expert realized that this is a classic feature selection problem, where the objective is to pick the most important feature that could help predict the crop accurately. Can you help him?

## Project Instructions

Identify the single feature that has the strongest predictive performance for classifying crop types.

* Find the feature in the dataset that produces the best score for predicting ``"crop"``.
* From this information, create a variable called ``best_predictive_feature``, which:
    * Should be a ``dictionary`` containing the best predictive feature name as a key and the evaluation score (for the metric you chose) as the value.

## Guides

### How to approach the project

1. Read the data into a pandas DataFrame and perform exploratory data analysis
2. Split the data
3. Evaluate feature performance
4. Create the best_predictive_feature variable

### Steps

1. Read the data into a pandas DataFrame and perform exploratory data analysis.
    * Read in the ``"soil_measures.csv"`` file as pandas DataFrame.
    * Read in a csv file
        * You can use ``pd.read_csv()`` to read in a csv file.
    * Check for missing values
        * You can chain the pandas DataFrame methods ``isna().sum()`` to count the number of null values in each column, helping you decide whether you need to drop or impute missing values.
    * Check for crop types
        * To confirm if ``"crop"`` is a binary or multi-label feature you can use the pandas Series ``.unique()`` method to display all unique values in that column.

2. Split the data
    * Create training and test sets using all features.
        * Features and target variables
            * Create a variable containing the features, all columns except ``"crop"``, and another variable containing only the ``"crop"``.
        * Use train_test_split()
            * You can unpack the results of ``train_test_split()`` into four variables: ``X_train, X_test, y_train, and y_test``.
3. Evaluate feature performance
    * Predict the crop using each feature individually. You should build a model for each feature. That means you will build four models.
        * Create a dictionary to store each features predictive performance
            * Create an empty dictionary, e.g., ``features_dict = {}``.
        * Loop through the features
            * You can train and evaluate the performance of each feature by looping through them using the syntax ``for feature in ["N", "P", "K", "ph"]:``.
        * Training a multi-class classifier algorithm
            * Inside of the for loop iterating over a list of features, you can call ``LogisticRegression()`` to create your model, assigning to the variable ``log_reg``.
            * You should set the ``multi_class`` argument to ``"multinomial"`` so that multi-class prediction is supported.
            * Fit the model to the feature in ``X_train`` by subsetting it using double square brackets e.g., ``log_reg.fit(X_train[[feature]], y_train)``.
        * Predicting target values using the test set
            * You can use the model's ``.predict()`` method, subsetting the feature from ``X_test``, to predict target values.
            * Convention is to store the results as a variable called ``y_pred``.
        * Evaluating the performance of each feature
            * You can calculate F1 score, which is the harmonic mean of precision and recall, to evaluate feature performance.
            * Alternatively, you can use ``metrics.balanced_accuracy_score()``.
            * Scikit-learn's ``metrics.f1_score()`` function takes the target values, ``y_test``, and the predicted values, ``y_pred``, in order to calculate the F1 score.
            * Set the ``f1_score()``'s keyword argument ``average`` equal to ``"weighted"`` when calculating performance for each feature.
            * Assign the results of ``f1_score()`` to a variable called ``feature_performance``.
            * If you created an empty dictionary called ``feature_performance`` outside of a for loop where you built your models, you can add the feature-performance key-value pairs to the dictionary using the syntax ``feature_performance[feature] = feature_importance``.
            * You can use a ``print()`` statement with an f-string to output the feature and the performance, for example, ``print(f"F1-score for {feature}: {feature_performance}")``
4. Create the best_predictive_feature variable
    * Store the feature name as a key and the respective model's evaluation score as the value
        * Saving the information
            * Create a variable called ``best_predictive_feature``.
            * It should contain a single key-value pair.
            * The key should be a string representing the name of the feature that produced the best model performance.
            * The value should be the model's evaluation metric score.