# Prediction of Soil Viability for Sustainable Agriculture

🎯 The goal of this challenge is to train a model that classifies soils as viable or not for sustainable agriculture.

💡 As part of an initiative to promote sustainable agriculture worldwide, experiments were made at different locations.

Each experiment consisted in an analysis of the soil.  
The results of these analysis are our features.

After the analysis, a small agriculture project was launched at the location:    
- If the project was successful, the soil was labeled as viable.  
- On the other hand if the project failed, the soil was labeled as not-viable.  

The viability of the soil is our target.

💡 Small test projects were used for data collection, but the ambition is to launch projects of much larger scale.  

The costs and time investment on these large scale projects are extremely high.  

🎯 To be valuable, our model should be right at least 90% of the time when it identifies a viable soil.

Here is a description of the fields:
- **id**: Unique identification number of the experiment
- **scientist**: Name of the scientist responsible for the experiment
- **measure_index**: Engineered measure of soil characteristics
- **measure_moisture**: Moisture level of the soil
- **measure_temperature**: Temperature of the soil, in Celsius degrees
- **measure_chemicals**: Indice of chemicals presence in the soil
- **measure_biodiversity**: Indice of biodiversity in the soil
- **measure_flora**: Indice of diversity of flora in the soil
- **main_element**: Symbol of the main chemical element found in the soil
- **past_agriculture**: Indicates the presence of past agriculture on the soil
- **soil_condition**: Overall indicator of the soil fertility
- **datetime_start**: Timestamp of experiment's start 
- **datetime_end**: Timestamp of experiment's end
- **target**: Viability of the soil  
    - 1: means the soil was viable, i.e. the test project was a success  
    - 0: means the soil was not viable, i.e. the test project was a failure

## Data Collection

**📝 Load the csv provided at this URL: https://wagon-public-datasets.s3.amazonaws.com/certification/soils_viability/soils_viability_train.csv.**

In [1]:
# YOUR CODE HERE

**📝 Clean the dataset and store the resulting dataset in the `data` variable:**

In [6]:
# YOUR CODE HERE

### 💾 Save your results

Run the cell below to save your results.

In [16]:
from nbresult import ChallengeResult
results = ChallengeResult(
    "data_cleaning",
    columns=data.columns,
    shape=data.shape,
    samples=data.loc[7000:,:]
)
results.write()

## Target, Baseline & Metrics

**📝 Check the number of target classes and their repartition.**

In [18]:
# YOUR CODE HERE

❓ Is the dataset balanced?

> YOUR ANSWER HERE

🎯 Recall our initial requirement:

**"To be valuable, our model should be right at least 90% of the time when it predicts a viable soil."**

📝 Store the name of the metric we should use for this purpose in a variable `metric` from the list proposed by [Scikit-learn](https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values).


In [19]:
# YOUR CODE HERE

**📝 Compute the baseline score and store the result as a floating number in the `baseline_score` variable.**


In [20]:
# YOUR CODE HERE

**📝 Store the target in a variable named `y`.**

In [21]:
# YOUR CODE HERE

### 💾 Save your results

Run the cell below to save your results.

In [22]:
results = ChallengeResult(
    "baseline",
    metric=metric,
    baseline=baseline_score
)
results.write()

## Features

In [24]:
from sklearn import set_config; set_config(display='diagram')

**📝 Store the features in a DataFrame `X`.**


In [25]:
# YOUR CODE HERE

💡 Two features in there are useless.

- `id`: serves a technical need and does not carry any information.  
- `scientist`: almost all experiments were conducted by different scientists, we assume they all followed the same protocol for the experiment.

**📝 Drop these two features.**

In [26]:
# YOUR CODE HERE

**📝 Create variables to store feature names according to their types.**

- `feat_num`: list of numerical features' name
- `feat_cat` list of categorical features' name
- `feat_time` list of time features' name

In [27]:
# YOUR CODE HERE

💡 We will ignore date-like features for the basic preprocessing.

**📝 Create `X_basic` that contains only numerical and categorical features.**


In [28]:
# YOUR CODE HERE

### 💾 Save your results

Run the cell below to save your results.

In [29]:
from nbresult import ChallengeResult
result = ChallengeResult(
    "features",
    columns=X.columns,
    shape=X.shape,
    target=y.ndim
)
result.write()

## Preprocessing

In [31]:
from sklearn import set_config; set_config(display='diagram')

**📝 Scale and Encode your features.**

Prepare a ColumnTransformer that:
- Scale the numerical features between $0$ and $1$
- Encode the categorical features

Store it in a variable `preprocessing_basic`


In [33]:
# YOUR CODE HERE

## Linear Model

**📝 Cross-validate a linear model on `X_basic` to see how it compares to your baseline.**

Inside a pipeline, apply the basic preprocessing, then use a basic **linear** model with **no penalty**.

Cross-validate your pipeline and store the scores in `scores_linear` as a `numpy.ndarray`.

In [36]:
# YOUR CODE HERE

**❓ Does your model beat the baseline? Do you reach your goal?**

> YOUR ANSWER HERE

### 💾 Save your results

Run the cell below to save your results.

In [38]:
from nbresult import ChallengeResult
X_preproc=preprocessing_basic.fit_transform(X_basic)
from sklearn.model_selection import train_test_split
X_,X_val,y_,y_val = train_test_split(X_basic,y,test_size=0.3,random_state=10)
pipe=pipeline_linear.fit(X_,y_)

result = ChallengeResult(
    'basic_pipeline',
    preproc=preprocessing_basic,
    preproc_shape=X_preproc.shape,
    pipe=pipeline_linear,
    y=y_val,
    y_pred=pipeline_linear.predict(X_val),
    scores=scores_linear
)
result.write()

## Feature Engineering

💡 We are going to look more closely at the features and try to enhance our preprocessing.

### Enhanced `soil_condition` Encoding

**📝 Check the possible values of the feature `soil_condition`**

In [41]:
# YOUR CODE HERE

**❓ Can you a better way to encode the `soil_condition` feature?**

> YOUR ANSWER HERE

**📝 Select a transformer keeping a sense of the order of the values of `soil_condition` to encode that feature.** 

Encode `soil_condition` from `X` with that relevant encoder and store the result in `X_soil_condition_encoded` as a `numpy.ndarray`.

In [6]:
# YOUR CODE HERE

**📝 Make sure that it works properly.**

Check the value counts for the feature `soil_condition`

In [43]:
# YOUR CODE HERE

**📝 Check it again,  after transformation with the relevant encoder:**

In [46]:
# YOUR CODE HERE

### Custom Time Transformers

#### Datetime Features Extraction

💡  We want to extract two information from our time features

📅 The `month` of the experiment's start

⏳ The `duration` of the experiment in an appropriate unit

**📝 Compute the `duration` of experiments, and look at the statistics.**

In [47]:
# YOUR CODE HERE

**❓ What is the most accurate time unit to use to describe the `duration` feature?**

**📝 Choose between `['days', 'hours', 'minutes', 'seconds']` and store your choice in the `duration_time_unit` variable:**

In [49]:
# YOUR CODE HERE

**📝 Create a `TimeFeaturesExtractor` class that transforms `datetime_start` and `datetime_end` into `month` and `duration`:**
- `month` as a number from 1 to 12
- `duration` as a float in the relevant `duration_time_unit`

In [50]:
# YOUR CODE HERE

**📝 Apply your `TimeFeaturesExtractor` to _100 rows_ of `X` and store the result in a DataFrame `X_time_features`**

Double check that it has **2 columns**: `month` and `duration`, and **100 rows**

In [51]:
# YOUR CODE HERE

#### Cyclical Encoding & Scaling

💡 We now have to encode and scale the extracted time features!  

You should scale the `duration` between 0 and 1.  

However we need to build a **Cyclical Encoder** for the `month`.

**📝Create a `CyclicalEncoder` class that transforms `month` into `month_cos` and `month_sin`.**

Recall the equations:  

$month\_norm = 2\pi\frac{month}{12}$  
$month\_cos = \cos({month\_norm})$  
$month\_sin = \sin({month\_norm})$

In [52]:
# YOUR CODE HERE

**📝 Apply your `CyclicalEncoder` to `X_time_features` and store the result in a DataFrame `X_time_cyclical`.**

Double check that it has **2 columns**: `month_cos` and `month_sin`, and **100 rows**

In [53]:
# YOUR CODE HERE

**📝 Build a pipeline, that contains all the steps for time features.**

Store it in a variable `preprocessing_time`

**Steps**

- Extraction of `month` and `duration` from  `datetime_start` and `datetime_end`  
- Scaling of `duration` between 0 and 1
- Cyclical encoding of `month`

In [54]:
# YOUR CODE HERE

### 💾 Save your results

Run the cell below to save your results.

In [56]:
from nbresult import ChallengeResult
results = ChallengeResult(
    'feature_engineering',
    x_soil_condition=X_soil_condition_encoded,
    X_time_features=X_time_features,
    X_time_cyclical= X_time_cyclical,
    X_time=preprocessing_time.fit_transform(X)
)
results.write()

## Advanced Pipeline

**📝  Build a full preprocessing pipeline and store it in `preprocessing_advanced`.**

Here are its steps, they should go in a parallel ColumnTransformer

- Scale all numerical features between 0 and 1
- Encode `main_element`  
- Better encode `soil_condition`
- Apply the `preprocessing_time` pipeline on `datetime_start` and `datetime_end`

In [58]:
# YOUR CODE HERE

## Regularized Linear Model

**📝 Build a pipeline that uses `preprocessing_advanced` and then a _Regularized Linear_ model.**

Cross-validate your pipeline and store the scores in a list `scores_regularized`

In [62]:
# YOUR CODE HERE

### 💾 Save your results

Run the cell below to save your results.

In [64]:
from nbresult import ChallengeResult
from sklearn.model_selection import train_test_split
X_,X_val,y_,y_val = train_test_split(X,y,test_size=0.3,random_state=7)
pipe=pipeline_regularized.fit(X_,y_)

result = ChallengeResult(
    'advanced_pipeline',
    steps=str(pipeline_regularized.steps),
    scores=scores_regularized,
    y=y_val,
    y_pred=pipeline_regularized.predict(X_val)
)
result.write()

## Dimensionality Reduction

**📝 Add a dimensional reduction step as the last step of your `preprocessing_advanced`. Make sure your dimensional reduction keeps _only 12 features_.**

In [67]:
# YOUR CODE HERE

**📝 Apply your `preprocessing_advanced` to `X` and store the result in the `X_preproc_adv` variable.**

In [68]:
# YOUR CODE HERE

### 💾 Save your results

Run the cell below to save your results.

In [69]:
from nbresult import ChallengeResult
results=ChallengeResult(
    'unsupervised',
    algorithm=preprocessing_advanced.steps[-1],
    X_preproc_adv=X_preproc_adv
)
results.write()

## Non-linear Model

**📝 Build a pipeline that uses `preprocessing_advanced` and then a _Ensemble_ model.**

Store this pipeline in the variable `pipeline_ensemble`

Cross-validate your pipeline and store the scores in a list `scores_ensemble`

In [73]:
# YOUR CODE HERE

**❓ Does this non-linear model satisfy the goal of the study?**

> YOUR ANSWER HERE

💡 Wait, did our feature engineering helps us ❓

**📝 Build a pipeline that uses `preprocessing_basic` and the same Ensemble model as above.**

In [88]:
# YOUR CODE HERE

**❓ What is your conclusion?**

> YOUR ANSWER HERE

### 💾 Save your results

Run the cell below to save your results.

In [90]:
from nbresult import ChallengeResult
from sklearn.model_selection import train_test_split
X_,X_val,y_,y_val=train_test_split(X,y,test_size=0.3,random_state=7)
pipeline_ensemble.fit(X_,y_)
y_pred=pipeline_ensemble.predict(X_val)

results=ChallengeResult(
    'ensemble',
    steps=str(pipeline_ensemble.steps),
    scores=scores_ensemble,
    y=y_val,
    y_pred=y_pred
)
results.write()

## Fine-Tuning

💡 To improve the model as much as we can, it's time to grid search for optimal hyperparameters

**📝 Look at the hyperparameters of your estimator**

In [93]:
# YOUR CODE HERE

**📝 Try to fine tune some hyperparameters to improve your model!**

In [94]:
# YOUR CODE HERE

**📝 Store the _fitted_ grid search in the `search` variable:**

In [98]:
# YOUR CODE HERE

**📝 Store the _cross-validated results_ of your grid search in the `cv_results` variable:**

In [99]:
# YOUR CODE HERE

**📝 Store the _best model_ of your grid search in a variable `tuned_model`.**

In [100]:
# YOUR CODE HERE

### 💾 Save your results

Run the cell below to save your results.

In [101]:
from nbresult import ChallengeResult
from sklearn.model_selection import train_test_split
X_,X_val,y_,y_val=train_test_split(X,y,test_size=0.3,random_state=10)
tuned_model.fit(X_,y_)

result = ChallengeResult(
    'model_tuning',
    scores_ensemble=scores_ensemble,
    scoring=search.scorer_,
    params=search.best_params_,
    cv_results=cv_results,
    y=y_val,
    y_pred=tuned_model.predict(X_val)
)
result.write()

## Prediction

**📝 Use your newly fine-tuned model to predict on a test set.**

Load the test provided at this url: "https://wagon-public-datasets.s3.amazonaws.com/certification/soils_viability/soils_viability_test.csv".

Create `X_test` and `y_test`

Use your fine-tuned model to predict on `X_test`

Print a full classification report with your prediction and `y_test`

In [104]:
df_test = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/certification/soils_viability/soils_viability_test.csv")

In [105]:
# YOUR CODE HERE

**❓ Comment your results:**

> YOUR ANSWER HERE

## API 

Time to put a pipeline in production!

👉 Go back to the certification interface and follow the instructions about the API challenge.

**This final part is independent from the above notebook**