# Prediction of Soil Viability for Sustainable Agriculture

🎯 The goal of this challenge is to train a model that classifies soils as viable or not for sustainable agriculture.

💡 As part of an initiative to promote sustainable agriculture worldwide, experiments were made at different locations.

Each experiment consisted in an analysis of the soil.  
The results of these analysis are our features.

After the analysis, a small agriculture project was launched at the location:    
- If the project was successful, the soil was labeled as viable.  
- On the other hand if the project failed, the soil was labeled as not-viable.  

The viability of the soil is our target.

💡 Small test projects were used for data collection, but the ambition is to launch projects of much larger scale.  

The costs and time investment on these large scale projects are extremely high.  

🎯 To be valuable, our model should be right at least 90% of the time when it identifies a viable soil.

Here is a description of the fields:
- **id**: Unique identification number of the experiment
- **scientist**: Name of the scientist responsible for the experiment
- **measure_index**: Engineered measure of soil characteristics
- **measure_moisture**: Moisture level of the soil
- **measure_temperature**: Temperature of the soil, in Celsius degrees
- **measure_chemicals**: Indice of chemicals presence in the soil
- **measure_biodiversity**: Indice of biodiversity in the soil
- **measure_flora**: Indice of diversity of flora in the soil
- **main_element**: Symbol of the main chemical element found in the soil
- **past_agriculture**: Indicates the presence of past agriculture on the soil
- **soil_condition**: Overall indicator of the soil fertility
- **datetime_start**: Timestamp of experiment's start 
- **datetime_end**: Timestamp of experiment's end
- **target**: Viability of the soil  
    - 1: means the soil was viable, i.e. the test project was a success  
    - 0: means the soil was not viable, i.e. the test project was a failure

In [1]:
# import 
import numpy as np
import pandas as pd


from sklearn.impute import SimpleImputer

## Data Collection

**📝 Load the csv provided at this URL: https://wagon-public-datasets.s3.amazonaws.com/certification/soils_viability/soils_viability_train.csv.**

In [2]:
data=pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/certification/soils_viability/soils_viability_train.csv")

**📝 Clean the dataset and store the resulting dataset in the `data` variable:**

In [4]:
data.shape

(8302, 14)

In [3]:
data.drop_duplicates(inplace=True)

In [4]:
data.shape

(8138, 14)

In [5]:
data.isnull().sum().sort_values(ascending=False)/len(data)

measure_flora           0.990538
past_agriculture        0.402802
measure_chemicals       0.002212
measure_biodiversity    0.002212
measure_temperature     0.002089
measure_moisture        0.000860
id                      0.000000
scientist               0.000000
measure_index           0.000000
main_element            0.000000
soil_condition          0.000000
datetime_start          0.000000
datetime_end            0.000000
target                  0.000000
dtype: float64

__measure_flora__ : Trop de NA pour SimpleImputer

In [6]:
data.measure_flora.value_counts()

58.823740    1
17.563939    1
6.998023     1
24.147078    1
16.775127    1
            ..
4.137682     1
40.549023    1
4.976018     1
0.854240     1
12.647071    1
Name: measure_flora, Length: 77, dtype: int64

In [7]:
data.drop(columns=["measure_flora"],inplace=True)

In [8]:
data.shape

(8138, 13)

__past_agriculture :__ idem

In [9]:
data.past_agriculture.value_counts()

no     3247
yes    1613
Name: past_agriculture, dtype: int64

In [10]:
data.drop(columns=["past_agriculture"],inplace=True)
#data.past_agriculture.replace(np.nan, "no", inplace=True)

__measure_chemicals :__ 

In [11]:
imputer = SimpleImputer(strategy="median")

In [12]:
imputer.fit(data[['measure_chemicals']]) # Call the "fit" method on the object

data['measure_chemicals'] = imputer.transform(data[['measure_chemicals']]) # Call the "transform" method on the object

__measure_biodiversity :__ 

In [13]:
imputer = SimpleImputer(strategy="median")

imputer.fit(data[['measure_biodiversity']]) # Call the "fit" method on the object

data['measure_biodiversity'] = imputer.transform(data[['measure_biodiversity']]) # Call the "transform" method on the object

__measure_temperature :__

In [14]:
imputer = SimpleImputer(strategy="median")

imputer.fit(data[['measure_temperature']]) # Call the "fit" method on the object

data['measure_temperature'] = imputer.transform(data[['measure_temperature']]) # Call the "transform" method on the object

__measure_moisture__ :

In [15]:
imputer = SimpleImputer(strategy="median")

imputer.fit(data[['measure_moisture']]) # Call the "fit" method on the object

data['measure_moisture'] = imputer.transform(data[['measure_moisture']]) # Call the "transform" method on the object

In [16]:
data.isnull().sum().sort_values(ascending=False)/len(data)

id                      0.0
scientist               0.0
measure_index           0.0
measure_moisture        0.0
measure_temperature     0.0
measure_chemicals       0.0
measure_biodiversity    0.0
main_element            0.0
soil_condition          0.0
datetime_start          0.0
datetime_end            0.0
target                  0.0
dtype: float64

In [17]:
data.head()

Unnamed: 0,id,scientist,measure_index,measure_moisture,measure_temperature,measure_chemicals,measure_biodiversity,main_element,soil_condition,datetime_start,datetime_end,target
0,493,Kathryn Owens,1.875085,24.442232,18.510316,5.715697,521.074105,Na,normal,2017-06-27 16:53:42,2017-06-27 20:05:36,1
1,2340,Andrea Pratt,7.658911,30.121175,17.05025,1.973804,314.443474,Ca,rich,2018-12-10 07:06:56,2018-12-10 11:43:29,1
2,5434,Kaitlyn Jackson,18.000212,34.188025,17.157393,3.658506,361.79618,Al,normal,2018-10-04 18:45:29,2018-10-04 23:20:38,0
3,2304,Brett Rosario,4.056764,37.462768,13.275961,6.666983,402.016494,Ca,normal,2018-10-03 08:03:36,2018-10-03 10:56:40,0
4,1911,Craig Thompson,53.271676,31.425482,17.433458,1.940748,978.383654,Si,poor,2018-07-20 09:27:34,2018-07-20 13:48:30,0


### 💾 Save your results

Run the cell below to save your results.

In [18]:
from nbresult import ChallengeResult
results = ChallengeResult(
    "data_cleaning",
    columns=data.columns,
    shape=data.shape,
    samples=data.loc[7000:,:]
)
results.write()

## Target, Baseline & Metrics

**📝 Check the number of target classes and their repartition.**

In [19]:
data.target.value_counts()

1    4092
0    4046
Name: target, dtype: int64

❓ Is the dataset balanced?

Oui

🎯 Recall our initial requirement:

**"To be valuable, our model should be right at least 90% of the time when it predicts a viable soil."**

📝 Store the name of the metric we should use for this purpose in a variable `metric` from the list proposed by [Scikit-learn](https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values).


In [64]:
metric="precision"

**📝 Compute the baseline score and store the result as a floating number in the `baseline_score` variable.**


In [83]:
from sklearn.metrics import precision_score

y_dummy=np.ones(len(data["target"]))
baseline_score=precision_score(data["target"],y_dummy)

In [84]:
baseline_score

0.5028262472351929

**📝 Store the target in a variable named `y`.**

In [85]:
y=data["target"]

### 💾 Save your results

Run the cell below to save your results.

In [86]:
results = ChallengeResult(
    "baseline",
    metric=metric,
    baseline=baseline_score
)
results.write()

## Features

In [87]:
from sklearn import set_config; set_config(display='diagram')

**📝 Store the features in a DataFrame `X`.**


In [340]:
X=data.drop(columns=["target"])

💡 Two features in there are useless.

- `id`: serves a technical need and does not carry any information.  
- `scientist`: almost all experiments were conducted by different scientists, we assume they all followed the same protocol for the experiment.

**📝 Drop these two features.**

In [341]:
X.drop(columns=["id","scientist"],inplace=True)

**📝 Create variables to store feature names according to their types.**

- `feat_num`: list of numerical features' name
- `feat_cat` list of categorical features' name
- `feat_time` list of time features' name

In [91]:
feat_num=X.select_dtypes(include='float64').columns.tolist()

In [92]:
feat_cat=X.select_dtypes(include='object').columns.tolist()[0:2]

In [93]:
feat_time=X.select_dtypes(include='object').columns.tolist()[2:]

💡 We will ignore date-like features for the basic preprocessing.

**📝 Create `X_basic` that contains only numerical and categorical features.**


In [94]:
X_basic=X[feat_num+feat_cat]

In [95]:
X_basic.head()

Unnamed: 0,measure_index,measure_moisture,measure_temperature,measure_chemicals,measure_biodiversity,main_element,soil_condition
0,1.875085,24.442232,18.510316,5.715697,521.074105,Na,normal
1,7.658911,30.121175,17.05025,1.973804,314.443474,Ca,rich
2,18.000212,34.188025,17.157393,3.658506,361.79618,Al,normal
3,4.056764,37.462768,13.275961,6.666983,402.016494,Ca,normal
4,53.271676,31.425482,17.433458,1.940748,978.383654,Si,poor


### 💾 Save your results

Run the cell below to save your results.

In [96]:
from nbresult import ChallengeResult
result = ChallengeResult(
    "features",
    columns=X.columns,
    shape=X.shape,
    target=y.ndim
)
result.write()

## Preprocessing

In [97]:
from sklearn import set_config; set_config(display='diagram')

**📝 Scale and Encode your features.**

Prepare a ColumnTransformer that:
- Scale the numerical features between $0$ and $1$
- Encode the categorical features

Store it in a variable `preprocessing_basic`


In [98]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer,make_column_selector
from sklearn.preprocessing import OneHotEncoder

In [99]:
num_transformer=MinMaxScaler()

In [100]:
cat_transformer = OneHotEncoder(handle_unknown='ignore')

In [101]:
preprocessing_basic=ColumnTransformer([
    ('num_transformer', num_transformer, make_column_selector(dtype_include=['float64'])),
    ('cat_transformer', cat_transformer, make_column_selector(dtype_include=['object']))],
    remainder='passthrough')


In [102]:
preprocessing_basic

## Linear Model

**📝 Cross-validate a linear model on `X_basic` to see how it compares to your baseline.**

Inside a pipeline, apply the basic preprocessing, then use a basic **linear** model with **no penalty**.

Cross-validate your pipeline and store the scores in `scores_linear` as a `numpy.ndarray`.

In [372]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_basic,y, test_size=0.3, random_state=0)

In [116]:
from sklearn.linear_model import LogisticRegression
pipeline_linear = Pipeline([
    ('preprocessing', preprocessing_basic),
    ('log_regression', LogisticRegression(penalty="none"))])


In [117]:
pipeline_linear

In [198]:
from sklearn.model_selection import cross_val_score

# Cross validate pipeline
scores_linear=cross_val_score(pipeline_linear, X_train, y_train, cv=5, scoring='precision')

In [199]:
scores_linear

array([0.78666667, 0.80576923, 0.80426357, 0.78666667, 0.77573529])

**❓ Does your model beat the baseline? Do you reach your goal?**

Oui! La précision est meilleure

### 💾 Save your results

Run the cell below to save your results.

In [120]:
from nbresult import ChallengeResult
X_preproc=preprocessing_basic.fit_transform(X_basic)
from sklearn.model_selection import train_test_split
X_,X_val,y_,y_val = train_test_split(X_basic,y,test_size=0.3,random_state=10)
pipe=pipeline_linear.fit(X_,y_)

result = ChallengeResult(
    'basic_pipeline',
    preproc=preprocessing_basic,
    preproc_shape=X_preproc.shape,
    pipe=pipeline_linear,
    y=y_val,
    y_pred=pipeline_linear.predict(X_val),
    scores=scores_linear
)
result.write()

## Feature Engineering

💡 We are going to look more closely at the features and try to enhance our preprocessing.

### Enhanced `soil_condition` Encoding

**📝 Check the possible values of the feature `soil_condition`**

In [162]:
X_basic.soil_condition.unique()

array(['normal', 'rich', 'poor'], dtype=object)

**❓ Can you a better way to encode the `soil_condition` feature?**

Encodage numérique --> OrdinalEncoder

**📝 Select a transformer keeping a sense of the order of the values of `soil_condition` to encode that feature.** 

Encode `soil_condition` from `X` with that relevant encoder and store the result in `X_soil_condition_encoded` as a `numpy.ndarray`.

In [359]:
from sklearn.preprocessing import OrdinalEncoder

lab_enc = OrdinalEncoder(categories=[["poor","normal","rich"]])
X_soil_condition_encoded= lab_enc.fit_transform(X[["soil_condition"]])

In [360]:
X_soil_condition_encoded

array([[1.],
       [2.],
       [1.],
       ...,
       [2.],
       [0.],
       [1.]])

**📝 Make sure that it works properly.**

Check the value counts for the feature `soil_condition`

In [361]:
X.soil_condition.value_counts()

normal    4076
poor      2456
rich      1606
Name: soil_condition, dtype: int64

**📝 Check it again,  after transformation with the relevant encoder:**

In [362]:
pd.DataFrame(X_soil_condition_encoded).value_counts()

1.0    4076
0.0    2456
2.0    1606
dtype: int64

### Custom Time Transformers

#### Datetime Features Extraction

💡  We want to extract two information from our time features

📅 The `month` of the experiment's start

⏳ The `duration` of the experiment in an appropriate unit

**📝 Compute the `duration` of experiments, and look at the statistics.**

In [122]:
X.head()

Unnamed: 0,measure_index,measure_moisture,measure_temperature,measure_chemicals,measure_biodiversity,main_element,soil_condition,datetime_start,datetime_end
0,1.875085,24.442232,18.510316,5.715697,521.074105,Na,normal,2017-06-27 16:53:42,2017-06-27 20:05:36
1,7.658911,30.121175,17.05025,1.973804,314.443474,Ca,rich,2018-12-10 07:06:56,2018-12-10 11:43:29
2,18.000212,34.188025,17.157393,3.658506,361.79618,Al,normal,2018-10-04 18:45:29,2018-10-04 23:20:38
3,4.056764,37.462768,13.275961,6.666983,402.016494,Ca,normal,2018-10-03 08:03:36,2018-10-03 10:56:40
4,53.271676,31.425482,17.433458,1.940748,978.383654,Si,poor,2018-07-20 09:27:34,2018-07-20 13:48:30


In [124]:
X["datetime_start"]=pd.to_datetime(X["datetime_start"],format="%Y-%m-%d %H:%M:%S")

In [125]:
X["datetime_end"]=pd.to_datetime(X["datetime_end"],format="%Y-%m-%d %H:%M:%S")

In [128]:
duration=X["datetime_end"]-X["datetime_start"]

In [140]:
duration/ np.timedelta64(1, 'h')

0       3.198333
1       4.609167
2       4.585833
3       2.884444
4       4.348889
          ...   
8297    2.016944
8298    3.539722
8299    2.437222
8300    2.835278
8301    4.770833
Length: 8138, dtype: float64

**❓ What is the most accurate time unit to use to describe the `duration` feature?**

**📝 Choose between `['days', 'hours', 'minutes', 'seconds']` and store your choice in the `duration_time_unit` variable:**

In [141]:
duration_time_unit="hours"

**📝 Create a `TimeFeaturesExtractor` class that transforms `datetime_start` and `datetime_end` into `month` and `duration`:**
- `month` as a number from 1 to 12
- `duration` as a float in the relevant `duration_time_unit`

In [273]:
from sklearn.base import TransformerMixin, BaseEstimator

class TimeFeaturesExtractor(TransformerMixin, BaseEstimator): 
# TransformerMixin generates a fit_transform method from fit and transform
# BaseEstimator generates get_params and set_params methods
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X_start = pd.to_datetime(X["datetime_start"],format="%Y-%m-%d %H:%M:%S")
        X_end = pd.to_datetime(X["datetime_end"],format="%Y-%m-%d %H:%M:%S")
        X_transformed=pd.DataFrame({"month":X_start.dt.month})
        X_transformed['duration']=(X_end-X_start)/np.timedelta64(1, 'h')
        # Return result as dataframe for integration into ColumnTransformer
        return X_transformed


**📝 Apply your `TimeFeaturesExtractor` to _100 rows_ of `X` and store the result in a DataFrame `X_time_features`**

Double check that it has **2 columns**: `month` and `duration`, and **100 rows**

In [274]:
X_time_features=TimeFeaturesExtractor().fit_transform(X[["datetime_start","datetime_end"]].head(100))

In [275]:
X_time_features.shape

(100, 2)

In [276]:
X_time_features.head()

Unnamed: 0,month,duration
0,6,3.198333
1,12,4.609167
2,10,4.585833
3,10,2.884444
4,7,4.348889


#### Cyclical Encoding & Scaling

💡 We now have to encode and scale the extracted time features!  

You should scale the `duration` between 0 and 1.  

However we need to build a **Cyclical Encoder** for the `month`.

**📝Create a `CyclicalEncoder` class that transforms `month` into `month_cos` and `month_sin`.**

Recall the equations:  

$month\_norm = 2\pi\frac{month}{12}$  
$month\_cos = \cos({month\_norm})$  
$month\_sin = \sin({month\_norm})$

In [304]:
import math

class CyclicalEncoder(TransformerMixin, BaseEstimator): 
# TransformerMixin generates a fit_transform method from fit and transform
# BaseEstimator generates get_params and set_params methods
    
    def __init__(self):
        self.pi=math.pi
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        norm=2*self.pi*(X["month"]/12)
        X_transformed=pd.DataFrame({"month_cos":np.cos(norm)})
        X_transformed['month_sin']=np.sin(norm)
        # Return result as dataframe for integration into ColumnTransformer
        return X_transformed

**📝 Apply your `CyclicalEncoder` to `X_time_features` and store the result in a DataFrame `X_time_cyclical`.**

Double check that it has **2 columns**: `month_cos` and `month_sin`, and **100 rows**

In [306]:
X_time_cyclical=CyclicalEncoder().fit_transform(X_time_features)

In [307]:
X_time_cyclical.shape

(100, 2)

In [308]:
X_time_cyclical.columns

Index(['month_cos', 'month_sin'], dtype='object')

**📝 Build a pipeline, that contains all the steps for time features.**

Store it in a variable `preprocessing_time`

**Steps**

- Extraction of `month` and `duration` from  `datetime_start` and `datetime_end`  
- Scaling of `duration` between 0 and 1
- Cyclical encoding of `month`

In [313]:
post_extract=ColumnTransformer([
    ('cyclical_encoding', CyclicalEncoder(),["month"]),
    ('scale_duration',MinMaxScaler(),["duration"])
])

In [314]:
preprocessing_time=Pipeline([
    ('month_duration', TimeFeaturesExtractor()),
    ('scale_cycling', post_extract)
])

In [315]:
preprocessing_time

In [316]:
preprocessing_time.fit_transform(X)

array([[-1.00000000e+00,  1.22464680e-16,  5.49465055e-01],
       [ 1.00000000e+00, -2.44929360e-16,  9.02320411e-01],
       [ 5.00000000e-01, -8.66025404e-01,  8.96484646e-01],
       ...,
       [ 8.66025404e-01,  5.00000000e-01,  3.59107962e-01],
       [ 1.00000000e+00, -2.44929360e-16,  4.58663332e-01],
       [-8.66025404e-01,  5.00000000e-01,  9.42753925e-01]])

### 💾 Save your results

Run the cell below to save your results.

In [363]:
from nbresult import ChallengeResult
results = ChallengeResult(
    'feature_engineering',
    x_soil_condition=X_soil_condition_encoded,
    X_time_features=X_time_features,
    X_time_cyclical= X_time_cyclical,
    X_time=preprocessing_time.fit_transform(X)
)
results.write()

## Advanced Pipeline

**📝  Build a full preprocessing pipeline and store it in `preprocessing_advanced`.**

Here are its steps, they should go in a parallel ColumnTransformer

- Scale all numerical features between 0 and 1
- Encode `main_element`  
- Better encode `soil_condition`
- Apply the `preprocessing_time` pipeline on `datetime_start` and `datetime_end`

In [382]:
preprocessing_advanced = ColumnTransformer([
    ('num_transformer', MinMaxScaler(), feat_num),
    ('main_element_transformer', OneHotEncoder(), ['main_element']),
        ('soil_transformer', OrdinalEncoder(categories=[["poor","normal","rich"]]), ["soil_condition"]),
    ('time_preprocessing', preprocessing_time, feat_time)
])

## Regularized Linear Model

**📝 Build a pipeline that uses `preprocessing_advanced` and then a _Regularized Linear_ model.**

Cross-validate your pipeline and store the scores in a list `scores_regularized`

In [383]:
pipeline_regularized = Pipeline([
    ('preprocessing', preprocessing_advanced),
    ('log_regression', LogisticRegression(C=0.5))])

In [384]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)

In [385]:
scores_regularized=cross_val_score(pipeline_regularized, X_train, y_train, cv=5, scoring='precision')

### 💾 Save your results

Run the cell below to save your results.

In [387]:
from nbresult import ChallengeResult
from sklearn.model_selection import train_test_split
X_,X_val,y_,y_val = train_test_split(X,y,test_size=0.3,random_state=7)
pipe=pipeline_regularized.fit(X_,y_)

result = ChallengeResult(
    'advanced_pipeline',
    steps=str(pipeline_regularized.steps),
    scores=scores_regularized,
    y=y_val,
    y_pred=pipeline_regularized.predict(X_val)
)
result.write()

## Dimensionality Reduction

**📝 Add a dimensional reduction step as the last step of your `preprocessing_advanced`. Make sure your dimensional reduction keeps _only 12 features_.**

In [388]:
from sklearn.decomposition import PCA

In [389]:
preprocessing_temp=ColumnTransformer([
    ('num_transformer', MinMaxScaler(), feat_num),
    ('main_element_transformer', OneHotEncoder(), ['main_element']),
        ('soil_transformer', OrdinalEncoder(categories=[["poor","normal","rich"]]), ["soil_condition"]),
    ('time_preprocessing', preprocessing_time, feat_time)
])

In [390]:
preprocessing_advanced=Pipeline([
    ('feature_modifications',preprocessing_temp),
    ('PCA',PCA(n_components=12))
    
])

**📝 Apply your `preprocessing_advanced` to `X` and store the result in the `X_preproc_adv` variable.**

In [391]:
X_preproc_adv=preprocessing_advanced.fit_transform(X)

### 💾 Save your results

Run the cell below to save your results.

In [381]:
from nbresult import ChallengeResult
results=ChallengeResult(
    'unsupervised',
    algorithm=preprocessing_advanced.steps[-1],
    X_preproc_adv=X_preproc_adv
)
results.write()

## Non-linear Model

**📝 Build a pipeline that uses `preprocessing_advanced` and then a _Ensemble_ model.**

Store this pipeline in the variable `pipeline_ensemble`

In [393]:
from sklearn.ensemble import RandomForestClassifier
pipeline_ensemble=Pipeline([
    ("preprocessing",preprocessing_advanced),
    ("ensemble_model",RandomForestClassifier()) 
])

In [394]:
pipeline_ensemble

Cross-validate your pipeline and store the scores in a list `scores_ensemble`

In [395]:
scores_ensemble=cross_val_score(pipeline_ensemble, X_train, y_train, cv=5, scoring='precision')

In [396]:
scores_ensemble

array([0.98387097, 0.98395722, 0.97686833, 0.96509599, 0.97326203])

**❓ Does this non-linear model satisfy the goal of the study?**

Oui! Nous dépassons les 90% attendus

💡 Wait, did our feature engineering helps us ❓

**📝 Build a pipeline that uses `preprocessing_basic` and the same Ensemble model as above.**

In [397]:
pipeline_ensemble_basic=Pipeline([
    ("preprocessing",preprocessing_basic),
    ("ensemble_model",RandomForestClassifier()) 
])

In [400]:
from sklearn.model_selection import cross_validate

In [401]:
cross_validate(pipeline_ensemble, X_train, y_train, cv=5, scoring='precision')

{'fit_time': array([1.14126897, 0.95567298, 0.85340691, 0.96131992, 1.09046888]),
 'score_time': array([0.06788993, 0.03660202, 0.03748131, 0.03866911, 0.03817511]),
 'test_score': array([0.98207885, 0.98401421, 0.97868561, 0.96678322, 0.97860963])}

In [402]:
cross_validate(pipeline_ensemble_basic, X_train, y_train, cv=5, scoring='precision')

{'fit_time': array([3.24518514, 3.20243311, 3.07462788, 3.00827384, 2.95543981]),
 'score_time': array([0.11057305, 0.09811401, 0.10859299, 0.10001612, 0.11043406]),
 'test_score': array([0.8697479 , 0.854     , 0.87018256, 0.86486486, 0.83762376])}

**❓ What is your conclusion?**

Le préprocessing a permis d'améliorer les résultats et d'atteindre nos objectifs. 

De plus les temps de calculs sont bien plus courts lors de l'utilisation du préprocessing! (Surement du à la réduction de dimensions)

### 💾 Save your results

Run the cell below to save your results.

In [403]:
from nbresult import ChallengeResult
from sklearn.model_selection import train_test_split
X_,X_val,y_,y_val=train_test_split(X,y,test_size=0.3,random_state=7)
pipeline_ensemble.fit(X_,y_)
y_pred=pipeline_ensemble.predict(X_val)

results=ChallengeResult(
    'ensemble',
    steps=str(pipeline_ensemble.steps),
    scores=scores_ensemble,
    y=y_val,
    y_pred=y_pred
)
results.write()

## Fine-Tuning

💡 To improve the model as much as we can, it's time to grid search for optimal hyperparameters

**📝 Look at the hyperparameters of your estimator**

In [407]:
pipeline_ensemble.get_params()

{'memory': None,
 'steps': [('preprocessing',
   Pipeline(steps=[('feature_modifications',
                    ColumnTransformer(transformers=[('num_transformer',
                                                     MinMaxScaler(),
                                                     ['measure_index',
                                                      'measure_moisture',
                                                      'measure_temperature',
                                                      'measure_chemicals',
                                                      'measure_biodiversity']),
                                                    ('main_element_transformer',
                                                     OneHotEncoder(),
                                                     ['main_element']),
                                                    ('soil_transformer',
                                                     OrdinalEncoder(categories=[['poor',
      

**📝 Try to fine tune some hyperparameters to improve your model!**

In [410]:
from sklearn.model_selection import GridSearchCV

# Instanciate grid search
grid_search = GridSearchCV(
    pipeline_ensemble, 
    param_grid={
        # Access any component of the pipeline, as far back as you want
        'preprocessing__PCA__n_components': [5,6,7,8,9,10,11,12,13],
        'ensemble_model__criterion': ["gini","entropy"]},cv=5,scoring="precision")

search=grid_search.fit(X_train, y_train)


{'ensemble_model__criterion': 'gini', 'preprocessing__PCA__n_components': 13}

**📝 Store the _fitted_ grid search in the `search` variable:**

In [98]:
# YOUR CODE HERE

**📝 Store the _cross-validated results_ of your grid search in the `cv_results` variable:**

In [412]:
cv_results=grid_search.cv_results_

**📝 Store the _best model_ of your grid search in a variable `tuned_model`.**

In [413]:
tuned_model=grid_search.best_estimator_

### 💾 Save your results

Run the cell below to save your results.

In [414]:
from nbresult import ChallengeResult
from sklearn.model_selection import train_test_split
X_,X_val,y_,y_val=train_test_split(X,y,test_size=0.3,random_state=10)
tuned_model.fit(X_,y_)

result = ChallengeResult(
    'model_tuning',
    scores_ensemble=scores_ensemble,
    scoring=search.scorer_,
    params=search.best_params_,
    cv_results=cv_results,
    y=y_val,
    y_pred=tuned_model.predict(X_val)
)
result.write()

## Prediction

**📝 Use your newly fine-tuned model to predict on a test set.**

Load the test provided at this url: "https://wagon-public-datasets.s3.amazonaws.com/certification/soils_viability/soils_viability_test.csv".

Create `X_test` and `y_test`

Use your fine-tuned model to predict on `X_test`

Print a full classification report with your prediction and `y_test`

In [416]:
df_test = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/certification/soils_viability/soils_viability_test.csv")

In [417]:
X_test=df_test.drop(columns="target")

In [418]:
y_test=df_test["target"]

In [419]:
tuned_model.fit(X,y)

In [420]:
tuned_model.predict(X_test)

array([0, 0, 1, ..., 0, 1, 0])

In [421]:
from sklearn.metrics import classification_report

In [423]:
print(classification_report(y_test,tuned_model.predict(X_test)))

              precision    recall  f1-score   support

           0       0.97      0.99      0.98      1036
           1       0.98      0.97      0.98      1000

    accuracy                           0.98      2036
   macro avg       0.98      0.98      0.98      2036
weighted avg       0.98      0.98      0.98      2036



**❓ Comment your results:**

Le modèle est plutot bon au vu des résultats

## API 

Time to put a pipeline in production!

👉 Go back to the certification interface and follow the instructions about the API challenge.

**This final part is independent from the above notebook**