# Model Training

Link to the project: drinkability of water 
[(fr)](https://drive.google.com/file/d/1FGNR1O8EKGVKpVB_PMb5Ty2LipYgoM8q/view?usp=sharing)
[(kaggle)](https://www.kaggle.com/artimule/drinking-water-probability)

In this notebook, we will train the model.

We will follow these different steps:
- Preprocessing
- 

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.experimental import enable_iterative_imputer  # noqa
from sklearn.impute import IterativeImputer, KNNImputer, SimpleImputer

## Preprocessing

In [4]:
TEST_SIZE = 0.2
RANDOM_STATE = 42

### Import the data

In [5]:
from google.colab import drive

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [6]:
# import data
path = "/content/drive/MyDrive/Best ML model ever/input/drinking_water_potability.csv"

df = pd.read_csv(path)
df.head()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
0,,204.890456,20791.31898,7.300212,368.516441,564.308654,10.379783,86.99097,2.963135,0
1,3.71608,129.422921,18630.05786,6.635246,,592.885359,15.180013,56.329076,4.500656,0
2,8.099124,224.236259,19909.54173,9.275884,,418.606213,16.868637,66.420093,3.055934,0
3,8.316766,214.373394,22018.41744,8.059332,356.886136,363.266516,18.436525,100.341674,4.628771,0
4,9.092223,181.101509,17978.98634,6.5466,310.135738,398.410813,11.558279,31.997993,4.075075,0


### Train/test split

As there are only continuous data, we don't need to use stratified sampling.

In [None]:
train_set, test_set = train_test_split(df, test_size=TEST_SIZE, random_state=RANDOM_STATE)
print("Train shape:", train_set.shape)
print("Test shape:", test_set.shape)

Train shape: (2620, 10)
Test shape: (656, 10)


In [None]:
X_train = train_set.drop("Potability", axis=1)
y_train = train_set["Potability"].copy()

X_test = test_set.drop("Potability", axis=1)
y_test = test_set["Potability"].copy()

### Build the pipeline

In [None]:
class RemoveNull(BaseEstimator, TransformerMixin):
  '''Defines a transformer to delete rows or cols containing null values'''

  def __init__(self, direction=0):
    self.direction = direction

  def fit(self, X, y=None):
    return self

  def transform(self, X, y=None):
    return X.dropna(axis=self.direction)

In [None]:
def preprocessing_pipeline(missing="median", scaling="standard"):
  """
  This function's goal is to build a preprocessing pipeline with given preprocessing strategy.

  Parameters
  ----------
  missing : string
      Specify the strategy for dealing with the missing values (default is "mean")
      Possible values: "mean", "median", “most_frequent”, "remove_rows", "remove_cols", "regression", "stochastic", "knn"
  scaling : string
      Specify the strategy for dealing with the scaling (default is "standard")
      Possible values: "standard", "min_max"
    
  Returns
  -------
  sklearn.Pipeline
      The preprocessing pipeline with given strategies
  """
  # Missing
  if missing in ["mean", "median", "most_frequent"]:
    missing_imputer = SimpleImputer(strategy=missing)
  elif missing in ["remove_rows", "remove_cols"]:
    missing_imputer = RemoveNull(0 if missing == "remove_rows" else 1)
  elif missing in ["regression", "stochastic"]:
    missing_imputer = IterativeImputer(sample_posterior=(missing=="stochastic"))
  elif missing == "knn":
    missing_imputer = KNNImputer()

  # Scaling
  if scaling == "standard":
    scaler = StandardScaler()
  elif scaling == "min_max":
    scaler = MinMaxScaler(feature_range=(-1, 1))

  return Pipeline([
      ('missing', missing_imputer),
      ('scaling', scaler)
  ])

## Building the model