# Work with Missing value, Outlier, Unbalanced Dataset

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Imports-and-Dataset" data-toc-modified-id="Imports-and-Dataset-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Imports and Dataset</a></span></li><li><span><a href="#Sampler,--transformer-and-estimator" data-toc-modified-id="Sampler,--transformer-and-estimator-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Sampler,  transformer and estimator</a></span></li><li><span><a href="#Lab-1:-Missing-value" data-toc-modified-id="Lab-1:-Missing-value-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Lab 1: Missing value</a></span></li><li><span><a href="#Outlier-removal" data-toc-modified-id="Outlier-removal-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Outlier removal</a></span></li><li><span><a href="#Unbalance-dataset" data-toc-modified-id="Unbalance-dataset-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Unbalance dataset</a></span></li></ul></div>

## Imports and Dataset 

In [105]:
#import warnings
#warnings.filterwarnings('ignore')

In [106]:
from tqdm import tqdm
import seaborn as sns                                     # For plotting data
import pandas as pd                                       # For dataframes
import numpy as np
import matplotlib.pyplot as plt                           # For plotting data
%matplotlib inline

# For splitting the dataset
from sklearn.model_selection import train_test_split

# For setting up pipeline
from imblearn.pipeline import Pipeline
from imblearn import FunctionSampler
from sklearn.preprocessing import FunctionTransformer
from sklearn.base import BaseEstimator, TransformerMixin

# For Missing data
from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# For Outlier detection
from sklearn.ensemble import IsolationForest
from sklearn.covariance import EllipticEnvelope
from sklearn.neighbors import LocalOutlierFactor
from sklearn.svm import OneClassSVM

# For Unbalanced dataset
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler

# For classification
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier

# For optimization
from sklearn.model_selection import GridSearchCV      
from sklearn.metrics import mean_absolute_error

The **original ForestCover/Covertype dataset** from UCI machine learning repository is a multiclass classification dataset. This dataset contains tree observations from four areas of the Roosevelt National Forest in Colorado. This study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices. 

In this notebook you are asked to predict the forest cover type (the predominant kind of tree cover) from strictly cartographic variables (as opposed to remotely sensed data). The actual forest cover type for a given 30 x 30 meter cell was determined from US Forest Service (USFS) Region 2 Resource Information System data. Independent variables were then derived from data obtained from the US Geological Survey and USFS. The data is in raw form (not scaled) and contains binary columns of data for qualitative independent variables such as wilderness areas and soil type.

This dataset has 54 attributes :
* 10 quantitative variables,
* 4 binary wilderness areas
* and 40 binary soil type variables).
Here, outlier detection dataset is created using only 10 quantitative attributes. Instances from class 2 are considered as normal points and instances from class 4 are anomalies. The anomalies ratio is 0.9%. Instances from the other classes are omitted.

Dataset description available on [Kaggle](https://www.kaggle.com/uciml/forest-cover-type-dataset).
* Elevation: Elevation in meters.
* Aspect: Aspect in degrees azimuth.
* Slope: Slope in degrees.
* Horizontal_Distance_To_Hydrology: Horizontal distance in meters to nearest surface water features.
* Vertical_Distance_To_Hydrology: Vertical distance in meters to nearest surface water features.
* Horizontal_Distance_To_Roadways: Horizontal distance in meters to the nearest roadway.
* Hillshade_9am: hillshade index at 9am, summer solstice. Value out of 255.
* Hillshade_Noon: hillshade index at noon, summer solstice. Value out of 255.
* Hillshade_3pm: shade index at 3pm, summer solstice. Value out of 255.
* Horizontal_Distance_To_Fire_Point*: horizontal distance in meters to nearest wildfire ignition points.
* Wilderness_Area#: wilderness area designation.
* Soil_Type#: soil type designation.

Wilderness_Area feature is one-hot encoded to 4 binary columns (0 = absence or 1 = presence), each of these corresponds to a wilderness area designation. Areas are mapped to value in the following way:
1. Rawah Wilderness Area
1. Neota Wilderness Area
1. Comanche Peak Wilderness Area
1. Cache la Poudre Wilderness Area

The same goes for Soil_Type feature which is encoded as 40 one-hot encoded binary columns (0 = absence or 1 = presence) and each of these represents soil type designation. All the possible options are:
1. Cathedral family - Rock outcrop complex, extremely stony
1. Vanet - Ratake families complex, very stony
1. Haploborolis - Rock outcrop complex, rubbly
1. Ratake family - Rock outcrop complex, rubbly
1. Vanet family - Rock outcrop complex complex, rubbly
1. Vanet - Wetmore families - Rock outcrop complex, stony
1. Gothic family
1. Supervisor - Limber families complex
1. Troutville family, very stony
1. Bullwark - Catamount families - Rock outcrop complex, rubbly
1. Bullwark - Catamount families - Rock land complex, rubbly.
1. Legault family - Rock land complex, stony
1. Catamount family - Rock land - Bullwark family complex, rubbly
1. Pachic Argiborolis - Aquolis complex
1. ¨unspecified in the USFS Soil and ELU Survey
1. Cryaquolis - Cryoborolis complex
1. Gateview family - Cryaquolis complex
1. Rogert family, very stony
1. Typic Cryaquolis - Borohemists complex
1. Typic Cryaquepts - Typic Cryaquolls complex
1. Typic Cryaquolls - Leighcan family, till substratum complex
1. Leighcan family, till substratum, extremely bouldery
1. Leighcan family, till substratum - Typic Cryaquolls complex
1. Leighcan family, extremely stony
1. Leighcan family, warm, extremely stony
1. Granile - Catamount families complex, very stony
1. Leighcan family, warm - Rock outcrop complex, extremely stony
1. Leighcan family - Rock outcrop complex, extremely stony
1. Como - Legault families complex, extremely stony
1. Como family - Rock land - Legault family complex, extremely stony
1. Leighcan - Catamount families complex, extremely stony
1. Catamount family - Rock outcrop - Leighcan family complex, extremely stony
1. Leighcan - Catamount families - Rock outcrop complex, extremely stony
1. Cryorthents - Rock land complex, extremely stony
1. Cryumbrepts - Rock outcrop - Cryaquepts complex
1. Bross family - Rock land - Cryumbrepts complex, extremely stony
1. Rock outcrop - Cryumbrepts - Cryorthents complex, extremely stony
1. Leighcan - Moran families - Cryaquolls complex, extremely stony
1. Moran family - Cryorthents - Leighcan family complex, extremely stony
1. Moran family - Cryorthents - Rock land complex, extremely stony

Cover_Type: forest cover type designation, its possible values are between 1 and 7, mapped in the following way:
1. Spruce/Fir
1. Lodgepole Pine
1. Ponderosa Pine
1. Cottonwood/Willow
1. Aspen
1. Douglas-fir
1. Krummholz

<font color=blue>
We will use a very small part of this dataset with only classes 1 and 7.
</font>

In [107]:
import ssl

ssl._create_default_https_context = ssl._create_unverified_context

url = "https://www.i3s.unice.fr/~riveill/dataset/covtype/"
filename = "covtype.csv"

In [108]:
# load train and test
train = pd.read_csv(url+"train.csv", delimiter=',')
test = pd.read_csv(url+"test.csv", delimiter=',')

In [109]:
columns = list(train.columns)
target = 'Cover_Type'
columns.remove(target)
cat_columns=[c for c in columns if 'Soil_Type' in c or 'Wilderness_Area' in c] # already one hot encode
num_columns=[c for c in columns if c not in cat_columns]

In [110]:
y_train = np.array(train[target]).reshape(-1,1)
X_train = train[columns]

y_test = np.array(test[target]).reshape(-1,1)
X_test = test[columns]

X_train.shape, X_test.shape

((10000, 54), (10000, 54))

In [111]:
# Class distribution
distribution = pd.Series(y_train.flatten()).value_counts().to_dict()
distribution

{1: 9083, 7: 917}

## Sampler,  transformer and estimator

There are three types of objects in imblearn/scikit-learn design:

**Transformer** transform observation (modify only X_train) and implements:
* fit: used for calculating the initial parameters on the training data and later saves them as internal objects state.
* transform: Use the initial above calculated values and return modified training data as output. Do not modify the length of the dataset.

**Predictor** is a "model" and implements:
* fit: calculates the parameters or weights on the training data and saves them as an internal object state.
* predict: Use the above-calculated weights on the test data to make the predictions.

**Sampler** is a new element, from imblearn library. A sampler modifies the number of observations in the train set (modify X_train and y_train) and implements:
* fit_resample

The following cells build a pipeline

In [112]:
# A sampler
class mySampler(BaseEstimator):
    def fit_resample(self, X, y):
        data = np.concatenate((X, y), axis=1)
        # remove rows with NaN
        data = data[~np.isnan(data).any(axis=1), :]
        return data[:,:-1], data[:,-1]

It's also possible to build sampler from a function

In [113]:
def mySamplerFunction(X, y, conta=0.1):
    iforest = IsolationForest(n_estimators=300, max_samples='auto', contamination=conta)
    outliers = iforest.fit_predict(X, y)

    X_filtered = X[outliers == 1]
    y_filtered = y[outliers == 1]

    return X_filtered, y_filtered

In [114]:
# A transformer
class myTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, strategy="most_frequent"):
        self.strategy = strategy
        self.sample = SimpleImputer(strategy=self.strategy)
    def fit(self, X, y=None):
        return self.sample.fit(X, y)
    def transform(self, X):
        return self.sample.transform(X)

Like sampler, it's also possible to build transformer from a function see `sklearn.preprocessing.FunctionTransform`

In [115]:
# A predictor
class myPredictor(BaseEstimator):
    def __init__(self, penalty="l2"):
        self.penalty = penalty
        self.sample = LogisticRegression(solver="lbfgs", penalty=self.penalty, max_iter=10000)
    def fit(self, X, y):
        return self.sample.fit(X, y)
    def predict(self, X):
        return self.sample.predict(X)

In [116]:
# Different version of the 2 steps pipeline
# step 1 : remove or imput missing data
# step 2 : remove outlier
# step 3 : predictor
pipeline = Pipeline([('missing_data', None),
                     ('outlier', FunctionSampler(func=mySamplerFunction)),
                     ('clf', None)])

parameters = [{'missing_data': [mySampler()],
               'clf': [myPredictor()],
               'clf__penalty': ['none'],
              },
              {'missing_data': [myTransformer()],
               'missing_data__strategy': ['most_frequent'],
               'clf': [myPredictor()],
               'clf__penalty': ['none'],
              },
              ]

pipeline

Pipeline(steps=[('missing_data', None),
                ('outlier',
                 FunctionSampler(func=<function mySamplerFunction at 0x7f0fbdbdaa70>)),
                ('clf', None)])

In [117]:
# GridSearch with pipeline
grid = GridSearchCV(pipeline, parameters, cv=2,
                    scoring="f1_micro", refit=True,
                    verbose=2)
grid

GridSearchCV(cv=2,
             estimator=Pipeline(steps=[('missing_data', None),
                                       ('outlier',
                                        FunctionSampler(func=<function mySamplerFunction at 0x7f0fbdbdaa70>)),
                                       ('clf', None)]),
             param_grid=[{'clf': [myPredictor()], 'clf__penalty': ['none'],
                          'missing_data': [mySampler()]},
                         {'clf': [myPredictor()], 'clf__penalty': ['none'],
                          'missing_data': [myTransformer()],
                          'missing_data__strategy': ['most_frequent']}],
             scoring='f1_micro', verbose=2)

<font color="blue">
Remember: samplers are only called to perform the "fit" and not to perform the predict. If the data set contains missing values (NaN) in the validation part, a warning may be raised.
</font>

In [118]:
# Try to find the best model
# Try to find the best model
grid.fit(X_train[:50], y_train[:50]) # Some data for testing the process... but use all available datagrid.fit(X_train[:50], y_train[:50]) # Some data for testing the process... but use all available data

Fitting 2 folds for each of 2 candidates, totalling 4 fits


  f"X has feature names, but {self.__class__.__name__} was fitted without"
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 762, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_scorer.py", line 221, in __call__
    sample_weight=sample_weight,
  File "/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_scorer.py", line 258, in _score
    y_pred = method_caller(estimator, "predict", X)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_scorer.py", line 68, in _cached_call
    return getattr(estimator, method)(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/metaestimators.py", line 113, in <lambda>
    out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)  # noqa
  File "/usr/local/lib/python3.7/dist-packages/sklearn/pipeline.py", line 470, in predict
    return self.steps[-1][1].predic

[CV] END clf=myPredictor(), clf__penalty=none, missing_data=mySampler(); total time=   0.7s


  f"X has feature names, but {self.__class__.__name__} was fitted without"
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 762, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_scorer.py", line 221, in __call__
    sample_weight=sample_weight,
  File "/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_scorer.py", line 258, in _score
    y_pred = method_caller(estimator, "predict", X)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_scorer.py", line 68, in _cached_call
    return getattr(estimator, method)(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/metaestimators.py", line 113, in <lambda>
    out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)  # noqa
  File "/usr/local/lib/python3.7/dist-packages/sklearn/pipeline.py", line 470, in predict
    return self.steps[-1][1].predic

[CV] END clf=myPredictor(), clf__penalty=none, missing_data=mySampler(); total time=   0.8s
[CV] END clf=myPredictor(), clf__penalty=none, missing_data=myTransformer(), missing_data__strategy=most_frequent; total time=   0.7s
[CV] END clf=myPredictor(), clf__penalty=none, missing_data=myTransformer(), missing_data__strategy=most_frequent; total time=   0.7s




GridSearchCV(cv=2,
             estimator=Pipeline(steps=[('missing_data', None),
                                       ('outlier',
                                        FunctionSampler(func=<function mySamplerFunction at 0x7f0fbdbdaa70>)),
                                       ('clf', None)]),
             param_grid=[{'clf': [myPredictor()], 'clf__penalty': ['none'],
                          'missing_data': [mySampler()]},
                         {'clf': [myPredictor(penalty='none')],
                          'clf__penalty': ['none'],
                          'missing_data': [myTransformer()],
                          'missing_data__strategy': ['most_frequent']}],
             scoring='f1_micro', verbose=2)

In [119]:
# Evaluate the model with the whole dataset
y_pred = grid.predict(X_train[:500])
print("Best: {:.2f} using {}".format(
    grid.best_score_, 
    grid.best_params_
))
print('Test set score: ' + str(grid.score(X_train[:500], y_train[:500])))

Best: 0.66 using {'clf': myPredictor(penalty='none'), 'clf__penalty': 'none', 'missing_data': myTransformer(), 'missing_data__strategy': 'most_frequent'}
Test set score: 0.7160000000000001


## Lab 1: Missing value

$$[TO DO - Students]$$

Test some algorithms to handle missing data.
* Choose the classifier that you think is preferable for this job.

1. with removal of missing data
1. with of the following imputation methods
    * With SimpleImputer
    * With IterativeImputer
    * With KNNimputer

Build a 2 step pipeline and use a gridsearch to find the right hyperpameters.

Submit your work in the form of an executable and commented notebook at lms.univ-cotedazur.fr

#Answers:

##1. Removal of missing data

In [120]:
class remove_missing_data(BaseEstimator):
    def fit_resample(self, X, y):
        data = np.concatenate((X, y), axis=1)
        # remove rows with NaN
        data = data[~np.isnan(data).any(axis=1), :]
        return data[:,:-1], data[:,-1]

pipeline = Pipeline([('missing_data', None),
                     ('clf', RandomForestClassifier())])

parameters = [{'missing_data': [remove_missing_data()],
               'clf__n_estimators': [200, 500],
              'clf__max_features': ['auto', 'sqrt', 'log2'],
              'clf__max_depth' : [5,10,20]
              }
              ]

pipeline

Pipeline(steps=[('missing_data', None), ('clf', RandomForestClassifier())])

In [121]:
# GridSearch with pipeline
grid = GridSearchCV(pipeline, parameters, cv=2,
                    scoring="f1_micro", refit=True,
                    verbose=2)
grid

GridSearchCV(cv=2,
             estimator=Pipeline(steps=[('missing_data', None),
                                       ('clf', RandomForestClassifier())]),
             param_grid=[{'missing_data': [remove_missing_data()]}],
             scoring='f1_micro', verbose=2)

In [None]:
grid.fit(X_train, y_train) # Some data for testing the process... but use all available data

In [123]:
# TEST SCORE:
y_pred = grid.predict(X_test)
print("Best: {:.2f} using {}".format(
    grid.best_score_, 
    grid.best_params_
))
print('Test set score: ' + str(grid.score(X_test, y_test)))

  f"X has feature names, but {self.__class__.__name__} was fitted without"
  f"X has feature names, but {self.__class__.__name__} was fitted without"


Best: nan using {'missing_data': remove_missing_data()}
Test set score: 0.9776


##2.1 With SimpleImputer

In [124]:
#I choose Random Forest as predictor 
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer

#Building a pipeline
pipeline = Pipeline([('missing_data', SimpleImputer(missing_values=np.nan, strategy='mean')),
                     ('clf', RandomForestClassifier())])

#Hyper-parameters to test
parameters = [{'missing_data__strategy': ['most_frequent','mean'],
               'clf__n_estimators': [200, 500],
              'clf__max_features': ['auto', 'sqrt', 'log2'],
              'clf__max_depth' : [5,10,20]
              }
              ]

pipeline

Pipeline(steps=[('missing_data', SimpleImputer()),
                ('clf', RandomForestClassifier())])

In [125]:
# GridSearch with pipeline
grid = GridSearchCV(pipeline, parameters, cv=2,
                    scoring="f1_micro", refit=True,
                    verbose=2)
grid

GridSearchCV(cv=2,
             estimator=Pipeline(steps=[('missing_data', SimpleImputer()),
                                       ('clf', RandomForestClassifier())]),
             param_grid=[{'missing_data__strategy': ['most_frequent', 'mean']}],
             scoring='f1_micro', verbose=2)

In [126]:
# Try to find the best model
grid.fit(X_train, y_train) 

Fitting 2 folds for each of 2 candidates, totalling 4 fits


  self._final_estimator.fit(Xt, yt, **fit_params_last_step)


[CV] END ...............missing_data__strategy=most_frequent; total time=   0.7s


  self._final_estimator.fit(Xt, yt, **fit_params_last_step)


[CV] END ...............missing_data__strategy=most_frequent; total time=   0.7s


  self._final_estimator.fit(Xt, yt, **fit_params_last_step)


[CV] END ........................missing_data__strategy=mean; total time=   0.7s


  self._final_estimator.fit(Xt, yt, **fit_params_last_step)


[CV] END ........................missing_data__strategy=mean; total time=   0.7s


  self._final_estimator.fit(Xt, yt, **fit_params_last_step)


GridSearchCV(cv=2,
             estimator=Pipeline(steps=[('missing_data', SimpleImputer()),
                                       ('clf', RandomForestClassifier())]),
             param_grid=[{'missing_data__strategy': ['most_frequent', 'mean']}],
             scoring='f1_micro', verbose=2)

In [142]:
#Training score:
y_pred = grid.predict(X_train)
print("ACC on train", accuracy_score(y_train, y_pred))
print("Best: {:.2f} using {}".format(
    grid.best_score_, 
    grid.best_params_
))
print('TRAIN set score: ' + str(grid.score(X_train, y_train)))

#Test score:
y_pred = grid.predict(X_test)
print('TEST set score: ' + str(grid.score(X_test, y_test)))

ACC on train 0.995
Best: 0.96 using {'missing_data__strategy': 'mean'}
TRAIN set score: 0.995
TEST set score: 0.9769


##2.2 With IterativeImputer




In [128]:
#I choose Random Forest as predictor 
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import IterativeImputer
#Building a pipeline
pipeline_IterativeImputer = Pipeline([('missing_data', IterativeImputer(missing_values=np.nan)),
                     ('clf', RandomForestClassifier())])

#Hyper-parameters to test
parameters_IterativeImputer = [{'missing_data__initial_strategy': ['most_frequent','mean'],
               'clf__n_estimators': [200, 500],
              'clf__max_features': ['auto', 'sqrt', 'log2'],
              'clf__max_depth' : [5,10,20]
              }
              ]

pipeline_IterativeImputer

Pipeline(steps=[('missing_data', IterativeImputer()),
                ('clf', RandomForestClassifier())])

In [129]:
# GridSearch with pipeline
grid_IterativeImputer = GridSearchCV(pipeline_IterativeImputer, parameters_IterativeImputer, cv=2,
                    scoring="f1_micro", refit=True,
                    verbose=2)

# Try to find the best model
grid_IterativeImputer.fit(X_train, y_train) 

Fitting 2 folds for each of 2 candidates, totalling 4 fits


  self._final_estimator.fit(Xt, yt, **fit_params_last_step)


[CV] END .......missing_data__initial_strategy=most_frequent; total time=  13.7s


  self._final_estimator.fit(Xt, yt, **fit_params_last_step)


[CV] END .......missing_data__initial_strategy=most_frequent; total time=  11.3s


  self._final_estimator.fit(Xt, yt, **fit_params_last_step)


[CV] END ................missing_data__initial_strategy=mean; total time=   6.9s


  self._final_estimator.fit(Xt, yt, **fit_params_last_step)


[CV] END ................missing_data__initial_strategy=mean; total time=   8.9s


  self._final_estimator.fit(Xt, yt, **fit_params_last_step)


GridSearchCV(cv=2,
             estimator=Pipeline(steps=[('missing_data', IterativeImputer()),
                                       ('clf', RandomForestClassifier())]),
             param_grid=[{'missing_data__initial_strategy': ['most_frequent',
                                                             'mean']}],
             scoring='f1_micro', verbose=2)

In [130]:
#Training score:
y_pred = grid_IterativeImputer.predict(X_train)
print("Best: {:.2f} using {}".format(
    grid_IterativeImputer.best_score_, 
    grid_IterativeImputer.best_params_
))
print('TRAIN set score: ' + str(grid_IterativeImputer.score(X_train, y_train)))

#Test score:
y_pred = grid_IterativeImputer.predict(X_test)
print("Best: {:.2f} using {}".format(
    grid_IterativeImputer.best_score_, 
    grid_IterativeImputer.best_params_
))
print('TEST set score: ' + str(grid_IterativeImputer.score(X_test, y_test)))

Best: 0.96 using {'missing_data__initial_strategy': 'most_frequent'}
TRAIN set score: 0.995
Best: 0.96 using {'missing_data__initial_strategy': 'most_frequent'}
TEST set score: 0.9768


##2.3 With KNNimputer

In [131]:
#I choose Random Forest as predictor 
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import KNNImputer

#Building a pipeline
pipeline_KNNImputer = Pipeline([('missing_data', KNNImputer()),
                     ('clf', RandomForestClassifier())])

#Hyper-parameters to test
parameters_KNNImputer = [{'missing_data__n_neighbors': [2,5],
               'clf__n_estimators': [200, 500],
              'clf__max_features': ['auto', 'sqrt', 'log2'],
              'clf__max_depth' : [5,10,20]
              }
              ]

pipeline_KNNImputer

Pipeline(steps=[('missing_data', KNNImputer()),
                ('clf', RandomForestClassifier())])

In [132]:
# GridSearch with pipeline
grid_KNNImputer = GridSearchCV(pipeline_KNNImputer, parameters_KNNImputer, cv=2,
                    scoring="f1_micro", refit=True,
                    verbose=2)

# Try to find the best model
grid_KNNImputer.fit(X_train, y_train) 

Fitting 2 folds for each of 2 candidates, totalling 4 fits


  self._final_estimator.fit(Xt, yt, **fit_params_last_step)


[CV] END ........................missing_data__n_neighbors=2; total time=   1.4s


  self._final_estimator.fit(Xt, yt, **fit_params_last_step)


[CV] END ........................missing_data__n_neighbors=2; total time=   1.4s


  self._final_estimator.fit(Xt, yt, **fit_params_last_step)


[CV] END ........................missing_data__n_neighbors=5; total time=   1.4s


  self._final_estimator.fit(Xt, yt, **fit_params_last_step)


[CV] END ........................missing_data__n_neighbors=5; total time=   1.4s


  self._final_estimator.fit(Xt, yt, **fit_params_last_step)


GridSearchCV(cv=2,
             estimator=Pipeline(steps=[('missing_data', KNNImputer()),
                                       ('clf', RandomForestClassifier())]),
             param_grid=[{'missing_data__n_neighbors': [2, 5]}],
             scoring='f1_micro', verbose=2)

In [133]:
grid_KNNImputer.get_params()

{'cv': 2,
 'error_score': nan,
 'estimator': Pipeline(steps=[('missing_data', KNNImputer()),
                 ('clf', RandomForestClassifier())]),
 'estimator__clf': RandomForestClassifier(),
 'estimator__clf__bootstrap': True,
 'estimator__clf__ccp_alpha': 0.0,
 'estimator__clf__class_weight': None,
 'estimator__clf__criterion': 'gini',
 'estimator__clf__max_depth': None,
 'estimator__clf__max_features': 'auto',
 'estimator__clf__max_leaf_nodes': None,
 'estimator__clf__max_samples': None,
 'estimator__clf__min_impurity_decrease': 0.0,
 'estimator__clf__min_samples_leaf': 1,
 'estimator__clf__min_samples_split': 2,
 'estimator__clf__min_weight_fraction_leaf': 0.0,
 'estimator__clf__n_estimators': 100,
 'estimator__clf__n_jobs': None,
 'estimator__clf__oob_score': False,
 'estimator__clf__random_state': None,
 'estimator__clf__verbose': 0,
 'estimator__clf__warm_start': False,
 'estimator__memory': None,
 'estimator__missing_data': KNNImputer(),
 'estimator__missing_data__add_indicator

In [134]:
#Training score:
y_pred = grid_KNNImputer.predict(X_train)
print("Best: {:.2f} using {}".format(
    grid_KNNImputer.best_score_, 
    grid_KNNImputer.best_params_
))
print('TRAIN set score: ' + str(grid_KNNImputer.score(X_train, y_train)))

#Test score:
y_pred = grid_KNNImputer.predict(X_test)

print('TEST set score: ' + str(grid_KNNImputer.score(X_test, y_test)))

Best: 0.96 using {'missing_data__n_neighbors': 2}
TRAIN set score: 0.995
Best: 0.96 using {'missing_data__n_neighbors': 2}
TEST set score: 0.9769


## Outlier removal

Removing the outliers modifies the data set, so it is a sampler.

<font color='blue'> 
IsolationForest or other sklearn detector are not a sampler. You have to read the </font>[imblearn documentation](https://imbalanced-learn.org/dev/references/generated/imblearn.FunctionSampler.html)
    
A small example with parameters:
<pre>
from collections import Counter
from imblearn.under_sampling import RandomUnderSampler

def func(X, y, sampling_strategy, random_state):
  return RandomUnderSampler(
      sampling_strategy=sampling_strategy,
      random_state=random_state).fit_resample(X, y)
      
sampler = FunctionSampler(func=func,
                          kw_args={'sampling_strategy': 'auto',
                                   'random_state': 0})
X_res, y_res = sampler.fit_resample(X, y)
print(f'Resampled dataset shape {sorted(Counter(y_res).items())}')
</pre>

$$[TO DO - Students]$$

Test some algorithms to handle outliers.
* Choose the classifier that you think is preferable for this job.

1. Without taking any precautions
1. By eliminating outliers with one of the following approaches:
    * With Isolation Forest (IF)
    * With Local Outlier Factor (LOF)
    * With Minimum Covariance Determinant (MCD)

Build a 3 step pipeline and use a gridsearch to find the right hyperpameters.
The first step, is your best previous "missing data method".

Submit your work in the form of an executable and commented notebook at lms.univ-cotedazur.fr

#ANSWER:

##1. Without taking any precautions: 
This is the same pipeline with the previous question with 2 steps (missing data handling and classifier).

#2. LocalOutlierFactor

Since it says "By eliminating outliers with ONE of the following approaches", I chosed Local Outlier Factor.

In [135]:
# step 1 : imput missing data
# step 2 : remove outlier
# step 3 : classifier
from sklearn.neighbors import LocalOutlierFactor

def outlier_removal(X, y):
    iforest = LocalOutlierFactor(n_neighbors=2)
    outliers = iforest.fit_predict(X, y)

    X_filtered = X[outliers == 1]
    y_filtered = y[outliers == 1]

    return X_filtered, y_filtered
pipeline_outlier = Pipeline([('missing_data', SimpleImputer(missing_values=np.nan, strategy='mean')),
                     ('outlier', FunctionSampler(func=outlier_removal)),
                     ('clf', RandomForestClassifier())])

parameters_outlier = [{'missing_data__strategy': ['most_frequent','mean'],
               'clf__n_estimators': [200, 500],
              'clf__max_features': ['auto', 'sqrt', 'log2'],
              'clf__max_depth' : [5,20]
              }
              ]

pipeline_outlier

Pipeline(steps=[('missing_data', SimpleImputer()),
                ('outlier',
                 FunctionSampler(func=<function outlier_removal at 0x7f0fbdbfd5f0>)),
                ('clf', RandomForestClassifier())])

In [136]:
# GridSearch with pipeline
grid_outlier = GridSearchCV(pipeline_outlier, parameters_outlier, cv=2,
                    scoring="f1_micro", refit=True,
                    verbose=2)

# Try to find the best model
grid_outlier.fit(X_train, y_train) 

Fitting 2 folds for each of 2 candidates, totalling 4 fits
[CV] END ...............missing_data__strategy=most_frequent; total time=   1.1s
[CV] END ...............missing_data__strategy=most_frequent; total time=   1.1s
[CV] END ........................missing_data__strategy=mean; total time=   1.1s
[CV] END ........................missing_data__strategy=mean; total time=   1.0s


GridSearchCV(cv=2,
             estimator=Pipeline(steps=[('missing_data', SimpleImputer()),
                                       ('outlier',
                                        FunctionSampler(func=<function outlier_removal at 0x7f0fbdbfd5f0>)),
                                       ('clf', RandomForestClassifier())]),
             param_grid=[{'missing_data__strategy': ['most_frequent', 'mean']}],
             scoring='f1_micro', verbose=2)

In [137]:
#Training score:
y_pred_outlier = grid_outlier.predict(X_train)
print("Best: {:.2f} using {}".format(
    grid_outlier.best_score_, 
    grid_outlier.best_params_
))
print('TRAIN set score: ' + str(grid_outlier.score(X_train, y_train)))

#Test score:
y_pred_outlier = grid_outlier.predict(X_test)

print('TEST set score: ' + str(grid_outlier.score(X_test, y_test)))

Best: 0.96 using {'missing_data__strategy': 'mean'}
TRAIN set score: 0.992
Best: 0.96 using {'missing_data__strategy': 'mean'}
TEST set score: 0.976


## Unbalance dataset

$$[TO DO - Students]$$

Test some algorithms to work with unbalanced dataset.
Choose the classifier that you think is preferable for this job.

1. Without taking any precautions
1. With modification of the dataset by Over sampling or Under sampling or SMOTE
1. Without modification of the dataset by weight

Build a 4 step pipeline and use a gridsearch to find the right hyperpameters and use a gridsearch to find the right hyperpameters. The first and second step, is your best previous methods.

Submit your work in the form of an executable and commented notebook at lms.univ-cotedazur.fr

#Answer:

##1- Without taking any precautions:
This is the same thing with the previous question where there is a 3-step pipeline (handling missing data, outlier removal and classifier).

##2.1 Over Sampling:

In [138]:
from imblearn.over_sampling import RandomOverSampler

def random_over_sample(X, y):
    overSample = RandomOverSampler(sampling_strategy=0.5)
    X_over, y_over = overSample.fit_resample(X, y)
    return X_over, y_over


pipeline_oversampling = Pipeline([('missing_data', SimpleImputer(missing_values=np.nan, strategy='mean')),
                     ('outlier', FunctionSampler(func=outlier_removal)),
                     ('over_sampling', FunctionSampler(func=random_over_sample)),
                     ('clf', RandomForestClassifier())])

parameters_oversampling = [{'missing_data__strategy': ['most_frequent','mean'],
               'clf__n_estimators': [200, 500],
              'clf__max_features': ['auto', 'sqrt', 'log2'],
              'clf__max_depth' : [5,20]
              }
              ]

pipeline_oversampling

Pipeline(steps=[('missing_data', SimpleImputer()),
                ('outlier',
                 FunctionSampler(func=<function outlier_removal at 0x7f0fbdbfd5f0>)),
                ('over_sampling',
                 FunctionSampler(func=<function random_over_sample at 0x7f0fbde3e320>)),
                ('clf', RandomForestClassifier())])

In [139]:
# GridSearch with pipeline
grid_oversampling = GridSearchCV(pipeline_oversampling, parameters_oversampling, cv=2,
                    scoring="f1_micro", refit=True,
                    verbose=2)

# Try to find the best model
grid_oversampling.fit(X_train, y_train) 

Fitting 2 folds for each of 2 candidates, totalling 4 fits
[CV] END ...............missing_data__strategy=most_frequent; total time=   1.2s
[CV] END ...............missing_data__strategy=most_frequent; total time=   1.2s
[CV] END ........................missing_data__strategy=mean; total time=   1.2s
[CV] END ........................missing_data__strategy=mean; total time=   1.2s


GridSearchCV(cv=2,
             estimator=Pipeline(steps=[('missing_data', SimpleImputer()),
                                       ('outlier',
                                        FunctionSampler(func=<function outlier_removal at 0x7f0fbdbfd5f0>)),
                                       ('over_sampling',
                                        FunctionSampler(func=<function random_over_sample at 0x7f0fbde3e320>)),
                                       ('clf', RandomForestClassifier())]),
             param_grid=[{'missing_data__strategy': ['most_frequent', 'mean']}],
             scoring='f1_micro', verbose=2)

In [140]:
#Training score:
y_pred_oversampling = grid_oversampling.predict(X_train)
print("Best: {:.2f} using {}".format(
    grid_oversampling.best_score_, 
    grid_oversampling.best_params_
))
print('TRAIN set score: ' + str(grid_oversampling.score(X_train, y_train)))

#Test score:
y_pred_oversampling = grid_oversampling.predict(X_test)
print('TEST set score: ' + str(grid_oversampling.score(X_test, y_test)))

Best: 0.96 using {'missing_data__strategy': 'mean'}
TRAIN set score: 0.9906
TEST set score: 0.9786


##2.2 Under Sampling

In [144]:
from imblearn.under_sampling import RandomUnderSampler

def random_under_sample(X, y):
    undersample = RandomUnderSampler(sampling_strategy=0.5)
    X_under, y_under = undersample.fit_resample(X, y)
    return X_under, y_under


pipeline_undersampling = Pipeline([('missing_data', SimpleImputer(missing_values=np.nan, strategy='mean')),
                     ('outlier', FunctionSampler(func=outlier_removal)),
                     ('under_sampling', FunctionSampler(func=random_under_sample)),
                     ('clf', RandomForestClassifier())])

parameters_undersampling = [{'missing_data__strategy': ['most_frequent','mean'],
               'clf__n_estimators': [200, 500],
              'clf__max_features': ['auto', 'sqrt', 'log2'],
              'clf__max_depth' : [5,20]
              }
              ]

pipeline_undersampling

Pipeline(steps=[('missing_data', SimpleImputer()),
                ('outlier',
                 FunctionSampler(func=<function outlier_removal at 0x7f0fbdbfd5f0>)),
                ('under_sampling',
                 FunctionSampler(func=<function random_under_sample at 0x7f0fbdb614d0>)),
                ('clf', RandomForestClassifier())])

In [145]:
# GridSearch with pipeline
grid_undersampling = GridSearchCV(pipeline_undersampling, parameters_undersampling, cv=2,
                    scoring="f1_micro", refit=True,
                    verbose=2)

# Try to find the best model
grid_undersampling.fit(X_train, y_train) 

Fitting 2 folds for each of 2 candidates, totalling 4 fits
[CV] END ...............missing_data__strategy=most_frequent; total time=   0.8s
[CV] END ...............missing_data__strategy=most_frequent; total time=   0.8s
[CV] END ........................missing_data__strategy=mean; total time=   0.8s
[CV] END ........................missing_data__strategy=mean; total time=   0.7s


GridSearchCV(cv=2,
             estimator=Pipeline(steps=[('missing_data', SimpleImputer()),
                                       ('outlier',
                                        FunctionSampler(func=<function outlier_removal at 0x7f0fbdbfd5f0>)),
                                       ('under_sampling',
                                        FunctionSampler(func=<function random_under_sample at 0x7f0fbdb614d0>)),
                                       ('clf', RandomForestClassifier())]),
             param_grid=[{'missing_data__strategy': ['most_frequent', 'mean']}],
             scoring='f1_micro', verbose=2)

In [146]:
#Training score:
y_pred_undersampling = grid_undersampling.predict(X_train)
print("Best: {:.2f} using {}".format(
    grid_undersampling.best_score_, 
    grid_undersampling.best_params_
))
print('TRAIN set score: ' + str(grid_undersampling.score(X_train, y_train)))

#Test score:
y_pred_undersampling = grid_undersampling.predict(X_test)
print('TEST set score: ' + str(grid_undersampling.score(X_test, y_test)))

Best: 0.93 using {'missing_data__strategy': 'most_frequent'}
TRAIN set score: 0.9573
TEST set score: 0.9554


##2.3 - STOME

In [147]:
from imblearn.over_sampling import SMOTE 

def stome(X, y):
    sm = SMOTE(random_state=42)
    X_res, y_res = sm.fit_resample(X, y)
    return X_res, y_res


pipeline_stome = Pipeline([('missing_data', SimpleImputer(missing_values=np.nan, strategy='mean')),
                     ('outlier', FunctionSampler(func=outlier_removal)),
                     ('under_sampling', FunctionSampler(func=stome)),
                     ('clf', RandomForestClassifier())])

parameters_stome = [{'missing_data__strategy': ['most_frequent','mean'],
               'clf__n_estimators': [200, 500],
              'clf__max_features': ['auto', 'sqrt', 'log2'],
              'clf__max_depth' : [5,20]
              }
              ]

pipeline_stome


Pipeline(steps=[('missing_data', SimpleImputer()),
                ('outlier',
                 FunctionSampler(func=<function outlier_removal at 0x7f0fbdbfd5f0>)),
                ('under_sampling',
                 FunctionSampler(func=<function stome at 0x7f0fbde3e560>)),
                ('clf', RandomForestClassifier())])

In [148]:
# GridSearch with pipeline
grid_stome = GridSearchCV(pipeline_stome, parameters_stome, cv=2,
                    scoring="f1_micro", refit=True,
                    verbose=2)

# Try to find the best model
grid_stome.fit(X_train, y_train) 

Fitting 2 folds for each of 2 candidates, totalling 4 fits
[CV] END ...............missing_data__strategy=most_frequent; total time=   1.8s
[CV] END ...............missing_data__strategy=most_frequent; total time=   1.8s
[CV] END ........................missing_data__strategy=mean; total time=   1.8s
[CV] END ........................missing_data__strategy=mean; total time=   1.8s


GridSearchCV(cv=2,
             estimator=Pipeline(steps=[('missing_data', SimpleImputer()),
                                       ('outlier',
                                        FunctionSampler(func=<function outlier_removal at 0x7f0fbdbfd5f0>)),
                                       ('under_sampling',
                                        FunctionSampler(func=<function stome at 0x7f0fbde3e560>)),
                                       ('clf', RandomForestClassifier())]),
             param_grid=[{'missing_data__strategy': ['most_frequent', 'mean']}],
             scoring='f1_micro', verbose=2)

In [149]:
#Training score:
y_pred_stome = grid_stome.predict(X_train)
print("Best: {:.2f} using {}".format(
    grid_stome.best_score_, 
    grid_stome.best_params_
))
print('TRAIN set score: ' + str(grid_stome.score(X_train, y_train)))

#Test score:
y_pred_stome = grid_stome.predict(X_test)
print('TEST set score: ' + str(grid_stome.score(X_test, y_test)))

Best: 0.96 using {'missing_data__strategy': 'mean'}
TRAIN set score: 0.9859
TEST set score: 0.9757
