<a href="https://colab.research.google.com/github/NotAndex/Demo/blob/main/on_the_importance_of_pipelines_during_cross_validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1 Setup

In [None]:
import pandas as pd
import numpy as np


import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, f1_score, precision_recall_curve
from sklearn.preprocessing import Normalizer, MinMaxScaler, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV 

!pip install catboost
from catboost import CatBoostClassifier


# 2 Data

In [3]:
df = pd.read_csv('https://storage.googleapis.com/download.tensorflow.org/data/creditcard.csv')


df.rename(columns={'Class':'label'}, inplace=True)
df['log10_amount'] = np.log10(df.Amount + 0.00001)
df = df.drop(['Time','Amount'],axis=1)

X = df.drop('label', axis=1)
y = df.label

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.3, random_state=42)
    
X_train, X_validate, y_train, y_validate = \
    train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# 3 Implementation pre-processing pipeline

In [4]:
# Definition of how to transform numeric values
num_transformer = Pipeline(steps=[('scaler', StandardScaler())])

# Columns of the data that are numeric features
num_feat = X_train.select_dtypes(include=['float64']).columns

# Pipeline: How (num_transformer) to transform which (num_feat) columns
preprocessor = ColumnTransformer(transformers=[('num', num_transformer, num_feat)])

# Glue pipeline together: preprocessor + Classifier
cat_clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('cb_clf', CatBoostClassifier())])

## But why is a pipeline important? 
Imagine you want to transform your data for better learning of the model. So you have to transform your train, validation, and test set. You use for example the StandardScaler() which is calculated like this: z = (x – u (mean)) / s(standard deviation). What is to be recognized here is, that you can’t know u and s of the validation and test set. Because of that, you use u and s of the train set to transform the validation and test set. This is achieved by fitting the StandardScaler() to the train set. Using the StandardScaler() in a pipeline, this behavior is [transferred to cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html). That means exactly, in every cross-validation step the StandardScaler() is fitted to the train set to transform the held out (K-fold) set.

#4 Hyperparameter tuning



## 4.1 Hyperparameter definition


In [5]:
cat_hyperparams = {'cb_clf__learning_rate' : [0.01, 0.03, 0.1],
          'cb_clf__objective': ['CrossEntropy'],
          'cb_clf__eval_metric' :['BalancedAccuracy']}

## 4.2 Hyperparameter search via GridSearchCV()

![Pic](https://github.com/NotAndex/Demo/blob/main/images/visio_cross_val_graphic.png?raw=true)

In [None]:
cat_model = GridSearchCV(cat_clf, cat_hyperparams, scoring="balanced_accuracy", cv = 5)

cat_model.fit(X_train, y_train)

# 5 Model implementation

## 5.1 Get + set best model configuration

In [10]:
best_parameter = cat_model.best_params_
print(best_parameter)
cat_clf.set_params(**best_parameter)

{'cb_clf__eval_metric': 'BalancedAccuracy', 'cb_clf__learning_rate': 0.03, 'cb_clf__objective': 'CrossEntropy'}


## 5.2 Model fit + predict

In [None]:
cat_clf.fit(X_train, y_train,
                 cb_clf__eval_set = (X_validate, y_validate))


y_test_pred_cat = cat_clf.predict(X_test)