# Starting-kit for ML pipeline

## Data loading

We can load the dataset that we cleaned during the first session. We have to take care about parsing date and time columns.

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv(
    'cleaner_dataframe.csv',
    index_col=0,
    parse_dates=[
        'FE_Declaration_date',
        'Claim Incident date',
        'Initial coverage date',
        'First claim decision date',
        'Last claim decisión date',
        'Policy Holder date of birth'
    ]
)
df.loc[:, 'Age policy at claim'] = pd.to_timedelta(df.loc[:, 'Age policy at claim'])
df.loc[:, 'Delay declaration'] = pd.to_timedelta(df.loc[:, 'Delay declaration'])
df.loc[:, 'Age client at claim'] = pd.to_timedelta(df.loc[:, 'Age client at claim'])

Let's have a look to be sure that everything was properly parsed.

In [None]:
df.info()

In [None]:
df.head()

In addition, we can use the `category` pandas data type for the categorical columns.

In [None]:
def convert_to_int_object(col):
    serie = []
    for _, x in col.iteritems():
        try:
            serie.append(int(x))
        except ValueError:
            serie.append(np.nan)
    return pd.Series(serie, index=col.index, dtype=object)

In [None]:
df['Risk code'] = df['Risk code'].astype('category')
df['Sexo'] = df['Sexo'].astype('category')
df['Refused decision reason code'] = convert_to_int_object(df['Refused decision reason code']).astype('category')
df['Trad_Refusal_reason'] = df['Trad_Refusal_reason'].astype('category')
df['Refusal_Category'] = df['Refusal_Category'].astype('category')
df['Claim_Status_Level_0']= df['Claim_Status_Level_0'].astype('category')
df['Refusal_Flag'] = df['Refusal_Flag'].astype('category')
df['Local Partner name categories'] = df['Local Partner name categories'].astype('category')
df['Insured NIF categories'] = df['Insured NIF categories'].astype('category')

We saw that there is a set of columns that we should not consider.

In [None]:
target = df['Refusal_Flag']
data = df.drop(columns=[
    'Refusal_Flag',
    'Refused decision reason code',
    'Claim_Status_Level_0',
    'Trad_Refusal_reason',
    'Refusal_Category',
    'First claim decision date',
    'Last claim decisión date',
    'Insured NIF categories',
    'Claim Number categories'
])

Looking at the `Refusal_Category`, we can check the reason of refusal. We will take a subset of data and reject the "Admnistrative" rows for this first try.

In [None]:
df['Refusal_Category'].unique()

In [None]:
mask_not_administrative = ~(df['Refusal_Category'] == 'Administrative')
data = data[mask_not_administrative]
target = target[mask_not_administrative]

Define the different part of the pipeline:

* Encode the categorical data;
* Drop the date columns;
* Let the numerical columns.

In [None]:
data.info()

In [None]:
from sklearn.compose import make_column_selector

categorical_columns = make_column_selector(dtype_include="category")(data)
numerical_columns = make_column_selector(dtype_include=[np.int64, np.float64])(data)

In [None]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer

In [None]:
preprocessor = make_column_transformer(
    (OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1), categorical_columns),
    ("passthrough", numerical_columns),
    n_jobs=-1
)

Use a RandomForestClassifier to make some classification within a 3-fold cross-validation. We will return the `balanced_accuracy_score` and the `roc_auc_score`.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate

In [None]:
model = make_pipeline(preprocessor, RandomForestClassifier(n_estimators=100, n_jobs=-1))

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    data, target, stratify=target, random_state=0
)

Let's check that our pipeline is working before to perform the cross-validation.

In [None]:
model.fit(X_train, y_train)
model.score(X_test, y_test)

In [None]:
scores = cross_validate(model, data, label, scoring=['roc_auc', 'balanced_accuracy'], cv=3, n_jobs=-1)

Convert the scores to a dataframe to have a nice display

In [None]:
scores = pd.DataFrame(scores)
scores

Compute the mean performance

In [None]:
scores.mean().to_frame().T

As well as the std. dev. of those performance

In [None]:
scores.std().to_frame().T