# Tutorial for Logistic Regression

Cobra requires the usual Python packages for data science:
- numpy (>=1.19.4)
- pandas (>=1.1.5)
- scipy (>=1.5.4)
- scikit-learn (>=0.23.1)
- matplotlib (>=3.3.3)
- seaborn (>=0.11.0)

These packages, along with their versions are listed in requirements.txt and can be installed using pip.


Note: if you want to install cobra with e.g. pip, you don't have to install all of these requirements as these are automatically installed with cobra itself.

In [None]:
pip install -r requirements.txt

The easiest way to install cobra is using pip:

In [None]:
pip install -U pythonpredictions-cobra

*****

This section we will walk you through all the required steps to build a predictive logistic regression model using **Cobra**. All classes and functions used here are well-documented. In case you want more information on a class or function, run the next cell:

In [None]:
#help(function_or_class_you_want_info_from)

Building a good model involves three steps

1. **Preprocessing**: properly prepare the predictors (a synonym for “feature” or variable that we use throughout this tutorial) for modelling.

2. **Feature Selection**: automatically select a subset of predictors which contribute most to the target variable or output in which you are interested.

3. **Model Evaluation**: once a model has been build, a detailed evaluation can be performed by computing all sorts of evaluation metrics.



Let's dive in!!!
***

## Survival Prediction using Titanic data
- GOAL : Predict if individuals survives in titanic sinking
- BASETABLE : seaborn dataset Titanic

import the necessary libraries

In [None]:
import json
import pandas as pd
import numpy as np

from pandas.api.types import is_datetime64_any_dtype

pd.set_option('display.max_columns', 50)
pd.set_option("display.max_rows", 50)
from cobra.preprocessing import PreProcessor
from cobra.evaluation import generate_pig_tables, plot_incidence
from cobra.evaluation import evaluator

In [None]:
import seaborn as sns
df=sns.load_dataset('titanic')
df.head()

In the example below, we assume the data for model building is available in a pandas DataFrame. This DataFrame should contain a an ID column, a target column (e.g. “**survived**”) and a number of candidate predictors (features) to build a model with.

***


In [None]:
df.dtypes

it is required to set all category vars to object dtype


In [None]:
df.loc[:, df.dtypes == 'category'] =\
    df.select_dtypes(['category'])\
    .apply(lambda x: x.astype('object'))

## Data preprocessing

#### The first part focusses on preparing the predictors for modelling by:

1. Defining the ID column, the target, discrete and contineous variables

2. Splitting the dataset into training, selection and validation datasets.

3. Binning continuous variables into discrete intervals

4. Replacing missing values of both categorical and continuous variables (which are now binned) with an additional “Missing” bin/category

5. Regrouping categories in new category “other”

6. Replacing bins/categories with their corresponding incidence rate per category/bin.

*Disclaimer*: Cobra's Preprocesser is valid only if the original data does not contain extreme irregularities, such as outliers or very skewed distributions. This should always be checked beforehand by its user.

In this toy dataset, the index will serve as ID,

In [None]:
df["id"] = df.index + 1
id_col = "id"

and survived is the target,


In [None]:
target_col = "survived"

Now, we remove the columns 'who' and 'adult_male' since they are duplicate of 'sex', and also 'alive', which seems to be a duplicate of 'survived'


In [None]:
del df['who']
del df['adult_male']
del df['alive']

Finding out which variables are categorical ("discrete") and which are continous:


 => discrete are definitely those that contain strings:

In [None]:
col_dtypes = df.dtypes
discrete_vars = [col for col in col_dtypes[col_dtypes==object].index.tolist() if col not in [id_col, target_col]] 
print(discrete_vars)
print()
for col in discrete_vars:
    print(col)
    print(df[col].value_counts())
    print()

Next, we also check for numerical columns that only contain a few different values, thus to be interpreted as discrete, categorical variables


In [None]:
for col in df.columns:
    if col not in discrete_vars and col not in [id_col, target_col]: # if we didn't mark it as discrete already because it was string typed, or also excluding it if it is the target:
        val_counts = df[col].value_counts()
        if len(val_counts) > 1 and len(val_counts) <= 10: # The column contains less than 10 different values. 
            print(col)
            print(val_counts)
            print()

By taking a look at the printed variables, it is clear that we have to include those in the list of discrete variables. This can be done as follows:

In [None]:
discrete_vars.extend(["pclass","sibsp","parch","class","deck","alone"])
discrete_vars

The remaining variables can be labelled continous predictors, without including the target variable.


In [None]:
continuous_vars = list(set(df.columns)
                       - set(discrete_vars) 
                       - set([id_col, target_col]))
continuous_vars                       

Now, we can prepare **Cobra's Preprocessor**

In [None]:
# using all Cobra's default parameters for preprocessing for now:
preprocessor = PreProcessor.from_params(
    model_type="classification")

# These are the options though:
help(PreProcessor.from_params)

split data into train-selection-validation set:


In [None]:
from cobra.preprocessing import PreProcessor
basetable = preprocessor.train_selection_validation_split(
                data=df,
                train_prop=0.6,
                selection_prop=0.2,
                validation_prop=0.2)

And fit the preprocessor pipeline:


In [None]:
preprocessor.fit(basetable[basetable["split"] == "train"],
                 continuous_vars=continuous_vars,
                 discrete_vars=discrete_vars,
                 target_column_name=target_col)

This pipeline can now be performed on the basetable!!


In [None]:
basetable = preprocessor.transform(basetable,
                                   continuous_vars=continuous_vars,
                                   discrete_vars=discrete_vars)
basetable.head()

## Feature selection

Once the predictors are properly prepared, we can start building a predictive model, which boils down to selecting the right predictors from the dataset to train a model on.
As a dataset typically contains many predictors, **we first perform a univariate preselection** to rule out any predictor with little to no predictive power. Later, using the list of preselected features, we build a logistic regression model using **forward feature selection** to choose the right set of predictors.

In previous steps, these were the predictors, as preprocessed so far:

In [None]:
preprocessed_predictors = [
    col for col in basetable.columns
    if col.endswith("_bin") or col.endswith("_processed")]
sorted(preprocessed_predictors)

But for feature selection, we use the target encoded version of each of these.

In [None]:
preprocessed_predictors = [col for col in basetable.columns.tolist()
                           if '_enc' in col]

A univariate selection on the preprocessed predictors can be conducted. The thresholds for retaining a feature are now on default but can be changed by the user.


In [None]:
from cobra.model_building import univariate_selection

df_auc = univariate_selection.compute_univariate_preselection(
    target_enc_train_data=basetable[basetable["split"] == "train"],
    target_enc_selection_data=basetable[basetable["split"] == "selection"],
    predictors=preprocessed_predictors,
    target_column=target_col,
    preselect_auc_threshold=0.53,  # if auc_selection <= 0.53 exclude predictor
    preselect_overtrain_threshold=0.05  # if (auc_train - auc_selection) >= 0.05 --> overfitting!
    )
from cobra.evaluation import plot_univariate_predictor_quality
plot_univariate_predictor_quality(df_auc)

Next, we compute correlations between the preprocessed predictors and plot it using a correlation matrix:


In [None]:
from cobra.evaluation import plot_correlation_matrix
df_corr = (univariate_selection
           .compute_correlations(basetable[basetable["split"] == "train"],
                                 preprocessed_predictors))
plot_correlation_matrix(df_corr)

To get a list of the selected predictors after the univariate selection, run the following call:


In [None]:
preselected_predictors = (univariate_selection
                          .get_preselected_predictors(df_auc))
preselected_predictors

After an initial preselection on the predictors, we can start building the model itself using forward feature selection to choose the right set of predictors. Since we use target encoding on all our predictors, we will only consider models with positive coefficients (no sign flip should occur) as this makes the model more interpretable.

## Modeling

In [None]:
from cobra.model_building import ForwardFeatureSelection

forward_selection = ForwardFeatureSelection(model_type="classification",
                                            max_predictors=30,
                                            pos_only=True)

# fit the forward feature selection on the train data
# has optional parameters to force and/or exclude certain predictors (see docs)
forward_selection.fit(basetable[basetable["split"] == "train"],
                      target_column_name = target_col,
                      predictors = preselected_predictors)
                      #forced_predictors: list = [],
                      #excluded_predictors: list = [])

# compute model performance
performances = (forward_selection
                .compute_model_performances(basetable, target_column_name = target_col))
performances

As can be seen, we have completed 4 steps till no further improvement can be observed

In [None]:
from cobra.evaluation import plot_performance_curves

# plot performance curves
plot_performance_curves(performances)

Based on the performance curves (AUC per model with a particular number of predictors in case of logistic regression), a final model can then be chosen and the variables importance can be plotted:


In [None]:
model = forward_selection.get_model_from_step(3)

# Note that chosen model the following variables:
final_predictors = model.predictors
print(final_predictors)
from cobra.evaluation import plot_variable_importance

variable_importance = model.compute_variable_importance(
    basetable[basetable["split"] == "selection"]
)
plot_variable_importance(variable_importance)

**Note**: variable importance is based on correlation of the predictor with the model scores (and not the true labels!).



Finally, we can again export the model to a dictionary to store it as JSON

In [None]:
model_dict = model.serialize()

model_path = os.path.join("output", "model.json")
with open(model_path, "w") as file:
    json.dump(model_dict, file)

# To reload the model again from a JSON file, run the following snippet:
# from cobra.model_building import LinearRegressionModel
# with open(model_path, "r") as file:
#     model_dict = json.load(file)
# model = LinearRegressionModel()
# model.deserialize(model_dict)

## Evaluation

Now that we have build and selected a final model, it is time to evaluate its predictions on the test set against various evaluation metrics. The used evaluation metrics are:
1. Accuracy
2. AUC: Area Under Curve
3. Precision
4. Recall
5. F1
6. Matthews Correlation Coefficient
7. Lift

Furthermore, we can evaluate the classification performance using a confusion matrix.


Also plotting makes the evaluation of a logistic regression model a lot easier. We will first use a **Receiver Operating Characteristic (ROC) curve**, which is a plot of the false positive rate (x-axis) versus the true positive rate (y-axis). Next, the **Cumulative Gains curve**, **Cumulative Lift curve** and **Cumulative Response curve** can be called.

In [None]:
from cobra.evaluation import ClassificationEvaluator

# get numpy array of True target labels and predicted scores:
y_true = basetable[basetable["split"] == "selection"][target_col].values
y_pred = model.score_model(basetable[basetable["split"] == "selection"])

In [None]:
evaluator = ClassificationEvaluator()
evaluator.fit(y_true, y_pred)  # Automatically find the best cut-off probability

In [None]:
evaluator.scalar_metrics

In [None]:
evaluator.plot_confusion_matrix()

In [None]:
evaluator.plot_roc_curve()

In [None]:
evaluator.plot_cumulative_gains()

In [None]:
evaluator.plot_lift_curve()

In [None]:
evaluator.plot_cumulative_response_curve()

Additionally, we can also compute the output needed to plot the so-called Predictor Insights Graphs (PIGs in short). These are graphs that represents the insights of the relationship between a single predictor and the target. This is a graph where the predictor is binned into groups, and where we represent group size in bars and group (target) incidence in a colored line. We have the option to force order of predictor values.

In [None]:
from cobra.evaluation import generate_pig_tables
predictor_list = [col for col in basetable.columns
                  if col.endswith("_bin") or col.endswith("_processed")]
pig_tables = generate_pig_tables(basetable[basetable["split"] == "selection"],
                                 id_column_name=id_col,
                                 target_column_name=target_col,
                                 preprocessed_predictors=predictor_list)
pig_tables

In [None]:
from cobra.evaluation import plot_incidence
for predictor in list(pig_tables.variable.unique()):
    print(predictor)
    try:
        if predictor + "_bin" in basetable.columns:
            column_order = list(basetable[predictor + "_bin"].unique().sort_values())
        else:
            column_order = None #sorted(list(basetable[predictor].unique())) # e.g. just binary variable
        plot_incidence(pig_tables,
                       variable=predictor,
                       model_type="classification",
                       column_order=column_order)
    except ValueError as ve:
        print(f"Can't plot PIG for {predictor}. Error was: {ve}")
    except TypeError as ve:
        print(f"Can't plot PIG for {predictor}. Error was: {ve}")