# Logistic Regression

In this notebook, we apply Logistic Regression to our data and we try to predict 'churn'.

In [1]:
# All imports will be here:
import pandas as pd
import numpy as np

from utils import import_and_transform
from utils import evaluate_model
from utils import aggregate, aggregate_features_improved, aggregate_features_improved2
from utils import get_churned_users

import matplotlib.pyplot as plt
import seaborn as sns
import os
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score,
    roc_curve,
    precision_recall_curve,
)
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV, StratifiedKFold

Based on the exploratory data analysis **EDA**, we will now modify our database accordingly. The EDA showed issues and necessary changed that require database modifications.

We restructurate our database:

Since in this competitions task we are asked to focus on the churn in a specific time window, we create a function which identifies who churned within a specified period.

Now, we are ready to load the train and test data and apply these modifications directly on our datasets.

In [2]:
# Load training data
df_train = import_and_transform("Data/train.parquet")

# Prepare test data
df_test = import_and_transform("Data/test.parquet")

We want to capture more temporal patterns. We want to "teach" the model to recognize churn patterns across different time periods. How do we do that?

If we look at only one point in the time, we do not have enough examples to train the model. So instead of taking just one "snapshot", we take multiple snapshots at different times and we consider each one of them as individual prediction problems. 

Like this, we increase our training data.

In [None]:
# Createing observation dates every 5 days
# Create multiple training samples with sliding window
training_dates = pd.date_range("2018-10-15", "2018-11-05", freq="5D")

X_train_list = []
y_train_list = []

# For each observ date, we create a separate training sample:
for obs_date in training_dates:
    # Filtering data up to the observation date
    df_obs = df_train[df_train["ts"] <= obs_date]
    features = aggregate_features_improved(df_obs, obs_date)   # Better aggregate function.
    # features = aggregate_features_improved2(df_obs, obs_date)  # Better function, but takes ages to run
    
    # Creating a 10 day window after the obervation date
    # And we identify who churned in that period
    window_end = obs_date + pd.Timedelta(days=10)
    churned_users = get_churned_users(df_train, obs_date, window_end)

    # 1 if they churned in the next 10 days, 0 otherwise
    labels = pd.Series(
        features.index.isin(churned_users).astype(int),
        index=features.index,
        name="churned",
    )

    X_train_list.append(features)
    y_train_list.append(labels)

    print(
        f"Date of the observation: {obs_date.date()}, with {len(features)} users, and a {labels.mean():.2%} churn rate"
    )

# We combine all observation windows:
X_train_combined = pd.concat(X_train_list)
y_train_combined = pd.concat(y_train_list)

# Drop non-numeric columns
feature_cols = X_train_combined.select_dtypes(include=[np.number]).columns
feature_cols = [
    c
    for c in feature_cols
    if c not in ["registration", "ts_min", "ts_max", "total_length"]
]

X_train_final = X_train_combined[feature_cols]




In [5]:
test_features = aggregate_features_improved(df_test, "2018-11-20")
X_test = test_features[feature_cols]

We are trying to predict 0/1 Yes/No churn. The very first intuitive step to do is to apply Logistic Regression and then optimize it.

Hence, we start by applying a vanilla Logistic Regression model.

In [6]:
log_reg = LogisticRegression(class_weight="balanced")
log_reg.fit(X_train_final, y_train_combined)

evaluate_model(log_reg, X_test, 0.5, file_out="log_reg_1.csv")

Base predicted churn: 49.86%
Predicted churn at 0.5 threshold: 49.86%
Submission saved to log_reg_1.csv


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In Logistic Regression, we have hyperprameters that we need to choose *before* the training so we control how the model learns. Our hyperparameters are:

- **C**: how much we penalize complexity (too high is overfitting, too low is underfitting)
- **penalty**: which type of regularization we use (l1 or l2)
- **solver**: which optimization algorithm we use

Since we do not know which combination of these 3 is the best, we need to try all of them to decide which one we'll use in the end.

For solver, we use "Liblinear", and "Saga". A solver is an optimization algorithm that finds the best model param during training. **Saga** is fast, uses low memory and is built for big data. Moreover, it uses Stochastic Average Gradient descent, meaning it updates the model parameters using small bacthes of data at time, it does not load the whole dataset in memory from the beginning, and it converges fast on big data. **Liblinear** is more reliable and is a classic choice.

**ROC-AUC** (area under the curve) measures how well the model separates the 2 classes, despite the imbalance. So in our case, it is used to answer:

"What is the probability the model will rank the churner higher if we pick one churner and one non-churner at random?"

Possible scores:

- $<0.5$ means wrose than random
- $~0.5$ means random guessing
- $1.0$ means perfect

In [None]:
evaluate_model(best_model, X_test, 0.5, file_out="log_reg_2.csv")