In [0]:
%load_ext autoreload
%autoreload 2
# Enables autoreload; learn more at https://docs.databricks.com/en/files/workspace-modules.html#autoreload-for-python-modules
# To disable autoreload; run %autoreload 0


# Credit Card Churn Prediction
---  

## Project Description  
In this project, I will develop a machine learning model – specifically, **a custom neural network architecture built with PyTorch** – to predict the probability of a customer canceling their credit card service (*churn*). The model will follow a **supervised learning** approach, using a labeled dataset where:  
- **Customers who left the service** (*churn*) are labeled as **1**.  
- **Active customers** (*non-churn*) are labeled as **0**.  

---  

### CRISP-DM Methodology  
The project will follow the CRISP-DM (*Cross-Industry Standard Process for Data Mining*) framework:  

| **Stage** | **Objective** | **Key Actions** |  
|-----------|---------------|------------------|  
| **1. Business Understanding** | Define the impact of churn prediction on customer retention. | - Identify the causes and possible solutions for the business.<br>- Align metrics with business KPIs. |  
| **2. Data Understanding** | Analyze data structure, quality, and variable relationships. | - Exploratory Data Analysis (EDA).<br>- Outlier and correlation detection. |  
| **3. Data Preparation** | Prepare data for model training. | - Split training and test data.<br>- Remove redundant variables. |  
| **4. Modeling** | Train and compare classical models and neural networks. | - Random Forest/Logistic Regression (baseline).<br>- PyTorch neural network (focus on generalization). |  
| **5. Evaluation** | Validate performance with business-oriented metrics. | - AUC-ROC, Recall, confusion matrix.<br>- Simulate financial impact. |  
| **6. Deployment** | Deploy the model for production use. | - Build a final churn prediction model with customer behavior indicators. |  

*This notebook covers the Modeling, Evaluation, and Deployment stages.*  

---  


## Installs:

In [0]:
%pip install -r '/Workspace/Repos/otniel.g.andrade@outlook.com/1_Portfolio-Credit-Card_Churn_Analysis_with_Pytorch/modeling_requeriments.txt'

In [0]:
# Command to restart the kernel and update the installed libraries
%restart_python

## Imports:

In [0]:
# Data Loading and Modeling:
# Pandas
import pandas as pd

# SRC/ Functions Utils
import sys
sys.path.append('/Workspace/Repos/otniel.g.andrade@outlook.com/1_Portfolio-Credit-Card_Churn_Analysis_with_Pytorch/src')
from data import DataSpark
from preprocessing import PreprocessingData

In [0]:

# Pyspark.Api Pandas
from pyspark import pandas as ps

# Numpy
import numpy as np

# Models of Machine Learning:
# Scikit-Learn Preprocessing / Metrics
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler, RobustScaler, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold, cross_val_score
# Scikit-Learn Models
from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
# PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
# Torch Metrics
from torchmetrics.classification import BinaryAccuracy, BinaryAUROC, BinaryRecall, BinaryF1Score, BinaryConfusionMatrix, BinaryPrecision, BinaryNegativePredictiveValue, BinaryROC, BinaryPrecisionRecallCurve, BinaryAveragePrecision
from torchmetrics import MetricCollection

# Hypertunning Pytorch:
# Ray Tunner/Optuna
from ray import tune
from ray.tune import Checkpoint, Tuner, TuneConfig, RunConfig
from ray.tune.search.optuna import OptunaSearch
import ray.cloudpickle as pickle
# Tqdm
from tqdm import tqdm

# Graphics:
# Matplotlib
import matplotlib.pyplot as plt
# Seaborn
import seaborn as sns

# Python:
# Time
import time
# Random
import random
# Partial
from functools import partial
# OS
import os
# Tempfile
import tempfile
# Path
from pathlib import Path
# Warnings
import warnings

## Configs:

In [0]:
# Pandas:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)


## 4. Modeling
---  
In this stage, **I will test classical machine learning models** to evaluate their performance on the training data. The approach will be **intentionally simple** (without complex hyperparameter tuning or advanced preprocessing techniques), as algorithms like **Random Forest, Logistic Regression, and SVM** typically perform better with straightforward data transformations.  

---  

After this initial analysis, **I will prioritize the project’s main model**: a **neural network developed in PyTorch**. This architecture was chosen due to its:  

- **Ability to identify complex patterns** in non-linear data.  
- **Flexibility to adapt to class imbalances** (e.g., the observed 84%-16% class distribution).  
- **Generalization capability** (Highly efficient with unseen data).  

However, neural networks require **specific preprocessing**, particularly to address:  
1. **High-cardinality categorical variables** (e.g., unique identifiers).  
2. **Asymmetric distributions** (identified during the EDA phase).  
3. **Data noise** (such as outliers in numerical variables).  

To address these, I will apply:  
- **Embedding layers** for categorical variables.  
- **Cross-validation** to verify and adjust data across different partitions.  
- **Regularization techniques** (e.g., *dropout*) to prevent *overfitting*.  

---  

### Modeling Split into Two Phases 
#### **Phase 1: Classical Machine Learning Models**  
| **Objective** | **Tools** | **Metric** |  
|---------------|------------|-------------|  
| Establish a performance baseline for future comparison. | Scikit-learn (Decision Trees, SVM, Logistic Regression). | AUC-ROC. |  

#### **Phase 2: PyTorch Neural Network**  
| **Objective** | **Tools** | **Metric** |  
|---------------|------------|-------------|  
| Achieve better generalization on unseen data. | PyTorch, Torchmetrics, Ray Tune. | AUC-ROC, Recall. |  

---  

### Evaluation Metric Choice: AUC-ROC 
#### Why AUC-ROC?  
| **Criterion** | **Explanation** | **Business Impact** |  
|---------------|------------------|----------------------|  
| **Class imbalance** | Balances *recall* (capturing churning customers) and *specificity* (avoiding unnecessary actions on loyal customers). | Reduces operational costs by prioritizing high-risk customers. |  
| **Asymmetric cost sensitivity** | False negatives (missing churn) are more critical than false positives. | Improves retention campaign efficacy (e.g., personalized offers). |  
| **Universal interpretability** | Scores above **0.85** indicate strong predictive power for binary classification. | Simplifies communication with non-technical stakeholders. |  

---  


#### Loading data train and data test

In [0]:
# File location and file type -- train
file_location  = '/Volumes/workspace/projects-data-science/churn-project-data/gold/train'
file_type = 'parquet'
train = DataSpark(spark = spark, file_location = file_location).load_data_pandas(file_type = file_type)

In [0]:
# File location and file type -- train
file_location  = '/Volumes/workspace/projects-data-science/churn-project-data/gold/test'
file_type = 'parquet'
test = DataSpark(spark = spark, file_location = file_location).load_data_pandas(file_type = file_type)

In [0]:
train.head()

### Checking the dimensions of the training and test data

In [0]:
train.count()

In [0]:
train.shape

In [0]:
test.count()

In [0]:
test.shape

### Training Classics Models - Cross-Validation

#### Separating features and labels

In [0]:
# Train
X_train = train.drop(columns = 'churn_target')
y_train = train['churn_target'].copy()
# Test 
X_test = test.drop(columns = 'churn_target')
y_test = test['churn_target'].copy()

# Checking the dimensions of the training and test data
print(f'The Train features dataset shape: {X_train.shape}')
print(f'The Train labels dataset shape: {y_train.shape}')
print(f'\nThe Test features dataset shape: {X_test.shape}')
print(f'The Test labels dataset shape: {y_test.shape}')

#### Preprocessing

In [0]:
# Models of Machine Learning:
# Scikit-Learn Preprocessing / Metrics
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler, RobustScaler, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold, cross_val_score
# Scikit-Learn Models
from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

In [0]:
ml_preprocessor = PreprocessingData().MLClassicPreprocessing()

In [0]:
X_train_preprocessed = ml_preprocessor.fit_transform(X_train)
pd.DataFrame(X_train_preprocessed, columns = ml_preprocessor.get_feature_names_out(X_train.columns)).head(10)

Shape of train preprocessed

In [0]:
X_train_preprocessed.shape

In [0]:
ml_preprocessor

In [0]:
models

In [0]:
models = [

    ('Logistic Regression', 
     Pipeline([('model', LogisticRegression(
         random_state = 33, 
         class_weight = 'balanced',
    ))])),

    ('Decision Tree Classifier', 
     Pipeline([('model', DecisionTreeClassifier(
         random_state = 33, 
         class_weight = 'balanced',
         max_depth = 5,
         criterion = 'gini',
         min_impurity_decrease = 0.001,
    ))])),


    ('Random Forest Classifier', 
     Pipeline([('model', RandomForestClassifier(
        random_state = 33,
        class_weight = 'balanced',
        n_estimators = 100,
        max_depth = 10,
        min_samples_split = 2,
        
    ))])),
     
    ('KNeighbors Classifier', 
     Pipeline([('model', KNeighborsClassifier(
        n_neighbors = 5,
        weights = 'distance',
        metric = 'minkowski'
    ))])),

    ('Suport Vector Machine Classifier', 
     Pipeline([('model', SVC(
        random_state = 33,
        class_weight = 'balanced',
        C = 1.0,
        kernel = 'rbf',
        gamma = 'scale',
        probability = True, 
    ))])),

    ('Gradient Boosting Classifier', 
     Pipeline([('model', GradientBoostingClassifier(
        random_state = 33,
        n_estimators = 200,
        max_depth = 3,
        learning_rate = 0.1,
        subsample = 0.7
    ))])),

]

In [0]:
nn_proprocessor = PreprocessingData().NeuralNetWorkPreprocessing()

In [0]:
nn_proprocessor

In [0]:
preprocessed_train = nn_proprocessor.fit_transform(train)
pd.DataFrame(preprocessed_train, columns = nn_proprocessor.get_feature_names_out(train.columns)).head(10)