# **FICO Analytic Challenge © Fair Isaac 2024**

# Week 5 - Logistic Regression - Model training and evaluation

## Logistic Regression

Logistic regression is a Machine Learning algorithm used primarily for binary classification problems, where the outcome can take one of two possible values. The goal is to predict the probability that the outcome belongs to a given class or not based on information from predictor variables.

Few examples of binary classification problems -
1. Spam Detection: To predict if an email is spam or not spam
2. Medical Diagnosis: To predict if a tumor is malignant or not
3. Marketing: To predict if a customer will buy a product or not
4. **Credit Scoring**: To predict if a customer will default on a loan or not
5. **Fraud Detection**: To identify if a transaction is fraud or not



<img src = https://cdn.analyticsvidhya.com/wp-content/uploads/2021/03/Screenshot-from-2021-03-05-11-51-17.png width = "800" style="margin:50px 0px 50px 0px">

<img src = https://www.saedsayad.com/images/LogReg_1.png width = "1000" style="margin:50px 0px 50px 0px">

**Sigmoid function:** The S-shaped curve used to predict probabilities. It's value is always between 0 and 1. <br>
<img src = https://editor.analyticsvidhya.com/uploads/642295.png style="margin:0px 50px 20px 250px">

<img src = https://cdn.analyticsvidhya.com/wp-content/uploads/2021/03/Screenshot-from-2021-03-05-10-58-02.png width = "800" style="margin:50px 0px 50px 0px">

**Resources:**

https://www.kdnuggets.com/2020/03/linear-logistic-regression-explained.html

https://www.analyticsvidhya.com/blog/2021/08/conceptual-understanding-of-logistic-regression-for-data-science-beginners/

https://www.analyticsvidhya.com/blog/2021/10/building-an-end-to-end-logistic-regression-model/

## Contents

**1. Load Dataset**

**2. Modelling Data Preparation**

    2.1 Summary Statistics
    2.2 Missing value analysis
    2.3 Normalization of the features
    2.4 Dataset filtering
    2.5 Create train and test datasets
    
**3. Model Training**

    3.1 Training Logistic Regression Model
    3.2 Forward Selection of features
    3.3 Backward Elimination of features

## 1. Load Dataset

1. Load the train and test datasets
2. Create a new column 'is_train' to use it as a tag to identify train and test datasets
3. Combine the train and test datasets for further analysis

In [None]:
! pip install mlxtend

In [None]:
! pip install seaborn

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pickle import dump, load

pd.set_option('display.max_columns', None)   # displays all columns
pd.set_option('display.max_rows', None)    # displays all rows

In [None]:
# Setting up the Google Drive mount
from google.colab import drive
drive.mount('/content/drive')

import os
import sys

path = '/content/drive/MyDrive/FICO Analytic Challenge/'
os.chdir(path)
print(os.getcwd())

In [None]:
# Location of the data
data = 'Data'

# Location to save model
model_folder = 'Model'

# # Names of the datasets
# train_dataset_name = "train_features.csv"
# test_dataset_name = "test_A_features.csv"

# Name of the model
model = 'LogReg'

# dataset file prefix
trainFile = ['train']
testFile = ['test_A']
# testFile = ['test_B']
# testFile = ['test_C_notags']

# CSV filename and where features dataset will be saved
featureTrainFileSuffix="_advanced_features.csv"
featureTestFileSuffix="_advanced_features.csv"

filePathTrain=os.path.join(path + data, trainFile[0] + featureTrainFileSuffix)
filePathTest=os.path.join(path + data, testFile[0] + featureTestFileSuffix)

if not os.path.isfile(filePathTrain):
    featureTrainFileSuffix="_features.csv"
    filePathTrain=os.path.join(path + data, trainFile[0] + featureTrainFileSuffix)

if not os.path.isfile(filePathTest):
    featureTestFileSuffix="_features.csv"
    filePathTest=os.path.join(path + data, testFile[0] + featureTestFileSuffix)

# CSV filename and where outputs will be saved
trainsaveCSV = os.path.join(path + data, 'score.' + model + '.' + trainFile[0] + featureTrainFileSuffix)
testsaveCSV = os.path.join(path + data, 'score.' + model + '.' + testFile[0] + featureTestFileSuffix)

print("Path to Train Output file and filename: {}".format(trainsaveCSV))
print("Path to Test Output file and filename: {}".format(testsaveCSV))

In [None]:
def import_df(filename):
    df1 = pd.read_csv(filename)
    df1['transactionDateTime'] = pd.to_datetime(df1['transactionDateTime'])
    df1 = df1.sort_values(by=['pan','transactionDateTime'])
    return df1

In [None]:
# Read the train dataset, print the dimensions (#rows x #columns) of the dataset and view the first 5 rows
df_train_features = import_df(filePathTrain)
# df_train_features = pd.read_csv(os.path.join(path, data, train_dataset_name))
print(df_train_features.shape)
df_train_features.head()

In [None]:
# Read the test dataset, print the dimensions (#rows x #columns) of the dataset and view the first 5 rows
df_test_features = import_df(filePathTest)
# df_test_features = pd.read_csv(os.path.join(path, data, test_dataset_name))
print(df_test_features.shape)
df_test_features.head()

In [None]:
# Combine train and test datasets for feature analysis and further processing
df = pd.concat([df_train_features, df_test_features], ignore_index=True, axis=0)
df.shape

In [None]:
# # Drop the columns which are not needed for modelling
# def modify_df(df1):
#     # UPDATE THIS PART, ONLY WITH COLUMNS THAT ARE NOT NEEDED
#     df1.drop(columns=['transactionDateTime',
#                       'trans_num',
#                       'unix_time',
#                       'merchCountry',
#                       'merchState',
#                       'merch_lat',
#                       'merch_long',
#                       'city_pop',
#                       'street',
#                       'gender',
#                       'deltaTime',
#                       'job',
#                       'dob',
#                       'zip',
#                       'lat',
#                       'long'
#                      ],
#              inplace= True)

#     df1['datetime'] = pd.to_datetime(df1['datetime']).astype('datetime64[ns]')
#     df1.rename(columns = {'datetime':'transactionDateTime'}, inplace = True)

#     return df1

# df = modify_df(df)
# df.shape

#### Analyze transaction level fraud rates

In [None]:
# Analyze distribution of mdlIsFraudTrx
print(df['mdlIsFraudTrx'].value_counts(dropna = False))

# Analyze distribution of mdlIsFraudTrx as percentage - Fraud rate
print(df['mdlIsFraudTrx'].value_counts(dropna = False, normalize = True))

#### Analyze account level fraud rates

In [None]:
# Number of unique pan ids
print(df['pan'].nunique())

# Create account level dataset by retaining unique pan ids
df_account = df[['pan','mdlIsFraudAcct']].sort_values(by = 'mdlIsFraudAcct',ascending = False).drop_duplicates('pan', keep = 'first')
print(df_account.shape)

In [None]:
# Analyze distribution of mdlIsFraudAcct
print(df_account['mdlIsFraudAcct'].value_counts(dropna = False))

# Analyze distribution of mdlIsFraudAcct as percentage - Fraud rate
print(df_account['mdlIsFraudAcct'].value_counts(dropna = False, normalize = True))

#### Create Card Present/Card Not Present flag

In [None]:
# Create flag is_CNP which takes values 1 for card not present and 0 for card present
df['is_CNP'] = (df['category'].apply(lambda x: x[-3:] == 'net')).astype(int)
df['is_CNP'].value_counts(dropna = False)

In [None]:
# Fraud rates for CP vs CNP
print(pd.crosstab(df['is_CNP'],df['mdlIsFraudTrx']))
print(pd.crosstab(df['is_CNP'],df['mdlIsFraudTrx'], normalize = 'index'))

#### Create Features dataset and Target data

**Features** - Features are also known as predictors, independent variables, or input variables. These are the attributes of the data that are used to make predictions. They are the inputs to the model. Features are derived from information available at the time of prediction. Features are generally represented by X.

**Target** - Target is also known as the response, dependent variable, or output variable. It is the value or label that the model is trying to predict. It is the output of the model. Target is generally represented by y.

In [None]:
# Columns that are in the dataframe that aren't inputs to the model, aside from the the columns dropped when importing the dataset
base_cols = ['pan', 'merchant', 'category', 'transactionAmount', 'first', 'last',
       'mdlIsFraudTrx', 'mdlIsFraudAcct', 'is_train', 'cardholderCountry',
       'cardholderState', 'transactionDateTime']

feature_columns = list(set(df.columns) - set(base_cols+['is_CNP']))
feature_columns.sort()

print('Number of features : ',len(feature_columns))
print('features : ',feature_columns)

# Assign all predictor variables (features) to X and target variable to y
X = df[feature_columns].copy()
print(X.shape)
y = df['mdlIsFraudTrx']

In [None]:
# Analyze distribution of target variable
y.value_counts(dropna = False)

In [None]:
# Analyze distribution of target variable as percentage - Fraud rate
y.value_counts(dropna = False, normalize = True)

## 2. Modelling data preparation

### 2.1 Summary Statistics

Analyze the univariate statistics like min, max, median, percentile distribution etc for all the predictive features

In [None]:
print("Summary Statistics")
summary_statistics = X.describe().T
summary_statistics

### 2.2 Missing value analysis

There are various methods to handle missing data. For example,
- If the proportion of missing values in a column is beyond a tolerable limit, those columns can be excluded from the model
- If the missing values of a column are within tolerable limit, they are imputed with median or mean value of the column

**isna()** is used to identify missing values in the dataset. It returns boolean values - 'True' indicates missing value, 'Flase' indicates non-missing values<br>
**fillna()** is used to replace missing/null values with a specified value

In [None]:
# Calculate the proportion of missing values in each column
X.isna().mean()

In [None]:
# Remove columns with high missing values beyond a threshold
# Threshold value can be changed
threshold_missing = 0.2

X = X.loc[:, X.isna().mean() < threshold_missing]
X.shape

In [None]:
## Impute missings with median value for variables below missing threshold
print(X.median())
X = X.fillna(X.median())

### 2.3 Normalization of the features

Normalization is done to transform the freatures to a same scale. This helps in stable model training and easy interpretability of feature importance.

There are different ways to normalize data. More details on why normalization is required and different ways to normalize the data can be found here -
https://www.datacamp.com/tutorial/normalization-in-machine-learning

**StandardScaler** function from sklean is used to normalize the features. It uses below formula to normalize the features.

x_transform = (x-mean)/standard deviation

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_transform = pd.DataFrame(scaler.fit_transform(X), columns = X.columns)
print(X_transform.shape)
X_transform.head()

In [None]:
# Save the scaling parameters
scaleFile = os.path.join(path, data , 'scaler.' + model + '.' + data + ".pkl")
dump(scaler, open(scaleFile, 'wb'))

### 2.4 Dataset filtering

The following transactions needs to be excluded from modelling data.
 - Transactions which are in the first two months to allow for profile maturation
 - All the non-fraud transactions corresponding to a fraud account are excluded from the model training

#### Removing first two months transactions to allow for profile maturation
Since some of our profile variables depend on potentially long periods of time, we would like to allow those features to fully build up to their entire calculation window. Ideally, we would have several months before the training period to allow for these features to mature, but since we only have a year's worth of data, we will restrict ourselves to a maximum of 2 month window, allowing our data to train on the remaining 10 months of data. These initial 2 months are known as the ‘profile maturation period’, where these profile variables properly develop. We exclude the transactions from the first 2 months from the modelling data to allow for profile maturation.

<font color='red'>**Do not modify the function**</font>

In [None]:
# Function to create boolean variable which tags transactions in the first two months as 'False' and transactions after two months as 'True'
def matureProf_n_months(df1, datetime_col, n_months=2):
    # Find earliest date in dataset
    min_date = df1[datetime_col].min()
    # Calculate cutoff date by adding n_months to min_date
    cutoff_date = min_date + pd.DateOffset(months=n_months)
    print('Earliest date: ', min_date)
    print('Cutoff date: ', cutoff_date)

    # return a boolean column which takes 'True' for rows where the datetime is less than the cutoff time, otherwise 'False'
    return df1[datetime_col] >= cutoff_date

# Create boolean variable which tags transactions in the first two months as 'False' and transactions after two months as 'True'
profileMature_bool = matureProf_n_months(df, 'transactionDateTime', n_months=2)

In [None]:
# Remove the transactions from first two months using profileMature_bool
X_profileMature = X_transform[profileMature_bool]
y_profileMature = y[profileMature_bool]

df_profileMature = df[profileMature_bool]

In [None]:
print(X_profileMature.shape)
print(y_profileMature.shape)
print(df_profileMature.shape)

#### Removing Non-fraud transactions from Fraud accounts
A fraud account can have fraud transactions and non-fraud transactions. To avoid any uncertainty of these non-fraud transactions being fraud or not, we remove all non-fraud transactions of a fraud account from modelling data.

In [None]:
# Generate cross-frequency of mdlIsFraudAcct and mdlIsFraudTrx
pd.crosstab(df_profileMature['mdlIsFraudAcct'], df_profileMature['mdlIsFraudTrx'])

In [None]:
# Create a boolean variable which tags non-fraud transactions of a fraud account as 'False' and the rest of the transactions as 'True'
filter_bool = ~((df_profileMature['mdlIsFraudAcct']==1) & (df_profileMature['mdlIsFraudTrx']==0))
filter_bool.value_counts(dropna = False)

In [None]:
# Use the filter_bool variable to filter the features and target datasets
X_filtered = X_profileMature[filter_bool]
y_filtered = y_profileMature[filter_bool]

print(X_filtered.shape)
print(y_filtered.shape)

In [None]:
# Filtering the main dataset
df_filtered = df_profileMature.loc[filter_bool, :]
df_filtered.shape

In [None]:
# Unique pan ids
df_filtered['pan'].nunique()

### 2.5 Create train and test datasets

Train dataset is used to **train the model** and test dataset is used to **evaluate the model** performance. The train and test datasets are chosen randomly so that both datasets represents the  distributions in overall data. Having a test data independent of the train data to evaluate the model also reduces risk of over-fitting of the model.

https://www.geeksforgeeks.org/training-data-vs-testing-data/

For the purpose of fraud modelling, we need to make sure that all the transactions corresponding to an account (pan id) are part of either train data or test data. Otherwise, there will be profile leaks if transactions from same account are included in both train and test datasets. In that case test data cannot be considered as independent of train data for the sake of evaluation

**GroupShuffleSplit** function from sklearn is used to create train and test datasets while ensuring that the accounts from same group (pan) fall into either train or test data

In [None]:
# from sklearn.model_selection import GroupShuffleSplit

# gss = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=100)

# for train_idx, test_idx in gss.split(X_filtered, y_filtered, groups=df_filtered['pan']):
#     # splitting features dataset and target dataset into train and test
#     X_train, X_test = X_filtered.iloc[train_idx], X_filtered.iloc[test_idx]
#     y_train, y_test = y_filtered.iloc[train_idx], y_filtered.iloc[test_idx]

#     # splitting the filtered dataset
#     df_train, df_test = df_filtered.iloc[train_idx], df_filtered.iloc[test_idx]

In [None]:
# Use is_train field to split the data into train and test samples

# Create boolean variables for train and test
train_bool = (df_filtered['is_train']==1)
test_bool = (df_filtered['is_train']==0)

# Use the boolean variable to create train data
X_train = X_filtered.loc[train_bool]
y_train = y_filtered.loc[train_bool]
df_train = df_filtered.loc[train_bool]

# Use the boolean variable to create test data
X_test = X_filtered.loc[test_bool]
y_test = y_filtered.loc[test_bool]
df_test = df_filtered.loc[test_bool]

In [None]:
print('X_train :', X_train.shape)
print('X_test :', X_test.shape)
print('y_train :', y_train.shape)
print('y_test :', y_test.shape)

#### Analyze transaction level fraud rates in train and test

In [None]:
print('Target rate (transaction fraud rate) in y_train : ', y_train.mean())
print('Target rate (transaction fraud rate) in y_test : ', y_test.mean())

#### Analyze CP vs CNP transaction level fraud rates in train and test

In [None]:
pd.crosstab(df_train['is_CNP'], df_train['mdlIsFraudTrx'], normalize = 'index')

In [None]:
pd.crosstab(df_test['is_CNP'], df_test['mdlIsFraudTrx'], normalize = 'index')

#### Analyze account level fraud rates

In [None]:
##-----Create account level train dataset-----------
# Number of unique pan ids in train datset
print('unique pan ids in train: ',df_train['pan'].nunique())

# Number of unique fraud pan ids in train datset
print('unique fraud pan ids in train: ',df_train[df_train['mdlIsFraudAcct']==1]['pan'].nunique())

# Create account level dataset for train by retaining unique pan ids
df_train_account = df_train[['pan','mdlIsFraudAcct']].sort_values(by = 'mdlIsFraudAcct',ascending = False).drop_duplicates('pan', keep = 'first')
print('Account level train dataset shape:', df_train_account.shape)

##-----Create account level test dataset-----------
# Number of unique pan ids in test datset
print('unique pan ids in test: ',df_test['pan'].nunique())

# Number of unique fraud pan ids in test datset
print('unique fraud pan ids in test: ',df_test[df_test['mdlIsFraudAcct']==1]['pan'].nunique())

# Create account level dataset for test by retaining unique pan ids
df_test_account = df_test[['pan','mdlIsFraudAcct']].sort_values(by = 'mdlIsFraudAcct',ascending = False).drop_duplicates('pan', keep = 'first')
print('Account level test dataset shape:', df_test_account.shape)

In [None]:
print('Account level fraud rate in train: ', df_train_account['mdlIsFraudAcct'].mean())
print('Account level fraud rate in test: ', df_test_account['mdlIsFraudAcct'].mean())

## 3. Model Training

### 3.1 Training Logistic Regression Model

**LogisticRegression** function from sklearn in used to train a Log Reg model

More details on LogisticRegression can be found here - https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [None]:
# import the function
from sklearn.linear_model import LogisticRegression

In [None]:
# initialize the model
LR = LogisticRegression(solver='liblinear', random_state=40, max_iter=500)

#Train the LR model using train data
LR.fit(X_train, y_train)

#### Evaluating Feature Importance

Since the data is normalized, the coefficients of the features represent their importance. The **magnitudes of the coefficients** indicate the strength of the association with the target variable. The **sign of coefficient** indicates the direction of relationship between the feature and target. Positive sign indicates that if the feature value increases, the likelihood of positive class in target increases and vice-versa.

**coef_** function is used to fetch the coeffients of features in the model

In [None]:
# Setting option to display numbers in float format
pd.options.display.float_format = '{:f}'.format

# Fetching coefficients
feature_coefficients = LR.coef_[0]

# Create a DataFrame to display feature importance
feature_names = X_train.columns
df_importance = pd.DataFrame({
    'Feature': feature_names,
    'Coefficients': feature_coefficients
}).sort_values(by='Coefficients', ascending=False)

df_importance

The features with highest magnitude of coefficient influence the predicted target the most. For features with sign of coefficient as negative, higher values are associated with negative outcome. Similarly, for features having sign of coefficient as positive, higher values are associated with positive outcome.

#### Generate predictions and convert to score
When the Logistic Regression model is applied on a transaction to make predictions, the output of the model gives is the probability of the transaction being Fraud or Non-Fraud. For operational purposes, these probabilities are converted to a score ranging from 1 - 999. High score indicates that the transaction has high probability of being a fraud and low score indicates low probability of transaction being a fraud. The following function uses the trained Logistic Regression model to generate probabilities on the transaction data and convert the probabilities to score.

<font color='red'>**Do not modify the function**</font>

In [None]:
# generate predictions and scores
from sklearn.preprocessing import MinMaxScaler

def scoring_predictions_logreg(X, LR):
    ##------- Generate predictions on the data ----#
    predictions = pd.Series(LR.predict_proba(X)[:,1])

    ##------ convert predictions to score ----#
    scaler = MinMaxScaler(feature_range=(1, 999))

    # Converting probabilites to logOdds to get a distribution about origin (0)
    log_odds = predictions.apply(lambda p: np.log(0.99999/(1-0.99999)) if p == 1 else np.log(p/(1-p)))
    score = pd.Series(scaler.fit_transform(log_odds.values[:, None]).astype(int).flatten())

    print("Y pred min = {}".format(predictions.min()))
    print("Y pred max = {}".format(predictions.max()))
    print("LogOdds min = {}".format(log_odds.min()))
    print("LogOdds max = {}".format(log_odds.max()))
    print("Score min = {}".format(score.min()))
    print("Score max = {}".format(score.max()))

    return score

In [None]:
# generate scores on train dataset
score_train = scoring_predictions_logreg(X_train, LR)

In [None]:
# generate scores on train dataset
score_test = scoring_predictions_logreg(X_test, LR)

#### Evaluating performance of the model on train and test datasets

The ROC-AUC metric (Receiver Operating Characteristic Area Under the Curve) is a performance metric used to evaluate the performance of binary classification models. It measures the ability of a model to distinguish between the positive and negative classes. Higher value indicates that the model is better at distinguishing the positive and negative classes.

ROC curve helps in visualizing the performance of the model. It helps us in understanding the fraud capture rates at different thresholds of non-frauds.

Detailed information on ROC-AUC can be found here - https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learning/

**predict_proba** function is used to calculate the probability estimates for positive and negative classes <br>
**roc_auc_score** function is used to calculate the roc-auc metric. It takes actual values of the target and probability extimates of positive class as inputs to calculate roc-auc metric.

<font color='red'>**Do not modify the function**</font>

In [None]:
from sklearn.metrics import roc_curve
def plotROC(y_train, y_train_score, y_test, y_test_score, model = 'All Features', target_fraud_rate = None):
    # roc curve for models
    NF1, F1, thresh1 = roc_curve(y_train, y_train_score, pos_label=1)
    NF2, F2, thresh2 = roc_curve(y_test, y_test_score, pos_label=1)

    # roc curve for tpr = fpr
    random_probs = [0 for i in range(len(y_test))]
    p_NF, p_F, _ = roc_curve(y_test, random_probs, pos_label=1)

    # plot roc curves
    plt.plot(NF1, F1, linestyle='--',color='orange', label='train')
    plt.plot(NF2, F2, linestyle='--',color='green', label='test')
    plt.plot(p_NF, p_F, linestyle='--', color='blue')

    if target_fraud_rate != None:
        # Find the Fraud Capture Rate at the 0.5% Non-Fraud Capture Rate
        target_NF = target_fraud_rate

        idx1 = np.argmin(np.abs(NF1 - target_NF))
        target_F1 = F1[idx1]

        idx2 = np.argmin(np.abs(NF2 - target_NF))
        target_F2 = F2[idx2]

        print(f"Fraud capture rate at {target_NF} of Frauds in train data is : {target_F1}")
        print(f"Fraud capture rate at {target_NF} of Frauds in test data is : {target_F2}")
        # Plot vertical line at target NF
        plt.axvline(x=target_NF, ymin=0, ymax=target_F1, color='red', linestyle='--', label=f'FPR = {target_NF * 100:.1f}%')
        # Plot horizontal line at corresponding F
        plt.axhline(y=target_F1, xmin=0, xmax=target_NF,color='red', linestyle='--')

        plt.xlim(0, min(target_fraud_rate*10, 1))

    # title
    plt.title('ROC curve : '+model)
    # x label
    plt.xlabel('% Non-Frauds')
    # y label
    plt.ylabel('% Frauds')

    plt.legend(loc='best')
    plt.show();

In [None]:
plotROC(y_train, score_train, y_test, score_test, model = 'All Features')

In [None]:
# target_fraud_rate can be changed
plotROC(y_train, score_train, y_test, score_test, model = 'All Features', target_fraud_rate = 0.005)

In [None]:
# import roc_auc_score
from sklearn.metrics import roc_auc_score

In [None]:
# Train data performance

auc_train = roc_auc_score(y_train, score_train)
print("AUC value of the Model on train data : ", auc_train)

lauc_train = roc_auc_score(y_train, score_train, max_fpr=0.02)
print("LAUC value of the Model on train data : ", lauc_train)

In [None]:
# Test data performance

auc_test = roc_auc_score(y_test, score_test)
print("AUC value of the Model on test data: ", auc_test)

lauc_test = roc_auc_score(y_test, score_test, max_fpr=0.02)
print("LAUC value of the Model on test data : ", lauc_test)

#### Performance of Card Present and Card Not Present on test data

In [None]:
# Card Present
auc_test_CP = roc_auc_score(y_test[df_test['is_CNP']==0], score_test.values[df_test['is_CNP']==0])
print("AUC value of the Model on test data for Card Present Transactions: ", auc_test_CP)

lauc_test_CP = roc_auc_score(y_test[df_test['is_CNP']==0], score_test.values[df_test['is_CNP']==0], max_fpr=0.02)
print("LAUC value of the Model on test data for Card Present Transactions: ", lauc_test_CP)

# Card not Present
auc_test_CNP = roc_auc_score(y_test[df_test['is_CNP']==1], score_test.values[df_test['is_CNP']==1])
print("AUC value of the Model on test data for Card Not Present Transactions: ", auc_test_CNP)

lauc_test_CNP = roc_auc_score(y_test[df_test['is_CNP']==1], score_test.values[df_test['is_CNP']==1], max_fpr=0.02)
print("LAUC value of the Model on test data for Card Not Present Transactions: ", lauc_test_CNP)

#### Save the Logistic Regression Model

In [None]:
modelFile = os.path.join(path, model_folder, model + '.' + data + ".pkl")

dump(LR, open(modelFile, 'wb'))

#### Generate score on the whole dataset and output the scored out dataset

This dataset will be used as a input to perf_metrics notebook

In [None]:
# generate probability predictions on whole data
df['y_preds'] = pd.Series(LR.predict_proba(X_transform)[:,1])
# use the scoring function defined above to generate scores on whole dataset
df['score'] = scoring_predictions_logreg(X_transform, LR)
df.head()

In [None]:
saveColumns = [*base_cols, *feature_columns, 'y_preds', 'score']
print(f"Columns to save: {saveColumns}")

In [None]:
# saveCSV_train = os.path.join(path + data, 'score.' + model + '.' + train_dataset_name)
# saveCSV_test = os.path.join(path + data, 'score.' + model + '.' + test_dataset_name)

df[df['is_train']==1][saveColumns].to_csv(trainsaveCSV, index=False)
df[df['is_train']==0][saveColumns].to_csv(testsaveCSV, index=False)

### 3.2 Forward Selection of features

Forward Selection is a type of feature selection technique that starts with an empty model and adds features one by one based on a specific criterion, typically the model's performance metric. The features are added until a stop criteria is met, like maximum number of features to add or no further imporvemnet in performance.

**SequentialFeatureSelector** function from mlxtend is used to add features in Forward selection or remove features in Backward elimination (discussed in next section).

More details on SequentialFeatureSelector can be found here - https://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/#overview

In [None]:
from mlxtend.feature_selection import SequentialFeatureSelector

import warnings
warnings.filterwarnings("ignore")

In [None]:
## initialize logistic regression model
LR_forward = LogisticRegression(solver='liblinear', random_state=10)

<font color='red'>**This step may take upto 1 hour to run**</font>

In [None]:
# Number of features to select can be changed
num_features_to_select_forward = 10

sfs_forward = SequentialFeatureSelector(LR_forward, k_features=num_features_to_select_forward, scoring='roc_auc',forward=True, floating=False, cv=5)
sfs_forward.fit(X_train, y_train)

#### Analyzing selected features
<font color='red'>**(Do not modify)**</font>

**k_feature_names_** gives list of final variables selected by SequentialFeatureSelector

In [None]:
selected_features_forward = list(sfs_forward.k_feature_names_)
print(f"Selected features: {selected_features_forward}")

**get_metric_dict** provides summary of iterations

In [None]:
# Create DataFrame to store the results
results_forward = pd.DataFrame.from_dict(sfs_forward.get_metric_dict()).T[['feature_idx','feature_names','avg_score']]
results_forward.rename(columns = {'avg_score':'roc'}, inplace = True)
results_forward

Each row in the above dataset represents one model. The columns feature_idx and feature_names shows index numbers and names of all the features included in the model, the column roc shows the roc_auc of the model. At each step, a feature is added to the model which gives best performance.

In [None]:
# Print the selected features and the corresponding model performance
selected_features = results_forward['feature_names'].apply(lambda x: list(x))
model_performance = results_forward['roc']

list_added_features = []
for i, (features, score) in enumerate(zip(selected_features, model_performance)):
    print(f"Step {i+1}:")
    if i>0:
        added_feature = [x for x in features if x not in selected_features[i]][0]
        print(f"Added feature(s): {added_feature}")
    else:
        added_feature = features[0]
        print(f"Added feature(s): {added_feature}")
    print(f"Model performance (roc): {score}")
    print("-" * 30)
    list_added_features = list_added_features+[added_feature]

In [None]:
plt.plot(list_added_features, model_performance,label='roc_auc')
plt.ylim(0.5, 1)
plt.xticks(rotation = 90)
# x label
plt.xlabel('Feature added at each step')
# y label
plt.ylabel('roc_auc')
plt.show()

As variables are added to the model at each step, the performance of the model increases initially. The rate of performance improvement decreases with each iteration. After a certain point, no further significant improvement is observed.

#### Train a Log Reg model with selected features and evaluate the performance

In [None]:
# Creating train and test feature datasets with only the selected features from forward selection method
X_train_forward = X_train[selected_features_forward]
X_test_forward = X_test[selected_features_forward]
print(X_train_forward.shape)
print(X_test_forward.shape)

In [None]:
# Train the model using selected features
LR_forward.fit(X_train_forward, y_train)

In [None]:
# Feature Importance
# Fetching coefficients
feature_coefficients_forward = LR_forward.coef_[0]

# Create a DataFrame to display feature importance
feature_names_forward = X_train_forward.columns
df_importance_forward = pd.DataFrame({
    'Feature_forward': feature_names_forward,
    'Coefficients_forward': feature_coefficients_forward
}).sort_values(by='Coefficients_forward', ascending=False)

display(df_importance_forward)

In [None]:
# Generate scores and evaluate performance on train dataset
score_train_forward = scoring_predictions_logreg(X_train_forward, LR_forward)
auc_train_forward = roc_auc_score(y_train, score_train_forward)
print("AUC value of Forward Inclusion Model on train data: ", auc_train_forward)

In [None]:
# Generate scores and evaluate performance on test dataset
score_test_forward = scoring_predictions_logreg(X_test_forward, LR_forward)
auc_test_forward = roc_auc_score(y_test, score_test_forward)
print("AUC value of Forward Inclusion Model on test data: ", auc_test_forward)

In [None]:
plotROC(y_train, score_train_forward, y_test, score_test_forward, model = 'Forward Selection')

In [None]:
# Model performance on CP and CNP
# Card Present
auc_test_CP_forward = roc_auc_score(y_test[df_test['is_CNP']==0], score_test_forward.values[df_test['is_CNP']==0])
print("AUC value of the Forward Inclusion Model on test data for Card Present Transactions: ", auc_test_CP_forward)

# Card not Present
auc_test_CNP_forward = roc_auc_score(y_test[df_test['is_CNP']==1], score_test_forward.values[df_test['is_CNP']==1])
print("AUC value of the Forward Inclusion Model on test data for Card Not Present Transactions: ", auc_test_CNP_forward)

### 3.3 Backward Elimination of features

Backward Elimination is a feature selection method that starts with a model with all the variables and removes the least significant features one by one until a stopping criteria is met.

In [None]:
## initialize logistic regression model
LR_backward = LogisticRegression(solver='liblinear', random_state=10)

<font color='red'>**This step may take upto 1 hour to run**</font>

In [None]:
# Number of features to select can be changed
num_features_to_select_backward = 10

sfs_backward = SequentialFeatureSelector(LR_backward, k_features=num_features_to_select_backward, scoring='roc_auc',forward=False, floating=False, cv=5)
sfs_backward.fit(X_train, y_train)

#### Analyzing selected features
<font color='red'>**(Do not modify)**</font>

In [None]:
selected_features_backward = list(sfs_backward.k_feature_names_)
print(f"Selected features: {selected_features_backward}")

In [None]:
# Create DataFrame to store the results
results_backward = pd.DataFrame.from_dict(sfs_backward.get_metric_dict()).T[['feature_idx','feature_names','avg_score']]
results_backward.rename(columns = {'avg_score':'roc'}, inplace = True)
results_backward

Each row in the above dataset represents one model. The columns feature_idx and feature_names shows index numbers and names of all the features included in the model, the column roc shows the roc_auc of the model. At each step, a feature is removed from the model which least affects the performance.

In [None]:
# Print the selected features and the corresponding model performance
selected_features = results_backward['feature_names'].apply(lambda x: list(x))
model_performance = results_backward['roc']

list_removed_features = []
for i, (features, score) in enumerate(zip(selected_features, model_performance)):
    print(f"Step {i+1}:")
    if i>0:
        removed_feature = [x for x in selected_features[X_train.shape[1]+1-i] if x not in features][0]
        print(f"Removed feature(s): {removed_feature}")
    else:
        removed_feature = 'NA'
        print(f"Removed feature(s): {removed_feature}")
    print(f"Model performance (roc): {score}")
    print("-" * 30)
    list_removed_features = list_removed_features+[removed_feature]

In [None]:
plt.plot(list_removed_features, model_performance,label='roc_auc')
plt.ylim(0.5, 1)
plt.xticks(rotation = 90)
# x label
plt.xlabel('Feature removed at each step')
# y label
plt.ylabel('roc_auc')
plt.show()

As the features are removed at each step, the performance remains almost same initially. As more features are removed, roc starts to decrease slightly. If more features are removed, we can observe that the performance decresases at each step.

#### Train a Log Reg model with selected features and evaluate on test dataset

In [None]:
# Creating train and test feature datasets with only the selected features from backward elimination method
X_train_backward = X_train[selected_features_backward]
X_test_backward = X_test[selected_features_backward]
print(X_train_backward.shape)
print(X_test_backward.shape)

In [None]:
# train the model using selected features
LR_backward.fit(X_train_backward, y_train)

In [None]:
# Feature Importance
# Fetching coefficients
feature_coefficients_backward = LR_backward.coef_[0]

# Create a DataFrame to display feature importance
feature_names_backward = X_train_backward.columns
df_importance_backward = pd.DataFrame({
    'Feature_backward': feature_names_backward,
    'Coefficients_backward': feature_coefficients_backward
}).sort_values(by='Coefficients_backward', ascending=False)

display(df_importance_backward)

In [None]:
# Make predictions and evaluate performance on test dataset
score_train_backward = scoring_predictions_logreg(X_train_backward, LR_backward)
auc_train_backward = roc_auc_score(y_train, score_train_backward)
print("AUC value of Backward Elimination Model on train data: ", auc_train_backward)

In [None]:
# Make predictions and evaluate performance on test dataset
score_test_backward = scoring_predictions_logreg(X_test_backward, LR_backward)
auc_test_backward = roc_auc_score(y_test, score_test_backward)
print("AUC value of Backward Elimination Model on test data: ", auc_test_backward)

In [None]:
plotROC(y_train, score_train_backward, y_test, score_test_backward, model = 'Backward Elimination')

In [None]:
# Model performance on CP and CNP
# Card Present
auc_test_CP_backward = roc_auc_score(y_test[df_test['is_CNP']==0], score_test_backward.values[df_test['is_CNP']==0])
print("AUC value of the Backward Elimination Model on test data for Card Present Transactions: ", auc_test_CP_backward)

# Card not Present
auc_test_CNP_backward = roc_auc_score(y_test[df_test['is_CNP']==1], score_test_backward.values[df_test['is_CNP']==1])
print("AUC value of the Backward Elimination Model on test data for Card Not Present Transactions: ", auc_test_CNP_backward)

### Exercise

- Train log reg model on new features (at least 10 variables)
- Identify important features
- Calculate AUC values for train, test, CP and CNP
- Prepare Midpoint Report