**Problem Statement**

Financial threats are displaying a trend about the credit risk of commercial banks as the incredible improvement in the financial industry has arisen. In this way, one of the biggest threats faces by commercial banks is the risk prediction of credit clients. The goal is to predict the probability of credit default based on credit card owner's characteristics and payment and ransaction history.

**Approach**

Tasks:- Hybrid Technique using Deep Learning and Genetics Algorithm. 
Supervised Learning tasks (Classification Problem)

Environment and Tools.

1. Tensorflow and Keras API.
2. Sklearn.
3. Matplotlib and Seaborn.
4. MLflow
5. PyGad

**Table of Content**
1. Importing Dependencies.
2. Data Understanding.
3. Exploratory Data Analysis.
4. Feature Engineering.
5. Model Building.
6. Model Evaluation and Interpretation

**Loading Dependencies**

In [7]:
import pandas as pd
import numpy as np
from collections import Counter
from statsmodels.stats.outliers_influence import variance_inflation_factor


import seaborn as sns
from matplotlib import pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report



**Data Understanding**

The process of analyzing data is to understand the data and gain useful information so that the derived knowledge can help in decision making.

In [40]:
# loading dataset
df = pd.read_csv('customer_transaction.csv')

In [7]:
# returns the first five rows
df.head()

In [8]:
# return a tuple representing the dimensionality of the datasets
df.shape

In [10]:
# print a concise summary about the dataset
df.info()

In [11]:
# return the sum of Series denoting duplicate rows.
df.duplicated().sum()

In [12]:
# return a Series containing counts of default. 
df['TX_FRAUD'].value_counts()

In [13]:
# return the total sum of missing values in present in the dataset 
df.isnull().sum()

In [14]:
# generate descriptive statistics of the dataset.
df.describe()

In [15]:
# checking relationship i.e correlation of variable in the dataset
corr = df.corr()
corr

In [16]:
# checking correlation in heatmap
sns.heatmap(corr, cmap='coolwarm', annot=True)
plt.show()

**Data visualization is the graphical representation of information and data.**

In [17]:
# view the distribution of Age
plt.title('view the distribution of Amount')
sns.histplot(data=df, x='TX_AMOUNT', bins=30)
plt.show()

In [36]:
# detecting outliers using boxplot
plt.title('detecting outliers using boxplot')
sns.boxplot(x='TX_AMOUNT', data=df)
plt.show()

In [37]:
# plotting target col
plt.title('Visualizing Target Col')
sns.countplot(x='TX_FRAUD', data=df)
plt.show()

**Performing Feature Engineering**

In [41]:
df['TX_DATETIME'] = df['TX_DATETIME'].astype(str)

In [42]:
# feature engineering
df['DATE'] = df['TX_DATETIME'].str.split('/').str[0]
df['MONTH'] = df['TX_DATETIME'].str.split('/').str[1]
df['YEAR'] = df['TX_DATETIME'].str.split('/').str[2]

In [35]:
df.drop(['TRANSACTION_ID', 'TX_DATETIME'], inplace=True, axis=1)

In [20]:
# assign independent variables 
x = df.drop(['TX_FRAUD'], axis=1)
# assign dependent variable
y = df['TX_FRAUD']

**Splitting Dataset into Train and Test**

In [21]:
# Split arrays or matrices into random train and test subsets.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=1)

In [22]:
# checking the spread of data
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

**Using Imblearn Library to Handle Data Imbalance present in the Dataset.**

Use the Synthetic Minority Oversampling Technique (SMOTE) which is a statistical technique for increasing the number of cases in your dataset in a balanced way. To data Imbalance present in the dataset.

none_fraud_transaction = 23364

fraud_transaction = 6636

In [23]:
# outputing y value before over sampling.
counter = Counter(y_train)
print("Before Sampling: {}".format(counter))

# Over-sampling using SMOTE and cleaning using ENN.
sm = SMOTEENN()
# esample the dataset.
x_train_sm, y_train_sm = sm.fit_resample(x_train, y_train)

# outputing y value after over sampling.
counter = Counter(y_train_sm)
print("After Sampling: {}".format(counter))

**Automatic Treatment of Outliers using LocalOutlierFactor**

The anomaly score of each sample is called the Local Outlier Factor.
It measures the local deviation of the density of a given sample with respect
to its neighbors..

In [None]:
# Unsupervised Outlier Detection using the Local Outlier Factor (LOF).
lof = LocalOutlierFactor()
# Returns -1 for outliers and 1 for inliers.
pred = lof.fit_predict(x_train_sm)

In [None]:
# select all rows that are not outliers
mask = pred != -1
x_train_sm, y_train_sm = x_train_sm[mask], y_train_sm[mask]
# summarize the shape of the updated training dataset
print(x_train_sm.shape, y_train_sm.shape)

In [None]:
# Apply a power transform featurewise to make data more Gaussian-like.
sc = StandardScaler()
# fit on the training dataset
sc.fit(x_train_sm)

# scale the training datasets
x_train_sm = sc.transform(x_train_sm)
# scale the testing datasets                
x_test = sc.transform(x_test)                                                   

**Model Development**

TensorFlow is an open-sourced end-to-end platform, a library for multiple machine learning tasks, while Keras is a high-level neural network library that runs on top of TensorFlow

In [None]:
# kernel_regularizer=L1L2(l1=1e-3, l2=1e-2), bias_regularizer=L2(1e-4)
# building keras model
def build_model():
    # Sequential groups a linear stack of layers
    model = Sequential([
        Dense(64, activation="relu", input_shape=(9,)),
        Dense(128, activation="relu"),
        Dense(256, activation="relu"),
        Dense(256, activation="relu"),
        Dense(128, activation="relu"),
        Dense(1, activation="sigmoid")
    ])
    return model

model = build_model()

In [None]:
# Prints a string summary of the network.
model.summary()

In [None]:
# Configures the model for training.
model.compile(
    # Gradient descent (with momentum) optimizer.
    optimizer=SGD(learning_rate=0.001, momentum=0.9, clipnorm=5.0, clipvalue=1.0),
    # Computes the cross-entropy loss between true labels and predicted labels.
    loss=BinaryCrossentropy(),
    # Calculates how often predictions match binary labels.
    metrics=[BinaryAccuracy()])

In [None]:
# model improving.
call_back = [
    # Callback to save the Keras model or model weights at some frequency.
    ModelCheckpoint(filepath="./", monitor="val_loss", save_best_only=True, save_weights_only=True, mode="max"),
    # TensorBoard is a visualization tool provided with TensorFlow.
    TensorBoard(log_dir='./logs', update_freq=1),
    # Stop training when a monitored metric has stopped improving.
    EarlyStopping(monitor='val_loss', patience=20, mode='min', restore_best_weights=True)
]

In [None]:
# Set the given experiment as the active experiment. 
mlflow.set_experiment("Customer Churn Prediction")

# Start a new MLflow run
with mlflow.start_run(): 
    # Enables automatic logging
    autolog()
    
    # Trains the model for a fixed number of epochs (iterations on a dataset).
    history = model.fit(x_train_sm, y_train_sm, batch_size=64, epochs=50, callbacks=call_back, validation_data=(x_test, y_test))
    
    # Saves a model as a TensorFlow SavedModel or HDF5 file.
    save_model(model, filepath="model.h5")
    
    # Loads a model saved
    loaded_model = load_model(filepath="model.h5")

    # Generates output predictions for the x_test input sample.
    pred = loaded_model.predict(x_test)
    
    # Returns the loss value & metrics values for the model in test mode.
    _, acc = loaded_model.evaluate(x_train_sm, y_train_sm, batch_size=128, verbose=0)
    print("Train Accuracy: {:.2f}".format(acc*100))

    # Returns the loss value & metrics values for the model in test mode.
    _, acc = loaded_model.evaluate(x_test, y_test, batch_size=128, verbose=0)
    print("Model Accuracy: {:.2f}".format(acc*100))
    
    metric = {
        "Training Accuracy": acc,
        "Testing Accuracy": acc
    }
    # Log multiple metrics for the current run
    log_metrics(metric)
    
# End an active MLflow run
mlflow.end_run()

In [None]:
y_hat = [1 if i > 0.5 else 0 for i in pred]

In [None]:
# Build a text report showing the main classification metrics.
print(classification_report(y_test, y_hat))

In [None]:
# Compute confusion matrix to evaluate the accuracy of a classification.
con_matrix = pd.DataFrame(confusion_matrix(y_test, y_hat), 
                          index=["Actual: No", "Actual: Yes"],
                          columns=("Predicted: No", "Predicted: Yes"))
print(con_matrix)

In [None]:
# plotting model peformance
def plot_graph(history, string):
    plt.plot(history.history[string])
    plt.plot(history.history["val_"+string])
    plt.xlabel("Epochs.")
    plt.ylabel(string)
    plt.legend([string, "val_"+string])
    plt.show()

# Model Accuracy Performance 
plot_graph(history, "binary_accuracy")
# Model Validation Performance
plot_graph(history, "loss")

**Using PyGad for Optimization.**

1. Instantiate KerasGA
2. Define fitness function
3. Create Callback function
4. Plot the result
5. Measure the Loss and Accuracy

In [None]:
# Creates an instance of the KerasGA class to build a population of model parameters.
keras_ga = KerasGA(model=model,
                   num_solutions=50)

In [None]:
# assign x_train to data_input
data_input = x_train_sm
# assign y_train to data_output
data_output = y_train_sm

In [None]:
# creating fitness function.
def fitness_func(solution, solution_idx):
    global data_input, data_output, keras_ga, model
    
    # set keras model to matrix
    model_weight_matrix = model_weights_as_matrix(model=model, weights_vector=solution)
    
    # sets the weights of the layer, from NumPy arrays.
    model.set_weights(weights=model_weight_matrix)
    
    # Generates output predictions for the input samples
    prediction = predict(model=model, 
                         solution=solution, 
                         data=data_input)
    
    # Computes the cross-entropy loss between true labels and predicted labels.
    criterion = BinaryCrossentropy()
    
    solution_fitness = 1.0 / (criterion(data_output, prediction).numpy() + 0.00000001)
    return solution_fitness

In [None]:
# creating callback on_generation function
def on_generation(ga_instance):
    # Convert a number or string to an integer.
    print("Generation: {generation}".format(generation=ga_instance.generations_completed), end='\n')
    # Returns information about the best solution found by the genetic algorithm.
    print("Fitness: {fitness}".format(fitness=ga_instance.best_solution()[1]))

In [None]:
# Initial population of network weights.
initial_pol = keras_ga.population_weights

In [None]:
# Instantiate GA class
ga_instance = GA(num_generations=400,
                 num_parents_mating=4, 
                 initial_population=initial_pol,
                 fitness_func=fitness_func,
                 on_generation=on_generation,
                 parent_selection_type="sss",
                 keep_parents=1,
                 crossover_type="single_point",
                 mutation_type="random",
                 mutation_percent_genes=10,)

In [None]:
# Runs the genetic algorithm.
ga_instance.run()

In [None]:
# Creates, shows, and returns a figure that summarizes how the fitness value evolved by generation
ga_instance.plot_fitness(title="PyGAD & Keras - Iteration vs. Fitness", linewidth=4)

In [None]:
# Returning the details of the best solution.
solution, solution_fitness, solution_idx = ga_instance.best_solution()
print("Fitness value of the best solution = {solution_fitness}".format(solution_fitness=solution_fitness))
print("Index of the best solution : {solution_idx}".format(solution_idx=solution_idx))

In [None]:
# Make prediction based on the best solution.
predictions = predict(model=model,
                      solution=solution,
                      data=data_input)

print("Predictions : \n", predictions)

In [None]:
# calculate the binary crossentropy for the trained model
bce = BinaryCrossentropy()
print("BinaryCrossentropy: {:.3f}".format(bce(data_output, predictions).numpy()))

# calculate the binary accuracy on the trained model
ba = BinaryAccuracy()
ba.update_state(data_output, predictions)
accuracy = ba.result().numpy()
print("Binary Accuracy: {:.3f}".format(accuracy))