What are the key factors that predict fraudulent customer?
The initial round of feature engineering, conducted within the limitations of my existing computational resources, has pinpointed the following features as potentially instrumental in predicting fraudulent behavior: 'step', 'type', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'nameDest', and 'oldbalanceDest'.

Do these factors make sense? If yes, How? If not, How not?
The majority of the selected features intuitively make sense as they are often linked with financial irregularities. However, the 'type' of the transaction is a feature that may require further validation:

Requires Further Validation: While the feature 'type' was deemed significant in the initial model, independent analysis indicates a temporal correlation with fraud events, challenging its actual relevance. A re-evaluation by omitting this feature would not only confirm its importance but could also reduce noise in the model.

What kind of prevention should be adopted while the company updates its infrastructure?
To enhance the robustness of fraud detection, I recommend integrating ensemble classifiers with the existing neural network. These classifiers can act as a supplementary decision layer after the primary sequential network, effectively "voting" on the neural network's predictions to achieve more accurate results.

Assuming these actions have been implemented, how would you determine if they work?
To gauge the effectiveness of the implemented actions, several approaches can be employed:

Model Optimization: Continual refinement of the model's hyperparameters, possibly exploring alternative algorithms or weight balancing methods, will serve as a crucial evaluation metric.
  
Noise Reduction: Minimizing reliance on data augmentation techniques like SMOTE, which might introduce noise, and emphasizing more on clean and relevant feature selection would be another performance indicator.

Anomaly Detection: Implementing contamination control can further refine the model by filtering out statistical anomalies that could distort its performance.

Performance improvements can then be quantitatively measured using established metrics such as AUC-ROC, F1-Score, and others.


In [2]:
'''Install all the packages which are not preinstalled in python'''
!pip install missingno
!pip install imblearn
!pip install scipy
!pip install hyperopt
!pip install psutil torch

Collecting multiprocessing
  Downloading multiprocessing-2.6.2.1.tar.gz (108 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m108.0/108.0 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating package metadata.
[31m╰─>[0m See above for output.

[1;35mnote[0m: This is an issue with the package mentioned above, not pip.
[1;36mhint[0m: See above for details.


In [3]:
'''Import all the packages required for our neural network'''
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, make_scorer, f1_score
from sklearn.preprocessing import LabelEncoder
from imblearn.pipeline import Pipeline
from sklearn.feature_selection import chi2, f_classif
from scipy.stats import mannwhitneyu, ttest_ind
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import IsolationForest
import missingno as msno
from sklearn.metrics import roc_auc_score, matthews_corrcoef, cohen_kappa_score, f1_score, average_precision_score
from tensorflow.keras.models import clone_model
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
import tensorflow as tf
from hyperopt import fmin, tpe, hp, STATUS_OK
from typing import List, Tuple
from tensorflow.keras import regularizers


In [4]:
'''Here, I will load the data and view the following features:
-Check a small snippet of the dataframe.
-Look at the information of each column and their datatypes.
-View the unique values per column as well.
'''

fraudcsv=pd.DataFrame(data=pd.read_csv('Fraud.csv'))
fraudcsv.head(5)
fraudcsv.info()
fraudcsv.nunique()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69858 entries, 0 to 69857
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   step            69858 non-null  int64  
 1   type            69858 non-null  object 
 2   amount          69858 non-null  float64
 3   nameOrig        69858 non-null  object 
 4   oldbalanceOrg   69858 non-null  float64
 5   newbalanceOrig  69858 non-null  float64
 6   nameDest        69858 non-null  object 
 7   oldbalanceDest  69858 non-null  float64
 8   newbalanceDest  69857 non-null  float64
 9   isFraud         69857 non-null  float64
 10  isFlaggedFraud  69857 non-null  float64
dtypes: float64(7), int64(1), object(3)
memory usage: 5.9+ MB


step                  9
type                  5
amount            69533
nameOrig          69858
oldbalanceOrg     41993
newbalanceOrig    34045
nameDest          38011
oldbalanceDest    36977
newbalanceDest    13553
isFraud               2
isFlaggedFraud        1
dtype: int64

In [None]:
"""
Here, a line plot is generated to visualize transaction amounts grouped by 'step' (which represents hour) and differentiated by 'type'.
- The x-axis represents the hour (step).
- The y-axis represents the amount of the transaction.
"""
plt.figure(figsize=(15,6))
sns.lineplot(data=fraudcsv, x='step', y='amount', hue='type')
plt.title('Transaction Amounts by Hour')
plt.xlabel('Step (Hour)')
plt.ylabel('Amount')
plt.show()

"""
This count plot displays the frequency of each transaction type ('type'), further categorized by whether the transaction is fraudulent ('isFraud').
- It provides insights into which types of transactions are more often associated with fraud.
"""
plt.figure(figsize=(10,6))
sns.countplot(data=fraudcsv, x='type', hue='isFraud')
plt.title('Transaction Types and Fraud Frequency')
plt.show()

"""
The scatter plot is used to compare the old balance and the new balance at the origin account.
- Transactions are color-coded based on whether they are fraudulent ('isFraud').
- The 'fit_reg' parameter set to False ensures that no regression line is fitted.
"""
sns.lmplot(data=fraudcsv, x='oldbalanceOrg', y='newbalanceOrig', hue='isFraud', fit_reg=False)
plt.title('Old Balance vs New Balance (Origin)')
plt.show()

"""
A box plot is created to showcase the distribution of transaction amounts for transactions that have been flagged as fraud.
- The x-axis represents the transaction type, and the y-axis represents the amount.
"""
sns.boxplot(data=fraudcsv[fraudcsv['isFlaggedFraud']==1], x='type', y='amount')
plt.title('Transaction Amounts for Flagged Transactions')
plt.show()

"""
This scatter plot compares the old balance at the origin account to the transaction amount.
- Points are color-coded based on whether they are fraudulent and styled based on the 'isFlaggedFraud' feature.
"""
sns.scatterplot(data=fraudcsv, x='oldbalanceOrg', y='amount', hue='isFraud', style='isFlaggedFraud', palette='viridis')
plt.title('Old Balance and Amounts with Fraud Flag')
plt.show()


"""
A new feature 'day' is created from 'step' to represent the day on which a transaction occurs.
- A line plot is generated to visualize average transaction amounts grouped by 'day' and differentiated by 'type'.
- The x-axis represents the day, and the y-axis represents the average amount of the transactions.
"""
fraudcsv['day'] = fraudcsv['step'] // 24

plt.figure(figsize=(15,6))
sns.lineplot(data=fraudcsv_grouped, x='day', y='amount', hue='type')
plt.title('Average Transaction Amounts by Day')
plt.xlabel('Day')
plt.ylabel('Average Amount')
plt.show()

In [None]:
plt.figure(figsize=(6, 6))
fraudcsv['isFraud'].value_counts().plot.pie(autopct='%1.1f%%')
plt.title('Fraud vs Non-Fraud Distribution')
plt.show()

plt.figure(figsize=(10, 6))
sns.countplot(data=fraudcsv, x='type', hue='isFraud')
plt.title('Transaction Type by Fraud vs Non-Fraud')
plt.show()

plt.figure(figsize=(10, 6))
sns.boxplot(data=fraudcsv, x='isFraud', y='amount')
plt.title('Amount vs Fraud')
plt.show()

In [5]:
"""
Here, I am converting object data types to string data types for the columns 'nameOrig', 'type', and 'nameDest'.
- Using '.astype('string')', to ensure that the data in these columns are treated as strings for further operations.
"""
# change the data from object to string
fraudcsv['nameOrig']=fraudcsv['nameOrig'].astype('string')
fraudcsv['type']=fraudcsv['type'].astype('string')
fraudcsv['nameDest']=fraudcsv['nameDest'].astype('string')

"""
This block of code is responsible for label encoding of categorical variables.
- I will use LabelEncoder from scikit-learn to convert the string labels into numerical format.
- Columns 'nameOrig', 'type', and 'nameDest' are transformed.
"""
# Encode the data using label encoder
labelencoder = LabelEncoder()
fraudcsv['nameOrig'] = labelencoder.fit_transform(fraudcsv['nameOrig'])
fraudcsv['type'] = labelencoder.fit_transform(fraudcsv['type'])
fraudcsv['nameDest'] = labelencoder.fit_transform(fraudcsv['nameDest'])

"""
After label encoding, I will convert the numerical labels into float data types.
- I will change the data type to 'float' to make it compatible for passing through neural networks or other machine learning algorithms.
"""
# convert the data to numeric float values to pass through the network
fraudcsv['nameOrig']=fraudcsv['nameOrig'].astype('float')
fraudcsv['type']=fraudcsv['type'].astype('float')
fraudcsv['nameDest']=fraudcsv['nameDest'].astype('float')


In [None]:
"""
In this section, the original dataframe is separated into feature set (features) and target labels (tar).
- The feature set omits the 'isFraud' column.
- The target label set only includes the 'isFraud' column.
"""
# Feature Engineering
features=fraudcsv.drop('isFraud',axis=1)
tar=fraudcsv['isFraud']

"""
The Mann-Whitney U test is used to compare the distributions of each feature for fraud and non-fraud cases.
- A dictionary 'mannwhitneyu_scores' is initialized to hold p-values for each feature.
- The p-values are calculated using the mannwhitneyu function from the scipy.stats module.
"""
mannwhitneyu_scores = {}
for feature in features.columns:
    mannwhitneyu_scores[feature] = mannwhitneyu(features[feature][tar == 1], features[feature][tar == 0]).pvalue

"""
The Independent t-test is used for the same purpose as the Mann-Whitney U test but assumes that the data is normally distributed.
- A dictionary 'ttest_scores' is initialized to hold p-values for each feature.
- The p-values are calculated using the ttest_ind function from the scipy.stats module.
"""
ttest_scores = {}
for feature in features.columns:
    ttest_scores[feature] = ttest_ind(features[feature][tar == 1], features[feature][tar == 0]).pvalue

"""
The significance level (alpha) is adjusted using Bonferroni correction to account for multiple comparisons.
- Alpha is divided by the number of features.
"""
alpha = 0.05 / len(features.columns)
relevant_features = []

"""
The list of relevant features identified through statistical testing is printed.
- These features are considered significant for classifying fraud.
"""
for feature in features.columns:
    if mannwhitneyu_scores[feature] < alpha and ttest_scores[feature] < alpha:
        relevant_features.append(feature)

print("Relevant Features for Fraud Classification: ", relevant_features)


Relevant Features for Fraud Classification:  ['step', 'type', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'nameDest', 'oldbalanceDest', 'ComputedDifference']


In [None]:
"""
This section is responsible for data preprocessing.
- A copy of the original dataframe is made to preserve the original data.
- Rows with missing 'isFraud' values are dropped.
- The features (X) and target labels (y) are separated.
- The feature set is scaled using Standard Scaler.
- The data is then split into training and validation sets.
"""


# Create a copy if you want to keep the original fraudcsv intact
df = fraudcsv.copy()
df.dropna(subset=['isFraud'], inplace=True)
X = df.drop('isFraud', axis=1)
y = df["isFraud"]

input_dim=len(X.columns)

scaler = StandardScaler()
scaled_features = scaler.fit_transform(X[['step','type','amount', 'oldbalanceOrg','nameOrig', 'newbalanceOrig', 'oldbalanceDest', 'nameDest']])
X[['step','type','amount', 'oldbalanceOrg','nameOrig', 'newbalanceOrig', 'oldbalanceDest', 'nameDest']] = scaled_features
X_train,X_val,y_train,y_val=train_test_split(X,y,test_size=0.2,random_state=42)


"""
The modelStruct function defines the architecture of the neural network model.
- It takes the number of nodes for each layer and the dropout rate as arguments.
- The function uses metrics like AUC-ROC, AUC-PR, Precision, and Recall for evaluation.
- Batch normalization and L2 regularization are applied.
- The model is compiled using the Adam optimizer and binary cross-entropy loss.
"""

def modelStruct(node1, node2,node3,dropout):
    metric_list=[
        tf.keras.metrics.AUC(name='auc_roc',curve='ROC'),
        tf.keras.metrics.AUC(name='auc_pr',curve='PR'),
        tf.keras.metrics.Precision(name='Precision'),
        tf.keras.metrics.Recall(name='recall')
        ]

    # Check twice one with flattening the input tensor and without
    # Since input tensors are 1D flattening would be unnecessary
    # model=tf.keras.Sequential([
    #     tf.keras.layers.Dense(node1,activation='relu'),
    #     tf.keras.layers.Dropout(dropout),
    #     tf.keras.layers.Dense(node2,activation='relu'),
    #     tf.keras.layers.Dropout(dropout),
    #     tf.keras.layers.Dense(1,activation='sigmoid')
    # ])
    model = tf.keras.Sequential([
          tf.keras.layers.InputLayer(input_shape=(input_dim,)),
          tf.keras.layers.BatchNormalization(),
          tf.keras.layers.Dense(node1, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
          tf.keras.layers.Dropout(dropout),
          tf.keras.layers.BatchNormalization(),
          tf.keras.layers.Dense(node2, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
          tf.keras.layers.Dropout(dropout),
          tf.keras.layers.BatchNormalization(),
          tf.keras.layers.Dense(node3, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
          tf.keras.layers.Dropout(dropout),
          tf.keras.layers.BatchNormalization(),
          tf.keras.layers.Dense(1, activation='sigmoid')
      ])

    model.compile(optimizer='adam',loss='binary_crossentropy',metrics=metric_list)
    return model

"""
The trainmodel function trains a clone of a source model.
- It receives hyperparameters and datasets as arguments.
- SMOTE and RandomUnderSampler are used to balance the dataset.
- Early stopping is employed to halt training when the loss plateaus.
- Returns the trained model.
"""
def trainmodel(params,source_model,X_train,y_train):
    # Clone and compile the model
    model = clone_model(source_model)
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=source_model.metrics)
    try:
      smote=SMOTE(sampling_strategy=params['smote_strategy'])
      rus=RandomUnderSampler(sampling_strategy=params['rus_strategy'])

      X_res,y_res=smote.fit_resample(X_train,y_train)
      X_res,y_res=rus.fit_resample(X_res,y_res)
    except:
      print(f"Skipping trial due to invalid ratio. SMOTE: {params['smote_strategy']}, RUS: {params['rus_strategy']}")
      return None


    early_stopping=tf.keras.callbacks.EarlyStopping(monitor='loss',patience=3)
    model.fit(X_res,y_res,epochs=10,verbose=0,callbacks=[early_stopping],validation_split=0.2)

    return model

"""
This function creates N balanced subsets from the training data.
- It checks if the dimensions of X and y match.
- The data is shuffled and then split into N subsets.
- Returns these subsets for further processing.
"""

def create_balanced_subsets(X: np.ndarray,y: np.ndarray,N=5)->List[Tuple[np.ndarray,np.ndarray]]:
    # Check dimensions
    if X.shape[0]!=y.shape[0]:
        raise ValueError("Samples have different rows(X and Y)")

    y=y.to_numpy().reshape(-1,1)
    combined_samples=np.hstack([X,y])
    np.random.shuffle(combined_samples)
    make_splits=np.array_split(combined_samples,N)
    subsets=[(subset[:,:-1],subset[:,-1]) for subset in make_splits]

    return subsets

"""
The ensembleTraining function manages the ensemble learning process.
- It calls create_balanced_subsets to get N subsets.
- Trains models on each of these subsets using trainmodel.
- Takes the mean of the predictions from all trained models.
- Evaluates the ensemble's performance using multiple metrics.
- Returns the mean of these metrics as the final evaluation score.
"""

def ensembleTraining(params):
    # Took a page out of the transformer architecture and instead of making duplicates/heads of the batches i split the data into subsets
    try:
        subsets=create_balanced_subsets(X_train,y_train,N=5)
    except ValueError as e:
        print(e)

    source_model=modelStruct(params['node1'],params['node2'],params['node3'],params['dropout'])
    models=[trainmodel(params,clone_model(source_model),X_sub,y_sub) for X_sub,y_sub in subsets]

    models = [model for model in models if model is not None]
    if len(models) == 0:
      return {'loss': float('inf'), 'status': STATUS_OK}
    val_preds=np.mean([model.predict(X_val).squeeze() for model in models],axis=0)

    val_preds_rounded=np.round(np.array(val_preds))

    metrics=[
        roc_auc_score(y_val,val_preds),
        matthews_corrcoef(y_val,val_preds_rounded),
        cohen_kappa_score(y_val,val_preds_rounded),
        f1_score(y_val,val_preds_rounded),
        average_precision_score(y_val,val_preds)
    ]

    mean_metrics=np.mean(metrics)

    return {'loss':-mean_metrics,'status':STATUS_OK}

"""
This section sets up hyperparameter tuning using the fmin function from the hyperopt library.
- A search space is defined for hyperparameters like SMOTE and RUS ratios, node sizes, and dropout rate.
- The ensembleTraining function is used as the objective function.
- The best hyperparameters are printed at the end.
"""

space = {
    'smote_strategy': hp.uniform('smote_strategy', 0.1, 1),
    'rus_strategy': hp.uniform('rus_strategy', 0.1, 1),
    'node1': hp.choice('node1', [512, 1024]),
    'node2': hp.choice('node2', [256, 512]),
    'node3': hp.choice('node3', [128, 256]),
    'dropout': hp.uniform('dropout', 0.3, 0.7)
    }

best = fmin(fn=ensembleTraining, space=space, algo=tpe.suggest, max_evals=100)
print("Best Hyperparameters:", best)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m

  1/437 [..............................] - ETA: 1:02
 15/437 [>.............................] - ETA: 1s  
 30/437 [=>............................] - ETA: 1s
 52/437 [==>...........................] - ETA: 1s
 70/437 [===>..........................] - ETA: 1s
 87/437 [====>.........................] - ETA: 1s

  1/437 [..............................] - ETA: 1:47
 15/437 [>.............................] - ETA: 1s  
 30/437 [=>............................] - ETA: 1s
 43/437 [=>............................] - ETA: 1s
 58/437 [==>...........................] - ETA: 1s
 74/437 [====>.........................] - ETA: 1s
 91/437 [=====>........................] - ETA: 1s

  1/437 [..............................] - ETA: 2:03
  9/437 [..............................] - ETA: 2s  
 21/437 [>.............................] - ETA: 2s
 38/437 [=>............................] - ETA: 1s
 53/437 [==>...........................] - ETA: 1s
 6

Some of the parameters which I got from the previous runtime were as follows:
100%|██████████| 50/50 [54:38<00:00, 65.57s/trial, best loss: -0.39153398391526484]
Best Hyperparameters: {'dropout': 0.29244632948742716, 'node1': 1, 'node2': 1, 'rus_strategy': 0.9822700231410721, 'smote_strategy': 0.8030273605560339}

In [None]:
'''After training the model I'm going to use the best
   parameters to train the model and save it be used later
'''
best_params = {
    'smote_strategy': best['smote_strategy'],
    'rus_strategy': best['rus_strategy'],
    'node1': [512, 1024][best['node1']],
    'node2': [256, 512][best['node2']],
    'node3': [128, 256][best['node3']],
    'dropout': best['dropout']
}
final_models = [trainmodel(best_params, clone_model(modelStruct(best_params['node1'], best_params['node2'], best_params['node3'], best_params['dropout'])), X_sub, y_sub) for X_sub, y_sub in create_balanced_subsets(X_train, y_train, N=5)]

for i, model in enumerate(final_models):
    model.save(f'saved_models/final_model_{i}.h5')
