<a href="https://colab.research.google.com/github/Beabsira94/Enhanced-Fraud-Detection/blob/Task-2/Notebooks/Model_training_Fraud_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Task 2
In this notebook, we embark on **Task 2** of the project, focusing on **Model Building and Training** for both credit card and fraud detection datasets. This process involves preparing the data by separating features and target variables, performing a train-test split, and selecting several models for comparison, including Logistic Regression, Decision Trees, Random Forest, Gradient Boosting, and neural network architectures like MLP, CNN, RNN, and LSTM. Each model will be trained, evaluated, and optimized for performance. Additionally, we incorporate MLOps practices by using tools like MLflow for experiment tracking, model versioning, and parameter logging, ensuring a robust and organized workflow.

In [None]:
!pwd

/content/drive/MyDrive/Colab Notebooks/Enhanced-Fraud-Detection/notebooks


We are going to start by mounting our drive.

In [None]:
%cd "/content/drive/MyDrive/Colab Notebooks/Enhanced-Fraud-Detection/notebooks"

/content/drive/MyDrive/Colab Notebooks/Enhanced-Fraud-Detection/notebooks


In [None]:
!pwd

/content/drive/MyDrive/Colab Notebooks/Enhanced-Fraud-Detection/notebooks


We are going to start our model training using the Fraud dataset.

In [None]:
!pip install mlflow



In [None]:
# import libraries and the dataset Fraud_dataset.csv
import pandas as pd
import numpy as np
import mlflow
import mlflow.sklearn
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Enhanced-Fraud-Detection/Data/df_merged.csv")
print(df.head)

<bound method NDFrame.head of         user_id      signup_time    purchase_time  purchase_value  \
0         22058  2/24/2015 22:55   4/18/2015 2:47              34   
1        333320   6/7/2015 20:39    6/8/2015 1:38              16   
2          1359   1/1/2015 18:52   1/1/2015 18:52              15   
3          1359   1/1/2015 18:52   1/1/2015 18:52              15   
4          1359   1/1/2015 18:52   1/1/2015 18:52              15   
...         ...              ...              ...             ...   
208076   360761   2/10/2015 6:39    6/3/2015 8:18              13   
208077   345170   1/27/2015 3:03   3/29/2015 0:30              43   
208078   274471  5/15/2015 17:43  5/26/2015 12:24              35   
208079   368416   3/3/2015 23:07   5/20/2015 7:07              40   
208080   207709   7/9/2015 20:06    9/7/2015 9:34              46   

            device_id source browser  sex  age  ip_address  class  \
0       QVPSPJUOCKZAR    SEO  Chrome    1   39   732758368      0   
1  

We are going to conduct a last cleaning check before proceeding to our models to be trained.

In [None]:
print(df.isnull().sum())

user_id                  0
signup_time              0
purchase_time            0
purchase_value           0
device_id                0
source                   0
browser                  0
sex                      0
age                      0
ip_address               0
class                    0
country                  0
hour_of_day              0
day_of_week              0
transaction_frequency    0
normalized_purchase      0
dtype: int64


**We are going to proceed to splitting the data for our models.**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
import mlflow
import mlflow.sklearn

# Separate the features (X) and the target variable (y)
X = df.drop(columns=['class', 'user_id', 'signup_time', 'purchase_time', 'device_id', 'ip_address'])
y = df['class']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Set the MLflow tracking URI to the specified directory
mlflow.set_tracking_uri("/content/drive/MyDrive/Colab Notebooks/Enhanced-Fraud-Detection/Log mlflow")

# Create or set an experiment
# If the experiment doesn't exist, it will be created
# If it exists, MLflow will use the existing experiment
mlflow.set_experiment("Fraud Detection Experiment")  # Replace with your desired experiment name

# Start an MLflow run
with mlflow.start_run():
    # Log the dataset split
    mlflow.log_param("test_size", 0.3)
    mlflow.log_param("random_state", 42)

    # Log the features and target shapes
    mlflow.log_param("X_train_shape", X_train.shape)
    mlflow.log_param("X_test_shape", X_test.shape)
    mlflow.log_param("y_train_shape", y_train.shape)
    mlflow.log_param("y_test_shape", y_test.shape)

    # Optionally: Save the train and test datasets as artifacts
    # mlflow.log_artifact("X_train.csv")
    # mlflow.log_artifact("X_test.csv")

print("Train and test data split successfully and logged to MLflow.")

2024/10/24 17:20:20 INFO mlflow.tracking.fluent: Experiment with name 'Fraud Detection Experiment' does not exist. Creating a new experiment.


Train and test data split successfully and logged to MLflow.


**We are going to train and evaluate Logistic Regression, Gradient Boosting, Convolutional Neural Network(CNN) and Long Short-Term Memory models.**

In [None]:
!pip install mlflow scikit-learn xgboost tensorflow pyngrok



In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
import mlflow
import mlflow.sklearn
import pickle
import tensorflow as tf
from tensorflow.keras import layers, models
from pyngrok import ngrok

# Ensure your data frame 'df' is already loaded and split
X = df.drop(columns=['class', 'user_id', 'signup_time', 'purchase_time', 'device_id', 'ip_address'])
y = df['class']

# Encode categorical features in X
label_encoder = LabelEncoder()
for column in X.select_dtypes(include=['object']).columns:
    X[column] = label_encoder.fit_transform(X[column])

# Now proceed with the train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Set the MLflow tracking URI and experiment name
mlflow.set_tracking_uri("/content/drive/MyDrive/Colab Notebooks/Enhanced-Fraud-Detection/Log mlflow")
mlflow.set_experiment("fraud_detection_experiment")

# Function to log model and metrics
def log_model_and_metrics(model, model_name, X_train, y_train, X_test, y_test):
    with mlflow.start_run() as run:
        # Train the model
        model.fit(X_train, y_train)

        # Predict on the test set
        predictions = model.predict(X_test)

        # Evaluate the model
        accuracy = accuracy_score(y_test, predictions)
        print(f"{model_name} Accuracy: {accuracy}")

        # Log metrics, parameters, and the model
        mlflow.log_metric("accuracy", accuracy)
        mlflow.log_param("model_name", model_name)
        mlflow.sklearn.log_model(model, f"{model_name}_model")

        # Print the run ID for reference
        run_id = run.info.run_id
        print(f"{model_name} Run ID: {run_id}")

        return accuracy

# 1. Logistic Regression
logistic_model = LogisticRegression(max_iter=100)
logistic_accuracy = log_model_and_metrics(logistic_model, "Logistic Regression", X_train, y_train, X_test, y_test)

# 2. Gradient Boosting
gb_model = GradientBoostingClassifier()
gb_accuracy = log_model_and_metrics(gb_model, "Gradient Boosting", X_train, y_train, X_test, y_test)

# 3. Convolutional Neural Network (CNN)
# Reshape X_train and X_test for CNN (1D CNN input)
X_train_cnn = np.expand_dims(X_train.values, axis=-1)
X_test_cnn = np.expand_dims(X_test.values, axis=-1)

cnn_model = models.Sequential()
cnn_model.add(layers.Conv1D(32, 2, activation='relu', input_shape=(X_train_cnn.shape[1], 1)))
cnn_model.add(layers.MaxPooling1D(pool_size=2))
cnn_model.add(layers.Flatten())
cnn_model.add(layers.Dense(64, activation='relu'))
cnn_model.add(layers.Dense(1, activation='sigmoid'))  # Binary classification

cnn_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
cnn_model.fit(X_train_cnn, y_train, epochs=10, batch_size=32, verbose=0)

# Predict and evaluate CNN
cnn_predictions = (cnn_model.predict(X_test_cnn) > 0.5).astype("int32")
cnn_accuracy = accuracy_score(y_test, cnn_predictions)
print(f"CNN Accuracy: {cnn_accuracy}")

# Log CNN model
with mlflow.start_run() as run:
    mlflow.log_metric("accuracy", cnn_accuracy)
    mlflow.log_param("model_name", "Convolutional Neural Network")
    mlflow.tensorflow.log_model(cnn_model, "cnn_model")
    print(f"CNN Run ID: {run.info.run_id}")

# 4. Long Short-Term Memory (LSTM)
# Reshape X_train and X_test for LSTM (3D input)
X_train_lstm = np.reshape(X_train.values, (X_train.shape[0], X_train.shape[1], 1))
X_test_lstm = np.reshape(X_test.values, (X_test.shape[0], X_test.shape[1], 1))

lstm_model = models.Sequential()
lstm_model.add(layers.LSTM(50, input_shape=(X_train_lstm.shape[1], 1)))
lstm_model.add(layers.Dense(1, activation='sigmoid'))

lstm_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
lstm_model.fit(X_train_lstm, y_train, epochs=10, batch_size=32, verbose=0)

# Predict and evaluate LSTM
lstm_predictions = (lstm_model.predict(X_test_lstm) > 0.5).astype("int32")
lstm_accuracy = accuracy_score(y_test, lstm_predictions)
print(f"LSTM Accuracy: {lstm_accuracy}")

# Log LSTM model
with mlflow.start_run() as run:
    mlflow.log_metric("accuracy", lstm_accuracy)
    mlflow.log_param("model_name", "Long Short-Term Memory")
    mlflow.tensorflow.log_model(lstm_model, "lstm_model")
    print(f"LSTM Run ID: {run.info.run_id}")

# Save all models for later deployment
model_directory = "/content/drive/MyDrive/Colab Notebooks/Enhanced-Fraud-Detection/Models"
with open(f"{model_directory}/logistic_regression_model.pkl", 'wb') as f:
    pickle.dump(logistic_model, f)

with open(f"{model_directory}/gradient_boosting_model.pkl", 'wb') as f:
    pickle.dump(gb_model, f)

cnn_model.save(f"{model_directory}/cnn_model.h5")  # Save CNN model
lstm_model.save(f"{model_directory}/lstm_model.h5")  # Save LSTM model

# Show results for all models
results = {
    "Logistic Regression": logistic_accuracy,
    "Gradient Boosting": gb_accuracy,
    "CNN": cnn_accuracy,
    "LSTM": lstm_accuracy,
}

print("\nModel Performance:")
for model_name, accuracy in results.items():
    print(f"{model_name} Accuracy: {accuracy}")

# Set up ngrok to access MLflow UI
ngrok.set_auth_token("2ntN0XY0zsArjXyzy66uf6UKlcw_2ZSxz1TfmtasmErCsKMiG")  # Replace with your ngrok token
public_url = ngrok.connect(5000)
print(f"MLflow UI available at: {public_url}")

# Run MLflow UI
!mlflow ui --port 5000


2024/10/24 17:20:39 INFO mlflow.tracking.fluent: Experiment with name 'fraud_detection_experiment' does not exist. Creating a new experiment.
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Logistic Regression Accuracy: 0.9343372046455747




Logistic Regression Run ID: ec767abd80dc4bd3b356a726c3315c1c
Gradient Boosting Accuracy: 0.9400240288346016


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Gradient Boosting Run ID: 7529ff1a5c654eab93624c5447aa1659
[1m1951/1951[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step




CNN Accuracy: 0.9389347216659992


  super().__init__(**kwargs)


CNN Run ID: 11251745cc2f48d8a876d932816a0f2d
[1m1951/1951[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 2ms/step




LSTM Accuracy: 0.9434040849018822




LSTM Run ID: adc82d3ab7b247f1992027dd13d0af31

Model Performance:
Logistic Regression Accuracy: 0.9343372046455747
Gradient Boosting Accuracy: 0.9400240288346016
CNN Accuracy: 0.9389347216659992
LSTM Accuracy: 0.9434040849018822
MLflow UI available at: NgrokTunnel: "https://f0aa-34-19-27-92.ngrok-free.app" -> "http://localhost:5000"
[2024-10-24 17:26:23 +0000] [10395] [INFO] Starting gunicorn 23.0.0
[2024-10-24 17:26:23 +0000] [10395] [INFO] Listening at: http://127.0.0.1:5000 (10395)
[2024-10-24 17:26:23 +0000] [10395] [INFO] Using worker: sync
[2024-10-24 17:26:23 +0000] [10400] [INFO] Booting worker with pid: 10400
[2024-10-24 17:26:23 +0000] [10401] [INFO] Booting worker with pid: 10401
[2024-10-24 17:26:24 +0000] [10402] [INFO] Booting worker with pid: 10402
[2024-10-24 17:26:24 +0000] [10403] [INFO] Booting worker with pid: 10403
[2024-10-24 17:49:42 +0000] [10395] [INFO] Handling signal: int

Aborted!
[2024-10-24 17:49:42 +0000] [10400] [INFO] Worker exiting (pid: 10400)
[2024-1