This code is a Python script that performs various tasks related to model training, evaluation, and visualization using machine learning libraries.

Dataset: https://www.kaggle.com/datasets/kartik2112/fraud-detection

**Importing Libraries**: This section imports various libraries such as pandas for data manipulation, scikit-learn for machine learning models and model evaluation, matplotlib for plotting, and datetime for handling date and time data.

In [1]:
# Import required libraries
from datetime import datetime
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_fscore_support, roc_curve, auc
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**load_and_process_data Function**: This function loads a dataset from a file path, drops any duplicate entries, and returns the processed DataFrame.

In [2]:
# Load the training dataset and drop duplicate entries
def load_and_process_data(file_path):
    df = pd.read_csv(file_path).drop_duplicates()
    return df

**add_time_related_features Function**: This function calculates the age based on the date of birth, and extracts additional time-related features such as hour, day of the week, and month from the transaction date and time.

In [3]:
# Feature Engineering
def add_time_related_features(df):
    # Calculate age based on date of birth
    df['age'] = datetime.now().year - pd.to_datetime(df['dob']).dt.year
    # Extract hour, day of the week, and month from transaction date and time
    df['hour'] = pd.to_datetime(df['trans_date_trans_time']).dt.hour
    df['day'] = pd.to_datetime(df['trans_date_trans_time']).dt.dayofweek
    df['month'] = pd.to_datetime(df['trans_date_trans_time']).dt.month
    return df

**prepare_train_test_data Function**: This function prepares the features and target variable for model training, converts categorical 'category' column to dummy variables, and splits the data into training and testing sets using the `train_test_split` function from scikit-learn.

In [4]:
# Model Training and Testing
def prepare_train_test_data(df, target_column):
    # Prepare features and target variable for model training
    train = df[['category', 'amt', 'zip', 'lat', 'long', 'city_pop', 'merch_lat', 'merch_long', 'age', 'hour', 'day', 'month', target_column]]
    # Convert categorical 'category' column to dummy variables
    train = pd.get_dummies(train, columns=['category'], drop_first=True)
    # Split the data into training and testing sets
    y_train = train[target_column].values
    X_train = train.drop(target_column, axis=1).values
    X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
    return X_train, X_test, y_train, y_test

**train_and_evaluate_model Function**: This function trains a given model, makes predictions, prints the classification report and confusion matrix, calculates precision, recall, and F1-score, and plots the Receiver Operating Characteristic (ROC) curve with the Area Under the Curve (AUC) value.

In [5]:
# Function for model training and testing
def train_and_evaluate_model(model, model_name, X_train, X_test, y_train, y_test):
    # Train the model and make predictions
    model.fit(X_train, y_train)
    predicted = model.predict(X_test)

    # Print classification report and confusion matrix
    print('\n\n' + model_name + ' - Classification Report:\n', classification_report(y_test, predicted))
    conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)
    print(model_name + ' - Confusion Matrix:\n', conf_mat)

    # Perform cross-validation and calculate precision, recall, and F1-score
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
    print("\nCross-Validation Scores for " + model_name + ":", cv_scores)

    precision, recall, fscore, _ = precision_recall_fscore_support(y_test, predicted, average='binary')
    print("\nPrecision:", precision)
    print("Recall:", recall)
    print("F1-Score:", fscore)

    # Plot ROC curve and calculate AUC
    fpr, tpr, threshold = roc_curve(y_test, model.predict_proba(X_test)[:,1])
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, label = model_name + ' (AUC = %0.2f)' % roc_auc)

**evaluate_models Function**: This function creates a plot for the ROC curve and iterates through each model, calling the `train_and_evaluate_model` function to train, evaluate, and visualize the performance of each model.

In [6]:
# Create models and evaluate them
def evaluate_models(models, X_train, X_test, y_train, y_test):
    # Initialize a plot for ROC curve
    plt.figure(figsize=(10, 8))
    # Iterate through each model, train and evaluate it
    for model_name, model in models.items():
        train_and_evaluate_model(model, model_name, X_train, X_test, y_train, y_test)

    # Plot the ROC curve and display the graph
    plt.plot([0, 1], [0, 1],'r--')
    plt.xlim([0, 1])
    plt.ylim([0, 1])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic')
    plt.legend(loc="lower right")
    plt.show()

**Main Code Execution**: In the main part of the code, it loads and processes the dataset, adds time-related features, prepares the train and test data, defines various machine learning models, and evaluates them using the previously prepared data by calling the `evaluate_models` function.

In [None]:
# Main code execution
file_path = '/content/drive/MyDrive/Colab Notebooks/Neuronexus Innovations/Neuronexus Innovations - Machine Learning/Credit Card Fraud Detection/fraudTrain.csv'
df = load_and_process_data(file_path)
df = add_time_related_features(df)

target_column = 'is_fraud'
X_train, X_test, y_train, y_test = prepare_train_test_data(df, target_column)

# Define models and evaluate them using the prepared data
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(random_state=5),
    'K-Nearest Neighbors': KNeighborsClassifier(),
    'Naive Bayes': GaussianNB(),
    'Decision Tree (Gini)': DecisionTreeClassifier(criterion='gini', random_state=5),
    'Decision Tree (Entropy)': DecisionTreeClassifier(criterion='entropy', random_state=5),
    'Support Vector Machine': SVC(probability=True, random_state=5)
}

evaluate_models(models, X_train, X_test, y_train, y_test)



Logistic Regression - Classification Report:
               precision    recall  f1-score   support

           0       0.99      1.00      1.00    257815
           1       0.00      0.00      0.00      1520

    accuracy                           0.99    259335
   macro avg       0.50      0.50      0.50    259335
weighted avg       0.99      0.99      0.99    259335

Logistic Regression - Confusion Matrix:
 [[257671    144]
 [  1520      0]]

Cross-Validation Scores for Logistic Regression: [0.99357973 0.99363757 0.99355563 0.99381109 0.99376771]

Precision: 0.0
Recall: 0.0
F1-Score: 0.0


Random Forest - Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    257815
           1       0.97      0.74      0.84      1520

    accuracy                           1.00    259335
   macro avg       0.99      0.87      0.92    259335
weighted avg       1.00      1.00      1.00    259335

Random Forest - Confusion Matri

Overall, this code demonstrates a pipeline for model training, evaluation, and visualization for a fraud detection task using various machine learning algorithms.