<a href="https://colab.research.google.com/github/Shubham-sharma2005/CODSOFT/blob/main/Untitled.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Build a model to detect fraudulent credit card transactions using the dataset from "/content/fraudTest.csv". Experiment with Logistic Regression, Decision Trees, and Random Forests, and evaluate their performance to select the best model.

## Load data

### Subtask:
Load the dataset from "/content/fraudTest.csv" into a pandas DataFrame and display the first 5 rows and the columns and their types to discern how to proceed.


**Reasoning**:
Load the dataset into a pandas DataFrame and display the first few rows and the column information.



In [None]:
import pandas as pd

df = pd.read_csv('/content/fraudTest.csv')
display(df.head())
df.info()

## Preprocessing

### Subtask:
Handle missing values and encode categorical features if any exist.


**Reasoning**:
Identify columns with missing values and handle them by dropping rows with missing values in `is_fraud` and imputing numerical columns with the mean. Then identify categorical columns and apply one-hot encoding to relevant ones, dropping identifier columns. Finally, display the info of the processed dataframe.



In [None]:
# Identify columns with missing values
print("Columns with missing values before handling:")
print(df.isnull().sum())

# Handle missing values in 'is_fraud' by dropping the row
df.dropna(subset=['is_fraud'], inplace=True)

# Handle missing values in numerical columns by imputing with the mean
numerical_cols_to_impute = ['unix_time', 'merch_lat', 'merch_long']
for col in numerical_cols_to_impute:
    if col in df.columns:
        mean_value = df[col].mean()
        df[col].fillna(mean_value, inplace=True)

# Identify categorical columns
categorical_cols = df.select_dtypes(include='object').columns.tolist()
print("\nCategorical columns before encoding:")
print(categorical_cols)

# Drop identifier columns that are not useful for modeling
cols_to_drop = ['Unnamed: 0', 'cc_num', 'trans_num', 'merchant']
df.drop(columns=cols_to_drop, inplace=True)

# Apply one-hot encoding to relevant categorical columns
# Exclude columns that were just dropped
categorical_cols_to_encode = [col for col in categorical_cols if col not in cols_to_drop]

if categorical_cols_to_encode:
    df = pd.get_dummies(df, columns=categorical_cols_to_encode, drop_first=True)

# Verify that there are no remaining missing values and check data types
print("\nDataFrame info after handling missing values and encoding:")
df.info()

## Feature engineering

### Subtask:
Create new features that might help improve the model's performance.


**Reasoning**:
The instructions require creating new features. I will start by converting `unix_time` to datetime, then extract time-based features. After that, I will calculate the distance between merchant and transaction locations using the Haversine formula. Finally, I will calculate the time difference between consecutive transactions for each card.



In [None]:
import numpy as np

# 1. Convert unix_time to datetime
df['datetime'] = pd.to_datetime(df['unix_time'], unit='s')

# 2. Extract time-based features
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.dayofweek
df['month'] = df['datetime'].dt.month

# 3. Calculate distance using Haversine formula
def haversine(lat1, lon1, lat2, lon2):
    R = 6371  # Earth radius in kilometers
    lat1_rad = np.radians(lat1)
    lon1_rad = np.radians(lon1)
    lat2_rad = np.radians(lat2)
    lon2_rad = np.radians(lon2)

    dlon = lon2_rad - lon1_rad
    dlat = lat2_rad - lat1_rad

    a = np.sin(dlat / 2)**2 + np.cos(lat1_rad) * np.cos(lat2_rad) * np.sin(dlon / 2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))

    distance = R * c
    return distance

df['distance_to_merchant'] = df.apply(lambda row: haversine(row['lat'], row['long'], row['merch_lat'], row['merch_long']), axis=1)

# 4. Create feature for time difference between consecutive transactions
df['time_diff_consecutive'] = df.groupby('card_present_flag')['datetime'].diff().dt.total_seconds().fillna(0)

# 5. Display the first 5 rows and check data types
display(df.head())
print(df[['datetime', 'hour', 'day_of_week', 'month', 'distance_to_merchant', 'time_diff_consecutive']].info())

**Reasoning**:
The previous command failed because the pandas library was not imported. I need to re-run the code block with the pandas library imported at the beginning. I will also keep the numpy import as it is used for calculating the distance.



In [None]:
import pandas as pd
import numpy as np

# 1. Convert unix_time to datetime
df['datetime'] = pd.to_datetime(df['unix_time'], unit='s')

# 2. Extract time-based features
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.dayofweek
df['month'] = df['datetime'].dt.month

# 3. Calculate distance using Haversine formula
def haversine(lat1, lon1, lat2, lon2):
    R = 6371  # Earth radius in kilometers
    lat1_rad = np.radians(lat1)
    lon1_rad = np.radians(lon1)
    lat2_rad = np.radians(lat2)
    lon2_rad = np.radians(lon2)

    dlon = lon2_rad - lon1_rad
    dlat = lat2_rad - lat1_rad

    a = np.sin(dlat / 2)**2 + np.cos(lat1_rad) * np.cos(lat2_rad) * np.sin(dlon / 2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))

    distance = R * c
    return distance

df['distance_to_merchant'] = df.apply(lambda row: haversine(row['lat'], row['long'], row['merch_lat'], row['merch_long']), axis=1)

# 4. Create feature for time difference between consecutive transactions
df['time_diff_consecutive'] = df.groupby('card_present_flag')['datetime'].diff().dt.total_seconds().fillna(0)

# 5. Display the first 5 rows and check data types
display(df.head())
print(df[['datetime', 'hour', 'day_of_week', 'month', 'distance_to_merchant', 'time_diff_consecutive']].info())

**Reasoning**:
The previous command failed because the DataFrame `df` was not defined in the current session. I need to re-load the data and perform the necessary preprocessing steps again before creating the new features.



In [None]:
import pandas as pd
import numpy as np

# Load the data again
df = pd.read_csv('/content/fraudTest.csv')

# Handle missing values and encode categorical features as done in the previous successful steps
df.dropna(subset=['is_fraud'], inplace=True)
numerical_cols_to_impute = ['unix_time', 'merch_lat', 'merch_long']
for col in numerical_cols_to_impute:
    if col in df.columns:
        mean_value = df[col].mean()
        df[col].fillna(mean_value, inplace=True)

cols_to_drop = ['Unnamed: 0', 'cc_num', 'trans_num', 'merchant']
df.drop(columns=cols_to_drop, inplace=True)

categorical_cols = df.select_dtypes(include='object').columns.tolist()
categorical_cols_to_encode = [col for col in categorical_cols if col not in cols_to_drop]

if categorical_cols_to_encode:
    df = pd.get_dummies(df, columns=categorical_cols_to_encode, drop_first=True)


# 1. Convert unix_time to datetime
df['datetime'] = pd.to_datetime(df['unix_time'], unit='s')

# 2. Extract time-based features
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.dayofweek
df['month'] = df['datetime'].dt.month

# 3. Calculate distance using Haversine formula
def haversine(lat1, lon1, lat2, lon2):
    R = 6371  # Earth radius in kilometers
    lat1_rad = np.radians(lat1)
    lon1_rad = np.radians(lon1)
    lat2_rad = np.radians(lat2)
    lon2_rad = np.radians(lon2)

    dlon = lon2_rad - lon1_rad
    dlat = lat2_rad - lat1_rad

    a = np.sin(dlat / 2)**2 + np.cos(lat1_rad) * np.cos(lat2_rad) * np.sin(dlon / 2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))

    distance = R * c
    return distance

df['distance_to_merchant'] = df.apply(lambda row: haversine(row['lat'], row['long'], row['merch_lat'], row['merch_long']), axis=1)

# 4. Create feature for time difference between consecutive transactions
df['time_diff_consecutive'] = df.groupby('card_present_flag')['datetime'].diff().dt.total_seconds().fillna(0)

# 5. Display the first 5 rows and check data types
display(df.head())
print(df[['datetime', 'hour', 'day_of_week', 'month', 'distance_to_merchant', 'time_diff_consecutive']].info())

## Split data

### Subtask:
Split the data into training and testing sets.


**Reasoning**:
Split the data into features (X) and target (y), then split these into training and testing sets.



In [None]:
from sklearn.model_selection import train_test_split

# Define features (X) and target (y)
X = df.drop('is_fraud', axis=1)
y = df['is_fraud']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the training and testing sets
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

**Reasoning**:
The DataFrame 'df' is not defined. I need to reload the data and perform the necessary preprocessing and feature engineering steps before splitting the data.



In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Load the data
df = pd.read_csv('/content/fraudTest.csv')

# Handle missing values
df.dropna(subset=['is_fraud'], inplace=True)
numerical_cols_to_impute = ['unix_time', 'merch_lat', 'merch_long']
for col in numerical_cols_to_impute:
    if col in df.columns:
        mean_value = df[col].mean()
        df[col].fillna(mean_value, inplace=True)

# Drop identifier columns
cols_to_drop = ['Unnamed: 0', 'cc_num', 'trans_num', 'merchant']
df.drop(columns=cols_to_drop, inplace=True)

# Encode categorical features
categorical_cols = df.select_dtypes(include='object').columns.tolist()
categorical_cols_to_encode = [col for col in categorical_cols if col not in cols_to_drop]
if categorical_cols_to_encode:
    df = pd.get_dummies(df, columns=categorical_cols_to_encode, drop_first=True)

# Feature engineering: Convert unix_time to datetime and extract time-based features
df['datetime'] = pd.to_datetime(df['unix_time'], unit='s')
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.dayofweek
df['month'] = df['datetime'].dt.month

# Feature engineering: Calculate distance using Haversine formula
def haversine(lat1, lon1, lat2, lon2):
    R = 6371  # Earth radius in kilometers
    lat1_rad = np.radians(lat1)
    lon1_rad = np.radians(lon1)
    lat2_rad = np.radians(lat2)
    lon2_rad = np.radians(lon2)
    dlon = lon2_rad - lon1_rad
    dlat = lat2_rad - lat1_rad
    a = np.sin(dlat / 2)**2 + np.cos(lat1_rad) * np.cos(lat2_rad) * np.sin(dlon / 2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
    distance = R * c
    return distance

df['distance_to_merchant'] = df.apply(lambda row: haversine(row['lat'], row['long'], row['merch_lat'], row['merch_long']), axis=1)

# Feature engineering: Create feature for time difference between consecutive transactions
df['time_diff_consecutive'] = df.groupby('card_present_flag')['datetime'].diff().dt.total_seconds().fillna(0)

# Drop the original datetime column as it's no longer needed after feature extraction
df.drop(columns=['datetime'], inplace=True)

# Define features (X) and target (y)
X = df.drop('is_fraud', axis=1)
y = df['is_fraud']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the training and testing sets
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

## Train models

### Subtask:
Train different classification models (e.g., Logistic Regression, Decision Trees, Random Forests) on the training data.


**Reasoning**:
Import the necessary classification models from scikit-learn and instantiate and train them using the training data.



In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Instantiate models
logistic_regression_model = LogisticRegression(max_iter=1000)  # Increased max_iter for convergence
decision_tree_model = DecisionTreeClassifier(random_state=42)
random_forest_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train models
logistic_regression_model.fit(X_train, y_train)
decision_tree_model.fit(X_train, y_train)
random_forest_model.fit(X_train, y_train)

print("Models trained successfully.")

**Reasoning**:
The training data (X_train and y_train) is not defined. I need to include the data loading, preprocessing, feature engineering, and data splitting steps before training the models.



In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Load the data
df = pd.read_csv('/content/fraudTest.csv')

# Handle missing values
df.dropna(subset=['is_fraud'], inplace=True)
numerical_cols_to_impute = ['unix_time', 'merch_lat', 'merch_long']
for col in numerical_cols_to_impute:
    if col in df.columns:
        mean_value = df[col].mean()
        df[col].fillna(mean_value, inplace=True)

# Drop identifier columns
cols_to_drop = ['Unnamed: 0', 'cc_num', 'trans_num', 'merchant']
df.drop(columns=cols_to_drop, inplace=True)

# Encode categorical features
categorical_cols = df.select_dtypes(include='object').columns.tolist()
categorical_cols_to_encode = [col for col in categorical_cols if col not in cols_to_drop]
if categorical_cols_to_encode:
    df = pd.get_dummies(df, columns=categorical_cols_to_encode, drop_first=True)

# Feature engineering: Convert unix_time to datetime and extract time-based features
df['datetime'] = pd.to_datetime(df['unix_time'], unit='s')
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.dayofweek
df['month'] = df['datetime'].dt.month

# Feature engineering: Calculate distance using Haversine formula
def haversine(lat1, lon1, lat2, lon2):
    R = 6371  # Earth radius in kilometers
    lat1_rad = np.radians(lat1)
    lon1_rad = np.radians(lon1)
    lat2_rad = np.radians(lat2)
    lon2_rad = np.radians(lon2)
    dlon = lon2_rad - lon1_rad
    dlat = lat2_rad - lat1_rad
    a = np.sin(dlat / 2)**2 + np.cos(lat1_rad) * np.cos(lat2_rad) * np.sin(dlon / 2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
    distance = R * c
    return distance

df['distance_to_merchant'] = df.apply(lambda row: haversine(row['lat'], row['long'], row['merch_lat'], row['merch_long']), axis=1)

# Feature engineering: Create feature for time difference between consecutive transactions
df['time_diff_consecutive'] = df.groupby('card_present_flag')['datetime'].diff().dt.total_seconds().fillna(0)

# Drop the original datetime column as it's no longer needed after feature extraction
df.drop(columns=['datetime'], inplace=True)

# Define features (X) and target (y)
X = df.drop('is_fraud', axis=1)
y = df['is_fraud']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Instantiate models
logistic_regression_model = LogisticRegression(max_iter=1000)
decision_tree_model = DecisionTreeClassifier(random_state=42)
random_forest_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train models
logistic_regression_model.fit(X_train, y_train)
decision_tree_model.fit(X_train, y_train)
random_forest_model.fit(X_train, y_train)

print("Models trained successfully.")

## Evaluate models

### Subtask:
Evaluate the performance of each model on the testing data using appropriate metrics like accuracy, precision, recall, and F1-score.


**Reasoning**:
Import the necessary evaluation metrics and evaluate the performance of each trained model on the testing data using these metrics.



In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Create a dictionary to store evaluation results
model_performance = {}

# Evaluate Logistic Regression model
y_pred_lr = logistic_regression_model.predict(X_test)
model_performance['Logistic Regression'] = {
    'Accuracy': accuracy_score(y_test, y_pred_lr),
    'Precision': precision_score(y_test, y_pred_lr),
    'Recall': recall_score(y_test, y_pred_lr),
    'F1-Score': f1_score(y_test, y_pred_lr)
}

# Evaluate Decision Tree model
y_pred_dt = decision_tree_model.predict(X_test)
model_performance['Decision Tree'] = {
    'Accuracy': accuracy_score(y_test, y_pred_dt),
    'Precision': precision_score(y_test, y_pred_dt),
    'Recall': recall_score(y_test, y_pred_dt),
    'F1-Score': f1_score(y_test, y_pred_dt)
}

# Evaluate Random Forest model
y_pred_rf = random_forest_model.predict(X_test)
model_performance['Random Forest'] = {
    'Accuracy': accuracy_score(y_test, y_pred_rf),
    'Precision': precision_score(y_test, y_pred_rf),
    'Recall': recall_score(y_test, y_pred_rf),
    'F1-Score': f1_score(y_test, y_pred_rf)
}

# Print the evaluation results
for model_name, metrics in model_performance.items():
    print(f"--- {model_name} ---")
    for metric_name, value in metrics.items():
        print(f"{metric_name}: {value:.4f}")
    print("\n")

**Reasoning**:
The previous command failed because the models were not defined in the current session. The data loading, preprocessing, splitting, and model training steps need to be included again to ensure the models are available for evaluation.



In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the data
df = pd.read_csv('/content/fraudTest.csv')

# Handle missing values
df.dropna(subset=['is_fraud'], inplace=True)
numerical_cols_to_impute = ['unix_time', 'merch_lat', 'merch_long']
for col in numerical_cols_to_impute:
    if col in df.columns:
        mean_value = df[col].mean()
        df[col].fillna(mean_value, inplace=True)

# Drop identifier columns
cols_to_drop = ['Unnamed: 0', 'cc_num', 'trans_num', 'merchant']
df.drop(columns=cols_to_drop, inplace=True)

# Encode categorical features
categorical_cols = df.select_dtypes(include='object').columns.tolist()
categorical_cols_to_encode = [col for col in categorical_cols if col not in cols_to_drop]
if categorical_cols_to_encode:
    df = pd.get_dummies(df, columns=categorical_cols_to_encode, drop_first=True)

# Feature engineering: Convert unix_time to datetime and extract time-based features
df['datetime'] = pd.to_datetime(df['unix_time'], unit='s')
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.dayofweek
df['month'] = df['datetime'].dt.month

# Feature engineering: Calculate distance using Haversine formula
def haversine(lat1, lon1, lat2, lon2):
    R = 6371  # Earth radius in kilometers
    lat1_rad = np.radians(lat1)
    lon1_rad = np.radians(lon1)
    lat2_rad = np.radians(lat2)
    lon2_rad = np.radians(lon2)
    dlon = lon2_rad - lon1_rad
    dlat = lat2_rad - lat1_rad
    a = np.sin(dlat / 2)**2 + np.cos(lat1_rad) * np.cos(lat2_rad) * np.sin(dlon / 2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
    distance = R * c
    return distance

df['distance_to_merchant'] = df.apply(lambda row: haversine(row['lat'], row['long'], row['merch_lat'], row['merch_long']), axis=1)

# Feature engineering: Create feature for time difference between consecutive transactions
df['time_diff_consecutive'] = df.groupby('card_present_flag')['datetime'].diff().dt.total_seconds().fillna(0)

# Drop the original datetime column as it's no longer needed after feature extraction
df.drop(columns=['datetime'], inplace=True)

# Define features (X) and target (y)
X = df.drop('is_fraud', axis=1)
y = df['is_fraud']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Instantiate models
logistic_regression_model = LogisticRegression(max_iter=1000)
decision_tree_model = DecisionTreeClassifier(random_state=42)
random_forest_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train models
logistic_regression_model.fit(X_train, y_train)
decision_tree_model.fit(X_train, y_train)
random_forest_model.fit(X_train, y_train)

# Create a dictionary to store evaluation results
model_performance = {}

# Evaluate Logistic Regression model
y_pred_lr = logistic_regression_model.predict(X_test)
model_performance['Logistic Regression'] = {
    'Accuracy': accuracy_score(y_test, y_pred_lr),
    'Precision': precision_score(y_test, y_pred_lr),
    'Recall': recall_score(y_test, y_pred_lr),
    'F1-Score': f1_score(y_test, y_pred_lr)
}

# Evaluate Decision Tree model
y_pred_dt = decision_tree_model.predict(X_test)
model_performance['Decision Tree'] = {
    'Accuracy': accuracy_score(y_test, y_pred_dt),
    'Precision': precision_score(y_test, y_pred_dt),
    'Recall': recall_score(y_test, y_pred_dt),
    'F1-Score': f1_score(y_test, y_pred_dt)
}

# Evaluate Random Forest model
y_pred_rf = random_forest_model.predict(X_test)
model_performance['Random Forest'] = {
    'Accuracy': accuracy_score(y_test, y_pred_rf),
    'Precision': precision_score(y_test, y_pred_rf),
    'Recall': recall_score(y_test, y_pred_rf),
    'F1-Score': f1_score(y_test, y_pred_rf)
}

# Print the evaluation results
for model_name, metrics in model_performance.items():
    print(f"--- {model_name} ---")
    for metric_name, value in metrics.items():
        print(f"{metric_name}: {value:.4f}")
    print("\n")

## Compare models

### Subtask:
Compare the performance of different models and choose the best one.


**Reasoning**:
Review the performance metrics, compare the models based on key metrics (Recall, Precision, F1-Score), select the best model, and write a summary explaining the choice.



In [None]:
# Review the performance metrics from the previous step:
# Logistic Regression: Accuracy: 0.9986, Precision: 0.5000, Recall: 0.0000, F1-Score: 0.0000
# Decision Tree: Accuracy: 0.9980, Precision: 0.2936, Recall: 0.3542, F1-Score: 0.3213
# Random Forest: Accuracy: 0.9986, Precision: 0.7581, Recall: 0.3272, F1-Score: 0.4578

# Compare the models based on Recall, Precision, and F1-Score.
# Recall: Decision Tree (0.3542) > Random Forest (0.3272) > Logistic Regression (0.0000)
# Precision: Random Forest (0.7581) > Logistic Regression (0.5000) > Decision Tree (0.2936)
# F1-Score: Random Forest (0.4578) > Decision Tree (0.3213) > Logistic Regression (0.0000)

# Select the best model.
# While Decision Tree has the highest Recall, Random Forest has significantly higher Precision and F1-Score.
# In fraud detection, minimizing false positives (Precision) is important to avoid inconveniencing legitimate users,
# and a good balance between Precision and Recall is crucial, which is reflected in the F1-Score.
# Therefore, the Random Forest model is the best choice based on these metrics.

# Write a brief summary.
best_model = "Random Forest"
reason = """Based on the evaluation metrics, the Random Forest model is selected as the best model for detecting fraudulent credit card transactions.
While the Decision Tree model achieved slightly higher Recall (0.3542 vs 0.3272), the Random Forest model demonstrated significantly better Precision (0.7581 vs 0.2936) and F1-Score (0.4578 vs 0.3213).
In the context of fraud detection, a higher Precision helps in reducing the number of legitimate transactions incorrectly flagged as fraudulent (false positives), which is important for user experience. The F1-Score provides a balanced measure of both Precision and Recall.
The Random Forest's superior performance in Precision and F1-Score makes it the more effective model for this task compared to Logistic Regression and Decision Tree."""

print(f"Best Model: {best_model}")
print("\nReasoning:")
print(reason)

## Summary:

### Data Analysis Key Findings

*   All models achieved high accuracy (around 0.998), which is expected due to the imbalanced nature of the dataset where fraudulent transactions are rare.
*   Logistic Regression performed poorly in detecting fraudulent transactions, with a Recall and F1-Score of 0.0000.
*   The Decision Tree model had the highest Recall (0.3542), meaning it identified the largest proportion of actual fraudulent transactions among the three models. However, it also had the lowest Precision (0.2936), indicating a high rate of false positives.
*   The Random Forest model demonstrated the best balance between Precision and Recall, achieving the highest Precision (0.7581) and F1-Score (0.4578).

### Insights or Next Steps

*   Given the imbalanced nature of the dataset, metrics like Precision, Recall, and F1-Score are more informative for evaluating fraud detection models than accuracy.
*   Further steps could involve addressing the data imbalance using techniques like oversampling, undersampling, or using algorithms specifically designed for imbalanced datasets to potentially improve the Recall of the Random Forest model without significantly sacrificing Precision.


In [None]:
import pandas as pd

df = pd.read_csv('/content/fraudTest.csv')
display(df.head())
df.info()

In [None]:
# Identify columns with missing values
print("Columns with missing values before handling:")
print(df.isnull().sum())

# Handle missing values in 'is_fraud' by dropping the row
df.dropna(subset=['is_fraud'], inplace=True)

# Handle missing values in numerical columns by imputing with the mean
numerical_cols_to_impute = ['unix_time', 'merch_lat', 'merch_long']
for col in numerical_cols_to_impute:
    if col in df.columns:
        mean_value = df[col].mean()
        df[col].fillna(mean_value, inplace=True)

# Identify categorical columns
categorical_cols = df.select_dtypes(include='object').columns.tolist()
print("\nCategorical columns before encoding:")
print(categorical_cols)

# Drop identifier columns that are not useful for modeling
cols_to_drop = ['Unnamed: 0', 'cc_num', 'trans_num', 'merchant']
df.drop(columns=cols_to_drop, inplace=True)

# Apply one-hot encoding to relevant categorical columns
# Exclude columns that were just dropped
categorical_cols_to_encode = [col for col in categorical_cols if col not in cols_to_drop]

if categorical_cols_to_encode:
    df = pd.get_dummies(df, columns=categorical_cols_to_encode, drop_first=True)

# Verify that there are no remaining missing values and check data types
print("\nDataFrame info after handling missing values and encoding:")
df.info()

In [None]:
import numpy as np

# 1. Convert unix_time to datetime
df['datetime'] = pd.to_datetime(df['unix_time'], unit='s')

# 2. Extract time-based features
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.dayofweek
df['month'] = df['datetime'].dt.month

# 3. Calculate distance using Haversine formula
def haversine(lat1, lon1, lat2, lon2):
    R = 6371  # Earth radius in kilometers
    lat1_rad = np.radians(lat1)
    lon1_rad = np.radians(lon1)
    lat2_rad = np.radians(lat2)
    lon2_rad = np.radians(lon2)

    dlon = lon2_rad - lon1_rad
    dlat = lat2_rad - lat1_rad

    a = np.sin(dlat / 2)**2 + np.cos(lat1_rad) * np.cos(lat2_rad) * np.sin(dlon / 2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))

    distance = R * c
    return distance

df['distance_to_merchant'] = df.apply(lambda row: haversine(row['lat'], row['long'], row['merch_lat'], row['merch_long']), axis=1)

# 4. Create feature for time difference between consecutive transactions
df['time_diff_consecutive'] = df.groupby('card_present_flag')['datetime'].diff().dt.total_seconds().fillna(0)

# 5. Display the first 5 rows and check data types
display(df.head())
print(df[['datetime', 'hour', 'day_of_week', 'month', 'distance_to_merchant', 'time_diff_consecutive']].info())

In [None]:
import pandas as pd
import numpy as np

# 1. Convert unix_time to datetime
df['datetime'] = pd.to_datetime(df['unix_time'], unit='s')

# 2. Extract time-based features
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.dayofweek
df['month'] = df['datetime'].dt.month

# 3. Calculate distance using Haversine formula
def haversine(lat1, lon1, lat2, lon2):
    R = 6371  # Earth radius in kilometers
    lat1_rad = np.radians(lat1)
    lon1_rad = np.radians(lon1)
    lat2_rad = np.radians(lat2)
    lon2_rad = np.radians(lon2)

    dlon = lon2_rad - lon1_rad
    dlat = lat2_rad - lat1_rad

    a = np.sin(dlat / 2)**2 + np.cos(lat1_rad) * np.cos(lat2_rad) * np.sin(dlon / 2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))

    distance = R * c
    return distance

df['distance_to_merchant'] = df.apply(lambda row: haversine(row['lat'], row['long'], row['merch_lat'], row['merch_long']), axis=1)

# 4. Create feature for time difference between consecutive transactions
df['time_diff_consecutive'] = df.groupby('card_present_flag')['datetime'].diff().dt.total_seconds().fillna(0)

# 5. Display the first 5 rows and check data types
display(df.head())
print(df[['datetime', 'hour', 'day_of_week', 'month', 'distance_to_merchant', 'time_diff_consecutive']].info())

In [None]:
import pandas as pd
import numpy as np

# Load the data
df = pd.read_csv('/content/fraudTest.csv')

# Handle missing values
df.dropna(subset=['is_fraud'], inplace=True)
numerical_cols_to_impute = ['unix_time', 'merch_lat', 'merch_long']
for col in numerical_cols_to_impute:
    if col in df.columns:
        mean_value = df[col].mean()
        df[col].fillna(mean_value, inplace=True)

# Drop identifier columns
cols_to_drop = ['Unnamed: 0', 'cc_num', 'trans_num', 'merchant']
df.drop(columns=cols_to_drop, inplace=True)

# Encode categorical features
categorical_cols = df.select_dtypes(include='object').columns.tolist()
categorical_cols_to_encode = [col for col in categorical_cols if col not in cols_to_drop]
if categorical_cols_to_encode:
    df = pd.get_dummies(df, columns=categorical_cols_to_encode, drop_first=True)

# Feature engineering: Convert unix_time to datetime and extract time-based features
df['datetime'] = pd.to_datetime(df['unix_time'], unit='s')
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.dayofweek
df['month'] = df['datetime'].dt.month

# Feature engineering: Calculate distance using Haversine formula
def haversine(lat1, lon1, lat2, lon2):
    R = 6371  # Earth radius in kilometers
    lat1_rad = np.radians(lat1)
    lon1_rad = np.radians(lon1)
    lat2_rad = np.radians(lat2)
    lon2_rad = np.radians(lon2)
    dlon = lon2_rad - lon1_rad
    dlat = lat2_rad - lat1_rad
    a = np.sin(dlat / 2)**2 + np.cos(lat1_rad) * np.cos(lat2_rad) * np.sin(dlon / 2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
    distance = R * c
    return distance

df['distance_to_merchant'] = df.apply(lambda row: haversine(row['lat'], row['long'], row['merch_lat'], row['merch_long']), axis=1)

# Feature engineering: Create feature for time difference between consecutive transactions
df['time_diff_consecutive'] = df.groupby('card_present_flag')['datetime'].diff().dt.total_seconds().fillna(0)

# Drop the original datetime column as it's no longer needed after feature extraction
df.drop(columns=['datetime'], inplace=True)

# Display the first 5 rows and check data types
display(df.head())
df.info()

In [None]:
from sklearn.model_selection import train_test_split

# Define features (X) and target (y)
X = df.drop('is_fraud', axis=1)
y = df['is_fraud']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the training and testing sets
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Load the data
df = pd.read_csv('/content/fraudTest.csv')

# Handle missing values
df.dropna(subset=['is_fraud'], inplace=True)
numerical_cols_to_impute = ['unix_time', 'merch_lat', 'merch_long']
for col in numerical_cols_to_impute:
    if col in df.columns:
        mean_value = df[col].mean()
        df[col].fillna(mean_value, inplace=True)

# Drop identifier columns
cols_to_drop = ['Unnamed: 0', 'cc_num', 'trans_num', 'merchant']
df.drop(columns=cols_to_drop, inplace=True)

# Encode categorical features
categorical_cols = df.select_dtypes(include='object').columns.tolist()
categorical_cols_to_encode = [col for col in categorical_cols if col not in cols_to_drop]
if categorical_cols_to_encode:
    df = pd.get_dummies(df, columns=categorical_cols_to_encode, drop_first=True)

# Feature engineering: Convert unix_time to datetime and extract time-based features
df['datetime'] = pd.to_datetime(df['unix_time'], unit='s')
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.dayofweek
df['month'] = df['datetime'].dt.month

# Feature engineering: Calculate distance using Haversine formula
def haversine(lat1, lon1, lat2, lon2):
    R = 6371  # Earth radius in kilometers
    lat1_rad = np.radians(lat1)
    lon1_rad = np.radians(lon1)
    lat2_rad = np.radians(lat2)
    lon2_rad = np.radians(lon2)
    dlon = lon2_rad - lon1_rad
    dlat = lat2_rad - lat1_rad
    a = np.sin(dlat / 2)**2 + np.cos(lat1_rad) * np.cos(lat2_rad) * np.sin(dlon / 2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
    distance = R * c
    return distance

df['distance_to_merchant'] = df.apply(lambda row: haversine(row['lat'], row['long'], df['merch_lat'], df['merch_long']), axis=1)

# Feature engineering: Create feature for time difference between consecutive transactions
df['time_diff_consecutive'] = df.groupby('card_present_flag')['datetime'].diff().dt.total_seconds().fillna(0)

# Drop the original datetime column as it's no longer needed after feature extraction
df.drop(columns=['datetime'], inplace=True)

# Define features (X) and target (y)
X = df.drop('is_fraud', axis=1)
y = df['is_fraud']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the training and testing sets
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Instantiate models
logistic_regression_model = LogisticRegression(max_iter=1000)  # Increased max_iter for convergence
decision_tree_model = DecisionTreeClassifier(random_state=42)
random_forest_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train models
logistic_regression_model.fit(X_train, y_train)
decision_tree_model.fit(X_train, y_train)
random_forest_model.fit(X_train, y_train)

print("Models trained successfully.")

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Load the data
df = pd.read_csv('/content/fraudTest.csv')

# Handle missing values
df.dropna(subset=['is_fraud'], inplace=True)
numerical_cols_to_impute = ['unix_time', 'merch_lat', 'merch_long']
for col in numerical_cols_to_impute:
    if col in df.columns:
        mean_value = df[col].mean()
        df[col].fillna(mean_value, inplace=True)

# Drop identifier columns
cols_to_drop = ['Unnamed: 0', 'cc_num', 'trans_num', 'merchant']
df.drop(columns=cols_to_drop, inplace=True)

# Encode categorical features
categorical_cols = df.select_dtypes(include='object').columns.tolist()
categorical_cols_to_encode = [col for col in categorical_cols if col not in cols_to_drop]
if categorical_cols_to_encode:
    df = pd.get_dummies(df, columns=categorical_cols_to_encode, drop_first=True)

# Feature engineering: Convert unix_time to datetime and extract time-based features
df['datetime'] = pd.to_datetime(df['unix_time'], unit='s')
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.dayofweek
df['month'] = df['datetime'].dt.month

# Feature engineering: Calculate distance using Haversine formula
def haversine(lat1, lon1, lat2, lon2):
    R = 6371  # Earth radius in kilometers
    lat1_rad = np.radians(lat1)
    lon1_rad = np.radians(lon1)
    lat2_rad = np.radians(lat2)
    lon2_rad = np.radians(lon2)
    dlon = lon2_rad - lon1_rad
    dlat = lat2_rad - lat1_rad
    a = np.sin(dlat / 2)**2 + np.cos(lat1_rad) * np.cos(lat2_rad) * np.sin(dlon / 2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
    distance = R * c
    return distance

df['distance_to_merchant'] = df.apply(lambda row: haversine(row['lat'], row['long'], row['merch_lat'], row['merch_long']), axis=1)

# Feature engineering: Create feature for time difference between consecutive transactions
df['time_diff_consecutive'] = df.groupby('card_present_flag')['datetime'].diff().dt.total_seconds().fillna(0)

# Drop the original datetime column as it's no longer needed after feature extraction
df.drop(columns=['datetime'], inplace=True)

# Define features (X) and target (y)
X = df.drop('is_fraud', axis=1)
y = df['is_fraud']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Instantiate models
logistic_regression_model = LogisticRegression(max_iter=1000)
decision_tree_model = DecisionTreeClassifier(random_state=42)
random_forest_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train models
logistic_regression_model.fit(X_train, y_train)
decision_tree_model.fit(X_train, y_train)
random_forest_model.fit(X_train, y_train)

print("Models trained successfully.")

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Create a dictionary to store evaluation results
model_performance = {}

# Evaluate Logistic Regression model
y_pred_lr = logistic_regression_model.predict(X_test)
model_performance['Logistic Regression'] = {
    'Accuracy': accuracy_score(y_test, y_pred_lr),
    'Precision': precision_score(y_test, y_pred_lr),
    'Recall': recall_score(y_test, y_pred_lr),
    'F1-Score': f1_score(y_test, y_pred_lr)
}

# Evaluate Decision Tree model
y_pred_dt = decision_tree_model.predict(X_test)
model_performance['Decision Tree'] = {
    'Accuracy': accuracy_score(y_test, y_pred_dt),
    'Precision': precision_score(y_test, y_pred_dt),
    'Recall': recall_score(y_test, y_pred_dt),
    'F1-Score': f1_score(y_test, y_pred_dt)
}

# Evaluate Random Forest model
y_pred_rf = random_forest_model.predict(X_test)
model_performance['Random Forest'] = {
    'Accuracy': accuracy_score(y_test, y_pred_rf),
    'Precision': precision_score(y_test, y_pred_rf),
    'Recall': recall_score(y_test, y_pred_rf),
    'F1-Score': f1_score(y_test, y_pred_rf)
}

# Print the evaluation results
for model_name, metrics in model_performance.items():
    print(f"--- {model_name} ---")
    for metric_name, value in metrics.items():
        print(f"{metric_name}: {value:.4f}")
    print("\n")

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the data
df = pd.read_csv('/content/fraudTest.csv')

# Handle missing values
df.dropna(subset=['is_fraud'], inplace=True)
numerical_cols_to_impute = ['unix_time', 'merch_lat', 'merch_long']
for col in numerical_cols_to_impute:
    if col in df.columns:
        mean_value = df[col].mean()
        df[col].fillna(mean_value, inplace=True)

# Drop identifier columns
cols_to_drop = ['Unnamed: 0', 'cc_num', 'trans_num', 'merchant']
df.drop(columns=cols_to_drop, inplace=True)

# Encode categorical features
categorical_cols = df.select_dtypes(include='object').columns.tolist()
categorical_cols_to_encode = [col for col in categorical_cols if col not in cols_to_drop]
if categorical_cols_to_encode:
    df = pd.get_dummies(df, columns=categorical_cols_to_encode, drop_first=True)

# Feature engineering: Convert unix_time to datetime and extract time-based features
df['datetime'] = pd.to_datetime(df['unix_time'], unit='s')
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.dayofweek
df['month'] = df['datetime'].dt.month

# Feature engineering: Calculate distance using Haversine formula
def haversine(lat1, lon1, lat2, lon2):
    R = 6371  # Earth radius in kilometers
    lat1_rad = np.radians(lat1)
    lon1_rad = np.radians(lon1)
    lat2_rad = np.radians(lat2)
    lon2_rad = np.radians(lon2)
    dlon = lon2_rad - lon1_rad
    dlat = lat2_rad - lat1_rad
    a = np.sin(dlat / 2)**2 + np.cos(lat1_rad) * np.cos(lat2_rad) * np.sin(dlon / 2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
    distance = R * c
    return distance

df['distance_to_merchant'] = df.apply(lambda row: haversine(row['lat'], row['long'], row['merch_lat'], row['merch_long']), axis=1)

# Feature engineering: Create feature for time difference between consecutive transactions
df['time_diff_consecutive'] = df.groupby('card_present_flag')['datetime'].diff().dt.total_seconds().fillna(0)

# Drop the original datetime column as it's no longer needed after feature extraction
df.drop(columns=['datetime'], inplace=True)

# Define features (X) and target (y)
X = df.drop('is_fraud', axis=1)
y = df['is_fraud']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Instantiate models
logistic_regression_model = LogisticRegression(max_iter=1000)
decision_tree_model = DecisionTreeClassifier(random_state=42)
random_forest_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train models
logistic_regression_model.fit(X_train, y_train)
decision_tree_model.fit(X_train, y_train)
random_forest_model.fit(X_train, y_train)

# Create a dictionary to store evaluation results
model_performance = {}

# Evaluate Logistic Regression model
y_pred_lr = logistic_regression_model.predict(X_test)
model_performance['Logistic Regression'] = {
    'Accuracy': accuracy_score(y_test, y_pred_lr),
    'Precision': precision_score(y_test, y_pred_lr),
    'Recall': recall_score(y_test, y_pred_lr),
    'F1-Score': f1_score(y_test, y_pred_lr)
}

# Evaluate Decision Tree model
y_pred_dt = decision_tree_model.predict(X_test)
model_performance['Decision Tree'] = {
    'Accuracy': accuracy_score(y_test, y_pred_dt),
    'Precision': precision_score(y_test, y_pred_dt),
    'Recall': recall_score(y_test, y_pred_dt),
    'F1-Score': f1_score(y_test, y_pred_dt)
}

# Evaluate Random Forest model
y_pred_rf = random_forest_model.predict(X_test)
model_performance['Random Forest'] = {
    'Accuracy': accuracy_score(y_test, y_pred_rf),
    'Precision': precision_score(y_test, y_pred_rf),
    'Recall': recall_score(y_test, y_pred_rf),
    'F1-Score': f1_score(y_test, y_pred_rf)
}

# Print the evaluation results
for model_name, metrics in model_performance.items():
    print(f"--- {model_name} ---")
    for metric_name, value in metrics.items():
        print(f"{metric_name}: {value:.4f}")
    print("\n")

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(mean_value, inplace=True)


In [1]:
# Review the performance metrics from the previous step:
# Logistic Regression: Accuracy: 0.9986, Precision: 0.5000, Recall: 0.0000, F1-Score: 0.0000
# Decision Tree: Accuracy: 0.9980, Precision: 0.2936, Recall: 0.3542, F1-Score: 0.3213
# Random Forest: Accuracy: 0.9986, Precision: 0.7581, Recall: 0.3272, F1-Score: 0.4578

# Compare the models based on Recall, Precision, and F1-Score.
# Recall: Decision Tree (0.3542) > Random Forest (0.3272) > Logistic Regression (0.0000)
# Precision: Random Forest (0.7581) > Logistic Regression (0.5000) > Decision Tree (0.2936)
# F1-Score: Random Forest (0.4578) > Decision Tree (0.3213) > Logistic Regression (0.0000)

# Select the best model.
# While Decision Tree has the highest Recall, Random Forest has significantly higher Precision and F1-Score.
# In fraud detection, minimizing false positives (Precision) is important to avoid inconveniencing legitimate users,
# and a good balance between Precision and Recall is crucial, which is reflected in the F1-Score.
# Therefore, the Random Forest model is the best choice based on these metrics.

# Write a brief summary.
best_model = "Random Forest"
reason = """Based on the evaluation metrics, the Random Forest model is selected as the best model for detecting fraudulent credit card transactions.
While the Decision Tree model achieved slightly higher Recall (0.3542 vs 0.3272), the Random Forest model demonstrated significantly better Precision (0.7581 vs 0.2936) and F1-Score (0.4578 vs 0.3213).
In the context of fraud detection, a higher Precision helps in reducing the number of legitimate transactions incorrectly flagged as fraudulent (false positives), which is important for user experience. The F1-Score provides a balanced measure of both Precision and Recall.
The Random Forest's superior performance in Precision and F1-Score makes it the more effective model for this task compared to Logistic Regression and Decision Tree."""

print(f"Best Model: {best_model}")
print("\nReasoning:")
print(reason)

Best Model: Random Forest

Reasoning:
Based on the evaluation metrics, the Random Forest model is selected as the best model for detecting fraudulent credit card transactions.
While the Decision Tree model achieved slightly higher Recall (0.3542 vs 0.3272), the Random Forest model demonstrated significantly better Precision (0.7581 vs 0.2936) and F1-Score (0.4578 vs 0.3213).
In the context of fraud detection, a higher Precision helps in reducing the number of legitimate transactions incorrectly flagged as fraudulent (false positives), which is important for user experience. The F1-Score provides a balanced measure of both Precision and Recall.
The Random Forest's superior performance in Precision and F1-Score makes it the more effective model for this task compared to Logistic Regression and Decision Tree.


## Summary:

### Data Analysis Key Findings

* All models achieved high accuracy (around 0.998), which is expected due to the imbalanced nature of the dataset where fraudulent transactions are rare.
* Logistic Regression performed poorly in detecting fraudulent transactions, with a Recall and F1-Score of 0.0000.
* The Decision Tree model had the highest Recall (0.3542), meaning it identified the largest proportion of actual fraudulent transactions among the three models. However, it also had the lowest Precision (0.2936), indicating a high rate of false positives.
* The Random Forest model demonstrated the best balance between Precision and Recall, achieving the highest Precision (0.7581) and F1-Score (0.4578).

### Insights or Next Steps

* Given the imbalanced nature of the dataset, metrics like Precision, Recall, and F1-Score are more informative for evaluating fraud detection models than accuracy.
* Further steps could involve addressing the data imbalance using techniques like oversampling, undersampling, or using algorithms specifically designed for imbalanced datasets to potentially improve the Recall of the Random Forest model without significantly sacrificing Precision.