# Bank Transaction Dataset for Fraud Detection

## Introduction
This dataset contains **2,512 bank transactions** with **16 features** related to transaction details, customer information, and device/location data. The goal is to analyze the dataset to detect fraudulent transactions and build a predictive model for fraud detection.

### Dataset Overview:
- **Rows**: 2,512 transactions
- **Columns**: 16 features (described below)

## Dataset Columns Description

Below is a detailed description of each column in the dataset:

1. **TransactionID**: Unique alphanumeric identifier for each transaction.
2. **AccountID**: Unique identifier for each account, with multiple transactions per account.
3. **TransactionAmount**: Monetary value of each transaction, ranging from small everyday expenses to larger purchases.
4. **TransactionDate**: Timestamp of each transaction, capturing date and time.
5. **TransactionType**: Categorical field indicating 'Credit' or 'Debit' transactions.
6. **Location**: Geographic location of the transaction, represented by U.S. city names.
7. **DeviceID**: Alphanumeric identifier for devices used to perform the transaction.
8. **IP Address**: IPv4 address associated with the transaction, with occasional changes for some accounts.
9. **MerchantID**: Unique identifier for merchants, showing preferred and outlier merchants for each account.
10. **Channel**: Channel through which the transaction was performed (e.g., Online, ATM, Branch).
11. **CustomerAge**: Age of the account holder, with logical groupings based on occupation.
12. **CustomerOccupation**: Occupation of the account holder (e.g., Doctor, Engineer, Student, Retired), reflecting income patterns.
13. **TransactionDuration**: Duration of the transaction in seconds, varying by transaction type.
14. **LoginAttempts**: Number of login attempts before the transaction, with higher values indicating potential anomalies.
15. **AccountBalance**: Balance in the account post-transaction, with logical correlations based on transaction type and amount.
16. **PreviousTransactionDate**: Timestamp of the last transaction for the account, aiding in calculating transaction frequency.

## Load the dataset

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('../data/bank_transactions.csv')

df.head()

## Basic Dataset Information

Let’s explore the dataset's structure and summary statistics to understand its composition.

In [None]:
# Display dataset information
df.info()

### Observations from `df.info()`:
- The dataset has **2,512 rows** and **16 columns**.
- There are **no missing values** in any column (all columns have 2,512 non-null entries).
- Data types include:
  - **Float64**: `TransactionAmount`, `AccountBalance`
  - **Int64**: `CustomerAge`, `TransactionDuration`, `LoginAttempts`
  - **Object**: All other columns (e.g., `TransactionID`, `AccountID`, `TransactionDate`, etc.)

In [None]:
# Display summary statistics for numerical columns
df.describe()

### Observations from `df.describe()`:

1. **TransactionAmount**:
   - Average: **\$297.59**, with a wide range from **\$0.26** to **\$1,919.11**.
   - 75% of transactions are below **\$414.53**.

2. **CustomerAge**:
   - Average age: **44.67 years**, ranging from **18** to **80**.
   - Median age: **45 years**.

3. **TransactionDuration**:
   - Average duration: **119.64 seconds**, ranging from **10** to **300 seconds**.
   - 75% of transactions complete in under **161 seconds**.

4. **LoginAttempts**:
   - Average: **1.12 attempts**, with most transactions (75%) involving only **1 attempt**.
   - Maximum attempts: **5**.

5. **AccountBalance**:
   - Average balance: **\$5,114.30**, ranging from **\$101.25** to **\$14,977.99**.
   - Median balance: **\$4,735.51**.


## Data Cleaning

In [None]:
df.isnull().sum()


In [None]:
df.duplicated().sum()


## Data Convertion

In [None]:
df['TransactionDate'] = pd.to_datetime(df['TransactionDate'])
df['PreviousTransactionDate'] = pd.to_datetime(df['PreviousTransactionDate'])

df[['TransactionDate', 'PreviousTransactionDate']].info()


## Outliers Check

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Establecer el estilo global para los gráficos con fondo gris y bordes blancos suaves
sns.set_theme(style="whitegrid", palette="muted", rc={
    "axes.facecolor": "#284B63",  # Fondo gris oscuro para los contenedores
    "figure.facecolor": "#284B63",  # Fondo gris oscuro para toda la figura
    "grid.color": "#D9D9D9",  # Gris más claro para la cuadrícula
    "axes.grid": True,  # Mostrar cuadrícula
    "axes.edgecolor": "#D9D9D9",  # Bordes de los contenedores en gris claro
    "xtick.color": "#D9D9D9",  # Color de las etiquetas del eje X en gris claro
    "ytick.color": "#D9D9D9",  # Color de las etiquetas del eje Y en gris claro
    "axes.labelcolor": "#D9D9D9",  # Color de las etiquetas de los ejes en gris claro
    "axes.titlesize": 14,  # Tamaño de los títulos
    "axes.titleweight": 'bold',  # Hacer los títulos en negrita
})

custom_palette = ["#D9D9D9", "#A0A0A0"]

# Crear una función personalizada para los boxplots
def custom_boxplot(*args, **kwargs):
    kwargs.setdefault("color", "#D9D9D9")  # Relleno de la caja en gris claro
    kwargs.setdefault("flierprops", dict(markerfacecolor='#D9D9D9', marker='o', markersize=7))
    kwargs.setdefault("boxprops", dict(edgecolor='#D9D9D9', linewidth=1))
    kwargs.setdefault("medianprops", dict(color='#D9D9D9', linewidth=1))
    kwargs.setdefault("capprops", dict(color='#D9D9D9', linewidth=1))
    kwargs.setdefault("whiskerprops", dict(color='#D9D9D9', linewidth=1))
    
    return sns.boxplot(*args, **kwargs)

def custom_countplot(*args, **kwargs):
    # Personalizar colores de los gráficos countplot
    kwargs.setdefault("palette", "dark:#D9D9D9")  # Color de las barras (gris claro)
    kwargs.setdefault("edgecolor", "#D9D9D9")  # Bordes de las barras (gris claro)
    
     # Crear el gráfico countplot
    ax = sns.countplot(*args, **kwargs)

    # Personalizar la leyenda si existe
    legend = ax.legend()
    if legend:
        plt.setp(legend.get_texts(), color="#D9D9D9")  # Cambiar el color del texto de la leyenda
        legend.get_frame().set_edgecolor("#D9D9D9")  # Color del borde de la leyenda

    return ax

def custom_scatterplot(x, y, data, **kwargs):
    kwargs.setdefault("color", "#D9D9D9")  # Color de los puntos
    kwargs.setdefault("edgecolor", "#D9D9D9")  # Borde de los puntos
    kwargs.setdefault("s", 50)  # Tamaño de los puntos
    kwargs.setdefault("alpha", 0.7)  # Transparencia para evitar saturación visual

    return sns.scatterplot(x=x, y=y, data=data, **kwargs)

In [None]:
numerical_columns = ['TransactionAmount', 'CustomerAge', 'TransactionDuration', 'LoginAttempts', 'AccountBalance']
plt.figure(figsize=(15, 8))
for i, col in enumerate(numerical_columns, 1):
    plt.subplot(2, 3, i)
    custom_boxplot(y=df[col])  # Usamos la función personalizada
    plt.title(col, color='#D9D9D9')  # Títulos en gris claro
plt.tight_layout()
plt.show()

### Observations:
1. **TransactionAmount**:
   - Some transactions have very high amounts (up to \$1,919.11), which could be outliers or legitimate large transactions.
2. **CustomerAge**:
   - The age range (18 to 80) is reasonable, with no outliers.
3. **TransactionDuration**:
   - Some transactions take up to 300 seconds (5 minutes), which might be outliers or indicate unusual activity.
4. **LoginAttempts**:
   - Most transactions involve 1 login attempt, but some have up to 5 attempts, which could indicate anomalies.
5. **AccountBalance**:
   - Some accounts have very high balances (up to \$14,977.99), which could be outliers or legitimate for high-net-worth individuals.

### Action:
- We’ll keep the outliers for now, as they might be important for fraud detection (e.g., unusually high transaction amounts or login attempts).

## Unique Values Categorical Columns

In [None]:
# Check unique values in categorical columns
categorical_columns = ['TransactionType', 'Location', 'DeviceID', 'IP Address', 'MerchantID', 'Channel', 'CustomerOccupation']

for col in categorical_columns:
    print(f"Unique values in {col}: {df[col].unique()}")
    print("\n")

### Observations:
1. **TransactionType**:
   - Only two values: `Debit` and `Credit`. No inconsistencies.
2. **Location**:
   - All values are valid U.S. city names. No inconsistencies.
3. **DeviceID**:
   - All values are alphanumeric and unique. No inconsistencies.
4. **IP Address**:
   - All values are valid IPv4 addresses. No inconsistencies.
5. **MerchantID**:
   - All values are alphanumeric and unique. No inconsistencies.
6. **Channel**:
   - Only three values: `Branch`, `ATM`, and `Online`. No inconsistencies.
7. **CustomerOccupation**:
   - Only four values: `Student`, `Doctor`, `Engineer`, and `Retired`. No inconsistencies.


## Summary of Data Cleaning

1. **No Missing Values or Duplicates**:
   - The dataset is clean, with no missing values or duplicate rows.

2. **Data Type Conversion**:
   - `TransactionDate` and `PreviousTransactionDate` have been converted to `datetime`.

3. **Outliers**:
   - Outliers were identified in `TransactionAmount`, `TransactionDuration`, and `LoginAttempts`, but they were retained as they might be relevant for fraud detection.

4. **Categorical Data**:
   - No inconsistencies were found in categorical columns.

The dataset is now ready for **Exploratory Data Analysis (EDA)**.

## Exploratory Data Analysis (EDA)

In this step, we’ll explore the dataset visually and statistically to uncover patterns, trends, and relationships. EDA helps us understand the data better and identify potential features for fraud detection.

### 1. Univariate Analysis


#### Numerical Columns

We are not using Account ID and Transaction ID because they are not relevant in this analysis


### Transaction Amount

In [None]:
# Histogram for TransactionAmount
sns.histplot(df['TransactionAmount'], bins=30, kde=True, color="#D9D9D9")
plt.title('Distribution of Transaction Amount',color="#D9D9D9")
plt.show()

# Boxplot for TransactionAmount
custom_boxplot(x=df['TransactionAmount'])
plt.title('Boxplot of Transaction Amount',color="#D9D9D9")
plt.show()

### Costumer Age

In [None]:

# Histogram for CustomerAge
sns.histplot(df['CustomerAge'], bins=30, kde=True, color="#D9D9D9")
plt.title('Distribution of Costumer Age',color="#D9D9D9")
plt.show()


### Account Balance

In [None]:

# Histogram for AccountBalance
sns.histplot(df['AccountBalance'], bins=30, kde=True, color="#D9D9D9")
plt.title('Distribution of Account Balance',color="#D9D9D9")
plt.show()


### Transaction Duration

In [None]:

# Histogram for TransactionDuration
sns.histplot(df['TransactionDuration'], bins=30, kde=True, color="#D9D9D9")
plt.title('Distribution of Transaction Duration',color="#D9D9D9")
plt.show()



### Login Attempts

In [None]:

# Histogram for LoginAttempts
sns.histplot(df['LoginAttempts'], bins=30, kde=True, color="#D9D9D9")
plt.title('Distribution of Login Attempts',color="#D9D9D9")
plt.show()


### Categorical Columns

#### TransactionType

In [None]:
# Transaction Type
custom_countplot(x='TransactionType', hue='TransactionType' , data=df)
plt.title('Distribution of Transaction Types', color="#D9D9D9")
plt.show()


### Location

In [None]:
# Location

top_10_locations = df['Location'].value_counts().index[:20]

custom_countplot(y='Location', data=df, hue='Location', order=top_10_locations)
plt.title('Distribution of top 20 Transactions by Location', color="#D9D9D9")
plt.show()

### Channel

In [None]:
custom_countplot(x='Channel', hue='Channel', data=df)
plt.title('Distribution of Transaction Channels', color="#D9D9D9")
plt.show()

### CustomerOccupation

In [None]:
custom_countplot(x='CustomerOccupation' ,hue='CustomerOccupation' , data=df)
plt.title('Distribution of Transaction Customer Occupation', color="#D9D9D9")
plt.show()

## Bivariate Analysis

### TransactionAmount vs AccountBalance



In [None]:
custom_scatterplot(x='TransactionAmount', y='AccountBalance', data=df, )
plt.title('Transaction Amount vs. Account Balance', color="#D9D9D9")
plt.show()

### TransactionAmount vs CustomerAge



In [None]:
custom_scatterplot(x='TransactionAmount', y='CustomerAge', data=df)
plt.title('Transaction Amount vs. Customer Age', color="#D9D9D9")
plt.show()

### TransactionAmount vs TransactionDuration



In [None]:
custom_scatterplot(x='TransactionAmount', y='TransactionDuration', data=df)
plt.title('Transaction Amount vs Transaction Duration', color="#D9D9D9")
plt.show()

### AccountBalance vs CustomerAge	

In [None]:
custom_scatterplot(x='AccountBalance', y='CustomerAge', data=df)
plt.title('Account Balance vs Customer Age', color="#D9D9D9")
plt.show()

### AccountBalance vs TransactionDuration	

In [None]:
custom_scatterplot(x='AccountBalance', y='TransactionDuration', data=df)
plt.title('Account Balance vs Transaction Duration', color="#D9D9D9")
plt.show()

## Numerical vs Categorical


### TransactionAmount vs TransactionType



In [None]:
custom_boxplot(x='TransactionType', y='TransactionAmount', data=df)
plt.title('Transaction Amount by Transaction Type', color="#D9D9D9")
plt.show()

### TransactionAmount vs Location



In [None]:
plt.figure(figsize=(12, 12))  # Aumentar tamaño
custom_boxplot(y='Location', x='TransactionAmount', data=df)
plt.title('Transaction Amount by Location', color="#D9D9D9")
plt.show()


### TransactionAmount vs Channel



In [None]:
custom_boxplot(x='Channel', y='TransactionAmount', data=df)
plt.title('Transaction Amount by Channel', color="#D9D9D9")
plt.show()

### CustomerAge vs TransactionType	

In [None]:
custom_boxplot(x='CustomerAge', y='TransactionType', data=df)
plt.title('Customer Age by Transaction Type', color="#D9D9D9")
plt.show()

### AccountBalance vs TransactionType	

In [None]:
custom_boxplot(x='AccountBalance', y='TransactionType', data=df)
plt.title('Customer Age by Transaction Type', color="#D9D9D9")
plt.show()

## Categorical vs Categorical


### TransactionType vs Location

In [None]:
plt.figure(figsize=(14, 6))  # Aumentar el tamaño del gráfico
top_10_locations = df['Location'].value_counts().index[:10]
df_filtered = df[df['Location'].isin(top_10_locations)]
custom_countplot(x='Location', hue='TransactionType', data=df_filtered)
plt.xticks(rotation=45, ha="right")  # Rotar etiquetas
plt.title('Transaction Type by Location (Top 10)', color="#D9D9D9")
plt.show()


### TransactionType vs Channel

In [None]:
custom_countplot(x='Channel', hue='TransactionType', data=df)
plt.title('Transaction Type by Channel', color="#D9D9D9")
plt.show()

### Location vs Channel	

In [None]:
# Seleccionar las 10 ubicaciones más frecuentes
top_10_locations = df['Location'].value_counts().index[:10]
df_filtered = df[df['Location'].isin(top_10_locations)]

# Graficar solo con esas ubicaciones
plt.figure(figsize=(14, 6))  # Hacer el gráfico más ancho
custom_countplot(x='Location', hue='Channel', data=df_filtered)
plt.xticks(rotation=45, ha="right")  # Rotar etiquetas para que no se superpongan
plt.title('Transaction Type by Channel (Top 10 Locations)', color="#D9D9D9")
plt.show()


### TransactionType vs CustomerOccupation

In [None]:
custom_countplot(x='TransactionType', hue='CustomerOccupation', data=df)
plt.title('Transaction Type by Customer Occupation', color="#D9D9D9")
plt.show()

### Location vs CustomerOccupation	

In [None]:
# Seleccionar las 10 ubicaciones más frecuentes
top_10_locations = df['Location'].value_counts().index[:10]
df_filtered = df[df['Location'].isin(top_10_locations)]

# Graficar solo con esas ubicaciones
plt.figure(figsize=(14, 6))  # Hacer el gráfico más ancho
custom_countplot(x='Location', hue='CustomerOccupation', data=df_filtered)
plt.xticks(rotation=45, ha="right")  # Rotar etiquetas para que no se superpongan
plt.title('Transaction Type by Customer Occupation (Top 10 Locations)', color="#D9D9D9")
plt.show()


In [None]:
# Select numerical columns
numerical_columns = ['TransactionAmount', 'AccountBalance', 'CustomerAge', 'TransactionDuration', 'LoginAttempts']

# Calculate correlation matrix
correlation_matrix = df[numerical_columns].corr()

# Plot heatmap
plt.figure(figsize=(12, 8))  # Hacer el gráfico más grande para mejorar la visualización

# Usar paleta de colores basada en tonos grises
sns.heatmap(correlation_matrix, annot=True, cmap="Greys", fmt=".2f", annot_kws={"size": 12, "ha": "center", "va": "center"}, cbar_kws={'label': 'Correlation'})

# Personalización adicional
plt.title('Correlation Heatmap of Numerical Variables', color="#D9D9D9")
plt.xticks(rotation=45, ha="right", color="#D9D9D9")  # Rotar las etiquetas del eje X y color
plt.yticks(rotation=0, color="#D9D9D9")  # Cambiar color de las etiquetas del eje Y
plt.show()


In [None]:
# Create a cross-tabulation of TransactionType and Channel
cross_tab = pd.crosstab(df['TransactionType'], df['Channel'])

# Plot heatmap
plt.figure(figsize=(12, 7))  # Hacer el gráfico más grande para mejorar la visualización

# Usar paleta de colores basada en tonos grises
sns.heatmap(cross_tab, annot=True, cmap="Greys", fmt="d", annot_kws={"size": 12, "ha": "center", "va": "center"}, cbar_kws={'label': 'Frequency'})

# Personalización adicional
plt.title('Frequency Heatmap: Transaction Type vs Channel', color="#D9D9D9")
plt.xticks(rotation=45, ha="right", color="#D9D9D9")  # Rotar las etiquetas del eje X y color
plt.yticks(rotation=0, color="#D9D9D9")  # Cambiar color de las etiquetas del eje Y
plt.show()


In [None]:

# Create a cross-tabulation of Location and TransactionType
cross_tab = pd.crosstab(df['Location'], df['TransactionType'])

# Plot heatmap
plt.figure(figsize=(20, 12))  # Hacer el gráfico más amplio para mejorar la visualización

# Usar paleta de colores basada en tonos grises
sns.heatmap(cross_tab, annot=True, cmap="Greys", fmt="d", annot_kws={"size": 10, "ha": "center", "va": "center"}, cbar_kws={'label': 'Frequency'})

# Personalización adicional
plt.title('Frequency Heatmap: Transaction Type vs Location', color="#D9D9D9")

# Rotar las etiquetas del eje X y eje Y
plt.xticks(rotation=0, ha="right", color="#D9D9D9")  # Rotar etiquetas eje X
plt.yticks(rotation=0, color="#D9D9D9")  # Las etiquetas eje Y estarán a la izquierda y horizontales


# Mostrar el gráfico
plt.show()


In [None]:
# Create a cross-tabulation of Location and Channel
cross_tab = pd.crosstab(df['Location'], df['Channel'])

# Plot heatmap
plt.figure(figsize=(20, 10))  # Hacer el gráfico más grande para mejorar la visualización

# Usar paleta de colores basada en tonos grises
sns.heatmap(cross_tab, annot=True, cmap="Greys", fmt="d", annot_kws={"size": 12, "ha": "center", "va": "center"}, cbar_kws={'label': 'Frequency'})

# Personalización adicional
plt.title('Frequency Heatmap: Location vs Channel', color="#D9D9D9")
plt.xticks(rotation=0, color="#D9D9D9")  # Rotar las etiquetas del eje X y color
plt.yticks(rotation=0, color="#D9D9D9")  # Cambiar color de las etiquetas del eje Y

plt.show()


## Feature Engineering

### Time-Based Features:

In [None]:
df['TimeSinceLastTransaction'] = (df['TransactionDate'] - df['PreviousTransactionDate']).dt.total_seconds()
df['TransactionHour'] = df['TransactionDate'].dt.hour

### Transaction Frequency:

In [None]:
df['TransactionFrequency'] = df.groupby('AccountID')['TransactionID'].transform('count')

### Device and IP Usage:

In [None]:
df['DeviceUsage'] = df.groupby('DeviceID')['TransactionID'].transform('count')
df['IPUsage'] = df.groupby('IP Address')['TransactionID'].transform('count')

### Merchant Preferences:

In [None]:
df['MerchantPreference'] = df.groupby(['AccountID', 'MerchantID'])['TransactionID'].transform('count')

In [None]:
df.head()

## Data Preprocessing

###  Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler

# Select numerical features to scale
numerical_features = [
    'TransactionAmount', 'TransactionDuration', 'LoginAttempts', 
    'AccountBalance', 'TimeSinceLastTransaction', 'TransactionFrequency', 
    'DeviceUsage', 'IPUsage', 'MerchantPreference'
]

# Apply StandardScaler
scaler = StandardScaler()
df[numerical_features] = scaler.fit_transform(df[numerical_features])

### Encode Categorical Variables

In [None]:
# Label Encoding for 'TransactionType' (binary: Debit/Credit)
df['TransactionTypeEncoded'] = df['TransactionType'].map({'Debit': 0, 'Credit': 1})

# One-Hot Encoding for 'Channel' and 'Location'
df = pd.get_dummies(df, columns=['Channel', 'Location'], drop_first=True)

# Drop the original categorical columns (optional)
#df.drop(columns=['TransactionType'], inplace=True)

df.head()

In [None]:
# Drop non-numeric columns
df_numeric = df.select_dtypes(include=['number'])

# Check the resulting dataframe
print(df_numeric.head())

In [None]:
from sklearn.ensemble import IsolationForest

# Initialize the Isolation Forest model
iso_forest = IsolationForest(contamination=0.05, random_state=42)  # Adjust contamination based on expected anomaly rate

# Fit the model and predict anomalies
df['AnomalyScore'] = iso_forest.fit_predict(df_numeric)

# AnomalyScore: -1 for anomalies, 1 for normal transactions
df['IsAnomaly'] = df['AnomalyScore'].apply(lambda x: 1 if x == -1 else 0)

# Check the number of anomalies
print(df['IsAnomaly'].value_counts())

In [None]:
import matplotlib.pyplot as plt

# Visualize anomalies (e.g., TransactionAmount vs. AccountBalance)
plt.figure(figsize=(10, 6))
plt.scatter(df['TransactionAmount'], df['AccountBalance'], c=df['IsAnomaly'], cmap='coolwarm', alpha=0.6)
plt.title('Anomalies Detected by Isolation Forest')
plt.xlabel('Transaction Amount')
plt.ylabel('Account Balance')
plt.colorbar(label='Anomaly (1) / Normal (0)')
plt.show()