# AI-Driven Anomaly Detection in Credit Card Transactions using Azure OpenAI

## 1. Introduction

In an era of rapidly growing digital financial systems, fraud has become a sophisticated and persistent threat — costing institutions billions annually. Traditional rule-based systems often fall short due to their inability to adapt to evolving fraud tactics and complex behavioral patterns.

To address this challenge, our project leverages **Microsoft Fabric’s modern data architecture**, the **powerful language reasoning capabilities of Azure OpenAI's GPT-4 model**, and **interactive data visualization with Power BI** to create a smart, explainable fraud detection system.

Using a public transaction dataset from **Kaggle** as a proxy for real financial data, we simulate a production-grade fraud detection pipeline. The architecture is built on the **Lakehouse Medallion model (Bronze → Silver → Gold)** for organized data processing, feature engineering, and AI insights. The **GPT-4 model** (accessed through Azure OpenAI) analyzes each transaction and provides two outputs:
- A concise, human-like explanation of whether the transaction seems fraudulent or not.
- A **fraud risk score** (`Low`, `Medium`, or `High`) to flag suspicious records.

This AI-enhanced approach doesn't just detect fraud — it **communicates the rationale**, empowering analysts to understand *why* a transaction may be risky.

The final fraud intelligence is surfaced through **Power BI dashboards**, offering visual summaries, risk distribution, and GPT-generated fraud insights in a business-friendly format.

---

## 2. Problem Statement

As digital banking and online payments expand, fraudsters continue to develop more complex, real-time attack strategies. Existing fraud detection methods — often built on rigid rules or traditional supervised models — struggle to detect novel fraud patterns or adapt to changing behavior.

This project responds to this problem by introducing a **generative AI-based fraud detection model** that:
- Thinks like a human fraud analyst.
- Understands the context behind each transaction.
- Flags anomalies based on behavioral reasoning — not just mathematical thresholds.

By combining **Microsoft Fabric** for structured data processing and **GPT-4** for intelligent transaction analysis, we aim to build a flexible, scalable system capable of detecting both known and unknown fraud behaviors.

---

## 3. Objectives

### 3.1 Main Objective  
To develop an AI-powered fraud detection system that can intelligently evaluate financial transactions, assign risk levels, and provide human-readable fraud justifications — using Azure OpenAI and Microsoft Fabric.

### 3.2 Specific Objectives

- Ingest and process transactional data using the Lakehouse architecture (Bronze → Silver → Gold).
- Analyze financial transactions to extract behavioral and contextual patterns.
- Leverage **Azure OpenAI GPT-4** to classify transactions by risk level and generate fraud analysis summaries.
- Visualize flagged transactions and fraud trends using Power BI.
- Showcase explainable AI as a tool to support fraud analysts in real-world decision-making.

---

## 4. Hackathon Achievements

- Successfully implemented an **AI-driven fraud detection system** using **Azure OpenAI** for fraud analysis and **Microsoft Fabric** for scalable data processing.
- Demonstrated how **AI models** can provide detailed **fraud risk assessments** for transactions without requiring labeled fraud data.
- Created an interactive **Power BI dashboard** for presenting fraud trends, risk distributions, and GPT-generated insights.
- Participated in the **Microsoft Fabric Data + AI Kenya Hackathon**, applying data science techniques to real-world challenges and advancing fraud detection capabilities with modern AI technologies.


---

## 5. Data Understanding
The dataset contains key transactional attributes that will be analyzed to detect fraud:

Feature	Description

**TransactionID**	Unique identifier for each transaction

**AccountID**	Unique identifier for the account

**TransactionAmount**	Monetary value of the transaction

**TransactionDate**	Timestamp of the transaction

**TransactionType**	Indicates whether it’s a 'Credit' or 'Debit' transaction

**Location**	Geographic location of the transaction

**DeviceID**	Identifier for the device used in the transaction

**IP Address**	IP address linked to the transaction

**MerchantID**	Unique identifier for the merchant

**Channel**	Mode of transaction (e.g., Online, ATM, Branch)

**AccountBalance**	Remaining balance after the transaction

**TransactionDuration**	Time taken for the transaction (in seconds)

**LoginAttempts**	Number of login attempts before the transaction

These features will be analyzed to identify anomalies such as:

1. Transactions from unusual locations or devices.

2. Sudden large withdrawals or transactions.

3. Multiple failed login attempts before transactions.

4. High transaction frequency within a short period.


## Data Ingestion

We begin by loading the transactional data from a CSV file into a Spark DataFrame. This raw dataset will serve as the foundation for all subsequent analysis.

In [8]:
df = spark.read.format("csv").option("header","true").load("Files/bank_transactions_data_2.csv")
# df now is a Spark DataFrame containing CSV data from "Files/bank_transactions_data_2.csv".
display(df)

StatementMeta(, ae1b969f-45ce-4a8c-b8d8-cf06d2a572d1, 10, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 5c188dd3-e9e9-4203-9624-733186a09fe7)

The above DataFrame contains transactional records with attributes like transaction amount, date, location, type, and user identifiers. This raw data will be stored in the Bronze layer of our lakehouse architecture for initial analysis and data cleaning.


## Saving to Bronze Layer

To maintain data lineage and ensure structured data management, we store the raw dataset into the Bronze layer in Delta format.


In [9]:
#Saving data in the Bronze layer
from pyspark.sql.functions import col
df = df.select([col(c).alias(c.replace(" ", "_")) for c in df.columns])
df.write.format("delta").mode("append").save("Tables/Bronze/bank_transactions_data")

StatementMeta(, ae1b969f-45ce-4a8c-b8d8-cf06d2a572d1, 11, Finished, Available, Finished)

This concludes the Bronze layer processing, where the raw dataset is ingested and preserved.

## Data Cleaning and Type Casting

Before moving to analysis, we need to clean the dataset. This involves checking summary statistics and ensuring that all numerical and date fields are correctly typed.


In [10]:
df.describe()

StatementMeta(, ae1b969f-45ce-4a8c-b8d8-cf06d2a572d1, 12, Finished, Available, Finished)

DataFrame[summary: string, TransactionID: string, AccountID: string, TransactionAmount: string, TransactionDate: string, TransactionType: string, Location: string, DeviceID: string, IP_Address: string, MerchantID: string, Channel: string, CustomerAge: string, CustomerOccupation: string, TransactionDuration: string, LoginAttempts: string, AccountBalance: string, PreviousTransactionDate: string]

### Data Type Transformation

We now cast specific columns to appropriate data types such as integers, doubles and timestamps. This ensures consistency and enables proper numeric analysis.


In [11]:
from pyspark.sql.functions import col, to_date
from pyspark.sql.types import DoubleType, IntegerType, TimestampType

# Defining a dictionary mapping column names to target types or transformation functions.

cast_dict = {
    "TransactionAmount": DoubleType(),
    "TransactionDate": TimestampType(), 
    "CustomerAge": IntegerType(),
    "TransactionDuration": IntegerType(),   # Changing to DoubleType().
    "LoginAttempts": IntegerType(),
    "AccountBalance": DoubleType(),
    "PreviousTransactionDate": TimestampType()    

}

# Applying the cast transformations.
for col_name, data_type in cast_dict.items():
    df = df.withColumn(col_name, col(col_name).cast(data_type))

StatementMeta(, ae1b969f-45ce-4a8c-b8d8-cf06d2a572d1, 13, Finished, Available, Finished)

### Verifying Schema

After casting, we verify the schema to ensure all columns have the correct data types.


In [12]:
#Displaying the changed datatypes
df.printSchema()

StatementMeta(, ae1b969f-45ce-4a8c-b8d8-cf06d2a572d1, 14, Finished, Available, Finished)

root
 |-- TransactionID: string (nullable = true)
 |-- AccountID: string (nullable = true)
 |-- TransactionAmount: double (nullable = true)
 |-- TransactionDate: timestamp (nullable = true)
 |-- TransactionType: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- DeviceID: string (nullable = true)
 |-- IP_Address: string (nullable = true)
 |-- MerchantID: string (nullable = true)
 |-- Channel: string (nullable = true)
 |-- CustomerAge: integer (nullable = true)
 |-- CustomerOccupation: string (nullable = true)
 |-- TransactionDuration: integer (nullable = true)
 |-- LoginAttempts: integer (nullable = true)
 |-- AccountBalance: double (nullable = true)
 |-- PreviousTransactionDate: timestamp (nullable = true)



## Data Quality Checks

We perform data quality checks to ensure the dataset is clean and ready. This includes checking for missing values, duplicates and outliers.


In [13]:
from pyspark.sql.functions import col, sum, mean, count, approx_count_distinct
from pyspark.sql.types import IntegerType, FloatType, TimestampType
from pyspark.sql.functions import to_timestamp

def check_data_quality(df):
    print("\n Checking Data Quality...\n")

    # Checking for Missing Values
    print("Missing Values Per Column:")
    df.select([sum(col(c).isNull().cast("int")).alias(c) for c in df.columns]).show()

    # Checking for Duplicates
    total_rows = df.count()
    unique_rows = df.dropDuplicates().count()
    print(f"Total Rows: {total_rows}, Unique Rows: {unique_rows}, Duplicates: {total_rows - unique_rows}")

    # Checking Column Data Types
    print("\n Data Types:")
    df.printSchema()

    # Checking for Outliers (Using IQR Method)
    for col_name in df.columns:
        if dict(df.dtypes)[col_name] in ['int', 'double', 'float']:
            quantiles = df.approxQuantile(col_name, [0.25, 0.75], 0.05)
            if len(quantiles) == 2:
                Q1, Q3 = quantiles
                IQR = Q3 - Q1
                lower_bound = Q1 - 1.5 * IQR
                upper_bound = Q3 + 1.5 * IQR
                outlier_count = df.filter((col(col_name) < lower_bound) | (col(col_name) > upper_bound)).count()
                print(f"Outliers in '{col_name}': {outlier_count}")

check_data_quality(df)

StatementMeta(, ae1b969f-45ce-4a8c-b8d8-cf06d2a572d1, 15, Finished, Available, Finished)


 Checking Data Quality...

Missing Values Per Column:
+-------------+---------+-----------------+---------------+---------------+--------+--------+----------+----------+-------+-----------+------------------+-------------------+-------------+--------------+-----------------------+
|TransactionID|AccountID|TransactionAmount|TransactionDate|TransactionType|Location|DeviceID|IP_Address|MerchantID|Channel|CustomerAge|CustomerOccupation|TransactionDuration|LoginAttempts|AccountBalance|PreviousTransactionDate|
+-------------+---------+-----------------+---------------+---------------+--------+--------+----------+----------+-------+-----------+------------------+-------------------+-------------+--------------+-----------------------+
|            0|        0|                0|              0|              0|       0|       0|         0|         0|      0|          0|                 0|                  0|            0|             0|                      0|
+-------------+---------+--------

The quality check shows no missing or duplicate values. However, outliers were detected in several numeric fields. These outliers do not significantly impact our analysis and have therefore been retained.


## Saving to Silver Layer

Now that the data is cleaned and types are standardized, we store the dataset in the Silver layer of the lakehouse for downstream analytics.


In [14]:
# Saving to Silver Layer
df = df.select([col(c).alias(c.replace(" ", "_")) for c in df.columns])
df.write.format("delta").mode("append").save("Tables/Silver/bank_transactions_data")

StatementMeta(, ae1b969f-45ce-4a8c-b8d8-cf06d2a572d1, 16, Finished, Available, Finished)

The cleaned data has been successfully stored in the Silver layer. In the next section, we will explore the data further using EDA to uncover patterns, trends and prepare it for further analysis.
