Skip to content

Strawberry404/FraudDetection

Repository files navigation

FraudDetection — Real-Time Fraud Detection Pipeline (NiFi • Kafka • Spark • ML • MySQL • Streamlit)

This repository contains an end-to-end Big Data streaming pipeline for real-time fraudulent transaction detection, combining Apache NiFi (ingestion), Apache Kafka (event buffer), Apache Spark (Scala) (stream processing + ML inference), MySQL (hot storage for alerts), and Streamlit + Flask (dashboard & API).


Repository Links


Architecture Overview

Flow (high level):

  1. Dataset (Kaggle) → downloaded as CSV
  2. DataSplitter.py → prepares/splits the dataset into files ready for NiFi ingestion
  3. NiFi flow (JSON provided) → GetFile → SplitText (chunking) → SplitText (atomization) → PublishKafka
  4. Kafka topic buffers events
  5. Spark Streaming (Scala) consumes Kafka → feature engineering → ML inference (Random Forest)
  6. Outputs:
    • MySQL: only fraud alerts (hot path)
    • Data Lake (Parquet): all transactions (cold path)
  7. Flask API serves stats → Streamlit dashboard displays KPIs and charts

Project Structure


FraudDetection/
docker-compose.yml
build.sbt
requirements.txt
backend_api.py
dashboard_streamlit.py
nifi-flow.json
src/
main/
scala/
input_data/
fraud.csv
DataSplitter.py
ModelTrainer.scala
FraudDetectionStreaming.scala
CreateMySQLTable.scala
CreateOracleTable.scala
data-lake/
fraud-model/
checkpoint/
target/


Prerequisites

  • Docker + Docker Compose
  • Java (for SBT / Scala build)
  • SBT
  • Python 3.9+
  • Optional: NiFi installed locally or accessible (NiFi runs outside docker in this setup)

Setup & Run (Recommended Order)

1) Clone the repository

git clone https://github.com/Strawberry404/FraudDetection.git
cd FraudDetection

2) Download the dataset (Kaggle)

Download from: https://www.kaggle.com/datasets/chitwanmanchanda/fraudulent-transactions-data

Then place/rename the file as:

src/main/scala/input_data/fraud.csv

Note: If Kaggle provides a different filename, rename it to fraud.csv to match the project structure.


3) Run the data preparation step (mandatory)

This script prepares and/or splits the dataset into files ready for NiFi ingestion.

python src/main/scala/DataSplitter.py

4) Start infrastructure (Spark + Kafka)

docker-compose up -d

Quick checks:

  • Spark UI: http://localhost:8080
  • Kafka (external / Windows / NiFi): localhost:9092
  • Kafka (internal / Docker / Spark): kafka:29092

5) Import the NiFi flow

A ready-to-use NiFi flow export is provided:

  • File: nifi-flow.json (repo root)

Import steps (NiFi UI):

  1. Open NiFi canvas
  2. Use Upload / Import Flow Definition
  3. Select nifi-flow.json
  4. Drop the imported Process Group onto the canvas

After import, verify:

  • Kafka3ConnectionService → Bootstrap servers: localhost:9092 (if NiFi runs on Windows)
  • Kafka topic name matches Spark consumer configuration
  • GetFile input directory points to the output generated by DataSplitter.py

6) Build the Scala JAR (SBT)

sbt clean compile package

The JAR is expected under:

target/scala-2.12/FraudDetectionPipeline_2.12-1.0.jar

7) Train the model (Spark job)

docker exec -it -u 0 spark-master /opt/spark/bin/spark-submit \
  --class ModelTrainer \
  --master spark://spark-master:7077 \
  --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 \
  /opt/spark/work-dir/target/scala-2.12/FraudDetectionPipeline_2.12-1.0.jar

8) Start streaming detection (Spark Structured Streaming)

docker exec -it -u 0 spark-master /opt/spark/bin/spark-submit \
  --class FraudDetectionStreaming \
  --master spark://spark-master:7077 \
  --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0,mysql:mysql-connector-java:8.0.33 \
  --driver-memory 2g \
  --executor-memory 2g \
  /opt/spark/work-dir/target/scala-2.12/FraudDetectionPipeline_2.12-1.0.jar

9) Start API + Dashboard

Install Python dependencies:

pip install -r requirements.txt

Run:

python backend_api.py &
streamlit run dashboard_streamlit.py

Dashboard usually runs at:


Expected Outputs

  • MySQL: fraud alerts table (hot path)
  • Data Lake (Parquet): stored in data-lake/ (cold path)
  • Model artifacts: saved into fraud-model/ (if configured)
  • Streaming checkpoints: checkpoint/

Troubleshooting

NiFi can’t publish to Kafka

  • If NiFi runs on Windows, use: localhost:9092
  • If NiFi runs inside Docker, use: kafka:29092
  • Verify Kafka topic exists and matches both NiFi and Spark configs

Spark can’t read from Kafka

  • Spark running inside Docker should use internal listener: kafka:29092

Missing images / paths

  • Ensure fraud.csv is placed exactly in: src/main/scala/input_data/fraud.csv

License

For academic use / coursework.

If you paste the exact Kafka topic name, the NiFi GetFile directory used after DataSplitter.py, and the MySQL connection parameters (host/user/dbname, not passwords), I’ll tailor the README so it matches your project 1:1 and removes any remaining assumptions.

@don't forget to mention me hehe

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages