FraudDetection — Real-Time Fraud Detection Pipeline (NiFi • Kafka • Spark • ML • MySQL • Streamlit)

This repository contains an end-to-end Big Data streaming pipeline for real-time fraudulent transaction detection, combining Apache NiFi (ingestion), Apache Kafka (event buffer), Apache Spark (Scala) (stream processing + ML inference), MySQL (hot storage for alerts), and Streamlit + Flask (dashboard & API).

Repository Links

GitHub: https://github.com/Strawberry404/FraudDetection.git
Dataset (Kaggle): https://www.kaggle.com/datasets/chitwanmanchanda/fraudulent-transactions-data

Architecture Overview

Flow (high level):

Dataset (Kaggle) → downloaded as CSV
DataSplitter.py → prepares/splits the dataset into files ready for NiFi ingestion
NiFi flow (JSON provided) → GetFile → SplitText (chunking) → SplitText (atomization) → PublishKafka
Kafka topic buffers events
Spark Streaming (Scala) consumes Kafka → feature engineering → ML inference (Random Forest)
Outputs:
- MySQL: only fraud alerts (hot path)
- Data Lake (Parquet): all transactions (cold path)
Flask API serves stats → Streamlit dashboard displays KPIs and charts

Project Structure


FraudDetection/
docker-compose.yml
build.sbt
requirements.txt
backend_api.py
dashboard_streamlit.py
nifi-flow.json
src/
main/
scala/
input_data/
fraud.csv
DataSplitter.py
ModelTrainer.scala
FraudDetectionStreaming.scala
CreateMySQLTable.scala
CreateOracleTable.scala
data-lake/
fraud-model/
checkpoint/
target/

Prerequisites

Docker + Docker Compose
Java (for SBT / Scala build)
SBT
Python 3.9+
Optional: NiFi installed locally or accessible (NiFi runs outside docker in this setup)

Setup & Run (Recommended Order)

1) Clone the repository

git clone https://github.com/Strawberry404/FraudDetection.git
cd FraudDetection

2) Download the dataset (Kaggle)

Download from: https://www.kaggle.com/datasets/chitwanmanchanda/fraudulent-transactions-data

Then place/rename the file as:

src/main/scala/input_data/fraud.csv

Note: If Kaggle provides a different filename, rename it to fraud.csv to match the project structure.

3) Run the data preparation step (mandatory)

This script prepares and/or splits the dataset into files ready for NiFi ingestion.

python src/main/scala/DataSplitter.py

4) Start infrastructure (Spark + Kafka)

docker-compose up -d

Quick checks:

Spark UI: http://localhost:8080
Kafka (external / Windows / NiFi): localhost:9092
Kafka (internal / Docker / Spark): kafka:29092

5) Import the NiFi flow

A ready-to-use NiFi flow export is provided:

File: nifi-flow.json (repo root)

Import steps (NiFi UI):

Open NiFi canvas
Use Upload / Import Flow Definition
Select nifi-flow.json
Drop the imported Process Group onto the canvas

After import, verify:

Kafka3ConnectionService → Bootstrap servers: localhost:9092 (if NiFi runs on Windows)
Kafka topic name matches Spark consumer configuration
GetFile input directory points to the output generated by DataSplitter.py

6) Build the Scala JAR (SBT)

sbt clean compile package

The JAR is expected under:

target/scala-2.12/FraudDetectionPipeline_2.12-1.0.jar

7) Train the model (Spark job)

docker exec -it -u 0 spark-master /opt/spark/bin/spark-submit \
  --class ModelTrainer \
  --master spark://spark-master:7077 \
  --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 \
  /opt/spark/work-dir/target/scala-2.12/FraudDetectionPipeline_2.12-1.0.jar

8) Start streaming detection (Spark Structured Streaming)

docker exec -it -u 0 spark-master /opt/spark/bin/spark-submit \
  --class FraudDetectionStreaming \
  --master spark://spark-master:7077 \
  --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0,mysql:mysql-connector-java:8.0.33 \
  --driver-memory 2g \
  --executor-memory 2g \
  /opt/spark/work-dir/target/scala-2.12/FraudDetectionPipeline_2.12-1.0.jar

9) Start API + Dashboard

Install Python dependencies:

pip install -r requirements.txt

Run:

python backend_api.py &
streamlit run dashboard_streamlit.py

Dashboard usually runs at:

http://localhost:8501

Expected Outputs

MySQL: fraud alerts table (hot path)
Data Lake (Parquet): stored in data-lake/ (cold path)
Model artifacts: saved into fraud-model/ (if configured)
Streaming checkpoints: checkpoint/

Troubleshooting

NiFi can’t publish to Kafka

If NiFi runs on Windows, use: localhost:9092
If NiFi runs inside Docker, use: kafka:29092
Verify Kafka topic exists and matches both NiFi and Spark configs

Spark can’t read from Kafka

Spark running inside Docker should use internal listener: kafka:29092

Missing images / paths

Ensure fraud.csv is placed exactly in: src/main/scala/input_data/fraud.csv

License

For academic use / coursework.

If you paste the exact Kafka topic name, the NiFi GetFile directory used after DataSplitter.py, and the MySQL connection parameters (host/user/dbname, not passwords), I’ll tailor the README so it matches your project 1:1 and removes any remaining assumptions.

@don't forget to mention me hehe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FraudDetection — Real-Time Fraud Detection Pipeline (NiFi • Kafka • Spark • ML • MySQL • Streamlit)

Repository Links

Architecture Overview

Project Structure

Prerequisites

Setup & Run (Recommended Order)

1) Clone the repository

2) Download the dataset (Kaggle)

3) Run the data preparation step (mandatory)

4) Start infrastructure (Spark + Kafka)

5) Import the NiFi flow

6) Build the Scala JAR (SBT)

7) Train the model (Spark job)

8) Start streaming detection (Spark Structured Streaming)

9) Start API + Dashboard

Expected Outputs

Troubleshooting

NiFi can’t publish to Kafka

Spark can’t read from Kafka

Missing images / paths

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.idea		.idea
checkpoint		checkpoint
data-lake		data-lake
fraud-model		fraud-model
project		project
src/main/scala		src/main/scala
.gitignore		.gitignore
NiFi_Flow.json		NiFi_Flow.json
README.md		README.md
backend_api.py		backend_api.py
build.sbt		build.sbt
dashboard_streamlit.py		dashboard_streamlit.py
docker-compose.yml		docker-compose.yml
plugins.sbt		plugins.sbt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

FraudDetection — Real-Time Fraud Detection Pipeline (NiFi • Kafka • Spark • ML • MySQL • Streamlit)

Repository Links

Architecture Overview

Project Structure

Prerequisites

Setup & Run (Recommended Order)

1) Clone the repository

2) Download the dataset (Kaggle)

3) Run the data preparation step (mandatory)

4) Start infrastructure (Spark + Kafka)

5) Import the NiFi flow

6) Build the Scala JAR (SBT)

7) Train the model (Spark job)

8) Start streaming detection (Spark Structured Streaming)

9) Start API + Dashboard

Expected Outputs

Troubleshooting

NiFi can’t publish to Kafka

Spark can’t read from Kafka

Missing images / paths

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages