This repository contains an end-to-end Big Data streaming pipeline for real-time fraudulent transaction detection, combining Apache NiFi (ingestion), Apache Kafka (event buffer), Apache Spark (Scala) (stream processing + ML inference), MySQL (hot storage for alerts), and Streamlit + Flask (dashboard & API).
- GitHub: https://github.com/Strawberry404/FraudDetection.git
- Dataset (Kaggle): https://www.kaggle.com/datasets/chitwanmanchanda/fraudulent-transactions-data
Flow (high level):
- Dataset (Kaggle) → downloaded as CSV
- DataSplitter.py → prepares/splits the dataset into files ready for NiFi ingestion
- NiFi flow (JSON provided) → GetFile → SplitText (chunking) → SplitText (atomization) → PublishKafka
- Kafka topic buffers events
- Spark Streaming (Scala) consumes Kafka → feature engineering → ML inference (Random Forest)
- Outputs:
- MySQL: only fraud alerts (hot path)
- Data Lake (Parquet): all transactions (cold path)
- Flask API serves stats → Streamlit dashboard displays KPIs and charts
FraudDetection/
docker-compose.yml
build.sbt
requirements.txt
backend_api.py
dashboard_streamlit.py
nifi-flow.json
src/
main/
scala/
input_data/
fraud.csv
DataSplitter.py
ModelTrainer.scala
FraudDetectionStreaming.scala
CreateMySQLTable.scala
CreateOracleTable.scala
data-lake/
fraud-model/
checkpoint/
target/
- Docker + Docker Compose
- Java (for SBT / Scala build)
- SBT
- Python 3.9+
- Optional: NiFi installed locally or accessible (NiFi runs outside docker in this setup)
git clone https://github.com/Strawberry404/FraudDetection.git
cd FraudDetectionDownload from: https://www.kaggle.com/datasets/chitwanmanchanda/fraudulent-transactions-data
Then place/rename the file as:
src/main/scala/input_data/fraud.csv
Note: If Kaggle provides a different filename, rename it to
fraud.csvto match the project structure.
This script prepares and/or splits the dataset into files ready for NiFi ingestion.
python src/main/scala/DataSplitter.pydocker-compose up -dQuick checks:
- Spark UI: http://localhost:8080
- Kafka (external / Windows / NiFi):
localhost:9092 - Kafka (internal / Docker / Spark):
kafka:29092
A ready-to-use NiFi flow export is provided:
- File:
nifi-flow.json(repo root)
Import steps (NiFi UI):
- Open NiFi canvas
- Use Upload / Import Flow Definition
- Select
nifi-flow.json - Drop the imported Process Group onto the canvas
After import, verify:
Kafka3ConnectionService→ Bootstrap servers:localhost:9092(if NiFi runs on Windows)- Kafka topic name matches Spark consumer configuration
GetFileinput directory points to the output generated byDataSplitter.py
sbt clean compile packageThe JAR is expected under:
target/scala-2.12/FraudDetectionPipeline_2.12-1.0.jar
docker exec -it -u 0 spark-master /opt/spark/bin/spark-submit \
--class ModelTrainer \
--master spark://spark-master:7077 \
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 \
/opt/spark/work-dir/target/scala-2.12/FraudDetectionPipeline_2.12-1.0.jardocker exec -it -u 0 spark-master /opt/spark/bin/spark-submit \
--class FraudDetectionStreaming \
--master spark://spark-master:7077 \
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0,mysql:mysql-connector-java:8.0.33 \
--driver-memory 2g \
--executor-memory 2g \
/opt/spark/work-dir/target/scala-2.12/FraudDetectionPipeline_2.12-1.0.jarInstall Python dependencies:
pip install -r requirements.txtRun:
python backend_api.py &
streamlit run dashboard_streamlit.pyDashboard usually runs at:
- MySQL: fraud alerts table (hot path)
- Data Lake (Parquet): stored in
data-lake/(cold path) - Model artifacts: saved into
fraud-model/(if configured) - Streaming checkpoints:
checkpoint/
- If NiFi runs on Windows, use:
localhost:9092 - If NiFi runs inside Docker, use:
kafka:29092 - Verify Kafka topic exists and matches both NiFi and Spark configs
- Spark running inside Docker should use internal listener:
kafka:29092
- Ensure
fraud.csvis placed exactly in:src/main/scala/input_data/fraud.csv
For academic use / coursework.
If you paste the exact Kafka topic name, the NiFi GetFile directory used after DataSplitter.py, and the MySQL connection parameters (host/user/dbname, not passwords), I’ll tailor the README so it matches your project 1:1 and removes any remaining assumptions.
@don't forget to mention me hehe