Real-Time Transaction Pattern Detection System

1. Overview

This document outlines the architecture for a high-performance, scalable system designed to process a large stream of transaction data in near real-time. The system is composed of two primary, independent mechanisms, both running on Databricks:

• Mechanism X: A scheduled Databricks job that reads a master transaction log file from an AWS S3 bucket, breaks it into smaller, manageable chunks, and writes them to a Delta table in a landing zone.

• Mechanism Y: A continuous Databricks job that streams from the Delta table as new transaction chunks arrive, analyses them to detect specific, predefined patterns, and writes the results to a designated S3 bucket.

2. System Architecture

The architecture is designed around a Databricks-centric model, using Delta Lake for reliable data transfer and external databases for stateful analysis.

3. Mechanism X:

Purpose Mechanism X is responsible for breaking down a potentially massive, continuously growing source transaction file into small, consistently sized chunks using the power of pandas on Databricks. Process Flow

Trigger: A Databricks Job, containing a notebook or script, is scheduled to run every 5 seconds.
State Management: The job uses a simple mechanism (a small state file in S3) to remember the last line number (offset) it processed from the source - offset.csv.
Chunk Creation: The job reads the source file, skips to the last known offset, and reads the next 10,000 transaction entries into a DataFrame.
Output: The DataFrame is added as a chunk csv file located in the S3 Landing Zone (s3://your-bucket/landing_zone/).
State Update: The job updates its state file/table with the new last processed line number.

4. Mechanism Y: Pattern Detector (Databricks Job)

Purpose Mechanism Y is the core analytical engine. It runs as a continuous Databricks job, using Structured Streaming to process data as soon as it's available in the landing zone, perform complex pattern matching, and output actionable insights. Process Flow

Trigger: A continuous Databricks job is initiated, configured to use the landing zone chunks as a streaming source. It automatically detects and processes new data added by Mechanism X.
Ingestion: The streaming job reads new transaction chunks from the S3 folder in micro-batches.
Stateful Analysis: For each transaction in the chunk, the job reads from and writes to into a delta table. This table is crucial for maintaining the state required to evaluate the patterns across multiple chunks. The table would store aggregates like: o Total transaction counts per merchant. o Total transaction counts for each customer with a specific merchant. o Running average transaction values for customers. o Gender-based customer counts for each merchant. o Data for calculating percentiles.
Pattern Detection: The function evaluates the incoming transactions against the three patterns defined below.
Buffering: Detections are collected within the Spark job. A foreachBatch sink is used to process the micro-batch of detections.
Output: Within the foreachBatch operation, once 50 detections are collected, the job writes all 50 records to a single, unique CSV file in the detections S3 bucket (s3://your-bucket/detections/). The file is named uniquely, e.g., detections_1657886410.csv.

5. Pattern Definitions

PatId1: UPGRADE • ActionType: UPGRADE • Conditions:

The merchant's total number of transactions must exceed 50,000.
The customer's total transaction count for that merchant must be in the top 10th percentile compared to all other customers of that same merchant.
The customer's average transaction weight (averaged over all their transactions with that merchant) must be in the bottom 10th percentile.

PatId2: CHILD • ActionType: CHILD • Conditions:

The customer has made at least 80 transactions with the specific merchant.
The customer's average transaction value for that merchant is less than ₹23.

PatId3: DEI-NEEDED • ActionType: DEI-NEEDED • Conditions:

The total number of female customers for the merchant is greater than 100.
The total number of female customers is less than the total number of male customers for that same merchant. • Note: For this detection, customerName can be “” as the detection applies to the merchant, not an individual customer.

6. Implementation Considerations

• State Management: Delta table is used to gather the coming data from chunks and using Py spark on it to analyse for the detections in the Databricks job. Also Managing the chunk offset value in a csv file in s3. • Job Management: Both Databricks jobs need to be monitored for failures. Configure job failure notifications and retry policies appropriately. • Deployment: The entire infrastructure, including Databricks jobs and cloud resources, should be managed as code using a framework like Terraform for repeatable and reliable deployments.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
S3_Bucket_Structure/databricksbucket101		S3_Bucket_Structure/databricksbucket101
README.md		README.md
chunk_generator.ipynb		chunk_generator.ipynb
pattern_detections.ipynb		pattern_detections.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Real-Time Transaction Pattern Detection System

1. Overview

2. System Architecture

3. Mechanism X:

4. Mechanism Y: Pattern Detector (Databricks Job)

5. Pattern Definitions

6. Implementation Considerations

About

Uh oh!

Releases

Packages

Languages

Sg108/Transaction_Pattern_detection

Folders and files

Latest commit

History

Repository files navigation

Real-Time Transaction Pattern Detection System

1. Overview

2. System Architecture

3. Mechanism X:

4. Mechanism Y: Pattern Detector (Databricks Job)

5. Pattern Definitions

6. Implementation Considerations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages