Transaction Analytics Batch Job

Worflow Explaination

There is a mocker script which is generating random 100 transactions every second . This transactions will be stored temporarily in assets/temp folder. after every 1 seccond file will be opened in append mode after adding the transaction size of file is checked if size of files 6mb then this file will be uploaded in s3 and meta data of that file like file name and boolean 'is_processed=false' is added to postgres db.
Parallelly airflow is running which will schedule a dag after every 30 minutes.The task of dag is to submit the spark job using livy rest api.
After dag submit the spark job files will get loaded from hdfs like *.egg and some dependencies. First that job query a table from postgres named file_audit_entity and query the all files which is not processed yet. this include name of file_name .Then this file_names is used to get all the files from s3.After getting all the file this files will get combined in one .txt file this file will loaded into a dataframe after converting into dataframe we will perform some queries .
After performing queries writin this data into a pdf and uploading to back into s3 bucket. after uploading the pdf report updating the table attribute 'is_processed'=true

Project dependencies

Hadoop 2.9.x
Spark 3.3.0
Apache livy 0.7.1
Docker

Setup Project

This project used python 3.9.7 so we have to setup that first. there is script in scripts folder named as setup-python.bash run that script using bash script/setup-python.bash
Create and activate the pyenv enviroment using pyenv virtualenv 3.9.7 [env_name]
After we have install dependencies. using pip install -r requirements.txt --no-cache-dir
Now we have setup the airflow so we have run airflow-setup script using bach scripts/airflow-setup.bash
create a .env file in root of project and paste this

How to run the project

start all services
1. $HADOOP_HOME/start-all.sh
2. $SPARK_HOME/start-all.sh
3. $LIVY_HOME/bin/livy-server
start airflow using bash scripts/start-airflow.sh
run a postgres image in docker
run mock_stream_gpay_transaction_data_producer.py file
go to airflow and start the spark-job-submit dag

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
airflow/dags		airflow/dags
dist		dist
jobs		jobs
mock		mock
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

airflow/dags

airflow/dags

dist

dist

jobs

jobs

mock

mock

scripts

scripts

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Transaction Analytics Batch Job

Worflow Explaination

Project dependencies

Setup Project

How to run the project

About

Releases

Packages

Contributors 2

Languages

ANGADJIT/airflow_batch_job

Folders and files

Latest commit

History

Repository files navigation

Transaction Analytics Batch Job

Worflow Explaination

Project dependencies

Setup Project

How to run the project

About

Resources

Stars

Watchers

Forks

Languages