Skip to content

ANGADJIT/airflow_batch_job

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Transaction Analytics Batch Job

sorry


Worflow Explaination

  1. There is a mocker script which is generating random 100 transactions every second . This transactions will be stored temporarily in assets/temp folder. after every 1 seccond file will be opened in append mode after adding the transaction size of file is checked if size of files 6mb then this file will be uploaded in s3 and meta data of that file like file name and boolean 'is_processed=false' is added to postgres db.
  2. Parallelly airflow is running which will schedule a dag after every 30 minutes.The task of dag is to submit the spark job using livy rest api.
  3. After dag submit the spark job files will get loaded from hdfs like *.egg and some dependencies. First that job query a table from postgres named file_audit_entity and query the all files which is not processed yet. this include name of file_name .Then this file_names is used to get all the files from s3.After getting all the file this files will get combined in one .txt file this file will loaded into a dataframe after converting into dataframe we will perform some queries .
  4. After performing queries writin this data into a pdf and uploading to back into s3 bucket. after uploading the pdf report updating the table attribute 'is_processed'=true


Project dependencies

  1. Hadoop 2.9.x
  2. Spark 3.3.0
  3. Apache livy 0.7.1
  4. Docker

Setup Project

  1. This project used python 3.9.7 so we have to setup that first. there is script in scripts folder named as setup-python.bash run that script using bash script/setup-python.bash
  2. Create and activate the pyenv enviroment using pyenv virtualenv 3.9.7 [env_name]
  3. After we have install dependencies. using pip install -r requirements.txt --no-cache-dir
  4. Now we have setup the airflow so we have run airflow-setup script using bach scripts/airflow-setup.bash
  5. create a .env file in root of project and paste this
      AWS_ACCESS_KEY_ID=[ur-access-key]
      AWS_SECRET_KEY_ID=[ur-secrep-key]
      AWS_REGION=[ur-region]
      DEV_BUCKET_NAME=[ur-s3-bucket-name]
      HOST=[UR-HOST]
      USERNAME=[UR-USERNAME]
      PASSWORD=[UR-PASSWORD]
      DATABASE=[UR-DATABASE]
      PORT=[UR-PORT]
      FILE_SIZE=6664093

How to run the project

  1. start all services
    1. $HADOOP_HOME/start-all.sh
    2. $SPARK_HOME/start-all.sh
    3. $LIVY_HOME/bin/livy-server
  2. start airflow using bash scripts/start-airflow.sh
  3. run a postgres image in docker
  4. run mock_stream_gpay_transaction_data_producer.py file
  5. go to airflow and start the spark-job-submit dag

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published