Skip to content

Latest commit

 

History

History
110 lines (90 loc) · 5.95 KB

Readme.md

File metadata and controls

110 lines (90 loc) · 5.95 KB

About the project

This repository keeps a Data Engineering project that aims to gather data from Braziian economic landscape. Initially, most of data is about companies which are listed on B3 (Brasil, Bolsa, Balcão).

About the requirements

For performing this project you will need:

  • KIND (Basic only for creating a cluster)
  • Docker (Basic for understand the cluster behavior and maybe debugging it sometimes)
  • Kubectl (Basic for deployment and debugging)
  • AWS account (IAM role, Security group, AWS cli, Credentials, GLUE)
  • Python (Modules, Classes, Functions)
  • Pyspark
  • Airflow
  • Terraform

Project outline

The project consists of collecting data using python applications and writting it on S3 datalake, then making needed data cleaning and tranformations. Once data is available on datalake, it is transformed into a Data Lake House by using Delta and Pyspark. The whole enviroment will be based on Docker containers orchestrated by kubernetes operators in Airflow.

Project Status

Phase 1 Ingestion to Datalake

Alt text

  • Ingestion Script: Done
  • Cluster Config: Done
  • Python Docker Image: Done
  • Ingestion Deployment: Done

Phase 2 Data Discovering

This step is required to understand the data and start the analysis. The presentation bellow is just a first glance at the data, and it shows the incomings from sells of goods and services by UF. Alt text

  • SQL queries: Done
  • Python Script: Done

Phase 3 Datalakehouse

Alt text

  • Glue Pyspark: Done
  • Terraform: Done
  • Data Lake House: In Progress

Phase 4 Airflow orchestration

  • Airflow set up: Pending
  • Create the DAGs: Pending

Phase 5 SODA Data Quality

  • Implementing: Pending

Phase 6 AWS RDS

  • Create DB: Pending
  • Modeling it: Pending

Phase 7 AWS EKS

  • Cluster creation more: In progress
  • Replicate local cluster on EKS: Pending

Settings

The first thing you need is a local Kubernetes Cluster, for this project KIND was used for this task. KIND is a cluster based on Docker containers, so it is crucial that you have KIND and Docker Desktop installed on your machine. KIND uses YAML files to build the cluster and you can take a look at the b3_app\kind_b3.yaml to get familiarized with the required key:value pairs for the cluster configuration.

Creating a new cluster on KIND

Make sure you are in b3_app folder.
Run:

  • kind create cluster --name dfpcluster --config kind_b3.yaml
    (As a result you ll have two new containers running on your Docker, based on a kind image.)
    After doing that, you'll see the following on your Docker Desktop
    Control plane and Worker Containers. Control plane - Workers KIND image. KIND image

Setting up AWS credentials

You'll need to type your credentials on b3_app\src.env right after the "=" sign:
Alt text
If you use a different region, adjust the region as well.
If the file is not within the folder structure, create it in the path b3_app\src.env and type your credentials and region.

Understanding the src folder structure

The src folder contains the whole script that performs the extracting from sources and writting into Datalake. The objects.py has the classes and functions that are used for it. The DFPgov is the file which calls functions from objects.py in order to properly do the ETL job.
In the objects.py there is a variable called dotenv_path that you might be aware of. This variable indicates the path to .env file you recent adjust. Note that the path indicates /workspaces/app/.env as it will be the right path within the container. If you try running that locally, you can change the path for the .env file. Alt text

Building and pushing a DOCKER image to DockerHub

Your recent created Kubernetes cluster needs that you have your images available on a container registry. The following steps are for build your image locally and then push into Docker Hub:

(It is required to have a account on Docker)

  • docker build -t dfp:v1 . Builded image
  • docker login
  • docker tag dfp:v1 jcs7/dfp:v1 Tagged image
  • docker push jcs7/dfp:v1
    Docker Hub image

(It is gonna push as public, make it private, as soon as it is done by clicking on settings.)

Adding a secret on cluster to access your DockerHub image

As this project access AWS bucket, you must be sure of working only with private repositories in order to keep your credentials from unallowed accessing issues.

  • kubectl create secret docker-registry regcred --docker-server="https://index.docker.io/v1/" --docker-username="" --docker-password="" --docker-email=""

The above command will create a new token named as regcred. This token is mentioned on your deployment.yaml file.

Deploying your app

Within the src folder you can see python-deploy.yaml... this file has the deployment configuration and may be used in case of increasing nodes is needed.

  • Note that Replicas is set as 1 and imagePullSecrets is named as regcred which is the name of token you generated recently.
    Alt text
  • kubectl apply -f python-deploy.yaml
  • kubectl get pods You expect to see status running. It may take few minutes.
  • If you have CrashLoopBackoff status, it is probally because your Docker ran out of memory, just restart your system and repeat those steps if you need. In case of you need troubleshotting it, the following commands may be handy: kubectl get pods To get the pod name kubectl describe