Skip to content
Industry Practicum Project on developing a scalable data pipeline using Microsoft Azure, Kubernetes and Pachyderm
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
pachyderm_blob
pachyderm_validation
sql_validation
Build-Docker-CHG.sh
Build_Docker_Blob.sh
Dockerfile
README.md
Setup-Environment.sh
acr.sh
chg_validate.py
crowe_input.py
crowe_input_json.json
requirements.txt
serviceprincipal.sh

README.md

Azure-Data-Pipelines

Industry Practicum Project on developing a scalable data pipeline using Microsoft Azure, Kubernetes and Pachyderm

Microsoft Azure data pipeline implementation for Crowe LLC Report: https://www.dropbox.com/s/89ski1ng4cae01v/Crowe%20Paper.pdf?dl=0 Presentation: https://www.dropbox.com/s/l0wf6xgd9wezo1a/Final%20Presentation%20Crowe.pptx?dl=0

Our client is one of the largest public accounting, consulting, and technology firms in the U.S. with a need to process massive amounts of healthcare data. In particular, the client’s “Revenue Analytics” product gathers data from over 1000 hospitals. As their business grows, having a decentralized database brings more disparity. It creates difficulty in scaling the infrastructure, using computing effectively, and managing failures and downtime. With this project, we facilitate the commoditization of this data in the cloud using Kubernetes and Blob storage. This approach allows us to create an easily accessible, robust and centralized data and analysis platform meeting the needs of various stakeholders. Moreover, this new data architecture overcomes the challenges of the client’s current architecture in which a series of ETL transformations push data into various database locations. Previously, the client could not track the point of failures in their data pipelines. We solve this problem via Azure cloud services along with Kubernetes and Docker to design, develop and track all data pipelines. These data pipelines feed a centralized “SQL Data Hub” that drives all decision-making processes and powers enhanced BI reporting.

Understanding the various files -

Setup-Environment.sh - This file is used to setup the infrastructure on Azure (Creates Resource Group, Create Storage Account, Create Blob Container, Install Pachyderm, Create ACR.

Setup-SQL.sh - This file is used to create the SQL Server and Database on Azure.

Pachyderm_Blob - This folder contains the various artifacts

crowe_input.py - The python file contains the code to fetch files from Blob and store in Pachyderm Input Repository. Dockerfile & requirements.txt - This contains the Docker file and the requirements for installing the dependencies. crowe_input_json.json - This file contains the configuration file of Pachyderm to initiate Docker containers on Kubernetes. servicerinicipal.sh - This file creates the Kubernetes Cluster to create ACR credentials. acr.sh - This file fetches the ACR credentials. Pachyderm_Validation - This folder contains the various artifacts

chg_validate.py - The python file contains the code which creates the validation logic and the logic to insert data into SQL Dockerfile & requirements.txt - This contains the Docker file and the requirements for installing the dependencies. crowe_input_chg.json - This file contains the configuration file of Pachyderm to initiate Docker containers on Kubernetes.

You can’t perform that action at this time.