Skip to content

COVID19 Data Pipeline: An end-to-end ETL data pipeline to fetch daily and weekly Covid data from API, transform and load it into SQL database using some of the Azure services.

License

Notifications You must be signed in to change notification settings

rashmi0007/health_data

Repository files navigation

Covid-19 Health Care Data Engineering Project

About Project:

Created a Data Warehouse of COVID-19 data on Cases & Deaths, Hospital Admissions and more, develop a complete Data Pipeline using Azure Data Factory & Databricks. Data Visualization was made using PowerBi.

Solution Architecture:

Covid19_DataFlow_diagram

Getting Started

  1. Cloned the project repository from GitHub .

  2. Above line can be skipped by fetching data from ECDC API.

  3. Developed a Data Pipeline in Azure Data Factory

    ◾ Fetched data from GitHub to Azure Blob Storage.

copyActivity

CopySuccess

   ◾ Processed data by applying diverse transformations as per requirements using:

       ▪ Used Dataflows in Data Factory

Hospital_Datafloow

admission_hospitalFlowData

       ▪ Pyspark in Azure Databricks to write data in Azure SQL DB.
  1. Created Data Lake to store raw and processed data.

  2. Developed a Data Warehouse in Azure SQL DB(DDL Command & Pyspark_code_in_SQL) and masked the sensitive data using Pyspark functionality(Pyspark code)

  3. To get insights out of it, data from SQL DB was loaded into Power BI Desktop.

Health_care_report

Services Used :

◽ Azure Data Factory (Dataflows, Linked Services, Triggers, Azure Databricks)

◽ Azure Blob Storage

◽ Azure Data Lake Storage Gen 2

◽ Azure SQL DB

About

COVID19 Data Pipeline: An end-to-end ETL data pipeline to fetch daily and weekly Covid data from API, transform and load it into SQL database using some of the Azure services.

Topics

Resources

License

Stars

Watchers

Forks