Daily Covid Data Analysis on Serverless Spark

Overview

With the advent of cloud environments, the concept of huge capital investments in infrastructure in terms of capital and maintenance is a thing of the past. Even when it comes to provisioning infrastructure on cloud services, it can get tedious and cumbersome.

In this example, you will look at executing a simple Python code which runs on Serverless batch (a fully managed Dataproc cluster). It is similar to executing code on a Dataproc cluster without the need to initialize, deploy or manage the underlying infrastructure.

This repository collects information to assess the economic impact of Covid-19 in the stock market in more than 1850 companies in 50 different countries around the world. This repository includes a dataset of more than 2.5 million rows with the stock market data of each company that dates back 400 days, which will shows the stock values before the outbreak and the impact during the pandemic. This repository also includes a file produced by the Oxford Coronavirus Government Response Tracker (OxCGRT project) which contains information of each country with a Government Response Stringency Index which indicates the stringency regarding each government's response to reduce the impacts of covid-19 (https://ourworldindata.org/policy-responses-covid#government-stringency-index).

Services Used

Google Cloud Storage
Google Cloud Dataproc

3. Permissions / IAM Roles required to run the lab

Following permissions / roles are required to execute the serverless batch

Viewer
Dataproc Editor
Service Account User
Storage Admin

4. Checklist

To perform the lab, below are the list of activities to perform.

1. GCP Prerequisites
2. Spark History Server Setup
3. Uploading scripts and datasets to GCP
4. Create a Cloud Composer Environment

Note down the values for below variables to get started with the lab:

PROJECT_ID=                                         #Current GCP project where we are building our use case
REGION=                                             #GCP region where all our resources will be created
SUBNET=                                             #subnet which has private google access enabled
BUCKET_CODE=                                        #GCP bucket where our code, data and model files will be stored
BUCKET_PHS=                                         #bucket where our application logs created in the history server will be stored
HISTORY_SERVER_NAME=                                #name of the history server which will store our application logs
UMSA_NAME=                                          #user managed service account required for the PySpark job executions
SERVICE_ACCOUNT=$UMSA_NAME@$PROJECT_ID.iam.gserviceaccount.com
NAME=<your_name_here>                               #Your Unique Identifier

5. Lab Modules

The lab consists of the following modules.

Understand the Data
Solution Architecture
Executing the notebook
Examine the logs
Explore the output

There are 3 ways of perforing the lab.

Using Google Cloud Shell
Using GCP console
Using Airflow

Please chose one of the methods to execute the lab.

6. CleanUp

Delete the resources after finishing the lab.
Refer - Cleanup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Daily Covid Data Analysis on Serverless Spark

Overview

Services Used

3. Permissions / IAM Roles required to run the lab

4. Checklist

5. Lab Modules

6. CleanUp

Files

README.md

Latest commit

History

README.md

File metadata and controls

Daily Covid Data Analysis on Serverless Spark

Overview

Services Used

3. Permissions / IAM Roles required to run the lab

4. Checklist

5. Lab Modules

6. CleanUp