This exercise is about data organization, orchestration, and coding, including creating Docker images.
Aim of this exercise is to:
- Evaluate your coding (e.g. data reorganizations)
- Understand data orchestration capabilities (e.g. Docker)
- Understand how you design a solution (overall thinking)
The data is daily COVID case data from the United States. The data is located at: https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports_us which is a series of CSV files, one for each day. Information about the data is located at https://github.com/CSSEGISandData/COVID-19.
Put code in a public code repository hosted on GitHub, or a private
repository and add muschellij2 as a collaborator with read access
Note: the goal is the solution. If any steps below pose an unreasonable challenge at any time, and you need to get to the end result in a different way due to time constraints, please communicate that.
- Create a GitHub repository for this exercise.
- Create a Docker image that can read in the data from the day before, and compile a report/print out the cases for the day before. If the data is not there, print out a diagnostic message. If you are using R, you can use the rocker images as a base https://github.com/rocker-org/rocker-versioned2
- Set up GitHub Actions to build this Docker image
Using this Docker image:
- filter rows that are only in the United States,
- take the mean cases (
Confirmedvariable) and deaths (Deaths) by state, averaging over counties (Admin2). Print this out in the action - Append the results to a file from the previous days’ results.
- run a this pipeline on a schedule (daily) using GitHub Actions.
Please provide a half/full page description (either separate or in a README) of:
- the challenges in getting this up and running
- improvements you’d make in this pipeline if more time were available or any issues with the solution and how you’d perform checks on it
- additional cleaning you would consider performing on a data set like this.