# COVID-19 Data Exploration for Beginners

<img src="https://www.scientificanimations.com/wp-content/uploads/2020/01/3D-medical-animation-coronavirus-structure.jpg" width="800" height="400">
<font size="2">Figure 1: COVID-19 Virus structure</font>
<br>
Image credits : [Scientific Animations](https://www.scientificanimations.com/wiki-images/) under a [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) license

# Version Control

| Version Number | Version Description | Creator | Date |
| -- | -- | -- | -- |
| 0.1 | Draft version | Thabor Walbeek | 05 April 2020 |

## Introduction

Due to the outbreak of the COVID-19 outbreak, many people would like to understand the analysis of the available data. On many public sites dashboards, analysis and notebooks can be found. 

For those new to the analysis, this notebook will guide them through some basic steps to start an exploratory analysis on available data sets and gain their first insights.

The notebook will follow simple step-by-step guidance in retrieving the data sets and understand what is important when starting an Exploratory Data Analysis (EDA). In between questions are being asked to fully understand the reasons for executing those steps and gaining a good understanding on what is important on looking at data.

If you're not familair with tools like R or Python, don't worry. You can easily download the files (data sets) to your local machine and do similar actions in tools like MS Excel. Where possible examples will be given how to perform the same actions in MS Excel.

-------------

Before starting any analysis it's important to understand what kind of data we are looking at. This means that we first need to go through some additional information. In our analysis we will be looking at the outbreak of the COVID-19 virus. Therefore we need to first understand what this virus is, and how data is populated.

**Read the information following these links before continuining:**

[WHO definition of COVID-19 virus: Get more understanding of what the COVID-19 virus is](https://www.who.int/health-topics/coronavirus#tab=tab_1)


**Data sets used in this analysis can be found on Github:**

[CSEGISandData/COVID-19](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series)
<br>
2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE
This dataset is updated on daily basis by Johns Hopkins CSSE

-----------------

The best way to start analyzing data is to follow this notebook chapter by chapter. That will help in understanding all steps of the analysis, but if you want to go back to another chapter use the Table of Content to easily skip to another section

# Table of Content

* [1. Start](#1start)
* [1.1 Start](#11start)
* [1.1.1 Start](#11start)

# 0. Preparing

THe following steps need to run to be able to utilize functions in Python for visualizing and doing analysis using the functions. If you're not familair with Python, then please ignore these steps and continue with [1. Start](#1start)

## 0.1 Load Python packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## 0.2 Load the data sets into the environment

In [2]:
confirmed = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv")
deaths = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv")
recovered = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv")

<a id='1start'></a>
# 1. Start

In the previous step we have loaded 3 .csv files into this environment, so we can have a look at it and start analyzing.

The 3 files are time-series files, which mean they contain day-by-day data per country in the registrations of:

- Confirmed cases (dataset name: **'confirmed'**)
- Deaths (dataset name: **'deaths'**)
- Recovered cases (dataset name: **'recovered'**)

Even though we have loaded the files into the environment, we still have no idea of what the data actually looks like, so we first start by exploring the data high level.


<a id='11start'></a>
# 1.1 Looking into the data set

So the first step is getting an understanding of the data sets. Every data analysis will start with this only, as we not only need to know what kind of data we are looking at, we also need to understand what the data means, and moreoever: do we have enough data?

In the next section we will run the first code to retrieve the column names of the data set **'confirmed'**. Run the following cell and then anser the question in the [next cell block](#11question)

In [3]:
for col in confirmed.columns: 
    print(col) 

Province/State
Country/Region
Lat
Long
1/22/20
1/23/20
1/24/20
1/25/20
1/26/20
1/27/20
1/28/20
1/29/20
1/30/20
1/31/20
2/1/20
2/2/20
2/3/20
2/4/20
2/5/20
2/6/20
2/7/20
2/8/20
2/9/20
2/10/20
2/11/20
2/12/20
2/13/20
2/14/20
2/15/20
2/16/20
2/17/20
2/18/20
2/19/20
2/20/20
2/21/20
2/22/20
2/23/20
2/24/20
2/25/20
2/26/20
2/27/20
2/28/20
2/29/20
3/1/20
3/2/20
3/3/20
3/4/20
3/5/20
3/6/20
3/7/20
3/8/20
3/9/20
3/10/20
3/11/20
3/12/20
3/13/20
3/14/20
3/15/20
3/16/20
3/17/20
3/18/20
3/19/20
3/20/20
3/21/20
3/22/20
3/23/20
3/24/20
3/25/20
3/26/20
3/27/20
3/28/20
3/29/20
3/30/20
3/31/20
4/1/20
4/2/20
4/3/20
4/4/20


<a id='11question'></a>
`Answer the following questions:`

**Q1: What is the format for the majority of the columns?**

1. Country Information in text format
2. Date information in date format
3. None of the above

In [4]:
# Answer the question by running this cell block, and give the number of the correct answer and press ENTER:
answer11 = input()

1


------

<span style="color:red">**All the results of the questions will be shown at the end of the notebook**</span>

-----

<a id='11start'></a>
# 1.1 Start

some text ...

<a id='111start'></a>
# 1.1.1 Start

some text ...

# Results of all the questions

In [8]:
print("Q1: What is the format for the majority of the columns?")
print("1. Country Information in text format")
print("2. Date information in date format")
print("3. None of the above")


if int(answer11) == 2:
    print("Your answer was: ", answer11," The answer to that was: correct")
else:
    print("Your answer was: ", answer11," The answer to that was: incorrect. Check the naming of the columns. Besides the first few columns, all columns contain a date value")

Q1: What is the format for the majority of the columns?
1. Country Information in text format
2. Date information in date format
3. None of the above
Your answer was:  1  The answer to that was: incorrect
