According to the World Health Organization, coronaviruses are a family of viruses that range from the common cold to MERS coronavirus, which is Middle East Respiratory Syndrome coronavrius.

The viruses cn be found in animals and can be transmitted to humans. The novel coronavirus that has been in the news of late is referred to as COVID-19. This virus is genetically closely related to the SARS-CoV-1 Virus that emerged at the end of 2002 in China. Although the mortality rate of COVID-19 is significantly less than the 2003 SARS outbreak, it is a much higher mortality rate than that of the seasonal influenza.

In the following exercise we will read and excel spreadsheet with data pertaining to death and recoveries from the COVID-19 between January 22 and February 9 2020 in different provinces around China

## Reading a csv using the pandas library

Pandas makes it easy for users to read data in a csv using pd.read_csv. In this case we want to read multiple csv files in a github repository. There are two ways around this. We could opt to read each single csv file separately. Remember that if we are reading a csv file from GitHub we have to read the raw data.

In [6]:
import pandas as pd
path = 'https://raw.githubusercontent.com/EmmS21/GradientBoostIntrotoDS/master/Datasets/Coronavirusstats/Jan22_12am.csv'
file = pd.read_csv(path)
file.head()

Unnamed: 0,Province/State,Country,Date last updated,Confirmed,Suspected
0,Shanghai,China,1/21/2020,9.0,10.0
1,Yunnan,China,1/21/2020,1.0,
2,Beijing,China,1/21/2020,10.0,
3,Taiwan,China,1/21/2020,1.0,
4,Jilin,China,1/21/2020,,1.0


Since we want to read 40 files, we can think of ways to automate this process and reduce the amount of time we would need to spend acquiring our data. In this scenario we could make use of the glob module to find all files that end with .csv in our coronavirus. We could the glob module inside of a list comprehension to create a list of all the csv files we need to read from our folder

More about using the glob module:

https://www.poftut.com/python-glob-function-to-match-path-directory-file-names-with-examples/

In [51]:
import glob
files = [i for i in glob.glob('../Datasets/Coronavirusstats/*csv')]

We could then read each csv using pd.read_csv again using list comprehension and storing the results in a variable we will name temp

In [55]:
temp = [pd.read_csv(f) for f in files]

Lastly, we know that we can use pd.concat to concatenate pandas dataframes. We will then use pd.concat to concatenate all the dataframes we have stored in temp

In [57]:
covid = pd.concat([i for i in a])
covid.head()

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered,Unnamed: 6,Unnamed: 7,"Quick note: Starting from this tab, our map is updating (almost) in real time (China data - at least once per hour; non China data - several times per day). This table is planning to be updated twice a day. The discrepancy between the map and this sheet is expected. Sorry for any confusion and inconvenience.",Country,Date last updated,Suspected,Demised
0,Hubei,Mainland China,2/1/2020 10:00,7153.0,249.0,168.0,,,,,,,
1,Zhejiang,Mainland China,2/1/2020 10:00,599.0,,21.0,,,,,,,
2,Guangdong,Mainland China,2/1/2020 10:00,535.0,,14.0,,,,,,,
3,Henan,Mainland China,2/1/2020 10:00,422.0,2.0,3.0,,,,,,,
4,Hunan,Mainland China,2/1/2020 10:00,389.0,,8.0,,,,,,,


We can also combine these steps into one a liner, however we opted to breakdown each step to give you an understanding of what each step does and why it appears where it appears in the sequence of steps to be taken. To recap we start off by using glob.glob to recursively read all files ending with .csv in our coronavirusstats folder. We then read each file as a pandas dataframe using pd.read_csv and lastly we concatenate all the dataframes we have created.

In [61]:
covid = pd.concat([pd.read_csv(f) for f in glob.glob('../Datasets/Coronavirusstats/*csv')])
covid.head()

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered,Unnamed: 6,Unnamed: 7,"Quick note: Starting from this tab, our map is updating (almost) in real time (China data - at least once per hour; non China data - several times per day). This table is planning to be updated twice a day. The discrepancy between the map and this sheet is expected. Sorry for any confusion and inconvenience.",Country,Date last updated,Suspected,Demised
0,Hubei,Mainland China,2/1/2020 10:00,7153.0,249.0,168.0,,,,,,,
1,Zhejiang,Mainland China,2/1/2020 10:00,599.0,,21.0,,,,,,,
2,Guangdong,Mainland China,2/1/2020 10:00,535.0,,14.0,,,,,,,
3,Henan,Mainland China,2/1/2020 10:00,422.0,2.0,3.0,,,,,,,
4,Hunan,Mainland China,2/1/2020 10:00,389.0,,8.0,,,,,,,


In [62]:
covid

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered,Unnamed: 6,Unnamed: 7,"Quick note: Starting from this tab, our map is updating (almost) in real time (China data - at least once per hour; non China data - several times per day). This table is planning to be updated twice a day. The discrepancy between the map and this sheet is expected. Sorry for any confusion and inconvenience.",Country,Date last updated,Suspected,Demised
0,Hubei,Mainland China,2/1/2020 10:00,7153.0,249.0,168.0,,,,,,,
1,Zhejiang,Mainland China,2/1/2020 10:00,599.0,,21.0,,,,,,,
2,Guangdong,Mainland China,2/1/2020 10:00,535.0,,14.0,,,,,,,
3,Henan,Mainland China,2/1/2020 10:00,422.0,2.0,3.0,,,,,,,
4,Hunan,Mainland China,2/1/2020 10:00,389.0,,8.0,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
57,,Sri Lanka,1/31/2020 14:00,1.0,,,,,,,,,
58,,Finland,1/31/2020 14:00,1.0,,,,,,,,,
59,,Philippines,1/31/2020 14:00,1.0,,,,,,,,,
60,,India,1/31/2020 14:00,1.0,,,,,,,,,


There are already a few problems we can see, firstly for the purpose of this assignment we are only interested in a few columns; the province/state, the region,/country, last update, confirmed, deaths and recovered columns.

However, we do not want to simply remove the Country, Date Last Updated, Suspected and Demised columns. From observing our data we can make the assumption that in certain files these columns refer to the columns of interest we have already identified.

**How do we know this?**
Okay, to validate this assumption let us look at rows where the Country/Region column returns a NaN value. If we find that in instances where the Country/Region column is blank the Country column contains data relevant to the particular Country/Region relevant to the row of data we are looking at we can validate this assumption.

We will start off by filtering for rows where the Country/Region column returns a NaN value

In [65]:
covid[covid['Country/Region'].isna()]

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered,Unnamed: 6,Unnamed: 7,"Quick note: Starting from this tab, our map is updating (almost) in real time (China data - at least once per hour; non China data - several times per day). This table is planning to be updated twice a day. The discrepancy between the map and this sheet is expected. Sorry for any confusion and inconvenience.",Country,Date last updated,Suspected,Demised
0,Shanghai,,,9.0,,,,,,China,1/21/2020,10.0,
1,Yunnan,,,1.0,,,,,,China,1/21/2020,,
2,Beijing,,,10.0,,,,,,China,1/21/2020,,
3,Taiwan,,,1.0,,,,,,China,1/21/2020,,
4,Jilin,,,,,,,,,China,1/21/2020,1.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
33,Yunnan,,1/22/2020 12:00,1.0,,,,,,China,,,
34,Zhejiang,,1/22/2020 12:00,10.0,,,,,,China,,,
35,,,1/22/2020 12:00,2.0,,,,,,Japan,,,
36,,,1/22/2020 12:00,2.0,,,,,,Thailand,,,


## Question 1

This means that in cases where the country/region column return an NaN we need to populate that rows with the associated value in the Country column. Give this part of the exercise a go.

## Question 2

You may have noticed that much like relationship between the country/region column and the country column in some of the csv's we have included the columns; 'Date Last Updated','deceased' and 'suspected' where used to refer to the columns'Last Update','Confirmed' and 'Deaths' respectively. With this in mind, move all values from the 'Date Last Updated', 'deceased' and 'suspected' columns to the 'Last Update', 'Confirmed' and 'Deaths' columns in cases where there is NaN value in the latter columns.

## Question 3

Lets assume we want tobetter understand the growth of the COVID-19 virus around the world. Create visualizations to help us understand how understand how the statistics related to COVID-19 have evolved between January 22 and February 9 2020. Create visualizations to show any interesting patterns you observe and explain your observations. You can either opt to create multiple visualization on Jupyter or an interactive dashboard with multiple visualizations. Remember, if you choose a solution that allows you to challenge yourself you will learn a lot more, additionally, we are here to help you.

## Some sources to help regarding creating visualizations

- The Python Graph Gallery: https://python-graph-gallery.com/
- 9 popular ways to perform Data Visualizations in Python: https://www.analyticsvidhya.com/blog/2015/05/data-visualization-python/

## Some sources to help regarding building an interactive dashboard

- Creating interactive visualizations with Plotly's Dash Framework: https://pbpython.com/plotly-dash-intro.html
- Creating interactive dashboards from Jupyter Notebooks: https://pbpython.com/interactive-dashboards.html
- Build a live dashboard with Python: https://pusher.com/tutorials/live-dashboard-python
- How to create an analytics dashboard in a Django app: https://www.freecodecamp.org/news/how-to-create-an-analytics-dashboard-in-django-app/