**Welcome!**

We are amazed that you are interested in data analytics and Aalto Junior workshops. We hope that this workshop will inspire you in the fields of data analytics, data science and programming!

**Data**

For data analysis combined information known as data is required. It can be in many different types of data for example spreadsheets, text, audio or photos. Data can be static or dynamic. In this analysis, we use data dynamically published by WHO. This data is in spreadsheet form and it's [publicly available](https://covid19.who.int/table) under their [licence](https://creativecommons.org/licenses/by-nc-sa/3.0/igo/) this allows us to utilize this data. This data has covid recorded COVID-19 cases and deaths for the whole world. Remember to always check usage rights to the data being used.

In this practice, you have the data ready in a very functional form. This is great since now we can focus on the data analysis and conclusions. Only minimal data cleanup is required here.

The data can be downloaded in multiple file formats. In this practice, the data is in CSV format. You can learn more about CSV format [here](https://en.wikipedia.org/wiki/Comma-separated_values). From the data, we can find out new COVID-19 cases and deaths organized by country and date. The data also includes [WHO Regions](https://en.wikipedia.org/wiki/List_of_WHO_regions).

**Programming**

There are numerous modules available for python. They can be used to add more functionality to the code and to expand the feature set of python. In this practice, we utilise the pandas module for the data analysis and later on matplotlib for data visualization. The data given to us is in CSV format. We start by converting it to pandas.DataFrame format that can be used with the pandas module.

For this practice, the code is mostly already written for you. You can edit and add code to the already written code and run them in the code boxes by clicking the play symbol or pressing Shift+Enter. Remember to try different things with the code and play around.

**Familiarizing the data**

Let's start by familiarizing ourselves with the data.

In [None]:
#Import the pandas module.
import pandas
pandas.options.mode.chained_assignment = None
#Save the link to WHO data in a variable.
linkki = 'https://covid19.who.int/WHO-COVID-19-global-data.csv'
#Use the read_csv command to read the data from the link to pandas.DataFrame format.
data = pandas.read_csv(linkki)

In [None]:
#With the head() command from the pandas module, we get the fisrt couple of rows and this can be used to look at the data to get an overview of the data.
data.head()

In [None]:
#We can also get the whole data but luckily this programming environment is smart enough to only print a couple of lines from the data.
data
#What's the data like? How many rows does it have? What data is in each column?

In [None]:
'''
With the describe() command from pandas module we can calculate statistical values for the whole data set. You can google the mathematical concepts if they are unclear.
count - count
mean - mean (average)
std - standard deviation
min - minimum
25% - lower quarter
50% - median
75% - upper quarter
max - maximum
'''
data.describe()

In [None]:
'''
Example: Get all rows for sweden from the data set. Contry_code for Sweden is "SE".
'''
#Save the data to a variable data_ruotsi
data_sweden = data.loc[data['Country_code']=='SE'] 
#Look at the data
data_sweden
#In this practice, we study Finlands COVID-19 data. So the task here is to get the data for Finland and save it to a variable as well. Contry_code for Finland is "Fi". What differences do you notice?
#Compared to the global data? Compared to Swedens data?

In [None]:
'''
Here we callculate statistical values for Sweden.
'''
data_sweden.describe()

#Do the same thging for Finland and compare them.

Here we can see the covid cases and deaths per day. We can also see the averages of deaths and cases per day.

Numbers on 18.1.2022:

Max 101 deaths per day

Max 20513 cases per day

Average 2228.4 cases per day during the whole pandemic

Average 22.6 deaths per day during the whole pandemic

The number of cases differs greatly from day to day. This can be seen from the standard deviation of 2766.

**Visualizing the data**

Next, we will generate graphical plots of the data. For this, we utilize the matplotlib module.

In [None]:
#Import matplotlib module and use it as "plt"
import matplotlib.pyplot as plt
#Create an area for the figure and give it a size of 16x10
plt.figure(figsize=(16,10))
#Create a plot for all cases in sweden and give it the color red ("r")
plt.plot(data_sweden.Date_reported, data_sweden.New_cases, color="r")
#Create a plot for all deaths in sweden and give it the color black ("k")
plt.plot(data_sweden.Date_reported, data_sweden.New_deaths, color="k")
#Create corresponding plots for Filnad. What do you notice from the plot? What are the ups and downs of it?

As we can see the data also includes all the rows of zero (the horizontal line on the left side of the plot). From the data, we can see that there were no cases recorded in Sweden until 27.2.2020. Since there were no cases in Finland or Sweden at this time we can cut that useless data out by selecting only days forward of 27.2.2020. Do the same for Finland here.

Pandas module has its own data type for times and dates call datetime (Timestamp). The pandas module can utilize this data type and it makes the coding lot easier for us. For example, it's clever enough to not try to add all the dates to the plot. Now it nows that they are dates and it just adds a couple of them so we can see approximately what date it is. This datetime is formated as "year-month-day". The pandas module includes a command for converting text to the datetime object. Next, we convert all the dates to the correct data type and after that, we can select only the data we want to use easily by only saving the dates that are greater than or equal to the date where the data starts.

In [None]:
#Convert the dates to Timestamp format using the .to_datetime
data_sweden.loc[:,'Date_reported'] = pandas.to_datetime(data_sweden['Date_reported']) 
#Check out the new data format
data_sweden
#Do the same for Finland data

In [None]:
#Check the data type of the first row of Swedens data the ensure it's in the correct form
type(data_sweden.iloc[0].Date_reported)
#Do the same for Finland data

In [None]:
#Create a figure in the same way as earlier and see what differences did all this make.
plt.figure(figsize=(16,10))
#Create a plot for all cases in sweden and give it the color red ("r")
plt.plot(data_sweden.Date_reported, data_sweden.New_cases, color="r")
#Create a plot for all deaths in sweden and give it the color black ("k")
plt.plot(data_sweden.Date_reported, data_sweden.New_deaths, color="k")

In [None]:
#As we can see the zero row are still there. Remember to delete them.
#Select the dates greater than or equal to "2020-02-27" and save it to the same variable. 
data_sweden = data_sweden.loc[data_sweden['Date_reported']>='2020-02-27']
#Have a look at the data
data_sweden
#Let's do the same thing for Finland here

After the modifications, we once again create a figure in the same way as earlier.

In [None]:
#Create an area for the figure and give it a size of 16x10
plt.figure(figsize=(16,10))
#Create a plot for all cases in sweden and give it the color red ("r")
cases, = plt.plot(data_sweden.Date_reported, data_sweden.New_cases, color="r")
#Create a plot for all deaths in sweden and give it the color black ("k")
deaths, = plt.plot(data_sweden.Date_reported, data_sweden.New_deaths, color="k")
#Give a title to the plot and names to x and y axels.
plt.gca().set(title="COVID-19 Cases and deaths in Sweden", xlabel="Date",ylabel="Count")
#Since there are more than one data lines in the graph add a legend to then to know which is which.
plt.legend((cases,deaths),("Cases","Deaths"))
#Draw the corresponding plots for Finland

Continue improving the figure. We can see that the data is a bit noisy and sometimes a bit hard to read. This depends on the intended use of the figure. For god practice next, we will do a weekly average plot that can be easier to read and more useful. 

In [None]:
#Create a new variable for swedens average and give it the data of sweden.
data_sweden_average = data_sweden
#Calculate averages for cases and deaths using the rolling command. Set the window value to 7 for 7 days for a week.
data_sweden_average['New_cases']=data_sweden.New_cases.rolling(window=7).mean()
data_sweden_average['New_deaths']=data_sweden.New_deaths.rolling(window=7).mean()

#Create an area for the figure and give it a size of 16x10
plt.figure(figsize=(16,10))
#Create a plot for all cases in sweden and give it the color red ("r")
mean_cases, = plt.plot(data_sweden_average.Date_reported, data_sweden_average.New_cases, color="r")
#Create a plot for all deaths in sweden and give it the color black ("k")
mean_deaths, = plt.plot(data_sweden_average.Date_reported, data_sweden_average.New_deaths, color="k")
#Give a title to the plot and names to x and y axels.
plt.gca().set(title="COVID-19 Cases and deaths in Sweden", xlabel="Date",ylabel="Count")
#Since there are more than one data lines in the graph add a legend to then to know which is which.
plt.legend((mean_cases,mean_deaths),("Cases","Deaths"))

#Try chanching the window value. What changes do you see in the grapg?

**Conclusions**



Examine the pictures and tables. Consider how cases and deaths have developed. Do you notice clear declines or surprising increases?

It is often difficult to draw direct conclusions from the data itself. There is also a need to research the topic under study and current phenomena. Observations in the data can be used to link phenomena in the data to researched knowledge or vice versa. It is good to remember that data is not perfect but is often useful.

What do you notice? What information did you find from different sources? How can you link them to this data?

**Sources**

https://matplotlib.org/ (3.5.2020)
https://pandas.pydata.org/ (3.5.2020)
https://www.suomenmaa.fi/uutiset/koronakuolemien-maara-hyppasi-suomessa-ylospain-hus-raportoi-hoivakotikuolemia-6.3.600805.a23e518765 (3.5.2020)