# Project 1 - COVID-19 Data Analysis

**Project deadline:** This project is due for submission on Monday, 25.05.2020. You receive details on the submission process from your tutor!

**PLEASE READ THIS NOTEBOOK COMPLETELY BEFORE YOU START TO WORK ON THE PROJECT!**

## About the Projects
- You will get one project approximately every other week.
- Besides the homework-assignmentts, you need to solve the projects in order to pass the course. Your final course mark consists of the mean of your project marks. We aim to hand-out six projects during the term and we do not consider the worst project mark for your final course mark. Projects that you do not hand in are counted with a mark of 4.
- The projects need to be submitted to your tutor and he will give you necessary information on the submission process!
- **In contrast to the homework exercises, each student must hand in an own solution for the projects! Of course you can and should discuss problems with each other! However, you must not use code or code-parts from your student peers in your project solutions!**

**Note: The tutors, Oliver and I are very happy to help you out with difficulties you might have with the project tasks! You can ask questions any time but please do so well in advance of the deadlines!**

## Analysis of public COVID-19 data

In this first project, we would like to demonstrate that you can do advanced data analysis already with your current knowledge and with just a few lines of `Python`-code. Nevertheless the notebook contains some more advanced technical aspects to load data from the WWW and to prepare them for further analysis. Please do not worry if you do not fully understand all details of that part right now. We will cover those aspects later in the term.

We will do this project with a topic concerning all of us the moment, the COVID-19 pandemy. We will download publicly available data with a daily listing of new (known!) COVID-19 cases and new deaths due to the pandemy. The data set contains information on *all* countries with known COVID-19 cases. Your task will be to analyse the development of the pandemy and to check which countries currently do have a raising number of infectious COVID-19 patients.

The data that we will use in this notebook are daily updated and published by the [European Centre for Disease Prevention and Control](https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide).

In [None]:
# We need some modules (Python libraries) in the following.
# Usually, such modules are loaded in the first cell of a notebook.
# The modules that we need concern loading the data and plotting
# them later.

# all plots should appear directly within the notebook
%matplotlib inline

# modules necessary for plotting
import matplotlib.pyplot as plt

# seaborn just makes plots look a bit nicer - not
# absolutely necessary though.
import seaborn as sns
sns.set_style("whitegrid")

# modules to load the data. The Pandas module
# is just needed for a quick data-loading demonstration at the
# start of the Notebook. The corona_data module is self-made
# to comfortably load and administrate the COVID-19 data.
# To work correctly, a file named 'corona_data.py' must be
# in the same directory as this notebook file!
import pandas as pd
import corona_data

# module to make avilable data structures and routines
# for numerics
import numpy as np

## Loading data

### Data-loading demo with standard Python-modules

One great feature of `Python` is the ability to load all kinds of standardised data-formats into memory - in most cases with a single command. The data can be located on your disk or on the Web. In the following, we directly load data from [this Web-address](https://opendata.ecdc.europa.eu/covid19/casedistribution/csv) (no need to separately download them).

In [None]:
# load COVID-19 data from the WEB with the pandas-modult
data = pd.read_csv('https://opendata.ecdc.europa.eu/covid19/casedistribution/csv/', engine="python")

# Uncomment the following line if you want to see all lines
# (more than 15000) and not only 10:
#pd.set_option('display.max_rows', None)
data

The data lists among other quantities:
- first column (dateRep): (reported) date
- fifth column (cases): new confirmed COVID-19 cases at that date
- sixth column (deaths): new deaths because of COVID-19 at that date
- seventh column (countriesAndTerritories): country

The file lists all data from the 31st of December 2019 up to-date for all countries with known COVID-19 cases. Which countries are listed? Such information can be retrieved easily and quickly.

In [None]:
# list all countries reported in the data:
#
# The following line ensures that each country is reported once and that
# the resulting list is sorted:
countries = sorted((set(data['countriesAndTerritories'])))

# we only print 5 countries as the list is very long. Just remove the brackets
# if you want the full list:
countries[0:3]


### Data-loading for our project

Although the above data-format can be used efficiently, it requires a longer sequence of commands to retrieve interesting time-sequence data for specific countries. Because we do not want to deal with those data-handling issues at the moment, I transfered this part of the code to a module `corona_data`. It reads the data and extracts the columns *cases* and *deaths* for a specific country. Furthermore, it removes all data before the 1st of March 2020. This date, we consider our *Day Zero* of the pandemy henceforth.

In [None]:
# first read all the data into an own Corona class structure. This only
# needs to be done only once within this notebook!
corona = corona_data.CoronaData()

# The countries listed are accessed as member variable of the Corona class.
# We do not need them immediately but it comes in handy for your own tasks
# below.
countries = corona.countries


In [None]:
# now isolate interesting data for a specific country
#countries = corona.countries
country = 'Germany'

# The structure 'corona[country]' contains a triple of numpy-arrays
# with days, cases and deaths. We assign them to three variables
# with 'simultaneous assignment'.
day, cases, deaths = corona[country]

print(day)
print(cases)
print(deaths)

The three arrays have the following intuitive meaning: At day 0 (1st of March 2020), 54 new COVID-19 infections and zero new deaths were reported from Germany and so on. 

We now can make a first plot with the new cases against the day.

In [None]:

# The following command alone is sufficient to create the plot
plt.plot(day, cases)

# The following commands label the axes and the plot
plt.xlabel('day of COVID-19 pandemy')
plt.ylabel('new COVID-19 cases')
plt.title('Daily COVID-19 infections in India')

## Your tasks

**Note:** Please continue this notebook and do all the following tasks within that notebook. Please comment appropriately all code-blocks and perform the necessary discussions of your results in Markup cells. All plots must have appropriate axes-labels and a title! Your project submission will consist of the modified notebook.

The plot that we just created tells us that *new* infections have a decreasing trend. But to understand better the current state of the pandemy, we want to look at additional quantities.

1. Plot the *total accumulated number* of COVID-19 cases against the day. Give a short discussion on that plot. What kind of curve do you expect for a pandemy that can spread freely?  What effect do the current measures and restrictions in Germany (e.g. social distancing) have on the curve? Discuss this with the knowledge that drastic limitations on our life (closure of schools etc.) took effect in Germany on the 16th of March. What will the curve look like when the pandemy is over?
   
   **Hint:** Have a look at the `numpy` `cumsum`-function.
   
2. A very important quantity to decide whether current measures to confine the pandemy can be relaxed is the *development of the number of people who still can infect others (the infectious population)*. The main purpose of all COVID-19 restrictions is to realise a decreasing trend of that number! This quantity can be obtained by *the number of infected people minus those who died and minus those who recovered from COVID-19*. It is implicitely assumed in the following that recovered patients are immune against COVID-19.

   From the required information *only* the number of dead people is certain. The number of infected people is uncertain because we only have *reported* cases and we do not know how many people are infected but were not (yet) tested. Even more uncertain is the number of recovered patients and we entirely rely on an estimate for it. Furthermore, there are *many* definitions of *recoverd patients* around. The one coming closest to our procedure is the following: *A patient recovers if there are no symptoms 14-days after she was tested positively or after she left hospital.*
   
   Lacking further information, we define (overestimate) the number of recovered persons as follows: We consider everybody recovered who was positively tested more than 13-days ago and did not die.    
   Given these assumptions, create a plot of the infectious population as a function of pandemy-day for Germany. Discuss that plot. Assuming, the government withdrew all CVID-19 restrictions today and people immediately behaved as before the crisis, how long would it take until the number of infectious patients reaches again its all-time maximum?
   
   **Hint:** `numpy` array-slicing!
   
3. Create a loop over all countries with confirmed COVID-19 cases - see the hint below. List those countries who currently still have a *raising* infectious population. Limit the analysis to countries with more than 5000 confirmed COVID-19 cases.

   **Hint:** A raising infectious population means (for us) that the *derivative* of the plot from task (2) is positive today. 
   
**Note:** I include sample plots for tasks 1 and 2 from the 28th of March to the materials of this project. This allows you to verify your solution.   

In [None]:
# Hint to create a loop over all countries.
#
# The countries of known COVID-19 cases are stored in a so-called list.
# A Python-list is, as the numpy-arrays, a container (in that case of strings)
# whose elements can be accessed and iterated over in a very similar way:
#for country in countries:
    #print(country)

In [None]:
#Alternate method to calculate cumulative sum of cases in GERMANY
tot_c=np.zeros((len(cases)), dtype=int) # total no. of cases in integer format (i initalize the array in order to store the no. of cases )
j=0
while j < len(cases):
    tot_c[j]=cases[j] + tot_c[j-1]
    j =j+1
#print(tot_c)
#plt.plot(day, tot_c)


In [None]:
#plot of the total cumulative cases with day
plt.plot(day, np.cumsum(cases)) # cumsum command to calculate cumulative sum of cases in GERMANY
plt.title("Total accumulated cases in Germany")
plt.xlabel("Day of pandemy")
plt.ylabel("Total accumulated cases")

In the above plot, there is a subsequent growth in the number of cases with each passing day but after some time the number of new cases have reduced substantially due to preventive measures taken by the public and the government.

From the above plot of total cumulative cases v/s day we can clearly see that in the beginning there is an exponential growth in case numbers which can generally be expected when an infectious disease spreads unhindered.

Over time with social distancing or government measures such as curfew, it led to fewer and fewer contacts between infected and non-infected persons, which led to a temporal decrease in the so-called reproduction rate which is clear from the plot above as it is tending to get flatter with time.
When the pandemy gets over, the above plot will become flat i.e. the derivative of the plot will be zero.

In [None]:
#First of all i begin with defining the COVID-19 infected people as a function of day, cases and death
def func(day, cases, deaths):
    overestimate=np.zeros(len(day)) # I create an array to save the data of no. of recovered people
    infected_population=np.zeros(len(day)) #I create an array to save the data of no. of infected people
     # we now make a loop in order to calculate the no. of infected people.  we will begin by eliminating people who have recovered and those who have died.
    m=(np.cumsum(deaths))
    n=np.cumsum( overestimate)
    o=np.cumsum(cases)
    for j in range(14, len(day)):
        if j==14:
            overestimate[j]=cases[j-14] - m[j]  
        else:
            overestimate[j]=cases[j-14] - deaths[j]
  
    infected_population=0 - n - deaths # remaining infected population
    return infected_population
y=func(day, cases, deaths)
plt.plot(day, y)