# Covid Status
This Data Science project by Sven Oberwalder and Yasin Sahin analyzes the development of the corona pandemic based on input data provided by Statistik Austria

The Datasets can be found here:
### [Dataset 1](https://data.statistik.gv.at/web/meta.jsp?dataset=OGD_covidggstatus2_GGSTATUS_2)

### [Dataset 2](https://data.statistik.gv.at/web/meta.jsp?dataset=OGD_covidggstatus_GGSTATUS_1)

In [22]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

# Initial Data Analysis and Import
Dataset 1 has following attributes:
* **C-BEZIMST-0** Politischer Bezirk (PolBez)
* **C-ALTGRIMST-0** 5-years age group (Altersgr)
* **C-GLIMST-0** Country of Birth (GebLand)
* **C-C11-0** Sex (Geschl)
* **C-IMST-0** COVID-19 vaccinated-recovered-status (GeimpftGenesen)
* **F-DATA** Number of records (Anz)

Dataset 2 has following attributes:
* **C-B00-0** Federal country (Bundesland)
* **C-BILIMST-0** Education (Bildung)
* **C-ALTGRIMST-0** 5-years age group (Altersgr)
* **C-ESIMST-0** Economic status (ErwerbStatus)
* **C-IMST-0** COVID-19 vaccinated-recovered-status (GeimpftGenesen)
* **F-DATA** Number of records (Anz)

In below code, the datasets need to be imported and the attributes must be renamed to more meaningful names. Since the data is separated by semi-colons (;), we need to specify that when importing the csv-files. Furthermore, a simple Std-Analysis has to be conducted, to ensure the progress.

In [23]:
#import databases
dataset1 = pd.read_csv("./data/dataset1.csv", sep=";")
dataset2 = pd.read_csv("./data/dataset2.csv", sep=";")

#rename attributes
dataset1.rename(columns={"C-BEZIMST-0": "PolBez",
                         "C-ALTGRIMST-0": "Altersgr",
                         "C-GLIMST-0": "GebLand",
                         "C-C11-0": "Geschl",
                         "C-IMST-0": "GeimpftGenesen",
                         "F-DATA": "Anz"}, inplace=True)
dataset2.rename(columns={"C-B00-0": "Bundesland",
                         "C-BILIMST-0": "Bildung",
                         "C-ALTGRIMST-0": "Altersgr",
                         "C-ESIMST-0": "ErwerbStatus",
                         "C-IMST-0": "GeimpftGenesen",
                         "F-DATA": "Anz"}, inplace=True)


## Std-Analysis for Dataset 1

In [24]:
dataset1.sample(5)

Unnamed: 0,PolBez,Altersgr,GebLand,Geschl,GeimpftGenesen,Anz
21821,BEZIMST-409,ALTGRIMST-15,GLIMST-1,C11-1,IMST-3,38
8652,BEZIMST-303,ALTGRIMST-4,GLIMST-2,C11-1,IMST-2,4
31576,BEZIMST-614,ALT10IMST-10,GLIMST-2,C11-2,IMST-4,4
484,BEZIMST-199,ALTGRIMST-5,GLIMST-1,C11-1,IMST-3,153
21063,BEZIMST-407,ALT10IMST-3,GLIMST-2,C11-2,IMST-2,63


In [25]:
dataset1.head(5)

Unnamed: 0,PolBez,Altersgr,GebLand,Geschl,GeimpftGenesen,Anz
0,BEZIMST-101,ALTGRIMST-1,GLIMST-1,C11-1,IMST-1,1
1,BEZIMST-101,ALTGRIMST-1,GLIMST-1,C11-1,IMST-3,80
2,BEZIMST-101,ALTGRIMST-1,GLIMST-1,C11-1,IMST-4,216
3,BEZIMST-101,ALTGRIMST-1,GLIMST-1,C11-2,IMST-1,2
4,BEZIMST-101,ALTGRIMST-1,GLIMST-1,C11-2,IMST-3,86


In [26]:
dataset1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53961 entries, 0 to 53960
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   PolBez          53961 non-null  object
 1   Altersgr        53961 non-null  object
 2   GebLand         53961 non-null  object
 3   Geschl          53961 non-null  object
 4   GeimpftGenesen  53961 non-null  object
 5   Anz             53961 non-null  int64 
dtypes: int64(1), object(5)
memory usage: 2.5+ MB


In [27]:
dataset1.describe()

Unnamed: 0,Anz
count,53961.0
mean,642.049703
std,2171.148389
min,1.0
25%,44.0
50%,169.0
75%,508.0
max,70338.0


## Std-Analysis for Dataset 2

In [28]:
dataset2.sample(5)

Unnamed: 0,Bundesland,Bildung,Altersgr,ErwerbStatus,GeimpftGenesen,Anz
66,B00-1,BILIMST-3,ALT10IMST-4,ESIMST-1,IMST-3,939
127,B00-1,BILIMST-4,ALT10IMST-7,ESIMST-2,IMST-4,162
929,B00-6,BILIMST-99,ALT10IMST-7,ESIMST-2,IMST-2,1
1060,B00-7,BILIMST-99,ALT10IMST-4,ESIMST-1,IMST-2,7
858,B00-6,BILIMST-3,ALT10IMST-6,ESIMST-1,IMST-4,2450


In [29]:
dataset2.head(5)

Unnamed: 0,Bundesland,Bildung,Altersgr,ErwerbStatus,GeimpftGenesen,Anz
0,B00-1,BILIMST-1,ALT10IMST-4,ESIMST-1,IMST-1,611
1,B00-1,BILIMST-1,ALT10IMST-4,ESIMST-1,IMST-2,285
2,B00-1,BILIMST-1,ALT10IMST-4,ESIMST-1,IMST-3,396
3,B00-1,BILIMST-1,ALT10IMST-4,ESIMST-1,IMST-4,472
4,B00-1,BILIMST-1,ALT10IMST-4,ESIMST-2,IMST-1,471


In [30]:
dataset2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1404 entries, 0 to 1403
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Bundesland      1404 non-null   object
 1   Bildung         1404 non-null   object
 2   Altersgr        1404 non-null   object
 3   ErwerbStatus    1404 non-null   object
 4   GeimpftGenesen  1404 non-null   object
 5   Anz             1404 non-null   int64 
dtypes: int64(1), object(5)
memory usage: 65.9+ KB


In [31]:
dataset2.describe()

Unnamed: 0,Anz
count,1404.0
mean,3414.423789
std,5746.31088
min,1.0
25%,310.5
50%,1375.5
75%,3926.0
max,55581.0


# Data Cleaning
In this section, the input data will be cleaned. For this manner, any wrong, missing, or irrelevant informations will be treated accordingly.

**To Be marked**
The Column ``Altersgr`` uses an age gap interval of 10 years to inform us about the age of the adresses person. However, the first interval currently named ``ALT10IMST-1`` only has a 5 year gap (from 0 - 4 years) to especially adress babys and small children

### Dataset 1

In [32]:
geimpftGenesenDict = {"IMST-4": 0, #nothing
                      "IMST-1": 1, #vacc
                      "IMST-3": 2, #recov
                      "IMST-2": 3} #vacc + recov

ageGapDict = {"ALT10IMST-1": "0 - 4",
              "ALT10IMST-2": "5 - 14",
              "ALT10IMST-3": "15 - 24",
              "ALT10IMST-4": "25 - 34",
              "ALT10IMST-5": "35 - 44",
              "ALT10IMST-6": "45 - 54",
              "ALT10IMST-7": "55 - 64",
              "ALT10IMST-8": "65 - 74",
              "ALT10IMST-9": "75 - 84",
              "ALT10IMST-10": "85+"}

dataset1["GebLand"] = dataset1["GebLand"].map({"GLIMST-1": "INLAND",
                                               "GLIMST-2": "AUSLAND"})

dataset1["Geschl"] = dataset1["Geschl"].map({"C11-1": "m",
                                             "C11-2": "f"})

dataset1["GeimpftGenesen"] = dataset1["GeimpftGenesen"].map(geimpftGenesenDict)

dataset1["Altersgr"] = dataset1["Altersgr"].map(ageGapDict)

dataset1.sample(10)

Unnamed: 0,PolBez,Altersgr,GebLand,Geschl,GeimpftGenesen,Anz
20785,BEZIMST-407,,INLAND,m,2,792
27084,BEZIMST-503,,INLAND,m,3,337
22960,BEZIMST-412,,AUSLAND,f,1,34
27077,BEZIMST-503,,AUSLAND,m,2,9
44702,BEZIMST-912,,INLAND,m,2,796
27599,BEZIMST-504,25 - 34,INLAND,m,2,1025
33592,BEZIMST-622,5 - 14,INLAND,f,1,463
38487,BEZIMST-801,75 - 84,AUSLAND,m,2,6
11236,BEZIMST-309,,AUSLAND,m,0,17
17931,BEZIMST-325,,INLAND,m,3,185


### Dataset 2

In [33]:
bundeslandDict = {
    "B00-1": "Burgenland",
    "B00-2": "Kärnten",
    "B00-3": "Niederösterreich",
    "B00-4": "Oberösterreich",
    "B00-5": "Salzburg",
    "B00-6": "Steiermark",
    "B00-7": "Tirol",
    "B00-8": "Vorarlberg",
    "B00-9": "Wien"
}

dataset2["Bundesland"] = dataset2["Bundesland"].map(bundeslandDict)

#uses the dictionary geimpftGenesenDict from the previous code block
dataset2["GeimpftGenesen"] = dataset2["GeimpftGenesen"].map(geimpftGenesenDict)

dataset2["Bildung"] = dataset2["Bildung"].map({"BILIMST-1": "Pflichtschule",
                                               "BILIMST-2": "Lehrabschluss/BMS",
                                               "BILIMST-3": "BHS/AHS/Kolleg",
                                               "BILIMST-4": "Akademie/Hochschule",
                                               "BILIMST-99": "N/A"})

dataset2["ErwerbStatus"] = dataset2["ErwerbStatus"].map({"ESIMST-1": "aktiv",
                                                         "ESIMST-2": "inaktiv"})

#uses the dictionary ageGapDict from the previous code block
dataset2["Altersgr"] = dataset2["Altersgr"].map(ageGapDict)

dataset2.sample(10)

Unnamed: 0,Bundesland,Bildung,Altersgr,ErwerbStatus,GeimpftGenesen,Anz
1397,Wien,,55 - 64,aktiv,3,2
262,Kärnten,Akademie/Hochschule,45 - 54,aktiv,1,5689
1296,Wien,Lehrabschluss/BMS,45 - 54,inaktiv,1,8570
649,Salzburg,Lehrabschluss/BMS,25 - 34,aktiv,1,8030
1335,Wien,BHS/AHS/Kolleg,55 - 64,aktiv,0,2603
482,Oberösterreich,Pflichtschule,45 - 54,inaktiv,3,903
494,Oberösterreich,Lehrabschluss/BMS,25 - 34,aktiv,3,12216
136,Burgenland,,35 - 44,inaktiv,1,2
963,Tirol,Lehrabschluss/BMS,25 - 34,aktiv,1,13239
487,Oberösterreich,Pflichtschule,55 - 64,aktiv,2,2668


## Duplicate Data

In [34]:
dataset1["Anz"].sum()

34645644

In the above code we can see that the sum of records is approximately 4 times the total population of Austria. This strongly suggests that there may be duplicate data. And exactly that is the case: In ``PolBez`` there are records for each province and also for each federal country. This duplicate is not necessary, since one province can easily be assigned to its federal country (e.g. ``BEZIMST-304`` (= Wiener Neustadt) must be in ``B00-3`` (=Niederoesterreich)). So the records for the federal countries can be removed.

In [35]:
dataset1 = dataset1.loc[dataset1["PolBez"].map(lambda val: "BEZIMST" in val)] #only provinces have the prefix BEZIMST
dataset1["Anz"].sum()

17322822

The above code still returns an amount which is twice as big as Austria's population. Similairly, the column ``Altersgr`` has duplicate records for different age groups. Previously we maped `Altersgr` to an more readable string, but left out the duplicate values that will be deleted in the next step so we can simply drop the N/A records.

In [37]:
dataset1 = dataset1.dropna()
dataset1["Anz"].sum()

8661411

Now we have removed all duplicate values. Let's continue cleaning our data.
## Null Values

In [None]:
dataset1.isnull().sum()

In [None]:
dataset2.isnull().sum()

Since there are no null values, we can continue!

In [None]:
dataset2["Bildung"].value_counts()

After this we are done cleaning our data and can save it into two datasets. ``korr1.csv`` for ``dataset1.csv`` and ``korr2.csv`` for ``dataset2.csv``

In [None]:
dataset1.to_csv("./data/korr1.csv", sep=";")
dataset2.to_csv("./data/korr2.csv", sep=";")

# Data Preparation
## Numerical Values

In [39]:
dataset1.sample(10)

Unnamed: 0,PolBez,Altersgr,GebLand,Geschl,GeimpftGenesen,Anz
38392,BEZIMST-801,15 - 24,AUSLAND,m,0,140
48152,BEZIMST-919,55 - 64,INLAND,f,3,723
42342,BEZIMST-906,5 - 14,AUSLAND,f,3,24
8576,BEZIMST-302,75 - 84,INLAND,m,0,122
29355,BEZIMST-603,45 - 54,AUSLAND,m,3,47
9870,BEZIMST-305,75 - 84,INLAND,f,1,2916
44666,BEZIMST-911,85+,AUSLAND,m,0,14
34467,BEZIMST-701,5 - 14,AUSLAND,m,0,276
34945,BEZIMST-702,35 - 44,INLAND,f,2,768
41479,BEZIMST-904,15 - 24,AUSLAND,f,2,109
