<a href="https://colab.research.google.com/github/Sormy23/DataScience_CoronaStatistics/blob/main/mini_projekt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Covid Status
This Data Science project by Sven Oberwalder and Yasin Sahin analyzes the development of the corona pandemic based on input data provided by Statistik Austria

The Datasets can be found here:
### [Dataset 1](https://data.statistik.gv.at/web/meta.jsp?dataset=OGD_covidggstatus2_GGSTATUS_2)

### [Dataset 2](https://data.statistik.gv.at/web/meta.jsp?dataset=OGD_covidggstatus_GGSTATUS_1)

In [126]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import drive

In [127]:
#Use mount to google drive when working on google colab
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Initial Data Analysis and Import
Dataset 1 has following attributes:
* **C-BEZIMST-0** Politischer Bezirk (PolBez)
* **C-ALTGRIMST-0** 5-years age group (Altersgr)
* **C-GLIMST-0** Country of Birth (GebLand)
* **C-C11-0** Sex (Geschl)
* **C-IMST-0** COVID-19 vaccinated-recovered-status (GeimpftGenesen)
* **F-DATA** Number of records (Anz)

Dataset 2 has following attributes:
* **C-B00-0** Federal country (Bundesland)
* **C-BILIMST-0** Education (Bildung)
* **C-ALTGRIMST-0** 5-years age group (Altersgr)
* **C-ESIMST-0** Economic status (ErwerbStatus)
* **C-IMST-0** COVID-19 vaccinated-recovered-status (GeimpftGenesen)
* **F-DATA** Number of records (Anz)

In below code, the datasets need to be imported and the attributes must be renamed to more meaningful names. Since the data is separated by semi-colons (;), we need to specify that when importing the csv-files. Furthermore, a simple Std-Analysis has to be conducted, to ensure the progress.

In [128]:
#import databases
dataset1 = pd.read_csv("/content/drive/MyDrive/dataset1.csv", sep=";")
dataset2 = pd.read_csv("/content/drive/MyDrive/dataset2.csv", sep=";")

#rename attributes
dataset1.rename(columns={"C-BEZIMST-0": "PolBez",
                         "C-ALTGRIMST-0": "Altersgr",
                         "C-GLIMST-0": "GebLand",
                         "C-C11-0": "Geschl",
                         "C-IMST-0": "GeimpftGenesen",
                         "F-DATA": "Anz"}, inplace=True)
dataset2.rename(columns={"C-B00-0": "Bundesland",
                         "C-BILIMST-0": "Bildung",
                         "C-ALTGRIMST-0": "Altersgr",
                         "C-ESIMST-0": "ErwerbStatus",
                         "C-IMST-0": "GeimpftGenesen",
                         "F-DATA": "Anz"}, inplace=True)


## Std-Analysis for Dataset 1

In [129]:
dataset1.sample(5)

Unnamed: 0,PolBez,Altersgr,GebLand,Geschl,GeimpftGenesen,Anz
40148,BEZIMST-901,ALT10IMST-3,GLIMST-2,C11-1,IMST-4,45
17764,BEZIMST-325,ALTGRIMST-4,GLIMST-2,C11-1,IMST-2,1
23519,BEZIMST-413,ALTGRIMST-12,GLIMST-1,C11-2,IMST-1,990
50544,B00-2,ALTGRIMST-8,GLIMST-2,C11-2,IMST-3,819
4794,BEZIMST-204,ALTGRIMST-5,GLIMST-1,C11-2,IMST-1,448


In [130]:
dataset1.head(5)

Unnamed: 0,PolBez,Altersgr,GebLand,Geschl,GeimpftGenesen,Anz
0,BEZIMST-101,ALTGRIMST-1,GLIMST-1,C11-1,IMST-1,1
1,BEZIMST-101,ALTGRIMST-1,GLIMST-1,C11-1,IMST-3,80
2,BEZIMST-101,ALTGRIMST-1,GLIMST-1,C11-1,IMST-4,216
3,BEZIMST-101,ALTGRIMST-1,GLIMST-1,C11-2,IMST-1,2
4,BEZIMST-101,ALTGRIMST-1,GLIMST-1,C11-2,IMST-3,86


In [131]:
dataset1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53961 entries, 0 to 53960
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   PolBez          53961 non-null  object
 1   Altersgr        53961 non-null  object
 2   GebLand         53961 non-null  object
 3   Geschl          53961 non-null  object
 4   GeimpftGenesen  53961 non-null  object
 5   Anz             53961 non-null  int64 
dtypes: int64(1), object(5)
memory usage: 2.5+ MB


In [132]:
dataset1.describe()

Unnamed: 0,Anz
count,53961.0
mean,642.049703
std,2171.148389
min,1.0
25%,44.0
50%,169.0
75%,508.0
max,70338.0


## Std-Analysis for Dataset 2

In [133]:
dataset2.sample(5)

Unnamed: 0,Bundesland,Bildung,Altersgr,ErwerbStatus,GeimpftGenesen,Anz
621,B00-5,BILIMST-1,ALT10IMST-4,ESIMST-2,IMST-1,934
882,B00-6,BILIMST-4,ALT10IMST-5,ESIMST-1,IMST-4,2744
98,B00-1,BILIMST-4,ALT10IMST-4,ESIMST-1,IMST-3,567
1141,B00-8,BILIMST-2,ALT10IMST-6,ESIMST-2,IMST-2,385
234,B00-2,BILIMST-3,ALT10IMST-6,ESIMST-2,IMST-1,633


In [134]:
dataset2.head(5)

Unnamed: 0,Bundesland,Bildung,Altersgr,ErwerbStatus,GeimpftGenesen,Anz
0,B00-1,BILIMST-1,ALT10IMST-4,ESIMST-1,IMST-1,611
1,B00-1,BILIMST-1,ALT10IMST-4,ESIMST-1,IMST-2,285
2,B00-1,BILIMST-1,ALT10IMST-4,ESIMST-1,IMST-3,396
3,B00-1,BILIMST-1,ALT10IMST-4,ESIMST-1,IMST-4,472
4,B00-1,BILIMST-1,ALT10IMST-4,ESIMST-2,IMST-1,471


In [135]:
dataset2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1404 entries, 0 to 1403
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Bundesland      1404 non-null   object
 1   Bildung         1404 non-null   object
 2   Altersgr        1404 non-null   object
 3   ErwerbStatus    1404 non-null   object
 4   GeimpftGenesen  1404 non-null   object
 5   Anz             1404 non-null   int64 
dtypes: int64(1), object(5)
memory usage: 65.9+ KB


In [136]:
dataset2.describe()

Unnamed: 0,Anz
count,1404.0
mean,3414.423789
std,5746.31088
min,1.0
25%,310.5
50%,1375.5
75%,3926.0
max,55581.0


# Data Cleaning
In this section, the input data will be cleaned. For this manner, any wrong, missing, or irrelevant informations will be treated accordingly.

### Dataset 1

In [137]:
geimpftGenesenDict = {"IMST-4": 0, #nothing
                      "IMST-1": 1, #vacc
                      "IMST-3": 2, #recov
                      "IMST-2": 3} #vacc + recov

dataset1["GebLand"] = dataset1["GebLand"].map({"GLIMST-1": "INLAND",
                                               "GLIMST-2": "AUSLAND"})

dataset1["Geschl"] = dataset1["Geschl"].map({"C11-1": "m",
                                             "C11-2": "f"})

dataset1["GeimpftGenesen"] = dataset1["GeimpftGenesen"].map(geimpftGenesenDict)

### Dataset 2

In [138]:
bundeslandDict = {
    "B00-1": "Burgenland",
    "B00-2": "Kärnten",
    "B00-3": "Niederösterreich",
    "B00-4": "Oberösterreich",
    "B00-5": "Salzburg",
    "B00-6": "Steiermark",
    "B00-7": "Tirol",
    "B00-8": "Vorarlberg",
    "B00-9": "Wien"
}

dataset2["Bundesland"] = dataset2["Bundesland"].map(bundeslandDict)

dataset2["GeimpftGenesen"] = dataset2["GeimpftGenesen"].map(geimpftGenesenDict)

dataset2["Bildung"] = dataset2["Bildung"].map({"BILIMST-1": "Pflichtschule",
                                               "BILIMST-2": "Lehrabschluss/BMS",
                                               "BILIMST-3": "BHS/AHS/Kolleg",
                                               "BILIMST-4": "Akademie/Hochschule",
                                               "BILIMST-99": "N/A"})

dataset2["ErwerbStatus"] = dataset2["ErwerbStatus"].map({"ESIMST-1": 1,
                                                         "ESIMST-2": 0})

dataset2.sample(10)

Unnamed: 0,Bundesland,Bildung,Altersgr,ErwerbStatus,GeimpftGenesen,Anz
116,Burgenland,Akademie/Hochschule,ALT10IMST-6,0,1,203
300,Kärnten,,ALT10IMST-7,1,1,1
1245,Wien,Pflichtschule,ALT10IMST-4,1,3,4629
1036,Tirol,Akademie/Hochschule,ALT10IMST-5,1,3,5986
1010,Tirol,BHS/AHS/Kolleg,ALT10IMST-5,0,0,469
113,Burgenland,Akademie/Hochschule,ALT10IMST-6,1,3,1922
843,Steiermark,BHS/AHS/Kolleg,ALT10IMST-4,0,1,2885
309,Niederösterreich,Pflichtschule,ALT10IMST-4,0,1,3686
984,Tirol,Lehrabschluss/BMS,ALT10IMST-6,0,3,713
279,Kärnten,,ALT10IMST-4,1,3,1


## Duplicate Data

In [139]:
dataset1["Anz"].sum()

34645644

In the above code we can see that the sum of records is approximately 4 times the total population of Austria. This strongly suggests that there may be duplicate data. And exactly that is the case: In ``PolBez`` there are records for each province and also for each federal country. This duplicate is not necessary, since one province can easily be assigned to its federal country (e.g. ``BEZIMST-304`` (= Wiener Neustadt) must be in ``B00-3`` (=Niederoesterreich)). So the records for the federal countries can be removed.

In [140]:
dataset1 = dataset1.loc[dataset1["PolBez"].map(lambda val: "BEZIMST" in val)] #only provinces have the prefix BEZIMST
dataset1["Anz"].sum()

17322822

The above code still returns an amount which is twice as big as Austria's population. Similairly, the column ``Altersgr`` has duplicate records for different age groups. We can remove the non-specific records.

In [141]:
dataset1 = dataset1.loc[dataset1["Altersgr"].map(lambda val: "ALT10IMST" in val)] #only groups with prefix ALTGRIMST
dataset1["Anz"].sum()

8661411

Now we have removed all duplicate values. Let's continue cleaning our data.
## Null Values

In [142]:
dataset1.isnull().sum()

PolBez            0
Altersgr          0
GebLand           0
Geschl            0
GeimpftGenesen    0
Anz               0
dtype: int64

In [143]:
dataset2.isnull().sum()

Bundesland        0
Bildung           0
Altersgr          0
ErwerbStatus      0
GeimpftGenesen    0
Anz               0
dtype: int64

Since there are no null values, we can continue. (Es gibt null values in Bildung -- gekennzeichnet mit ``BILIMST-99`` siehe unten, KA wie behandeln (°~°))

In [144]:
dataset2["Bildung"].value_counts()

Pflichtschule          288
Lehrabschluss/BMS      288
BHS/AHS/Kolleg         288
Akademie/Hochschule    288
N/A                    252
Name: Bildung, dtype: int64

After this we are done cleaning our data and can save it into two datasets. ``korr1.csv`` for ``dataset1.csv`` and ``korr2.csv`` for ``dataset2.csv``

In [145]:
dataset1.to_csv("/content/drive/MyDrive/output/korr1.csv", sep=";")
dataset2.to_csv("/content/drive/MyDrive/output/korr2.csv", sep=";")

# Data Preparation
## Numerical Values

In [146]:
dataset1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32287 entries, 0 to 49823
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   PolBez          32287 non-null  object
 1   Altersgr        32287 non-null  object
 2   GebLand         32287 non-null  object
 3   Geschl          32287 non-null  object
 4   GeimpftGenesen  32287 non-null  int64 
 5   Anz             32287 non-null  int64 
dtypes: int64(2), object(4)
memory usage: 1.7+ MB
