# Analysis of the South Korea COVID-19 Kaggle Dataset

## Intorduction COVID-19 
- The COVID-19 virus is an incisive event in the 21st century. 
- Comprehension and risk assessment possible with data science
- Datasets are provided to public

## The Data

Link to the COVID-19 Kaggle Dataset for South Korea:
https://www.kaggle.com/kimjihoo/coronavirusdataset
You need to make a Kaggle account in order to gain access to the dataset. 


## Let's open the folder and have a first look at the data

### Imports

In [7]:
import os
import pandas as pd 
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import numpy as np

In [8]:
for file in os.listdir('coronavirusdataset/'):
    print(file)

SeoulFloating.csv
TimeAge.csv
SearchTrend.csv
TimeProvince.csv
Weather.csv
PatientRoute.csv
PatientInfo.csv
Region.csv
TimeGender.csv
Case.csv
Time.csv


So there seem to be quied some files in csv format. Lets load them into a dict and print the head

In [9]:
df_dict={}
for file in os.listdir('coronavirusdataset/'):
    df=pd.read_csv('coronavirusdataset/'+file)
    df_dict[file.split('.')[0]]=df
    print(file.split('.')[0])
    print(df.head())

SeoulFloating
         date  hour  birth_year     sex province           city  fp_num
0  2020-01-01     0          20  female    Seoul      Dobong-gu   19140
1  2020-01-01     0          20    male    Seoul      Dobong-gu   19950
2  2020-01-01     0          20  female    Seoul  Dongdaemun-gu   25450
3  2020-01-01     0          20    male    Seoul  Dongdaemun-gu   27050
4  2020-01-01     0          20  female    Seoul     Dongjag-gu   28880
TimeAge
         date  time  age  confirmed  deceased
0  2020-03-02     0   0s         32         0
1  2020-03-02     0  10s        169         0
2  2020-03-02     0  20s       1235         0
3  2020-03-02     0  30s        506         1
4  2020-03-02     0  40s        633         1
SearchTrend
         date     cold      flu  pneumonia  coronavirus
0  2016-03-17  0.15554  0.34471    0.18181      0.01236
1  2016-03-18  0.14417  0.49416    0.17563      0.01027
2  2016-03-19  0.13290  0.39907    0.15145      0.01154
3  2016-03-20  0.13863  0.39662   

File| Description | 
--- | --- 
Case| Data of COVID-19 Infection Caseses | 

Lets try to get an understanding about the progession of virus. Therefore we take a look at the "TimeAge" data and try to find out who is affected in risk and possible in danger. Therefore we are going to look at the overall numbers and cumulative numbers of conformations and deceased and calculate the motaility rate. We are going to use [plotly](https://plotly.com/) for this, a fantastic, ease to use library that allows for interaktive plots.

In [128]:
df=df_dict['TimeAge']
fig = make_subplots(rows=3, cols=2, subplot_titles=("Number of conformations", "Number of cumulative conformations", "Number of deceased", "Number of cumulative deceased", "Mortality rate"))
colors=plt.cm.Blues(np.linspace(0, 1, len(df['age'].unique())))

for i, age in enumerate(df['age'].unique()):
    plt_df=df[df['age']==age]
    x=plt_df['date'].values
    y_conf=plt_df['confirmed'].values
    y_conf_cum=plt_df['confirmed'].cumsum().values
    
    y_des=plt_df['deceased'].values
    y_des_cum=plt_df['deceased'].cumsum().values

    #decease rate
    y_des_rate=y_des/y_conf*100

    color=f"rgba({colors[i][0]},{colors[i][1]},{colors[i][2]},{colors[i][3]})"
    
    
    fig.add_trace(go.Scatter(x=x, y=y_conf, name=age, line=dict(color=color)), row=1, col=1)
    fig.add_trace(go.Scatter(x=x, y=y_conf_cum, name=f"{age} cumulative", line=dict(color=color), showlegend=False), row=1, col=2)
    fig.add_trace(go.Scatter(x=x, y=y_des, name=age, line=dict(color=color), showlegend=False), row=2, col=1)
    fig.add_trace(go.Scatter(x=x, y=y_des_cum, name=f"{age} cumulative", line=dict(color=color), showlegend=False), row=2, col=2)
    fig.add_trace(go.Scatter(x=x, y=y_des_rate, name=f"{age} cumulative", line=dict(color=color), showlegend=False), row=3, col=1)

fig.update_yaxes(title_text="[-]", row=1, col=1)
fig.update_yaxes(title_text="[-]", row=1, col=2)
fig.update_yaxes(title_text="[-]", row=2, col=1)
fig.update_yaxes(title_text="[-]", row=2, col=2)
fig.update_yaxes(title_text="[%]", row=3, col=1)
fig.update_layout(height=800,
                  width=1000,
                  title_text="Analysis of COVID progression over time",
                  template="plotly_white")
fig.show()

From this graph we can constate a couple of interessting things:
- Younger people tend to have a higher infection rate, especially people in their 20s
- Older people above 70 show the highest mortality rate between 5-10 %
- The comulative curves show an exponential trend, so a further increase can be anticipated

Let's investigate if gender has an effect on the virus. Therefore we take a look at the TimeGender data set and produce a plot of the mortality rate similar as above

In [141]:
df=df_dict['TimeGender']
fig = go.Figure()
colors=plt.cm.Blues(np.linspace(0, 1, len(df['sex'].unique())))

for i, sex in enumerate(df['sex'].unique()):
    plt_df=df[df['sex']==sex]
    x=plt_df['date'].values
    y_conf=plt_df['confirmed'].values
    
    y_des=plt_df['deceased'].values

    #decease rate
    y_des_rate=y_des/y_conf*100

    color=f"rgba({colors[i][0]},{colors[i][1]},{colors[i][2]},{colors[i][3]})"
    
    
    fig.add_trace(go.Scatter(x=x, y=y_des_rate, name=f"{sex} cumulative", line=dict(color=color), showlegend=False))

fig.update_yaxes(title_text="[%]")
fig.update_layout(height=400,
                  width=600,
                  title_text="Analysis of the gender influence of the mortality rate",
                  template="plotly_white")
fig.show()

Ok.... this does not look good for men, on agerage the mortality rate is twice as high as for women and both rates seem still to increase

Now we want to investigate the demographics and the influence of population size and density. Unfortunately the Kaggle data set does not provide data, only the names of the province/cities, which restricts the transferability. Therefore we need to get the data ourself. Thank god the internet provides us with credible sites like [City Population](https://citypopulation.de/), with a standartized data format that we can easily exctract data from. We are especially interested in the overall population and the population density.

Reference-style: 
![Getting Started]('./static/images/citypopulation_table.png')

To get the information for all cities listed in the TimeProvince dataset one good either google the cities individually or write a web scraper to get the information from the City Population website. Sinc the last approach is more fun i did just that, using [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) a web scraping library and [Selenium](https://selenium-python.readthedocs.io/) a web driver library, both can be installed via pip. I am not going to go to explain the code in detail, but what basically happens:
- Selenium allows a automatic search of a city
- With BeautifulSoup we extract the standartized html table
- With a regex operation the two numbers for the population and the population density are extracted

In [33]:
from bs4 import BeautifulSoup
import requests
import re
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from tqdm import tqdm

city_pd=pd.DataFrame([])

driver=webdriver.Firefox()

city_list=df_dict['TimeProvince']['province'].unique()
for city in tqdm(city_list):
    i=city_pd.shape[0]
    city_pd.at[i,'city']=city
    
    #Go to Citypopulation search
    driver.get("https://www.citypopulation.de/search.html")
    time.sleep(1)
    #Search for the city
    try:
        search_country= driver.find_element_by_id("countries1")
        search_country.send_keys('South Korea')
        search_place= driver.find_element_by_id("places1")
        search_place.send_keys(city)
        search_place.send_keys(Keys.RETURN)
        time.sleep(1)

        #Get first result
        result=driver.find_element_by_class_name("result")
        print(result)
        #result=driver.find_elements_by_xpath('//*[@title="administrative area"]')
        #result=driver.find_element_by_partial_link_text('adminid')
        result.click()

        #Scrap the html
        URL=driver.current_url

        response=requests.get(URL)
        soup=BeautifulSoup(response.text,'html.parser')

        #Get the table with pupultion information
        table=soup.find('table',{"id":"tl"}).tbody
        rows = table.find_all("td")
        
        #Scrap population density and populatin number from table
        city_pd.at[i,'PopulationDensity']=float(re.search(r'data-density="(.*?)"', str(rows[0])).group(1))
        city_pd.at[i,'Population']=float(re.search(r'">(.*?)</td>', str(rows[-2])).group(1).replace(',',''))
        time.sleep(1)
    except:
        print(f'Did not work for {city}')

driver.close()
driver.quit()
print(city_pd)



  0%|          | 0/17 [00:00<?, ?it/s][A[A<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="8839c7fa-320a-584e-9bf2-d15033d32615", element="3d711ab5-b1f3-8041-878c-2a4ddcbd2ee6")>


  6%|▌         | 1/17 [00:06<01:37,  6.09s/it][A[A<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="8839c7fa-320a-584e-9bf2-d15033d32615", element="48611cad-f9a6-c742-b2d9-9ae5840856a3")>


 12%|█▏        | 2/17 [00:10<01:24,  5.62s/it][A[A

 18%|█▊        | 3/17 [00:13<01:05,  4.69s/it][A[ADid not work for Daegu
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="8839c7fa-320a-584e-9bf2-d15033d32615", element="d269d28e-b0fb-0044-a708-0ea5f17d9b87")>


 24%|██▎       | 4/17 [00:17<00:59,  4.56s/it][A[A<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="8839c7fa-320a-584e-9bf2-d15033d32615", element="db187f3f-6dc6-bc42-b8fb-beb725f26077")>


 29%|██▉       | 5/17 [00:21<00:54,  4.52s/it][A[A<selenium.webdriver.firefox.webelement.Fi

In [30]:
float('10,010,983')

ValueError: could not convert string to float: '10,010,983'

So with the exception of Jeollabuk-do all numbers could be automatically scraped. The missing information was manually added. 

In [12]:
print(df_dict['TimeProvince'].head(2))
print(df_dict['Region'].head(2))
print(df_dict['Case'].head(2))
print(df_dict['Time'].head(2))

date  time province  confirmed  released  deceased
0  2020-01-20    16    Seoul          0         0         0
1  2020-01-20    16    Busan          0         0         0
    code province        city   latitude   longitude  elementary_school_count  \
0  10000    Seoul       Seoul  37.566953  126.977977                      607   
1  10010    Seoul  Gangnam-gu  37.518421  127.047222                       33   

   kindergarten_count  university_count  academy_ratio  \
0                 830                48           1.44   
1                  38                 0           4.18   

   elderly_population_ratio  elderly_alone_ratio  nursing_home_count  
0                     15.38                  5.8               22739  
1                     13.17                  4.3                3088  
   case_id province           city  group       infection_case  confirmed  \
0  1000001    Seoul        Guro-gu   True  Guro-gu Call Center         79   
1  1000002    Seoul  Dongdaemun-gu   True  

In [155]:
df_dict['PatientInfo']['city'].unique()

array(['Gangseo-gu', 'Jungnang-gu', 'Jongno-gu', 'Mapo-gu', 'Seongbuk-gu',
       'etc', 'Songpa-gu', 'Seodaemun-gu', 'Seongdong-gu', 'Seocho-gu',
       'Guro-gu', 'Gangdong-gu', 'Eunpyeong-gu', 'Geumcheon-gu',
       'Gwanak-gu', 'Nowon-gu', 'Dongjak-gu', 'Gangnam-gu',
       'Yangcheon-gu', 'Gwangjin-gu', 'Dongdaemun-gu', 'Yeongdeungpo-gu',
       'Dobong-gu', 'Yongsan-gu', 'Gangbuk-gu', 'Jung-gu', 'Dongnae-gu',
       'Haeundae-gu', 'Yeonje-gu', nan, 'Buk-gu', 'Nam-gu', 'Seo-gu',
       'Geumjeong-gu', 'Saha-gu', 'Suyeong-gu', 'Sasang-gu',
       'Busanjin-gu', 'Dalseo-gu', 'Dalseong-gun', 'Suseong-gu',
       'Dong-gu', 'Wuhan', 'Bupyeong-gu', 'Michuhol-gu', 'Yeonsu-gu',
       'Gyeyang-gu', 'Namdong-gu', 'Yuseong-gu', 'Daedeok-gu', 'Ulju-gun',
       'Sejong', 'Goyang-si', 'Pyeongtaek-si', 'Bucheon-si', 'Suwon-si',
       'Guri-si', 'Siheung-si', 'Gimpo-si', 'Icheon-si', 'Pocheon-si',
       'Anyang-si', 'Yongin-si', 'Paju-si', 'Namyangju-si', 'Seongnam-si',
       'Gwangmyeong-s

In [156]:
df_dict['Case']

Unnamed: 0,case_id,province,city,group,infection_case,confirmed,latitude,longitude
0,1000001,Seoul,Guro-gu,True,Guro-gu Call Center,79,37.508163,126.884387
1,1000002,Seoul,Dongdaemun-gu,True,Dongan Church,24,37.592888,127.056766
2,1000003,Seoul,Eunpyeong-gu,True,Eunpyeong St. Mary's Hospital,14,37.63369,126.9165
3,1000004,Seoul,Seongdong-gu,True,Seongdong-gu APT,13,37.55713,127.0403
4,1000005,Seoul,Jongno-gu,True,Jongno Community Center,10,37.57681,127.006
5,1000006,Seoul,Jung-gu,True,Jung-gu Fashion Company,7,37.562405,126.984377
6,1000007,Seoul,from other city,True,Shincheonji Church,6,-,-
7,1000008,Seoul,-,False,etc,65,-,-
8,1100001,Busan,Dongnae-gu,True,Onchun Church,34,35.21628,129.0771
9,1100002,Busan,from other city,True,Shincheonji Church,8,-,-


In [None]:
p