# Korean Covid 19 infection percentage prediction

As we find the regression model in each clusters, now I'd like to use them to predict Korean Covid-19 infection percentage.

##### Here, I used data from US_state_hospitalbeds to fit modules and team-made Korean data.

## 1. Fit Cluster and Linear Regression Model

### Fitting Cluster

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Load cluster data
data_1=pd.read_csv("datasets/population_beds.csv")
data_1.drop(['Unnamed: 0'],axis=1,inplace=True)
data_1.head()

Unnamed: 0,State,LandArea,Density,Population,Total Hospital Beds,Total ICU Beds
0,Alabama,50645,97.4271,4934193,14793.0,1903.0
1,Alaska,570641,1.2694,724357,1583.0,130.0
2,Arizona,113594,66.2016,7520103,12590.0,1702.0
3,Arkansas,52035,58.3059,3033946,8560.0,908.0
4,California,155779,254.2929,39613493,68074.0,8105.0


In [5]:
# Fitting KMeans cluster module by "Total Hospital Beds" (n_clusters=3)
from sklearn.cluster import KMeans

kmeans=KMeans(n_clusters=3,random_state=144,max_iter=500)
kmeans.fit(data_1[["Total Hospital Beds"]])

KMeans(max_iter=500, n_clusters=3, random_state=144)

In [6]:
kmeans.labels_

array([1, 1, 1, 1, 2, 1, 1, 1, 2, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0,
       1, 1, 0, 1, 1, 1, 1, 0, 1, 2, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 2, 1,
       1, 0, 1, 1, 1, 1])

### Fitting Regression Model

In [7]:
# Load Regression Data
data_2=pd.read_csv("datasets/final_data.csv")
data_2.drop(['Unnamed: 0'],inplace=True,axis=1)
data_2.head()

Unnamed: 0,State,LandArea,Density,Population,Total Hospital Beds,Total ICU Beds,0,1,2,3,...,444,445,446,447,448,449,450,451,452,453
0,Alabama,50645,97.4271,4934193,14793.0,1903.0,0.0008,0.0008,0.0009,0.0009,...,0.1117,0.1117,0.1117,0.1117,0.1117,0.1117,0.1121,0.1121,0.1123,0.1124
1,Alaska,570641,1.2694,724357,1583.0,130.0,0.0004,0.0004,0.0004,0.0004,...,0.0984,0.0985,0.0985,0.0985,0.0985,0.0985,0.0988,0.0988,0.099,0.099
2,Arizona,113594,66.2016,7520103,12590.0,1702.0,0.0005,0.0005,0.0005,0.0006,...,0.1191,0.1191,0.1192,0.1193,0.1193,0.1194,0.1195,0.1195,0.1197,0.1198
3,Arkansas,52035,58.3059,3033946,8560.0,908.0,0.0005,0.0005,0.0005,0.0005,...,0.1154,0.1156,0.1156,0.1156,0.116,0.1161,0.1164,0.1168,0.1172,0.1172
4,California,155779,254.2929,39613493,68074.0,8105.0,0.0006,0.0006,0.0007,0.0007,...,0.0964,0.0964,0.0964,0.0965,0.0965,0.0965,0.0966,0.0967,0.0967,0.0968


In [8]:
# Select only useful columns in predicting regression model : State, Density, Total Hospital Beds, Day 448 anf 453
data=pd.DataFrame(data_2['Density'])
data['Total Hospital Beds']=data_2['Total Hospital Beds']
data['Lag 5']=data_2.iloc[:,448]
data['Confirmed(%)']=data_2.iloc[:,453]

In [9]:
# Add Cluster
data['Cluster']=kmeans.labels_

In [10]:
data['Cluster'].value_counts()

1    35
0    11
2     4
Name: Cluster, dtype: int64

In [11]:
data.head()

Unnamed: 0,Density,Total Hospital Beds,Lag 5,Confirmed(%),Cluster
0,97.4271,14793.0,0.1116,0.1117,1
1,1.2694,1583.0,0.0983,0.0985,1
2,66.2016,12590.0,0.1189,0.1193,1
3,58.3059,8560.0,0.1149,0.1156,1
4,254.2929,68074.0,0.0964,0.0965,2


In [12]:
# DataFrame grouped by Clusters
data0=data[data['Cluster']==0]
data1=data[data['Cluster']==1]
data2=data[data['Cluster']==2]

In [13]:
print(len(data0))
print(len(data1))
print(len(data2))

11
35
4


In [14]:
data0.drop(['Cluster'],axis=1,inplace=True)
data1.drop(['Cluster'],axis=1,inplace=True)
data2.drop(['Cluster'],axis=1,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


**(1) cluster = 0**

In [15]:
from sklearn.linear_model import LinearRegression

X_features=data0.drop(['Confirmed(%)'],axis=1)
y_target=data0['Confirmed(%)']

linear_reg0=LinearRegression()
linear_reg0.fit(X_features,y_target)

LinearRegression()

In [16]:
linear_reg0.coef_

array([-1.99169667e-07, -7.74848263e-09,  9.99092435e-01])

**(2) cluster = 1**

In [17]:
X_features=data1.drop(['Confirmed(%)'],axis=1)
y_target=data1['Confirmed(%)']

linear_reg1=LinearRegression()
linear_reg1.fit(X_features,y_target)

LinearRegression()

In [18]:
linear_reg1.coef_

array([-1.93002091e-07, -7.47275621e-10,  1.00157016e+00])

**(3) cluster = 2**

In [19]:
X_features=data2.drop(['Confirmed(%)'],axis=1)
y_target=data2['Confirmed(%)']

linear_reg2=LinearRegression()
linear_reg2.fit(X_features,y_target)

LinearRegression()

In [20]:
linear_reg2.coef_

array([-4.35421262e-05,  1.26585805e-06,  3.23559014e+00])

## 2. Predict Korean Covid-19 Infection percentage (%)

In [21]:
# Load data
korea=pd.read_csv("datasets/koreadata.csv")

In [22]:
korea.shape

(17, 7)

In [23]:
korea.head()

Unnamed: 0,Province,PopDensity,HosBeds(2019),Confirmed,Death,Recovered,Popularity
0,Seoul,15865,57142,30635,88,30245,51781000
1,Busan,4342,6925,4291,72,4471,9602000
2,Daegu,2738,10933,2754,4,3055,3344000
3,Incheon,2770,7682,3863,4,3788,2419000
4,Gwangju,2969,3106,1839,3,1784,1488000


In [24]:
# Drop unnecessary columns in predicting
korea.drop(['Death','Recovered'],axis=1,inplace=True)

In [25]:
# Rename HosBeds(2019) as Beds
korea.rename(columns={'HosBeds(2019)':'Beds'},inplace=True)

In [26]:
# Make "Confirmed" column into "Confirmed(%)" & Drop "Popularity"
korea['Confirmed']=korea['Confirmed']/korea['Popularity']
korea.rename(columns={'Confirmed':'Confirmed(%)'},inplace=True)
korea.drop(['Popularity'],axis=1,inplace=True)

### Clustering
Here, we will use KMeans cluster module fitted by Total Hospital Beds above to cluster korean states by the same criteria.

In [27]:
korea['cluster']=kmeans.predict(korea[['Beds']])

In [28]:
korea.head()

Unnamed: 0,Province,PopDensity,Beds,Confirmed(%),cluster
0,Seoul,15865,57142,0.000592,2
1,Busan,4342,6925,0.000447,1
2,Daegu,2738,10933,0.000824,1
3,Incheon,2770,7682,0.001597,1
4,Gwangju,2969,3106,0.001236,1


In [29]:
korea['cluster'].value_counts()

1    15
2     2
Name: cluster, dtype: int64

There are 15 states in cluster_1 and 2 states in cluster_2.

In [30]:
# DataFrame group by clusters
korea0=korea[korea['cluster']==0]
korea1=korea[korea['cluster']==1]
korea2=korea[korea['cluster']==2]

In [31]:
# Drop cluster column
korea0.drop(['cluster'],axis=1,inplace=True)
korea1.drop(['cluster'],axis=1,inplace=True)
korea2.drop(['cluster'],axis=1,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [32]:
# Korea provinces in cluster 0
korea0['Province'].values

array([], dtype=object)

In [33]:
# Korea provinces in cluster 1
korea1['Province'].values

array(['Busan', 'Daegu', 'Incheon', 'Gwangju', 'Daejeon', 'Ulsan',
       'Sejong', 'Gangwon-do', 'Chungcheongbuk-do', 'Chungcheongnam-do',
       'Jeollabuk-do', 'Jeollanam-do', 'Gyeongsangbuk-do',
       'Gyeongsangnam-do', 'Jeju'], dtype=object)

In [35]:
# Korea provinces in cluster 2
korea2['Province'].values

array(['Seoul', 'Gyeonggi-do'], dtype=object)

### Infection Percentage Prediction

**(1) cluster_0**

X_features=korea0.drop(['Province'],axis=1)
korea0['Prediction']=linear_reg0.predict(X_features)

korea0

In [79]:
linear_reg0.coef_

array([-1.99169667e-07, -7.74848263e-09,  9.99092435e-01])

**(1) cluster_1**

In [37]:
X_features=korea1.drop(['Province'],axis=1)
korea1['Prediction']=linear_reg1.predict(X_features)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  korea1['Prediction']=linear_reg1.predict(X_features)


In [38]:
korea1

Unnamed: 0,Province,PopDensity,Beds,Confirmed(%),Prediction
1,Busan,4342,6925,0.000447,-0.000332
2,Daegu,2738,10933,0.000824,0.000351
3,Incheon,2770,7682,0.001597,0.001122
4,Gwangju,2969,3106,0.001236,0.000726
5,Daejeon,2780,3155,0.001197,0.000723
6,Ulsan,1073,2992,0.001849,0.001706
7,Sejong,750,626,0.001175,0.001095
9,Gangwon-do,90,3823,0.001533,0.001578
10,Chungcheongbuk-do,220,3442,0.0013,0.00132
11,Chungcheongnam-do,267,4308,0.000969,0.000979


In [39]:
linear_reg1.coef_

array([-1.93002091e-07, -7.47275621e-10,  1.00157016e+00])

**(2) cluster_2**

In [40]:
X_features=korea2.drop(['Province'],axis=1)
korea2['Prediction']=linear_reg2.predict(X_features)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  korea2['Prediction']=linear_reg2.predict(X_features)


In [41]:
korea2

Unnamed: 0,Province,PopDensity,Beds,Confirmed(%),Prediction
0,Seoul,15865,57142,0.000592,-0.907058
8,Gyeonggi-do,1315,49588,0.002188,-0.277919


In [42]:
linear_reg2.coef_

array([-4.35421262e-05,  1.26585805e-06,  3.23559014e+00])

## Result
**We can find that some predicted values are minus. There are two possible reasons I thought about.**
* Because the classification and regression models are fitted by US states' data, the model might have more environmental or social differences than we guessed.
* Because Korea's medical infrastructure is much better than that of US, the regression model might predict that, in Korea's facility, Covid 19 would be no more spread.

Though we couldn't figure out why the predictions showed minus value, I think this project was meaningful for me to grow my perspective and dealing with lots of errors occurred during data analysis.