# 2019 Novel Corona Virus Outbreak: EDA with estimation of Case Fatality Rate (CFR)


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

### Let's read the data and get a glimpse of how it looks like

In [None]:
ncov = pd.read_csv('/kaggle/input/novel-corona-virus-2019-dataset/2019_nCoV_data.csv')
print(ncov.head(7))
print('\n')

In [None]:
print(ncov.info())

### The data has 434 rows and 7 columns. The first column 'Sno' doesn't appear to be informative. So we will remove it. Also, let's rename the columns to something that doesn't have any special characters or spaces. Also, the column 'Last Update' is a date variable, but its not in the proper datetime format and so let's convert it to proper datetime format as well

In [None]:
ncov = ncov.drop('Sno', axis = 1)
ncov.columns = ['State', 'Country', 'Date', 'Confirmed', 'Deaths', 'Recovered']
ncov['Date'] = ncov['Date'].apply(pd.to_datetime).dt.normalize() #convert to proper datetime object

In [None]:
ncov.info()

In [None]:
ncov[['State','Country','Date']].drop_duplicates().shape[0] == ncov.shape[0]

##### The above code simply checks if the number of distinct rows of the dataset containing only the three columns State, Country and Date is equal to the number of rows of the parent dataset. This check returned 'True' and it means that every row in the dataset is unique per each Country per State per Date. This is important information to have while performing grouping and aggregating operations

### Now let's view some summary statistics of all the 6 columns

In [None]:
ncov.describe(include = 'all')

### From the above table, we know that there are missing values for 'State', but rest all columns doesn't contain any missing values. Let's explore 'State' variable to see if its missing is something dubious

In [None]:
ncov[['Country','State']][ncov['State'].isnull()].drop_duplicates()

In [None]:
ncov[ncov['Country'].isin(list(ncov[['Country','State']][ncov['State'].isnull()]['Country'].unique()))]['State'].unique()

### Ok, it looks the 'State' names are missing for certain nations only. In case of Australia, there are records where 'State' is recorded and there are also records where Australia's 'State' name is missing. The plausible reason could be that the epicenter of 2019-nCoV outbreak is China and hence data is likely to be more complete from China. Nonetheless, the missingness of 'State' doesn't appear to be something that's dubious. 

In [None]:
ncov.State.unique()

In [None]:
ncov.Country.unique()

### Looking at the list of countries, there appear to be two values that sounds conflicting. These are 'China' and 'Mainland China'. These two entities are supposed to be representing a single country, but in the data it's two different entities. Let's see what's going on..

In [None]:
print(ncov[ncov['Country'].isin(['China', 'Mainland China'])].groupby('Country')['State'].unique())
print(ncov[ncov['Country'].isin(['China', 'Mainland China'])].groupby('Country')['Date'].unique())

### It looks like the value of 'Country' is set to 'China' for observations on 22 Jan 2020 and from 23 Jan 2020 onwards its recorded as 'Mainland China'. Both the entities have same distinct states as well. This means that they both represent 'China', a single entity. We should correct the name or else having them as two different values could cause trouble in the EDA downstream

In [None]:
ncov['Country'] = ncov['Country'].replace(['Mainland China'], 'China') #set 'Mainland China' to 'China'
sorted(ncov.Country.unique())

In [None]:
print(ncov.head())

In [None]:
china = ncov[ncov['Country']=='China']
china.head()

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.rcParams["figure.figsize"] = (7,5)
ax1 = china[['Date','Confirmed']].groupby(['Date']).sum().plot()
ax1.set_ylabel("Total Number of Confirmed Cases")
ax1.set_xlabel("Date")

ax2 = china[['Date','Deaths', 'Recovered']].groupby(['Date']).sum().plot()
ax2.set_ylabel("Total N")
ax2.set_xlabel("Date")


#### The above two line plots is looking only at the data from 'China' to explore the trends of 2019-nCoV outbreak. The number of confirmed cases are on the rise. However, there doesn't seem to be any explosion in the number of confirmed infections in China with time. A similar trend is seen with respect to the number of 'deaths' and 'recovered' cases as well. Number of people who is dying from 2019-nCoV is rising day by day with no plateau reached as of 30 Jan 2020. The number of people who are 'recovering' from the infection is also increasing day by day.

### Total confirmed cases per each state/province in China

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import seaborn as sns
plt.rcParams["figure.figsize"] = (17,10)
nums = china.groupby(["State"])['Confirmed'].aggregate(sum).reset_index().sort_values('Confirmed', ascending= False)
ax = sns.barplot(x="Confirmed", y="State", order = nums['State'], data=china, ci=None) 
ax.set_xlabel("Total Confirmed Cases")

#### Looking at the barplot above, Hubei clearly has more confirmed cases than anywhere else in China. Wuhan is the capital of Hubei province. There also seems to be a province named 'Taiwan', this is likely an error (typo) in the data.

## Case Fatality Rate
The case fatality rate or CFR is the proportion of deaths among people diagnosed with a disease. This is a measure of the fatality and is a measure of the risk of death if you happen to have get infected with an infectious agent. For infectious agents like Ebola and Nipah this number is quite high. 2019-nCoV's CFR rates are estimated to be less than 5% in most literature. We could compute the same using this data. 

In [None]:
#a custom function to return the lower and upper bounds of 95% confidence interval of a proportion
def get_ci(N,p):
    lci = (p - 1.96*(((p*(1-p))/N) ** 0.5))*100
    uci = (p + 1.96*(((p*(1-p))/N) ** 0.5))*100
    return str(np.round(lci,3)) + "% - " + str(np.round(uci,3)) + '%'

In [None]:
final = ncov[ncov.Date==np.max(ncov.Date)]
final = final.copy()

final['CFR'] = np.round((final.Deaths.values/final.Confirmed.values)*100,3)
final['CFR 95% CI'] = final.apply(lambda row: get_ci(row['Confirmed'],row['CFR']/100),axis=1)
global_cfr = np.round(np.sum(final.Deaths.values)/np.sum(final.Confirmed.values)*100, 3)
final.sort_values('CFR', ascending= False).head(10)

In [None]:
tops = final.sort_values('CFR', ascending= False)
tops = tops[tops.CFR >0]
df = final[final['CFR'] != 0]
plt.rcParams["figure.figsize"] = (10,5)
ax = sns.barplot(y="CFR", x="State", order = tops['State'], data=df, ci=None) 
ax.axhline(global_cfr, alpha=.5, color='r', linestyle='dashed')
ax.set_title('Case Fatality Rates (CFR) as of 30 Jan 2020')
ax.set_ylabel('CFR %')
print('Average CFR % = ' + str(global_cfr))

#### The average CFR is represented by the dotted red line. The CFR of 2019-nCoV as of now is ~ 2-4%. Note that there are many infectious agents in the world whose CFR is many times higher than 2019-nCoV.

## Local Outlier Factor (LOF)
> > In anomaly detection, the local outlier factor (LOF) is an algorithm proposed by Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng and Jörg Sander in 2000 for finding anomalous data points by measuring the local deviation of a given data point with respect to its neighbours.
> https://en.wikipedia.org/wiki/Local_outlier_factor

#### The numerical columns of the data with confirmed number of cases, number of deaths and number of recovered cases are the interesting variables here for outlier detection. Variables denoting geocordinates of each state could be a value addition, but we don't have that in the data. These variables are standardized and then inputted into the LOF algorithm to get LOF scores for each state. A LOF score in and around 1 is characteristic of an inlier, while a LOF score >>1 is typical for outliers 

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import LocalOutlierFactor
scaler = StandardScaler()
scd = scaler.fit_transform(final[['Confirmed','Deaths','Recovered']])
clf = LocalOutlierFactor(n_neighbors=20, contamination=0.1) #LOF is very sensitive to the choice of n_neighbors. Generally, n_neighbors = 20 works better
clf.fit(scd)
lofs = clf.negative_outlier_factor_*-1
final['LOF Score'] = lofs
tops = final.sort_values('LOF Score', ascending= False)
plt.rcParams["figure.figsize"] = (20,12)
ax = sns.barplot(x="LOF Score", y="State", order = tops['State'], data=final, ci=None) 
ax.axvline(1, alpha=.5, color='g', linestyle='dashed')
ax.axvline(np.median(lofs), alpha=.5, color='b', linestyle='dashed')
ax.axvline(np.mean(lofs) + 3*np.std(lofs), alpha=.5, color='r', linestyle='dashed')

#### Hubei is indeed in a league of its when it comes to nCoV. It's LOF score is 70, which is kind of astronomically high! The bar plot above has three dotted lines representing a LOF score of 1 (green), the median LOF score (blue) and the LOF score that's 3 standard deviations away (red). It is to be noted that LOF score does not translate to a bad epidemiological condition in that region and that it only represents whether that region is an outlier or not as compared to others. For example, New South Wales also has a very high LOF score (~12), but this was because 50% of the cases in New South Wales got recovered already.

In [None]:
final.sort_values('LOF Score', ascending=False)

## K-Means Clustering 
The outlierness of Hubei is clearly seen in K-Means clustering too

In [None]:
from sklearn.cluster import KMeans
plt.rcParams["figure.figsize"] = (5,5)
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=1897)
    kmeans.fit(scd)
    wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('Within Cluster Sum of Squares')
plt.show()

##### num_clusters = 2 seems to be the best choice by looking at the elbow analysis above

In [None]:
kmeans = KMeans(n_clusters=2, init='k-means++', max_iter=300, n_init=10, random_state=1897)
clusters = np.where(kmeans.fit_predict(scd) == 0, 'Cluster 1', 'Cluster 2')
clusters

#### Let's do Principal Component Analysis (PCA). This makes it easier to visualize high-dimensional data in 2D-space

In [None]:
from sklearn import decomposition
pca = decomposition.PCA(n_components=3)
pca.fit(scd)
X = pca.transform(scd)
print(pca.explained_variance_ratio_.cumsum())

##### The first principal component itself explains ~99% of the variance.

In [None]:
plt.rcParams["figure.figsize"] = (7,7)
ax = sns.scatterplot(X[:,0], X[:,1], marker = 'X', s = 80, hue=clusters)
ax.set_title('K-Means Clusters of States/Provinces')
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')

##### Hubei is clustered into 'Cluster 2' and is quite far from any other states.

In [None]:
pd.DataFrame(final.State.values, clusters)