In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.axis as axis
import matplotlib.ticker as ticker
import matplotlib
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import seaborn as sns
import math
plt.style.use('seaborn-white')
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Objectives
The project is mainly to fullfill the requirements of the Udacity Data Science Nanodegree program, but I take it seriously with high enthusiasm.  

## Primary objectives: 

 1.  To demonstrate the top 10 countries with the greatest number of total confirmed cases and deaths of Covid-19
 2.  To show and compare total confirmed vs. confirmed/million people, and total deaths vs. deaths/million people of Covid-19 in each WHO region
 3.  To visualize deaths/100 cases and recovered/100 cases of Covid-19 among countries in each WHO regions

## Secondary objective
 Using machine learing (linear regression) method to show how some factors of interests contribute to death/100 cases 

# Business understanding
## Overview
COVID 19 is still a worldwide public health challenge, although the Food and Drug Administration of the US issued emergency use authorization (EUA) for three vaccines from Pfizer -bioNtech ,Moderna and Johnson & Johnson (Janssen). As a biostatistician who work for the R&D department of a biotech company that is leadership in COVID-19 diagnosis, I want to put what I learn from the data science nadodegree program into practice, to analyze and visualize COVID-19 data. the dataset I use not only focus on the US, instead, it covers 187 countries from all 6 World Health Organization (WHO) regions. 
## Questions and importance

### Primary questions
1. What are the top 10 countries with the greatest number of total confirmed cases and deaths of Covid-19, and where are those countries located in the World Health Organization WHO regions?  People are likely to be intersted in countries play worst in pandemic, I put the question and answer in the first place mostly to attract audience. And, the large number of confirmed and deaths can impress my audience of how bad the condition is, which is also a good way to promote their awareness. 

2. What are the similarities and differences between total confirmed and confirmed/million people, as well as between total deaths and deaths/million people of Covid-19 in each WHO region? Just like there can be huge difference in the rank of a country's GDP and GDP per capita, my potential audience are also curious about how difference between total number of confirmed cases, deaths and prevalance, death rate. 

3. What are the distributions of deaths/100 cases and recovered/100 cases of each WHO region? Assuming that the virus do not have any discrimination upon any specific race or ethnicity, the istributions of deaths/100 cases and recovered/100 cases is a good criteria to compare and contrast the performance of public health responses of a specific WHO region. 

### Secondary questions
4. How the factors including Confirmed, Deaths, Recovered, Active, New cases, New deaths, New recovered, Recovered / 100 Cases, Deaths / 100 Recovered, Confirmed last week, 1-week change, 1-week % increase and WHO regions are corrlated and contributing to death/100 cases. This question is a good practice for data cleaning, handling categorical values, missing values,infinite values, and training linear model. In the meantime, the model can also be used to estimate and predict the death rate of COVID-19 given other factors.  





# Data understanding

The dataset was named as COVID-19 Dataset: Number of Confirmed, Death and Recovered cases every day across the globe.  It was collected and cleaned by Devakumar kp from the sources https://github.com/CSSEGISandData/COVID-19 and
https://www.worldometers.info with free access to everyone. This dataset has 187 observations,  13 numeric variables including Confirmed, Deaths, Recovered, Active, New cases, New deaths, New recovered, Deaths / 100 cases, Recovered / 100 Cases, Deaths / 100 Recovered, Confirmed last week, 1-week change, 1-week % increase, and 2 categorical variables, which are Country/Region and WHO regions.  

## Dataset structures

- The first 5 obs

In [None]:
# Import data from country_wise_latest.csv and name it as df_country

df_country = pd.read_csv("/kaggle/input/corona-virus-report/country_wise_latest.csv")

# print out the first 5 obs
df_country.head()



- The number of observations, and the number of variables

In [None]:
#print number of countries, number of variables

print('Number of observations:{}, Number of numeric variables:{}, Number of categorical variables:{}'.format(df_country.shape[0],df_country.select_dtypes(include = ["float","int"]).shape[1], df_country.select_dtypes(exclude = ["float","int"]).shape[1]))

## Data cleaning and descriptive analysis




- List the WHO regions

In [None]:
print(pd.DataFrame(np.unique(df_country['WHO Region']),columns = ["WHO Region"]))


- Number of observations (country/region) in each WHO region


In [None]:
df_country.count = pd.DataFrame(df_country.groupby('WHO Region')['Country/Region'].nunique())

df_country.count.columns = ["Number of Country/Region"]

print(df_country.count)

- Descpritive analyses on numerical data

In [None]:
## Get the all numeric variables and WHO region

df_country2 = df_country.drop(["Country/Region"],axis =1 )
num_col= df_country.select_dtypes(include = ["float","int"]).columns

df1 = df_country2.groupby('WHO Region')[num_col].agg(['mean','median','min','max'], axis=1)


df1


## More deeper analyses and visualization

 - List of top 10 country/Regtion with the greatest number of confirmed cases

In [None]:
# Sort the df_country by column 'Confirmed' in descending order,
# then get the top 10 observations and save them in the new var 'df_country_sortconfirmed'

df_country_sortconfirmed =df_country.sort_values('Confirmed',ascending= False).head(10).sort_values(['WHO Region','Confirmed'],ascending= True)

# select and print columns 'Country/Region','Deaths','WHO Region' of df_country_sortconfirmed
df_country_sortconfirmed[['Country/Region','Confirmed','WHO Region']]

#df_country_sortconfirmed['Country/Region']

- Plot top 10 country/Regtion with the greatest number of confirmed cases

In [None]:
# prepare parameters for plots

scatter_x = np.array(df_country_sortconfirmed['Country/Region'])

scatter_y = np.array(df_country_sortconfirmed['Confirmed'])

size = np.array(df_country_sortconfirmed['Confirmed']/500)

colors = {'Eastern Mediterranean': 'r', 'Europe': 'b', 'Africa': 'g','Americas': 'y','Western Pacific': 'c','South-East Asia': 'm'}

group = np.array(['Africa','Americas','Americas','Americas','Americas','Americas','Eastern Mediterranean','Europe','Europe','South-East Asia'])


In [None]:
#Plots

f,ax= plt.subplots(1,1,figsize=(12, 9))
ax2 = ax.twiny()
plt.ticklabel_format(style = 'plain')

ax.get_yaxis().set_major_formatter(
matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))

plt.xticks(fontsize=14, rotation=90)

plt.ylim([100000,5000000])
plt.yticks(fontsize=14, color = 'black')
plt.title("Total Confirmed Cases by WHO Region")

## add first x-axis
ax.set_xticklabels(list(df_country_sortconfirmed['Country/Region']),fontsize=14, rotation=90)
ax.set_xlabel(r"Country")
## add second x-axis
ax2.set_xlim(ax.get_xlim())
ax2.set_xticks(np.arange(0.06,1.06,0.1))
ax2.set_xticklabels(['Africa','Americas','Americas','Americas','Americas','Americas','Eastern Mediterranean','Europe','Europe','South-East Asia'])
ax2.set_xlabel(r"WHO Region")


for g in np.unique(group):
    i = np.where(group == g)
    ax.scatter(scatter_x[i],scatter_y[i],
            c = colors[g], s = size[i], lw=1, label = g)


plt.show()

plt.savefig('total confirmed.png')

- List top 10 country/Regtion with the greatest number of deaths

In [None]:
# Sort the df_country by column 'Deaths' in descending order,
# then get the top 10 observations and save them in the new var 'df_country_sortdeath'

df_country_sortdeath = df_country.sort_values('Deaths',ascending= False).head(10).sort_values(['WHO Region','Deaths'],ascending = True)

df_country_sortdeath[['Country/Region','Deaths','WHO Region']]


# select and print columns 'Country/Region','Deaths','WHO Region' of df_country_sortdeath
df_country_sortdeath[['Country/Region','Deaths','WHO Region']]


#df_country_sortdeath['Country/Region']

In [None]:
# prepare parameters for plots

scatter_x = np.array(df_country_sortdeath['Country/Region'])

scatter_y = np.array(df_country_sortdeath['Deaths'])

size = np.array(df_country_sortdeath['Deaths']**2/3000000)

colors = {'Eastern Mediterranean': 'r', 'Europe': 'b', 'Africa': 'g','Americas': 'y','Western Pacific': 'c','South-East Asia': 'm'}

group = np.array(['Americas','Americas','Americas','Americas','Eastern Mediterranean','Europe','Europe','Europe','Europe','South-East Asia'])



- Plot top 10 country/Regtion with the greatest number of deaths

In [None]:
# Plots

f,ax= plt.subplots(1,1,figsize=(12, 9))
ax2 = ax.twiny()
plt.ticklabel_format(style = 'plain')

ax.get_yaxis().set_major_formatter(
matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))



plt.ylim([2000,180000])
plt.yticks(fontsize=16, color = 'black')

plt.title('Total Deaths by WHO Region', color='black')
plt.xticks(fontsize=14, rotation=90)


## add first x-axis
ax.set_xticklabels(list(df_country_sortdeath['Country/Region']),fontsize=14, rotation=90)
ax.set_xlabel(r"Country")
## add second x-axis
ax2.set_xlim(ax.get_xlim())
ax2.set_xticks(np.arange(0.06,1.06,0.1))
ax2.set_xticklabels(['Americas','Americas','Americas','Americas','Eastern Mediterranean','Europe','Europe','Europe','Europe','South-East Asia'])
ax2.set_xlabel(r"WHO Region")

for g in np.unique(group):
    i = np.where(group == g)
    ax.scatter(scatter_x[i],scatter_y[i],
            c = colors[g], s = size[i], lw=1, label = g)



plt.show()

- List and plot the comfirm cases vs. confirmed cases/1 million peopel, deaths vs deaths/1 million people within each WHO region

In [None]:
df_country_sum = df_country.groupby('WHO Region').sum()
df_country_sum['WHO Region'] =['Africa','Americas','Eastern Mediterranean','Europe','South-East Asia','Western Pacific']

df_country_sum['Popultion (in millions)'] = [1019.922,992.155,664.336,916.315,1947.632,1889.901]

df_country_sum['Confirmed cases/ million people'] =np.round(df_country_sum['Confirmed'] /df_country_sum['Popultion (in millions)'],3)

df_country_sum['Deaths / million people'] =np.round(df_country_sum['Deaths'] /df_country_sum['Popultion (in millions)'],3)

df_country_sum[['Popultion (in millions)','Confirmed','Deaths','Confirmed cases/ million people','Deaths / million people']]


In [None]:
f,ax= plt.subplots(1,1,figsize=(16, 9))

plt.ticklabel_format(style = 'plain')

ax.get_yaxis().set_major_formatter(
matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))

plt.xticks(fontsize=16, rotation=90)
plt.yticks(fontsize=14, color = 'black')

plt.title('Total Confirmed Cases Grouped by WHO Region', color='black')



colors = {'Eastern Mediterranean': 'r', 'Europe': 'b', 'Africa': 'g','Americas': 'y','Western Pacific': 'c','South-East Asia': 'm'}



df_country_sum.plot(x='WHO Region',y='Confirmed',kind="bar", ax=f.gca(), color=[colors[i] for i in df_country_sum['WHO Region']])
plt.show()




In [None]:
f,ax= plt.subplots(1,1,figsize=(16, 9))

plt.ticklabel_format(style = 'plain')

ax.get_yaxis().set_major_formatter(
matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))

plt.xticks(fontsize=16, rotation=90)
plt.yticks(fontsize=14, color = 'black')

plt.title('Confirmed Cases / 1 million people Grouped by WHO Region', color='black')



colors = {'Eastern Mediterranean': 'r', 'Europe': 'b', 'Africa': 'g','Americas': 'y','Western Pacific': 'c','South-East Asia': 'm'}

df_country_sum.plot(x='WHO Region',y='Confirmed cases/ million people',kind="bar", ax=f.gca(), color=[colors[i] for i in df_country_sum['WHO Region']])
plt.show()

In [None]:
f,ax= plt.subplots(1,1,figsize=(16, 9))

plt.ticklabel_format(style = 'plain')

ax.get_yaxis().set_major_formatter(
matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))

plt.xticks(fontsize=16, rotation=90)
plt.yticks(fontsize=14, color = 'black')



plt.title('Total Deaths Grouped by WHO Region', color='black')
df_country_sum.plot(x='WHO Region',y='Deaths',kind="bar", ax=f.gca(), color=[colors[i] for i in df_country_sum['WHO Region']])
plt.show()

In [None]:
f,ax= plt.subplots(1,1,figsize=(16, 9))

plt.ticklabel_format(style = 'plain')

ax.get_yaxis().set_major_formatter(
matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))

plt.xticks(fontsize=16, rotation=90)
plt.yticks(fontsize=14, color = 'black')

plt.title('Deaths / 1 million people Grouped by WHO Region', color='black')



colors = {'Eastern Mediterranean': 'r', 'Europe': 'b', 'Africa': 'g','Americas': 'y','Western Pacific': 'c','South-East Asia': 'm'}

df_country_sum.plot(x='WHO Region',y='Deaths / million people',kind="bar", ax=f.gca(), color=[colors[i] for i in df_country_sum['WHO Region']])
plt.show()

- Plot Deaths/100 Cases and Recovered / 100 cases of each country grouped by WHO regions

In [None]:
#Plot Deaths/100 Cases and Recovered / 100 cases of each country grouped by WHO regions

fig, axs = plt.subplots(1,2,figsize=(16,  9))



df_country.boxplot(column =['Deaths / 100 Cases'],
                   by = 'WHO Region', ax=axs[0],rot = 90,fontsize = 14)

df_country.boxplot(column =['Recovered / 100 Cases'],
                   by = 'WHO Region', ax=axs[1],rot = 90,fontsize = 14)



List the outlier--Yemen

In [None]:
print(df_country.groupby('WHO Region').max()[['Country/Region','Deaths / 100 Cases']].iloc[[2],:])

Plot Deaths/100 Cases of each country grouped by WHO regions after removing the outlier (Yemen)

In [None]:
#Plot Deaths/100 Cases of each country grouped by WHO regions after removing the outlier (Yemen)

df_country.loc[df_country['Country/Region']!='Yemen'].boxplot(column =['Deaths / 100 Cases'],
                   by = 'WHO Region',rot = 90,fontsize = 14,figsize = (16,9))

plt.title('Deaths / 100 Cases (excluding Yemen)')

- Data preparing for correlation heatmap

In [None]:
column_name = df_country.columns

#column_name.remove('Deaths / 100 Cases')

df_country.replace([np.inf, -np.inf], np.nan, inplace=True)

df_country.dropna(subset=['Deaths / 100 Cases'],how='any',axis=0,inplace=True)

#df_country.dropna(axis=1)

ml_country_cat = df_country[["WHO Region"]]
#print(df_country)




In [None]:
ml_dummy = pd.get_dummies(ml_country_cat,drop_first =True)
ml_num = df_country.drop(['Country/Region','WHO Region',], axis = 1)

fill_na = lambda col: col.fillna(col.mean())

ml_num = ml_num.apply(fill_na,axis = 0)

ml_df = pd.concat([ml_num,ml_dummy],axis = 1)

#ml_df.head()

In [None]:
#Use heatmap to show correlations between each variables

df_country_num = df_country.select_dtypes(include = ['float','int'])


plt.figure(figsize=(16, 9))

plt.xticks(fontsize=15, rotation=90)
plt.yticks(fontsize=15, color = 'black')

mask = np.triu(np.ones_like(df_country_num.corr(), dtype=np.bool))

corr_map= sns.heatmap(df_country_num.corr(),annot = True,cmap='BrBG',mask =mask)

corr_map.set_title("Correlation Heatmap", pad = 12)

## Data modeling: linear model training

- Data cleaning for training linear model


In [None]:
# Data cleaning for training linear model


## dropping na
df_country.dropna(subset=['Deaths / 100 Cases'],how="any",axis = 0,inplace = True)

## selecting mumeric vairables
df_country_num = df_country.select_dtypes(include = ['float','int'])

## replacing infinity to nan
df_country_num = df_country_num.replace([np.inf, -np.inf], np.nan)

## imputation with col mean
fill_na = lambda col:col.fillna(col.mean())

df_country_num = df_country_num.apply(fill_na)

## creating dummy variable of WHO region, here set Africa as the reference
df_country_region = df_country['WHO Region']

df_country_dummy = pd.get_dummies(df_country_region,drop_first = True)

# Concate cleaned numeric and dummy dataframe
ml_df = pd.concat([df_country_num,df_country_dummy],axis = 1)


- Get dependent and independent variables

In [None]:
#independent variables

X = ml_df.drop('Deaths / 100 Cases',axis = 1)

#Depentent variable
y= pd.DataFrame(ml_df['Deaths / 100 Cases'])

Data cleaning for training linear model


- Split data set into training and testing, and missing/infinite values checking

In [None]:
X_train, X_test, y_train,y_test = train_test_split(X,y,test_size=0.33, random_state=42)


# check infinity and
print("Is there any missing values in X_train ? :{} ".format(np.any(np.isnan(X_train))))

print("Is there any infinity values in X_train ? {}".format(np.all(np.isfinite(X_train))))

print("Is there any missing values in y_train ? {}".format(np.all(np.isnan(y_train))))

- Model traning, then predict y with training set and testing set respectively

In [None]:
lm_model = LinearRegression(normalize = True)

lm = lm_model.fit(X_train,y_train)

y_train_pred = lm.predict(X_train)

y_test_pred = lm.predict(X_test)
#print(X_train_pred.shape)

#print(X_train.shape)

- get both training and testing r square scores.  

In [None]:

r2_train = r2_score(y_train,y_train_pred)
r2_test =r2_score(y_test,y_test_pred)

print("r2_score of training dataset {}".format(r2_train))
print("r2_score of testing dataset {}".format(r2_test))




Overfitting problem is unlikely here

- Get the coefficients

In [None]:
coefficients = pd.concat([pd.DataFrame(X_train.columns),pd.DataFrame(np.transpose(lm.coef_))], axis = 1)
coefficients.columns=['Variables',"Coefficients"]

coefficients.iloc[:,0]

# Results evaluation

## Summary for primary questions


- 1 country of Africa (South Africa), 5 countries of Americas (Chile, Peru, Mexico, Brazil, and US), 1 country of Eastern Mediterranean (Iran), 2 countries of Europe (United Kingdom, Russia), and 1 South-East Asian country (India) are among the top 10 countries who have the greatest number of total confirmed cases,

- As for total deaths in Covid-19, 4 countries of Americas (Peru, Mexico, Brazil and US), 1 country of Eastern Mediterranean (Iran), 4 countries of Europe (France, Spain, Italy, United Kingdom), and 1 South-East Asia country (India) are ranked top 10 among all countries.


- Americas have the largest number of confirmed cases, confirm cases/ 1M people, deaths, deaths/ 1M people, followed by Europe among all WHO regions; western Pacific area has the least of all four metrics. What also draws my attention is, although the total confirmed cases and deaths in South-East Asia seem higher than those in the Eastern Mediterranean area, the number per 1M people is relatively small. the number per 1M people is relatively small.

- Americas have a relatively low average recovery rate compared with their high death rate, while Europe has both a high average death rate and a high aveage recovery rate among the WHO regions. However, the variability among countries is significant for all 6 WHO region. 


## Discussion

As a Chinese orgin working in America, I hope both coutry could beat the pandemic.  They have a lot to learn from each other. Several major cities in US effectively slowed down the virus spreading by practicing social distancing and quarantine, such as NYC, Chicago and Bosto. Meanwhile, the EUA of vaccines from Pfizer and Moderna should be a effetive way for controlling the prevalance as more and more people get fully vaccinated. 