1. **About testing **     
Source: WHO   
https://www.who.int/emergencies/diseases/novel-coronavirus-2019/technical-guidance/laboratory-guidance
 
What are countries looking for in **diagnostics**?   

In [None]:
import pandas as pd 
  
# initialize list of lists of external data
data1 = [['China', 'China CDC', 'ORF1ab and N'], ['Germany', 'Charité', 'RdRP, E, N'], ['Hong Kong', 'SAR HKU', 'ORF1ab and N'],
         ['Japan', 'National Institute of Infectious Diseases', 'Pancorona and multiple targets, Spike protein'], 
         ['Thailand', 'National Institute of Health', 'N'], ['US', 'CDC', 'Three targets in N gene'], 
         ['France', 'Institut Pasteur, Paris', 'Two targets in RdRP']] 
  
# Create as a DataFrame 
df = pd.DataFrame(data1, columns = ['Country', 'Institute', 'Gene target']) 
  
df 


What does the **literature** have in common?

In [None]:
#text mining

df1=pd.read_csv('../input/CORD-19-research-challenge/metadata.csv')
df1.head(3)

In [None]:
journals= df1[['title', 'abstract', 'publish_time']]
journals.head()

In [None]:
import numpy as np
a=journals.dropna(how='all')    #drops rows with all NaN values

In [None]:
b= a.dropna(how='any')    #drops any row with a nan

In [None]:
b[['abstract']]\
.describe(include=np.object)\
.transpose()

In [None]:
b['words'] = b.abstract.str.strip().str.split('[\W_]+')
b['words'].head()

In [None]:
abstracts = b[b.words.str.len() > 0]
abstracts.head()

In [None]:
rows = list()
for row in abstracts[['words']].iterrows():
    r = row[1]
    for word in r.words:
        rows.append((word))

words = pd.DataFrame(rows, columns=['word'])
words.head()


In [None]:
#To calculate TF-IDF statistic, normalize the words by chaging the words to the same case. 
text1 = words.word.str.lower()

In [None]:
counts = text1.value_counts()
counts

**S-Protein** is consistenly discussed in the medical/research literature abstracts available on COVID-19.   
Other important terms:   
Assays, Blood Type, Antigens/Antibodies, diagnostics, and testing.   

In [None]:
abstracts[abstracts['abstract'].str.contains('assay')]

In [None]:
abstracts[abstracts['abstract'].str.contains('blood type')]

In [None]:
abstracts[abstracts['abstract'].str.contains('symptoms')]

In [None]:
abstracts[abstracts['abstract'].str.contains('antigens')]

In [None]:
abstracts[abstracts['abstract'].str.contains('antibodies')]

In [None]:
abstracts[abstracts['abstract'].str.contains('diagnostic')]

In [None]:
abstracts[abstracts['abstract'].str.contains('testing')]

In [None]:
abstracts[abstracts['abstract'].str.contains('false negative')]

What do the numbers show?   

In [None]:
df2 = pd.read_csv("../input/httpsourworldindataorgcoronavirussourcedata/full_data(4).csv")
df2

In [None]:
df2.describe ()

In [None]:
#Correlation among columns
df2.corr()

The above shows a correlation between new cases and newly reported deaths, and between total cases and total deaths.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
df2.corr().style.background_gradient(cmap='coolwarm')

In [None]:
#Density plots
df2.plot(kind='density', subplots=True, layout=(2,2), sharex=False)
plt.show()

The numbers are not increasing by the hundreds or thousands, yet.

In [None]:
sns.regplot(x='total_cases', y='total_deaths', data=df2, logistic=False)

The above reflects a clear relationship between total cases and total deaths.

In [None]:
df2.corr().plot.bar()

**USA**   
What does the data show?

In [None]:
USA=df2.loc[df2['location']== 'United States']
USA

In [None]:
sns.countplot(USA['total_deaths'])

In [None]:
USA.describe()

In [None]:
USA.corr()

The above shows that, in the **USA**, there is a high correlation between elements, except the new cases and new deaths which reflect a milder correlation.

In [None]:
sns.distplot(USA['total_cases'], 
             hist=True, kde=True, 
              color='orange', 
             hist_kws={'edgecolor':'black'}, 
             kde_kws={'linewidth': 2})

In [None]:
sns.regplot(x='new_cases', y='total_cases', data=USA)

In [None]:
USA['date']=pd.to_datetime(USA['date'])

In [None]:
USA

In [None]:
sns.set(rc={'figure.figsize':(11, 4)})

In [None]:
USA['total_cases'].plot(linewidth=2)

In [None]:
plt.figure(figsize=(16,9)) # Figure size
sns.lineplot(x='date', y='total_cases', data=USA, marker='o', color='red') 
plt.title('Cases per day') # Title
plt.xticks(USA.date.unique(), rotation=90) # All values in x-axis; rotate 90 degrees
plt.show()

The above shows a sharp increase in cases in the month of March.

In [None]:
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics

In [None]:
plt.figure(figsize=(8,6))
plt.tight_layout()
sns.distplot(USA['new_cases'], color='purple')

In [None]:
#predict
x= USA['total_cases'].values.reshape(-1,1)
y= USA['total_deaths'].values.reshape(-1,1)

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)



In [None]:
# fit model
model = LinearRegression()
model.fit(X_train, y_train)

In [None]:
#To retrieve the intercept:
print(model.intercept_)
#For retrieving the slope:
print(model.coef_)

The intercept (often labeled the constant) is the expected mean value of Y when all X=0.    
The regression coefficient is the constant that represents the rate of change of one variable (y) as a function of changes in the other (x); it is the slope of the regression line.

In [None]:
y_pred = model.predict(X_test)

In [None]:
df3 = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': y_pred.flatten()})
df3

In [None]:
df4 = df3
df4.plot(kind='bar',figsize=(10,6))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()

In [None]:
plt.scatter(X_test, y_test,  color='gray')
plt.plot(X_test, y_pred, color='red', linewidth=2)
plt.show()

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

The predictions above are imperfect but informative.

Compared to a more successful country in controlling the spread: Singapore

In [None]:
SING=df2.loc[df2['location']== 'Singapore']
SING

In [None]:
SING.corr()

In [None]:
plt.figure(figsize=(16,9)) # Figure size
sns.lineplot(x='date', y='total_cases', data=SING, marker='o', color='red') 
plt.title('Cases per day') # Title
plt.xticks(SING.date.unique(), rotation=90) # All values in x-axis; rotate 90 degrees
plt.show()

In [None]:
#prediction
#predict
x1= SING['total_cases'].values.reshape(-1,1)
y1= SING['total_deaths'].values.reshape(-1,1)

X_train, X_test, y_train, y_test = train_test_split(x1, y1, test_size=0.4, random_state=1)

In [None]:
# fit model
model = LinearRegression()
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)

In [None]:
df5 = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': y_pred.flatten()})
df5

The prediction above for Singapore seems correct. However, this depends on data reporting.

In [None]:
plt.scatter(X_test, y_test,  color='gray')
plt.plot(X_test, y_pred, color='green', linewidth=2)
plt.show()