# COVID-19 daily cases prediction

Given survey results in the past 3 days in a specific state in U.S., then
predict the percentage of new tested positive cases in the 3rd day.
In this sample, we will work around the data file `covid.csv`.


## Importing and preparing data
First we need to import all the required packages, and import Intel(R) Extension for Scikit-learn*.

In [32]:
#Intel(R) Extension for Scikit-learn dynamically patches scikit-learn estimators to use oneDAL as the underlying solver
from sklearnex import patch_sklearn
patch_sklearn()

import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

csv = pd.read_csv('covid.csv')
data = csv.values
columns = csv.columns

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


## Feature extraction and normalization

First, we need to prepare for the data and do some feature extraction. We will use SelectKBest to extract the most important features.

We will select `k` features, here k = 12.

In [33]:
k = 12
selector = SelectKBest(score_func=f_regression, k=k)
result = selector.fit(data[:, :-1], data[:,-1])
idx = np.argsort(result.scores_)[::-1]
feat_idx = list(np.sort(idx[:k]))

print(f'Selected features: {columns[feat_idx]}')
features = data[:, feat_idx]

# We need to normalize all the features
normalizer = MinMaxScaler(feature_range = (0,1))
features = normalizer.fit_transform(features)

values = data[:, -1]
values = np.expand_dims(values, axis=1)
features_with_values = np.concatenate((features, values), axis=1)

Selected features: Index(['cli', 'ili', 'hh_cmnty_cli', 'nohh_cmnty_cli', 'tested_positive',
       'cli.1', 'hh_cmnty_cli.1', 'nohh_cmnty_cli.1', 'tested_positive.1',
       'cli.2', 'hh_cmnty_cli.2', 'nohh_cmnty_cli.2'],
      dtype='object')


## Preparing dataset

We then need to split the dataset into training set and test set.
We can use the api provided by sklearn to do this.

In [34]:
train_set,test_set = train_test_split(features_with_values, test_size=0.2, random_state=42)

x_train=train_set[:,:-1]
y_train=train_set[:,-1]
x_test=test_set[:,:-1]
y_test=test_set[:,-1]

## Linear regression

Finally, we will train our model using linear regression.

In [35]:
linearRegression = LinearRegression()
linearRegression.fit(x_train, y_train)

# Weights and bias
print(f"Weights: {linearRegression.coef_}")
print(f"Bias: {linearRegression.intercept_}")

# Loss function
print("Losses: %.2f" % np.mean((linearRegression.predict(x_test) - y_test) ** 2))
# Prediction scores
print("Prediction scores: %.2f" % linearRegression.score(x_test, y_test))


Weights: [ -3.91028257   3.64099088   6.18709571  -8.59003291  -0.08478557
  -6.2970195  -19.18986644  15.91035666  37.2064383    7.39297461
  14.59408981  -8.1357505 ]
Bias: 2.271530322667597
Losses: 0.94
Prediction scores: 0.98
