## Model Building and Evaluation
We can tackle this problem with a supervised regression technique as we are looking for the number of COVID-19 deaths based on many other internal and external factors in the dataset.  Visually and statistically, there is a linear relationship between the dependent variable (total deaths) and the independent variables. Earlier in the data exploration part, we applied the Linear Regression model and visualised the data with a pair plot to see the correlation between variables. We noted a strong relationship between the dependent variable and the independent variables. However, we also noted a strong correlation between a few independent variables. In this case, we could apply the feature selection technique to remove the variables. However, we do not remove anything; instead, we try the Lasso Regression because it can help to zero the least important features.

In [31]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import train_test_split

#retrieve dataframe from preprocessing
%store -r df

def rmse(a,b):
    return np.sqrt(np.mean((a-b)**2))

### Pipeline
We seperated the independent variables and the target variable in two seperate lists. 

In [32]:
dep_var = ['total_deaths']

ind_vars = list(df.columns)
ind_vars.remove('total_deaths')

Because linear regression algorithms expect data to have a gaussian distribution, we normalize our features individually to eliminate varying scales. Since the total deaths and other variables such as cases and tests are always positive numbers, with no bell curve or normal distribution, we will apply normalization (MinMax) scaling technique.

In [33]:
num_si_step = ('si', SimpleImputer(strategy='median'))
num_scl_step = ('scl', MinMaxScaler())
num_steps = [num_si_step, num_scl_step]
num_pipe = Pipeline(num_steps)
num_transformer = [('num', num_pipe, ind_vars)]

In [34]:
ct = ColumnTransformer(transformers=num_transformer)
ct.fit(df[ind_vars])
X = ct.transform(df[ind_vars])
y = df[['total_deaths']].values

### Train-Test Split (80-20) -- Regularization

In [35]:
# train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(11096, 30) (2774, 30) (11096, 1) (2774, 1)


In [37]:
%store X_train
%store X_test
%store y_train
%store y_test
%store ind_vars
%store dep_var
%store X
%store y

Stored 'X_train' (ndarray)
Stored 'X_test' (ndarray)
Stored 'y_train' (ndarray)
Stored 'y_test' (ndarray)
Stored 'ind_vars' (list)
Stored 'dep_var' (list)
Stored 'X' (ndarray)
Stored 'y' (ndarray)
