# COVID-19: A predictive analysis
by: Kaike W. Reis


## Steps
- Missing Data Analysis & Pre-processing
- Exploratory Data Analysis
- Predictive Analysis - General Information
- Task 1
- Task 2
- Conclusions

## Notebook Libraries

In [None]:
# Standard modules
import numpy as np
import pandas as pd

# Machine Learning modules - quantitative analysis
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split 
from sklearn.metrics import classification_report, confusion_matrix

# Graphical modules
import matplotlib.pyplot as plt
import seaborn as sns

# Display module
from IPython.display import display

# Missing Data Analysis & Pre-processing

## Importing data

In [None]:
# Importing raw dataset
dataset_raw = pd.read_excel('/kaggle/input/covid19/dataset.xlsx')

In [None]:
# Drop unused columns
dataset_raw.drop('Patient ID', axis=1, inplace=True)

In [None]:
# Change column names: lowercase and '_' over space
new_columns = list()
# Loop over all columns
for col in dataset_raw.columns:
    new_columns.append(col.lower().replace(' ','_'))
# Modify dataset columns
dataset_raw.columns = new_columns

## Missing Data Analysis

### **How many NaN values have each column?**

In [None]:
# How many nan have each column
nan_per_column = pd.DataFrame(dataset_raw.isna().sum(),columns=['nanValues']).reset_index()

# Calculate NaN %
for i in range(0,len(nan_per_column)):
    nan_per_column.loc[i, 'nanValuesPct'] = 100*round(nan_per_column.loc[i, 'nanValues']/len(dataset_raw),3)

In [None]:
# Plot - % of missing rows for each column
plt.figure(figsize=(20,15))
sns.barplot(x="index", y="nanValuesPct", data=nan_per_column)
plt.xlabel('Variables', fontsize=20)
plt.ylabel('Missing %', fontsize=20)
plt.title('Missing Data Plot', fontsize=30)
plt.yticks([0,10,20,30,40,50,60,70,80,90,100])
plt.xticks(rotation=90);

Based on this plot, It's clear that this dataset have a lot of missing values, so before models development it's necessary to have a complete dataset to train. 

First I have to decide what happens with our NaN values:
- Get only complete samples?
- Impute all missing values?

Well, both solutions are not an option. The first one probably will select a complete dataset with 0 samples.

In [None]:
len(dataset_raw.dropna(how='any'))

The second one: "Impute all missing values" can be complicated. In Missing Data Analysis (field that studies forms to impute data) there are some golden rules (more like recommendations):
- You need to understand if your data is Missing Completely at Random (MCAR), Missing at Random (MAR) or missing Not At Random (MNAR)
- Samples with more than 50% missing data should not be imputed
- Columns with more than 5%-10% should not be imputed

Let's see an example:


In [None]:
# Boolean dataset for missing values, except sars-cov-2
dataset_nan = dataset_raw.drop('sars-cov-2_exam_result',axis=1).isnull().join(dataset_raw['sars-cov-2_exam_result'])

In [None]:
# PLOT - Plot NaN Bool Dataset related to sars
fig, axes = plt.subplots(nrows=5, ncols=4, figsize=(20,15))
features_to_plot = dataset_nan.columns[4:24] # Avoid first 4 variables because they are complete and are target variables
r = 0 # Index row
c = 0 # Index col
for f in features_to_plot:
    # Count Plot
    sns.countplot(x=f, hue='sars-cov-2_exam_result', data=dataset_nan,ax=axes[r][c])
    # Plot configs
    axes[r][c].legend(title='sars-cov-2_exam_result', loc='upper right')
    # Index control
    c += 1
    if c > 3:
        c = 0
        r += 1

plt.tight_layout()

In [None]:
# PLOT - Plot NaN Bool Dataset related to sars
fig, axes = plt.subplots(nrows=5, ncols=4, figsize=(20,15))
features_to_plot = dataset_nan.columns[24:44] # Avoid first 4 variables because they are complete and are target variables
r = 0 # Index row
c = 0 # Index col
for f in features_to_plot:
    # Count Plot
    sns.countplot(x=f, hue='sars-cov-2_exam_result', data=dataset_nan,ax=axes[r][c])
    # Plot configs
    axes[r][c].legend(title='sars-cov-2_exam_result', loc='upper right')
    # Index control
    c += 1
    if c > 3:
        c = 0
        r += 1

plt.tight_layout()

In [None]:
# PLOT - Plot NaN Bool Dataset related to sars
fig, axes = plt.subplots(nrows=5, ncols=4, figsize=(20,15))
features_to_plot = dataset_nan.columns[44:64] # Avoid first 4 variables because they are complete and are target variables
r = 0 # Index row
c = 0 # Index col
for f in features_to_plot:
    # Count Plot
    sns.countplot(x=f, hue='sars-cov-2_exam_result', data=dataset_nan,ax=axes[r][c])
    # Plot configs
    axes[r][c].legend(title='sars-cov-2_exam_result', loc='upper right')
    # Index control
    c += 1
    if c > 3:
        c = 0
        r += 1

plt.tight_layout()

In [None]:
# PLOT - Plot NaN Bool Dataset related to sars
fig, axes = plt.subplots(nrows=5, ncols=4, figsize=(20,15))
features_to_plot = dataset_nan.columns[64:84] # Avoid first 4 variables because they are complete and are target variables
r = 0 # Index row
c = 0 # Index col
for f in features_to_plot:
    # Count Plot
    sns.countplot(x=f, hue='sars-cov-2_exam_result', data=dataset_nan,ax=axes[r][c])
    # Plot configs
    axes[r][c].legend(title='sars-cov-2_exam_result', loc='upper right')
    # Index control
    c += 1
    if c > 3:
        c = 0
        r += 1

plt.tight_layout()

In [None]:
# PLOT - Plot NaN Bool Dataset related to sars
fig, axes = plt.subplots(nrows=5, ncols=4, figsize=(20,15))
features_to_plot = dataset_nan.columns[84:104] # Avoid first 4 variables because they are complete and are target variables
r = 0 # Index row
c = 0 # Index col
for f in features_to_plot:
    # Count Plot
    sns.countplot(x=f, hue='sars-cov-2_exam_result', data=dataset_nan,ax=axes[r][c])
    # Plot configs
    axes[r][c].legend(title='sars-cov-2_exam_result', loc='upper right')
    # Index control
    c += 1
    if c > 3:
        c = 0
        r += 1

plt.tight_layout()

In [None]:
# PLOT - Plot NaN Bool Dataset related to sars
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(20,15))
features_to_plot = dataset_nan.columns[104:109] # Avoid first 4 variables because they are complete and are target variables
r = 0 # Index row
c = 0 # Index col
for f in features_to_plot:
    # Count Plot
    sns.countplot(x=f, hue='sars-cov-2_exam_result', data=dataset_nan,ax=axes[r][c])
    # Plot configs
    axes[r][c].legend(title='sars-cov-2_exam_result', loc='upper right')
    # Index control
    c += 1
    if c > 1:
        c = 0
        r += 1

plt.tight_layout()

From those graphs you can see:
- Doing Average imputation you may be generalizing negative cases into positive cases, which can be considered an error. Even because, how can you extrapolate the values of a positive case only with values for negative cases?
- It is possible to verify that there is a column with at all missing values, that is, a Zero imputation would add a variable without deviation and therefore useless to the model.
- Even through a Selected Case Average imputation (imputing negative cases with negative cases average and positive cases with positive cases average) can cause two problems such as: lack of a considerable samples (mainly for positive cases) to propose a representative average  or situations where positive cases do not even present samples to take the average.

Besides that Mean imputation or Zero imputation are not recommended given the diversity of imputation techniques.

So it's important to present a consistent dataset before any model, to garantee a representative prediction in the future. To continue my analysis I will develop a function to find a complete dataset respecting the ideas:
- Keep the higher number of variables
- Keep most of the rows

Given the fact that to eval all 109! dataset possibilities combinations (1.4438595832025E+176 combinations), I will propose a different approach...

### Function to find a local complete dataset
Arguments:
- ```rows_threshold``` - How many complete information each columns must have in respect to ```target_variable```
- ```target_variable``` - Target variable to evaluate previous argument
- ```df``` - Dataset to verify

The function routine is:
- Evaluate how many values each column have with respect to ```target_variable``` to filter based in ```rows_threshold``` value
- Sort this result in Ascending or Descending way
- Print how many complete rows do we have for some amount of variables as we increase the variable number in the dataset

In [None]:
def find_complete_dataset(rows_threshold, target_variable, df, asc):
    # Count how many existent values each variable have in relation to target_variable
    df_value_counts = pd.DataFrame(df.groupby(target_variable).count())
    # Create a df to keep the variables that contains: existent values >= threshold
    df_vars_threshold = pd.DataFrame(columns=['vars','values'])
    # Evaluate the variables that respect this condition
    for var in df_value_counts.columns:
        # Sum samples from groupby.count
        existent_samples = df_value_counts[var].sum()  
        if existent_samples >= rows_threshold:
            index = len(df_vars_threshold)
            df_vars_threshold.loc[index, 'vars'] = var
            df_vars_threshold.loc[index, 'values'] = existent_samples
    # Sort Descending dataframe (Assuming that higher complete values cause a best choice to keep most of the variables)
    df_vars_threshold.sort_values(by=['values'], ascending=asc, inplace=True)
    # List all variables (features) that pass the established condition 
    vars_threshold = list(df_vars_threshold['vars'])
    # Print Info
    print('### For ',rows_threshold,' rows samples complete we have ',len(vars_threshold),' possible feature variables.')
    # Verify
    for i in range(1, len(vars_threshold)+1):
        # Define the set of variables to verify
        vars_to_test = vars_threshold[0:1+len(vars_threshold)-i]
        df_model = df[vars_to_test].dropna(how='any')
        # Print Info
        print('With ', len(vars_to_test),' variables, we have ', len(df_model),' complete rows.')
    # Return sorted variables
    return vars_threshold

In [None]:
# Target variables to keep out from feature dataset
targets_out = ['patient_addmited_to_semi-intensive_unit_(1=yes,_0=no)','patient_addmited_to_intensive_care_unit_(1=yes,_0=no)',
               'patient_addmited_to_regular_ward_(1=yes,_0=no)']

# Raw data for features (keep target sars)
data = dataset_raw.drop(targets_out, axis=1)

# Eval a possible dataset given my assumption for Ascending sorting
dataset_vars_eval_asc = find_complete_dataset(rows_threshold=500, target_variable='sars-cov-2_exam_result', df=data, asc=True)

In [None]:
# Eval a possible dataset given my assumption for Descending sorting
dataset_vars_eval_des = find_complete_dataset(rows_threshold=500, target_variable='sars-cov-2_exam_result', df=data, asc=False)

Based in this approach, I found two possible datasets:
- Ascending way with 420 samples
- Descending way with 1352 samples

Let's explore more the results.

In [None]:
# Creating two complete datasets
## Ascending
vars_selected = ['sars-cov-2_exam_result'] + targets_out + dataset_vars_eval_asc[0:16]
df1 = dataset_raw[vars_selected].dropna(how='any')
df1.index = range(0,len(df1))
## Show info
print('### Ascending way')
for i in df1.columns:
    print(i)
## Descending
vars_selected = ['sars-cov-2_exam_result'] + targets_out + dataset_vars_eval_des[0:18]
df2 = dataset_raw[vars_selected].dropna(how='any')
df2.index = range(0,len(df2))
## Show info
print('\n### Descending way')
for i in df2.columns:
    print(i)

In [None]:
# View df
display(df1)

In [None]:
# View df
display(df2)

As you noticed, are two distincting datasets:
- Different variables
- Different type (categorical df vs numerical df)

The EDA phase will decide which dataset will be use to model development.

# Exploratory Data Analysis (EDA)


## EDA for Numerical complete local dataset

**PS**: Data already have mean 0 and std as 1.

In [None]:
# Eval 'sars-cov-2_exam_result' proportions
print('Positive case proportion - original dataset [%]: ', round(100*dataset_raw['sars-cov-2_exam_result'].value_counts()[1]/dataset_raw['sars-cov-2_exam_result'].value_counts().sum(),2))
print('Positive case proportion - numerical dataset [%]: ', round(100*df1['sars-cov-2_exam_result'].value_counts()[1]/df1['sars-cov-2_exam_result'].value_counts().sum(),2))

It's possible to see that we gain more positive proportion in this complete dataset.

In [None]:
# Eval features for 'sars-cov-2_exam_result'
## Defining our Y variables out
targets_out = ['sars-cov-2_exam_result','patient_addmited_to_semi-intensive_unit_(1=yes,_0=no)','patient_addmited_to_intensive_care_unit_(1=yes,_0=no)',
           'patient_addmited_to_regular_ward_(1=yes,_0=no)']

## Defining our X variables
feat_cols = list(set(df1.columns).difference(set(targets_out)))

## Data
data = df1[['sars-cov-2_exam_result']+feat_cols]

In [None]:
# PLOT - Plot NaN Bool Dataset related to sars
fig, axes = plt.subplots(nrows=4, ncols=4, figsize=(20,15))
r = 0 # Index row
c = 0 # Index col
for f in data.columns:
    if not f == 'sars-cov-2_exam_result':
        sns.stripplot(x=f, y='sars-cov-2_exam_result', data=data, ax=axes[r][c])
        #axes[r][c].legend(title='sars-cov-2_exam_result', loc='upper right')
        # Index control
        c += 1
        if c > 3:
            c = 0
            r += 1

plt.tight_layout()

From this result, I analyze that none of the variables could prove linearly exam_results (don't present boundary regions)

In [None]:
sns.pairplot(data, hue='sars-cov-2_exam_result')

Through this amount of plots, it's possible to see that some variables are correlated. Let's evaluate this through a correlation heatmap:

In [None]:
plt.figure(figsize=(20,10))
myBasicCorr = df1[feat_cols].corr('spearman')
sns.heatmap(myBasicCorr, annot = True)

Considering a correlation absolute threshold of 0.90 (maximum is 1), two pairs needs to be evaluate:
- neutrophils VS lymphocytes
- hemoglobin VS hematocrit

Because they are higher correlated, to decrease model complexity and noise would be necessary to remove one of them for each pair. First let's evaluate each data distribution:

In [None]:
# Dist plot
corr_vars = ['neutrophils', 'lymphocytes', 'hemoglobin', 'hematocrit']
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(20,15))
r = 0 # Index row
c = 0 # Index col
# Array for each category
target_0 = data.loc[data['sars-cov-2_exam_result'] == 'negative']
target_1 = data.loc[data['sars-cov-2_exam_result'] == 'positive']
# Plot process
for f in corr_vars:
        sns.distplot(target_0[f], hist=False, ax=axes[r][c])
        sns.distplot(target_1[f], hist=False, ax=axes[r][c])
        plt.legend(title='sars_case')
        c += 1
        if c > 1:
            c = 0
            r += 1

plt.tight_layout()

Based on the distribution, I will keep: ```hematocrit``` and ```lymphocytes``` most because the distinguished difference between positive/negative cases distributions.

In [None]:
# Atualize main df
df1.drop(['hemoglobin', 'neutrophils'], axis=1, inplace=True)

Now, for continuos dataset we have 16 features!

## EDA for Categorical complete local dataset

In [None]:
# Eval 'sars-cov-2_exam_result' proportions
print('Positive case proportion - original dataset [%]: ', round(100*dataset_raw['sars-cov-2_exam_result'].value_counts()[1]/dataset_raw['sars-cov-2_exam_result'].value_counts().sum(),2))
print('Positive case proportion - numerical dataset [%]: ', round(100*df2['sars-cov-2_exam_result'].value_counts()[1]/df2['sars-cov-2_exam_result'].value_counts().sum(),2))

Different from numerical dataset, was lost positive proportion having a more unbalanced dataset.

In [None]:
# Eval features for 'sars-cov-2_exam_result'
## Defining our Y variables out
targets_out = ['sars-cov-2_exam_result','patient_addmited_to_semi-intensive_unit_(1=yes,_0=no)','patient_addmited_to_intensive_care_unit_(1=yes,_0=no)',
           'patient_addmited_to_regular_ward_(1=yes,_0=no)']

## Defining our X variables
feat_cols = list(set(df2.columns).difference(set(targets_out)))

## Data
data = df2[['sars-cov-2_exam_result']+feat_cols]

In [None]:
# PLOT - Barplots over our variables
fig, axes = plt.subplots(nrows=5, ncols=4, figsize=(20,15))
r = 0 # Index row
c = 0 # Index col
for f in feat_cols:
    # Count Plot
    sns.countplot(x=f, hue='sars-cov-2_exam_result', data=data,ax=axes[r][c])
    # Plot configs
    axes[r][c].legend(title='sars-cov-2_exam_result', loc='upper right')
    # Index control
    c += 1
    if c > 3:
        c = 0
        r += 1

plt.tight_layout()

Given the fact that most of the variables just present one category with positive samples and target variable proportion lost, **I will discontinue the study in discrete dataset**.

# Predictive Analysis - General Information
For both tasks, those infos are necessary to a better comprehension.

## Why Support Vector Machine (SVM)
- It's is a robust model, before deep learning explosion as one of the most used
- Works really well with small datasets, as I only have 420 samples. This statement is not applied for Deep Learning models that requires a lot of data to work a correct generalization.
- It's a powerful binary classifier and this problem is a binary classification (TASK 1). Nonetheless, can be applied for multi-class problem (TASK 2)

## Measures againts imbalanced target class
All targets for task 1 and 2 are unbalanced, to avoid this I proposed:
- Train/Test split stratified
- Stratified K-Fold Cross-Validation, keeping the proportion between train/validation split
- Cross Validation Analysis to avoid possible overfit due to small dataset
- Score metrics - balanced accuracy
- Evaluate confusion matrix results besides accuracy

**PS**: Stratify keeps the same proportion for target categories in spliting process.

## Steps for model development
- Prepare the continuous dataset
- Split a test set to verify the metrics
- Grid Search Stratified KFold Cross Validation for SVM Classifier
- Evaluate Metrics

## Metrics details
Will be evaluated **balanced accuracy** and a **confusion matrix report**. To understand the last one, let me explaing a binary confusion matrix:

![cm](https://user-images.githubusercontent.com/32513366/77873202-aec63780-721f-11ea-9955-08e3860e2a01.PNG)

Column - Predicted value
Row - Real value

As the target variable is medical in nature, each value in the confusion matrix has a meaning:
- TP (True Positive): If a pacient present COVID-19 and the model predict correct
- TN (True Negative): If a pacient don't present COVID-19 and the model predict correct
- FP (False Positive): If a pacient don't present COVID-19 and the model says that he's infected
- FN (False Negative): If a pacient present COVID-19, but the model says that he's fine

Given the actual world situation, I considered **FN error the worst**: the pacient after getting out from the hospital could infect others. I will considered **recall** and **accuracy** as the most important measures for this study.

# Task 1
**Predict confirmed COVID-19 cases among suspected cases. Based on the results of laboratory tests commonly collected for a suspected COVID-19 case during a visit to the emergency room, would it be possible to predict the test result for SARS-Cov-2 (positive/negative)?**

### Prepare the continuous dataset

In [None]:
## Defining our X variables
feat_cols = list(set(df1.columns).difference(set(targets_out)))

## Setting Binary values - One Hot Encode
df1.loc[df1['sars-cov-2_exam_result'] == 'positive', 'sars-cov-2_exam_result'] = 1
df1.loc[df1['sars-cov-2_exam_result'] == 'negative', 'sars-cov-2_exam_result'] = 0

### Split a test set to verify the metrics

In [None]:
x_train, x_test, y_train, y_test = train_test_split(df1[feat_cols], df1['sars-cov-2_exam_result'], test_size = 0.20, random_state = 1206, stratify=df1['sars-cov-2_exam_result'])

### Grid Search Stratified KFold Cross Validation for SVM Classifier

In [None]:
# Defining parameter range to grid search
param_gridSVM = {'C': [0.1, 1, 10, 100, 1000],
                 'shrinking':[True, False],
                 'gamma': ['scale', 'auto', 1, 0.1, 0.01, 0.001, 0.0001], 
                 'kernel': ['linear', 'poly', 'rbf', 'sigmoid']}  

# Define grid instance
gridSVM = GridSearchCV(cv=5, estimator=SVC(class_weight='balanced', random_state=101), param_grid=param_gridSVM, refit = True, verbose = 0, scoring='balanced_accuracy') 

# Initialize grid search, fitting the best model
gridSVM.fit(x_train, y_train)

In [None]:
# print best parameter after tuning 
print(gridSVM.best_params_)

In [None]:
# print how our best model looks after hyper-parameter tuning 
print(gridSVM.best_estimator_) 

### Evaluate Metrics
Evaluating SVM model performance can answer the first task.

In [None]:
# Make predictions over test set for both models
y_pred_svm = gridSVM.predict(x_test)

In [None]:
# print classification report SVM
print(classification_report(y_test, y_pred_svm))

In [None]:
# Confusion Matrix SVM
## original binary labels
labels = np.unique(y_test)
## DF with C.M.
cm = pd.DataFrame(confusion_matrix(y_test, y_pred_svm, labels=labels), index=labels, columns=labels)
# Visualize labels
cm.index = ['real: 0', 'real: 1']
cm.columns = ['pred: 0', 'pred: 1']

# CM visualization
cm

### Results - Task 1
- The SVM model had a great performance, with a recall (related to **FN error**) of 75% and a  accuracy ~80%
- The model is robust against unbalanced data for the task 1

Probably with more complete samples for those continuous features the model can be improved!

# Task 2
Predict admission to general ward, semi-intensive unit or intensive care unit among confirmed COVID-19 cases. Based on the results of laboratory tests commonly collected among confirmed COVID-19 cases during a visit to the emergency room, would it be possible to predict which patients will need to be admitted to a general ward, semi-intensive unit or intensive care unit?

To evaluate this problem, I will create a categorical variable for all patient admitted.

### Prepare the continuous dataset

In [None]:
## Defining our X variables
feat_cols = list(set(df1.columns).difference(set(['sars-cov-2_exam_result'] + targets_out)))

## Evaluate how many possibilities we have for three targets in a single column
patient_addmited_possibilities = list()
for i in range(0, len(df1)):
    possibility=str(df1.loc[i,targets_out[1]]) + str(df1.loc[i,targets_out[2]]) + str(df1.loc[i,targets_out[3]])
    patient_addmited_possibilities.append(possibility)

## Get unique possibilities
patient_addmited_cats = sorted(set(patient_addmited_possibilities))
patient_addmited_cats

In [None]:
# Output related to previous one
[0,1,2,3]

In [None]:
## Create a new dataset for this task with a new column
df1_t2 = df1.copy()
df1_t2['patient_addmited_cats'] = patient_addmited_possibilities
df1_t2.drop(targets_out, axis=1, inplace=True)

## Change to num values
for i in range(0, len(df1_t2)):
    if df1_t2.loc[i, 'patient_addmited_cats'] == patient_addmited_cats[0]:
        df1_t2.loc[i, 'patient_addmited_cats'] = 0
    elif df1_t2.loc[i, 'patient_addmited_cats'] == patient_addmited_cats[1]:
        df1_t2.loc[i, 'patient_addmited_cats'] = 1
    elif df1_t2.loc[i, 'patient_addmited_cats'] == patient_addmited_cats[2]:
        df1_t2.loc[i, 'patient_addmited_cats'] = 2
    elif df1_t2.loc[i, 'patient_addmited_cats'] == patient_addmited_cats[3]:
        df1_t2.loc[i, 'patient_addmited_cats'] = 3
## See df
df1_t2['patient_addmited_cats'].value_counts()

Again, we can see a unbalanced problem.

### Split a test set to verify the metrics

In [None]:
x_train, x_test, y_train, y_test = train_test_split(df1_t2[feat_cols], df1_t2['patient_addmited_cats'], test_size = 0.20, random_state = 1206, stratify=df1_t2['patient_addmited_cats'])

### Grid Search Stratified KFold Cross Validation for SVM Classifier

In [None]:
# Defining parameter range to grid search
param_gridSVM = {'C': [0.1, 1, 10, 100, 1000], 
                 'shrinking':[True, False],
                 'gamma': ['scale', 'auto', 1, 0.1, 0.01, 0.001, 0.0001], 
                 'kernel': ['linear', 'poly', 'rbf', 'sigmoid']}  

# Define grid instance
gridSVM = GridSearchCV(cv=5, estimator=SVC(class_weight='balanced', random_state=101), param_grid=param_gridSVM, refit = True, verbose = 0, scoring='balanced_accuracy') 

# Initialize grid search, fitting the best model
gridSVM.fit(x_train, y_train)

In [None]:
# print best parameter after tuning 
print(gridSVM.best_params_)

In [None]:
# print how our best model looks after hyper-parameter tuning 
print(gridSVM.best_estimator_) 

### Evaluate Metrics

In [None]:
# Make predictions over test set for both models
y_pred_svm = gridSVM.predict(x_test)

In [None]:
# print classification report SVM
print(classification_report(y_test, y_pred_svm))

In [None]:
# Confusion Matrix SVM
## original binary labels
labels = np.unique(y_test)
## DF with C.M.
cm = pd.DataFrame(confusion_matrix(y_test, y_pred_svm, labels=labels), index=labels, columns=labels)

# CM visualization
cm

### Results - Task 2
- The model provides a good **recall** of 67% for the first class (no need any hospitalization) and accuracy 58%.
- Unfortunately it's verify that the model have worst prediction for others categories. This could be explain by several reasons, but based on my work here it's possible that the continuous feature space dataset it's not good enough for this task as was for the first one or given the multi-class classification problem would require more complete samples.


# Conclusions
All pipeline was explained during code development. I think that is possible to keep those continuous variables and increase the size of complete samples to improve both models performance.

#### Let's win this fight against Covid-19!