In [None]:
#import relevant modules

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

from patsy import dmatrices

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score
from sklearn.tree import export_graphviz

import graphviz
from graphviz import Source

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

# (1) Data understanding and prep:

The original dataset has been cleaned (Homework1) and is now imported as a starting point for this work. 
1. The accompanying data quality report from Task1 can be found attached as PDF as a background to this cleaned dataset;
2. A summary of this plan can be seen in the table below;


# Summary of data quality plan from Assignment 1:


| Variable Names                     | Data Quality Issue                                 | Handling Strategy                                                                  |
|------------------------------------|----------------------------------------------------|------------------------------------------------------------------------------------|
| cdc_report_dt                      | Depreciated by CDC                                 | Dropped column in Part 1                                                           |
| pos_spec_dt                        | 67.88% missing values                              | Drop column                                                                        |
| cdc_case_earliest_dt               | NA                                                 | Do nothing                                                                         |
| onset_dt                           | 44.97% missing values                              | Do nothing                                                                         |
| sex                                | 0.062% missing values                              | Replace missing values with mode                                                   |
| sex                                | 0.682% unknown values                              | Do nothing                                                                         |
| age_group                          | 0.126% missing values                              | Replace missing values with mode                                                   |
| race_ethnicity_combined            | 1.017% missing values                              | Replace missing values with mode                                                   |
| race_ethnicity_combined            | 38.70% unknown values                              | Do nothing                                                                         |
| hosp_yn                            | 22.64% missing values                              | Combine with unknown values                                                        |
| hosp_yn                            | 16.56% unknown values                              | Combine with missing values                                                        |
| hosp_yn                            | 2 outliers of 'OTHER'                              | Remove outliers                                                                    |
| icu _yn                            | 75.34% missing values                              | Combine with unknown values                                                        |
| icu _yn                            | 13.52% unknown values                              | Combine with missing values                                                        |
| death_yn                           | NA                                                 | Do nothing                                                                         |
| medcond_yn                         | 73.83% missing values                              | Combine with unknown values                                                        |
| medcond_yn                         | 8.15% unknown values                               | Combine with missing values                                                        |

##  Part 1 1.1 Review, prepare and split the dataset into two datasets: 70% training and 30% test
Here we will import cleaned dataset

In [None]:
# Read csv file into a dataframe.
df = pd.read_csv('covid19-cdc-13336431-cleaned_data_Final3.csv', keep_default_na=True, sep=',\s+', delimiter=',', skipinitialspace=True)
df.head(10)

After importing the cleaned csv, as a reminder we will check the shape, inspect the datatypes and check for any remaining null values

In [None]:
df.shape

In [None]:
df.dtypes

#### Convert datatypes for plotting later
We will now review the datatypes and convert if needed. This will help avoid plotting errors later in the notebook
- The target feature "death_yn" is type object, with values "Yes" & "No". These will be mapped 'yes': 1, "no": 0 and stored as "int64"
- Categorical features will stay as category datatypes
- Some Continuous features are datetime64 type
  

#### Setup Column types
We will now setup the continuous, categorical, target features

##### Select all categorical columns and convert to categorical type
 - This will be needed later when it will be required to convert categorical features into dummy features for modeling

In [None]:
# convert riskperformance to 0,1
df['death_yn'] = df['death_yn'].map({'Yes': 1, "No": 0})

In [None]:
#Convert date time features to appropriate date time data types
df['cdc_case_earliest_dt'] = df['cdc_case_earliest_dt'].astype('datetime64[ns]')
df.dtypes

#Select all columns of type 'object'
object_columns = df.select_dtypes(['object']).columns
object_columns
#Convert selected columns to type 'category'
for column in object_columns:
    df[column] = df[column].astype('category')
df.dtypes 
continous_columns = df.select_dtypes(['datetime64[ns]']).columns

In [None]:
df.dtypes

##### Finally set the target feature "death_yn" to int64 
- death_yn is a categorical feature but it is also the target feature
- To allow continuous features to plot against the target, it will need to be int64 type
- We will also remove it from the list of categorical columns

In [None]:
df['death_yn'] = df['death_yn'].astype('int64')

In [None]:
categorical_columns = df[["current_status","sex","age_group","Race","Ethnicity", "hosp_yn", "icu_yn","medcond_yn"]].columns

In [None]:
df.dtypes

##### The dataset will now be split into two datasets: 70% training and 30% test
- First we will set the target feature "y" to be risk performance
- Then we will set "X" to be the remaining features in the dataframe i.e. we drop "death_yn" from the dataframe

In [None]:
y = pd.DataFrame(df["death_yn"])
X = df.drop(["death_yn"],1)

##### The data set can now be split
- The train test split will randomly split the dataset as per the test size
- We will set the random state=1 to allow the random shuffle to be repeated within this notebook only

In [None]:
# Split the dataset into two datasets: 70% training and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=1)

print("original range is: ",df.shape[0])
print("training range (70%):\t rows 0 to", round(X_train.shape[0]))
print("test range (30%): \t rows", round(X_train.shape[0]), "to", round(X_train.shape[0]) + X_test.shape[0])

In [None]:
X_train.dtypes
y_train.dtypes

In [None]:
X_train.dtypes

## Part 1 1.2 On the training set we will now carry out a series of plots comparing all features to help make decisions on what features to keep for the model 

- All plots will be using the training subset of dataset: X_train, y_train

### 1.2 Plot interaction between categorical features and target feature
- Here we will plot a pairwise interaction between each categorical feature against the target feature. 
- We will discuss what we observe from these plots, e.g., which categorical features seem to be better at predicting the target feature...
- We will choose a subset of categorical features we find promising (if any) and justify our choice.

Here we have 8 categorical features
- current_status, sex, age_group, hosp_yn, icu_yn, medcond_yn, race, ethnicity 
- A stacked bar plot for each will be created.

In [None]:
## for the one continous feature
##create plot for cdc_earliest_case
date_df = df.groupby(["cdc_case_earliest_dt", "death_yn"])["death_yn"].count().unstack("death_yn").fillna(0)
date_df

date_df.plot(kind="barh", figsize=(10, 80), stacked=True)

In [None]:
##### Create stacked bar plots for all categorical features

In [None]:
import matplotlib.patches as mpatches
# for the comparison we will convert the target back to categorical
y_train = y_train.astype("category")
#X_train[categorical_columns].columns

# we will create temp dataframe for these plots, mergeing X_train and y_train
df_temp= pd.concat([X_train, y_train], axis=1)


# for each categorical feature create a stacked bar plot
for categorical_feature in categorical_columns:
    # Using code from the module lab
    
    maxDelqEver = pd.unique(df_temp[categorical_feature].ravel())

    # add new column and set values to zero
    df_temp['percent'] = 0

    #print header
    print("\n", categorical_feature)
    print("Index \t Count")

    # for each delinquency category
    for i in maxDelqEver:

        count = df_temp[df_temp[categorical_feature] == i].count()['death_yn']
        count_percentage = (1 / count) * 100

        # print out index vs count
        print(i, "\t", count)

        index_list = df_temp[df_temp[categorical_feature] == i].index.tolist()
        for ind in index_list:
            df_temp.loc[ind, 'percent'] = count_percentage

    group = df_temp[['percent',categorical_feature,'death_yn']].groupby([categorical_feature,'death_yn']).sum()

    my_plot = group.unstack().plot(kind='bar', stacked=True, title=f"Death_yn vs {categorical_feature}", figsize=(15,7), grid=True)

    # add legend
    red_patch = mpatches.Patch(color='orange', label='Yes')
    blue_patch = mpatches.Patch(color='blue', label='No')
    my_plot.legend(handles=[red_patch, blue_patch], frameon = True)

    # add gridlines
    plt.grid(b=True, which='major', color='#666666', linestyle='-')
    plt.minorticks_on()
    plt.grid(b=True, which='minor', color='#999999', linestyle='-', alpha=0.2)

    # add labels
    my_plot.set_xlabel("Feature Values")
    my_plot.set_ylabel("% of rows")
    my_plot.set_ylim([0,100])
    
# drop 'percent' that was used only for stacked bar plot
df_temp = df_temp.drop("percent", 1)

#### Interpretation of the results
#### death_yn vs cdc_earliest_case_dt
- Observing the plots we can see there is no significant relationship between the cdc_earliest_case_dt of a COVID-19 positive case and death.

##### death_yn vs current_status
- Observing the plots we can see there is no significant relationship between the current_status of a COVID-19 positive case and death. 
- Both status' have a similar % probability of death and both are not significantly high ~ less than 5%.
- Due to this it would seem that current_status does not have a significant effect on likelihood of death 
- It is not a significant high value feature.

##### death_yn vs sex
- Observing the plots we can see there is no significant relationship between the sex of COVID-19 positive cases and death. 
- Both male and female genders have a similar % probability of death and both are not significantly high ~ less than 5%and unknown sex is ~ 2%, suggesting sex does not significantly effect the liklihood of death from COVID-19.
- It is not a significant high value feature.

##### death_yn vs age_group
- Observing the plots we can see there is a significant relationship between the age of COVID-19 positive cases and death. 
- The plot shows a significant relationship between older age groups and death, it seems the risk of death increases as one gets older, with the 80+ year olds being most at risk of death when COVID-19 positive with ~ 35% of 80+ year olds dying of COVID-19.
- The younger age groups show much less risk of dying from COVID-19 when COVID-19 positive.
- age_group is a significant high value feature. 

##### death_yn vs race
- Observing the plots we can see there is no significant relationship between the race of COVID-19 positive cases and death. 
- The highest % deaths are in the white race category ~ 5% but this may be due to white race having the highest number of cases, as a result all races could have proportional numbers of death relative to number of COVID-19 positive cases.
- It is not a significant high value feature.

##### death_yn vs hosp_yn
- Observing the plots we can see there is a significant relationship between hosipitalisations of COVID-19 positive cases and death. 
- The plot shows a significant relationship between hospitalised as a result of COVID-19 and death, it seems the risk of death increases if a positive case is hospitalised with ~30% of positive casees who were hospitalised dying as a result of COVID-19.
- hosp_yn is a significant high value feature.

##### death_yn vs icu_yn
- Observing the plots we can see there is a significant relationship between positive cases being admitted to ICU  and death. 
- The plot shows a significant relationship between icu admittance as a result of COVID-19 and death, it seems the risk of death increases if a positive case is admitted to ICU with ~60% of positive casees who were admitted to icu dying as a result of COVID-19.
- icu_yn is a significant high value feature.

##### death_yn vs medcond_yn
- Observing the plots we can see there is a significant relationship between positive cases having existing medical conditions and death. 
- The plot shows a significant relationship between having an exisiting medical condition and being COVID-19 positive and death, it seems the risk of death increases if a positive case has a medical condition with ~12% of positive casees who had a medical condition dying as a result of COVID-19.
- Although, the risk of death and having a medical condition is not as high as being admitted to icu, having a medical condition does seem to increase the risk of death and therefore, medcond_yn is a high value feature.


**Categorical features will be split into low value features and high value features, low value features include current_status, sex, race and ethnicity and will be dropped, high value features include age_group, hosp_yn, icu_yn, medcond_yn and these features will be kept**

**cdc_case_earliest_dt is the only continuous feature and will also be dropped due to low correlation with target feature.

In [None]:
low_gain_features = ['current_status', 'sex', 'Race', 'Ethnicity']
low_correlation_features = ['cdc_case_earliest_dt']
# drop all low value features
low_value_features = list(set(low_gain_features + low_correlation_features))
print(low_value_features)

In [None]:
df

In [None]:
# drop all low value features
# before dropping make copy of original
df_rev1 = df.copy()
# drop low value features
df_rev1.drop(low_value_features, 1, inplace=True)
print('\nRemaining columns:', df_rev1.columns)
print('\nNew shape:', df_rev1.shape)

In [None]:
df_rev1.head(10)

## Part 1 1.2 Prepare dataset for modeling
Now we have picked our descriptive features for the whole dataset, a number of additional steps will need to be taken to prepare the dataset for modeling 
- We will now convert the categorical variables into dummies variable to allow modeling
- We will then set up the train test split again based on the dataset with the dummies included

#### Set up dummy features. 
This will split up each categorical feature into a number of dummy features. The data type changes to "uint8"

In [None]:
df_rev1 = pd.get_dummies(df_rev1, columns=['age_group','hosp_yn','icu_yn','medcond_yn'], drop_first=True)

#### Categorical columns are:

In [None]:
categorical_columns = df_rev1.select_dtypes(include=['uint8']).columns.tolist()


categorical_columns

### Setting up the train/test split
- This is based on the dataset with dummy values
- The Target is stored in dataframe "y"
- The remaining features are stored in dataframe "X"
- Both are split into training and test subsets

In [None]:
# y is the target
y = df_rev1["death_yn"]


# X is everything else
X = df_rev1.drop(["death_yn"],1)
# Split the dataset into two datasets: 70% training and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1,  test_size=0.3)

print("original range is: ",df_rev1.shape[0])
print("training range (70%):\t rows 0 to", round(X_train.shape[0]))
print("test range (30%): \t rows", round(X_train.shape[0]), "to", round(X_train.shape[0]) + X_test.shape[0])

In [None]:
print("\nDescriptive features in X:\n", X_train.head(5))
print("\nTarget feature in y:\n", y_train.head(5))

In [None]:
X_train.head(5)

In [None]:
y_train
y_train.head(5)
y_train

#### Reset the indexes of the training and test splits
- We can see from the X_train printout below that the indexes are no longer consecutive
- This is the same for y_train, X_test, y_test
- This will need to be dealt with next as it will effect merging of dataframes in the coming sections

In [None]:
# need to reset the index to allow contatenation with predicted values otherwise not joining on same index...
X_train.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)
X_test.reset_index(drop=True, inplace=True)
y_test.reset_index(drop=True, inplace=True)
X_train.head(5)

In [None]:
y_train.head(5)

The datasets are now ready for modeling

## Part 2 Linear Regression Model
#### 2.1 Train a linear regression model using only the descriptive features selected from part 1 above

In [None]:
# Train aka fit, a model using all continuous and categorical features.
multiple_linreg = LinearRegression().fit(X_train, y_train)

#### 2.2 Print the coefficients learned by the model and discuss their role in the model (e.g., interpret the model by analysing each coefficient and how it relates each input feature to the target feature).

In [None]:
# Print the weights learned for each feature.
print("\nFeatures are: \n", X_train.columns)
print("\nCoeficients are: \n", multiple_linreg.coef_)
print("\nIntercept is: \n", multiple_linreg.intercept_)
print("\nFeatures and coeficients: \n", list(zip(X_train.columns, multiple_linreg.coef_)))

#### Correlation coefficients are used to measure how strong a relationship is between two variables.

Correlation coefficient formulas are used to find how strong a relationship is between data. The formulas return a value between -1 and 1, where:

1 indicates a strong positive relationship.
-1 indicates a strong negative relationship.
A result of zero indicates no relationship at all.

In this case most features show a weak positive relationship between 0 and 0.1. The features that have the weakest relationship with the target feature is 'age_group_30 - 39 Years', -0.001203725227328168,'medcond_yn_unknown', -0.007032338248988288,'age_group_30 - 39 Years', -0.001203725227328168 which all have negative correlation co-efficients and the features with the strongest relationship with the target feature is age_group_80+ Years', 0.27613110357796317,  'hosp_yn_Yes', 0.16592315710540517 and 'icu_yn_Yes', 0.4040060481753917.

#### 2.2 Interpreting the linear regression model

- Linear regression is a modeling tool that is used to make predictions based on linear relationship between the target (dependent variable) and any number of predictors (independent variables)
    - It finds the line of best fit the describes the relationship between the target and predictors 
    - This line is calculated by minimising the overall error
- The purpose of regression analysis is to:
    - Predict the value of the dependent variable as a function of the value(s) of at least one independent variable.
    - Explain how changes in an independent variable are manifested in the dependent variable
    - The dependent variable is the variable that is to be predicted or explained.
    - An independent variable is the variable or variables that is used to predict or explain the dependent variable
- The linear regression formula takes the following form:
    - $target\_feature = w_0 + w_1 * feature_1 + w_2*feature_2 + ...+ w_n*feature_n $
    - The output of this formula will be a continuous value that can be less than 0 and higher than 1
- We can see the calculated intercept is -0.030 (w_0)
    - This is the starting point. i.e. if all other coefficients are zero then the model result will be -0.030
    - Can be thought of as where the model line intercepts the y axis
- We can see all the coefficients for each feature
    - These are zipped together in a single list for ease of inspection.
    - The sum of all the feature*coefficients + intercept will result in the model prediction  
- We cannot make a direct comparison based on the value of the coefficients as it is tied directly to the range of each feature.
    - If all features were normalized this would be possible to see directly
    - However when categorical features are one-hot encoded, the values are either 1 or 0, so the scaling range is not as much a problem as values fall within either 0 or 1, meaning normalisation is not needed. 
    - Normalization/standardization of features is done to bring all features to a similar scale. When you one hot encode categorical variables they are either 0/1 hence there is not much scale difference
    
It is important to note that the output from a linear regression model is not suited to the classification problem that we are trying to solve.
- The output is not a probability and an additional thresholding step is necessary to convert the output into a binary classification
- We will threshold the output so that any values >=0.5 will be cast to 1, any values <0.5 will be cast to 0

Finally it is worth mentioning the effect outliers can have on linear regression output. 
- If an extra training example is included and it is an outlier,the outlier will have a huge effect on values near the threshold. 
- This sensitivity to outliers is one reason linear regression is a poor choice for classification problems.
- Additionally, another reason that linear regression is a poor choice for classification problems is that linear regression models output values that are continuous and can be far greater than 1 and far less than 0. Since our classes are discrete, only consisting of 0 and 1 i.e death yes or no, linear regression does not seem a plausible solution to this problem.

** Got information from DA lab solution.

source: https://medium.com/@rgotesman1/learning-machine-learning-part-3-logistic-regression-94db47a94ea3

## Part 2 2.3 Printing 10 predicted target features and evaluate the prediction
Print the predicted target feature value for the first 10 training examples. Threshold the predicted target feature value given by the linear regression model at 0.5, to get the predicted class for each example. Print the predicted class for the first 10 examples. Print a few classification evaluation measures computed on the full training set (e.g., Accuracy, Confusion matrix, Precision, Recall, F1) and discuss your findings so far.


- Here we will print the predicted target feature value for the first 10 training examples. 
- We will threshold the predicted target feature value given by the linear regression model at 0.5, to get the predicted class for each example. 
    - If value is >= 0.5 it is cast to 1, if < 0.5 it is cast to 0

In [None]:
# calculate the prediction and threshold the value. If >= 0.5 its true
multiple_linreg_predictions_train = (multiple_linreg.predict(X_train) >= 0.5) * 1.0

print("\nPredictions with multiple linear regression: \n")
actual_vs_predicted_multiplelinreg = pd.concat([y_train, pd.DataFrame(multiple_linreg_predictions_train, columns=['Predicted'])], axis=1)
print(actual_vs_predicted_multiplelinreg.head(10))

## Part 2 2.4 Evaluation metrics based on training data
- We will print the classification evaluation measures computed on the training set (e.g. Accuracy, Confusion matrix, Precision, Recall, F1)
- We will discuss findings based on these measures

In [None]:
# Some more evaluation metrics.
print("==================== Train Data =======================")
print("Accuracy: ", metrics.accuracy_score(y_train, multiple_linreg_predictions_train))
print("Confusion matrix: \n", metrics.confusion_matrix(y_train, multiple_linreg_predictions_train))
print("Classification report:\n ", metrics.classification_report(y_train, multiple_linreg_predictions_train))
print("======================================================")

## Interpretation of results
- Accuracy
    - This is simply stating how often the model is correct. We have an accuracy of 96.6%
- Confusion Matrix - A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known.
    - [0][0] TRUE NEGATIVE - The number that is predicted 0 that was actually 0
    - [0][1] FALSE POSITIVE - The number that is predicted 1 that is actually 0
    - [1][0] FALSE NEGATIVE - The number that is predicted 0 that is actually 1
    - [1][1] TRUE POSITIVE - The number that is predicted 1 that is actually 1
- Precision - How good the model is at predicting the positive class
    - What % of the predicted positive are actually positive
    - It is the number values correctly predicted positive over the total number of  positive values
    - Precision Positive is 0.62
    - Precision Negative is 0.97
- Recall - The precise definition of recall is the number of true positives divided by the number of true positives plus the number of false negatives. 
    - What % of the positive values did we predict
    - Is the number correctly predicted positive over the total number actual positive
    - Recall Positive is 0.16
    - Recall Negative is 1.00
- F1 Score
    - Is an a weighted average of Precision and recall
    - F1 Score Positive is 0.26
    - F1 Score Negative is 0.98

Summary
- Theses values seem reasonable. 
- The model is a much better at predicting the negative class i.e not death.
- This makes sense as the model has significantly more negative data to learn from - it is the majority class. 
- This is not ideal as it does not suit the need to determine features that may increase risk of death as a result of COVID-19. 
- This represents an imbalanced classification problem: we have two classes we need to identify — death and not death — with one category representing the overwhelming majority of the data points i.e not death.
- This is not ideal as the model may underestimate the number of deaths as a result of COVID-19.
- These types of problems are examples of when accuracy is not a good measure for assessing model performance because essentially what the model has done is cleverly realised that if it predicts no deaths for the majority of cases, because the data set is imbalanced to the non-death side, it most likely is accurate in predicting non-deaths only because that's what the majority of the data is but the model will not correctly identify the risks that affect the outcome and as such cannot accurately predict factors related to COVID-19 deaths and that's why the recall value is so low and the positive precision value is 62%.
- The important statistic to look at here is the recall values, the recall value of determining true positives is 16%.
- This essentially means the model is very poor at predicting deaths as there are a lot of false negatives.
- The down side of this is more people who have a risk factor may die as a result of COVID-19 because the model understimated their liklihood of dying due to a COVID-19 related risk.
- This would not be an appropriate model to use to determine risk factors of COVID-19 deaths as a result.

## Part 2 2.4 Evaluate the model using the hold-out (30% examples) test set
- The results from the test data will be compared the results from the training data.
- In addition they will be compared to the results from a cross-validated model (i.e. a new model trained and evaluated using cross-validation on the full dataset).

In [None]:
# calculate the prediction and threshold the value. If >= 0.5 its true
multiple_linreg_predictions_test = (multiple_linreg.predict(X_test) >= 0.5) * 1.0

print("\nPredictions with multiple linear regression: \n")
actual_vs_predicted_multiplelinreg = pd.concat([y_test, pd.DataFrame(multiple_linreg_predictions_test, columns=['Predicted'])], axis=1)
print(actual_vs_predicted_multiplelinreg.head(10))

In [None]:
# Some more evaluation metrics.
print("==================== Test Data =======================")
print("Accuracy: ", metrics.accuracy_score(y_test, multiple_linreg_predictions_test))
print("Confusion matrix: \n", metrics.confusion_matrix(y_test, multiple_linreg_predictions_test))
print("Classification report - Test data:\n ", metrics.classification_report(y_test, multiple_linreg_predictions_test))
print("\n==================== Train Data ======================")
print("Accuracy: ", metrics.accuracy_score(y_train, multiple_linreg_predictions_train))
print("Confusion matrix: \n", metrics.confusion_matrix(y_train, multiple_linreg_predictions_train))
print("\nClassification report: - Training data\n ", metrics.classification_report(y_train, multiple_linreg_predictions_train))

**Interpretation of test results and comparison with training results**
- Accuracy
    - The accuracy of the test data is marginally lower 96.4% vs 96.6%. 
- Precision
    - The precision score for predicting the positive case decreased from 62% to 45%.
    - The precision score for predicting the negative case did not change as it was 97% in both cases.
- Recall
    - The recall score for predicting the positive case has decreased from 16% to 9%.
    - The recall score for predicting the negative case did not change, it was 100% in both cases.
- F1
    - The f1 score for predicting the positive case has increased from 26% to 15%.
    - The f1 score for predicting the negative case has reduced did not change, it was 98% in both cases.

**Summary**
- These values are what we expected as we are now testing the model prediction on data it has not seen before
 and the results prove this model is not a good model to use on real world data.
- More comparisons need to be made and this is where cross validation steps in

## Part 2 2.4 Cross validation

Cross-validation is a statistical method used to estimate the skill of machine learning models.

It is commonly used in applied machine learning to compare and select a model for a given predictive modeling problem because it is easy to understand, easy to implement, and results in skill estimates that generally have a lower bias than other methods.

It is a validation method to see how accurate your model is on new data as an indicator of how it would perform on real-world data. 

k-fold cross validation is a procedure used to estimate the skill of the model on new data.


- We will now perform cross validation on the linear regression model. 
- Here we perform the same evaluation as above but multiple times
- Each time the data is shuffled so we get a slightly different view of the data for training and testing
- This works well for evaluating on a limited set of data
- We will store the results in a dictionary for later use

* Got info from lab solution example.

First we need to create a function to perform this cross validation. Sklearn does not provide one for linear regression. However it does for logistic and random forests models
- Cross validation Function can be seen below

In [None]:
def cross_val_LinReg(X, y, cv=3, scoring='accuracy'):
    """Functions to carry out cross validation on the linear regression model
    Default number of validations is 3. The randon state will be updated 
    at each iteration to allow our results to be repeated"""
    
    # store results
    results = []
    # evaluate cv times and append to results
    for i in range(cv):
        # set up train test split
        X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=i , test_size=0.3)
        # generate model
        multiple_linreg = LinearRegression().fit(X_train, y_train)
        # threshold
        multiple_linreg_predictions = (multiple_linreg.predict(X_test) >= 0.5) * 1.0
        # calc score
        if scoring=='accuracy':
            score = metrics.accuracy_score(y_test, multiple_linreg_predictions)
        elif scoring=='precision':
            score = metrics.precision_score(y_test, multiple_linreg_predictions)
        elif scoring=='f1':
            score = metrics.f1_score(y_test, multiple_linreg_predictions)
        elif scoring=='recall':
            score = metrics.recall_score(y_test, multiple_linreg_predictions)
        # append to results
        results.append(score)
    return results

In [None]:
def cross_val_LinReg_DF(X,y):
    """Function to perform cross validation and store results 
    in dataframe. Cross validation looks at accuracy, precision, 
    recall, f1. Returns a dataframe with results"""

    linRegResults = {}
    # metrics to test against
    test_metrics = ['accuracy','precision','recall', 'f1']
    
    for metric in test_metrics:
        # generate test results
        result = cross_val_LinReg(X, y, cv=10, scoring=metric)
        length = len(result)
        # store result in dict
        linRegResults[metric] = sum(result)/length

    # create dataframe with results
    LinRegDF = pd.DataFrame.from_dict(linRegResults, orient='index', columns=['Linear_Regression'])
    
    return LinRegDF

In [None]:
linRegDF = cross_val_LinReg_DF(X,y)
print(f"Mean results from 10 fold cross validation are:")
linRegDF

These results are marginally lower than previous results but this is expected. We have taken the mean of 10 sets of results. This proved what was stated in the summary of interpretation of results, that this model would not accurately predict COVID-19 deaths based on certain risk factors and as such would not be a viable model to use on real-world data.

## Part 3 Logistic Regression  

- Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable.

- Linear regression is used to predict the continuous dependent variable using a given set of independent variables, whereas, Logistic Regression is used to predict the categorical dependent variable using a given set of independent variables so in theory should be better suited to this problem, where the output is a classification of death no or death yes.

- In order to map predicted values to probabilities, a sigmoid function is used. The function maps any real value into another value between 0 and 1.

    - The first part of logistic regression function is similar to linear regression i.e. We find the line of best fit
    - We then pass this equation through what is called a sigmoid function
    - This sigmoid function will output a value bound between 0 and 1. It is a probability
    - The model then applies a threshold to this probability so that if is is >= 0.5 its cast to 1 and if it is <0.5 it is cast to 0
    - All of these steps are carried out within the logistic regression function, however the threshold value can be adjusted up or down depending on the problem you are trying to solve.

The model estimated in logistic regression is given by the logistic function: <br>
$probability(target=1|descriptive\_features)=logistic(w_0 + w_1 * feature_1 + w_2*feature_2 + ...+ w_n*feature_n)$ <br>
where $logistic(x)$ is defined as: $logistic(x) = \frac{e ^ x}{1 + e ^ x} = \frac{1}{1+e^{-x}}$

- From the values above can see the calculated intercept is -1.006
    - This is the starting point. i.e. if all other coefficients were zero then the value for 'x' would be -1.006. 
    - This is the input to the logistic function and the logistic function will then calculate the probability and threshold based on this.
- We can see all the coefficients for each features
    - These are zipped together in a single list for ease of inspection.
    - The effect of these on the value 'x' is the same as for linear regression
    - Only when the value 'x' is fed into the logistic function do we see the real difference between the to methods.
    
A major benefit of logistic regression worth  highlighting is its ability to handle outliers. 
- As discussed for linear regression, outliers can significantly skew what values fall within each threshold point. 
- Outliers do not significantly effect the model in a logistic regression method.

* Got information from DA Lab solution example.




## Part 3 3.1 Train a logistic regression model using only the descriptive features selected from part 1 above

In [None]:
# Train aka fit, a model using all continuous and categorical features.
multiple_logisticreg = LogisticRegression().fit(X_train, y_train)

## Part 3 3.2 Print the coefficients learned by the model and discuss their role in the model (e.g., interpret the model by analysing each coefficient and how it relates each input feature to the target feature).

In [None]:
# Print the weights learned for each feature.
print("\nFeatures are: \n", X_train.columns)
print("\nCoeficients are: \n", multiple_logisticreg.coef_[0])
print("\nIntercept is: \n", multiple_logisticreg.intercept_)
print("\nFeatures and coeficients: \n", list(zip(X_train.columns, multiple_logisticreg.coef_[0])))

Correlation coefficients are used to measure how strong a relationship is between two variables.

Correlation coefficient formulas are used to find how strong a relationship is between data. The formulas return a value between -1 and 1, where:

1 indicates a strong positive relationship.
-1 indicates a strong negative relationship.
A result of zero indicates no relationship at all.

In this case most features show a weak positive relationship between 0 and 0.3. The features that have the weakest relationship with the target feature is 'age_group_10 - 19 Years', -1.3177439596473888,'age_group_20 - 29 Years', -1.7009865673611022,'age_group_40 - 49 Years', -0.6451217529604351 which all have negative correlation co-efficients and the features with the strongest relationship with the target feature is 'age_group_80+ Years', 3.408052382571183, 'hosp_yn_Yes', 2.456620101265501, 'age_group_70 - 79 Years', 2.258483090701781 and 'icu_yn_Yes', 2.6308478366567143.

These values have significantly increased in comparison to the correlation co-efficients determined using the linear regression model.

## Part 3 3.3 Evaluate the model using the hold-out (30% examples) test set
- These results from the test data will be compared the results from the training data.
- In addition they will be compared to the results from a cross-validated model (i.e., a new model trained and evaluated using cross-validation on the full dataset).

In [None]:
# calculate the prediction and threshold the value. If >= 0.5 its true
multiple_logisticreg_predictions_train = multiple_logisticreg.predict(X_train)

print("\nPredictions with multiple logistic regression: \n")
actual_vs_predicted_multiplelogisticreg = pd.concat([y_train, pd.DataFrame(multiple_logisticreg_predictions_train, columns=['Predicted'])], axis=1)
print(actual_vs_predicted_multiplelogisticreg.head(10))

In [None]:
# Some more evaluation metrics.
print("==================== Train Data =======================")
print("Accuracy: ", metrics.accuracy_score(y_train, multiple_logisticreg_predictions_train))
print("Confusion matrix: \n", metrics.confusion_matrix(y_train, multiple_logisticreg_predictions_train))
print("Classification report:\n ", metrics.classification_report(y_train, multiple_logisticreg_predictions_train))
print("======================================================")

##### Interpretation of results
- Accuracy
    - This is simply stating how often the model is correct. We have an accuracy of 96.7%
- Confusion Matrix - A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known.
    - [0][0] TRUE NEGATIVE - The number that is predicted 0 that was actually 0
    - [0][1] FALSE POSITIVE - The number that is predicted 1 that is actually 0
    - [1][0] FALSE NEGATIVE - The number that is predicted 0 that is actually 1
    - [1][1] TRUE POSITIVE - The number that is predicted 1 that is actually 1
- Precision - How good the model is at predicting the positive class
    - What % of the predicted positive are actually positive
    - It is the number values correctly predicted positive over the total number of  positive values
    - Precision Positive is 0.59
    - Precision Negative is 0.97
- Recall - The precise definition of recall is the number of true positives divided by the number of true positives plus the number of false negatives. 
    - What % of the positive values did we predict
    - Is the number correctly predicted positive over the total number actual positive
    - Recall Positive is 0.31
    - Recall Negative is 0.99
- F1 Score
    - Is an a weighted average of Precision and recall
    - F1 Score Positive is 0.40
    - F1 Score Negative is 0.98

Summary
- Theses values seem reasonable. 
- The model is a much better at predicting the negative class i.e not death.
- This makes sense as the model has significantly more negative data to learn from - it is the majority class. 
- This model achieves better recall values (Recall Positive is 0.31, Recall Negative is 0.99) than the linear regression model (Recall Positive is 0.16, Recall Negative is 1.00).
- This may be due to the fact that outliers in this model do not significantly affect the output and that logistic regression models output discrete binary classification numbers which suits the problem of this assignment and not a continous output that results from a linear regression model. 
- This is still not ideal as the model may underestimate the number of deaths as a result of COVID-19 as a recall value of 0.31 still results in a significant amount of false negatives.
- Again, the down side of this is more people who have a risk factor may die as a result of COVID-19 because the model understimated their liklihood of dying due to a COVID-19 related risk.
- This would not be an appropriate model to use to determine risk factors of COVID-19 deaths as a result but it is a slightly better model than the linear regression model.

In [None]:
# calculate the prediction and threshold the value. If >= 0.5 its true
multiple_logisticreg_predictions_test = multiple_logisticreg.predict(X_test)

print("\nPredictions with multiple linear regression: \n")
actual_vs_predicted_multiplelogisticreg = pd.concat([y_test, pd.DataFrame(multiple_logisticreg_predictions_test, columns=['Predicted'])], axis=1)
print(actual_vs_predicted_multiplelogisticreg.head(10))

In [None]:
# Some more evaluation metrics.
print("==================== Test Data =======================")
print("Accuracy: ", metrics.accuracy_score(y_test, multiple_logisticreg_predictions_test))
print("Confusion matrix: \n", metrics.confusion_matrix(y_test, multiple_logisticreg_predictions_test))
print("Classification report:\n ", metrics.classification_report(y_test, multiple_logisticreg_predictions_test))
print("==================== Train Data =======================")
print("Accuracy: ", metrics.accuracy_score(y_train, multiple_logisticreg_predictions_train))
print("Confusion matrix: \n", metrics.confusion_matrix(y_train, multiple_logisticreg_predictions_train))
print("Classification report:\n ", metrics.classification_report(y_train, multiple_logisticreg_predictions_train))
print("======================================================")

**Interpretation of test results and comparison with training results**
- Accuracy
    - The accuracy of the test data is marginally higher 96.8% vs 96.7%. 
- Precision
    - The precision score for predicting the positive case did not change it was 59% in both cases.
    - The precision score for predicting the negative case has increased from 97% to 98%.
- Recall
    - The recall score for predicting the positive case has increased from 31% to 35%.
    - The recall score for predicting the negative case did not change, it was 99% in both cases.
- F1
    - The f1 score for predicting the positive case has increased from 40% to 44%.
    - The f1 score for predicting the negative case did not change, it was 98% in both cases.

**Summary**
- These values are a little higher than expected as we are now testing the model prediction on data it has not seen before
- This is a good sign that the model is generalising
- More comparisons need to be made and this is where cross validation steps in

## Part 3 3.4 Cross validation
We will first create function to perform 10 fold cross validation and store results into dataframe
- This will be used to simplify further analysis the dataset, looking at accuracy, precision, recall, f1.

In [None]:
def cross_val_LogReg_DF(X,y):
    """Function to perform cross validation and store results 
    in dataframe. Cross validation looks at accuracy, precision, 
    recall, f1. Returns a dataframe with results"""

    logRegResults = {}
    # metrics to test against
    test_metrics = ['accuracy','precision','recall', 'f1']

    for metric in test_metrics:
        # generate test results
        result = cross_val_score(LogisticRegression(), X, y, scoring=metric, cv=10)
        # store result in dict
        logRegResults[metric] = result.mean()
        
    # create dataframe with results
    LogRegDF = pd.DataFrame.from_dict(logRegResults, orient='index', columns=['Logistic_Regression'])
    
    return LogRegDF

In [None]:
logRegDF = cross_val_LogReg_DF(X,y)
print(f"Mean results from 10 fold cross validation are:")
logRegDF

## Part 4 Random Forest Model

### 4.1 Train a random forest model using only the descriptive features selected from part 1 above 
- A random forest model will now be trained on our test data. 
- We use the RandomForestClassifier() function. 
- The random state will be set to 1 to allow the results to be repeated

In [None]:
# Train RF with 100 trees
rfc = RandomForestClassifier(n_estimators=100, max_features='auto', oob_score=True, random_state=1)
rfc.fit(X_train, y_train)

### 4.2 Understanding the Random Forest model

- A random forest is made up of a collection of decision trees, in order to understand random forests, we must understand decision trees
- Decision tree
    - A decision tree is a set of if-then-else rules based on splitting the data based on specific features
    - Feature split is based on information value 
    - The tree will split based on the feature that gives the highest information value
    - The feature with the highest information value will sit at the root of each decision tree
    - Usually the higher the value in information, the higher the root is.
    - Each decision tree will be a single prediction
- Random Forest 
    - The random forest is made up of decision trees
    - The output of the random forest is based on all the decision trees combined
    - This gives reliable results but can be hard to interpret
   
    
** Got information from lab solution example.

In [None]:
dtc4 = DecisionTreeClassifier(max_depth=4, random_state=1)
dtc10 = DecisionTreeClassifier(max_depth=10, random_state=1)
dtc4.fit(X_train, y_train)
dtc10.fit(X_train, y_train)
print("Max depth 4: \n",dtc4)
print("Max depth 10: \n",dtc10)

In [None]:
import graphviz
from graphviz import Source
# create a Graphviz png
with open("DecisionTree4.dot", 'w') as f1:
    f1 = export_graphviz(dtc4, out_file=f1, feature_names=X_train.columns)
with open("DecisionTree10.dot", 'w') as f2:
    f2 = export_graphviz(dtc10, out_file=f2, feature_names=X_train.columns)
!dot -Tpng DecisionTree4.dot -o DecisionTree4.png
!dot -Tpng DecisionTree10.dot -o DecisionTree10.png

In [None]:
importance = pd.DataFrame({'feature': X_train.columns, 'importance':rfc.feature_importances_})
importance.sort_values('importance', ascending=False)

- This proves that age_group 80+ has the highest information value in terms of prediction of risk of death, this may be sue to the fact that there is a strong relationship between age groups and death, with ~ 35% of 80+ year olds COVID-19 patients dying.

- Second highest important feature is hospitalisations Yes, due to the significant relationship between patients who were hospitalised and patients who died.

- These features that rank the highest are also in line with what was seen on the correlation co-efficients in the logistic regression model. Strong relationships were seen between age group 80+ years, hosp_yn_yes and icu_yn_yes and the target feature. 

## Part 4 4.4

In [None]:
# Predicted class labels for all examples, 
# using the trained model, on in-sample data (same sample used for training and test)
rfc_predictions_train = rfc.predict(X_train)
df_true_vs_rfc_predicted = pd.DataFrame({'ActualClass': y_train, 'PredictedClass': rfc_predictions_train})
df_true_vs_rfc_predicted.head(10)

In [None]:
print("==================== Train Data =======================")
print("Accuracy: ", metrics.accuracy_score(y_train, rfc_predictions_train))
print("Confusion matrix: \n", metrics.confusion_matrix(y_train, rfc_predictions_train))
print("Classification report:\n ", metrics.classification_report(y_train, rfc_predictions_train))

##### Interpretation of results
- Accuracy
    - This is simply stating how often the model is correct. We have an accuracy of 96.7%
- Confusion Matrix - A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known.
    - [0][0] TRUE NEGATIVE - The number that is predicted 0 that was actually 0
    - [0][1] FALSE POSITIVE - The number that is predicted 1 that is actually 0
    - [1][0] FALSE NEGATIVE - The number that is predicted 0 that is actually 1
    - [1][1] TRUE POSITIVE - The number that is predicted 1 that is actually 1
- Precision - How good the model is at predicting the positive class
    - What % of the predicted positive are actually positive
    - It is the number values correctly predicted positive over the total number of  positive values
    - Precision Positive is 0.73
    - Precision Negative is 0.97
- Recall - The precise definition of recall is the number of true positives divided by the number of true positives plus the number of false negatives. 
    - What % of the positive values did we predict
    - Is the number correctly predicted positive over the total number actual positive
    - Recall Positive is 0.24
    - Recall Negative is 1.0
- F1 Score
    - Is an a weighted average of Precision and recall
    - F1 Score Positive is 0.36
    - F1 Score Negative is 0.98

Summary
- Theses values seem reasonable. 
- Again, siilar to the other models this model is a much better at predicting the negative class i.e not death.
- This makes sense as the model has significantly more negative data to learn from - it is the majority class. 
- This model achieves better recall values than the linear regression model but worse recall values than logistic regression model. 
- This model is still not ideal as the model may underestimate the number of deaths as a result of COVID-19 as a recall value of 0.24 still results in a significant amount of false negatives.
- Again, the down side of this is more people who have a risk factor may die as a result of COVID-19 because the model understimated their liklihood of dying due to a COVID-19 related risk.
- This would not be an appropriate model to use to determine risk factors of COVID-19 deaths as a result but it is a slightly better model than the linear regression model.

In [None]:
# Predicted class labels for all examples, 
# using the trained model, on in-sample data (same sample used for training and test)
rfc_predictions_test = rfc.predict(X_test)
df_true_vs_rfc_predicted_test = pd.DataFrame({'ActualClass': y_test, 'PredictedClass': rfc_predictions_test})
df_true_vs_rfc_predicted_test.head(20)

In [None]:
print("==================== Test Data =======================")
print("Accuracy: ", metrics.accuracy_score(y_test, rfc_predictions_test))
print("Confusion matrix: \n", metrics.confusion_matrix(y_test, rfc_predictions_test))
print("Classification report:\n ", metrics.classification_report(y_test, rfc_predictions_test))
print("==================== Train Data =======================")
print("Accuracy: ", metrics.accuracy_score(y_train, rfc_predictions_train))
print("Confusion matrix: \n", metrics.confusion_matrix(y_train, rfc_predictions_train))
print("Classification report:\n ", metrics.classification_report(y_train, rfc_predictions_train))
print("======================================================")

**Interpretation of test results and comparison with training results**
- Accuracy
    - The accuracy of the test data is marginally lower 96.83% vs 97%. 
- Precision
    - The precision score for predicting the negative case did not change it was 97% in both cases.
    - The precision score for predicting the positive case has decreased in the test case from 73% to 46%.
- Recall
    - The recall score for predicting the positive case has decreased in the test case from 24% to 13%.
    - The recall score for predicting the negative case, it was 100% in the train case and 99% in the test case.
- F1
    - The f1 score for predicting the positive case has dercreased in the test case from 36% to 20%.
    - The f1 score for predicting the negative case did not change, it was 98% in both cases.

**Summary**
- These values are a little lower than expected as we are now testing the model prediction on data it has not seen before
- This is a sign that the model is not generalising 
- More comparisons need to be made and this is where cross validation steps in

In [None]:
def cross_val_RandomForest_DF(X,y, depth=None, estimators=100):
    """Function to perform cross validation and store results 
    in dataframe. Cross validation looks at accuracy, precision, 
    recall, f1. Returns a dataframe with results"""

    # store results in dict
    RandomForestResults = {}
    # metrics to test against
    test_metrics = ['accuracy','precision','recall', 'f1']

    for metric in test_metrics:
        # generate test results
        result = cross_val_score(RandomForestClassifier(n_estimators=estimators, max_features='auto', oob_score=True, random_state=1, max_depth=depth), X, y, scoring=metric, cv=10)
        # store result in dict
        RandomForestResults[metric] = result.mean()
    
    # create dataframe with results
    RandomForestDF = pd.DataFrame.from_dict(RandomForestResults, orient='index', columns=['Random_Forests'])

    return RandomForestDF

In [None]:
RandomForestDF = cross_val_RandomForest_DF(X,y)
print(f"Mean results from 10 fold cross validation are:")
RandomForestDF

Out of bag accuracy
- This is the out of bag error estimate 
- It is an internal error estimate of a random forest as it is being constructed.
- It is used as an additional measure and should be line with the cross validation results
- This is generally expected to be the lowest

In [None]:
# compute the out-of-bag classification accuracy
rfc.oob_score_

This is similar to the cross validation accuracy

In [None]:
ResultsDF = pd.concat([linRegDF, logRegDF, RandomForestDF], axis=1)
ResultsDF

## Part 5 Improving Predictive Models.

### 5.1 Which model of the ones trained above performs better at predicting the target feature? 



In [None]:
#Firstly test a simple model to see if any of the models above are better than a simple model
good_count = y_test[y_test == 1].count()
bad_count = y_test[y_test == 0].count()
total = good_count+ bad_count
print(f'From original dataset: \t\tCount {total}')
print(f'Total number deaths:\t {good_count}\t{round(good_count/len(y_test)*100,2)}%')
print(f'Total number non-deaths:\t {bad_count}\t{round(bad_count/len(y_test)*100,2)}%')

- shows the majority class is non-deaths
- will now generate scores for predicting the majority class

In [None]:
# generate array of 2860 zeros (length of test dataset)
majority = np.zeros(2860)
# create dataframe 
df_majorityClass = pd.DataFrame(majority, columns=['prediction'])
# calculate scores of simple predictin vs actual
print("Accuracy: ", metrics.accuracy_score(y_test, df_majorityClass))
print("Confusion matrix: \n", metrics.confusion_matrix(y_test, df_majorityClass))
print("Classification report:\n ", metrics.classification_report(y_test, df_majorityClass))

- For predicting good outcome the simple model fails completely as expected
- For predicting the bad outcome the simple model fares much better but still worse than all the other models outlined above
    - The precision is the same as the accuracy in this case - i.e. how good is it at predicting 0
    - The recall is 100% as all of the actual 0 values were predicted 0
    - The f1 is between the precision and recall as it is a weighted average of the two

The best model at predicting the target feature is a Logistic Regression model as it has the highest accuracy, precision and recall values in comparison to linear regression and random forest models. This model is better than a simple model that always predicts the majority calss as the precision positive value is 59% meaning it correctly identifies true positives 59% of the time and the precision negative is 97%, meaning the model correctly identifies true negatives 97% of the time. This would be better than a simple model due to the fact that the simple model has a much lower precision positive value than this logisitic regression model because it would identifies true positives as negatives due to fact that it always predict the majority class which is death no. This would be a dangerous and useless model to use on such critical information.


## Part 5 **5.2)** 

However with the logistic regression model, the recall value is still too low to be able to use this model on real world data, especially dealing with data that is critical to the life of a person. As previously stated this model is much better at predicting the negative class i.e not death. This makes sense as the model has significantly more negative data to learn from - it is the majority class. The problem is that this model is not ideal as it does not suit the need to determine the risk of death in COVID-19 positive patients. This represents an imbalanced classification problem: we have two classes we need to identify — death and not death — with one category representing the overwhelming majority of the data points i.e not death. This is not ideal as the model may underestimate the number of deaths as a result of COVID-19. 

- One way to try and fix this imbalance is to oversample the data to try and balance the classes. This can be done with Synthetic Minority Oversampling Technique. This approach addresses imbalanced datasets by oversampling the minority class i.e. death yes. The simplest approach involves duplicating examples in the minority class, although these examples don’t add any new information to the model.  Instead, new examples can be synthesized from the existing examples.

- Usually when trying to improve a model, one needs to understand the problem. In this case, we want a model that has as low as possible number of false negatives. Meaning, we don't want a model that predicts lots people who won't die of COVID-19 when in reality they will due to the risk factors associated with death. So for this model we want a high recall value. This model at present underestimates risk of death we don't want this for our problem.

- Maximizing precision will minimize the number false positives, whereas maximizing the recall will minimize the number of false negatives.

    - Precision: Appropriate when minimizing false positives is the focus.
    - Recall: Appropriate when minimizing false negatives is the focus.

- This oversampling approach will hopefully improve the recall value. 


### Use Oversampling to improve the model:

In [None]:
import imblearn
print(imblearn.__version__)

In [None]:
#get values of death_yn
deaths = df['death_yn']
deaths.value_counts()



In [None]:
#plot number of yes and no deaths count to show the imbalance in data.
ax = sns.countplot(x="death_yn", data=df)

It is evident that the model is imbalanced as the class death_yn 'no'or '0' has significantly more values than death_yn 'yes' or '1'. This means the model will be heavily skewed in favour of the majority class.

In [None]:
from imblearn.over_sampling import SMOTE

#Create an oversampled training data
smote = SMOTE(random_state = 101)
X_oversample, y_oversample = smote.fit_resample(X_train, y_train)

In [None]:
#Training with oversampled data
# Importing the splitter, classification model, and the metric
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
classifier_o = LogisticRegression()
classifier_o.fit(X_oversample, y_oversample)
print("==================== Balanced Train Data =======================")
print(metrics.classification_report(y_train, classifier_o.predict(X_train)))
print("Accuracy: ", metrics.accuracy_score(y_train, classifier_o.predict(X_train)))
print("Confusion matrix: \n", metrics.confusion_matrix(y_train, classifier_o.predict(X_train)))

# Some more evaluation metrics.
print("==================== Imbalanced Train Data =======================")
print("Accuracy: ", metrics.accuracy_score(y_train, multiple_logisticreg_predictions_train))
print("Confusion matrix: \n", metrics.confusion_matrix(y_train, multiple_logisticreg_predictions_train))
print("Classification report:\n ", metrics.classification_report(y_train, multiple_logisticreg_predictions_train))
print("======================================================")

- It is evident from the Balanced data that the positive recall value has improved dramatically from 31% in the imbalanced training data to 93% in the oversampled balanced data, this is a dramatic increase and significantly improves the model. This recall value means that it correctly identifies relevant data i.e true positives meaning it reduces the number of predicted false negatives.

- The negative recall value is slightly reduced in the balanced oversampled data 91% in comparison to the imbalanced train set 99%.

- The positive precision value has reduced significantly aswell in the balanced train set 28% in comparison to 59% in the imbalanced train set. Precision would be expected to decrease significantly because the output predictions will have more positive predictions in the train set and since our data is so heavily skewed towards a negative outcome it will produce a lot of false positives resulting in a reduced precision value for positives.

- In imbalanced datasets, the goal is to improve recall without hurting precision. These goals, however, are often conflicting, since in order to increase the True Positives for the minority class, the number of False Positives is also often increased, resulting in reduced precision.

- However, in this case we are more interested in improving recall because a model that overestimates risk of death is better than underestimating risk of death.

source: https://machinelearningmastery.com/precision-recall-and-f-measure-for-imbalanced-classification/


- It is evident from the tables above, that oversampling the data dramatically improved the overall recall and precision values  and also improved the f1 score aswell, F-Measure provides a way to combine both precision and recall into a single measure that captures both properties.

- Alone, neither precision or recall tells the whole story. F-measure provides a way to express both precision and recall with a single value, in this case the f1 score has increased from 41% to 44%. 

- Accuracy has reduced slightly from 97% to 91% in the oversampled train set, this is to be expected as oversampling a train set means more true positives are introduced in the data but this also means more false positives are introduced into the data aswell, resulting in reduced accuracy.

- Overall this model implemented with smote could be used on real-life data as recall is 95%, meaning the sensitivity of this model correctly predicts the risk of death in COVID-19 positive cases 95% of the time. I would be confident in using this model on real world data to predict risk of death in COVID-19 positive patients.

source: https://machinelearningmastery.com/precision-recall-and-f-measure-for-imbalanced-classification/

## Check to see how the model works on test data that the model hasn't seen

In [None]:
#Training with oversampled data
# Importing the splitter, classification model, and the metric
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
classifier_o = LogisticRegression()
classifier_o.fit(X_oversample_test, y_oversample_test)
print("==================== Balanced Test Data =======================")
print(metrics.classification_report(y_test, classifier_o.predict(X_test)))
print("Accuracy: ", metrics.accuracy_score(y_test, classifier_o.predict(X_test)))
print("Confusion matrix: \n", metrics.confusion_matrix(y_test, classifier_o.predict(X_test)))

# Some more evaluation metrics.
print("==================== Imbalanced Test Data =======================")
print("Accuracy: ", metrics.accuracy_score(y_test, multiple_logisticreg_predictions_test))
print("Confusion matrix: \n", metrics.confusion_matrix(y_test, multiple_logisticreg_predictions_test))
print("Classification report:\n ", metrics.classification_report(y_test, multiple_logisticreg_predictions_test))
print("======================================================")

- The improved model using smote shows the values have significantly improved from the imbalanced test data.

- accuracy has reduced from 97% to 93%.
- precision has increased from 62% to 93%.
- recall has increased from 33% to 93%.
- f1 score has increased from 42% to 93%.

This proves that the model can be used on data it has not seen before and accuracy and recall is 93%.
- recall value in test data is slightly lower than in train data, this is to be expected as the model has not seen the new test data, whereas the model was trained on the training data.

## Part 5  **(5.3)**

### Test other csv provided using the trained Logistic Regression Model



In [None]:
# Read csv file into a dataframe and it was cleaned prior using the homework1 jupyter notebook, that is why it has a new name.
df_new = pd.read_csv('covid19-cdc-13336431-cleaned_data_Final_testCase.csv', keep_default_na=True, sep=',\s+', delimiter=',', skipinitialspace=True)
df_new.head(10)

In [None]:
df_new.shape

In [None]:
df_new.dtypes

#### Convert datatypes for plotting later
We will now review the datatypes and convert if needed. This will help avoid plotting errors later in the notebook
- The target feature "death_yn" is type object, with values "Yes" & "No". These will be mapped 'yes': 1, "no": 0 and stored as "int64"
- Categorical features will stay as category datatypes
- Some Continuous features are datetime64 type
  

#### Setup Column types
We will now setup the continuous, categorical, target features

##### Select all categorical columns and convert to categorical type
 - This will be needed later when it will be required to convert categorical features into dummy features for modeling

In [None]:
# convert riskperformance to 0,1
df_new['death_yn'] = df_new['death_yn'].map({'Yes': 1, "No": 0})

In [None]:
#Convert date time features to appropriate date time data types
df_new['cdc_case_earliest_dt'] = df_new['cdc_case_earliest_dt'].astype('datetime64[ns]')
df_new.dtypes

#Select all columns of type 'object'
object_columns = df_new.select_dtypes(['object']).columns
object_columns
#Convert selected columns to type 'category'
for column in object_columns:
    df_new[column] = df_new[column].astype('category')
df_new.dtypes 
continous_columns = df_new.select_dtypes(['datetime64[ns]']).columns

##### Finally set the target feature "death_yn" to int64 
- death_yn is a categorical feature but it is also the target feature
- To allow continuous features to plot against the target, it will need to be int64 type
- We will also remove it from the list of categorical columns

In [None]:
df_new['death_yn'] = df_new['death_yn'].astype('int64')

In [None]:
df_new.dtypes

In [None]:
y = pd.DataFrame(df_new["death_yn"])
X = df_new.drop(["death_yn"],1)

In [None]:
low_gain_features = ['current_status', 'sex', 'Race', 'Ethnicity']
low_correlation_features = ['cdc_case_earliest_dt']
# drop all low value features
low_value_features = list(set(low_gain_features + low_correlation_features))
print(low_value_features)

In [None]:
# drop all low value features
# before dropping make copy of original
df_rev1_new = df_new.copy()
# drop low value features
df_rev1_new.drop(low_value_features, 1, inplace=True)
print('\nRemaining columns:', df_rev1_new.columns)
print('\nNew shape:', df_rev1_new.shape)

## Prepare dataset for modeling
Now we have picked our descriptive features for the whole dataset, a number of additional steps will need to be taken to prepare the dataset for modeling 
- We will now convert the categorical variables into dummies variable to allow modeling
- We will then set up the train test split again based on the dataset with the dummies included

In [None]:
df_rev1_new = pd.get_dummies(df_rev1_new, columns=['age_group','hosp_yn','icu_yn','medcond_yn'], drop_first=True)

In [None]:
categorical_columns = df_rev1_new.select_dtypes(include=['uint8']).columns.tolist()


categorical_columns

In [None]:
# y is the target
y = df_rev1_new["death_yn"]


# X is everything else
X = df_rev1_new.drop(["death_yn"],1)

In [None]:
# Some more evaluation metrics.
print("==================== Provided New CSV Imbalanced Data when SMOTE is not applied =======================")
print("Accuracy: ", metrics.accuracy_score(y, multiple_logisticreg.predict(X)))
print("Confusion matrix: \n", metrics.confusion_matrix(y, multiple_logisticreg.predict(X)))
print("Classification report:\n ", metrics.classification_report(y, multiple_logisticreg.predict(X)))

print("==================== Provided New CSV Balanced Data when SMOTE is applied=======================")
print(metrics.classification_report(y, classifier_o.predict(X)))
print("Accuracy: ", metrics.accuracy_score(y, classifier_o.predict(X)))
print("Confusion matrix: \n", metrics.confusion_matrix(y, classifier_o.predict(X)))

print("==================== Balanced Test Data on csv from Homework 1 =======================")
print(metrics.classification_report(y_test, classifier_o.predict(X_test)))
print("Accuracy: ", metrics.accuracy_score(y_test, classifier_o.predict(X_test)))
print("Confusion matrix: \n", metrics.confusion_matrix(y_test, classifier_o.predict(X_test)))


## Summary of Findings:

- Applying the new csv to the trained model that had implemented SMOTE shows similar results to the results when the test data from the cleaned dataframe csv from homework1 was used on the model to predict death in COVID-19 positive patients.

- Positive Precision value is low, as is expected when increased numbers of false positives are in the dataset.
- Recall value is 64%, which is not as high as 84% seen in the test data. This may be due to the fact that the test data is 30% of the actual dataset in comparison to the whole dataset from the new csv which was used in the model.
- Accuracy is similar 93% in both homework1 csv test data and provided new csv data.

Conclusion:

- 3 predictive models have been evaluated in this notebook, with all models performing very similarly
- The logistic and random forest models perform almost the same +/-10% and both could be used. 
- The logistic model shows better results in accuracy and recall in comparison to the other two models.
- 64% recall on new data using the best logistic regression trained model is not a bad result. However, it is not a high enough value to make me believe that this could be used on real world data with confidence, due to the fact that the output is of critical importance to the lives of COVID-19 positive patients. 

Recommendations:

- As a result, I believe more data is needed to train the model and significantly more data on death_yn yes values, the limited number of the minority class makes it difficult to predict the risk of death confidently.
- Possibly more stratified information on certain features such as exact medical conditions may give more insight into risk factors associated with death.


