## Introduction 

- This is a dataset that contains 13 columns (12 feature and 1 target feature) and 299 observations.
- This dataset is a part of a study that worked in determining the implications of two specific features on a prediction model developed to predict heart failure.
***
- In my project, I will be working to *confirm the researcher's claims that serum creatinine and ejection fraction do a better job in model development and accuracy than the sum of all of the features.* This will be done in a python based model development which is different from the techniques used by the researchers. I will refer to these features as the "ideal features."
    - By confirming or denying I will be either fortifying the researcher's argument, or observe a significant drop in accuracy to counter the researcher's argument. In doing so I will also be testing the credibility of the python based regression and procession models.
***
- Additionally, through out the study the researchers themselves split the data features into Numerical and Categorical data points. I went a different route that split the datasets into **Invasive and Non-Invasive.**
- To elaborate:
    - *'Invasive'* would be any feature of the dataset that required some form of invasive procedure on the patient to be retrieved like: blood tests, sample testing, image scans, etc.
    - *'Non-invasive'* would be something as simple as a Yes or No question or a blood pressure reading.
***
- In total it will be three main dataset/models I will be working with.

# In this Notebook (NB1):

In this notebook we will be working **backwards** and starting on the Invasive and Non Invasive models.
***
- Interestingly, most of the features I consider to be 'Invasive' were classified as 'Numerical' by the researchers. However, I will be adding the categorical feature _Anaemia_ because the classification of this in the dataset was done through measurement of haematocrit levels in the blood most likely from a blood test.
- Similarly, most of the features I consider to be 'Non-Invasive' were classified as 'Categorical' by the researchers. However, I will be adding the numerical features _Age and Time_ since they are number values but not retrieved by an invasive procedure. 
***
**I am going against the researcher's data and I hypothesize that the Non Invasive models will perform better than the Invasive models, this is because with non continuous or non numeric values it is easier to identify patterns.**

### A. Loading Data

In [3]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn import svm, datasets
from sklearn.svm import SVR
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV 
import sklearn
from sklearn import linear_model
from sklearn.model_selection import train_test_split 
import sklearn.preprocessing as preprocessing
import pandas as pd
from sklearn.metrics import plot_roc_curve,auc
from sklearn.metrics import mean_squared_error, r2_score
import warnings 
warnings.filterwarnings('ignore')

In [4]:
dataset1=pd.read_csv('heart_failure_dataset.csv') #Reading dataset

In [5]:
df = dataset1.copy() #Make a copy to work with

In [6]:
df.dtypes #Checking they types of variables for each feature- 3 floats and 10 ints.

age                         float64
anaemia                       int64
creatinine_phosphokinase      int64
diabetes                      int64
ejection_fraction             int64
high_blood_pressure           int64
platelets                   float64
serum_creatinine            float64
serum_sodium                  int64
sex                           int64
smoking                       int64
time                          int64
DEATH_EVENT                   int64
dtype: object

In [7]:
df.describe() #Checking to see if all of the features are present here before moving on.

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
count,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0
mean,60.833893,0.431438,581.839465,0.41806,38.083612,0.351171,263358.029264,1.39388,136.625418,0.648829,0.32107,130.26087,0.32107
std,11.894809,0.496107,970.287881,0.494067,11.834841,0.478136,97804.236869,1.03451,4.412477,0.478136,0.46767,77.614208,0.46767
min,40.0,0.0,23.0,0.0,14.0,0.0,25100.0,0.5,113.0,0.0,0.0,4.0,0.0
25%,51.0,0.0,116.5,0.0,30.0,0.0,212500.0,0.9,134.0,0.0,0.0,73.0,0.0
50%,60.0,0.0,250.0,0.0,38.0,0.0,262000.0,1.1,137.0,1.0,0.0,115.0,0.0
75%,70.0,1.0,582.0,1.0,45.0,1.0,303500.0,1.4,140.0,1.0,1.0,203.0,1.0
max,95.0,1.0,7861.0,1.0,80.0,1.0,850000.0,9.4,148.0,1.0,1.0,285.0,1.0


### B. Splitting Data

- List of the features in the **Invasive** dataset would be: Anaemia, Creatinine Phosphokinase (CPK), Ejection Fraction, Platelets, Serum Creatinine, Serum Sodium.
- List of the features in the **Non-Invasive** dataset would be: Age, High BP, Diabetes, Sex, Smoking, Time.
- The Target Feature is: Death Event.
- I will be conducting a 70-30 datasplit, there are a lot of features for the model in the set to be trained with so I want to have a higher number in the testing set.

### 1. Invasive:

**Interesting fact:** In this particular situation this Invasive dataset/model could also answer the research question of *How effective is a patient's blood work detail when it comes to developing a prediction model for heart failure?.* This is because all of the features presented here are read and collected through blood work.

In [74]:
#Feature Selection: All of the invasive features present in the invasive dataset.
invasive_df = df[['anaemia', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium']].copy()
inv_target = df[['DEATH_EVENT']].copy()

In [75]:
#Splitting Data
inv_train, inv_test, inv_target_train, inv_target_test = train_test_split(invasive_df, inv_target, test_size=0.30, random_state=55)

# C. Model Training

- For classification purposes, I wanted to run all three types of kernels for this dataset. However because "implementations explicitly store this as an NxN matrix" this would be a 6 X 6 dimensional model making runtime longer and complicated. Therefore, I will not be using the polynomial and RBF kernel for classification purpose.

- Source/Online Explanation: https://ai.stackexchange.com/questions/7202/why-does-training-an-svm-take-so-long-how-can-i-speed-it-up

###### Classification

In [77]:
#Linear Kernel SVM- Classification
svc1 = svm.SVC(kernel='linear', C=1,gamma= 0.5)
svc1.fit(inv_train, inv_target_train)

SVC(C=1, gamma=0.5, kernel='linear')

In [78]:
#Accuracy
print(f'Accuracy Score for the Linear Kernel SVM (svc1) is {svc1.score(inv_test, inv_target_test):f}')

Accuracy Score for the Linear Kernel SVM (svc1) is 0.700000


In [79]:
#Logistic Regression- Classification
LR1 =  LogisticRegressionCV(random_state=0, solver='lbfgs',cv=5)
LR1 = LR1.fit(inv_train,np.ravel(inv_target_train.values)) 

In [80]:
#Accuracy
print(f'Accuracy Score of the LR model for the Invasive Features is {LR1.score(inv_test,inv_target_test):f}')

Accuracy Score of the LR model for the Invasive Features is 0.722222


###### Regression

In [93]:
#Linear Kernel SVM- Regression
svr_lin = SVR(kernel='linear', C=100, gamma='auto')

In [94]:
inv_FOlin = svr_lin.fit(inv_train, inv_target_train).predict(inv_test)

In [95]:
inv_FOlin #Really Bad Predictions.

array([ 254759.65827405, -278869.2206322 ,  -66321.98235095,
       -844878.64055408, -477196.7909447 ,  157555.24421155,
       -133124.71672595,  272003.34675061, -421158.08830798,
       -389787.97160877, -402459.33391345,  -69770.62297595,
       -349772.6112572 , -161526.33391345,  -82218.60637439,
       -325430.34953845, -414731.02824939, -235510.5331322 ,
       -216455.13078845, -473682.6815697 , -495593.95989002,
        -19041.69328845, -318639.49797595, -362856.40422595,
       -402438.59953845,  -57358.70110095,  -37426.15324939,
       -213874.28508533, -195090.46672595, -628833.24016345,
       -246218.48235095,  -10136.19328845, -279670.44328845,
        225273.88483655,  -15352.05266345, -156243.64641345,
       -145777.32512439,  159177.11139905, -284688.34660877,
        227708.13483655,  115310.48639905, -375292.9393822 ,
       -230028.71672595,   44300.8887428 , -352774.91252673,
        -43219.61516345, -463481.56047595, -301013.0878197 ,
       -151111.6034447 ,

In [96]:
# Running some analysis on the Linear Kernel SVM- the other 2 kernels are not compatable for this 
print("SVR- Linear Kernel")
print('Coefficients for the Model is: {}'.format(svr_lin.coef_)) #
print("Mean squared error for the Model is: %.2f"% mean_squared_error(inv_target_test, inv_FOlin))
print('R-square for the Model is: %.2f \n' % r2_score(inv_target_test, inv_FOlin))

#These are very bad values of MSE and R-squared, cannot compare with other Kernel types because they are not compatable.

SVR- Linear Kernel
Coefficients for the Model is: [[-7.93127055e+01  7.39486637e+01 -1.44842034e+04  2.34545395e+00
  -1.18377611e+03  3.54836370e+03]]
Mean squared error for the Model is: 90114056708.17
R-square for the Model is: -438656165465.43 



# (*)
- Because the analysis resulted in bad values, I will be performing scaling to see if that has any improvements.
- There are 2 types of scaling and using this I can compare which one will be the best for analysis.

In [122]:
inv_FOlin1 = inv_FOlin.reshape(-1, 1) #Reshaping necessary for scaling

In [136]:
# MinMax Scaler- values scaled first and then analyzed.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
inv_mmscaled= scaler.fit_transform(inv_FOlin1)
print("SVR- Linear Kernel")
print('Coefficients for the Model is: {}'.format(svr_lin.coef_)) #
print("Mean squared error for the Model is: %.2f"% mean_squared_error(inv_target_test, inv_mmscaled))
print('R-square for the Model is: %.2f \n' % r2_score(inv_target_test, inv_mmscaled))

SVR- Linear Kernel
Coefficients for the Model is: [[-7.93127055e+01  7.39486637e+01 -1.44842034e+04  2.34545395e+00
  -1.18377611e+03  3.54836370e+03]]
Mean squared error for the Model is: 0.28
R-square for the Model is: -0.38 



In [135]:
# Standard Scaler- values scaled first and then analyzed.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
inv_sscaled= scaler.fit_transform(inv_FOlin1)
print("SVR- Linear Kernel- StandardScaled")
print('Coefficients for the Model is: {}'.format(svr_lin.coef_)) #
print("Mean squared error for the Model is: %.2f"% mean_squared_error(inv_target_test, inv_sscaled))
print('R-square for the Model is: %.2f \n' % r2_score(inv_target_test, inv_sscaled))

SVR- Linear Kernel- StandardScaled
Coefficients for the Model is: [[-7.93127055e+01  7.39486637e+01 -1.44842034e+04  2.34545395e+00
  -1.18377611e+03  3.54836370e+03]]
Mean squared error for the Model is: 1.15
R-square for the Model is: -4.58 



- From looking at the analysis after the two scaling methods, it is clear that the MinMax scaling is ideal in comparison. However this does not mean the model is performing well. 
- SVM in general is used to dictate prediction models to fit in continuous spaces. In this case the target variable is not continuous, however the prediction model and then the scaled values are continuous hence the bad model performance.

In [25]:
#Linear Regression
LinR1 = linear_model.LinearRegression() 
LinR1.fit(inv_train, inv_target_train) #Fitting it to the trainings
LinR1

LinearRegression()

In [26]:
pred1=LinR1.predict(inv_test) #prediction function
pred1

array([ 0.22336123,  0.22752947,  0.36043579,  0.24218315,  0.15221861,
        0.92702138,  0.5278146 ,  0.43143275,  0.25317659, -0.09036654,
        0.08122538,  0.31071505, -0.03957541,  0.171856  ,  0.31912229,
        0.24153479,  0.0368415 ,  0.17014603,  0.26229329,  0.35670341,
        0.17393847,  0.14671524,  0.11522316,  0.25695542,  0.12253053,
        0.10986305,  0.29238432,  0.5342844 ,  0.17769649,  0.00417002,
        0.16937825, -0.02428376,  0.35471903,  0.5831268 ,  0.31377355,
        0.36351239,  0.29578434,  0.18617424,  0.16478214,  0.1759465 ,
        0.1535532 ,  0.11015841,  0.28030532,  0.38737674,  0.25109581,
        0.1704136 ,  0.00691298,  0.24202431,  0.22194949,  0.18806067,
       -0.02368234,  0.38950835,  0.30286833,  0.38631327,  0.57887421,
       -0.01913975,  0.10461903,  0.32198739,  0.288632  ,  0.98966014,
        0.38066864,  0.3268916 ,  0.38536414,  0.43260549,  0.30275364,
        0.22734558,  0.03791466,  0.26977061,  0.34926562,  0.07

In [27]:
print('Coefficients for the Model is: {}'.format(LinR1.coef_)) #
print("Mean squared error for the Model is: %.2f"% mean_squared_error(inv_target_test, pred1))
print('R-square for the Model is: %.2f \n' % r2_score(inv_target_test, pred1))

Coefficients for the Model is: [ 8.37446162e-02  5.10746238e-05 -1.08310177e-02 -1.23719082e-07
  1.30583778e-01 -1.11458547e-02]
Mean squared error for the Model is: 0.19
R-square for the Model is: 0.06 



In [28]:
print(f'Accuracy Score of the Linear Regression is {LinR1.score(inv_test, inv_target_test):f}')

Accuracy Score of the Linear Regression is 0.061128


In [29]:
#Ridge Regression
RidR1=linear_model.RidgeCV()

In [30]:
RidR1.fit(inv_train,inv_target_train) #fitting

RidgeCV(alphas=array([ 0.1,  1. , 10. ]))

In [31]:
R_pred1=RidR1.predict(inv_test)
R_pred1

array([ 2.29545842e-01,  2.23643233e-01,  3.67092532e-01,  2.36637964e-01,
        1.61791399e-01,  9.14124039e-01,  5.20100093e-01,  4.36542845e-01,
        2.58294368e-01, -8.34267098e-02,  7.51749550e-02,  3.15878723e-01,
       -3.14889818e-02,  1.66277421e-01,  3.14604457e-01,  2.32495760e-01,
        4.44756674e-02,  1.77317937e-01,  2.56216313e-01,  3.49076442e-01,
        1.80312815e-01,  1.40260184e-01,  1.22426614e-01,  2.52109929e-01,
        1.15057215e-01,  1.03007052e-01,  2.99208492e-01,  5.40312645e-01,
        1.86262182e-01, -1.92039196e-04,  1.76551044e-01, -1.51821356e-02,
        3.49000608e-01,  5.83958204e-01,  3.21340391e-01,  3.67837906e-01,
        2.89950417e-01,  1.80182066e-01,  1.67545591e-01,  1.83800478e-01,
        1.61251508e-01,  1.20512922e-01,  2.88292852e-01,  3.81940801e-01,
        2.59833367e-01,  1.78490385e-01,  1.41777554e-02,  2.36624266e-01,
        2.30082728e-01,  1.95936731e-01, -2.81456084e-02,  3.84694688e-01,
        2.98309676e-01,  

In [32]:
print('Coefficients for the Model is: {}'.format(RidR1.coef_)) #
print("Mean squared error for the Model is: %.2f"% mean_squared_error(inv_target_test, R_pred1))
print('R-square for the Model is: %.2f \n' % r2_score(inv_target_test, R_pred1))

Coefficients for the Model is: [ 6.99356630e-02  4.97005923e-05 -1.08248859e-02 -1.29686669e-07
  1.26186890e-01 -1.11637275e-02]
Mean squared error for the Model is: 0.19
R-square for the Model is: 0.06 



In [33]:
print(f'Accuracy Score of the Ridge Regression is {RidR1.score(inv_test,inv_target_test):f}')

Accuracy Score of the Ridge Regression is 0.063376


### 2. Non-Invasive:

In [137]:
#Feature Selection: All of the invasive features present in the invasive dataset.
noninvasive_df = df[['age', 'high_blood_pressure', 'diabetes', 'sex', 'smoking', 'time']].copy()
noninv_target = df.DEATH_EVENT

In [138]:
noninv_train, noninv_test, noninv_target_train, noninv_target_test = train_test_split(noninvasive_df, noninv_target, test_size=0.30, random_state=55)

#### Model Training

- Similar to the previous classification model, I will only be using Linear kernel classification for this Non-Invasive training.

In [139]:
#Linear kernel- Classification
svc2 = svm.SVC(kernel='linear', C=1,gamma= 0.5)
svc2.fit(noninv_train, noninv_target_train)

SVC(C=1, gamma=0.5, kernel='linear')

In [140]:
print(f'Accuracy Score for the Linear Kernel SVM (svc2) is {svc2.score(noninv_test, noninv_target_test):f}')

Accuracy Score for the Linear Kernel SVM (svc2) is 0.844444


In [141]:
#Logistic Regression (with the basic solver and crossvalidation)
LR2 =  LogisticRegressionCV(random_state=0, solver='lbfgs',cv=5)
LR2 = LR2.fit(noninv_train,np.ravel(noninv_target_train.values)) 

In [142]:
print(f'Accuracy Score of the LR model for the Non-Invasive Features is {LR2.score(noninv_test,noninv_target_test):f}')

Accuracy Score of the LR model for the Non-Invasive Features is 0.833333


###### Regression

In [143]:
#Linear Kernel SVM- Regression
svr_lin = SVR(kernel='linear', C=100, gamma='auto')

In [144]:
noninv_FOlin = svr_lin.fit(noninv_train, noninv_target_train).predict(noninv_test)

In [145]:
noninv_FOlin #Continuous Values.

array([-0.00602545,  0.5950876 ,  0.14648968,  0.91022057,  0.27637985,
        0.93930484,  0.12356853, -0.04715618,  0.95768591,  0.42751828,
        0.34543208,  0.04143149,  0.14045145,  0.85027744,  0.10893186,
        0.26418422,  0.67930443,  0.85930407,  0.18634253, -0.06771015,
        0.31345863,  0.15759107,  0.57561446,  1.05752744,  0.2519264 ,
        0.83324443,  0.49799243, -0.01189773, -0.00185198,  0.62683339,
        0.58110325,  0.40617108,  0.33062894,  0.74431863,  0.69343046,
        1.03987686,  0.64136242,  0.17856566,  0.53003442, -0.16919292,
       -0.13106525, -0.2705326 ,  0.42221015,  0.65819441,  0.43860226,
       -0.08449974,  0.3956854 ,  0.64043702,  0.53056715,  0.46253404,
        0.60353359, -0.01555208,  0.6317913 , -0.06875995,  0.73951569,
        0.34664284,  0.38169462, -0.08014015,  0.7208444 ,  0.46139939,
        0.7349078 ,  0.29100812,  0.25339045, -0.02010903,  0.20970104,
        0.04355365, -0.06975576,  0.78460262, -0.34974894,  0.60

In [146]:
# Running some analysis on the Linear Kernel SVM- the other 2 kernels are not compatable for this 
print("SVR- Linear Kernel")
print('Coefficients for the Model is: {}'.format(svr_lin.coef_)) #
print("Mean squared error for the Model is: %.2f"% mean_squared_error(noninv_target_test, noninv_FOlin))
print('R-square for the Model is: %.2f \n' % r2_score(noninv_target_test, noninv_FOlin))

#These are very bad values of MSE and R-squared, cannot compare with other Kernel types because they are not compatable.

SVR- Linear Kernel
Coefficients for the Model is: [[ 0.00856981 -0.00184805  0.07923323 -0.0774783   0.02149074 -0.0036119 ]]
Mean squared error for the Model is: 0.18
R-square for the Model is: 0.14 



In [147]:
noninv_FOlin1 = noninv_FOlin.reshape(-1, 1) #Reshaping necessary for scaling

In [148]:
# MinMax Scaler- values scaled first and then analyzed.
# Not performing Standard Scaler because it does not yield better scaled scores.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
noninv_mmscaled= scaler.fit_transform(noninv_FOlin1)
print("SVR- Linear Kernel- MinMaxScaled")
print('Coefficients for the Model is: {}'.format(svr_lin.coef_)) #
print("Mean squared error for the Model is: %.2f"% mean_squared_error(noninv_target_test, noninv_mmscaled))
print('R-square for the Model is: %.2f \n' % r2_score(noninv_target_test, noninv_mmscaled))

SVR- Linear Kernel- MinMaxScaled
Coefficients for the Model is: [[ 0.00856981 -0.00184805  0.07923323 -0.0774783   0.02149074 -0.0036119 ]]
Mean squared error for the Model is: 0.20
R-square for the Model is: 0.01 



- Both target variables and the features are categorical which might be the explaination to the better model performance for the Non-Invasive models.

In [150]:
#Linear Regression
LinR2 = linear_model.LinearRegression() 
LinR2.fit(noninv_train, noninv_target_train) #Fitting it to the trainings
LinR2

LinearRegression()

In [151]:
pred2=LinR2.predict(noninv_test) #prediction function
pred2

array([-0.03444669,  0.55719169,  0.06042937,  0.84286641,  0.24063361,
        0.82692506,  0.0620359 ,  0.01462381,  0.7828007 ,  0.34854551,
        0.29762034, -0.00394861,  0.09499252,  0.68678894,  0.09154121,
        0.24941302,  0.57090646,  0.673093  ,  0.1181505 , -0.06981826,
        0.21321622,  0.10585575,  0.52061122,  0.87854314,  0.24673312,
        0.71825124,  0.45819288,  0.03038274, -0.05669974,  0.49008136,
        0.50869328,  0.45068098,  0.21239398,  0.63194427,  0.64554972,
        0.82720938,  0.6306181 ,  0.10989398,  0.47480421, -0.14244643,
       -0.16616717, -0.16110788,  0.40795797,  0.58507019,  0.42718888,
       -0.08198087,  0.43920863,  0.55156462,  0.50968767,  0.3793144 ,
        0.49336104, -0.07863137,  0.49212199,  0.01525375,  0.67758003,
        0.25431822,  0.40651712, -0.14711127,  0.63207625,  0.41647381,
        0.67424003,  0.30169209,  0.26039779, -0.00346683,  0.14194511,
       -0.01404042, -0.04484897,  0.67071971, -0.29064626,  0.56

In [152]:
print('Coefficients for the Model is: {}'.format(LinR2.coef_)) #
print("Mean squared error for the Model is: %.2f"% mean_squared_error(noninv_target_test, pred2))
print('R-square for the Model is: %.2f \n' % r2_score(noninv_target_test, pred2))

Coefficients for the Model is: [ 0.00543162 -0.04521857  0.011856   -0.04758517 -0.02714858 -0.00339099]
Mean squared error for the Model is: 0.17
R-square for the Model is: 0.20 



In [153]:
print(f'Accuracy Score of the Linear Regression is {LinR2.score(noninv_test, noninv_target_test):f}')

Accuracy Score of the Linear Regression is 0.196001


In [154]:
#Ridge Regression
RidR2=linear_model.RidgeCV()

In [155]:
RidR2.fit(noninv_train,noninv_target_train) #fitting

RidgeCV(alphas=array([ 0.1,  1. , 10. ]))

In [156]:
R_pred2=RidR2.predict(noninv_test)
R_pred2

array([-0.04281966,  0.55781864,  0.06167448,  0.8390617 ,  0.25296438,
        0.81446689,  0.06352925,  0.01592308,  0.78017249,  0.35956991,
        0.30952267, -0.00459999,  0.1043956 ,  0.68432181,  0.10163434,
        0.24829505,  0.56978813,  0.68135002,  0.11874032, -0.06926201,
        0.20251034,  0.11508056,  0.52090946,  0.8746357 ,  0.23764819,
        0.71723128,  0.45900869,  0.04207071, -0.04655601,  0.48873089,
        0.51645918,  0.45052515,  0.21249658,  0.6305844 ,  0.63448067,
        0.83142008,  0.6395925 ,  0.09959193,  0.47486193, -0.13893139,
       -0.16314601, -0.15823386,  0.4095137 ,  0.58477514,  0.43663012,
       -0.08948933,  0.43085008,  0.54033144,  0.51084081,  0.3686276 ,
        0.50085448, -0.07649775,  0.4906958 ,  0.01709164,  0.67490797,
        0.26524203,  0.40902552, -0.14490957,  0.62004591,  0.41483505,
        0.6634007 ,  0.31196818,  0.26039129,  0.00767869,  0.13219768,
       -0.02317055, -0.03340441,  0.68013093, -0.28564947,  0.55

In [157]:
print('Coefficients for the Model is: {}'.format(RidR2.coef_)) #
print("Mean squared error for the Model is: %.2f"% mean_squared_error(noninv_target_test, R_pred2))
print('R-square for the Model is: %.2f \n' % r2_score(noninv_target_test, R_pred2))

Coefficients for the Model is: [ 0.00534248 -0.03526495  0.01079954 -0.03853997 -0.02522809 -0.00337757]
Mean squared error for the Model is: 0.16
R-square for the Model is: 0.20 



In [158]:
print(f'Accuracy Score of the Ridge Regression is {RidR2.score(noninv_test,noninv_target_test):f}')

Accuracy Score of the Ridge Regression is 0.198865


# D. Discussion:

When it comes to looking into these regressions it is difficult to find one that performs well, it is either very poor or average. Let's dive into each below.
***
###### Classification models:
- INVASIVE Linear SVM: The accuracy score here was 0.7. This is not a perfect score of 1.0, but it is still an average score. 
- INVASIVE Logistic Regression: The accuracy score here was similar (slightly higher) 0.72. Once again this is not perfect but still an average score. I believe with more observations for training it will perform better.
- NON-INVASIVE Linear SVM: The accuracy score here was 0.84. This is not a perfect score of 1.0, but it is a good score. 
- NON-INVASIVE Logistic Regression: The accuracy score here was similar 0.83. Once again this is not perfect but still good. More observations will improve the score.
In terms of visualization it is difficult with classification models that has more than two features.
***
###### The Regression models
This is where things get complicated. Just as I hypothesized the Non-Invasive models did perform better than the Invasive model. Strikingly, the invasive models were very poor in predicting.

###### Let's Begin with the INVASIVE:
- INVASIVE Linear SVM regression: the values for this model were well-off. The MSE was 90114056708.17 and the R-square was -438656165465.43. 
    - This raised a flag for me, so I decided to take the predicted values and scale them to see if that would help the scores. There are two methods to scale the StandardScaler and the MinMaxScaler. When the analysis was run with the newly scaled predictions, we achieved better MSE and R-square values, but it does not mean the model is performing well.
- INVASIVE Linear SVM regression **MinMaxScaler**: MSE: 0.28 and R-square: -0.38. Both scores are significantly lower now. The MSE is near 0 which is considered to be a good predictor model, but that is not the case. The R-squared is a negative value which signifies negative correlation.
- INVASIVE Linear SVM regression **StandardScaler**: MSE: 1.15 and R-square: -4.58. Similarly these are also better than the original SVM analysis. 
    - When comparing between the MinMax and this Standard model it is clear the MinMax is better, with a lower MSE value and a better R-squared in terms of negative correlation. 


Now Moving on to the last two regression models. These models similar to the SVM regression tell a similar story but with an Accuracy score to remove any doubt of model performance.
- INVASIVE Linear Regression: MSE: 0.19, which is close to 0 and that is good but can only be confirmed when compared. R-squared: 0.06 which is a value very far away from a perfect 1.0. Finally a terrible accuracy score of 0.061128. This shows that using the invasive features was not good enough for model development.
- INVASIVE Ridge Regression: MSE: 0.19, R-squared: 0.06, and Accuracy score: 0.063376. These are all similar to the previous values of Linear Regression.

Overall, for the INVASIVE regression models there was nothing positive to report, all of the models performed poorly. This is perhaps due to the nature of the target variable, it is not a continuous feature- it is categorical. These regression models perhaps work better with a more fluid target. On the other hand the NON INVASIVE data models somehow reported better values from their training. Perhaps due to the entire dataset and target bring categorical types.
***
###### Let's look at NON INVASIVE:
- NON INVASIVE Linear SVM regression: MSE: 0.18, this is good but can only be confirmed when compared. R-squared is 0.14 which is close to 0 and is not good. Although these MSE and R-squared values are acceptable, I still ran a MinMaxScalar on the prediction values to see the effect on the scores.
- NON INVASIVE Linear SVM regression **MinMaxScaler**: MSE: 0.2 and R-squared: 0.01. Surprisingly with a slightly higher MSE value and lower R-squared the scaling did not help in this situation. (Only performed MinMaxScaler here since it is better.)

It is interesting to see the model identify enough patterns in this non-numerical dataset and perform better in predicting heart failure. Now the last two regression models tell a similar story but with the Accuracy score.

- NON INVASIVE Linear Regression: MSE: 0.17, which is good and close to 0. R-squared: 0.20 which is still low and not close to 1.0. Finally an accuracy score of 0.196001. This is not an amazing accuracy score but I believe it can improve with more observations. 
- NON INVASIVE Ridge Regression: MSE: 0.16, R-squared: 0.20, and Accuracy score: 0.198865. These are all similar to the previous values of Linear Regression.

When it comes to the NON INVASIVE regression there was a significant improvement in every score when compared to the INVASIVEs. Because of the nature of the dataset (categorical) I believe with enough observations the NON INVASIVE parameter can be highly accurate in prediction of heart failure. 
***
###### Conclusion
- For the classification both model types performed similarly for Invasive and Non Invasive. Invasive had a slightly better score with Logistic Regression and Non Invasive had a slightly better score with Linear SVM.
- When it comes to the best INVASIVE regression model it would have to be the **Ridge Regression** because overall it had better scoring for MSE, R-squared, and Accuracy. 
- When it comes to the best NON INVASIVE regression model it would also be the **Ridge Regression** because of the better scoring.
- My Hypothesis: Correct.

# Please move on to Notebook 2 for the 2 feature comparisons.

Reference:

Chicco, D., Jurman, G. Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Med Inform Decis Mak 20, 16 (2020). https://doi.org/10.1186/s12911-020-1023-5