In [4]:
import numpy as np, pandas as pd

metricContainer1 = np.load("metric arrays/metrics_original_normal_ttsplit.npz")          #Array of performance metrics realized from initial attempt using normal train_test_split.
metrics_Initial = metricContainer1["arr_0"]

metricContainer2 = np.load("metric arrays/metrics_TestingCV.npz")                        #Array of performance metrics realized during plain cross-validation testing.
metrics_TestingCV = metricContainer2["arr_0"]

metricContainer3 = np.load("metric arrays/metrics_ageS6_removed.npz")                     #Array of performance metrics realized when low importance features were removed.
metrics_ageS6_removed = metricContainer3["arr_0"]

metricContainer4 = np.load("metric arrays/metrics_mod_complete.npz")                       #Array of performance metrics realized when low importance features have been removed and interaction features added.
metrics_mod_complete = metricContainer4["arr_0"]


Below comparison of metrics will be made. These metrics contained in the arrays above, represent the result of 4 stages of attempts made to model more robustly the sklearn's load_diabetes dataset.

1. First attempt: The diabetes dataset was split, modelled and evaluated with 6 models using the basic train_test_split approach.

2. Second attempt: Use of cross-validation(CV) across 6 models to get better model performance on the dataset.

3. Third attempt: In addition to CV, dropped 2 very low importance features; decision informed through analysis of feature importances and possible interactions. File attached(TestForInteraction.ipynb)

4. Fourth attempt: Adding to step 3 above, created 3 new interaction features; interactions determined through analysis of feature interactions. File attached(TestForInteraction.ipynb)

Note: Metrics realized from the 2nd attempt using CV (see Testing_CV branch), are used as base line. Comparison is therefore made against it. Metrics realized from subsequent attempts are subtracted from baseline. Data is displayed in tables for visualization.

In [5]:
#Performance from initial attempt.
initialMetrics = pd.DataFrame(metrics_Initial, 
                         index = ["MSE score", "R2 score", "SNR"],
                         columns = ["OLSRegression", "LassoRegression", "RidgeRegression", "RandomForest", "XGBoost", "NeuralNetwok"])
initialMetrics[:2]

Unnamed: 0,OLSRegression,LassoRegression,RidgeRegression,RandomForest,XGBoost,NeuralNetwok
MSE score,3145.386692,4067.208772,3712.537141,3115.877072,3461.812012,22613.793331
R2 score,0.488422,0.338493,0.396179,0.493222,0.436958,-2.677995


In [6]:
#Baseline performance from cross-validation

cvMetricsFrame = pd.DataFrame(metrics_TestingCV, 
                         index = ["MSE score", "R2 score", "SNR"],
                         columns = ["OLSRegression", "LassoRegression", "RidgeRegression", "RandomForest", "XGBoost", "NeuralNetwok"])
cvMetricsFrame

Unnamed: 0,OLSRegression,LassoRegression,RidgeRegression,RandomForest,XGBoost,NeuralNetwok
MSE score,2992.33563,3004.09572,2997.22919,3269.98265,4195.118823,23927.527059
R2 score,0.492439,0.461785,0.463474,0.414577,0.252512,-3.197962
SNR,0.938803,0.0,0.0,0.0,0.0,0.0


In [7]:
#Performance improvement from dropping 2 low importance features.
#Note: +ve R2 and -ve MSE scores indicates positive increase. The reverse is a decrease in peformance with CV attempt as baseline.

featuresDropdiff = metrics_ageS6_removed - metrics_TestingCV
featuresDroppedMetricsFrame = pd.DataFrame((featuresDropdiff), 
                         index = ["MSE improvement", "R2 improvement", "SNR improvement"],
                         columns = ["OLSRegression", "LassoRegression", "RidgeRegression", "RandomForest", "XGBoost", "NeuralNetwok"])
featuresDroppedMetricsFrame

Unnamed: 0,OLSRegression,LassoRegression,RidgeRegression,RandomForest,XGBoost,NeuralNetwok
MSE improvement,-20.572591,-34.444533,-34.360725,-5.526038,-127.230615,45.34877
R2 improvement,0.003388,0.006499,0.006681,-0.000821,0.013701,-0.015549
SNR improvement,0.020186,0.0,0.0,0.0,0.0,0.0


#From table above:
1. Dropping 2 low importance features, marginally improved model performance on the diabetes dataset; MSE scores of all the models decreased, except for the neural network model. XGBoost showed highest improvement with a decrease of 127.23 units.

2. R2 scores also had marginal increase, except in the RandomForest and Neural netwok models. Again, the XGBoost model saw highest increase of about 0.014 units.

2. The signal to noise ratio (snr) also improved. This probably indicates that low importance features and which also does not interact with other features, only increases noise rather than signal.

Summary/Conclusion: With the general decrease in MSE, increased r2 and snr scores, though marginally, reduction of low importance features had positive influence on model performances.

In [8]:
#Performance improvement from dropping 2 low importance features and adding 3 new inteaction features.

fullModMetricDiff = metrics_mod_complete - metrics_TestingCV
modCompleteMetricsFrame = pd.DataFrame((fullModMetricDiff), 
                         index = ["MSE score", "R2 score", "SNR"],
                         columns = ["OLSRegression", "LassoRegression", "RidgeRegression", "RandomForest", "XGBoost", "NeuralNetwok"])
modCompleteMetricsFrame

Unnamed: 0,OLSRegression,LassoRegression,RidgeRegression,RandomForest,XGBoost,NeuralNetwok
MSE score,-42.160563,-34.426581,-35.501725,50.930922,-165.704907,63.107559
R2 score,0.007111,0.006496,0.006883,-0.011117,0.018405,-0.099803
SNR,-0.002689,0.0,0.0,0.0,0.0,0.0


#From table above:
1. Dropping 2 low importance features and adding custom interaction features, also improved model performance on the diabetes dataset, though marginally. MSE scores of all the models decreased, except for the neural network and random forest model. XGBoost showed highest improvement with a decrease of about 167.70 units. 

2. R2 scores also had marginal increase, except in the RandomForest and Neural netwok models. Again, the XGBoost model saw highest increase of about 0.018 units.

2. Here, signal to noise ratio (snr) dropped. Added interaction features appear to have added more noise than signal.

In [9]:
#So the question now becomes did the complete modifications really yield positive improvement
#or did the sub-step (low importance feature removal alone) yield better?

#To determie this, we will subtract the featuresDroppedMetrics(sub-step metrics) from modCompleteMetrics(final metrics).
#A -ve MSE, +ve r2 and +ve snr scores will mean that the complete modifications, yielded positive improvement.

stepsComparisonMetrics = fullModMetricDiff - featuresDropdiff
stepCompMetricsFrame = pd.DataFrame(stepsComparisonMetrics, 
                         index = ["MSE score", "R2 score", "SNR"],
                         columns = ["OLSRegression", "LassoRegression", "RidgeRegression", "RandomForest", "XGBoost", "NeuralNetwork"])

#To simplify comparisons, we will leave out  the models which by default have poor performances: RandomForest and NeuralNetwork.

for model in stepCompMetricsFrame.columns:
    if model == "RandomForest" or model == "NeuralNetwork":
        stepCompMetricsFrame.drop(model, axis=1, inplace=True)
stepCompMetricsFrame

Unnamed: 0,OLSRegression,LassoRegression,RidgeRegression,XGBoost
MSE score,-21.587972,0.017952,-1.141,-38.474292
R2 score,0.003722,-3e-06,0.000201,0.004704
SNR,-0.022874,0.0,0.0,0.0


From table above, 3 out of 4 models confirm that the complete modifications-for-improvement really yielded positive model peformance. Nevertheless, SNR and lasso regression had their highest scores from only low importance feature removal. 

Note: Model hyperparameters were not tuned. So this could be a major source of bias. 

Focus for further improvements:
1. Tuning of model hyperparameters for better performance.
2. Given smallness of sklearn's diabetes dataset, GAN or VAE could be used to augment dataset size.