# Purpose

#### Can be used to identify the best feature sets to use for our machine learning problem. Note: If this is the wrong approach, it could potentially act as a method for confirming any predictions we make via other methods? 

### Based on our previous [Pearson/Spearman] correlation matrix, the top 4 values were 'ADULT_MORTALITY' -0.92, 'INCOME_COMPOSITION_OF_RESOURCES' 0.91, 'SCHOOLING' 0.83 and 'BMI' 0.73. As such, we predict these will be the most significant and accurate features for our models.

In [1]:
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
from statistics import mean
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import missingno as msno

# Import the required packages for out regression models.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

In [2]:
# Ignore inline warnings thrown by packages
warnings.filterwarnings("ignore")

life_df = pd.read_csv('Cleaned_Life_Expectancy_Data.csv', delimiter=',')

In [3]:
reg_life_df = life_df[["STATUS", "LIFE_EXPECTANCY", "ALCOHOL", "ADULT_MORTALITY", "BMI", "SCHOOLING", "INCOME_COMPOSITION_OF_RESOURCES"]]
reg_life_df = reg_life_df.dropna()

In [6]:
from itertools import combinations

# Create a list containing all potential unique combinations of independent variables.
col_list = list(reg_life_df)
col_list.remove("LIFE_EXPECTANCY") # Remove dependant variable
all_iv = [] # List to store all independant variable combinations

# Iterate through columns list and append all potential column combinations to list.
for i in range (0, len(col_list)):
    all_iv.append([j for j in combinations(col_list, i)])

del all_iv[0] # Delete empty first entry

testing_results = []

# Iterate through combinations list and test all combinations.
for combinations_list in all_iv:
    for combination in combinations_list:
        
        # Define our x (independent) and y (dependent) variables for our regression models.
        x = reg_life_df[list(combination)]
        y = reg_life_df["LIFE_EXPECTANCY"]
        
        trials = 200
        trials_r2 = np.zeros(trials)  # storing coeffecient of determination
        trials_mse = np.zeros(trials)  # storing model prediction error

        for i in range(0, trials):
            # NOTE: Using a the "random_state" parameter ensures we get repeatable results for each execution.
            x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=i)

            # Create and train the ordinary least squares Linear Regression model based on our training data.
            linear_regression = LinearRegression()
            linear_regression.fit(x_train, y_train)

            # Run a prediction using our testing data.
            y_pred = linear_regression.predict(x_test)

            trials_r2[i] = r2_score(y_test, y_pred)
            trials_mse[i] = mean_squared_error(y_test, y_pred)
        
        # Add results to list 
        iv_len = len(list(combination))
        temp_str = ", ".join(list(combination))
        temp_list = [temp_str, iv_len, mean(trials_r2), mean(trials_mse)]
        testing_results.append(temp_list)

In [7]:
reg_list_tests_df = pd.DataFrame(testing_results, columns = ["Independant_Variables", "IV_Count", "R2_Score", "MSE_Error"]) 
reg_list_tests_df = reg_list_tests_df.sort_values(by=['MSE_Error'])

reg_list_tests_df.head(20)

Unnamed: 0,Independant_Variables,IV_Count,R2_Score,MSE_Error
53,"ALCOHOL, ADULT_MORTALITY, SCHOOLING, INCOME_CO...",4,0.917465,6.903078
61,"ALCOHOL, ADULT_MORTALITY, BMI, SCHOOLING, INCO...",5,0.917462,6.903421
58,"STATUS, ALCOHOL, ADULT_MORTALITY, SCHOOLING, I...",5,0.917452,6.904132
57,"STATUS, ALCOHOL, ADULT_MORTALITY, BMI, INCOME_...",5,0.917414,6.90739
52,"ALCOHOL, ADULT_MORTALITY, BMI, INCOME_COMPOSIT...",4,0.917412,6.907502
33,"ALCOHOL, ADULT_MORTALITY, INCOME_COMPOSITION_O...",3,0.917403,6.908261
43,"STATUS, ALCOHOL, ADULT_MORTALITY, INCOME_COMPO...",4,0.917395,6.908928
55,"ADULT_MORTALITY, BMI, SCHOOLING, INCOME_COMPOS...",4,0.91733,6.914912
39,"ADULT_MORTALITY, SCHOOLING, INCOME_COMPOSITION...",3,0.917326,6.915186
38,"ADULT_MORTALITY, BMI, INCOME_COMPOSITION_OF_RE...",3,0.917318,6.915858


### Based on the mean R2 and MSE scores from 200 trials runs using every potential combination of our independant variables, we see that the first 15 combinations listed are all within 0.002 and 0.02 of one another for the R2 and MSE means respectively. 

### Of these 15, the most accurate IV combination uses the following 4 IV's.
#### - 'ALCOHOL' , 'ADULT_MORTALITY', 'SCHOOLING' and  'INCOME_COMPOSITION_OF_RESOURCES'
### Whilst this is different from our initial prediction, 3 of the 4 were accurate, with 'ALCOHOL' being the unexpected IV replacing our prediction of 'BMI'.