<h4>Data and Module Importing</h4>

Importing the most-used functions from throughout the project and also some additional ones that will help with model analysis.

In [31]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler

import statsmodels.api as sm
import statsmodels.tools

from statsmodels.stats.outliers_influence import variance_inflation_factor

df = pd.read_csv("C:\\Users\\Toby\\Documents\\Digital Futures\\Projects\\WHO Project\\Life Expectancy Data.csv")


(2864, 21)

<h4>Train-Test Split</h4>

In [None]:
feature_cols = list(df.columns)
feature_cols.remove('Life_expectancy')

X = df[feature_cols]
y = df['Life_expectancy']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)

<h2>Interactive Model Selection and Outputs</h2>

<h4>Behind the Scenes Workings</h4>

This first function is defined to give the user the option of a full model or a sensitivity-corrected model. This handles all inputs to only return a model if it is a valid input.

Following this, we can begin running some functions that will be defined later (All models must be run before this function can be called.)

For both models, we send our data into a feature engineering function to clean up the data and get our columns in the right format.

Following this, we then find the optimal VIF for the columns we use in each model.

Finally, we complete some model analysis.

This is all then output to the user to give an overview of the quality and fit of the model being used. This is helpful in showing that the final models are up to the required standards of predictiveness and consistency.

In [473]:
def model_selection():

    # Error Handling
    
    try: model_choice = int(input("""Do you want to run the full model (1) or run a censored model to cover sensitive data (2)?
    Enter your option here: """))
    except:
        print("Invalid input. Please enter either 1 or 2 to choose your model")
        model_selection()

    if model_choice == 1:

    # Model FE and defining stage
    
        X_train_fe = feature_eng_full(X_test)
        model_cols = X_train_fe.columns

    # Model Metrics
        global model_state
        model_state = "full"
        modelling(model_cols)
        print("\nThe equation derived from our linear regression model is:")
        print()
        print(equation) 
        print() 

        lin_reg = sm.OLS(y_train, X_train_fe[optimal_cols])
        results = lin_reg.fit()
        print(f"For comparison, the optimal condition number found using VIF is {results.condition_number}")


    elif model_choice == 2:
        
        X_train_fe = feature_eng_full(X_test)
        model_cols = ['const','Year', 'Alcohol_consumption', 'log_GDP', 'Adult_mortality']

    # Model Metrics
        model_state = "sensitive"
        modelling(model_cols)
        print("\nThe equation derived from our linear regression model is:")
        print()
        print(equation) 
        print() 

        lin_reg = sm.OLS(y_train, X_train_fe[optimal_cols])
        results = lin_reg.fit()
        print(f"For comparison, the optimal condition number found using VIF is {results.condition_number}")

    else:
        print("This is not one of the options. Please enter either 1 or 2 to choose your model")
        model_selection()

In [462]:
def feature_eng_full(data):
    data = data.copy()

    # Removing autocorrelated columns
    
    data = data.drop(columns = ['Country', 'Economy_status_Developing', 'Infant_deaths'])
    
    # One hot encoding
    
    data = pd.get_dummies(data, columns = ['Region'], drop_first = True, prefix = 'Region', dtype=int) 

    # Fixing exponential relationship

    data['log_GDP'] =  np.log(data['GDP_per_capita'])

    # Scaling
    
    scaler = StandardScaler()
    data[data.columns] = scaler.fit_transform(data[data.columns])

    # Removing columns we are not interested in for our model

    data = data.drop(columns = ['Measles', 'GDP_per_capita', 'Population_mln', 'Thinness_five_nine_years'])
    
    # VIF

    if model_state == "full":
        data_col = data.columns
    else:
        data_col = ['Year', 'Alcohol_consumption', 'log_GDP', 'Adult_mortality']
    
    calculate_vif(data[data_col])
    
    data = sm.add_constant(data)
    return data

In [464]:
def calculate_vif(X, thresh = 5.0):
    variables = list(range(X.shape[1]))
    dropped = True
    while dropped:
        dropped = False
        # this bit uses list comprehension to gather all the VIF values of the different variables
        vif = [variance_inflation_factor(X.iloc[:, variables].values, ix)
               for ix in range(X.iloc[:, variables].shape[1])]
        
        maxloc = vif.index(max(vif)) # getting the index of the highest VIF value
        if max(vif) > thresh:
            del variables[maxloc] # we delete the highest VIF value on condition that it's higher than the threshold
            dropped = True # if we deleted anything, we set the 'dropped' value to True to stay in the while loop    
    
    global optimal_cols 
    optimal_cols = list(X.columns[variables])
    optimal_cols.append('const')

    # We now create a global variable and assign the list of columns still in the valid set to it, remembering to add the constant back in. We can use this to check for an optimal condition number.
    
    return optimal_cols

In [466]:
def modelling(col):

    # Modelling Stage
    
    lin_reg = sm.OLS(y_train, X_train_fe[col])
    results = lin_reg.fit()

    # Metrics Observations 
    
    print(f"\nThe following shows the level of success our {model_state} model has with predicting life expectancy:\n")
    print(f"""
P-Values:

{round(results.pvalues,3)}

R-Squared:
    
{results.rsquared}
    
AIC and BIC:
    
{results.aic}
{results.bic}
    
Condition Number:
    
{results.condition_number}
""")

    # Coefficients

    # From stackoverflow.com - 'user idiot-tom' - this shows how to extract the coefficients from our results and insert it into a data frame. We then are able to put them into a dictionary which
    # will be used to form the equation used for predictions.

    global coef_df
    global coefficients
    global equation
    coef_df = pd.read_html(results.summary().tables[1].as_html(),header=0,index_col=0)[0]
    coefficients = coef_df['coef'].to_dict()

    equation = f"y = {coefficients['const']}"
    for val, key in coefficients.items():
        if val != 'const':
            equation += f" + {key}*{val}"
    
    # RMSE Calculations
    
    y_pred = results.predict(X_train_fe[col])
    rmse = statsmodels.tools.eval_measures.rmse(y_train, y_pred)
    print(f"RMSE:\n\n{rmse}")
    # print(results.summary())

<h2>Model Analysis</h2>

Here, we can call the function that allows us to select which model we want to use. The entry is exception-tested and will only allow one of two inputs. Once the user has selected their chosen model, the output will represent all the important metrics that are used to check the effectiveness of the model. This output will include:

* P-Values
* R-Squared
* AIC and BIC
* Condition Number
* RMSE
* Linear Regression Equation
* Optimal VIF

In [475]:
model_selection()

Do you want to run the full model (1) or run a censored model to cover sensitive data (2)?
    Enter your option here:  1



The following shows the level of success our full model has with predicting life expectancy:


P-Values:

const                                   0.000
Year                                    0.000
Under_five_deaths                       0.000
Adult_mortality                         0.000
Alcohol_consumption                     0.057
Hepatitis_B                             0.002
BMI                                     0.000
Polio                                   0.000
Diphtheria                              0.014
Incidents_HIV                           0.000
Thinness_ten_nineteen_years             0.001
Schooling                               0.000
Economy_status_Developed                0.000
Region_Asia                             0.032
Region_Central America and Caribbean    0.000
Region_European Union                   0.000
Region_Middle East                      0.115
Region_North America                    0.006
Region_Oceania                          0.000
Region_Rest of Euro

  coef_df = pd.read_html(results.summary().tables[1].as_html(),header=0,index_col=0)[0]


ValueError: The indices for endog and exog are not aligned

<br>
<br>
<br>
<br>
<h3>User Inputs Experiment</h3>

After a test to see how we would make our function interactive, we decided it would be best to host this portion in an online application. We thought it would be useful to keep this section in the notebook as a display of our thought process and for some transparency. It currently has no effect on the interactive function.

In [383]:
def user_inputs_1():

    # List of features included in our final full model
    
    user_values = ['year', 'U5 Deaths per 1000', 'adult mortality rate',
       'alcohol consumption', 'hepatitis B immunization (%)', 'BMI', 'polio immunization (%)',
       'diphtheria immunization (%)', 'HIV per 1000', 
       'thinness between 10-19', 'schooling years',
       'economy status (Developed or Developing)', 'region', 'GDP']

    # Initialise a dictionary to store the users inputs
    
    user_dict = {}

    # Define lists to use in input checking - these features are grouped by the upper limit of their input
    
    limit1000 = ['U5 Deaths per 1000', 'adult mortality rate', 'HIV per 1000']
    limit100 = ['hepatitis B immunization (%)', 'polio immunization (%)', 'diphtheria immunization (%)', 'thinness between 10-19']
    #regions = 

    # Section used to take in user inputs

    for each in user_values:       # Creates a new input for every feature in the model

        # For features that require text inputs, a regular input is used. Once an input is taken, it is added into the dictionary under the respective feature.
        
        if each in ['economy status (Developed or Developing)', 'region']:
            user_input = input(f"Please enter your value for {each}: ")
            user_dict[each] = user_input

        # For features that require an integer input, there is some checking that must be done. The try exception ensures that only numbers can be used as an input and it will keep asking for
        # a number until the user inputs one.
        
        else:
            while True:
                try:
                    user_input = int(input(f"Please enter your value for {each}: "))

                    # This section is used for data checking too. Some features have maximum limits for them to make sense (E.g. You cannot have 101% of a population!). This is where
                    # the inputs are checked to ensure they follow the rules.
                    
                    if each in limit100 and user_input > 100:
                        print(f"{each.title()} must be less than 100, please enter a new value")
                    elif each in limit1000 and user_input > 1000:
                        print(f"{each.title()} must be less than 1000, please enter a new value")

                    # Once the user has entered a valid input, it is entered into the dictionary and the checks are ended. It then moves onto the next input (or ends if it is the last input)
                    
                    else:
                        user_dict[each] = user_input
                        break
                except:
                    print("This must be an integer, try again")

    # Prints the dictionary of the user's inputs
            
    for a, b in user_dict.items():
        print(f"{a.title()}: {b}")
    global input_df
    input_df = pd.DataFrame(columns=[
    'Year', 'Under_five_deaths', 'Adult_mortality',
       'Alcohol_consumption', 'Hepatitis_B', 'BMI', 'Polio', 'Diphtheria',
       'Incidents_HIV', 'Thinness_ten_nineteen_years', 'Schooling',
       'Economy_status_Developed', 'Region', 'GDP_per_capita'])

In [None]:
user_inputs_1()

In [None]:
input_df.head()