<img src = 'https://upload.wikimedia.org/wikipedia/commons/2/26/World_Health_Organization_Logo.svg'
    width = 690px
    height= 665px />

# **Life Expectancy of a Country - Predictive Model**  
By: Team Scrum of Digital Futures

### Please run the following cells sequentially

In [1]:
# Libraries Preamble
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler

import statsmodels.api as sm
import statsmodels.tools

In [2]:
# read in the file data
df = pd.read_csv('Life Expectancy Data.csv')

## Explaining the Validation Functions
The following three functions are similar and serve similar purpose, checking if user inputs are valid. ```valid_input()``` will check that the user input is a number that is contained within the minimum and maximum value. ```valid_string()``` will check the user input strictly for a string of 'yes' or 'no', with any capitalisation. ```valid_region()``` will check if the input is a string containing one of the nine allowed regions, again with any capitalisation. Should the user input something that doesn't pass their respective checks, the fucntion will print an error statement and ask the user for their input again by simply returning the function itself.

In [3]:
# Function that checks the validity of the user input
def valid_input(prompt, input_type=float, input_min=None, input_max=None):
    
    try:
        # Store user input
        user_input = input_type(input(prompt))
        # check if number
        if isinstance(user_input, (int, float)):
            # prompt the user again if value is too low
            if user_input < input_min:
                print(f'Value cannot be less than {input_min}. Try again.')
                return valid_input(prompt, input_type, input_min, input_max)
            # prompt user again if value is too high
            elif user_input > input_max:
                print(f'Value cannot be more than {input_max}. Try again.')
                return valid_input(prompt, input_type, input_min, input_max)
            # return input
            else:
                return user_input
        else:
            return user_input
    # retry if input is the wrong type (i.e a string)
    except ValueError:
        print('Invalid input! Please enter a number')
        return valid_input(prompt, input_type, input_min, input_max)   

In [4]:
# Function that checks the validity of the user input (yes or no version)
def valid_string(prompt):
    # store user input
    user_input = input(prompt)
    # allowed inputs
    options = ['yes', 'no']
    # return input if is one of the allowed options   
    if user_input.lower() in options:
        return user_input
    # retry if not
    else:
        print('You must respond \"Yes\" or \"No\"')
        return valid_string(prompt)     

In [5]:
# Function for region validation
def valid_region(prompt):
    # store user input
    user_input = input(prompt)
    # allowed inputs
    options = ['asia', 'central america and caribbean', 'european union', 'middle east', 'north america', 'oceania', 'rest of europe', 'south america', 'africa']
    # return input if is one of the allowed options   
    if user_input.lower() in options:
        return user_input
    # retry if not
    else:
        print('Please type one of the nine Regions: Asia, Central America and Caribbean, European Union, Middle East, North America, Oceania, Rest of Europe, South America, Africa')
        return valid_region(prompt)

## Explaining the Model Functions
The two functions below are the functions the contains a OLS model and the imported WHO data:
- The data is first train-test split
- A nested function feature engineers the data
- Some of the data is also scaled before being trained on the OLS model. 
- Then the model asks for the users inputs and stores them.
    - some of the inputs are manipulated in the same way as the feature engineering function
- Then a dataframe is created for the user inputs.
- The user dataframe is also scaled appropriately
- Finally, the fitted model uses the user inputs to produce a prediction of the life expectancy.
- The results are printed to the terminal, along with the $R^2$ value and the condition number

```all_data()``` asks the user for all the data necessary for generating the life expectancy. ```sensitiveless_data()``` doesn't ask the user for data that was deemed sensitive.

In [6]:
# Fucntion that uses all the data
def all_data():
    # ----------------------------------- Modelling data ---------------------------------------------
    X = df.drop(columns=['Life_expectancy'])    # X dataframe (features)
    y = df['Life_expectancy']                   # y series (Target)
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 423)   #Train-Test split
    
    def feature_eng(df):
        df = df.copy() 
        # One-hot enconding the 'Region' column and dropping it afterwards
        df = pd.get_dummies(df, columns=['Region'], drop_first=True, prefix='Region',dtype=int) 
        # Feature enrichment - creating new columns from combining highly correlated features
        df['Diphtheria_Polio'] = df['Diphtheria'] / df['Polio']
        df['Under_five_Infant_deaths'] = df['Under_five_deaths'] / df['Infant_deaths']
        df['Thinness_combined'] = df['Thinness_ten_nineteen_years'] / df['Thinness_five_nine_years']
        # dropping the columns that have been combined 
        df.drop(columns=['Diphtheria', 'Polio', 'Under_five_deaths', 'Infant_deaths','Thinness_five_nine_years', 'Thinness_ten_nineteen_years'], inplace=True)
        # dropping 'Country', 'Economy_status_Developed' and 'Year' columns
        df.drop(columns=['Country', 'Economy_status_Developed', 'Year'], inplace=True)
        # adding a constant
        df = sm.add_constant(df) 
    
        return df  # returning the DatFrame

    X_train_fe = feature_eng(X_train)
    
    # Creating a list of columns to scale
    columns_to_scale = ['Adult_mortality','Alcohol_consumption','Hepatitis_B', 'Measles', 'BMI', 'GDP_per_capita','Population_mln', 'Schooling']
    # Initialize scaler
    scaler = RobustScaler()
    # Fit on train and transform both sets
    X_train_fe[columns_to_scale] = scaler.fit_transform(X_train_fe[columns_to_scale])
    
    #Fitting
    lin_reg = sm.OLS(y_train, X_train_fe)
    results = lin_reg.fit()
    # ------------------------------------------------------------------------------------------------
    
    # User Inputs
    u_Infant_deaths               = valid_input('\nPlease input the following:\nNumber of Infant Deaths per 1000 population: ', float, 0, 1000)     
    u_Under_five_deaths           = valid_input('Number of deaths (0-4 years) per 1000 population: ', float, 0, 1000)             
    u_Adult_mortality             = valid_input('Adult (15-60 years) mortality rate per 1000: ', float, 0, 1000)        
    u_Alcohol_consumption         = valid_input('Alcohol consumption in litres per captia: ', float, 0, 1000)                 
    u_Hepatitis_B                 = valid_input('Hepatitis B immunisation coverage in 1-year olds (%): ', float, 0, 100)            
    u_Measles                     = valid_input('Measles cases per 1000 population: ', float, 0, 1000)            
    u_BMI                         = valid_input('National average BMI: ', float, 0, 250)            
    u_Polio                       = valid_input('Polio immunisation coverage in 1-year olds (%): ', float, 0, 100)
    u_Diphtheria                  = valid_input('Diphtheria immunisation coverage in 1-year olds (%): ', float, 0, 100)      
    u_Incidents_HIV               = valid_input('HIV related deaths (0-4 years) per 1000 live births: ', float, 0, 1000)     
    u_GDP_per_capita              = valid_input('GDP per capita (USD): ', int, 0, 1000000)    
    u_Thinness_ten_nineteen_years = valid_input('Prevalence of thinness among children Aged 10 to 19 (%): ', float, 0, 100)      
    u_Thinness_five_nine_years    = valid_input('Prevalence of thinness among children Aged 5 to 9 (%): ', float, 0, 100)    
    u_Schooling                   = valid_input('Average years of schooling: ', float, 0, 40)
    u_Population_mln              = valid_input('Population of country in millions: ', float, 0, 4000)            
    u_Economy_status              = valid_string('Is this country developed? (Yes or No) ')
    u_Economy_status_Developing   = 0
    u_region                      = valid_region('What region is this country from: ')
    u_Diphtheria_Polio            = u_Diphtheria / u_Polio
    u_Under_five_Infant_deaths    = u_Under_five_deaths / u_Infant_deaths
    u_Thinness_combined           = u_Thinness_ten_nineteen_years / u_Thinness_five_nine_years
    u_Region_Asia                 = 0
    u_Region_Central_America      = 0  
    u_Region_European_Union       = 0
    u_Region_Middle_East          = 0 
    u_Region_North_America        = 0
    u_Region_Oceania              = 0 
    u_Region_Rest_of_Europe       = 0
    u_Region_South_America        = 0 
    
    # converting user input into OHE value
    if u_Economy_status == 'no':
        u_Economy_status_Developing = 1
    if u_region == 'asia':
        u_Region_Asia = 1
    if u_region == 'central america and caribbean':
        u_Region_Central_America = 1
    if u_region == 'european union':
        u_Region_European_Union = 1
    if u_region == 'middle east':
        u_Region_Middle_East = 1
    if u_region == 'north america':
        u_Region_North_America = 1
    if u_region == 'oceania':
        u_Region_Oceania = 1
    if u_region == 'rest of europe':
        u_Region_Rest_of_Europe = 1
    if u_region == 'south america':
        u_Region_South_America = 1
    
    # Make empty dataframe for user inputs
    user_record = pd.DataFrame(columns=['const', 'Adult_mortality', 'Alcohol_consumption', 'Hepatitis_B', 'Measles', 'BMI', 'Incidents_HIV', 'GDP_per_capita', 'Population_mln', 'Schooling', 'Economy_status_Developing', 
    'Region_Asia', 'Region_Central_America_and_Caribbean', 'Region_European_Union', 'Region_Middle_East', 'Region_North_America', 'Region_Oceania', 'Region_Rest_of_Europe', 'Region_South_America', 
    'Diphtheria_Polio', 'Under_five_Infant_deaths', 'Thinness_combined'])
    
    # Add user inputs
    user_record.loc[0] = [1.0, u_Adult_mortality, u_Alcohol_consumption, u_Hepatitis_B, u_Measles, u_BMI, u_Incidents_HIV, u_GDP_per_capita, u_Population_mln, u_Schooling, u_Economy_status_Developing,
    u_Region_Asia, u_Region_Central_America, u_Region_European_Union, u_Region_Middle_East, u_Region_North_America, u_Region_Oceania, u_Region_Rest_of_Europe, u_Region_South_America, 
    u_Diphtheria_Polio, u_Under_five_Infant_deaths, u_Thinness_combined]
    
    # Scaling user inputs
    user_record[columns_to_scale] = scaler.fit_transform(user_record[columns_to_scale])
    
    # Generates predicted life expectancy
    y_pred = results.predict(user_record)
    
    # Print Result
    print('\nAverage life expectancy of country: ' ,round(y_pred[0], 1), 'years'
         '\n          R\u00B2 = ', round(results.rsquared, 3), 
         '\n   Cond. No. = ', round(results.condition_number, 2))       

In [7]:
# Function that doesn't use sensitive data
def sensitiveless_data():
    
    # ----------------------------------- Modelling data ---------------------------------------------
    df_WHO = df.drop(columns=['Life_expectancy'])     # creating dataframe with desired columns
    X = df_WHO                      # X dataframe (features)
    y = df['Life_expectancy']       # y series (Target)
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 423)      #Train-Test split
    
    def feature_eng(df):
        # Creating a copy of the DataFrame
        df = df.copy()
        # One-hot enconding the 'Region' column and dropping it afterwards
        df = pd.get_dummies(df, columns=['Region'], drop_first=True, prefix='Region',dtype=int)
        # Feature enrichment - creating new columns from combining highly correlated features
        df['log_population'] = np.log1p(X['Population_mln'])  # log(1 + x) to avoid log(0)
        # dropping the sensitive and irrelevent columns
        df.drop(columns=['Country','Year','Population_mln', 'Infant_deaths', 'Under_five_deaths', 'Hepatitis_B','Polio', 'Diphtheria','Incidents_HIV','Thinness_ten_nineteen_years',
                  'Thinness_five_nine_years','Schooling','Economy_status_Developed', 'Economy_status_Developing'], inplace=True)
        # adding a constant
        df = sm.add_constant(df)

        return df  # returning the DatFrame
    
    X_train_fe = feature_eng(X_train)
    
    # Creating a list of columns to scale
    columns_to_scale = ['Adult_mortality', 'Alcohol_consumption','Measles','BMI','GDP_per_capita']
    # Initialize scaler
    scaler = RobustScaler()
    # Fit on train and transform both sets
    X_train_fe[columns_to_scale] = scaler.fit_transform(X_train_fe[columns_to_scale])
    
    lin_reg = sm.OLS(y_train, X_train_fe)          #Fitting
    results = lin_reg.fit()
    # ------------------------------------------------------------------------------------------------ 

    # User Inputs             
    u_Adult_mortality             = valid_input('\nPlease input the following:\nAdult (15-60 years) mortality rate per 1000: ', float, 0, 1000)        
    u_Alcohol_consumption         = valid_input('Alcohol consumption in litres per captia: ', float, 0, 1000)                           
    u_BMI                         = valid_input('National average BMI: ', float, 0, 250)               
    u_GDP_per_capita              = valid_input('GDP per capita (USD): ', int, 0, 1000000)       
    u_Population_mln              = valid_input('Population of country in millions: ', float, 0, 4000)            
    u_Measles                     = valid_input('Measles cases per 1000 population: ', float, 0, 1000)
    u_region                      = valid_region('What region is this country from: ')
    u_log_Population              = np.log1p(u_Population_mln)
    u_Region_Asia                 = 0
    u_Region_Central_America      = 0  
    u_Region_European_Union       = 0
    u_Region_Middle_East          = 0 
    u_Region_North_America        = 0
    u_Region_Oceania              = 0 
    u_Region_Rest_of_Europe       = 0
    u_Region_South_America        = 0 
    
    # converting user input into OHE value
    if u_region == 'asia':
        u_Region_Asia = 1
    if u_region == 'central america and caribbean':
        u_Region_Central_America = 1
    if u_region == 'european union':
        u_Region_European_Union = 1
    if u_region == 'middle east':
        u_Region_Middle_East = 1
    if u_region == 'north america':
        u_Region_North_America = 1
    if u_region == 'oceania':
        u_Region_Oceania = 1
    if u_region == 'rest of europe':
        u_Region_Rest_of_Europe = 1
    if u_region == 'south america':
        u_Region_South_America = 1
    
    # Make empty dataframe for user inputs
    user_record = pd.DataFrame(columns=['const','Adult_mortality', 'Alcohol_consumption','BMI', 'GDP_per_capita', 'Measles', 'log_Population', 
    'Region_Asia', 'Region_Central_America_and_Caribbean', 'Region_European_Union', 'Region_Middle_East', 'Region_North_America', 'Region_Oceania', 'Region_Rest_of_Europe', 'Region_South_America', ])
    
    # Add user inputs
    user_record.loc[0] = [1.0, u_Adult_mortality, u_Alcohol_consumption, u_BMI, u_GDP_per_capita, u_Measles, u_log_Population, 
    u_Region_Asia, u_Region_Central_America, u_Region_European_Union, u_Region_Middle_East, u_Region_North_America, u_Region_Oceania, u_Region_Rest_of_Europe, u_Region_South_America ]
    
    # Scaling user inputs
    user_record[columns_to_scale] = scaler.fit_transform(user_record[columns_to_scale])
    
    # Generates predicted life expectancy
    y_pred = results.predict(user_record)
    
    # Print Result
    print('\nAverage life expectancy of country: ' ,round(y_pred[0], 1), 'years'
         '\n          R\u00B2 = ', round(results.rsquared, 3), 
         '\n   Cond. No. = ', round(results.condition_number, 2))

## Life Expectancy Function
This is the main function. It first asks the user which model they would like to run. Then , depending on the user input, it will run one of the two model functions above.

In [8]:
# Main function that asks the user which version to run
def life_expectancy_est():
    concent = valid_string('Do you consent to using advanced population data, which may include protected information, for improved accuracy? (Yes or No) ')
    if concent.lower() == 'yes':
        print('Running Expanded Model')
        return all_data()
    elif concent.lower() == 'no':
        print('Running Sensitive Model ')
        return sensitiveless_data()
    else:
        raise Exception('An unknown error has occurred')

# The Life Expectancy Estimator
### Please run the cell and fill out the prompts.

In [9]:
# run the life expectancy estimator 
life_expectancy_est()

Do you consent to using advanced population data, which may include protected information, for improved accuracy? (Yes or No)  fdsbkfbhbfafbh


You must respond "Yes" or "No"


Do you consent to using advanced population data, which may include protected information, for improved accuracy? (Yes or No)  yes


Running Expanded Model



Please input the following:
Number of Infant Deaths per 1000 population:  11
Number of deaths (0-4 years) per 1000 population:  heraeagr


Invalid input! Please enter a number


Number of deaths (0-4 years) per 1000 population:  11
Adult (15-60 years) mortality rate per 1000:  111
Alcohol consumption in litres per captia:  1
Hepatitis B immunisation coverage in 1-year olds (%):  90
Measles cases per 1000 population:  90
National average BMI:  25
Polio immunisation coverage in 1-year olds (%):  90
Diphtheria immunisation coverage in 1-year olds (%):  90
HIV related deaths (0-4 years) per 1000 live births:  0.1
GDP per capita (USD):  11111
Prevalence of thinness among children Aged 10 to 19 (%):  5
Prevalence of thinness among children Aged 5 to 9 (%):  5
Average years of schooling:  8
Population of country in millions:  11
Is this country developed? (Yes or No)  no
What region is this country from:  asia



Average life expectancy of country:  71.1 years
          R² =  0.966 
   Cond. No. =  149.34
