# Linear regression
The goal of regression is to perform predictions of continuous values using a trained model. We want to find a relationship between one or multiple features and a specific output variable we call 'y'. 

This relationship is a multidimensional curve in the shape of $y=\theta_nX_n+...\theta_1X_1+\theta_0$. The vector X is the set of features and the set $\theta$ are called the parameters.

The goals of this exercise are:
* Read in data and process it using the pandas library
* Perform simple statistic analyses on the data to detect inconsistencies of the data and see correlations between features and/or target
* Correctly splitting data in a training and test set
* The ability to normalize data
* Training a regression model
* Evaluating a regression model
* Detect over- and underfitting
* Apply L1 and L2 regularisation to prevent over- and underfitting


In [1]:
%matplotlib inline
import matplotlib.pyplot as plt                                             # To create plots
import seaborn as sns                                                       # Make seaborn plots
import numpy as np                                                          # To perform calculations quickly
import pandas as pd                                                         # To load in and manipulate data
from sklearn.linear_model import LinearRegression, Lasso, Ridge             # Built in datasets and linear models
from sklearn.model_selection import train_test_split                        # Splitting in train and test set
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score # Metrics used to test the model
from sklearn.metrics import roc_curve                                       # Used to create ROC_curve
from sklearn.preprocessing import PolynomialFeatures                        # Used to construct higher order features
from sklearn.preprocessing import StandardScaler,MinMaxScaler, RobustScaler # Different scalers that can be used

## Life expectancy
The Global Health Observatory (GHO) data repository under World Health Organization (WHO) keeps track of the health status as well as many other related factors for all countries The data-sets are made available to public for the purpose of health data analysis. The data-set related to life expectancy, health factors for 193 countries has been collected from the same WHO data repository website and its corresponding economic data was collected from United Nation website.

More information: https://www.kaggle.com/augustus0498/life-expectancy-who

**Metadata**
* Country - Country
* Year - Year
* Status - Developed or Developing status
* Lifeexpectancy - Life Expectancy in age
* AdultMortality - Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)
* infantdeaths - Number of Infant Deaths per 1000 population
* Alcohol - Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)
* percentageexpenditure - Expenditure on health as a percentage of Gross Domestic Product per capita(%)
* HepatitisB - Hepatitis B (HepB) immunization coverage among 1-year-olds (%)
* Measles - Measles - number of reported cases per 1000 population
* BMI - Average Body Mass Index of entire population
* under-fivedeaths - Number of under-five deaths per 1000 population
* Polio - Polio (Pol3) immunization coverage among 1-year-olds (%)
* Totalexpenditure - General government expenditure on health as a percentage of total government expenditure (%)
* Diphtheria - Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)
* HIV/AIDS - Deaths per 1 000 live births HIV/AIDS (0-4 years)
* GDP - Gross Domestic Product per capita (in USD)
* Population - Population of the country
* thinness1-19years - Prevalence of thinness among children and adolescents for Age 10 to 19 (% )
* thinness5-9years - Prevalence of thinness among children for Age 5 to 9(%)
* Incomecompositionofresources - Human Development Index in terms of income composition of resources (index ranging from 0 to 1)
* Schooling - Number of years of Schooling(years)

In [None]:
# Read the dataset from csv
data = pd.read_csv("data/")
# Show first 10 rows of the dataframe


In [None]:
# Show the number of rows and columns in the df (the shape of the df)

# Summarize the dataframe (describe)


In [None]:
# Due to a lot of missing data in the Population column, we will remove this column

# Also remove the Country column


In [None]:
# Remove all rows with NA values (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)

# Take a new look at the shape of the df (to see if/how many rows were removed)


In [None]:
# Take a look at all possible values of the status column
# Hint: use the "unique" function from pandas 


In [None]:
# Based on the information above: replace the possible values with 0 and 1
# Hint: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html
# Take a look at dict-like replacement


In [None]:
# Remove outliers based on the zscore
# zscore gives you the number of standard deviations difference between the value and the mean value of the column
# In this example: remove outliers that are more than 5 standard deviations (z-score > 5) from the mean

# Use the following function to determine the zscore for each value
from scipy.stats import zscore

# Remove the outliers (example code)
data_no_outliers = data[(zscore(data)<=5).all(axis=1)]

# Take a look at the new shape of the dataframe
data_no_outliers.shape

In [None]:
# Look at the correlations between multiple features by displaying a correlation plot (in heatmap form)


# Formulate a few conclusions based on this plot.
# Which features have a strong correlation with Lifeexpactancy
# Which features are strongly correlated between each other

# If you had to remove a few features, which ones would you choose and why?

In [None]:
# Also take a look at the pairplot between features. (Takes some time to generate)



#Formulate a few conclusions based on this plot

In [None]:
# Split the data into features and targets



In [None]:
# Create dummy columns for the Year column
# This because we would like the Year column to be treated as a categorical column
# Take a look at https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html
# Specify the year column with columns=["Year"] parameter



In [None]:
# Take a look at the first rows of the features df


In [None]:
# Split the data into a training and a test set
# Take about 20% of the data as test set


In [None]:
# Create a linear regression model and fit the data



In [None]:
# Predict new values for the test set


In [None]:
# Evaluate the model using mean absolute error, mean squared error, R2
# Check to see if you have under- or overfitting by also calculating these scores for the training set

# Formulate a conclusion based on this model, is this a good model, over or underfitted?


## Model optimalisation
Until now we did not do any optimizations like normalization or regularization.

Try different regularization techniques (Ridge/Lasso), play also with the alpha levels of the model.

In [None]:
# Try different possibilities yourself

### Also try around with higher order features

In [None]:
# Add in some higher order features and train some models