# Example of Linear Regression using GapMinder Data

As a very simple example of machine learning we're going see if we can predict life expectancy using some data from [GapMinder](https://www.gapminder.org/tools/#_data_/_lastModified:1526038652718&lastModified:1526038652718;&chart-type=bubbles), an organisation that aims to educate us more on the true state of the world.
The input data set will have the following variables:
- Country - 136 countries are included 
- Continent
- Life expectancy
- GDP per capita, PPP, inflation adjusted
- Healthcare spend as a percentage of GDP
- Population per square km 
- The democracy index of the country (high is better). See https://en.wikipedia.org/wiki/Democracy_Index

Go to [GapMinder](https://www.gapminder.org/tools/#_data_/_lastModified:1526038652718&lastModified:1526038652718;&chart-type=bubbles) now and take a look at some of these variable in the data viewer.  
- Do any of the variables (visually) look like they have a bearing on life expectancy?  
- Are any of them surprising?
- We want the relationship between the variables to be more-or-less astraight line.  Does a log-linear display help? 
<img src="files/Screen Shot from gapminder.png" alt="GDP per Capita vs. Life Expectancy from Gapminder" title="GDP per Capita vs. Life Expectancy from Gapminder" style="width: 50pc;"/>

# Basic Linear Regression Theory
- Straight line equation (indicate dependent vs. independent variable)
- Least squares
- Expand to multiple dimensions
- Link to more detailed explanation

# How do we know which variables to use?
When we have a number of variables, it may not be immediately apparent which ones influence the final result.  We will usually find that we can and should  drop one or more independent variables because:
- We don't want to include variables that aren't significant
<img src="files/GarbageInGarbageOut.png" alt="Garbage In - Garbage out" title="Garbage In - Garbage Out" style="width: 25pc;"/>
- When you are presenting your results you will have to explain why you included a variable; "because I had the data" is not a good enough reason!

There are several ways to acheive but we will concentrate on "backward elimination".  To understand this you need to know about the "P" value.  A proper explanation is [here](https://www.mathbootcamps.com/what-is-a-p-value) but for now we'll just say that the lower the P value is, the more significant the variable.

## Backward elimination
1. Choose a maximum P value: 5% is a good value
2. Run the linear regression using all the dependent variables. 
3. Look at the P values from the output, and choose the biggest.  
4. If it is > than 5%, then drop that variable.
5. Rerun the linear regression.
6. Repeat steps 2-5 until all P-values are < 5%.  

#### I will take you up to step 5, then the rest will be up to you!

# Coding
## Importing the Dataset
Firstly we need to import some libraries

In [2]:
# Standard library for numerical analysis
import numpy as np
# Standard library for data manipulation
import pandas as pd
# User-defined library for making plots
import sys
sys.path.append("python")
from plot_functions import make_plot

Next we will import the dataset with data for 2007, which has the following columns:
- Country - 136 countries are included
- Continent
- lifeexp - Life expectancy
- gdpPercap - GDP per capita, PPP, inflation adjusted
- health_spend - Healthcare spend as a percentage of GDP
- Pop density - popultaion per square km
- Democracy - The democracy index of the country

In [None]:
dataset = pd.read_csv('https://raw.githubusercontent.com/DataForGood-Norway/GirlsCanDoIt/master/MachineLearning/Lab/datasets/gapminder_2007_emma.csv')

### Here is how the data looks:

<img src="files/Screen Shot dataset.png" alt="Data_table" title="Data table" style="width: 50pc;"/>



### Data preparation
To get ready for the regression we need to make sure that all the columns contain numbers only:
- Cells that are empty currently contain "Not a Number" (nan) - these will be replaced with an average
- Cells that contain words ("categorical variables") will be replaced with numerical values.

But firstly the data now needs to be split into independent and dependent variables:

In [None]:
# Country, continent, population, GDP per cap, healthcare spend, pop density, democracy
X = dataset.iloc[:,[0,1,3,4,5,6,7]].values
# Will use the log of the GDP per capita because from our first look, there seems to be a log-linear realtionship
X[:,3] = np.log(dataset.iloc[:,4].values)
# The dependent variable is life expectancy, in column 2
y = dataset.iloc[:, 2].values
# Also making a list of continents for plotting purposes
cont = dataset.iloc[:,1].values

## Cleaning
- The cells where data is missing have "Not a Number" in them. The following code replaces these with the average for that column

In [None]:
# Fix the "NaNs" - replace with the average of that column
# sklearn contains libraries for preprocessing data
# now importing Imputer class
from sklearn.preprocessing import Imputer
# Create object
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
# Fit imputer object to feature X
imputer = imputer.fit(X[:,2:])
# Replace missing data with replaced values
X[:,2:] = imputer.transform(X[:,2:])

## Categorical variables
Categorical variables are ones which have a name, not a number.  In this case that would "Country name" and "Continent".

__Country name__: We only have each country listed once, and it makes no sense to use the name of the country in the regression, so we are somply going to number them 0-136

In [None]:
# now importing LabelEncoder class
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# Encoding the Independent Variable
# Firstly assign the categorical variables a unique number using LabelEncoder
labelencoder_X = LabelEncoder()
# Country names are in column zero
# Each country will get a number 0-136 (for the full dataset) 
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])

__Continent__: There are 5 continents in this dataset: Africa, Americas, Asia, Europe and Oceania.  If we simply give them a number 0-4, Python will assume that Oceania "is greater" than all this other continents, which makes no sense.  To fix this we use "one-hot encoding" which works like this:
- Number the continents 0-4
- Create a column for each continent -- these new columns are called __dummy variables__
- If the country is in that continent the column contains "1" otherwise it contains "0"
- Finally, since any of the five columns could be predicted from the other four (known as the __dummy variable trap__) we need to drop one.
    - Normally it would be the first column, but since Africa looks quite interesting in this dataset, I decided to drop Oceania

In [None]:
# now importing OneHotEncoder class
from sklearn.preprocessing import OneHotEncoder
# Each continent will get a number 0-4
X[:, 1] = labelencoder_X.fit_transform(X[:, 1])
cont = labelencoder_X.fit_transform(cont)

# one hot encode continenets
onehotencoder = OneHotEncoder(categorical_features = [1])
X = onehotencoder.fit_transform(X).toarray()
# Remove a continent (dummy variable trap)
# The obvious one to drop is Africa as it's the first column
# However this is also the most interesting
# so dropping Oceania instead
X = X[:,[0,1,2,3,5,6,7,8,9,10]]

We need to add a column of ones at the beginning, as the regression requires a constant

In [None]:
# Add a column of ones to the begining.  
# This is because the following procedure 
# will otherwise not have a constant in the regression model
X = np.append(arr = np.ones((len(X),1)).astype(int), values = X, axis = 1)



## Splitting into tesing and training set
This is very important!  We will first try to create the model from the training set and then apply it to the test set to see if the model has worked.

In [None]:

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test, cont_train, cont_test = train_test_split(X, y, cont, test_size = 0.2, random_state = 0)

"""
Columns are now
0: constant, 1:is Africa 2: is Americas 3: is Asia 4:is Europe
5: country 6:population 7:log(GDP per cap) 
8:health spend 9:pop density 10: dempocracy score
"""
X_train = pd.DataFrame(X_train,columns = ['const', 'africa','americas','asia','europe','country',
                                          'population','health_spend','logGDP','pop_density','democracy'])

## Training the regressor
This is where we get to the interesting part!

Firstly we will train the regressor using all the variables and see how it performs:


In [None]:
"""Start the training
"""

#train the regressor
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,y_train)
#Make predictions
y_pred = regressor.predict(X_train)

#plot the results
gdp_per_cap = np.exp(X_train[:,7])
make_plot('GDP per cap',gdp_per_cap,y_train,y_pred,cont_train)
#plot the results
make_plot('Health spend',X_train[:,8],y_train,y_pred,cont_train)


#### What can we say about the above plots?  
#### Is there a continent that stands out from the rest?

## Convert to dataframe?

## Start backward elimination
As a reminder:
1. Choose a maximum P value: 5% is a good value
2. Run the linear regression using all the dependent variables.
3. Look at the P values from the output, and choose the biggest.
4. If it is > than 5%, then drop that variable.
5. Rerun the linear regression.
6. Repeat steps 2-5 until all P-values are < 5%.

#### The code below is the first step and then you will have a chance to get involved and complete the other steps.


In [None]:
#Start backward elimination
import statsmodels.formula.api as sm
X_opt = X_train[ : , [0,1,2,3,4,5,6,7,8,9,10]]
regressor_OLS = sm.OLS(endog = y_train, exog = X_opt).fit()
regressor_OLS.summary()

#### Look at the table above.  We are particularly interested in the P-values for each variable.  What does this say about the significance of the different variables?  Are any of them surprising?

#### You can plot the results using the user defined module make_plot, which sets some of the display parameters like colour and titles, without you having to type it all in every time.  The format is:
make_plot('independent_variable',x_data,actual_data,predicted data,continents)

In [None]:
#Make predictions
regressor_new = LinearRegression()
regressor_new.fit(X_opt,y_train)
y_pred = regressor.predict(X_train)

#plot the results
gdp_per_cap = np.exp(X_train[:,7])
make_plot('GDP per cap',gdp_per_cap,y_train,y_pred,cont_train)
#plot the results
make_plot('Health spend',X_train[:,8],y_train,y_pred,cont_train)