### Seismic section (to move to other presentation)
<img src="http://vbpr.no/wp-content/uploads/2016/09/PCable-Norwegian-Sea-seismic-sample.jpg" alt="Seismic section" title="Seismic Section" style="width: 50pc;"/>

# Example of Linear Regression using GapMinder Data

# Basic Linear Regression Theory
- Straight line equation (indicate dependent vs. independent variable)
- Least squares
- Expand to multiple dimensions
- Link to more detailed explanation

# How do we know which variables to use?
## Backward elimination
1. Choose a maximum P value: 5% is a good value
2. Run the linear regression using all the dependent variables. 
3. Look at the P values from the output, and choose the biggest.  
4. If it is > than 5%, then drop that variable.
5. Rerun the linear regression.
6. Repeat steps 2-5 until all P-values are < 5%.  

# Coding
## Importing the Dataset
Firstly we need to import some libraries

In [6]:
# Standard library for numerical analysis
import numpy as np
# Standard library for data manipulation
import pandas as pd
# User-defined library for making plots
from plot_functions import make_plot

Next we will import the dataset with data for 2007, which has the following columns:
Country - 136 countries are included
Continent
lifeexp - Life expectancy
gdpPercap - GDP per capita, PPP, inflation adjusted
health_spend - Healthcare spend as a percentage of GDP
Pop density - popultaion per square km
Democracy - The democracy index of the country (high is better).  See https://en.wikipedia.org/wiki/Democracy_Index

In [2]:
dataset = pd.read_csv('datasets/gapminder_2007_emma.csv')

Take a look at this data at the [original site](https://www.gapminder.org/tools/#_data_/_lastModified:1526038652718&lastModified:1526038652718;&chart-type=bubbles)
Do any of the variables (visually) look like they have a bearing on life expectancy?  Are any of them surprising?
<img src="files/Screen Shot from gapminder.png" alt="GDP per Capita vs. Life Expectancy from Gapminder" title="GDP per Capita vs. Life Expectancy from Gapminder" style="width: 50pc;"/>

### Here is how the data looks:

<img src="files/Screen Shot dataset.png" alt="Data_table" title="Data table" style="width: 50pc;"/>



The data now needs to be split into independent and dependent variables:

In [3]:
# Country, continent, population, GDP per cap, healthcare spend, pop density, democracy
X = dataset.iloc[:,[0,1,3,4,5,6,7]].values
# Will use the log of the GDP per capita because from our first look, there seems to be a log-linear realtionship
X[:,3] = np.log(dataset.iloc[:,4].values)
# The dependent variable is life expectancy, in column 2
y = dataset.iloc[:, 2].values
# Also making a list of continents for plotting purposes
cont = dataset.iloc[:,1].values

## Cleaning
- The cells where data is missing have "Not a Number" in them. The following code replaces these with the average for that column

In [4]:
# Fix the "NaNs" - replace with the average of that column
#Replace missing values
#sklearn contains libraries for preprocessing data
#now importing Imputer class
from sklearn.preprocessing import Imputer
#Create object
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
#Fit imputer object to feature X
imputer = imputer.fit(X[:,2:])
#Replace missing data with replaced values
X[:,2:] = imputer.transform(X[:,2:])

## Categorical variables
Categorical variables are ones which have a name, not a number.  In this case that would "Country name" and "Continent".

__Country name__: We only have each country listed once, and it makes no sense to use the name of the country in the regression, so we are somply going to number them 0-136

__Continent__: There are 5 continents in this dataset: Africa, Americas, Asia, Europe and Oceania.  If we simply give them a number 0-4, Python will assume that Oceania "is greater" than all this other continents, which makes no sense.  To fix this we use "one-hot encoding" which works like this:
- Number the continents 0-4
- Create a column for each continent -- these new columns are called "dummy variables"
- If the country is in that continent the column contains "1" otherwise it contains "0"
- Finally, since any of the five columns could be predicted from the other four (known as the "dummy variable trap") we need to drop one.
    - Normally it would be the first column, but since Africa looks quite interesting in this dataset, I decided to drop Oceania


In [7]:
# Encoding the Independent Variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# Firstly assign the categorical variables a unique number using LabelEncoder
labelencoder_X = LabelEncoder()
# Country names are in column zero
# Each country will get a number 1-136 (for the full dataset) 
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
X[:, 1] = labelencoder_X.fit_transform(X[:, 1])
cont = labelencoder_X.fit_transform(cont)
# if excluding Africa
#cont = cont + 1
# one hot encode continenets
onehotencoder = OneHotEncoder(categorical_features = [1])
X = onehotencoder.fit_transform(X).toarray()
#Remove a continent (dummy variable trap)
# The obvious one to drop is Africa as it's the first column
# However this is also the most interesting
# so dropping Oceania instead
X = X[:,[0,1,2,3,5,6,7,8,9,10]]

#Add a column of ones to the begining.  
#This is because the following procedure 
#will otherwise not have a constant in the regression model
X = np.append(arr = np.ones((len(X),1)).astype(int), values = X, axis = 1)

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test, cont_train, cont_test = train_test_split(X, y, cont, test_size = 0.2, random_state = 0)

"""
Columns are now
0: constant, 1:is Africa 2: is Americas 3: is Asia 4:is Europe
5: country 6:population 7:log(GDP per cap) 
8:health spend 9:pop density 10: dempocracy score
"""


"""Start the training
"""

#train the regressor
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,y_train)
#Make predictions
y_pred = regressor.predict(X_train)

#plot the results
gdp_per_cap = np.exp(X_train[:,7])
make_plot('GDP per cap',gdp_per_cap,y_train,y_pred,cont_train)
#plot the results
# make_plot('Health spend',X_train[:,8],y_train,y_pred,cont_train)




TypeError: '>' not supported between instances of 'float' and 'NoneType'

<matplotlib.figure.Figure at 0x1a170fa630>

In [None]:
#Start backward elimination
import statsmodels.formula.api as sm
X_opt = X_train[ : , [0,1,2,3,4,5,6,7,8,9,10]]
regressor_OLS = sm.OLS(endog = y_train, exog = X_opt).fit()
regressor_OLS.summary()