Linear Regression example on Life Satisfaction GDP Dataset

In [26]:
# Standard imports and assertions to ensure we have the right versions of Python etc. 
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)
# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"
import numpy as np
import pandas as pd

In [28]:
# To plot pretty figures directly within Jupyter
%matplotlib inline
import matplotlib.pyplot as plt

Load the data. This is created from the life satisfaction datasets used by Geron in chapter 1
see https://github.com/ageron/handson-ml2/blob/master/01_the_machine_learning_landscape.ipynb
It assumes the file is stored locally in this case in the same location as this notebook

In [31]:
# Load the data
country_stats = pd.read_csv("gdp.csv")

Let's have a quick look at the first few values of the data. 

In [34]:
country_stats.head()

Unnamed: 0,Country\tSubject Descriptor\tUnits\tScale\tCountry/Series-specific Notes\t2015\tEstimates Start After,Unnamed: 1,Unnamed: 2,Unnamed: 3
0,Afghanistan\tGross domestic product per capita,current prices\tU.S. dollars\tUnits\tSee note...,current prices (National currency) Population...,
1,Albania\tGross domestic product per capita,current prices\tU.S. dollars\tUnits\tSee note...,current prices (National currency) Population...,995.383\t2010
2,Algeria\tGross domestic product per capita,current prices\tU.S. dollars\tUnits\tSee note...,current prices (National currency) Population...,318.135\t2014
3,Angola\tGross domestic product per capita,current prices\tU.S. dollars\tUnits\tSee note...,current prices (National currency) Population...,100.315\t2014
4,Antigua and Barbuda\tGross domestic product pe...,current prices\tU.S. dollars\tUnits\tSee note...,current prices (National currency) Population...,414.302\t2011


Next, we can plot the data (always a good idea as we shall see)

In [37]:
country_stats.plot(kind='scatter', x="GDP per capita", y='Life satisfaction')
plt.show()

KeyError: 'GDP per capita'

We can add details of which country is which into the plot as follows.
The x, y and label values are extracted from the dataframe and converted into an array with the np_c function.

In [None]:
xvals = np.c_[country_stats["GDP per capita"]]
yvals = np.c_[country_stats["Life satisfaction"]]
country_labels = np.c_[country_stats["Country"]]
for i,lab in enumerate(country_labels):
    xi = xvals[i,0]
    yi = yvals[i,0]
    plt.scatter(xi, yi)
    plt.text(xi+0.05, yi+0.05, lab[0], fontsize=9)
plt.show()

We're going to try creating a very simple model using linear regression as it looks like there is a roughly linear relationship between GDP and Life Satisfaction.
The range of the y-axis exaggerates the difference and if we re-plot the data this relationship becomes clearer.

In [41]:
country_stats.plot(kind='scatter', x="GDP per capita", y='Life satisfaction')
plt.axis([0, 60000, 0, 10])
plt.show()

KeyError: 'GDP per capita'

Our linear regression model is simply created as follows using the LinearRegression class from scikit-learn (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html).
The X and y values are again extracted from the dataframe and represent the independent and dependent variables
(the values used to make a prediction, and the values we are trying to predict respectively).
By fitting the model we are going to create 
the line of best fit through this data which minimises the errors (the residuals).

In [44]:
import sklearn.linear_model
model = sklearn.linear_model.LinearRegression()
X = np.c_[country_stats["GDP per capita"]]
y = np.c_[country_stats["Life satisfaction"]]
# Train the model
model.fit(X, y)

KeyError: 'GDP per capita'

We can find out how well out model captures the relationship between the data by looking at the score value (R-squared)

In [47]:
model.score(X,y)

NameError: name 'X' is not defined

We can also find out the details of the model by looking at the coefficients and intercept values. 
This is a very simple univariate regression model remember so these values correspond to the gradient 
and intercept values of the  - m and c respectively - in the standard equation for a straight line (y = mx + c)

In [50]:
model.coef_

AttributeError: 'LinearRegression' object has no attribute 'coef_'

In [52]:
model.intercept_

AttributeError: 'LinearRegression' object has no attribute 'intercept_'

We can then plot this line over the original data to give an illustration of the model we have created 

In [55]:
t0, t1 = model.intercept_[0], model.coef_[0][0]
country_stats.plot(kind='scatter', x="GDP per capita", y='Life satisfaction', figsize=(5,3))
plt.axis([0, 60000, 0, 10])
X=np.linspace(0, 60000, 1000) #generate 1000 values between 0 and 60000 to create the line
plt.plot(X, t0 + t1*X, "r")
plt.show()

AttributeError: 'LinearRegression' object has no attribute 'intercept_'

Having created this model we can then use it to make predictions of the life satisfaction score for a country that 
we have not previously considered in the data provided we have a value for its GDP.
For example, we know might know that the per capita GPD for Cyprus is $22587 which we can feed into our model as follows:

In [58]:
# Make a prediction for Cyprus
X_new = [[22587]]  # Cyprus' GDP per capita
print(model.predict(X_new))

NotFittedError: This LinearRegression instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

Remember, this is just a prediction based on our model. Cyprus's actual life satisfaction score could well be different to this!