In [None]:
%matplotlib inline



# What you will learn

## The problem

Every year since 2005, the [World Happiness Report](https://worldhappiness.report/ed/2018/) 
has analysed the results of the Gallup World Poll, 
which is carried out in 160 countries (covering 99% of the world’s population). 
The pollsters contact a random sample of people in each country and ask them over 
100 questions about their income, their health and their family. These questions include the 
following question about happiness:

<img src="file://../../images/lesson1/HappyQuestion.png">

People living in different countries give different answers. In the UK it is 6.94, making the UK 17th in the world for happiness. 
The top ranked country --- rather surprisingly given a national stereotype of people who are reserved and don’t express their 
feelings very much --- is Finland, with a score of 7.82. In general, Scandinavian and Northern European countries are 
ranked highest. The USA is 16th (0.03 points ahead of the UK). China, with a score of 5.59 and at 72nd place, is 
roughly in the middle of the table of the countries surveyed. Other mid-ranked countries include Montenegro, Ecuador, 
Vietnam and Russia. Further down the table, we find many African --- Uganda and Ethiopia placed 117th and 131st, 
respectively, Middle Eastern countries --- Iran is at 110 and Yemen at 132.  
The unhappiest country in the world in 2022 is Afghanistan, with an average happiness score of only 2.40.

The question is what makes people happy? One possible answer is that people are happier when they live longer. 
It is this relationship in data that we will explore in this lesson.

## The methods

Here we will learn about plotting and looking for relationships in data;
fitting straight lines through data points; understanding the slope and intercept of the line 
as parameters in a mathematical model; and showing that the parameters are the best possible fit to the data. 
These are all key data science skills and also the first steps towards machine learning. Specifically,
we will find out more about a method known as linear regression.

Before we get to linear regression, we are going to go take detour into another ware of mathematics: 
calculus. When you studied calculus at school or university, you probably didn't associate it with finding statsitical
relationships in data. But in machine learning, we are often interested in finding the minimum value of a function, and for that 
we need one of the key tools of calculus: differentiation.

## How to use this material

This material is taught as part of a 6 hour learning session. Your Juenga instructor will have booked 
a time for an in-person or online two hour session. This means you have two hours to work to do either side of the
session. Here is what you should do:

*Before coming to the class*: You should read through this entire page. At the section on differentiation, solve the paper and 
pencil exercise (If you get stuck look [here](https://www.bbc.co.uk/bitesize/guides/zyj77ty/revision/1)), but otherwise you should 
simply read through and try to understand what we are doing. Once you have read through, you should 
download the `data <https://github.com/AfricaEuropeCoreAI/Kujenga/blob/main/course/lessons/data/HappinessData.csv`_
 for the exercise. You will need to have a Python environment set up on your computer or access via Google Colab (see here for
 info on how to set that up). Then you can download this page as a Jupyter notebook or as Python code by clicking the links at
 the bottom of this page. Run the code and focus on understanding what it does up to and including section `Finding the best fit line`_. 

 *During class*: Your teacher will start by going through the theory for `Finding the best fit line`_. 
 Please ask them questions and actively engage! 

 


# Differentiation

Taking the derivative of a function is about finding an equation for the slope of the curve the function describes. 
When the derivative is zero, the slope is zero. For a recap on differentiation, 
[this page](https://www.bbc.co.uk/bitesize/guides/zyj77ty/revision/1) provides a quick review.

<img src="file://../../images/lesson1/ILLUSTRATION_DERIVATIVE.png">

Here is an example of how we make the calculation. Imagine you are asked to find the value 
of $m$ which minimises the function $(4-2m)^2$. To solve this problem, you can first multiply out 
the brackets to get

\begin{align}\end{align}

 (4-2m)^2 = 16 - 16m + 4m^2 

You can then take a derivative in order to calculate the slope of the function, 
as follows.

\begin{align}\end{align}

 rac{d}{dm} 16 - 16m + 4m^2= - 16 + 8m

We then solve this equal to zero, because the function is a minimum when it has slope zero.

\begin{align}- 16 + 8m = 0 \Rightarrow 8m = 16 \Rightarrow m = 2\end{align}

Problem solved. 

.. admonition:: Think yourself!
  
  Use the derivative to find the minimum of

  .. math::

      (9-3m)^2  

Note that I used the letter $m$ for the variable  was trying to find, while
most often in school we use the letter $x$ for the variable we are trying to find. In maths it really doesn't 
matter what letter you use, as long as you are consistent, but we will later use $m$ for the slope of a line, so I wanted 
to start using it already now.

If you can solve the problem above, you have the mathematics needed to work through the rest of this lesson.
But, irrespective of whether you can solve the problem above or not, we recommend you have a look at 
[Khan Academy's Calculus 1 course](https://www.khanacademy.org/math/calculus-1). These calculus 
skills are part of the building blocks needed for the Kujenga course.

      
# A line through the data

We already discussed looked at how the [World Happiness Report](https://worldhappiness.report/ed/2018/) 
documents the happiness of people across the world. Now let's load in that data to Python.


In [None]:
from IPython.display import display
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import numpy as np

# Read in the data, we shorten the variable names 
happy = pd.read_csv("../data/HappinessData.csv",delimiter=';') 
happy.rename(columns = {'Social support':'SocialSupport'}, inplace = True) 
happy.rename(columns = {'Life Ladder': 'Happiness'}, inplace = True) 
happy.rename(columns = {'Perceptions of corruption':'Corruption'}, inplace = True) 
happy.rename(columns = {'Log GDP per capita': 'LogGDP'}, inplace = True) 
happy.rename(columns = {'Healthy life expectancy at birth': 'LifeExp'}, inplace = True) 
happy.rename(columns = {'Freedom to make life choices': 'Freedom'}, inplace = True) 

# We just look at data for 2018 and dsiplay in table.
df=happy.loc[happy['Year'] == 2018]
display(df[['Country name','LifeExp','Happiness']])

## Creating the plot 
The code below plots the average life expectancy of 
each of these countries against their happiness (life ladder) scores. 



In [None]:
from pylab import rcParams
rcParams['figure.figsize'] = 14/2.54, 14/2.54
matplotlib.font_manager.FontProperties(family='Helvetica',size=11)


def plotData(df,x,y): 
    fig,ax=plt.subplots(num=1)
    ax.plot(x,y, data=df, linestyle='none', markersize=5, marker='o', color=[0.85, 0.85, 0.85])
    for country in ['United States','United Kingdom','Croatia','Benin','Finland','Yemen']:
        ci=np.where(df['Country name']==country)[0][0]
        ax.plot(  df.iloc[ci][x],df.iloc[ci][y], linestyle='none', markersize=7, marker='o', color='black')
        ax.text(  df.iloc[ci][x]+0.5,df.iloc[ci][y]+0.08,  country)
           
    ax.set_xticks(np.arange(30,90,step=5))
    ax.set_yticks(np.arange(11,step=1))
    ax.set_ylabel('Average Happiness (0-10)')
    ax.set_xlabel('Life Expectancy at Birth')
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.set_xlim(47,78)
    ax.set_ylim(2.5,8.1) 
    return fig,ax

fig,ax=plotData(df,'LifeExp','Happiness')

plt.show()

Each circle in the plot is a country. 
The x-axis shows the life expectancy in the country and 
the y-axis shows the average ranking of life satisfaction, 
on the 0 to 10 scale. In general, the higher the life expectancy of a country, 
the higher the happiness (life satisfaction) there. 

## Drawing a line

One way to quantify this relationship is to draw a straight line
through the points, showing how happiness increases with life expectancy. 
For example, imagine that for every 12 extra years which people live in a 
country they are one point happier. The equation for happiness in this case 
would then look like this,

\begin{align}\mbox{Happiness} = \frac{\mbox{Life Expectancy}}{12}\end{align}

in this case, if the average life expectancy in the country 
is 60 then the equation above predicts the happiness to be 60/12=5. 
If the life expectancy is 78 then average happiness is predicted to be 78/12=6.5. And so on...

We can draw this equation in the form of a straight line going 
through the cloud of country points, as shown below.



In [None]:
# Setup parameters: m is the slope of the line
# And calculate a line with that slope.
m=1/12
Life_Expectancy=np.arange(0.5,100,step=0.01)
Happiness= m*Life_Expectancy

# Plot the data and the line
fig,ax=plotData(df,'LifeExp','Happiness')
ax.plot(Life_Expectancy, Happiness, linestyle='-', color='black')
df=df.assign(Predicted=np.array(m*df['LifeExp']))
for country in ['United States','United Kingdom','Croatia','Benin','Finland','Yemen']:
    ci=np.where(df['Country name']==country)[0][0]
    ax.plot(  [df.iloc[ci]['LifeExp'],df.iloc[ci]['LifeExp']] ,[ df.iloc[ci]['Happiness'],df.iloc[ci]['Predicted']] ,linestyle=':', color='black')
plt.show()

.. admonition:: Try it yourself!

  Download the code by clicking on the link below and 
  try changing the slope and the intercept of the line above by 
  changing the values 1/12 and replotting the line.
  See if you can find a line that lies closer to the data points.


## The sum of squares

Each of the dotted lines above show how far the line – which predicts that happiness is one 
twelfth of life expectancy – is from the data for each of the six highlighted countries.
For example, the USA has a happiness score of 6.88 and an 
average life expectancy of 68.3. The first equation (figure 2b) predicts 

\begin{align}\mbox{Predicted USA Happiness} = \frac{\mbox{USA Life Expectancy}}{12} = \frac{\mbox{68.3}}{12} =  5.69\end{align}

Which means that the squared distance between the prediction and reality is 

\begin{align}\end{align}

 (6.88 - 5.69)^2 = 1.412

The table below shows the predicted value and the squared distance between 
prediction and reality for each country. We then sum these squared distances 
to get an overall measure of how far our predictions our from reality. This is done below.



In [None]:
df=df.assign(SquaredDistance=np.power((df['Predicted'] - df['Happiness']),2))
display(df[['Country name','Happiness','Predicted','SquaredDistance']])
             
Model_Sum_Of_Squares = np.sum(df['SquaredDistance'])

print('The model sum of squares is %.4f' % Model_Sum_Of_Squares)

# Finding the best fit line 
We have drawn a line. But the question is what the ‘best’ line is?

## Sum of squares

Let’s start by formulating this problem mathematically. 
For each country $i$, 
we have two values: the life satisfaction, which I will call $y_i$ 
and life expectancy, which I will call $x_i$ . For example, 
when $i=$USA then $x_i=6.88$ and $y_i=68.3$. 

Now, let’s denote the slope of the line as $m$ (in the plot above
$m=1/12$) and repeat the caluclation we made above but with letters instead 
of numbers. First we note that 

\begin{align}\end{align}

 \hat{y_i} = m \cdot x_i = 1/12 \cdot 6.88

The little "hat" in $\hat{y_i}$ denotes that it is a prediction 
(rather than the measured value itself, which is $y_i$). 
The squared distance between the prediction and outcome is written as

\begin{align}\end{align}

 ( y_i - m \cdot x_i)^2 

I want to emphasise here that all I am doing is rewriting the same calculation I
did above with numbers, but now with the letters. The reason for doing this is that 
our aim is to find an equation for the value of $m$ which minimises the sum of square 
distances.

The next step is to write out the sum

\begin{align}\end{align}

 ( y_1 - m \cdot x_1)^2 +  ( y_2 - m \cdot x_2)^2  + ... + ( y_{136} - m \cdot x_{136})^2  

The above equation is can be written in shorthand form (using the sum notation we met 
in `the section on our average friend <averagefriends>` as

\begin{align}\end{align}

 \sum_i^n ( y_i - m \cdot x_i)^2 

where $n=136$ is the number of countries. 

## Back to differentiation

We want to find the value of $m$ which minimises this sum of squares. But how do we do this? 

The answer is differentiation. We now want to find the value of $m$ which minimises the sum of squares. 
The equation above is more complicated than the one we used in the section on `Differentiation`_.


Although  the algebra is more complicated, we can use exactly the same logic to solve the problem 
above, of finding the value of $m$ which minimises this sum of squares. We first
take the derivative 

\begin{align}\end{align}

 \frac{d}{dm} \left( ( y_1 - m \cdot x_1)^2 +  ( y_2 - m \cdot x_2)^2  + ... + ( y_{136} - m \cdot x_{136})^2  \right)

 = - 2 x_1 y_1 + 2 x_1^2 m  - 2 x_2 y_2 + 2 x_2^2 m  +  ... - 2 x_{136} y_{136} + 2 x_{136}^2 m  

Although this particular step involves alot of algebra, notice that we are doing exactly the same as in the example above.
Another thing that I find can confuse students (when I teach this in statistics) is that 
we differentiate with respect to $m$. 
In school, we often use the letter $x$ for the variable name and $m$ for a constant. Here it is the other way round. 
The data $x_i$ and $y_i$ are constants (measurements from countries) and  $m$ is the variable we differentiate for.

We now write the sum above in shorthand as

\begin{align}\end{align}

 \sum_i^n 2 x_i y_i - \sum_i^n 2 \cdot x_i^2 m

and we solve equal to zero (to find the point at which it is minimized, and the slope is zero) to get

\begin{align}\end{align}

 \sum_i^n 2 x_i y_i - \sum_i^n 2 \cdot x_i)^2 m = 0 \Rightarrow \sum_i^n 2 x_i y_i = \sum_i^n 2 \cdot x_i^2 m \Rightarrow \sum_i^n x_i y_i = \sum_i^n x_i^2

Moving the $m$ to the left hand side gives

\begin{align}\end{align}

 m = \frac{\sum_i^n x_i y_i}{\sum_i^n x_i^2}

Lets now use our newly found equation to calculate the line that best fits the data.



In [None]:
df=df.assign(SquaredLifEExp=np.power(df['LifeExp'],2))
df=df.assign(HappinessLifEExp=df['LifeExp'] * df['Happiness'])

m_best = np.sum(df['HappinessLifEExp'])/np.sum(df['SquaredLifEExp'])
print('The best fitting line has slope m = %.4f' % m_best)

Our intial guess of $m = 1/12 = 0.0833$ wasn't so far away from the best fitting value. 
But this new slope is slightly closer to the data. We can now plot this and recalculate 
the model sum of squares




In [None]:
Life_Expectancy=np.arange(0.5,100,step=0.01)
Happiness= m_best*Life_Expectancy

fig,ax=plotData(df,'LifeExp','Happiness')
ax.plot(Life_Expectancy, Happiness, linestyle='-', color='black')
df=df.assign(Predicted=np.array(m_best*df['LifeExp']))
for country in ['United States','United Kingdom','Croatia','Benin','Finland','Yemen']:
    ci=np.where(df['Country name']==country)[0][0]
    ax.plot(  [df.iloc[ci]['LifeExp'],df.iloc[ci]['LifeExp']] ,[ df.iloc[ci]['Happiness'],df.iloc[ci]['Predicted']] ,linestyle=':', color='black')
 
plt.show()

df=df.assign(SquaredDistance=np.power((df['Predicted'] - df['Happiness']),2))
             
Model_Sum_Of_Squares = np.sum(df['SquaredDistance'])             
print('The model sum of squares is %.4f' % Model_Sum_Of_Squares)

Again, this sum of squares is slightly smaller than the value we got above 
for $m = 1/12$ 


## Including the Intercept
An equation for a straight line usually has two components a slope $m$
which we have already seen and an intercept $k$, which so far we have assumed is zero.
We can write the equation for a straight line as

\begin{align}\end{align}

 y = k + m \times x

We now look at how we can improve the fit of the model by
including this intercept.

We start by shifting the data so that it has a mean (average) of zero.
To do this we simply take away the mean value from both life expectancy and 
from happiness. Then replot the data 



In [None]:
df=df.assign(ShiftedLifeExp=df['LifeExp'] - np.mean(df['LifeExp']))
df=df.assign(ShiftedHappiness=df['Happiness'] - np.mean(df['Happiness']))

fig,ax=plotData(df,'ShiftedLifeExp','ShiftedHappiness')
ax.set_ylabel('Happiness (corrected for Mean Happiness)')
ax.set_xlabel('Life Expectancy (corrected for Mean Life Expectancy) ')
ax.set_xticks(np.arange(-30,30,step=5))
ax.set_yticks(np.arange(-5,5,step=1))
ax.set_xlim(-20,15)
ax.set_ylim(-3,3) 
plt.show()

This graph shows us that, for example, Yemen is almost -2.5 points below the world 
average for happiness and has a life expectency 8 years shorter than the average over
all countries in the world. The United States life expectancy is around 3.5 years longer than 
the average and the citizens of the USA are about 1.3 points happier than average.
It is worth noting that the correction is for country averages and does not account for the size of the 
populations of these various countries. It does however give us a new way 
of seeing between country differences.


Let's now try to find the best fit line which goes through these data points.



In [None]:
df=df.assign(SquaredLifEExp=np.power(df['ShiftedLifeExp'],2))
df=df.assign(HappinessLifEExp=df['ShiftedLifeExp'] * df['ShiftedHappiness'])

m_best = np.sum(df['HappinessLifEExp'])/np.sum(df['SquaredLifEExp'])
print('The best fitting line has slope m = %.4f' % m_best)

Life_Expectancy=np.arange(-50,50,step=0.01)
Happiness= m_best*Life_Expectancy

fig,ax=plotData(df,'ShiftedLifeExp','ShiftedHappiness')
ax.plot(Life_Expectancy, Happiness, linestyle='-', color='black')
ax.set_ylabel('Happiness (corrected for Mean Happiness)')
ax.set_xlabel('Life Expectancy (corrected for Mean Life Expectancy) ')
ax.set_xticks(np.arange(-30,30,step=5))
ax.set_yticks(np.arange(-5,5,step=1))
ax.set_xlim(-20,15)
ax.set_ylim(-3,3) 

plt.show()

This line appears to fit better than the one we fitted earlier! It lies 
closer to the points and better capture the relationship in the data.
To test whether this is indeed the case we can calculate the sum of squares
between this new line and the shifted data. This is as follows



In [None]:
df=df.assign(Predicted=np.array(m_best*df['ShiftedLifeExp']))       
df=df.assign(SquaredDistance=np.power((df['Predicted'] - df['ShiftedHappiness']),2))
            
Model_Sum_Of_Squares = np.sum(df['SquaredDistance'])             
print('The model sum of squares is %.4f' % Model_Sum_Of_Squares)

This new line through the data is better! It has a smaller sum of squares. 

The mean values are calculated as follows

\begin{align}\end{align}

 \bar{x} = \frac{1}{n} \sum_i^n x_i \mbox{ and }  \bar{y} = \frac{1}{n} \sum_i^n y_i 


Using this notation, the equation for the line through the data is

\begin{align}\end{align}

 \hat{y_i} - \bar{y} = m  (\hat{x_i} - \bar{x})

Just to remind you about the notation. The predicted value has a hat over it, while the mean values
have a bar over them. We can rearrange this equation to get 

\begin{align}\end{align}

 \hat{y_i}  = m \hat{x_i} + (\bar{y} - m\bar{x})

Notice that this is an equation for a straight line, so we can write

\begin{align}\end{align}

 \hat{y_i}  = m \hat{x_i} + k  \mbox{ where } k = \bar{y} - m\bar{x}

Let's apply this to data and plot the line again



In [None]:
k_best = np.mean(df['Happiness']) - m_best*np.mean(df['LifeExp'])

Life_Expectancy=np.arange(0.5,100,step=0.01)
Happiness= m_best*Life_Expectancy + k_best

fig,ax=plotData(df,'LifeExp','Happiness')
ax.plot(Life_Expectancy, Happiness, linestyle='-', color='black')
df=df.assign(Predicted=np.array(m_best*df['LifeExp']+k_best))
for country in ['United States','United Kingdom','Croatia','Benin','Finland','Yemen']:
    ci=np.where(df['Country name']==country)[0][0]
    ax.plot(  [df.iloc[ci]['LifeExp'],df.iloc[ci]['LifeExp']] ,[ df.iloc[ci]['Happiness'],df.iloc[ci]['Predicted']] ,linestyle=':', color='black')
 
plt.show()

print('The slope of the line is m = %.4f and the intercept is k = %.4f' % (m_best,k_best))
print('An increase in life expectancy of %.4f years is associated with one extra point of happiness' % (1/m_best))

    
df=df.assign(SquaredDistance=np.power((df['Predicted'] - df['Happiness']),2))          
Model_Sum_Of_Squares = np.sum(df['SquaredDistance'])             
print('The model sum of squares is still %.4f' % Model_Sum_Of_Squares)

Now we have it. By shifting back to the original co-ordinates we
can find the best fitting line through the data. Notice that the sum of squares is unaffected by
shifting the line back again, since the distances from the points to the line are unaffected. 

We can say (roughly speaking) that for every 8 years of life expectancy
country citizens are about 1 point happier on a scale of 0 to 10. It isn't 
the whole truth (see the word of warning below), but it isn't entirely misleading either. 



# Interpretting the data


Although there is a relationship between these two variables, this does not mean
that life expectancy causes happiness.

<img src="file://../../images/lesson1/BeckyExplains.png">

In the book you can learn more about the dangers on confusing correlation for causation.

