# Lab 9 Tasks - Solution

In this notebook we will use regression techniques to analyse a subset of the [2019 World Happiness Index Report](https://worldhappiness.report/) dataset. In this dataset each row represents a country, with the following features:

- *country*: name of the country for each row
- *gdp*: real GDP per capita
- *social_support*: amount of social support that is present in a country
- *health*: healthy life expectancy,
- *freedom*: freedom to make life choices 
- *generosity*: leve of generosity of citizens
- *corruption*: perceptions of corruption in a country

## Task 1

Download the World Happiness Index data from the link below, and load the data into a Pandas DataFrame.

http://mlg.ucd.ie/modules/COMP41680/happiness2019.csv

In [None]:
import pandas as pd
# Pandas can download the data directly from the URL
df = pd.read_csv("http://mlg.ucd.ie/modules/COMP41680/happiness2019.csv", index_col="country")
df.head(10)

## Task 2 

Calculate basic summary statistics for the data. List the top 5 ranked countries for each measure.

In [None]:
df.describe()

In [None]:
from IPython.display import display
for column in df.columns:
    print("Top countries by %s" % column )
    display(df.nlargest(5, column)[[column]])

Generate a boxplot of the measures in the dataset. 

Do you see any **outliers** for any of the measures?

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
# generate the box plot
df.boxplot(figsize=(12,7), fontsize=13)
plt.xlabel("Measure", fontsize=13)
plt.ylabel("Measure Score", fontsize=13);

In the above we can spot potential outliers that appear above or below the T-bars of each boxplot (i.e. the circles above/below the maximum/minimum limits of the boxplot). For instance, we see some lower outliers for the measures *social_support*, *health* and *freedom*, while we see some upper outliers for *generosity* and *corruption*.

## Task 3

Calculate the correlations between the different variables in the data. 

Which pair of variables are: (i) the most highly correlated; (ii) the least correlated?

In [None]:
# calculate the pairwise correlations
df_c = df.corr()
df_c

In [None]:
# we could turn this into a sorted DataFrame to show us a ranking for the pairs of variables
# with the highest and lowest correlation
from itertools import combinations
rows = []
for v1, v2 in combinations(df_c.columns, 2):
    rows.append({"Variable 1": v1, "Variable 2": v2, "Correlation": df_c[v1][v2]})
# show the ranked list
pd.DataFrame(rows).sort_values(by="Correlation", ascending=False)

## Task 4

Apply a **simple linear regression** to learn (fit) the model, where *gdp* is the independent variable and *health* is the target variable that we would like to predict. Produce a plot of the regression line.

In [None]:
# Get the columns
x = df[["gdp"]].values
y = df[["health"]].values
# Now build the regression model
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x, y)
print("Coefficient (slope): %.4f" % model.coef_[0])
print("Intercept: %.4f" % model.intercept_)
# plot the data
plt.figure(figsize=(9,6))
plt.scatter(x, y)
# plot the regression line
m = model.coef_[0]
b = model.intercept_
plt.plot([min(x), max(x)], [b, m*max(x) + b], 'r')
plt.xlabel('GDP', fontsize=13)
plt.ylabel('Health', fontsize=13);

## Task 5

Repeat the process from Task 4, but this time use *generosity* as the **target** variable. What does a comparison of the two regression lines indicate?

In [None]:
# Get the columns
x = df[["gdp"]].values
y = df[["generosity"]].values
# Now build the regression model
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x, y)
print("Coefficient (slope): %.4f" % model.coef_[0])
print("Intercept: %.4f" % model.intercept_)
# plot the data
plt.figure(figsize=(9,6))
plt.scatter(x, y)
# plot the regression line
m = model.coef_[0]
b = model.intercept_
plt.plot([min(x), max(x)], [b, m*max(x) + b], 'r')
plt.xlabel('GDP', fontsize=14)
plt.ylabel('Generosity', fontsize=14);

There appears to be a strong correlation between GDP per capita and healthy life expectancy, for far less correlation between GDP per capita and generosity (i.e. level of charitable donations).