# Linear Regression on the World Population
© Explore Data Science Academy

Now that we know about cleaning and exploring a dataset, we will now train a simple linear regression model on a set of data. We'll use the world population data from the Analyse Exam.


## Honour Code

I **MANGALISO**, **MAKHOBA**, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the EDSA honour code (https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.


## Imports

In [1]:
import pandas as pd
import numpy as np

from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split

In [2]:
df_pop = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/AnalyseProject/world_population.csv')
country_map_df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/AnalyseProject/country_code_map.csv', index_col='Country Code')

In [3]:
df_pop.head()

Unnamed: 0,Country Code,1960,1961,1962,1963,1964,1965,1966,1967,1968,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
0,ABW,54211.0,55438.0,56225.0,56695.0,57032.0,57360.0,57715.0,58055.0,58386.0,...,101353.0,101453.0,101669.0,102053.0,102577.0,103187.0,103795.0,104341.0,104822.0,105264.0
1,AFG,8996351.0,9166764.0,9345868.0,9533954.0,9731361.0,9938414.0,10152331.0,10372630.0,10604346.0,...,27294031.0,28004331.0,28803167.0,29708599.0,30696958.0,31731688.0,32758020.0,33736494.0,34656032.0,35530081.0
2,AGO,5643182.0,5753024.0,5866061.0,5980417.0,6093321.0,6203299.0,6309770.0,6414995.0,6523791.0,...,21759420.0,22549547.0,23369131.0,24218565.0,25096150.0,25998340.0,26920466.0,27859305.0,28813463.0,29784193.0
3,ALB,1608800.0,1659800.0,1711319.0,1762621.0,1814135.0,1864791.0,1914573.0,1965598.0,2022272.0,...,2947314.0,2927519.0,2913021.0,2905195.0,2900401.0,2895092.0,2889104.0,2880703.0,2876101.0,2873457.0
4,AND,13411.0,14375.0,15370.0,16412.0,17469.0,18549.0,19647.0,20758.0,21890.0,...,83861.0,84462.0,84449.0,83751.0,82431.0,80788.0,79223.0,78014.0,77281.0,76965.0


In [4]:
country_map_df.head()

Unnamed: 0_level_0,Country Name
Country Code,Unnamed: 1_level_1
ABW,Aruba
AFG,Afghanistan
AGO,Angola
ALB,Albania
AND,Andorra


## Questions

### Question 1

The world population data spans from 1960 to 2017. We'd like to build a predictive model that can give us the best guess at what the future or past population of a particular country was or might be.

First, however, we need to formulate our data such that sklearn's `Ridge` regression class can train on our data. To do this, we will write a function that takes as input a country name and return a 2-d numpy array that contains the year and the measured population. 

_**Function Specifications:**_
* Should take a `str` as input and return a numpy `array` type as output.
* The array should only have two columns containing the year and the population, in other words, it should have a shape `(?, 2)` where `?` is the length of the data.
* The values within the array should be of type `int`.

_**Hint:**_
You'll need to use both the the population and country map dataframes given above.

In [35]:
### START FUNCTION
def get_year_pop(country_name):
    new = country_map_df.reset_index().merge(df_pop, on='Country Code')
    new2 = new[new['Country Name'] == country_name].T.reset_index()[2:]
    new2['index'] = new2['index'].apply(int)
    
    return np.array(new2)

### END FUNCTION

In [37]:
get_year_pop('Aruba')

array([[1960, 54211.0],
       [1961, 55438.0],
       [1962, 56225.0],
       [1963, 56695.0],
       [1964, 57032.0],
       [1965, 57360.0],
       [1966, 57715.0],
       [1967, 58055.0],
       [1968, 58386.0],
       [1969, 58726.0],
       [1970, 59063.0],
       [1971, 59440.0],
       [1972, 59840.0],
       [1973, 60243.0],
       [1974, 60528.0],
       [1975, 60657.0],
       [1976, 60586.0],
       [1977, 60366.0],
       [1978, 60103.0],
       [1979, 59980.0],
       [1980, 60096.0],
       [1981, 60567.0],
       [1982, 61345.0],
       [1983, 62201.0],
       [1984, 62836.0],
       [1985, 63026.0],
       [1986, 62644.0],
       [1987, 61833.0],
       [1988, 61079.0],
       [1989, 61032.0],
       [1990, 62149.0],
       [1991, 64622.0],
       [1992, 68235.0],
       [1993, 72504.0],
       [1994, 76700.0],
       [1995, 80324.0],
       [1996, 83200.0],
       [1997, 85451.0],
       [1998, 87277.0],
       [1999, 89005.0],
       [2000, 90853.0],
       [2001, 92

_**Expected Outputs:**_
```python
get_year_pop('Aruba')
```
> ```
array([[  1960,  54211],
       [  1961,  55438],
       [  1962,  56225],
        ...
       [  2016, 104822],
       [  2017, 105264]])
```

```python
get_year_pop('Aruba').shape == (58, 2)
```

### Question 2

Now that we have have our data, we need to split this into a training set, and a testing set. But before we split our data into training and testing, we also need to split our data into the predictive features (denoted `X`) and the response (denoted `y`). 

Write a function that will take as input a 2-d numpy array and return four variables in the form of `(X_train, y_train), (X_test, y_test)`, where `(X_train, y_train)` are the features + response of the training set, and `(X-test, y_test)` are the features + response of the testing set.

_**Function Specifications:**_
* Should take a 2-d numpy `array` as input.
* Should split the array such that X is the year, and y is the corresponding population.
* Should return two `tuples` of the form `(X_train, y_train), (X_test, y_test)`.
* Should use sklearn's train_test_split function with a `test_size = 0.2` and `random_state = 42`.

In [38]:
### START FUNCTION
def feature_response_split(arr):
    X = arr[:, 0]
    y = arr[:, 1]
    
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42)

    return (X_train, y_train), (X_test, y_test)

### END FUNCTION

In [39]:
data = get_year_pop('Aruba')
(X_train, y_train), (X_test, y_test) = feature_response_split(data)

_**Expected Outputs:**_
```python
data = get_year_pop('Aruba')
feature_response_split(data)
```
> 
```
X_train == array([1996, 1991, 1968, 1977, 1966, 1964, 2001, 1979, 1990, 2009, 2010,
       2014, 1975, 1969, 1987, 1986, 1976, 1984, 1993, 2015, 2000, 1971,
       1992, 2016, 2003, 1989, 2013, 1961, 1981, 1962, 2005, 1999, 1995,
       1983, 2007, 1970, 1982, 1978, 2017, 1980, 1967, 2002, 1974, 1988,
       2011, 1998])

y_train == array([ 83200,  64622,  58386,  60366,  57715,  57032,  92898,  59980,
        62149, 101453, 101669, 103795,  60657,  58726,  61833,  62644,
        60586,  62836,  72504, 104341,  90853,  59440,  68235, 104822,
        97017,  61032, 103187,  55438,  60567,  56225, 100031,  89005,
        80324,  62201, 101220,  59063,  61345,  60103, 105264,  60096,
        58055,  94992,  60528,  61079, 102053,  87277])
        
X_test == array([1960, 1965, 1994, 1973, 2004, 2012, 1997, 1985, 2006, 1972, 2008,
       1963])
       
y_test == array([ 54211,  57360,  76700,  60243,  98737, 102577,  85451,  63026,
       100832,  59840, 101353,  56695])
 ```

### Question 3

Now that we have formatted our data, we can fit a model using sklearn's `Ridge()` class. We'll write a function that will take as input the features and response variables that we created in the last question, and returns a trained model.

_**Function Specifications:**_
* Should take two numpy `arrays` as input in the form `(X_train, y_train)`.
* Should return an sklearn `Ridge` model.
* The returned model should be fitted to the data.

_**Hint:**_
You may need to reshape the data within the function. You can use `.reshape(-1, 1)` to do this.

In [45]:
### START FUNCTION
def train_model(X_train, y_train):
    from sklearn.linear_model import Ridge

    ridge = Ridge()
    ridge.fit(X_train.reshape(-1, 1), y_train.reshape(-1, 1))

    return ridge

### END FUNCTION

In [92]:
data = get_year_pop('Aruba')
(X_train, y_train), _ = feature_response_split(data)

train_model(X_train, y_train).predict([[2017]])

array([[104468.15547163]])

_**Expected Outputs:**_
```python
train_model(X_train, y_train).predict([[2017]]) == array([[104468.15547163]])
```

### Question 4

We would now like to test our model using the testing data that we produced from Question 2. To chieve this, we'll use the mean square error, which for your convenience is written as:
$$
MSE = \frac{1}{N}\sum_{i=1}^N (p_i - y_i)^2,
$$
where $p_i$ refers to the $i^{\rm th}$ prediction made from `X_test`, $y_i$ refers to the $i^{\rm th}$ value in `y_test`, and $N$ is the length of `y_test`.

_**Function Specifications:**_
* Should take a trained model and two `arrays` as input. This will be the `X_test` and `y_test` variables from Question 2. 
* Should return the mean square error over the input from the predicted values of `X_test` as compared to values of `y_test`.
* The output should be a `float` rounded to 2 decimal places.

In [51]:
### START FUNCTION
def test_model(model, X_test, y_test):
    from sklearn.metrics import mean_squared_error

    y_pred = model.predict(X_test.reshape(-1, 1))

    return round(mean_squared_error(y_test.reshape(-1, 1), y_pred), 2)

### END FUNCTION

In [95]:
data = get_year_pop('Aruba')
(X_train, y_train), (X_test, y_test) = feature_response_split(data)
lm = train_model(X_train, y_train)

test_model(lm, X_test, y_test)

42483684.58

_**Expected Outputs:**_
```python
test_model(lm, X_test, y_test) == 42483684.58
```