# Random Forest Regression on the World Population
© Explore Data Science Academy

For this coding challenge, we'll learn how decision trees can be expanded upon as simple regressors in order to create an [ensemble](https://en.wikipedia.org/wiki/Ensemble_learning) model know as a Random Forest. For this coding challenges, we train the model using the world population data. 

<img src="https://github.com/Explore-AI/Pictures/blob/master/population.png?raw=true" width=70%/>



## Honour Code

I **HUMPHERY**, **OJO**, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the EDSA honour code (https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

### Imports

In [1]:
import numpy as np
import pandas as pd
from numpy import array
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

In [2]:
population_df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/AnalyseProject/world_population.csv', index_col='Country Code')
meta_df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/AnalyseProject/metadata.csv', index_col='Country Code')

In [3]:
population_df.head()

Unnamed: 0_level_0,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
Country Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ABW,54211.0,55438.0,56225.0,56695.0,57032.0,57360.0,57715.0,58055.0,58386.0,58726.0,...,101353.0,101453.0,101669.0,102053.0,102577.0,103187.0,103795.0,104341.0,104822.0,105264.0
AFG,8996351.0,9166764.0,9345868.0,9533954.0,9731361.0,9938414.0,10152331.0,10372630.0,10604346.0,10854428.0,...,27294031.0,28004331.0,28803167.0,29708599.0,30696958.0,31731688.0,32758020.0,33736494.0,34656032.0,35530081.0
AGO,5643182.0,5753024.0,5866061.0,5980417.0,6093321.0,6203299.0,6309770.0,6414995.0,6523791.0,6642632.0,...,21759420.0,22549547.0,23369131.0,24218565.0,25096150.0,25998340.0,26920466.0,27859305.0,28813463.0,29784193.0
ALB,1608800.0,1659800.0,1711319.0,1762621.0,1814135.0,1864791.0,1914573.0,1965598.0,2022272.0,2081695.0,...,2947314.0,2927519.0,2913021.0,2905195.0,2900401.0,2895092.0,2889104.0,2880703.0,2876101.0,2873457.0
AND,13411.0,14375.0,15370.0,16412.0,17469.0,18549.0,19647.0,20758.0,21890.0,23058.0,...,83861.0,84462.0,84449.0,83751.0,82431.0,80788.0,79223.0,78014.0,77281.0,76965.0


In [4]:
meta_df.head()

Unnamed: 0_level_0,Region,Income Group,Special Notes
Country Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ABW,Latin America & Caribbean,High income,Mining is included in agriculture\r\r\r\nElect...
AFG,South Asia,Low income,Fiscal year end: March 20; reporting period fo...
AGO,Sub-Saharan Africa,Lower middle income,
ALB,Europe & Central Asia,Upper middle income,
AND,Europe & Central Asia,High income,WB-3 code changed from ADO to AND to align wit...


### Question 1

As we've seen previously, the world population data spans from 1960 to 2017. We'd like to build a predictive model that can give us the best guess at what the world population in a given year was. However, as a slight twist this time, we want to compute this estimate for only _countries within a given income group_. 

First, however, we need to organise our data such that the sklearn's `RandomForestRegressor` class can train on our data. To do this, we will write a function that takes as input an income group and return a 2-d numpy array that contains the year and the measured population.

_**Function Specifications:**_
* Should take a `str` argument, called `income_group_name` as input and return a numpy `array` type as output.
* Set the default argument of `income_group_name` to equal `'Low income'`.
* If the specified value of `income_group_name` does not exist, the function must raise a `ValueError`.
* The array should only have two columns containing the year and the population, in other words, it should have a shape `(?, 2)` where `?` is the length of the data.
* The values within the array should be of type `np.int64`. 

_**Further Reading:**_

Data types are associated with memory allocation. As such, your choice of data type affects the precision of computations in your program. For example, the `np.int` data type in numpy can only store values between -2147483648 to 2147483647 and assigning values outside this range for variables of this data type may cause run-time errors. To avoid this, we can use data types with larger memory capacity e.g. `np.int64`.

https://docs.scipy.org/doc/numpy/user/basics.types.html

In [5]:
population_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 217 entries, ABW to ZWE
Data columns (total 58 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   1960    214 non-null    float64
 1   1961    214 non-null    float64
 2   1962    214 non-null    float64
 3   1963    214 non-null    float64
 4   1964    214 non-null    float64
 5   1965    214 non-null    float64
 6   1966    214 non-null    float64
 7   1967    214 non-null    float64
 8   1968    214 non-null    float64
 9   1969    214 non-null    float64
 10  1970    214 non-null    float64
 11  1971    214 non-null    float64
 12  1972    214 non-null    float64
 13  1973    214 non-null    float64
 14  1974    214 non-null    float64
 15  1975    214 non-null    float64
 16  1976    214 non-null    float64
 17  1977    214 non-null    float64
 18  1978    214 non-null    float64
 19  1979    214 non-null    float64
 20  1980    214 non-null    float64
 21  1981    214 non-null    float64
 22  1982 

In [6]:
meta_df.head()

Unnamed: 0_level_0,Region,Income Group,Special Notes
Country Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ABW,Latin America & Caribbean,High income,Mining is included in agriculture\r\r\r\nElect...
AFG,South Asia,Low income,Fiscal year end: March 20; reporting period fo...
AGO,Sub-Saharan Africa,Lower middle income,
ALB,Europe & Central Asia,Upper middle income,
AND,Europe & Central Asia,High income,WB-3 code changed from ADO to AND to align wit...


In [7]:
### START FUNCTION
def get_total_pop_by_income(income_group_name='Low income'):
    # your code here
    
    x = [item for item in meta_df['Income Group']]
    if income_group_name in x:
         y = meta_df[meta_df['Income Group'] == income_group_name].index
         df = pd.DataFrame(population_df, index = y)
         a = df.sum()
         b = a.to_numpy(dtype = 'int64')
         n = a.index.to_numpy(int)
         w = pd.DataFrame({'year':n, 'population':b}).to_numpy()
         
         
    else:
        raise ValueError("Invalid income_group_name")
    
    return w

### END FUNCTION

In [10]:
get_total_pop_by_income('High income')  

array([[      1960,  769889923],
       [      1961,  781225329],
       [      1962,  791207437],
       [      1963,  801108277],
       [      1964,  810900987],
       [      1965,  820309686],
       [      1966,  829088382],
       [      1967,  837479954],
       [      1968,  844905494],
       [      1969,  854059674],
       [      1970,  862276721],
       [      1971,  871169187],
       [      1972,  880246152],
       [      1973,  888486025],
       [      1974,  897803169],
       [      1975,  906573084],
       [      1976,  913843314],
       [      1977,  921330504],
       [      1978,  928906293],
       [      1979,  936836246],
       [      1980,  944587066],
       [      1981,  952368316],
       [      1982,  959759971],
       [      1983,  966754949],
       [      1984,  973423742],
       [      1985,  980143630],
       [      1986,  987194728],
       [      1987,  994242786],
       [      1988, 1001421456],
       [      1989, 1009036892],
       [  

_**Expected Outputs:**_
```python
get_total_pop_by_income('High income')
```
> ```
array([[      1960,  769889923],
       [      1961,  781225329],
       [      1962,  791207437],
       [      1963,  801108277],
       ...
       [      2015, 1211252041],
       [      2016, 1218629612],
       [      2017, 1225514228]])
```




### Question 2

Now that we have have our data, we need to split this into a set of variables we will be training on, and the set of variables that we will make our predictions on.

Unlike in the previous coding challenge, a friend of ours has indicated that sklearn has a bunch of built-in functionality for creating training and testing sets. In this case however, they have asked us to implement a k-fold cross validation split of the data using sklearn's `KFold` [class](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) (which has already been imported into this notebook for your convenience). 

Using this knowledge, write a function which uses sklearn's `KFold` [class](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) internally, and that will take as input a 2-d numpy array and an integer `K` corresponding to the number of splits. This function will then return a list of tuples of length `K`. Each tuple in this list should consist of a `train_indices` list and a `test_indices` list containing the training/testing data point indices for that particular K$^{th}$ split.

_**Function Specifications:**_
* Should take a 2-d numpy `array` and an integer `K` as input.
* Should use sklearn's `KFold` [class](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html).
* Should return a list of `K` `tuples` containing a list of training and testing indices corresponding to the data points that belong to a particular split. For example, given an array called `data` and an integer `K`, the function should return: 
>```
data_indices = [(list_of_train_indices_for_split_1, list_of_test_indices_for_split_1)
                  (list_of_train_indices_for_split_2, list_of_test_indices_for_split_2)
                  (list_of_train_indices_for_split_3, list_of_test_indices_for_split_3)
                                                   ...
                                                   ...
                  (list_of_train_indices_for_split_K, list_of_test_indices_for_split_K)]
```

* The `shuffle` argument in the KFold object should be set to `False`.

**_Hint_**: To see an example of how to use the `KFold` object enter `help(KFold)` in a new notebook cell

In [11]:
help(KFold)

Help on class KFold in module sklearn.model_selection._split:

class KFold(_BaseKFold)
 |  KFold(n_splits=5, *, shuffle=False, random_state=None)
 |  
 |  K-Folds cross-validator
 |  
 |  Provides train/test indices to split data in train/test sets. Split
 |  dataset into k consecutive folds (without shuffling by default).
 |  
 |  Each fold is then used once as a validation while the k - 1 remaining
 |  folds form the training set.
 |  
 |  Read more in the :ref:`User Guide <k_fold>`.
 |  
 |  Parameters
 |  ----------
 |  n_splits : int, default=5
 |      Number of folds. Must be at least 2.
 |  
 |      .. versionchanged:: 0.22
 |          ``n_splits`` default value changed from 3 to 5.
 |  
 |  shuffle : bool, default=False
 |      Whether to shuffle the data before splitting into batches.
 |      Note that the samples within each split will not be shuffled.
 |  
 |  random_state : int, RandomState instance or None, default=None
 |      When `shuffle` is True, `random_state` affect

In [12]:
### START FUNCTION
def sklearn_kfold_split(data,K):
    # your code here
    kf = KFold(n_splits=K)
    kf.get_n_splits(data)
    x1 = []
    
    
    for train_index, test_index in kf.split(data):
        
        x1.append([train_index, test_index])
        
        
        #print(list((train_index, test_index)))
        
        
        
        
    return   x1  #list(zip(data_train,data_test))

### END FUNCTION

In [13]:
data = get_total_pop_by_income('High income');
sklearn_kfold_split(data,4) 

[[array([15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
         32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48,
         49, 50, 51, 52, 53, 54, 55, 56, 57]),
  array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])],
 [array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 30, 31,
         32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48,
         49, 50, 51, 52, 53, 54, 55, 56, 57]),
  array([15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])],
 [array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
         17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 44, 45, 46, 47,
         48, 49, 50, 51, 52, 53, 54, 55, 56, 57]),
  array([30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43])],
 [array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
         17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
         34, 35, 36, 37, 38

_**Expected Outputs:**_
```python
data = get_total_pop_by_income('High income')
sklearn_kfold_split(data,4)
```
> ```
[(array([15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
         32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48,
         49, 50, 51, 52, 53, 54, 55, 56, 57]),
  array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])),
 (array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 30, 31,
         32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48,
         49, 50, 51, 52, 53, 54, 55, 56, 57]),
  array([15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])),
 (array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
         17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 44, 45, 46, 47,
         48, 49, 50, 51, 52, 53, 54, 55, 56, 57]),
  array([30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43])),
 (array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
         17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
         34, 35, 36, 37, 38, 39, 40, 41, 42, 43]),
  array([44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57]))]
 ```

### Question 3

Now that we have formatted our data, we can fit a model using sklearn's `RandomForestRegressor` class. We'll write a function that will take as input the data indices (consisting of the train and test indices for each split) that we created in the last question, train a different `RandomForestRegressor` on each split and return the model that obtains the best testing set performance across all K splits.

**Important Note:** Due to the random initialisation process used within sklearn's `RandomForestRegressor` class, you will need to fix the value of the `random_state` argument in order to get repeatable and predictable results.

_**Function Specifications:**_
* Should take a 2-d numpy array (i.e. the data) and `data_indices` (a list of `(train_indices,test_indices)` tuples) as input.
* For each `(train_indices,test_indices)` tuple in `data_indices` the function should:
    * Train a new `RandomForestRegressor` model on the portion of data indexed by `train_indices`
    * Evaluate the trained `RandomForestRegressor` model on the portion of data indexed by `test_indices` using the **mean squared error** (which has also been imported for your convenience).
* After training and evalating the `RandomForestRegressor` models, the function should return the `RandomForestRegressor` model that obtained highest testing set `mean_square_error` over its allocated data split across all trained models. 
* The trained `RandomForestRegressor` models should be trained with `random_state` equal `42`, all other parameters should be left as default.

**_Hint_**: for each tuple in the `data_indices` list, you can obtain `X_train`,`X_test`, `y_train`, `y_test` as follows:  
>```
    X_train, y_train = data[train_indices,0],data[train_indices,1]
    X_test, y_test = data[test_indices,0],data[test_indices,1]
```



In [28]:
### START FUNCTION
def best_k_model(data,data_indices):
    # your code here
    
    model = None
    RSE = 0
    
    for train_indices,test_indices in data_indices:
        
        X_train, y_train = data[train_indices,0].reshape(-1, 1),data[train_indices,1]
        X_test, y_test = data[test_indices,0].reshape(-1, 1),data[test_indices,1]
        
        RF = RandomForestRegressor(random_state = 42)
        RF.fit(X_train, y_train)
        result = RF.predict(X_test)
        rse = np.sqrt(mean_squared_error(y_test, result))
        if rse > RSE :
            model = RF
            RSE = rse      
    return model


### END FUNCTION

In [29]:
data = get_total_pop_by_income('High income')
data_indices = sklearn_kfold_split(data,5)

best_model = best_k_model(data,data_indices)
best_model.predict([[1960]])

array([8.85170916e+08])

_**Expected Outputs:**_
```python
best_model.predict([[1960]]) == array([8.85170916e+08])
```