<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Python-Notebook-Banners/Code_challenge.png"  style="display: block; margin-left: auto; margin-right: auto;";/>
</div>

# Code challenge: Decision trees
© ExploreAI Academy

In this code challenge, we will test our knowledge of the fundamental concepts of decision trees by implementing a decision tree regression model and analysing its RMSLE.



⚠️ **Note that this code challenge is graded and will contribute to your overall marks for this module. Submit this notebook for grading. Note that the names of the functions are different in this notebook. Transfer the code in your notebook to this submission notebook**

### Instructions

- **Do not add or remove cells in this notebook. Do not edit or remove the `### START FUNCTION` or `### END FUNCTION` comments. Do not add any code outside of the functions you are required to edit. Doing any of this will lead to a mark of 0%!**

- Answer the questions according to the specifications provided.

- Use the given cell in each question to see if your function matches the expected outputs.

- Do not hard-code answers to the questions.

- The use of StackOverflow, Google, and other online tools is permitted. However, copying a fellow student's code is not permissible and is considered a breach of the Honour code. Doing this will result in a mark of 0%.

We begin by importing the necessary packages for the challenges.

In [None]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

## The dataset

The dataset contains population data for various countries over the years from 1960 to 2017. Each row corresponds to a specific country, identified by a country code, and each column represents a year. The values within the dataset represent the population count for each country in the corresponding year.

In [None]:
population_df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/AnalyseProject/world_population.csv', index_col='Country Code')
population_df.head()

## Analysis

### Challenge 1: Population growth

The world population data spans from 1960 to 2017. We'd like to build a predictive model that can give us the best guess at what the population growth rate in a given year might be. We will calculate the population growth rate as follows:-

$$
Growth\_rate = \frac{current\_year\_population - previous\_year\_population}{previous\_year\_population}
$$

As such, we can only calculate the growth rate for the year 1961 onwards.

Write a function that takes the `population_df` and a `country_code` as input and computes the population growth rate for a given country starting from the year 1961. This function must return a return a 2-d numpy array that contains the year and corresponding growth rate for the country.

_**Function Specifications:**_
* Should take a `population_df` and `country_code` string as input and return a numpy `array` as output.
* The array should only have two columns containing the year and the population growth rate, in other words, it should have a shape `(?, 2)` where `?` is the length of the data.
* The growth rates should be rounded to 5 decimal places

In [None]:
### START FUNCTION
def get_population_growth_rate_by_country_year(df,country_code):
    country_pop_year = df.loc[country_code].values
    pop_year = df.columns
    growth_rates = []
    for i, rate in enumerate(country_pop_year[:-1]):
        pop_growth_rate = round((country_pop_year[i+1] - rate) / rate, 5)
        year = int(pop_year[i+1])
        growth_rates.append([year, pop_growth_rate])
    return np.array(growth_rates)

### END FUNCTION

Input:

In [None]:
get_population_growth_rate_by_country_year(population_df,'ABW')

Expected output:

```
array([[ 1.961e+03,  2.263e-02],
       [ 1.962e+03,  1.420e-02],
       [ 1.963e+03,  8.360e-03],
       [ 1.964e+03,  5.940e-03],
            ...       ....
       [ 2.015e+03,  5.260e-03],
       [ 2.016e+03,  4.610e-03],
       [ 2.017e+03,  4.220e-03]])
```

### Challenge 2: Even-odd train-test split

Now that we have our data, we need to divide it into two sets: the variables we will train on and the variables we will predict on. In this scenario, we're separating the variables so that the **training set contains growth rates for even years and the test set contains growth rates for odd years**. We also need to divide our data into the predictive features (`X`) and the response features (`y`). 

Write a function that will take a 2-D numpy array as input and return four variables in the form of `(X_train, y_train), (X_test, y_test)`, where `(X_train, y_train)` are the features and response variables of the training set, and `(X_test, y_test)` are the features and response variables of the testing set. The training and testing data consist of even and odd years, respectively. The function should return two tuples of the form `(X_train, y_train), (X_test, y_test)`.

_**Function Specifications:**_
* Should take a 2-d numpy `array` as input.
* Should return two `tuples` of the form `(X_train, y_train), (X_test, y_test)`.
* `(X_train, y_train)` should consist of data from even years and `(X_test, y_test)` should consist of data from odd years.

In [None]:
### START FUNCTION
def feature_response_split(arr):
    X = arr[:, 0]
    y = arr[:, 1]
    
    X_train = X[X % 2 == 0]
    y_train = y[X % 2 == 0]

    X_test = X[X % 2 != 0]
    y_test = y[X % 2 != 0]
    
    return (X_train, y_train), (X_test, y_test)

### END FUNCTION

Input:

In [None]:
data = get_population_growth_rate_by_country_year(population_df,'ABW');
(X_train, y_train), (X_test, y_test) = feature_response_split(data)

Expected output:

```
y_train ==  array([ 0.01419604,  0.00594409,  0.00618898,  0.00570149,  0.00573851,
        0.00672948,  0.00473084, -0.00117052, -0.00435676,  0.00193398,
        0.01284528,  0.01020884, -0.00606099, -0.01219414,  0.01830187,
        0.05590975,  0.05787267,  0.03580499,  0.02136897,  0.02076288,
        0.02254085,  0.01772885,  0.00800752,  0.00131397,  0.00212906,
        0.00513459,  0.00589222,  0.00460988])
```

```
X_test == array([1961., 1963., 1965., 1967., 1969., 1971., 1973., 1975., 1977.,
       1979., 1981., 1983., 1985., 1987., 1989., 1991., 1993., 1995.,
       1997., 1999., 2001., 2003., 2005., 2007., 2009., 2011., 2013.,
       2015., 2017.])
```

```
y_test == array([ 0.02263378,  0.00835927,  0.00575116,  0.00589102,  0.00582331,
        0.00638301,  0.00673463,  0.00213125, -0.0036312 , -0.00204649,
        0.00783746,  0.01395387,  0.00302374, -0.01294617, -0.0007695 ,
        0.03979147,  0.0625632 ,  0.04724902,  0.02705529,  0.01979903,
        0.02250889,  0.02131758,  0.01310552,  0.00384798,  0.00098665,
        0.00377696,  0.00594675,  0.00526037,  0.00421667])      
 ```

### Question 3

Now that we have formatted our data, we can fit a model using sklearn's `DecisionTreeRegressor` class. We'll write a function that will take as input the features and response variables that we created in the last question, and return a trained model.

_**Function Specifications:**_
* Should take two numpy `arrays` as input in the form `(X_train, y_train)` as well as a `MaxDepth` int corresponding to the max_depth hyperparameter in decision trees.
* Should return an sklearn `DecisionTreeRegressor` model.
* The returned model should be fitted to the data.

_**Hint:**_
You may need to reshape the data within the function. You can use `.reshape(-1, 1)` to do this.

In [None]:
### START FUNCTION
def train_model(X_train, y_train, MaxDepth):
    d_tree = DecisionTreeRegressor(max_depth = MaxDepth).fit(X_train.reshape(-1,1), y_train)
    return d_tree

### END FUNCTION

Input:

In [None]:
data = get_population_growth_rate_by_country_year(population_df,'ABW')
(X_train, y_train), _ = feature_response_split(data)

train_model(X_train, y_train,3).predict([[2017]])

Expected output:

```
array([0.00451333])
```

### Challenge 4

Now we would like to test our model on the testing data that we produced in Exercise 2. This test will give the Root Mean Squared Logarithmic Error (RMSLE), which is determined by:

$$
RMSLE = \sqrt{\frac{1}{N}\sum_{i=1}^N [log(1+p_i) - log(1+y_i)]^2}
$$

* *$p_i$ refers to the $i^{\rm th}$ prediction made from `X_test` 
* $y_i$ refers to the $i^{\rm th}$ value in `y_test`
* $N$ is the length of `y_test`

_**Function Specifications:**_
* Should take a trained model and two `arrays` as input. This will be the `X_test` and `y_test` variables from Question 2. 
* Should return the residual sum of squares over the input from the predicted values of `X_test` as compared to values of `y_test`.
* The output should be a `float` rounded to 3 decimal places.



In [None]:
### START FUNCTION
def test_model(model, y_test, X_test):
    y_pred = model.predict(X_test.reshape(-1, 1))
    log_pred = np.sum((np.log(1 + y_pred) - np.log(1 + y_test))**2)
    N = len(y_test)
    RMSLE = round(np.sqrt(log_pred/N),3)
    return RMSLE

### END FUNCTION

Input:

In [None]:
data = get_population_growth_rate_by_country_year(population_df,'ABW')
(X_train, y_train), (X_test, y_test) = feature_response_split(data)
lm = train_model(X_train, y_train,3)
test_model(lm, y_test, X_test)

Expected output:

```
0.008
```

What does this value say about our model?
- ✍️ Your notes here

#  

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>