# Challenge 3

## Reading Materials


### [Kaggle Tutorials](https://www.kaggle.com/learn/overview)
Due: October 2, 2018
~~September 27, 2018~~

 * __Complete__ [Data Visualisation course](https://www.kaggle.com/learn/data-visualisation), if needed


## Activity 3: World Bank Data
Due: October 2, 2018
~~September 27, 2018~~

Activity 3 is based on the [World Bank Data](https://www.kaggle.com/gemartin/world-bank-data-1960-to-2016), which aggregates the population of various countries, along with fertility rate and life expectancy, from 1960 to 2016.
The goal of this activity is to explore a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and limit the fragility of a statistical model.
Relevant topics for this challenge include the following sections.
 * Linear Regression (Section 9.2)
 * Regularized Loss Minimization (Section 13.1)
 * Stable Rules Do Not Overfit (Section 13.2)


### Least Squares

The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems.
This technique is used extensively in the context of data fitting; the best fit in the least-squares sense minimizes the sum of squared residuals.
In the current context, a regularized version to the least squares solution is highly desirable.
Ridge regression, or Tikhonov regularization, adds the constraint that the $L_2$-norm of the parameter vector remains no greater than a given value.
An alternative regularized version of least squares is Lasso (least absolute shrinkage and selection operator), which uses the constraint that the $L_1$-norm of the parameter vector be no greater than a given value.
The latter constraint, which we will focus on, favors sparsity in the solution.


### Data

As mentioned above, the data we are using is based on the [World Bank Data](https://www.kaggle.com/gemartin/world-bank-data-1960-to-2016).
Specifically, we are using a cleaned up version of `country_population.csv` where incomplete rows have been removed and extraneous columns have been deleted.
The intent is to use year 1960 to 1999 to train a least squares model and, subsequently, explore its prediction power for year 2001 to 2016.
For a given country, the solution should be a population estimate based on a linear combination of (at most) five other countries.
Mathematically, the population of country $C_0$ is estimated based on the population of five other countries,
\begin{equation*}
\hat{C}_0(\text{year}) = \sum_i \alpha_i C_i(\text{year})
\end{equation*}
subject to $\| \boldsymbol{\alpha} \|_0 \leq 5$.
For every country, the parameters $\boldsymbol{\alpha}$ must be derived based on populations from 1960 to 1999.


### Evaluation

The evaluation criterion is the average sum of squared residuals for populations from 2000 to 2016.


### File Descriptions

 * `population_training.csv` – the training data
 * `population_training_kaggle.csv` – the training data in Kaggle format  (40 x 259)
 * `population_testing.csv` – the test data
 * `population_testing_kaggle.csv` – the test data in Kaggle format (17 x 259)
 * `population_sample_kaggle.csv` – A sample Kaggle solution (17 x 259)
 * `population_parameters.csv` – A sample parameters file (259 x 259)


### Deliverables (Part 1)

User submissions are evaluated by comparing their submission CSV to the ground truth solution CSV with respect to Least Squares.
Team numbers and compositions are available on GitHub under

* `ECEN689-Fall2018 -> Challenges -> 3Files -> README.md`

Documents to be submitted are as follows.

__Kaggle__: Every team should enter the Kaggle competition and submit a prediction file for years 2000 to 2016 in the Kaggle format, as specified in `population_sample_kaggle.csv`.

__GitHub__: Every team should commit and push files.
 1. A pediction file for years 2000 to 2016 (17 x 259)
   * `ECEN689-Fall2018 -> Challenges -> 3Files -> Team## -> population_prediction.csv'
 2. A parameter file one column per country (259 x 259)
    * `ECEN689-Fall2018 -> Challenges -> 3Files -> Team## -> population_parameters.csv'
 3. Jupyter notebook code or Python code within the same `Team##` directory

Every column in `population_parameters.csv` should be 5-sparse. 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import Ridge, Lasso
from sklearn.metrics import mean_squared_error

population_training_df = pd.read_csv('3Files/population_training_kaggle.csv', encoding='cp1252').dropna(axis=0)
population_testing_df = pd.read_csv('3Files/population_testing_kaggle.csv', encoding='cp1252').dropna(axis=0)
print(population_training_df.shape)
print(population_testing_df.shape)

population_training_matrix = population_training_df.values
population_testing_matrix = population_testing_df.values
print(population_training_matrix.shape)
print(population_testing_matrix.shape)

One can use the `sklearn` package to perform ridge regression and lasso.
The main functions in this package that we care about are `Ridge()`, which can be used to fit ridge regression models, and `Lasso()` which will fit lasso models.

The `Ridge()` function has an `alpha` argument that is employed to tune the model.


In [None]:
alphas = 10**np.linspace(10,-2,5)*0.5
alphas

Associated with each `alpha` value is a vector of ridge regression coefficients, stored in a matrix `coefs`.

### Deliverables (Part 2)

The second part of Challenge 3 is an attempt to draw insights from the process.
Once can use the file `population_parameters.csv` to build a graph where the nodes correspond to countries, and the (directed) edges represent coefficients in the sparse model fit created above.
The resulting graph may reveal properties about a connected world.

The task consists in using a visualization tool (e.g. [Gephi](https://gephi.org/)) to explore the structural properties of best predictors and advance an hypothesis on the nature of the structure.
Both the findings and a sample visualization should be submitted in a 4-page PDF report (single column, IEEE style).