In [None]:
# importing libraries, etc...

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

path = "https://raw.githubusercontent.com/LennardVaarten/ML-Workshops/main/data/"

The [Gapminder](https://www.gapminder.org/) dataset contains historical data (mid-19th century onwards) containing hundreds of indicators such as life expectancy and GDP for countries around the world.
For our purpose, we will try to predict the life expectancy of countries based on several of these indicators. I have only included data from the year 2018.

In [None]:
life_expectancy = pd.read_csv(path+"life_expectancy.csv")

life_expectancy

In [None]:
# checking the number of missing values per feature

life_expectancy.isna().sum()

In [None]:
# Imputing missing values using the k-NN algorithm, with n_neighbors=3

from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=3).fit(life_expectancy.iloc[:,:-1])
life_expectancy.iloc[:,:-1] = imputer.transform(life_expectancy.iloc[:,:-1])

In [None]:
# voila: no more missing values!

life_expectancy.isna().sum()

In [None]:
life_expectancy

In [None]:
# scaling

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler().fit(life_expectancy.iloc[:,1:-1])
life_expectancy.iloc[:,1:-1] = scaler.transform(life_expectancy.iloc[:, 1:-1])

In [None]:
# splitting into training and test set

from sklearn.model_selection import train_test_split

features_train, features_test, target_train, target_test = train_test_split(life_expectancy.iloc[:,1:-1],
                                                                                       life_expectancy.iloc[:,0],
                                                                                       test_size=0.35,
                                                                                       random_state=99)

In [None]:
train = pd.concat([target_train, features_train], axis=1)

fig, axes = plt.subplots(3,3, figsize=(18,16))

for i in range(len(train.columns)-1):
    sns.scatterplot(data=train, ax=axes[i//3, i%3], x=train.columns[i+1], y=train.columns[0])

fig.tight_layout(pad=2)

In [None]:
# example using Grid Search and Cross Validation to find the optimal parameters. Here, I have used 10 folds, but feel free to use 
# more or fewer in the model(s) you make below!

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsRegressor

params = {
    "n_neighbors": [1, 3, 5, 7, 9, 11],
    "weights": ["uniform", "distance"]
}

knn = GridSearchCV(estimator=KNeighborsRegressor(),
                   param_grid=params, cv=10) 

knn.fit(features_train, target_train)

print("Training set score: {:.4f}".format(knn.score(features_train, target_train)))
print("Test set score: {:.4f}".format(knn.score(features_test, target_test)))
print(knn.best_params_)

Now, it's your turn to use any of the models we've discussed to see how well they perform on this task. Since this dataset is significantly smaller than the mnist (handwritten digits) dataset, it is very feasible - and, practically a requirement - to use Grid Search and Cross Validation to build and test your models. Note that this is a regression problem and classification models will thus not work on it. Perhaps even more important than choosing a classifier is trying out different parameter settings (e.g. n_neighbors and weights for k-Nearest Neighbors). 

Below are the regression models we've discussed, along with the import statement and the parameters that we've covered during the sessions.

- **k-Nearest Neighbors Regression** (already imported in the cell above)
    - n_neighbors (any number above 0)
    - weights ("uniform", "distance")
- **Linear Regression** (from sklearn.linear_model import LinearRegression)
    - No parameters to tune
- **Ridge Regression** (from sklearn.linear_model import Ridge)
    - alpha (any number above 0)
- **Lasso Regression** (from sklearn.linear_model import Lasso)
    - alpha (any number above 0)
- **Decision Tree Regression** (from sklearn.tree import DecisionTreeRegressor)
    - max_depth (a whole number above 0)
    - min_samples_split (a whole number above 1)
- **Random Forest Regression** (from sklearn.ensemble import RandomForestRegressor)
    - n_estimators (a whole number above 0)
    - max_depth (a whole number above 0)
    - min_samples_split (a whole number above 1)
- **Gradient Boosting Regressor** (from sklearn.ensemble import GradientBoostingRegressor)
    - n_estimators (a whole number above 0)
    - max_depth (a whole number above 0)
    - min_samples_split (a whole number above 1)
    - learning_rate (a number between 0 and 1)
    - subsample (a number between 0 and 1)
    
If you want to access even more parameter settings than we've discussed in class (models tend to have a lot), you can also access the sklearn documentation. For example, [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html), you can find all possible parameters to tune for the KNeighborsClassifier.

Good luck and feel free to share your model (and the results you obtain with it) on the Canvas discussion page!