# 0. Mount Google Drive

In [None]:
from google.colab import drive

drive.mount('/content/gdrive')

In [None]:
# Check the directory
!ls "/content/gdrive/My Drive/Colab Notebooks"

In [None]:
# Data directory
data_dir = '/content/gdrive/My Drive/Colab Notebooks/data'

!ls '$data_dir'

# 1. Prepare Environment

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# 2. Load Dataset

In this task, you will predict the GDP of each countries
[[kaggle](https://www.kaggle.com/fernandol/countries-of-the-world)].

We have already done some preprocessing on the data to simplify the next process. You can also try on the raw data. We also recommend you to check the data visualization technique in this [notebook](https://www.kaggle.com/mehmettek/data-science-with-world-countries).

First, you download the [countries of the world_cleaner.csv](https://drive.google.com/file/d/1KXS-9AOsc1a9OG9r44EnJpHA-GCJuGPL/view?usp=sharing) and then upload it to your Google Drive. The recommended location is in the `Colab Notebooks/data` folder.

Then run the following command to read the csv file in your Google Drive.

In [None]:
data_path = os.path.join(data_dir, 'countries of the world_cleaner.csv')
df = pd.read_csv(data_path)

In [None]:
df

# 3. Data Preparation

In this section, we will prepare the dataset into a format that can be used to train models.

## 3.1 Feature Selection

How do we know which features can be used to predict whether the passenger will survided the crash?

* Domain Expert Knowledge
* Visual Inspection
* Feature Selection Algorithms (see more [link1](https://scikit-learn.org/stable/modules/feature_selection.html), [link2](https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/)) 




In [None]:
# Drop unused features
data_df = df.drop(columns=['Unnamed: 0'])

## 3.2 Deal with NaN values

In [None]:
# Investigate NaN in DataFrame
total = data_df.isnull().sum().sort_values(ascending=False)
percent = (data_df.isnull().sum()/data_df.isnull().count())
percent = percent.sort_values(ascending=False)
missing_data = pd.concat(
    [total, percent], axis=1, keys=['Total', 'Percent'])
print('Missing data:')
missing_data.head(20)

In [None]:
# Determine the values to be replaced for NaN
replace_values = {}
for column in data_df.columns:
  if data_df[column].isnull().any():
    if data_df[column].dtype == np.float64:
      # Use mean for float
      replace_values[column] = data_df[column].mean()
    elif data_df[column].dtype == type(object):
      # Use 'UNK' keyword for string
      replace_values[column] = 'UNK'
print(replace_values)

In [None]:
# Replace NaN values according to the `replace_values` dictionary
data_df = data_df.fillna(value=replace_values)
data_df

In [None]:
# Investigate NaN in DataFrame
total = data_df.isnull().sum().sort_values(ascending=False)
percent = (data_df.isnull().sum()/data_df.isnull().count())
percent = percent.sort_values(ascending=False)
missing_data = pd.concat(
    [total, percent], axis=1, keys=['Total', 'Percent'])
print('Missing data:')
missing_data.head(20)

## 3.3 Categorical Columns

Scikit-learn expects numerical tensors, so we have to convert our `str` data into number.

In [None]:
# Strip white-space in 'Region'
data_df['Region'] = data_df['Region'].str.strip()

# One-hot encoding for 'Region'
reg_df = pd.get_dummies(df['Region'], prefix='Region')
clean_df = pd.concat([data_df, reg_df], axis=1)
clean_df = clean_df.drop(columns=['Region'])
clean_df

# 4. Prepare Train/Valid/Test Sets

Here you will write code to extract the features and labels.

In [None]:
# YOUR CODE HERE
X = None
y = None

Next we split the dataset into training/validation/test set.
* Training set: `X_train`, `y_train`
* Validation set: `X_valid`, `y_valid`
* Test set: `X_test`, `y_test`

The following is an example of how to split the dataset into a training and a test sets.

```python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    random_state=42,
    test_size=0.20)  # 80:20
```

Here you will write the code to split `(X, y)` into `(X_train, y_train)`, `(X_valid, y_valid)` and `(X_test, y_test)` using 80/10/10 proportion.

In [None]:
from sklearn.model_selection import train_test_split

# YOUR CODE HERE

print(f'Training set: {X_train.shape}, {y_train.shape}')
print(f'Validation set: {X_valid.shape}, {y_valid.shape}')
print(f'Test set: {X_test.shape}, {y_test.shape}')

# 5. Model Selection

In this section, you will write code to train the model on the training set and evaluate it on the validation set.

We will use the simplest ML model for regression problem, which is Linear Regression. This algorithm requires the features to be normalized before training the model.

Let's start by create a scaler, which will be used for both the training, validation and test sets.

In [None]:
from sklearn.preprocessing import RobustScaler

# Create a scaler
scaler = RobustScaler()

# Fit the scaler with the training set
scaler.fit(X_train)

# Scale the features in the training set
scaled_X_train = scaler.transform(X_train)

print(f'Mean: {np.mean(scaled_X_train, axis=0)}')
print(f'Std: {np.std(scaled_X_train, axis=0)}')

Next, let's use the scaled data to train the linear regression model.

In [None]:
from sklearn.linear_model import LinearRegression

# YOUR CODE HERE

Next, we use the trained model to predict whether the passengers in both the training and the validation sets will survive the titanic crash or not.

The predictions for the training and the validation set are stored in `y_hat_train` and `y_hat_valid`.

It should be noted that you should reuse the scaler that is **fitted to the training set only**. This is to prevent the scaler from observing unseen data.

In [None]:
# YOUR CODE HERE

Then we determine the prediction performance on the training and the validation set to investigate whether our model has the **underfitting** or **overfitting** problems or not.

Here, we use the common metrics for classification problems which are: **accuracy, precision, recall and f1-score**.

In [None]:
from sklearn.metrics import mean_squared_error

print('Training Set')
print(f'MSE: {mean_squared_error(y_true=y_train, y_pred=y_hat_train):.4f}')
print('')
print('Validation Set')
print(f'MSE: {mean_squared_error(y_true=y_valid, y_pred=y_hat_valid):.4f}')

**TODO**: Go back to update the parameters of the model to minimize the overfitting and the underfitting as much as you can.

Once you are happy with the performance, then we proceed to the next step.

# 6. Evaluation on Test Set

Once we found a best model, we then evaluate the trained model with the test set to estimate the performance on the **unseen** examples.

The predictions for the test set are stored in `y_hat_test`.

In [None]:
# YOUR CODE HERE

In [None]:
print('Test Set')
print(f'MSE: {mean_squared_error(y_true=y_test, y_pred=y_hat_test):.4f}')

# 7. Try Other Classifiers

There are a large number of supervised-ML algorithms that you can use. Please try other classifiers below and try to achieve the best performance on the test set.

* [Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge), [Lasso](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso), [ElasticNet](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html#sklearn.linear_model.ElasticNet): try to change `alpha`.
* [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html): try to change `n_estimators`, `max_depth`, `min_samples_leaf`.
* [Epsilon-Support Vector Regression](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html): try to change `C`, `gamma`.

**TODO**: 
1. Try the other classifiers as specified above.
1. Try other feature scalers mentioned in W6.1