## Class 02

In [None]:
# setup
%pip install pandas matplotlib seaborn scikit-learn numpy

In [126]:
import pathlib 
import pandas as pd 
import seaborn as sns
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score

In [78]:
path = pathlib.Path.cwd()
datapath = path.parents[4] / "data" / "class_01" / "bikes.csv"
df = pd.read_csv(datapath)

## EXERCISES
1. Create a folder called `group-x` within `nbs/class_02`, `cd` into it and work within that today
2. Choose an outcome variable for a regression problem. On the basis of this, define **which of the evaluation metrics** could be suitable. Evaluation metrics can be computed using scikit-learn: https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics 
3. (a) If you are in the bike sharing group, split your dataset into a training/validation/test set using later time points as validation/test set. Validation and test set should be 15% of your data each. (b) If you are in the personality group, using sklearn's `train_test_split` function, create a 70/15/15 random split of your data.
    - Remember to set a seed (`random_state`) when you do so. Let's all use the same (the classic `random_state=42`)
    - Save these datasets as separate csv files in a subfolder called `data`
4. Look at your outcome and predictors: do you want to transform them in any way?
5. Estimate the performance of a dummy baseline (i.e., the mean model) on all splits
6. Now look at your predictors: do they need any preprocessing? Any transformations? Removal of "bad" data points?
7. Fit the other models using KKN (sklearn's `KNeighborsRegressor`: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html) and linear models (`LinearRegressor`: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html). Save the fitted model object (with a meaningful name) using `pickle` (https://scikit-learn.org/stable/model_persistence.html) in a subfolder called `model`.
8. Once you are done, evaluate all models on both the training and the validation set and visualize the scores

## 2. Creating an Outcome

In [79]:
df["proportion_casual_reg"] = df["casual"]/df["cnt"]

Which evaluation metrics would be suitable? We'll chose R-squared as we are interested in a general metric on how good our model fits to the data and not necessarily interested in how our model fares on individual data points. Therefore it would not make sense to use RMSE/MSE as they are scale-dependent

## 3. Splitting our data

In [80]:
len_df = len(df)

# define the split percentage
split_size = 0.15

# get the absolute number of rows that equals to our split size (use int to rm. decimal)
n_rows = int(len_df * split_size)

# define test
test_df = df.iloc[-n_rows:, :]

# define train and val combined
train_val_df = df.iloc[:-n_rows, :]

# subset train from only the train and val 
train_df = train_val_df.iloc[:-n_rows, :]

# subset val from only the train and val
val_df = train_val_df.iloc[-n_rows:, :]

In [93]:
# save dataset
save_path = path.parents[2] / "nbs" / "class_02" / "group-RMDS" / "data"
save_path.mkdir(parents=True, exist_ok=True)

train_df.to_csv(save_path / "train_bikes.csv", index=False)
val_df.to_csv(save_path / "val_bikes.csv", index=False)
test_df.to_csv(save_path / "test_bikes.csv", index=False)

In [123]:
# split into predictors and features
predictors = ["proportion_casual_reg", "registered", "casual", "cnt", "dteday"]
X_train = train_df.drop(predictors, axis=1)
y_train = train_df["proportion_casual_reg"].values

X_val = val_df.drop(predictors, axis=1)
y_val = val_df["proportion_casual_reg"].values

X_test = test_df.drop(predictors, axis=1)
y_test = test_df["proportion_casual_reg"].values

In [128]:
performances = []

mean_value = y_train.mean()
model_name = 'dummy'

for y, nsplit in zip([y_train, y_val, y_test],
                    ['train', 'val', 'test']):
    performance = np.sqrt(mean_squared_error(y, 
                                             [mean_value]*y.shape[0]))
    r2 = r2_score(y, [mean_value]*y.shape[0])
    performances.append({'model': model_name,
                         'split': nsplit,
                         'rmse': performance.round(4),
                         'r2': r2.round(4)})

In [130]:
performances

[{'model': 'dummy', 'split': 'train', 'rmse': 0.1426, 'r2': 0.0},
 {'model': 'dummy', 'split': 'val', 'rmse': 0.1224, 'r2': -0.0753},
 {'model': 'dummy', 'split': 'test', 'rmse': 0.12, 'r2': -0.0734}]