## Problem set 4

**Problem 0** (-2 points for every missing green OK sign. If you don't run the cell below, that's -14 points.)

Make sure you are in the DATA1030 environment.

In [1]:
from __future__ import print_function
from distutils.version import LooseVersion as Version
import sys

OK = '\x1b[42m[ OK ]\x1b[0m'
FAIL = "\x1b[41m[FAIL]\x1b[0m"

try:
    import importlib
except ImportError:
    print(FAIL, "Python version 3.7 is required,"
                " but %s is installed." % sys.version)

def import_version(pkg, min_ver, fail_msg=""):
    mod = None
    try:
        mod = importlib.import_module(pkg)
        if pkg in {'PIL'}:
            ver = mod.VERSION
        else:
            ver = mod.__version__
        if Version(ver) == min_ver:
            print(OK, "%s version %s is installed."
                  % (lib, min_ver))
        else:
            print(FAIL, "%s version %s is required, but %s installed."
                  % (lib, min_ver, ver))    
    except ImportError:
        print(FAIL, '%s not installed. %s' % (pkg, fail_msg))
    return mod


# first check the python version
pyversion = Version(sys.version)
if pyversion >= "3.7":
    print(OK, "Python version is %s" % sys.version)
elif pyversion < "3.7":
    print(FAIL, "Python version 3.7 is required,"
                " but %s is installed." % sys.version)
else:
    print(FAIL, "Unknown Python version: %s" % sys.version)

    
print()
requirements = {'numpy': "1.18.5", 'matplotlib': "3.2.2",'sklearn': "0.23.1", 
                'pandas': "1.0.5",'xgboost': "1.1.1", 'shap': "0.35.0"}

# now the dependencies
for lib, required_version in list(requirements.items()):
    import_version(lib, required_version)

[42m[ OK ][0m Python version is 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 02:37:09) 
[Clang 10.0.1 ]

[42m[ OK ][0m numpy version 1.18.5 is installed.
[42m[ OK ][0m matplotlib version 3.2.2 is installed.
[42m[ OK ][0m sklearn version 0.23.1 is installed.
[42m[ OK ][0m pandas version 1.0.5 is installed.
[42m[ OK ][0m xgboost version 1.1.1 is installed.
[42m[ OK ][0m shap version 0.35.0 is installed.


In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GroupShuffleSplit, GroupKFold, StratifiedKFold
from sklearn.preprocessing import OneHotEncoder, StandardScaler

**Problem 1a** (3 points)

You will work with the diabetes dataset in Problem 1 and you will split the data and preprocess it to get ready for training an ML model. First, read in the dataset into a pandas dataframe using the tab delimited file linked at [this page](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html).


grading suggestion:
- 3 points if they read in the file correctly using the delimiter argument


In [3]:
# read in the data in this cell
df = pd.read_csv("https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt", sep='\t')
df.head()

Unnamed: 0,AGE,SEX,BMI,BP,S1,S2,S3,S4,S5,S6,Y
0,59,2,32.1,101.0,157,93.2,38.0,4.0,4.8598,87,151
1,48,1,21.6,87.0,183,103.2,70.0,3.0,3.8918,69,75
2,72,2,30.5,93.0,156,93.6,41.0,4.0,4.6728,85,141
3,24,1,25.3,84.0,198,131.4,40.0,5.0,4.8903,89,206
4,50,1,23.0,101.0,192,125.4,52.0,4.0,4.2905,80,135


**Problem 1b** (6 points)

Answer the following questions with 1-2 paragraphs.

Q1: Is the dataset IID or not? Why?

Q2: Please decide what fraction of points will be in each set and explain your decision in a paragraph or two.

Q3: Please explain in a paragraph or two why it is important to fit the preprocessors on the training set only.


grading suggestion:
- 2 points if they correctly argue that the dataset is IID.
- 2 points for a correct argument. 60-20-20 is best, I'd still be OK with 80-10-10 if they have a reasonable argument.
- 2 points for a good Q3 explanation


**Problem 1c** (11 points)

Based on your answers above, please perform a basic split and create training, validation, and test sets.

Now that you have three sets, you can preprocess the data. Please decide for each feature which preprocessor you will use (no need to write text). Fit those preprocessors on the training set, then transform the sets.

We discussed in class that it is important to split the data using various different random states so you can determine at the  end of the ML pipeline how much uncertainty in the test score the random splitting causes. Please use 10 random states and split/preprocess the data 10 times. 

Please make sure that your code is reproducable. The best way to check that is to print out which points are in e.g., the training set and rerun the cell a couple of times. If the same points are in the same set after every rerun, your code is reproducable. 

A couple of suggestions how you could structure your code is available below. 


One option:
```python

random_states = [...,...,...] # list of 10 numbers

for random_state in random_states:
    # whenever you need to set the random state, use `random_state`
    
    # split the data
    
    # preprocess the data
    
    # print stuff out to make sure your code is reproducable
    
```

Second option:
```python


for i in range(0,10):
    random_state = 42 * i # feel free to replace 42 with your magic number.
                          # the only important thing is that random_state has a different value in each iteration.
    
    # split the data
    
    # preprocess the data
    
    # print stuff out to make sure your code is reproducable
    
```




grading suggestion:

- 3 points for correctly splitting with train_test_split and setting the random_state
- 2 points for correctly using a one-hot encoder on the gender and either the standard scaler or the min-max scaler on the rest (deduct a point if they also preprocess the target variable. in regression, the target variable stays as is)
- 3 points if the fit the preprocessors to the training set and then transform everything
- 3 points for correctly looping through 10 random states

In [4]:
y = df['Y']
X = df.loc[:, df.columns != 'Y']

scaler_col = list(X.columns)
scaler_col.remove('SEX')

for i in range(10):
    random_state = 42 * i
    
    # split the data
    X_train, X_other, y_train, y_other = train_test_split(X, y, train_size=0.6, random_state=random_state)
    X_val, X_test, y_val, y_test = train_test_split(X_other, y_other, train_size=0.5, random_state=random_state)
    
    # preprocess the data
    enc = OneHotEncoder(sparse=False, handle_unknown='ignore')
    scaler = StandardScaler()
    
    X_train_ohe = enc.fit_transform(X_train[['SEX']])
    X_val_ohe = enc.transform(X_val[['SEX']])
    X_test_ohe = enc.transform(X_test[['SEX']])
    
    X_train_scaler = scaler.fit_transform(X_train[scaler_col])
    X_val_scaler = scaler.transform(X_val[scaler_col])
    X_test_scaler = scaler.transform(X_test[scaler_col])
    
    X_train_concat = np.concatenate([X_train_ohe, X_train_scaler], axis=1)
    X_val_concat = np.concatenate([X_val_ohe, X_val_scaler], axis=1)
    X_test_concat = np.concatenate([X_test_ohe, X_test_scaler], axis=1)
    
    X_pre_col = list(enc.get_feature_names(['SEX'])) + scaler_col
    X_train_pre = pd.DataFrame(X_train_concat, columns=X_pre_col)
    X_val_pre = pd.DataFrame(X_val_concat, columns=X_pre_col)
    X_test_pre = pd.DataFrame(X_test_concat, columns=X_pre_col)
    
    # print stuff out to make sure your code is reproducable
    print(X_train_pre.head())

   SEX_1  SEX_2       AGE       BMI        BP        S1        S2        S3  \
0    0.0    1.0 -0.101364 -0.835511 -0.948322 -1.661115 -1.549201 -0.698947   
1    0.0    1.0  1.014762  1.252631  1.475618  0.269567  0.435301 -0.473754   
2    0.0    1.0  0.047453  1.275084 -0.066889  1.263300  1.308739 -1.224396   
3    0.0    1.0 -0.175772  3.587757  0.300374  0.610275  0.705039 -0.473754   
4    1.0    0.0 -0.994265  0.129974  0.226922 -0.780951 -0.367491 -0.398690   

         S4        S5        S6  
0 -0.798531  0.205609 -0.189521  
1  0.701191  0.443799  0.537180  
2  2.200913  1.393977  2.475052  
3  0.701191  0.679407  0.617925  
4 -0.048670 -0.806330 -0.431755  
   SEX_1  SEX_2       AGE       BMI        BP        S1        S2        S3  \
0    1.0    0.0 -0.285158  1.213237  1.090432  0.972804  0.610896 -0.385769   
1    1.0    0.0  1.412975  1.123179  1.520650 -0.313446 -0.822460 -0.690588   
2    0.0    1.0  1.335787 -0.587927  0.229994  1.315804  1.010142  0.604894   
3    

**Problem 2** 

We work with the [hand postures dataset](https://archive.ics.uci.edu/ml/datasets/Motion+Capture+Hand+Postures) in problem 2. This dataset has group structure. 14 users performing 5 hand postures with markers attached to a left-handed glove were recorded. Two different ML questions can be asked using this dataset. We will explore how the splitting and preprocessing differs for both questions in 2a and 2b.

In [5]:
df = pd.read_csv('data/Postures.csv', skiprows=[1])
df.replace("?", np.nan, inplace=True)    # replace ? as nan
df.head()

Unnamed: 0,Class,User,X0,Y0,Z0,X1,Y1,Z1,X2,Y2,...,Z8,X9,Y9,Z9,X10,Y10,Z10,X11,Y11,Z11
0,1,0,54.26388,71.466776,-64.807709,76.895635,42.4625,-72.780545,36.621229,81.680557,...,,,,,,,,,,
1,1,0,56.527558,72.266609,-61.935252,39.135978,82.53853,-49.596509,79.223743,43.254091,...,,,,,,,,,,
2,1,0,55.849928,72.469064,-62.562788,37.988804,82.631347,-50.606259,78.451526,43.567403,...,,,,,,,,,,
3,1,0,55.329647,71.707275,-63.688956,36.561863,81.868749,-52.752784,86.32063,68.214645,...,,,,,,,,,,
4,1,0,55.142401,71.435607,-64.177303,36.175818,81.556874,-53.475747,76.986143,42.426849,...,,,,,,,,,,


**Problem 2a** (10 points)

How would you prepare the data if we wanted to know how well we can predict the hand postures of a new, previously unseen user? Write down your reasoning (the usual 1-2 paragraphs are fine). Split the dataset into training, validation, and test sets, preprocess the sets, and loop through 10 random states similar to 1b. As usual, check for reproducability!

Grading suggestion
- 6 points if they do group-split based on user ID and use the 'class' (the hand gesture) as the target variable 
    - it's ok if they use something else than GroupShuffleSplit as long as they split based on user ID
- 2 points for using the standard scaler on each 33 features and fitting on the train only
- 2 points for looping through 10 random states

Add your explanation here:



In [6]:
# add your code here

# save processed train-cv-test sets in a list of dictionaries
data_after_split = []
# extract X, y and User
y = df['Class']
User = df['User']
X = df.iloc[:, 2:]
# init GroupShuffleSplit as follows
gss = GroupShuffleSplit(n_splits=10, train_size=.8, random_state=42)
for nth_split, (other_idx, test_idx) in enumerate(gss.split(X, y, User)):    # group by User
    # extract X_other, X_test, y_other, y_test, User_other, User_test
    X_other, y_other = X.iloc[other_idx], y.iloc[other_idx]
    X_test, y_test = X.iloc[test_idx], y.iloc[test_idx]
    User_other = User.iloc[other_idx]
    User_test = User.iloc[test_idx]
    # init GroupKFold, 5 folds
    gkf = GroupKFold(n_splits=5)
    # create a temporary dictionary for kfold and test data
    random_split_tmp = {'kfold_train':[], 'kfold_cv':[], 'test': []}
    for train_idx, cv_idx in gkf.split(X_other, y_other, User_other):    # group by User_other
        # extract X_train, X_train, y_cv, y_cv, User_train, User_cv
        X_train, X_cv = X_other.iloc[train_idx], X_other.iloc[train_idx]
        y_train, y_cv = y_other.iloc[cv_idx], y_other.iloc[cv_idx]
        User_train = User_other.iloc[train_idx]
        User_cv = User_other.iloc[cv_idx]
        # check if ant user exist in any of the two splits
        if set(User_train.unique()) == set(User_cv.unique()) or\
           set(User_train.unique()) == set(User_test.unique()) or\
           set(User_cv.unique()) == set(User_test.unique()):
            raise ValueError('User exists in two or more of the splits')
        # init StandardScaler
        ss = StandardScaler()
        # fit_transform on train
        X_train = ss.fit_transform(X_train)
        # transform cv and test
        X_cv = ss.transform(X_train)
        X_test = ss.transform(X_test)
        # save to temp dict
        random_split_tmp['kfold_train'].append([X_train, y_train])
        random_split_tmp['kfold_cv'].append([X_cv, y_cv])
        random_split_tmp['test'].append([X_test, y_test])
    # save split of this random 
    data_after_split.append(random_split_tmp)
    print(f'User in kfold: {User_other.unique()}, user in test: {User_test.unique()}')
    print('*' * 80)

User in kfold: [ 1  2  4  5  6  7  8  9 11 13 14], user in test: [ 0 10 12]
********************************************************************************
User in kfold: [ 1  2  4  5  6  7  8  9 11 13 14], user in test: [ 0 10 12]
********************************************************************************
User in kfold: [ 0  2  4  6  7  8  9 10 11 12 13], user in test: [ 1  5 14]
********************************************************************************
User in kfold: [ 0  1  4  5  6  7  9 10 12 13 14], user in test: [ 2  8 11]
********************************************************************************
User in kfold: [ 0  1  2  4  5  6  7  8 10 12 13], user in test: [ 9 11 14]
********************************************************************************
User in kfold: [ 1  4  5  6  8  9 10 11 12 13 14], user in test: [0 2 7]
********************************************************************************
User in kfold: [ 1  4  5  7  8  9 10 11 12 13 14], user in te

**Problem 2b** (10 points)

How would you prepare the data if we wanted to identify a user based on hand postures? Follow the same steps as in 2a (explain your reasoning, split, preprocess, loop through 10 random states, check reproducability).

Grading suggestion
- 6 points if they split based on the user as the target variable 
    - the perfect solution would be to do a stratified split based on the combination of class and user ID columns
    - it is however also good if they do a stratified split on the user ID. this is important because some users have a few hundred postures measured while other uses have almost 10k postures.
    - a simple train_test_split is not OK.
- 2 points for using the standard scaler on each 33 features and fitting on the train only
- 2 points for looping through 10 random states


In [8]:
# add you code here

# save processed train-cv-test sets in a list of dictionaries
data_after_split = []
# extract X, y and Class
y = df['User']
X = df.iloc[:, 2:]
# set strata reference as class & user id
stratify_other_test_ref = df['Class'].astype(str) + "&" + df['User'].astype(str)
for r in range(10):    # loop through 10 random states
    X_other, X_test, y_other, y_test = train_test_split(X, y, test_size=0.2, random_state=42*r, stratify=stratify_other_test_ref)
    # get strata reference for kfold
    stratify_kfold_ref = stratify_other_test_ref.iloc[X_other.index]
    # init kfold
    skf = StratifiedKFold(n_splits=5)
    # create a temporary dictionary for kfold and test data
    random_split_tmp = {'kfold_train':[], 'kfold_cv':[], 'test': []}
    for train_index, cv_index in skf.split(X_other, stratify_kfold_ref):
        # extract X_train, X_train, y_cv, y_cv
        X_train, X_cv = X_other.iloc[train_index], X_other.iloc[cv_index]
        y_train, y_cv = y_other.iloc[train_index], y_other.iloc[cv_index]
        # init StandardScaler
        ss = StandardScaler()
        # fit_transform on train
        X_train = ss.fit_transform(X_train)
        # transform cv and test
        X_cv = ss.transform(X_train)
        X_test = ss.transform(X_test)
        # save to temp dict
        random_split_tmp['kfold_train'].append([X_train, y_train])
        random_split_tmp['kfold_cv'].append([X_cv, y_cv])
    random_split_tmp['test'] = [X_test, y_test]
    # save split of this random 
    data_after_split.append(random_split_tmp)
    print(y_test.iloc[:5])

25910     6
16228     2
65267    13
8305      0
53830    11
Name: User, dtype: int64
17023     2
25270     6
57502    12
10190     1
43975    10
Name: User, dtype: int64
59723    12
27283     6
29274     8
35314     8
57570    12
Name: User, dtype: int64
24797     6
64001    13
40741    10
8219      0
32388     8
Name: User, dtype: int64
37026     9
47274    10
73131    14
28673     8
3414      0
Name: User, dtype: int64
50498    11
56563    11
28933     8
11495     1
68504    13
Name: User, dtype: int64
28619     7
604       0
74272    14
41497    10
33372     8
Name: User, dtype: int64
14427    2
39171    9
16213    2
15106    2
20988    5
Name: User, dtype: int64
54546    11
24060     6
66064    13
23562     5
74710    14
Name: User, dtype: int64
29754     8
28838     8
13728     1
58217    12
77565    14
Name: User, dtype: int64
