## Problem set 4

**Problem 0** (-2 points for every missing green OK sign. If you don't run the cell below, that's -14 points.)

Make sure you are in the DATA1030 environment.

In [1]:
from __future__ import print_function
from distutils.version import LooseVersion as Version
import sys

OK = '\x1b[42m[ OK ]\x1b[0m'
FAIL = "\x1b[41m[FAIL]\x1b[0m"

try:
    import importlib
except ImportError:
    print(FAIL, "Python version 3.7 is required,"
                " but %s is installed." % sys.version)

def import_version(pkg, min_ver, fail_msg=""):
    mod = None
    try:
        mod = importlib.import_module(pkg)
        if pkg in {'PIL'}:
            ver = mod.VERSION
        else:
            ver = mod.__version__
        if Version(ver) == min_ver:
            print(OK, "%s version %s is installed."
                  % (lib, min_ver))
        else:
            print(FAIL, "%s version %s is required, but %s installed."
                  % (lib, min_ver, ver))    
    except ImportError:
        print(FAIL, '%s not installed. %s' % (pkg, fail_msg))
    return mod


# first check the python version
pyversion = Version(sys.version)
if pyversion >= "3.7":
    print(OK, "Python version is %s" % sys.version)
elif pyversion < "3.7":
    print(FAIL, "Python version 3.7 is required,"
                " but %s is installed." % sys.version)
else:
    print(FAIL, "Unknown Python version: %s" % sys.version)

    
print()
requirements = {'numpy': "1.18.5", 'matplotlib': "3.2.2",'sklearn': "0.23.1", 
                'pandas': "1.0.5",'xgboost': "1.1.1", 'shap': "0.35.0"}

# now the dependencies
for lib, required_version in list(requirements.items()):
    import_version(lib, required_version)

[42m[ OK ][0m Python version is 3.7.6 | packaged by conda-forge | (default, Jun  1 2020, 18:33:30) 
[Clang 9.0.1 ]

[42m[ OK ][0m numpy version 1.18.5 is installed.
[42m[ OK ][0m matplotlib version 3.2.2 is installed.
[42m[ OK ][0m sklearn version 0.23.1 is installed.
[42m[ OK ][0m pandas version 1.0.5 is installed.
[42m[ OK ][0m xgboost version 1.1.1 is installed.
[42m[ OK ][0m shap version 0.35.0 is installed.


**Problem 1a** (3 points)

You will work with the diabetes dataset in Problem 1 and you will split the data and preprocess it to get ready for training an ML model. First, read in the dataset into a pandas dataframe using the tab delimited file linked at [this page](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html).


grading suggestion:
- 3 points if they read in the file correctly using the delimiter argument


In [5]:
# read in the data in this cell


**Problem 1b** (6 points)

Answer the following questions with 1-2 paragraphs.

Q1: Is the dataset IID or not? Why?

Q2: Please decide what fraction of points will be in each set and explain your decision in a paragraph or two.

Q3: Please explain in a paragraph or two why it is important to fit the preprocessors on the training set only.


grading suggestion:
- 2 points if they correctly argue that the dataset is IID.
- 2 points for a correct argument. 60-20-20 is best, I'd still be OK with 80-10-10 if they have a reasonable argument.
- 2 points for a good Q3 explanation


**Problem 1c** (11 points)

Based on your answers above, please perform a basic split and create training, validation, and test sets.

Now that you have three sets, you can preprocess the data. Please decide for each feature which preprocessor you will use (no need to write text). Fit those preprocessors on the training set, then transform the sets.

We discussed in class that it is important to split the data using various different random states so you can determine at the  end of the ML pipeline how much uncertainty in the test score the random splitting causes. Please use 10 random states and split/preprocess the data 10 times. 

Please make sure that your code is reproducable. The best way to check that is to print out which points are in e.g., the training set and rerun the cell a couple of times. If the same points are in the same set after every rerun, your code is reproducable. 

A couple of suggestions how you could structure your code is available below. 


One option:
```python

random_states = [...,...,...] # list of 10 numbers

for random_state in random_states:
    # whenever you need to set the random state, use `random_state`
    
    # split the data
    
    # preprocess the data
    
    # print stuff out to make sure your code is reproducable
    
```

Second option:
```python


for i in range(0,10):
    random_state = 42 * i # feel free to replace 42 with your magic number.
                          # the only important thing is that random_state has a different value in each iteration.
    
    # split the data
    
    # preprocess the data
    
    # print stuff out to make sure your code is reproducable
    
```




grading suggestion:

- 3 points for correctly splitting with train_test_split and setting the random_state
- 2 points for correctly using a one-hot encoder on the gender and either the standard scaler or the min-max scaler on the rest (deduct a point if they also preprocess the target variable. in regression, the target variable stays as is)
- 3 points if the fit the preprocessors to the training set and then transform everything
- 3 points for correctly looping through 10 random states

**Problem 2** 

We work with the [hand postures dataset](https://archive.ics.uci.edu/ml/datasets/Motion+Capture+Hand+Postures) in problem 2. This dataset has group structure. 14 users performing 5 hand postures with markers attached to a left-handed glove were recorded. Two different ML questions can be asked using this dataset. We will explore how the splitting and preprocessing differs for both questions in 2a and 2b.

**Problem 2a** (10 points)

How would you prepare the data if we wanted to know how well we can predict the hand postures of a new, previously unseen user? Write down your reasoning (the usual 1-2 paragraphs are fine). Split the dataset into training, validation, and test sets, preprocess the sets, and loop through 10 random states similar to 1b. As usual, check for reproducability!

Grading suggestion
- 6 points if they do group-split based on user ID and use the 'class' (the hand gesture) as the target variable 
    - it's ok if they use something else than GroupShuffleSplit as long as they split based on user ID
- 2 points for using the standard scaler on each 33 features and fitting on the train only
- 2 points for looping through 10 random states

Add your explanation here:



In [3]:
# add your code here


**Problem 2b** (10 points)

How would you prepare the data if we wanted to identify a user based on hand postures? Follow the same steps as in 2a (explain your reasoning, split, preprocess, loop through 10 random states, check reproducability).

Grading suggestion
- 6 points if they split based on the user as the target variable 
    - the perfect solution would be to do a stratified split based on the combination of class and user ID columns
    - it is however also good if they do a stratified split on the user ID. this is important because some users have a few hundred postures measured while other uses have almost 10k postures.
    - a simple train_test_split is not OK.
- 2 points for using the standard scaler on each 33 features and fitting on the train only
- 2 points for saving each set into csv files


In [4]:
# add you code here
