## Cleaning up the dataset

The dirty secret of ML is that you spend most of your time cleaning data. So you'll have to spend some time on that here. Do the following:

* Replace the 0 values with `np.nan` (**Note**: be aware that you shouldn't do this for all columns. Think about it.)
* Use [sklearn.impute.KNNImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html) to impute values that are missing for those columns where you inserted NaNs. Those who have followed the BiBC Essentials Course might remember K-Nearest Neighbour clustering. This function determines the (by default) 5 most similar samples (based on data that is _not_ missing) and sets the bmi/glucose level, etc. to the mean of their values. Euclidean distance is used. We will discuss K-Nearest Neighbour clustering in two days. For now, you can just use it. To do so, use `a = KNNImputer(missing_values = np.nan)` followed by `imputedData = a.fit_transform(nonImputedData)`.
* Note that this turns the DataFrame into a numpy array: this is not a problem but it's good to know.
* Mean-normalise (i.e. subtract the mean and divide by the standard deviation) the features using the function provided below. This should be done on all the data except the labels.
* Put the class into a `np.array` (a column vector) called `diabetesClassLabels`


In [None]:
from sklearn.impute import KNNImputer


def createNormalisedFeatures(featureArray, mode="range", printit=False):
    if printit:
        print(featureArray)
    featureMeans = np.mean(featureArray, axis=0, keepdims=True)
    if printit:
        print(featureMeans)
    if printit:
        print(featureArray - featureMeans)
    if mode == "range":
        featureRanges = np.max(featureArray, axis=0, keepdims=True) - np.min(
            featureArray, axis=0, keepdims=True
        )
        # broadcasting in action:
        normalisedFeatures = (featureArray - featureMeans) / featureRanges
        return [normalisedFeatures, featureMeans, featureRanges]
    elif mode == "SD":
        featureSDs = np.std(featureArray, axis=0, keepdims=True)
        # broadcasting in action:
        normalisedFeatures = (featureArray - featureMeans) / featureSDs
        return [normalisedFeatures, featureMeans, featureSDs]
