# The Curse Of Dimensionality

In this notebook we're going to explore how serious the curse of dimensionality is.  We'll do that by examining it from an experimental perspective.
Let's take a look at the boston housing data found in keras

In [1]:
import numpy as np
from keras.datasets import boston_housing
import pandas as pd

In [2]:
(X_train, y_train),(X_test, y_test) =  boston_housing.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/boston_housing.npz


The first thing you can do to get an idea of your data is to examine the shape and print a few rows of the data.

In [3]:
X_train.shape

(404, 13)

So we see that our data has `403` samples and `13` features.  This is actually a pretty small dataset and the fact that it's got 13 features makes it a prime candidate to suffer from the curse of dimensionality.  13 features will mean that we have 13 variables to deal with, make it a 13 dimensional dataset.  This certainly creates a **lot** of sparsity.  

Before we search for the effects of the curse, let's just take a look at the features themselves.

## Boston Housing Features

    1. CRIM      per capita crime rate by town
    2. ZN        proportion of residential land zoned for lots over 
                 25,000 sq.ft.
    3. INDUS     proportion of non-retail business acres per town
    4. CHAS      Charles River dummy variable (= 1 if tract bounds 
                 river; 0 otherwise)
    5. NOX       nitric oxides concentration (parts per 10 million)
    6. RM        average number of rooms per dwelling
    7. AGE       proportion of owner-occupied units built prior to 1940
    8. DIS       weighted distances to five Boston employment centres
    9. RAD       index of accessibility to radial highways
    10. TAX      full-value property-tax rate per 10,000
    11. PTRATIO  pupil-teacher ratio by town
    12. B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks 
                 by town
    13. LSTAT    % lower status of the population
    
    Target Value - Y labels
    14. MEDV     Median value of owner-occupied homes in $1000's
    
    http://lib.stat.cmu.edu/datasets/boston
    
#### Note this dataset is OLD, from 1970's and obviously not racially sensitive at all.

In [4]:
## I like to put things into pandas dataframe, for a lot of reasons.  Here I do it just because it will make the dataset print out nicely.
cols = ["CRIM","ZN","INDUS","CHAS","NOX","RM","AGE",'DIS','RAD', 'TAX','PTRATIO','B','LSTAT']
X_traindf = pd.DataFrame(X_train, columns = cols)
X_traindf.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,1.23247,0.0,8.14,0.0,0.538,6.142,91.7,3.9769,4.0,307.0,21.0,396.9,18.72
1,0.02177,82.5,2.03,0.0,0.415,7.61,15.7,6.27,2.0,348.0,14.7,395.38,3.11
2,4.89822,0.0,18.1,0.0,0.631,4.97,100.0,1.3325,24.0,666.0,20.2,375.52,3.26
3,0.03961,0.0,5.19,0.0,0.515,6.037,34.5,5.9853,5.0,224.0,20.2,396.9,8.01
4,3.69311,0.0,18.1,0.0,0.713,6.376,88.4,2.5671,24.0,666.0,20.2,391.43,14.65


## Scale the variables

We need to standarize our data by scaling it and removing the mean and variance.  This is to address the problem of the data being on different scales, we'll discuss this in depth in the next lecture, for now, we'll just do it.

In [5]:
from sklearn.preprocessing import StandardScaler

In [6]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [7]:
X_train[:5]

array([[-0.27224633, -0.48361547, -0.43576161, -0.25683275, -0.1652266 ,
        -0.1764426 ,  0.81306188,  0.1166983 , -0.62624905, -0.59517003,
         1.14850044,  0.44807713,  0.8252202 ],
       [-0.40342651,  2.99178419, -1.33391162, -0.25683275, -1.21518188,
         1.89434613, -1.91036058,  1.24758524, -0.85646254, -0.34843254,
        -1.71818909,  0.43190599, -1.32920239],
       [ 0.1249402 , -0.48361547,  1.0283258 , -0.25683275,  0.62864202,
        -1.82968811,  1.11048828, -1.18743907,  1.67588577,  1.5652875 ,
         0.78447637,  0.22061726, -1.30850006],
       [-0.40149354, -0.48361547, -0.86940196, -0.25683275, -0.3615597 ,
        -0.3245576 , -1.23667187,  1.10717989, -0.51114231, -1.094663  ,
         0.78447637,  0.44807713, -0.65292624],
       [-0.0056343 , -0.48361547,  1.0283258 , -0.25683275,  1.32861221,
         0.15364225,  0.69480801, -0.57857203,  1.67588577,  1.5652875 ,
         0.78447637,  0.3898823 ,  0.26349695]])

##  Let's build a simple model

In [8]:
from keras import models
from keras import layers

In [9]:
model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(X_train.shape[1],)))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(1))  # 1 output for regression, also no activation function!

model.compile(optimizer = 'rmsprop', loss = 'mse', metrics =['mse'])

In [10]:
# We are going to set `verbose = 0` because we don't want 100 lines of epochs being complete.
model.fit(X_train, y_train, batch_size=16, epochs= 100, verbose=0)

<tensorflow.python.keras.callbacks.History at 0x22e5628fac0>

In [11]:
test_loss, test_acc = model.evaluate(X_test, y_test)
print (test_loss)

16.964195251464844


## Can we do better?  

At this point we've preprocessed the features a bit and now have a basic result.
My question to you is -- can we do better?  The answer should be 'yes', but honestly I'm not sure how.
One hypothesis I have is that since we have a very small dataset with 13 dimensions, I'm not convinced they are all helping more than they are hurting.  One simple approach would be to rank the features -- see which ones are better and then remove the worse ones.
This will have the effect of keeping the best features while reducing the dimensionality of the data.

Fortunately for us, Scikit-Learn comes equipped with a few functions to do exactly this kind of feature ranking.

http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection

In [12]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression

SelectKBest will choose the K best features, using a predefined univariate metric.  Don't worry too much about whats going on under the hood here, let's just accept that that there are some statistics we can use to assign value to the features.  SelectKBest will do that, and keep the `K` best that we ask for.  Go ahead and pick value for K below

In [13]:
selector = SelectKBest(f_regression, k = 6)  # PICK A VALUE FOR K
X_train_k = selector.fit_transform(X_train, y_train)  #we fit the selector on the training data only
X_test_k = selector.transform(X_test)  # we use the fit selector to transform the test data.

In [14]:
print (X_train_k.shape)
print (X_test_k.shape)

(404, 6)
(102, 6)


In [15]:
model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(X_train_k.shape[1],)))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(1))  # 1 output for regression, also no activation function!

model.compile(optimizer = 'rmsprop', loss = 'mse', metrics =['mse'])

model.fit(X_train_k, y_train, batch_size=16, epochs= 100, verbose=0)
test_loss, test_acc = model.evaluate(X_test_k, y_test)
print ("loss is {}".format(test_loss))

loss is 14.817341804504395


##  Did it get better?

So you reduced the dimensionality of the data and now I want to ask you -- did it get better?

your answer here : 



## You should really be more systematic

We should really be more systematic in checking how many dimensions are appropriate.
The basic and most simple thing to do is just stick it all in a for loop.  I've done that for you below

In [16]:
# let's wrap our model creation and testing into a simple method.

def run_model(X_train_k, y_train, X_test_k, y_test, i ):
    model = models.Sequential()
    model.add(layers.Dense(64, activation='relu', input_shape=(X_train_k.shape[1],)))
    model.add(layers.Dense(64, activation='relu'))
    model.add(layers.Dense(1))  # 1 output for regression, also no activation function!

    model.compile(optimizer = 'rmsprop', loss = 'mse', metrics =['mse'])

    model.fit(X_train_k, y_train, batch_size=16, epochs= 100, verbose=0)
    test_loss, test_acc = model.evaluate(X_test_k, y_test)
    print ("i is : {}, loss is {}".format(i, test_loss))

This for loop will test all the values 1-13 of K.  This way we can get results for all the different combinations of features and see what actually does best.  

In [17]:
for i in range(1,14):
    selector = SelectKBest(f_regression, k = i)
    X_train_k = selector.fit_transform(X_train, y_train)
    X_test_k = selector.transform(X_test)
    run_model(X_train_k, y_train, X_test_k, y_test, i)

i is : 1, loss is 31.74686050415039
i is : 2, loss is 22.788585662841797
i is : 3, loss is 18.066062927246094
i is : 4, loss is 17.9735107421875
i is : 5, loss is 17.755508422851562
i is : 6, loss is 16.664934158325195
i is : 7, loss is 19.0513858795166
i is : 8, loss is 19.476139068603516
i is : 9, loss is 20.748472213745117
i is : 10, loss is 18.970489501953125
i is : 11, loss is 19.642810821533203
i is : 12, loss is 15.545031547546387
i is : 13, loss is 15.787901878356934


## So in conclusion, did reducing the dimensionality of the data help?

Would you guess that it will always help? Always hurt? Why might it help and why might it not?

your answer here : 


##  The astute observer would notice...

That these results are quite sporadically different.  Why do you think this might be?
What happens if we run it all again.


In [18]:
for i in range(1,14):
    selector = SelectKBest(f_regression, k = i)
    X_train_k = selector.fit_transform(X_train, y_train)
    X_test_k = selector.transform(X_test)
    run_model(X_train_k, y_train, X_test_k, y_test, i)

i is : 1, loss is 30.786846160888672
i is : 2, loss is 24.661603927612305
i is : 3, loss is 17.31822967529297
i is : 4, loss is 16.864648818969727
i is : 5, loss is 17.915454864501953
i is : 6, loss is 15.788019180297852
i is : 7, loss is 18.180761337280273
i is : 8, loss is 23.037607192993164
i is : 9, loss is 18.716766357421875
i is : 10, loss is 21.004819869995117
i is : 11, loss is 21.62381362915039
i is : 12, loss is 17.74018669128418
i is : 13, loss is 17.024444580078125


## Final Question
Are the answers the same?  Why are these results less stable than the MNIST data?

your answer here : 

## Bonus Project

For the very determined amongst you I offer the following idea:

Remember our MNIST dataset.  If you observe it carefully you will notice that a lot of the pixel values are 0, in fact they are just white space.  However every single pixel is another dimension added to our dataset.  Can we remove some of the pixels and still have the same data?
Perhaps you do this by removing a row / column of each MNIST image.  Going from 28 x 28 (784,) --> 26 x 26 (676,) would drop 108 dimensions from the data.  In theory this should help our accuracy.

Would it?
You can setup the experiment and run it yourself.  You may notice that it does or does not depending on your network.  For example, I am doubtful that lowering the dimension would help it get past 97% test accuracy, but if you setup a dumber network (6 node input) it would probably help in that case.
What I mean to say is, that it would most likely help the bottom line, but you may not see with a dense network that is already performing so well (97% accuracy).
I won't do this, but I do encourage you to experiment and find out!