# Exploration of Dimension Reduction
<span style="font-weight:bold; font-size:17pt; color:#666666;">Random Projection</span>
<hr>

This notebook is for the evaluation of random projection for dimension reduction.

It assumes the data (**features.h5**, the full feature set) is in a sub-directory of the **/data** folder. I've already added entries to the _.gitignore_ file so that they won't be committed to the repository. Note that this file should be updated for new versions of the data.

See the [data readme in the Gitbug repository](https://github.com/BrianDavisMath/FDA-COVID19/tree/master/data) for more details.

<hr>

reference material:
    
* [scikit-learn RP overview](https://scikit-learn.org/stable/modules/random_projection.html)
* [scikit-learn SparseRandomProjection](https://scikit-learn.org/stable/modules/generated/sklearn.random_projection.SparseRandomProjection.html#sklearn.random_projection.SparseRandomProjection)
* [Johnson–Lindenstrauss lemma](https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma)                                  

<hr>

In [1]:
%pylab inline
%autosave 25

import pandas as pd

Populating the interactive namespace from numpy and matplotlib


Autosaving every 25 seconds


## Data location

In [2]:
data_loc = '../data/FDA-COVID19_files_v0.5/'

## Load the data from features.h5
<hr>

In [3]:
store = pd.HDFStore(data_loc + 'features.h5')
df_features = pd.DataFrame(store['df' ])
print('rows: {:,}, columns: {:,}'.format(len(df_features), len(df_features.columns)))

rows: 135,363, columns: 8,617


## Drop non feature cols

In [4]:
from sklearn.preprocessing import MinMaxScaler

# store for later
df_y = df_features['activity']
df_ids = df_features[['pid', 'cid']]

# Select only numeric columns and drop pid, cid and activity.
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] # pid dropped because it's a string
number_cols = df_features.select_dtypes(include=numerics)
number_cols = [col for col in number_cols]
number_cols.remove('activity')
number_cols.remove('cid')

print('Number of rows: {:,}'.format(len(df_features)))
print('Number of columns: {:,}'.format(len(number_cols)))

df_data = df_features[number_cols]

Number of rows: 135,363
Number of columns: 8,614


## Scale the data

In [5]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled = scaler.fit_transform(df_data)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


## Random Projection

<span style="font-weight:bold; font-size:12pt; color:#666666;">derive target dimensionality from the Johnson–Lindenstrauss lemma</span>

In [10]:
import numpy as np
from sklearn.random_projection import SparseRandomProjection
rng = np.random.RandomState(42)

transformer = SparseRandomProjection(random_state=rng)
X_new = transformer.fit_transform(scaled)
print(X_new.shape)

# very few components are non-zero
np.mean(transformer.components_ != 0)

ValueError: eps=0.100000 and n_samples=135363 lead to a target dimension of 10127 which is larger than the original space with n_features=8614