The goal of this topic is to determine the ideal projection when the dimensions of the input may vary (in columns). The dimensions of the columns are determined by feature generation/embeddings. 

Let:

*  $W$ be the projection matrix
*  $X$ to be the matrix representing input data which may vary in dimension from iteration to iteration
*  $Y$ be the result of the projection, i.e. $WX = Y$

WLOG fix the size of $Y$, (let's call this $d$). Then we have $W \in \mathbb{R}^{d \times n}$.

In this setting within the RJMCMC framework, the number of parameters we are estimating is:

*  all entries in $W$ ($d \times n$)  
*  all relevant combination of feature generation functions and their respective parameters 
                                             

### Examples of feature vector generation functions

As an example of an algorithm which we should compare is the multivariate adaptive regression splines (MARS)

**Hinge Function**

The hinge function ($f_\text{hinge}$) is similiar to ReLu), can be defined as a feature generation function, (_probably not the right way to write this out_)

$$ f_{\text{hinge}^+} (X_{j}, \theta) = (X_j-\theta \mathbf{1})_{+}$$

Where $X_j$ represents the $j$th column, without loss of generality we can likewise define $f_{\text{hinge}^-} (X_{j}, \theta) = (\theta \mathbf{1} - X_j)_{+}$

**Interaction**

Interaction term will be defined the dot product of two feature vectors (which can be the same feature vector), (_probably not the right way to write this out_)

$$ f_\text{interaction} (X_1, X_2) = X_1  \cdot X_2 $$

Replicating MARS in RJMCMC
--------------------------

[MARS](https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_splines) is a prime candidate for exploring this problem. 

In the RJMCMC space we have the following decisions:

1. GROW - we can grow the state by Proposing a new feature vector generation transformation and respective parameter, e.g. hinge, interaction, or more complex function transforms which could be a composition of function generation...
2. DESTROY - we can also delete a created feature

This would be represent moving from one state to another.                                  

---

This notebook shows a sample pipeline that is to be learnt by feature discovery. 

In this example we will use 3 custom components:

*  Restricted Boltzmann Machine (which can be replaced with artibitary feature reduction method)
*  Hinge search (which can be replaced with any kind of feature discovery method, e.g. tfidf)
*  Interaction term (to create features like x^2 or x1*x2)

with these pieces we will be able to replicate (theoretically) models like MARS

In [62]:
import numpy as np
import pandas as pd

from hinge import Hinge, error_on_split
from interaction import Interaction
from rbm import *


from sklearn import datasets
from sklearn import preprocessing
from sklearn.pipeline import Pipeline
from sklearn import preprocessing

# using svm as per scikit-feature repo
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score
from sklearn import random_projection

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

In [63]:
from sklearn.datasets import make_classification

problem_setup = {
    'n_samples': 300, 
    'n_features': 50, 
    'n_informative': 10, 
    'n_redundant': 10, 
    'n_classes': 3, 
    'random_state': 0    
}

X, y = make_classification(**problem_setup)
X = X.astype(float)
n_samples, n_features = X.shape    # number of samples and number of features

In [64]:
# vanilla pipeline
pipeline = []
pipeline.append(('linear svm', LinearSVC()))
model = Pipeline(pipeline)

# split data into 10 folds
kfold = KFold(n_splits=3, shuffle=True)
results = cross_val_score(model, X, y, cv=kfold)
print("Accuracy: {}".format(results.mean()))

Accuracy: 0.6900000000000001


In [65]:
from sklearn.decomposition import PCA

# pipeline with pca
pipeline = []
pipeline.append(('pca', PCA()))
pipeline.append(('linear svm', LinearSVC()))
model = Pipeline(pipeline)

# split data into 10 folds
kfold = KFold(n_splits=3, shuffle=True)
results = cross_val_score(model, X, y, cv=kfold)
print("Accuracy: {}".format(results.mean()))

Accuracy: 0.6999999999999998


In [66]:
# pipeline with factor analysis

from sklearn.decomposition import FactorAnalysis

# pipeline with pca
pipeline = []
pipeline.append(('pca', FactorAnalysis()))
pipeline.append(('linear svm', LinearSVC()))
model = Pipeline(pipeline)

# split data into 10 folds
kfold = KFold(n_splits=3, shuffle=True)
results = cross_val_score(model, X, y, cv=kfold)
print("Accuracy: {}".format(results.mean()))

Accuracy: 0.7433333333333335
