# Sythetic Multilabel Classification

## Data Generation
 y <== L dimension one-hot vector, each entry represent a label
 
 X <== y + N(0, $\sigma$)
 
 ## Train Process
 
* y $\in [0,1]^L$
 
* $\bar y = sign(M\cdot y) \in [0,1]^{\bar L}$, where M is a iid gaussian entry embedding matrix (store $\bar Y$ into local files for Matlab)
 
* $\tilde y = \textbf{BCHencode}(\bar y) \in [0,1]^{\tilde L}$ (need to use Matlab)
 
* Train multi-label random forest on $X, \tilde y$ 

**Notice**: for the BCH code, we choose the message length to be 67, codeword length to be 511, the error correction bit is 87. 

The error correction rate is 0.17

## Implement general One vs All classifier

In [1]:
from pytictoc import TicToc
time = TicToc()

In [2]:
from util import OvsA
clf = OvsA()

## Training Process

In [3]:
import numpy as np
from numpy.random import binomial
from numpy.random import normal
from numpy.random import randint
import numpy as np
np.random.seed(42)

In [4]:
# constants
SPARSE = 0.05 # sparsity of label vectors
SIGMA = 0. # standard diveation of noise
L = 500 # feature and label dimension
N = 10000 # number of data points
voter = 30 # number of nearest neighbors to search

L_bar = 45 # embedding dimension, also the message length for BCH code
L_tilde = 255 # codeword length for BCH encoder

In [5]:
# generate synthetic data
y = binomial(1, SPARSE, size=(N, L)) # iid Bernoulli entries
X = y + normal(loc=0, scale=SIGMA, size=(N, L))
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33)

In [6]:
# Source encode + KNN searcher
M = normal(size=[L, L_bar])
y_train_bar = (np.sign(y_train.dot(M))+1)/2
y_test_bar = (np.sign(y_test.dot(M)) + 1) / 2
import faiss
nn_index = faiss.index_factory(y_train_bar.shape[1], "Flat", faiss.METRIC_L2)   # build the index
nn_index.add(y_train_bar.astype('float32'))

Failed to load GPU Faiss: No module named swigfaiss_gpu
Faiss falling back to CPU-only.


In [9]:
# save y_bar to matlab file
from scipy.io import savemat, loadmat
savemat(file_name="../.temp/train/y_bar", mdict={'y_bars':[y_train_bar],
                                                 'y_test':y_test_bar,
                                                 'L_tilde':L_tilde
                                                })

In [10]:
y_train_bar.shape

(6700, 45)

----

Using **Matlab** to encode $\bar y$ into $\tilde y$ ...

----

In [None]:
(y_train_tilde[:, :45] == y_train_bar)

In [None]:
# load the y_tilde file generated by matlab
from scipy.io import savemat, loadmat
y_tildes = loadmat("../.temp/train/y_tilde")['y_tildes'].astype('float')
y_train_tilde = y_tildes[0]
y_test_tilde = loadmat("../.temp/train/y_tilde")['y_test_tilde'].astype('float')

In [None]:
float(y_train_tilde.sum()) / (y_train_tilde.shape[0] * y_train_tilde.shape[1])

In [None]:
# train the random forest multi-label classifier
from pytictoc import TicToc
time = TicToc()
from sklearn.ensemble import RandomForestClassifier
#clf = RandomForestClassifier(n_jobs=-1, n_estimators=48)
from sklearn.svm import SVC
#clf = SVC()
from sklearn.linear_model import LogisticRegression
#clf = LogisticRegression()
clf = OvsA(LogisticRegression)

time.tic()
clf.fit(X_train, y_train_tilde)
time.toc("train classifier")

In [None]:
# test the bit flip probability for y_bar
clf_ = OvsA(LogisticRegression)
clf_.fit(X_train, y_train_bar)
y_predict_bar = clf_.predict(X_test)
1 - (y_predict_bar == y_test_bar).sum() / float(y_predict_bar.shape[0] * y_predict_bar.shape[1])

## Testing Process

In [None]:
y_tilde_hat = clf.predict(X_test)

In [None]:
y_test_tilde.shape

In [None]:
y_tilde_hat.shape

In [None]:
1-(y_test_tilde == y_tilde_hat).sum() / float(y_test_tilde.shape[0] * y_test_tilde.shape[1])

In [None]:
from scipy.io import savemat, loadmat
savemat(file_name="../.temp/test/y_tilde_hat", 
        mdict={'y_tilde_hats':[y_tilde_hat],
               'L_bar':L_bar
              }
       )

In [None]:
y_tilde_hat.shape

----

Using **Matlab** to decode $\hat{\tilde y}$ into $\hat{\bar y}$ ...

----

In [None]:
# load the y_tilde file generated by matlab
from scipy.io import savemat, loadmat
y_bar_hats = loadmat("../.temp/test/y_bar_hat.mat")['y_bar_hats'].astype(int)
y_bar_hat = y_bar_hats[0]

In [None]:
# use KNN searcher to recover the predicted y_hat
dist, ind = nn_index.search(np.ascontiguousarray(y_bar_hat.astype('float32')), voter)
y_hat = np.stack([
    np.sum(np.array([
        y_train[indij].astype('float32')/float(distij**2 + 0.01) for indij, distij in zip(indi, disti)
    ]), axis=0)
    for indi, disti in zip(ind, dist)
], axis=0)

In [None]:
def precision_at_k(truth, vote, k=1):
    assert(truth.shape == vote.shape)
    success = 0
    for i in range(truth.shape[0]):
        topk = np.argpartition(vote[i], -k)[-k:]
        success += truth[i, topk].sum()
    return success / ((float(truth.shape[0]))*k)

In [None]:
precision_at_k(y_test, y_hat, 1)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test_tilde, y_tilde_hat)

### Simple Results

* random forest classifer for multi-label task

$\sigma$| p@1 | p@3 | p@5 
  ---   | --- | --- | --- 
    0   | 0.126 | 0.107 | 0.099
    0.1 | 0.112 | 0.091 | 0.085
    0.4 | 0.063 | 0.061 | 0.059
     
* OvsA with logistic regression 

$\sigma$| p@1 | p@3 | p@5 
  ---   | --- | --- | --- 
    0   | 0.126 | 0.107 | 0.099
    0.01 | 0.112 | 0.091 | 0.085
    0.05 | 0.063 | 0.061 | 0.059

In [None]:
y_test_tilde

In [None]:
y_tilde_hat