# Vectors of Locally Aggregated Concepts (VLAC)

**<a name="table">Table of Contents</a>  **
 
1. [Loading Packages](#packages)
    
2. [Prepare Data](#prepare)

3. [Train Model and Transform Features](#train) 

4. [Transform Features](#transform)

5. [Feature Quality](#quality) 

    5.1 [TF-IDF](#tfidf)
    
    5.2 [VLAC](#vlac)

#### <a name="packages">Loading Packages</a> 
[Back to Table of Contents](#table)

In [10]:
from vlac import VLAC

import pickle
import numpy as np
import gensim.models.keyedvectors as word2vec

from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer, confusion_matrix
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

def balanced_accuracy_score(y_true, y_pred, sample_weight=None,
                            adjusted=False):
    """Compute the balanced accuracy
    The balanced accuracy in binary and multiclass classification problems to
    deal with imbalanced datasets. It is defined as the average of recall
    obtained on each class.
    The best value is 1 and the worst value is 0 when ``adjusted=False``.
    Read more in the :ref:`User Guide <balanced_accuracy_score>`.
    Parameters
    ----------
    y_true : 1d array-like
        Ground truth (correct) target values.
    y_pred : 1d array-like
        Estimated targets as returned by a classifier.
    sample_weight : array-like of shape = [n_samples], optional
        Sample weights.
    adjusted : bool, default=False
        When true, the result is adjusted for chance, so that random
        performance would score 0, and perfect performance scores 1.
    Returns
    -------
    balanced_accuracy : float
    See also
    --------
    recall_score, roc_auc_score
    Notes
    -----
    Some literature promotes alternative definitions of balanced accuracy. Our
    definition is equivalent to :func:`accuracy_score` with class-balanced
    sample weights, and shares desirable properties with the binary case.
    See the :ref:`User Guide <balanced_accuracy_score>`.
    References
    ----------
    .. [1] Brodersen, K.H.; Ong, C.S.; Stephan, K.E.; Buhmann, J.M. (2010).
           The balanced accuracy and its posterior distribution.
           Proceedings of the 20th International Conference on Pattern
           Recognition, 3121-24.
    .. [2] John. D. Kelleher, Brian Mac Namee, Aoife D'Arcy, (2015).
           `Fundamentals of Machine Learning for Predictive Data Analytics:
           Algorithms, Worked Examples, and Case Studies
           <https://mitpress.mit.edu/books/fundamentals-machine-learning-predictive-data-analytics>`_.
    Examples
    --------
    >>> from sklearn.metrics import balanced_accuracy_score
    >>> y_true = [0, 1, 0, 0, 1, 0]
    >>> y_pred = [0, 1, 0, 0, 0, 1]
    >>> balanced_accuracy_score(y_true, y_pred)
    0.625
    """
    C = confusion_matrix(y_true, y_pred, sample_weight=sample_weight)
    with np.errstate(divide='ignore', invalid='ignore'):
        per_class = np.diag(C) / C.sum(axis=1)
    if np.any(np.isnan(per_class)):
        warnings.warn('y_pred contains classes not in y_true')
        per_class = per_class[~np.isnan(per_class)]
    score = np.mean(per_class)
    if adjusted:
        n_classes = len(per_class)
        chance = 1 / n_classes
        score -= chance
        score /= 1 - chance
    return score

### <a name="prepare">Prepare Data</a> 
[Back to Table of Contents](#table)

In order to prepare the data one must load the corresponding word embeddings either through Gensim or as a simple dictionary (word: vector). It is advised to load only the word embeddings that can be found within the collection of documents as clustering a large number of concepts might be computationally difficult. 

The collection of documents should be represented as a list of strings. 

In [5]:
# model = word2vec.KeyedVectors.load_word2vec_format('PATH')
with open('Data/r8_glove_1f.pickle', 'rb') as handle:
    model = pickle.load(handle)
    
with open('Data/r8_docs.txt', "r") as f:
    docs = f.readlines()
    
with open('Data/r8_labels.txt') as f:
    labels=[line for line in f]

### <a name="train">Train Model & Transform Features</a> 
[Back to Table of Contents](#table)  
The model can be trained using the .fit_transform() procedure. It should be noted that if you use a FastText model, it is advised to set oov to True as it will create word embeddings for out-of-vocabulary words when applying the VLAC procedure (thus, after clustering word embeddings). 

Note: It is advised to save the resulting kmeans model as that will allow you to quickly transform new documents. 

In [6]:
vlac_model = VLAC(documents=docs, model=model, oov=False)
vlac_features, kmeans = vlac_model.fit_transform(num_concepts=30)

### <a name="transform">Transform Features</a> 
[Back to Table of Contents](#table)  
After having trained the kmeans model, you can use that to quickly transform the features without needing to cluster the embeddings again.  

In [None]:
vlac_model = VLAC(documents=train_docs, model=model_w2v, oov=False)
vlac_features = vlac_model.transform(kmeans=kmeans)

## <a name="quality">Feature Quality</a> 
[Back to Table of Contents](#table)    
To test the quality of features one can use them for the classification of documents. Below is an example of how I did the classification. Due to some imbalance in the data balanced accuracy was used as an objective measure. Moreover, 10-fold cross validation was applied combined with LinearSVC as textual documents are typically linearly seperable. 

### <a name="tfidf">TF-IDF</a> 
[Back to Table of Contents](#table)  

In [14]:
X_counts = CountVectorizer().fit_transform(docs)
X_tfidf = TfidfTransformer().fit_transform(X_counts)
accuracy = make_scorer(balanced_accuracy_score)
cv_svc = cross_val_score(LinearSVC(random_state=42), X_train_tfidf, labels, cv=10, verbose=0, 
                         scoring=accuracy)
np.mean(cv_svc)

0.9207233411148211

### <a name="vlac">VLAC</a> 
[Back to Table of Contents](#table)  

In [8]:
accuracy = make_scorer(balanced_accuracy_score)
cv_svc = cross_val_score(LinearSVC(random_state=42), vlac_features, labels, cv=10, verbose=0, 
                         scoring=accuracy)
np.mean(cv_svc)

0.9290098396883734