# Fine Tuning Glove with Mittens

## Installation and Setup

This notebook needs the mittens package, to install:

`pip install --user mittens`

#### Notes

1. To limit words using the `mincount` parameter in the `build_weighted_matrix` function (default is likely too high for small corpora: 300)
1. Currently only doing a sample for timing. Undo the slice before `build_weighted_matrix` call
1. The usage of the `NltkPreprocessor` and `NltkTokenizer` can be replaced with a different one. There's no major dependency on `tatk` here
1. To use the `tf` implementation, set `USE_TF` to `True`
1. The trials aren't for iterative training
1. You can also use the `GloVe` implementation in `mittens`, shown in the last cell

## Unlabeled IMDB Data

In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="1"
import tensorflow as tf
from tensorflow.python.client import device_lib

In [2]:
device_lib.list_local_devices()

[name: "/device:CPU:0"
 device_type: "CPU"
 memory_limit: 268435456
 locality {
 }
 incarnation: 18436180092534167647, name: "/device:GPU:0"
 device_type: "GPU"
 memory_limit: 15863893197
 locality {
   bus_id: 1
   links {
   }
 }
 incarnation: 11333334022390198729
 physical_device_desc: "device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 7712:00:00.0, compute capability: 6.0"]

In [5]:
import pathlib
import pandas as pd
import numpy as np
import csv
from collections import defaultdict
from operator import itemgetter
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import (
    classification_report, accuracy_score, 
    confusion_matrix, f1_score)

USE_TF = True

if USE_TF:
    from mittens.tf_mittens import Mittens, GloVe # for tensorflow implementation
else:
    from mittens.np_mittens import Mittens, GloVe # for vectorized numpy

data_path = pathlib.Path.home() / "tatk" / "resources" / "data" / "imdb" / "aclImdb" / "train"

In [6]:
list(data_path.glob("*"))

[PosixPath('/home/alizaidi/tatk/resources/data/imdb/aclImdb/train/unsupBow.feat'),
 PosixPath('/home/alizaidi/tatk/resources/data/imdb/aclImdb/train/urls_pos.txt'),
 PosixPath('/home/alizaidi/tatk/resources/data/imdb/aclImdb/train/unsup'),
 PosixPath('/home/alizaidi/tatk/resources/data/imdb/aclImdb/train/urls_unsup.txt'),
 PosixPath('/home/alizaidi/tatk/resources/data/imdb/aclImdb/train/labeledBow.feat'),
 PosixPath('/home/alizaidi/tatk/resources/data/imdb/aclImdb/train/pos'),
 PosixPath('/home/alizaidi/tatk/resources/data/imdb/aclImdb/train/neg'),
 PosixPath('/home/alizaidi/tatk/resources/data/imdb/aclImdb/train/urls_neg.txt')]

In [7]:
def read_from_directory(directory_path,
                        dirs_to_check, 
                        column_names):
    """
    Returns a DataFrame with a single row for each txt file in `directory_path`
    """
    
    import pathlib
    import pandas as pd
    
    source_path = pathlib.Path(directory_path)
    list_dirs = [x for x in source_path.iterdir()]
    match_dirs = [x for x in list_dirs if any(xs in str(x) for xs in dirs_to_check)]
    files = sum([list(match.glob("*.txt")) for match in match_dirs], [])
    df = pd.concat([pd.read_csv(x, header=None, sep = "\n") for x in files])
    df.columns = column_names
    
    return df

In [8]:
imdb_df = read_from_directory(str(data_path), dirs_to_check="unsup", column_names=["review"])

In [9]:
imdb_subset = imdb_df.sample(n=10**3)
imdb_df[:5]

Unnamed: 0,review
0,"Makers of erotic thrillers need to be careful,..."
0,"A ghoulish mixture of Liszt, murder, violence ..."
0,This movie is severely underrated and was very...
0,"Apart from the wooden acting, the heavy-handed..."
0,Having read much of the criticism and praise o...


In [12]:
train_df = read_from_directory(str(data_path), dirs_to_check=["pos", "neg"], column_names=["review"])

## Preprocess Reviews

In [10]:
from tatk.preprocessing.nltk_preprocessor import NltkPreprocessor

In [11]:
preprocessor = NltkPreprocessor(input_col='review', output_col='tokens')
imdb_df = preprocessor.tatk_transform(imdb_df)

F1 2018-05-17 17:15:11,303 INFO azureml.text:machine info {"os_type": "Linux", "is_dsvm": true} 
F1 2018-05-17 17:15:11,306 INFO azureml.text:8481512779234386476 NltkPreprocessor::tatk_transform ==> start 
NltkPreprocessor::tatk_transform ==> start
F1 2018-05-17 17:15:30,764 INFO azureml.text:8481512779234386476 NltkPreprocessor::tatk_transform ==> end 	 Time taken: 0.32 mins 
NltkPreprocessor::tatk_transform ==> end 	 Time taken: 0.32 mins


In [11]:
preprocessed_text = imdb_df[preprocessor.get_output_col()].values

In [12]:
preprocessed_text[0]

'Makers of erotic thrillers need to be careful , as that is a genre that , if not handled carefully , can quickly fall prey to silliness and excess ( think " Fatal Attraction "). " Swimming Pool " is a thriller in the style of " The Deep End ," and more than once I was struck by similarities between the two in their respective tones and reliance on water as a recurring visual motif . Also , both films have a middle - aged female as the protagonist who becomes involved in covering up for the actions of a child ( in " The Deep End " a literal child , in " Swimming Pool " a figurative one ). Also , both films are completely unpredictable . Neither goes the direction in which the viewer thinks it \' s going to . However , " Swimming Pool " is much more abstract , and its ending leaves you wanting to watch the whole thing over immediately with an entirely different perspective on the action . This gimmick always makes for a memorable ending in movies that employ it , but too often it makes 

## Tokenizer and Weight Matrix

### Phrase Detector

In [13]:
from tatk.preprocessing.phrase_detector import PhraseDetector
phrase_detect = PhraseDetector(input_col="tokens", output_col="phrase_tokens")
imdb_phrase_df = phrase_detect.tatk_fit_transform(imdb_df)

PhraseDetector::tatk_fit_transform ==> start
PhraseDetector::tatk_fit_transform ==> end 	 Time taken: 7.23 mins


In [14]:
imdb_phrase_df.phrase_tokens.values[:4]

array([ 'Makers of erotic thrillers need to be careful , as that is a genre that , if not handled carefully , can quickly fall_prey to silliness and excess ( think " Fatal_Attraction "). " Swimming_Pool " is a thriller in the style of " The Deep End ," and more_than once I was struck_by similarities_between the two in their_respective tones and reliance_on water as a recurring visual motif . Also , both films have a middle -_aged female as the protagonist who becomes_involved in covering up for the actions of a child ( in " The Deep End " a literal child , in " Swimming_Pool " a figurative one ). Also , both films are completely unpredictable . Neither goes the direction in which the viewer thinks it \'_s going to . However , " Swimming_Pool " is much_more abstract , and its ending leaves you wanting to watch the whole_thing over immediately with an entirely_different perspective on the action . This gimmick always makes for a memorable ending in movies that employ it , but too_often i

### TATK Tokenizer

In [16]:
from tatk.preprocessing.nltk_tokenizer import NltkTokenizer
basic_tokenizer = NltkTokenizer()

In [15]:
preprocessed_text = imdb_phrase_df.phrase_tokens.values

### Build Weighted Matrix

In [17]:
def _window_based_iterator(toks, window_size, weighting_function):
    for i, w in enumerate(toks):
        yield w, w, 1
        left = max([0, i-window_size])
        for x in range(left, i):
            yield w, toks[x],weighting_function(x)
        right = min([i+1+window_size, len(toks)])
        for x in range(i+1, right):
            yield w, toks[x], weighting_function(x)

def build_weighted_matrix(corpus, tokenizing_func=basic_tokenizer,
        mincount=300, vocab_size=None, window_size=10,
        weighting_function=lambda x: 1 / (x + 1)):

    """Builds a count matrix based on a co-occurrence window of
    `window_size` elements before and `window_size` elements after the
    focal word, where the counts are weighted based on proximity to the
    focal word.

    Parameters
    ----------
    corpus : iterable of str
        Texts to tokenize.
    tokenizing_func : function
        Must map strings to lists of strings.
    mincount : int
        Only words with at least this many tokens will be included.
    vocab_size : int or None
        If this is an int above 0, then, the top `vocab_size` words
        by frequency are included in the matrix, and `mincount`
        is ignored.
    window_size : int
        Size of the window before and after. (So the total window size
        is 2 times this value, with the focal word at the center.)
    weighting_function : function from ints to floats
        How to weight counts based on distance. The default is 1/d
        where d is the distance in words.

    Returns
    -------
    pd.DataFrame
        This is guaranteed to be a symmetric matrix, because of the
        way the counts are collected.

    """
    
    tokenized_text = tokenizing_func.transform(corpus)
    tokens = [x.split(" ") for x in corpus]

    # Counts for filtering:
    wc = defaultdict(int)
    for toks in tokens:
        for tok in toks:
            wc[tok] += 1
    if vocab_size:
        srt = sorted(wc.items(), key=itemgetter(1), reverse=True)
        vocab_set = {w for w, c in srt[: vocab_size]}
    else:
        vocab_set = {w for w, c in wc.items() if c >= mincount}
    vocab = sorted(vocab_set)
    n_words = len(vocab)

    # Weighted counts:
    counts = defaultdict(float)
    for toks in tokens:
        window_iter = _window_based_iterator(toks, window_size, weighting_function)
        for w, w_c, val in window_iter:
            if w in vocab_set and w_c in vocab_set:
                counts[(w, w_c)] += val

    # Matrix:
    X = np.zeros((n_words, n_words))
    for i, w1 in enumerate(vocab):
        for j, w2 in enumerate(vocab):
            X[i, j] = counts[(w1, w2)]

    # DataFrame:
    X = pd.DataFrame(X, columns=vocab, index=pd.Index(vocab))
    return X


In [20]:
X = build_weighted_matrix(preprocessed_text, vocab_size=10**4)

NltkTokenizer::transform ==> start
Time taken: 2.65 mins
NltkTokenizer::transform ==> end


In [21]:
X.shape

(10000, 10000)

## Train Mittens with External Glove Initializations

## Parameters:

In [22]:
# n_trials = 5
n_trials = 1

max_iter = 50000

embedding_dim = 300

eta = 0.05

embedding_path = pathlib.Path("/datapascal") / "data" / "trained_embeddings" / "glove"

In [23]:
def create_glove_lookup(glove_filename):
    """Turns an external GloVe file into a defaultdict that returns
    the learned representation for words in the vocabulary and
    random representations for all others.
    """
    glove_lookup = glove2dict(glove_filename)
    glove_lookup = defaultdict(lambda : get_random_rep(), glove_lookup)
    return glove_lookup

def glove2dict(glove_filename):
    with open(glove_filename) as f:
        reader = csv.reader(f, delimiter=' ', quoting=csv.QUOTE_NONE)
        data = {line[0]: np.array(list(map(float, line[1: ]))) for line in reader}
    return data

def create_lookup(X):
    """Map a dataframe to a lookup that returns random vector reps
    for new words, adding them to the lookup when this happens.
    """
    embedding_dim = X.shape[1]
    data = defaultdict(lambda : get_random_rep())
    for w, vals in X.iterrows():
        data[w] = vals.values
    return data

In [24]:
GLOVE_LOOKUP = create_glove_lookup(str(embedding_path / "glove.6B.300d.txt"))

In [None]:
def experiment(train_data, test_data, lookup, label, trial_num):
    """Run a standard IMDB movie review experiment using `lookup` as 
    the basis for representing examples. The results are pickled to a 
    file called "results/imdb_{label}.pickle"    
    """        
    output_filename = "results/imdb_{}_trial{}.pickle".format(label, trial_num)            

    results = {}
    
    # Model:
    cv = GridSearchCV(
        RandomForestClassifier(), 
        param_grid={
            'n_estimators': [100, 200, 300, 400, 500],
            'max_features': ['sqrt', 'log2'],
            'max_depth': [3, 5, None]}, 
        refit=True, 
        n_jobs=-1)  
    
    # Split:
    X_train, y_train = featurize(train_data, lookup)
    X_test, y_test = featurize(test_data, lookup)
    
    # Fit with best estimator and predict:
    cv.fit(X_train, y_train)
    predictions = cv.predict(X_test) 
    
    # CV info:
    results['cv_results'] = cv.cv_results_
    results['best_params'] = cv.best_params_
    results['best_score'] = cv.best_score_
        
    # Test-set scoring:
    acc = accuracy_score(y_test, predictions)               
    results['accuracy'] = acc
    results['confusion_matrix'] = confusion_matrix(y_test, predictions)
    results['f1'] = f1_score(y_test, predictions, average=None)
    results['f1_macro'] = f1_score(y_test, predictions, average='macro')
    results['f1_micro'] = f1_score(y_test, predictions, average='micro')
    
    # Summary report:
    print("Accuracy: {0:0.04%}".format(acc))
    print("Best params:", cv.best_params_)
          
    # Storage:
    with open(output_filename, 'wb') as f:
        pickle.dump(results, f)

In [None]:
for trial_num in range(1, n_trials+1):
    mittens = Mittens(max_iter=max_iter, n=embedding_dim, eta=eta, mittens=1.0)
    G_mittens = mittens.fit(
        X.values, 
        vocab=list(X.index), 
        initial_embedding_dict=GLOVE_LOOKUP)
    G_mittens = pd.DataFrame(G_mittens, index=X.index)
    G_mittens.to_csv("imdb10K_mittens_embedding_{}.csv".format(trial_num))
    mittens_lookup = create_lookup(G_mittens)

## Train Glove, no initialization

In [None]:
for trial_num in range(1, n_trials+1):    
    glove = GloVe(max_iter=max_iter, n=embedding_dim, learning_rate=eta)
    G = glove.fit(X.values)
    G = pd.DataFrame(G, index=X.index)
    G.to_csv("imdb_glove_embedding_{}.csv".format(trial_num))
    imdb_glove_lookup = create_lookup(G)    