# Creating a dependency2vec model

This notebook serves as a tutorial for creating a dependency2vec embedding model.

## Prerequisites

- Introduction can be found at IDEA NAS/public/Presentation Slide/Senior Meeting/Dependency2vec.pdf
- Syntactic Parser: I recommend Spacy https://spacy.io/
- dependency2vec Paper: "Dependency-Based Word Embeddings", Omer Levy and Yoav Goldberg, 2014. *Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics*
- dependency2vec source: https://bitbucket.org/yoavgo/word2vecf

## Jumping in

The data format required to make a dep2vec embedding differs from traditional word embeddings.

Whereas word2vec or fasttext expects one text sample per line:
- user_mention user_mention user_mention i care obviously that's i'm making comment 
- he turn roadblocks super highways :) 
- what need know count me text #womensmarch

dependency2vec expects data in [CoNLL-U Format](http://universaldependencies.org/format.html):

Each line represents the format specified in CoNLL-U. The file format keeps track of the positions of each token in a given text sample and stores the required data for each token until we get to the end of that text sample. Each sample is separated by a new line. 

### That looks complex, ain't nobody got time for dat

Never fear, we have the technology. The full code is in my repository, I just extracted the key parts for brevity. This isn't tested.

#### Prepping the data

In [None]:
import spacy

def init_nlp_pipeline(parser, tokenizer=CustomTwokenizer):
    """Initialize spaCy nlp pipeline
    The params are boolean values that determine if that feature should
    be loaded with the pipeline.

    Returns:
        nlp: spaCy language model
    """
    if parser is False:
        nlp = spacy.load(settings.SPACY_EN_MODEL, create_make_doc=tokenizer,
                         parser=False)
    else:
        nlp = spacy.load(settings.SPACY_EN_MODEL,
                         create_make_doc=tokenizer)
    return nlp


def extract_lexical_features_test(nlp, tweet_list):
    """Provides tokenization, POS and dependency parsing
    Args:
        nlp  (spaCy model): Language processing pipeline
    """
    staging = []
    
    docs = nlp.pipe(tweet_texts, batch_size=15000, n_threads=4)
    for object_id, doc in zip(object_ids, docs):
        parsed_doc = extract_conll_format(doc)
        staging.append(parsed_doc)
        
        ## Send the docs somewhere in order to write to disk later
    return staging


def extract_conll_format(doc):
    """Return the document in CoNLL format
    Args:
        doc (spaCy Doc): A container for accessing linguistic annotations.
    Returns:
        result: List of lines storing the CoNLL format for each word in a sentence.
    """
    # https://github.com/explosion/spaCy/issues/533#issuecomment-254774296
    # http://universaldependencies.org/docs/format.html
    result = []
    conll = []
    for sent in doc.sents:
        for i, word in enumerate(sent):
            if word.head is word:
                head_idx = 0
            else:
                head_idx = word.head.i + 1
            conll.extend((i + 1, word.lower_, word.lemma_, word.pos_, word.tag_,
                          "_", head_idx, word.dep_, str(head_idx) + ":" + word.dep_, "_"))
            result.append("\t".join(str(x) for x in conll))
            conll = []
    return result

def prep_conll_file(collection, filename):
    """ Takes a MongoDB or other cursor
    and writes the coNLL data to a text file.
    Args:
        collection (iterable)
        filename (str)
    """

    count = 0
    with open(settings.CONLL_PATH + filename, "a+") as _f:
        for doc in collection:
            count += 1
            for entry in doc["conllFormat"]:
                _f.write(entry + "\n")
            _f.write("\n")
            settings.logger.debug("Count %s", count)
        _f.close()

#### Training the model

The following function calls the underlying C++ code. The full instructions are in the readme for the dep2vec source.

In [2]:
import subprocess

def dep2vec_model(input_data, filename, filter_count, min_count, dimensions):
    """ Train a dependency2vec model.
    Follows the same steps outlined in the dependency2vec readme
    """
    time1 = notifiers.time()
    # Step 1
    output = subprocess.run("cut -f 2 hatespeech_core/data/conll_data/" + input_data + " | python hatespeech_core/data/conll_data/dependency2vec/scripts/vocab.py " + str(
        filter_count) + " > " + "hatespeech_core/data/conll_data/dependency2vec/vocab_data/counted_vocabulary_" + input_data, shell=True, check=True, stdout=subprocess.PIPE)
    for line in output.stdout.splitlines():
        print(line)
    print(output)

    # Step 2
    output = subprocess.run("cat hatespeech_core/data/conll_data/" + input_data + " | python hatespeech_core/data/conll_data/dependency2vec/scripts/extract_deps.py hatespeech_core/data/conll_data/dependency2vec/vocab_data/counted_vocabulary_" +
                            input_data + " " + str(filter_count) + " > " + "hatespeech_core/data/conll_data/dependency2vec/vocab_data/dep.contexts_" + input_data, shell=True, check=True, stdout=subprocess.PIPE)
    for line in output.stdout.splitlines():
        print(line)

    # Step 3
    output = subprocess.run("hatespeech_core/data/conll_data/dependency2vec/" + "./count_and_filter -train " + "hatespeech_core/data/conll_data/dependency2vec/vocab_data/dep.contexts_" + input_data + " -cvocab " +
                            "hatespeech_core/data/conll_data/dependency2vec/vocab_data/cv_" + input_data + " -wvocab " + "hatespeech_core/data/conll_data/dependency2vec/vocab_data/wv_" + input_data + " -min-count " + str(min_count), shell=True, check=True, stdout=subprocess.PIPE)
    for line in output.stdout.splitlines():
        print(line)

    # Step 4
    output = subprocess.run("hatespeech_core/data/conll_data/dependency2vec/" + "./word2vecf -train " + "hatespeech_core/data/conll_data/dependency2vec/vocab_data/dep.contexts_" + input_data + " -cvocab " +
                            "hatespeech_core/data/conll_data/dependency2vec/vocab_data/cv_" + input_data + " -wvocab " + "hatespeech_core/data/conll_data/dependency2vec/vocab_data/wv_" + input_data + " -size " + str(dimensions) + " -negative 15 -threads 10 -output hatespeech_core/data/persistence/word_embeddings/dim" + str(dimensions) + "vecs_" + filename, shell=True, check=True, stdout=subprocess.PIPE)
    for line in output.stdout.splitlines():
        print(line)

    time2 = notifiers.time()
    notifiers.send_job_completion(
        [time1, time2], ["dependency2vec", "dependency2vec " + filename])


#### Loading the model

In [4]:
import sys
sys.path.append("../")
%load_ext autoreload
%autoreload 2

In [14]:
from gensim.models import KeyedVectors, Word2Vec
from modules.preprocessing import neural_embeddings
from modules.utils import file_ops, model_helpers, settings
import glob
from pprint import pprint

In [9]:
def get_embeddings(embedding_type, model_ids=None, load=False):
    """ Helper function for loading embedding models
    Args:
        embedding_type (str): dep2vec,ft:keyedVectors w2v:word2vec
        model_ids (list): List of ints referencing the models.
    """

    model_format = "kv" if embedding_type == "dep2vec" or embedding_type == "ft" else "w2v"

    if model_ids and load:
        if embedding_type == "dep2vec":
            embeddings_ref = sorted(file_ops.get_model_names(
                glob.glob(settings.EMBEDDING_MODELS + "dim*")))
        elif embedding_type == "ft":
            embeddings_ref = sorted(file_ops.get_model_names(
                glob.glob(settings.EMBEDDING_MODELS + "*.vec")))
        elif embedding_type == "w2v":
            embeddings_ref = sorted(file_ops.get_model_names(
                glob.glob(settings.EMBEDDING_MODELS + "word2vec_*")))

        for idx, ref in enumerate(embeddings_ref):
            print(idx, ref)

        loaded_models = []
        for idx in model_ids:
            loaded_models.append(load_embedding(
                embeddings_ref[idx], model_format))
        return loaded_models
    else:
        print("Embedding models not loaded")

def load_embedding(filename, embedding_type):
    """ Load a fasttext or word2vec embedding
    Args:
        filename (str)
        embedding_type (str): kv:keyedVectors w2v:word2vec
    """
    if embedding_type == "kv":
        return KeyedVectors.load_word2vec_format(settings.EMBEDDING_MODELS + filename, binary=False, unicode_errors="ignore")
    elif embedding_type == "w2v":
        model = Word2Vec.load(settings.EMBEDDING_MODELS + filename)
        word_vectors = model.wv
        del model
        return word_vectors

In [13]:
dep_model_ids = [8,4]
loaded_embeddings = get_embeddings("dep2vec", model_ids=dep_model_ids, load=True)
dep_embeddings = {}
dep_embeddings["twitter"] = loaded_embeddings[0]
dep_embeddings["dstormer"] = loaded_embeddings[1]

0 dim200vecs_core_combined_corpus
1 dim200vecs_core_hate_corpus
2 dim200vecs_core_tweets_clean
3 dim200vecs_core_tweets_hs_keyword
4 dim200vecs_dstormer_conll
5 dim200vecs_inaug_conll
6 dim200vecs_manch_conll
7 dim200vecs_melvynhs_conll
8 dim200vecs_twitter_conll
9 dim200vecs_uselec_conll
10 dim200vecs_ustream_conll


In [18]:
target_word = "bomber"
pprint(dep_embeddings["twitter"].similar_by_word(target_word, topn=5))
print()
pprint(dep_embeddings["dstormer"].similar_by_word(target_word, topn=5))

[('blazer', 0.9450125694274902),
 ('windbreaker', 0.937454879283905),
 ('linen', 0.930109977722168),
 ('#shirt', 0.9292298555374146),
 ('hoody', 0.9283322095870972)]

[('physician', 0.9836385846138),
 ('preacher', 0.9834705591201782),
 ('cleric', 0.9810529947280884),
 ('pensioner', 0.978823721408844),
 ('actress', 0.9784078001976013)]
