# Evaluation of Poincare Embeddings

This notebook demonstrates how well poincare embeddings trained using this [implementation](https://github.com/TatsuyaShirakawa/poincare-embedding) perform on the tasks detailed in the [original paper](https://arxiv.org/pdf/1705.08039.pdf).

This is the list of tasks - 
1. WordNet reconstruction
2. WordNet link prediction
3. Link prediction in collaboration networks
4. Lexical entailment on HyperLex

A more detailed explanation of the tasks and the evaluation methodology is present in the individual evaluation subsections.

## 1. Setup

The following code clones the `poincare-embedding` repository containing the C++ implementation of the Poincare embeddings, and applies a patch containing minor additions to the implementation. Please set the variable `parent_directory` below to define the directory to which the repository is cloned.

In [1]:
# The poincare datasets, models and source code for external models are downloaded to this directory
parent_directory = './poincare/'
! mkdir -p {parent_directory}

In [2]:
import os
current_directory = os.getcwd()

In [28]:
# Clone repo
% cd {parent_directory}
np_repo_name = 'poincare-np-embedding'
! git clone https://github.com/nishnik/poincare_embeddings.git {np_repo_name}

cpp_repo_name = 'poincare-cpp-embedding'
! git clone https://github.com/TatsuyaShirakawa/poincare-embedding.git {cpp_repo_name}

/home/jayant/projects/gensim/docs/notebooks/poincare
Cloning into 'poincare-np-embedding'...
remote: Counting objects: 20, done.[K
remote: Compressing objects: 100% (18/18), done.[K
remote: Total 20 (delta 2), reused 20 (delta 2), pack-reused 0[K
Unpacking objects: 100% (20/20), done.
Checking connectivity... done.
Cloning into 'poincare-cpp-embedding'...
remote: Counting objects: 96, done.[K
remote: Total 96 (delta 0), reused 0 (delta 0), pack-reused 96[K
Unpacking objects: 100% (96/96), done.
Checking connectivity... done.
/home/jayant/projects/gensim/docs/notebooks/poincare/poincare-cpp-embedding


In [None]:
# Apply patches
% cd {cpp_repo_name}
! git apply ../poincare_burn_in_eps.patch

% cd ../{np_repo_name}
! git apply ../poincare_numpy.patch

% cd {current_directory}

Follow the instructions in the [README](https://github.com/TatsuyaShirakawa/poincare-embedding/blob/master/README.md) to compile the sources in the C++ repo in the `poincare` directory and create the binaries.

## 2. Training

### 2.1 Create the data

In [3]:
# These directories are auto created in the current directory for storing poincare datasets and models
data_directory = os.path.join(parent_directory, 'data')
models_directory = os.path.join(parent_directory, 'models')

# Create directories
! mkdir -p {data_directory}
! mkdir -p {models_directory}

In [4]:
# Prepare the WordNet data
wordnet_file = os.path.join(data_directory, 'wordnet_noun_hypernyms.tsv')
! python {parent_directory}/poincare-embedding/scripts/create_wordnet_noun_hierarchy.py {wordnet_file}

82115 nouns
743241 hypernyms


In [8]:
# Prepare the HyperLex data
hyperlex_url = "http://people.ds.cam.ac.uk/iv250/paper/hyperlex/hyperlex-data.zip"
! wget {hyperlex_url} -O {data_directory}/hyperlex-data.zip
! unzip {data_directory}/hyperlex-data.zip -d {data_directory}
hyperlex_file = os.path.join(data_directory, 'nouns-verbs', 'hyperlex-nouns.txt')

--2017-10-23 12:50:28--  http://people.ds.cam.ac.uk/iv250/paper/hyperlex/hyperlex-data.zip
Resolving people.ds.cam.ac.uk (people.ds.cam.ac.uk)... 131.111.3.47
Connecting to people.ds.cam.ac.uk (people.ds.cam.ac.uk)|131.111.3.47|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 183900 (180K) [application/zip]
Saving to: ‘./poincare/data/hyperlex-data.zip’


2017-10-23 12:50:28 (2.93 MB/s) - ‘./poincare/data/hyperlex-data.zip’ saved [183900/183900]

Archive:  ./poincare/data/hyperlex-data.zip
   creating: ./poincare/data/nouns-verbs/
  inflating: ./poincare/data/nouns-verbs/hyperlex-verbs.txt  
  inflating: ./poincare/data/nouns-verbs/hyperlex-nouns.txt  
   creating: ./poincare/data/splits/
   creating: ./poincare/data/splits/random/
  inflating: ./poincare/data/splits/random/hyperlex_training_all_random.txt  
  inflating: ./poincare/data/splits/random/hyperlex_test_all_random.txt  
  inflating: ./poincare/data/splits/random/hyperlex_dev_all_random.txt  
   creati

### 2.2 Training C++ embeddings

In [27]:
cpp_binary_path = os.path.join(parent_directory, cpp_repo_name, 'work', 'poincare_embedding')

In [5]:
from gensim.utils import check_output

def train_cpp_model(
    binary_path, data_file, output_file, dim, epochs, neg,
    num_threads, epsilon, burn_in, seed=0):
    """Train a poincare embedding using the c++ implementation
    
    Args:
        binary_path (str): Path to the compiled c++ implementation binary
        data_file (str): Path to tsv file containing relation pairs
        output_file (str): Path to output file containing model
        dim (int): Number of dimensions of the trained model
        epochs (int): Number of epochs to use
        neg (int): Number of negative samples to use
        num_threads (int): Number of threads to use for training the model
        epsilon (float): Constant used for clipping below a norm of one
        burn_in (int): Number of epochs to use for burn-in init (0 means no burn-in)
    
    Notes: 
        If `output_file` already exists, skips training
    """
    if os.path.exists(output_file):
        print('File %s exists, skipping' % output_file)
        return
    args = {
        'dim': dim,
        'max_epoch': epochs,
        'neg_size': neg,
        'num_thread': num_threads,
        'epsilon': epsilon,
        'burn_in': burn_in,
        'learning_rate_init': 0.1,
        'learning_rate_final': 0.0001,
    }
    cmd = [binary_path, data_file, output_file]
    for option, value in args.items():
        cmd.append("--%s" % option)
        cmd.append(str(value))
    
    return check_output(args=cmd)

In [7]:
model_sizes = [5, 10, 20, 50, 100, 200]
default_params = {
    'neg': 20,
    'epochs': 50,
    'threads': 8,
    'eps': 1e-6,
    'burn_in': 0,
}

non_default_params = {
    'neg': [10],
    'epochs': [100, 200],
    'threads': [1],
    'eps': [1e-5],
    'burn_in': [5, 10]
}

In [8]:
model_files = {}

In [35]:
def model_name_from_params(params, prefix):
    name = ['%s_%s' % (key, params[key]) for key in sorted(params.keys())]
    return '%s_%s' % (prefix, '_'.join(name))


def train_model_with_params(params, train_file, model_sizes, prefix, implementation):
    """Trains models with given params for multiple model sizes using the given implementation
    
    Args:
        params (dict): parameters to train the model with
        train_file (str): Path to tsv file containing relation pairs
        model_sizes (list): list of dimension sizes (integer) to train the model with
        prefix (str): prefix to use for the saved model filenames
        implementation (str): whether to use the python or c++ implementation,
                              allowed values: 'python', 'c++'
   
   Returns:
        tuple (model_name, model_files)
        model_files is a dict of (size, filename) pairs
        Example: ('cpp_model_epochs_50', {5: 'models/cpp_model_epochs_50_dim_5'})
    """
    files = {}
    if implementation == 'c++':
        model_name = cpp_model_name_from_params(params, prefix)
    elif implementation == 'python':
        model_name = np_model_name_from_params(params, prefix)
        
    for model_size in model_sizes:
        output_file_name = '%s_dim_%d' % (model_name, model_size)
        output_file = os.path.join(models_directory, output_file_name)
        print('Training model %s' % output_file)
        if implementation == 'c++':
            out = train_cpp_model(
                cpp_binary_path, train_file, output_file, model_size,
                params['epochs'], params['neg'], params['threads'],
                params['eps'], params['burn_in'], seed=0)
        elif implementation == 'python':
            train_external_numpy_model(
                python_script_path, train_file, output_file, model_size,
                params['epochs'], params['neg'], seed=0)
        else:
            raise ValueError('Given implementation %s not found' % implementation)
        files[model_size] = output_file
    return (model_name, files)

In [None]:
# Train models with default params
model_name, files = train_model_with_params(default_params, wordnet_file, model_sizes, 'cpp_model', 'c++')
model_files[model_name] = {}
for dim, filepath in files.items():
    model_files[model_name][dim] = filepath
# Train models with non-default params
for param, values in non_default_params.items():
    params = default_params.copy()
    for value in values:
        params[param] = value
        model_name, files = train_model_with_params(params, wordnet_file, model_sizes, 'cpp_model', 'c++')
        model_files[model_name] = {}
        for dim, filepath in files.items():
            model_files[model_name][dim] = filepath

### 2.3 Training [numpy embeddings](https://github.com/nishnik/poincare_embeddings) (non-gensim)

In [30]:
python_script_path = os.path.join(parent_directory, np_repo_name, 'poincare.py')

In [45]:
from gensim.utils import check_output

def np_model_name_from_params(params, prefix):
    param_keys = ['neg', 'epochs']
    name = ['%s_%s' % (key, params[key]) for key in sorted(param_keys)]
    return '%s_%s' % (prefix, '_'.join(name))

def train_external_numpy_model(
    script_path, data_file, output_file, dim, epochs, neg, seed=0):
    """Train a poincare embedding using an external numpy implementation
    
    Args:
        script_path (str): Path to the Python training script
        data_file (str): Path to tsv file containing relation pairs
        output_file (str): Path to output file containing model
        dim (int): Number of dimensions of the trained model
        epochs (int): Number of epochs to use
        neg (int): Number of negative samples to use
    
    Notes: 
        If `output_file` already exists, skips training
    """
    if os.path.exists(output_file):
        print('File %s exists, skipping' % output_file)
        return
    args = {
        'input-file': data_file,
        'output-file': output_file,
        'dimensions': dim,
        'epochs': epochs,
        'learning-rate': 0.01,
        'num-negative': neg,
    }
    cmd = ['python', script_path]
    for option, value in args.items():
        cmd.append("--%s" % option)
        cmd.append(str(value))
    
    return check_output(args=cmd)

In [46]:
np_model_files = {}

In [None]:
# Train models with default params
model_name, files = train_model_with_params(default_params, wordnet_file, model_sizes, 'np_model', 'python')
np_model_files[model_name] = {}
for dim, filepath in files.items():
    np_model_files[model_name][dim] = filepath

Training model ./poincare/models/np_model_epochs_50_neg_20_dim_5


## 3. Loading the embeddings

In [12]:
embeddings = {}

In [55]:
import pickle
import re

from gensim.models.keyedvectors import KeyedVectors
import numpy as np
from pygtrie import Trie
from scipy.spatial.distance import euclidean, pdist
from smart_open import smart_open

def transform_cpp_embedding_to_kv(input_file, output_file, encoding='utf8'):
    """Given a C++ embedding tsv filepath, converts it to a KeyedVector-supported file"""
    with smart_open(input_file, 'rb') as f:
        lines = [line.decode(encoding) for line in f]
    if not len(lines):
         raise ValueError("file is empty")
    first_line = lines[0]
    parts = first_line.rstrip().split("\t")
    model_size = len(parts) - 1
    vocab_size = len(lines)
    with open(output_file, 'w') as f:
        f.write('%d %d\n' % (vocab_size, model_size))
        for line in lines:
            f.write(line.replace('\t', ' '))

def transform_numpy_embedding_to_kv(input_file, output_file, encoding='utf8'):
    """Given a numpy poincare embedding pkl filepath, converts it to a KeyedVector-supported file"""
    np_embeddings = pickle.load(open(input_file, 'rb'))
    random_embedding = np_embeddings[list(np_embeddings.keys())[0]]
    
    model_size = random_embedding.shape[0]
    vocab_size = len(np_embeddings)
    with open(output_file, 'w') as f:
        f.write('%d %d\n' % (vocab_size, model_size))
        for key, vector in np_embeddings.items():
            vector_string = ' '.join('%.6f' % value for value in vector)
            f.write('%s %s\n' % (key, vector_string))
        
class PoincareEmbedding(object):
    """Load and perform distance operations on poincare embedding"""

    def __init__(self, keyed_vectors):
        """Initialize PoincareEmbedding via a KeyedVectors instance"""
        self.kv = keyed_vectors
        self.init_key_trie()
        
    def init_key_trie(self):
        """Setup trie containing vocab keys for quick prefix lookups"""
        self.key_trie = Trie()
        for key in self.kv.vocab:
            self.key_trie[key] = True
    
    @staticmethod
    def poincare_dist(vector_1, vector_2):
        """Return poincare distance between two vectors"""
        norm_1 = np.linalg.norm(vector_1)
        norm_2 = np.linalg.norm(vector_2)
        euclidean_dist = euclidean(vector_1, vector_2)
        return np.arccosh(
            1 + 2 * (
                (euclidean_dist ** 2) / ((1 - norm_1 ** 2) * (1 - norm_2 ** 2))
            )
        )
        
    @classmethod
    def load_poincare_cpp(cls, input_filename):
        """Load embedding trained via C++ Poincare model

        Args:
            filepath (str): Path to tsv file containing embedding

        Returns:
            PoincareEmbedding instance

        """
        keyed_vectors_filename = input_filename + '.kv'
        transform_cpp_embedding_to_kv(input_filename, keyed_vectors_filename)
        keyed_vectors = KeyedVectors.load_word2vec_format(keyed_vectors_filename)
        os.unlink(keyed_vectors_filename)
        return cls(keyed_vectors)

    @classmethod
    def load_poincare_numpy(cls, input_filename):
        """Load embedding trained via Python numpy Poincare model

        Args:
            filepath (str): Path to pkl file containing embedding

        Returns:
            PoincareEmbedding instance

        """
        keyed_vectors_filename = input_filename + '.kv'
        transform_numpy_embedding_to_kv(input_filename, keyed_vectors_filename)
        keyed_vectors = KeyedVectors.load_word2vec_format(keyed_vectors_filename)
        os.unlink(keyed_vectors_filename)
        return cls(keyed_vectors)
    
    def find_matching_keys(self, word):
        """Find all senses of given word in embedding vocabulary"""
        matches = self.key_trie.items('%s.' % word)
        matching_keys = [''.join(key_chars) for key_chars, value in matches]
        return matching_keys

    def get_vector(self, term):
        """Return vector for given term"""
        return self.kv.word_vec(term)
        
    def get_all_distances(self, term):
        """Return distances to all terms for given term, including itself"""
        term_vector = self.kv.word_vec(term)
        all_vectors = self.kv.syn0
        
        euclidean_dists = np.linalg.norm(term_vector - all_vectors, axis=1)
        norm = np.linalg.norm(term_vector)
        all_norms = np.linalg.norm(all_vectors, axis=1)
        return np.arccosh(
            1 + 2 * (
                (euclidean_dists ** 2) / ((1 - norm ** 2) * (1 - all_norms ** 2))
            )
        )
        
    def get_distance(self, term_1, term_2):
        """Returns distance between vectors for input terms

        Args:
            term_1 (str)
            term_2 (str)

        Returns:
            Poincare distance between the two terms (float)
        
        Note:
            Raises KeyError if either term_1 or term_2 is absent from vocabulary

        """
        vector_1, vector_2 = self.kv[term_1], self.kv[term_2]
        return self.poincare_dist(vector_1, vector_2)

### 3.1 C++ embeddings

In [14]:
for model_name, models in model_files.items():
    embeddings[model_name] = {}
    for model_size, model_file in models.items():
        embeddings[model_name][model_size] = PoincareEmbedding.load_poincare_cpp(model_file)

### 3.2 Numpy embeddings

In [56]:
for model_name, models in np_model_files.items():
    embeddings[model_name] = {}
    for model_size, model_file in models.items():
        embeddings[model_name][model_size] = PoincareEmbedding.load_poincare_numpy(model_file)

## 4. Evaluation

In [16]:
from prettytable import PrettyTable

def display_results(task_name, results):
    """Display evaluation results of multiple embeddings on a single task in a tabular format
    
    Args:
        task_name (str): name the task being evaluated
        results (dict): mapping between embeddings and corresponding results
    
    """
    data = PrettyTable()
    data.field_names = ["Model Description", "Metric"] + [str(dim) for dim in sorted(model_sizes)]
    for model_name, model_results in results.items():
        metrics = [metric for metric in model_results.keys()]
        dims = sorted([dim for dim in model_results[metrics[0]].keys()])
        row = [model_name, '\n'.join(metrics) + '\n']
        for dim in dims:
            scores = ['%.2f' % model_results[metric][dim] for metric in metrics]
            row.append('\n'.join(scores))
        data.add_row(row)
    data.align = 'r'
    data_cols = data.get_string().split('\n')[0].split('+')[1:-1]
    col_lengths = [len(col) for col in data_cols]
    header_col_1_length = col_lengths[0] + col_lengths[1] - 1
    header_col_2_length = sum(col_lengths[2:]) + len(col_lengths[2:-1]) - 2
    
    header_col_2_content = "Model Dimensions"
    header_col_2_left_margin = (header_col_2_length - len(header_col_2_content)) // 2
    header_col_2_right_margin = header_col_2_length - len(header_col_2_content) - header_col_2_left_margin
    header_col_2_string = "%s%s%s" % (
        " " * header_col_2_left_margin, header_col_2_content, " " * header_col_2_right_margin)
    header = PrettyTable()
    header.field_names = [" " * header_col_1_length, header_col_2_string]
    header_lines = header.get_string(start=0, end=0).split("\n")[:2]
    print('Results for %s task' % task_name)
    print("\n".join(header_lines))
    print(data)        

### 4.1 WordNet reconstruction

In [58]:
import csv
from collections import defaultdict, OrderedDict
import itertools


class ReconstructionEvaluation(object):
    """Evaluating reconstruction on given network for given embedding"""
    def __init__(self, filepath, embedding):
        """Initialize evaluation instance with tsv file containing relation pairs and embedding to be evaluated
        
        Args:
            filepath (str): path to tsv file containing relation pairs
            embedding (PoincareEmbedding instance): embedding to be evaluated
        
        Returns
            ReconstructionEvaluation instance

        """
        items = set()
        embedding_vocab = embedding.kv.vocab
        relations = defaultdict(set)
        with smart_open(filepath, 'r') as f:
            reader = csv.reader(f, delimiter='\t')
            for row in reader:
                assert len(row) == 2, 'Hypernym pair has more than two items'
                item_1_index = embedding_vocab[row[0]].index
                item_2_index = embedding_vocab[row[1]].index
                relations[item_1_index].add(item_2_index)
                items.update([item_1_index, item_2_index])
        self.items = items
        self.relations = relations
        self.embedding = embedding
    
    
    @staticmethod
    def get_positive_relation_ranks_and_avg_prec(all_distances, positive_relations):
        """
        Given a numpy array of all distances from an item and indices of its positive relations,
        compute ranks and Average Precision of positive relations
        
        Args:
            distances (numpy float array): np array of all distances for a specific item
            positive_relations (list): list of indices of positive relations for the item
        
        Returns:
            tuple of (ranks, avg_precision)
            `ranks` is a list of ranks (int) of positive relations in the same order as `positive_relations`
            `avg_precision` is a float representing the Average Precision of the ranking
        """
        positive_relation_distances = all_distances[positive_relations]
        negative_relation_distances = np.ma.array(all_distances, mask=False)
        negative_relation_distances.mask[positive_relations] = True
        # Compute how many negative relation distances are less than each positive relation distance, plus 1 for rank
        ranks = (negative_relation_distances < positive_relation_distances[:, np.newaxis]).sum(axis=1) + 1
        map_ranks = np.sort(ranks) + np.arange(len(ranks))
        avg_precision = ((np.arange(1, len(map_ranks) + 1) / np.sort(map_ranks)).mean())
        return list(ranks), avg_precision
    
    def evaluate(self, max_n=None):
        """Evaluate all defined metrics for the reconstruction task
            
        Args:
            max_n (int or None): Maximum number of positive relations to evaluate, all if max_n is None
        
        Returns:
            dict containing (metric_name, metric_value) pairs
            e.g. {'mean_rank': 50.3, 'MAP': 0.31}

        """
        mean_rank, map_ = self.evaluate_mean_rank_and_map(max_n)
        return {'mean_rank': mean_rank, 'MAP': map_}

    def evaluate_mean_rank_and_map(self, max_n=None):
        """Evaluate mean rank and MAP for reconstruction
            
        Args:
            max_n (int or None): Maximum number of positive relations to evaluate, all if max_n is None
        
        Returns:
            tuple of (mean_rank, MAP)

        """
        ranks = []
        avg_precision_scores = []
        for i, item in enumerate(self.items, start=1):
            if item not in self.relations:
                continue
            item_relations = list(self.relations[item])
            item_term = self.embedding.kv.index2word[item]
            item_distances = self.embedding.get_all_distances(item_term)
            positive_relation_ranks, avg_precision = self.get_positive_relation_ranks_and_avg_prec(item_distances, item_relations)
            ranks += positive_relation_ranks
            avg_precision_scores.append(avg_precision)
            if max_n is not None and i > max_n:
                break
        return np.mean(ranks), np.mean(avg_precision_scores)
    

In [18]:
reconstruction_results = {}
metrics = ['mean_rank', 'MAP']

In [None]:
for model_name, models in embeddings.items():
    reconstruction_results[model_name] = OrderedDict()
    for metric in metrics:
        reconstruction_results[model_name][metric] = {}
    for model_size, embedding in models.items():
        print('Evaluating model %s of size %d' % (model_name, model_size))
        eval_instance = ReconstructionEvaluation(wordnet_file, embedding)
        eval_result = eval_instance.evaluate(max_n=1000)
        for metric in metrics:
            reconstruction_results[model_name][metric][model_size] = eval_result[metric]

In [61]:
display_results('WordNet Reconstruction', reconstruction_results)

Results for WordNet Reconstruction task
+-----------------------------------------------------------------------+---------------------------------------------------------+
|                                                                       |                    Model Dimensions                     |
+-----------------------------------------------------------+-----------+---------+---------+---------+---------+--------+--------+
|                                         Model Description |    Metric |       5 |      10 |      20 |      50 |    100 |    200 |
+-----------------------------------------------------------+-----------+---------+---------+---------+---------+--------+--------+
|  cpp_model_burn_in_0_epochs_50_eps_1e-06_neg_20_threads_8 | mean_rank |  265.72 |  116.94 |   90.81 |   59.47 |  55.14 |  54.31 |
|                                                           |       MAP |    0.28 |    0.41 |    0.49 |    0.56 |   0.58 |   0.59 |
|                                   

### 4.2 WordNet link prediction

#### 4.2.1 Preparing data

In [17]:
import random

def train_test_split(data_file, test_ratio=0.1):
    """Creates train and test files from given data file, returns train/test file names
    
    Args:
        data_file (str): path to data file for which train/test split is to be created
        test_ratio (float): fraction of lines to be used for test data
    
    Returns
        (train_file, test_file): tuple of strings with train file and test file paths
    """
    root_nodes, leaf_nodes = get_root_and_leaf_nodes(data_file)
    test_line_candidates = []
    line_count = 0
    all_nodes = set()
    with open(data_file, 'rb') as f:
        for i, line in enumerate(f):
            node_1, node_2 = line.split()
            all_nodes.update([node_1, node_2])
            if (
                    node_1 not in leaf_nodes
                    and node_2 not in leaf_nodes
                    and node_1 not in root_nodes
                    and node_2 not in root_nodes
                    and node_1 != node_2
                ):
                test_line_candidates.append(i)
            line_count += 1

    num_test_lines = int(test_ratio * line_count)
    if num_test_lines > len(test_line_candidates):
        raise ValueError('Not enough candidate relations for test set')
    print('Choosing %d test lines from %d candidates' % (num_test_lines, len(test_line_candidates)))
    test_line_indices = set(random.sample(test_line_candidates, num_test_lines))
    train_line_indices = set(l for l in range(line_count) if l not in test_line_indices)
    
    train_filename = data_file + '.train'
    test_filename = data_file + '.test'
    train_set_nodes = set()
    with open(data_file, 'rb') as f:
        train_file = open(train_filename, 'wb')
        test_file = open(test_filename, 'wb')
        for i, line in enumerate(f):
            if i in train_line_indices:
                train_set_nodes.update(line.split())
                train_file.write(line)
            elif i in test_line_indices:
                test_file.write(line)
            else:
                raise AssertionError('Line %d not present in either train or test line indices' % i)
        train_file.close()
        test_file.close()
    assert len(train_set_nodes) == len(all_nodes), 'Not all nodes from dataset present in train set relations'
    return (train_filename, test_filename)

In [16]:
def get_root_and_leaf_nodes(data_file):
    """Return keys of root and leaf nodes from a file with transitive closure relations
    
    Args:
        data_file(str): file path containing transitive closure relations
    
    Returns:
        (root_nodes, leaf_nodes) - tuple containing keys of root and leaf nodes
    """
    root_candidates = set()
    leaf_candidates = set()
    with open(data_file, 'rb') as f:
        for line in f:
            nodes = line.split()
            root_candidates.update(nodes)
            leaf_candidates.update(nodes)
    
    with open(data_file, 'rb') as f:
        for line in f:
            node_1, node_2 = line.split()
            if node_1 == node_2:
                continue
            leaf_candidates.discard(node_1)
            root_candidates.discard(node_2)
    
    return (leaf_candidates, root_candidates)

In [28]:
wordnet_train_file, wordnet_test_file = train_test_split(wordnet_file)

Choosing 74324 test lines from 109577 candidates


#### 4.2.2 Training and loading models

In [19]:
# Training models for link prediction
lp_model_files = {}

In [None]:
# Train models with default params
model_name, files = train_model_with_params(default_params, wordnet_train_file, model_sizes, 'cpp_lp_model', 'c++')
lp_model_files[model_name] = {}
for dim, filepath in files.items():
    lp_model_files[model_name][dim] = filepath
# Train models with non-default params
for param, values in non_default_params.items():
    params = default_params.copy()
    for value in values:
        params[param] = value
        model_name, files = train_model_with_params(params, wordnet_train_file, model_sizes, 'cpp_lp_model', 'c++')
        lp_model_files[model_name] = {}
        for dim, filepath in files.items():
            lp_model_files[model_name][dim] = filepath

In [21]:
lp_embeddings = {}

In [22]:
for model_name, models in lp_model_files.items():
    lp_embeddings[model_name] = {}
    for model_size, model_file in models.items():
        lp_embeddings[model_name][model_size] = PoincareEmbedding.load_poincare_cpp(model_file)

#### 4.2.3 Evaluating models

In [59]:
class LinkPredictionEvaluation(object):
    """Evaluating reconstruction on given network for given embedding"""
    def __init__(self, train_path, test_path, embedding):
        """Initialize evaluation instance with tsv file containing relation pairs and embedding to be evaluated
        
        Args:
            train_path (str): path to tsv file containing relation pairs used for training
            test_path (str): path to tsv file containing relation pairs to evaluate
            embedding (PoincareEmbedding instance): embedding to be evaluated
        
        Returns
            LinkPredictionEvaluation instance

        """
        items = set()
        embedding_vocab = embedding.kv.vocab
        relations = {'known': defaultdict(set), 'unknown': defaultdict(set)}
        data_files = {'known': train_path, 'unknown': test_path}
        for relation_type, data_file in data_files.items():
            with smart_open(data_file, 'r') as f:
                reader = csv.reader(f, delimiter='\t')
                for row in reader:
                    assert len(row) == 2, 'Hypernym pair has more than two items'
                    item_1_index = embedding_vocab[row[0]].index
                    item_2_index = embedding_vocab[row[1]].index
                    relations[relation_type][item_1_index].add(item_2_index)
                    items.update([item_1_index, item_2_index])
        self.items = items
        self.relations = relations
        self.embedding = embedding
    
    
    @staticmethod
    def get_unknown_relation_ranks_and_avg_prec(all_distances, unknown_relations, known_relations):
        """
        Given a numpy array of distances and indices of known and unknown positive relations,
        compute ranks and Average Precision of unknown positive relations
        
        Args:
            all_distances (numpy float array): np array of all distances for a specific item
            unknown_relations (list): list of indices of unknown positive relations
            known_relations (list): list of indices of known positive relations
            
        Returns:
            tuple of (ranks, avg_precision)
            `ranks` is a list of ranks (int) of unknown relations in the same order as `unknown_relations`
            `avg_precision` is a float representing the Average Precision of the ranking
        """
        unknown_relation_distances = all_distances[unknown_relations]
        negative_relation_distances = np.ma.array(all_distances, mask=False)
        negative_relation_distances.mask[unknown_relations] = True
        negative_relation_distances.mask[known_relations] = True
        # Compute how many negative relation distances are less than each unknown relation distance, plus 1 for rank
        ranks = (negative_relation_distances < unknown_relation_distances[:, np.newaxis]).sum(axis=1) + 1
        map_ranks = np.sort(ranks) + np.arange(len(ranks))
        avg_precision = ((np.arange(1, len(map_ranks) + 1) / np.sort(map_ranks)).mean())
        return list(ranks), avg_precision
    
    def evaluate(self, max_n=None):
        """Evaluate all defined metrics for the reconstruction task
            
        Args:
            max_n (int or None): Maximum number of positive relations to evaluate, all if max_n is None
        
        Returns:
            dict containing (metric_name, metric_value) pairs
            e.g. {'mean_rank': 50.3, 'MAP': 0.31}

        """
        mean_rank, map_ = self.evaluate_mean_rank_and_map(max_n)
        return {'mean_rank': mean_rank, 'MAP': map_}

    def evaluate_mean_rank_and_map(self, max_n=None):
        """Evaluate mean rank and MAP for reconstruction
            
        Args:
            max_n (int or None): Maximum number of positive relations to evaluate, all if max_n is None
        
        Returns:
            tuple of (mean_rank, MAP)

        """
        ranks = []
        avg_precision_scores = []
        for i, item in enumerate(self.items, start=1):
            if item not in self.relations['unknown']:  # No positive relations to predict for this node
                continue
            unknown_relations = list(self.relations['unknown'][item])
            known_relations = list(self.relations['known'][item])
            item_term = self.embedding.kv.index2word[item]
            item_distances = self.embedding.get_all_distances(item_term)
            unknown_relation_ranks, avg_precision = self.get_unknown_relation_ranks_and_avg_prec(item_distances, unknown_relations, known_relations)
            ranks += unknown_relation_ranks
            avg_precision_scores.append(avg_precision)
            if max_n is not None and i > max_n:
                break
        return np.mean(ranks), np.mean(avg_precision_scores)


In [60]:
lp_results = {}
metrics = ['mean_rank', 'MAP']

In [None]:
for model_name, models in lp_embeddings.items():
    lp_results[model_name] = OrderedDict()
    for metric in metrics:
        lp_results[model_name][metric] = {}
    for model_size, embedding in models.items():
        print('Evaluating model %s of size %d' % (model_name, model_size))
        eval_instance = LinkPredictionEvaluation(wordnet_train_file, wordnet_test_file, embedding)
        eval_result = eval_instance.evaluate(max_n=1000)
        for metric in metrics:
            lp_results[model_name][metric][model_size] = eval_result[metric]

In [89]:
display_results('WordNet Link Prediction', lp_results)

Results for WordNet Link Prediction task
+--------------------------------------------------------------------------+-----------------------------------------------------+
|                                                                          |                  Model Dimensions                   |
+--------------------------------------------------------------+-----------+--------+--------+--------+--------+--------+--------+
|                                            Model Description |    Metric |      5 |     10 |     20 |     50 |    100 |    200 |
+--------------------------------------------------------------+-----------+--------+--------+--------+--------+--------+--------+
|  cpp_lp_model_burn_in_0_epochs_50_eps_1e-05_neg_20_threads_8 | mean_rank | 182.03 | 107.04 |  63.29 |  72.67 |  73.64 |  60.35 |
|                                                              |       MAP |   0.16 |   0.25 |   0.31 |   0.34 |   0.36 |   0.37 |
|                                         

### 4.3 HyperLex Lexical Entailment

In [62]:
from scipy.stats import spearmanr

class LexicalEntailmentEvaluation(object):
    """Evaluating reconstruction on given network for any embedding"""
    def __init__(self, filepath):
        """Initialize evaluation instance with HyperLex text file containing relation pairs
        
        Args:
            filepath (str): path to HyperLex text file
        
        Returns
            LexicalEntailmentEvaluation instance

        """
        expected_scores = {}
        with smart_open(filepath, 'r') as f:
            reader = csv.DictReader(f, delimiter=' ')
            for row in reader:
                word_1, word_2 = row['WORD1'], row['WORD2']
                expected_scores[(word_1, word_2)] = float(row['AVG_SCORE'])
        self.scores = expected_scores
        self.alpha = 1000
    
    def score_function(self, embedding, word_1, word_2):
        """Given an embedding and two terms, return the predicted score for them (extent to which term_1 is a type of term_2)"""
        try:
            word_1_terms = embedding.find_matching_keys(word_1)
            word_2_terms = embedding.find_matching_keys(word_2)
        except KeyError:
            raise ValueError("No matching terms found for either %s or %s" % (word_1, word_2))
        min_distance = np.inf
        min_term_1, min_term_2 = None, None
        for term_1 in word_1_terms:
            for term_2 in word_2_terms:
                distance = embedding.get_distance(term_1, term_2)
                if distance < min_distance:
                    min_term_1, min_term_2 = term_1, term_2
                    min_distance = distance
        assert min_term_1 is not None and min_term_2 is not None
        vector_1, vector_2 = embedding.get_vector(min_term_1), embedding.get_vector(min_term_2)
        norm_1, norm_2 = np.linalg.norm(vector_1), np.linalg.norm(vector_2)
        return -1 * (1 + self.alpha * (norm_2 - norm_1)) * distance
        
    def evaluate_spearman(self, embedding):
        """Evaluate spearman scores for lexical entailment for given embedding
            
        Args:
            embedding (PoincareEmbedding instance): embedding for which evaluation is to be done
        
        Returns:
            spearman correlation score (float)

        """
        predicted_scores = []
        expected_scores = []
        skipped = 0
        count = 0
        for (word_1, word_2), expected_score in self.scores.items():
            try:
                predicted_score = self.score_function(embedding, word_1, word_2)
            except ValueError:
                skipped += 1
                continue
            count += 1
            predicted_scores.append(predicted_score)
            expected_scores.append(expected_score)
        print('Skipped pairs: %d out of %d' % (skipped, len(self.scores)))
        spearman = spearmanr(expected_scores, predicted_scores)
        return spearman.correlation


In [66]:
entailment_results = {}
eval_instance = LexicalEntailmentEvaluation(hyperlex_file)

In [None]:
for model_name, models in embeddings.items():
    entailment_results[model_name] = {}
    entailment_results[model_name]['spearman'] = {}
    for model_size, embedding in models.items():
        print('Evaluating model %s of size %d' % (model_name, model_size))
        entailment_results[model_name]['spearman'][model_size] = eval_instance.evaluate_spearman(embedding)

In [68]:
display_results('Lexical Entailment (HyperLex)', entailment_results)

Results for Lexical Entailment (HyperLex) task
+----------------------------------------------------------------------+-----------------------------------------+
|                                                                      |            Model Dimensions             |
+-----------------------------------------------------------+----------+------+------+------+------+------+------+
|                                         Model Description |   Metric |    5 |   10 |   20 |   50 |  100 |  200 |
+-----------------------------------------------------------+----------+------+------+------+------+------+------+
|  cpp_model_burn_in_0_epochs_50_eps_1e-06_neg_20_threads_8 | spearman | 0.46 | 0.45 | 0.48 | 0.45 | 0.47 | 0.46 |
|                                                           |          |      |      |      |      |      |      |
| cpp_model_burn_in_0_epochs_100_eps_1e-06_neg_20_threads_8 | spearman | 0.46 | 0.45 | 0.46 | 0.47 | 0.46 | 0.47 |
|                                

### 4.4 Link Prediction for collaboration networks


In [68]:
# TODO - quite tricky, since the loss function used for training the model on this network is different
# Will require changes to how gradients are calculated in C++ code