# Lexicon Expansion

## Traveling through word vector spaces

This notebook provides tools to explore and interact with word embedding models. It allows you to create a lexicon (a list of conceptually related words) by traversing a vector space.
The notebooks provides three methods for expansion:
- Unidirectional expansion
- Contrastive expansion
- Active-learning based expansion

See documentation below and [README](https://github.com/kasparvonbeelen/WordEmbeddingPlayground/tree/master/code/LexiconExpansion/README.md) for more information.

In [7]:
%load_ext autoreload

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [8]:
%autoreload 2

In [9]:
from gensim.models.word2vec import Word2Vec
from modAL.uncertainty import entropy_sampling,margin_sampling, uncertainty_sampling
from modAL.models import ActiveLearner
from sklearn.svm import SVC
from scipy.spatial.distance import cosine, euclidean
from sklearn.model_selection import train_test_split
from utils import *
from sklearn.metrics import f1_score
from collections import defaultdict
import logging
import sys
import pickle
import random
import requests
import datetime
import pprint
import seaborn as sns
sns.set()
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In /home/kaspar/.local/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: 
The text.latex.preview rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In /home/kaspar/.local/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: 
The mathtext.fallback_to_cm rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In /home/kaspar/.local/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: Support for setting the 'mathtext.fallback_to_cm' rcParam is deprecated since 3.3 and will be removed two minor releases later; use 'mathtext.fallback : 'cm' instead.
In /home/kaspar/.local/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: 
The validate_bool_maybe_none function was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In /home/kaspar/.local/lib/python3.6/site-packages/matplotlib/mpl-data/s

In [10]:
from IPython.core.display import display, HTML
from ipyannotate import annotate
from ipyannotate.buttons import (
    ValueButton as Button,
    NextButton as Next,
    BackButton as Back
)

def show_html(word,verbose='link'):
    
    link_template = '<a size="5" color="black" target="_blank" style="font-family:courier" href="{}">{}</a>'
    url = 'https://en.wiktionary.org/wiki/{}'.format(word)
    wiki_url = 'https://en.wikipedia.org/w/index.php?sort=relevance&search={}'
    
    if verbose=='insert':
        response = requests.get(url)
        description = response.content.decode("utf-8")
    elif verbose=='link':
        
        wiktionary = link_template.format(url,"Wiktionary")
        wikipedia = link_template.format(wiki_url.format('+'.join(word.split('-'))),"Wikipedia")
        description = f"{wiktionary}&nbsp;&nbsp;{wikipedia}"
    else:
        description = ''
        
    return display(HTML('</br><font size="6" color="black" style="font-family:georgia;">"{0}"</font></br></br>{1}'.format(word,description)))

## Load model

Change path variable below to load a specific Word2Vec model. 

In [11]:
# for large models it works only with numpy 1.17.0
# https://www.pythonanywhere.com/forums/topic/14613/
# pip3 install numpy==1.17.0
path = "/kbdata/Processed/Models/1890-1910-Katholiek.w2v.model"
model = Word2Vec.load(path)

2020-10-28 09:00:33,266 : INFO : loading Word2Vec object from /kbdata/Processed/Models/1890-1910-Katholiek.w2v.model
2020-10-28 09:00:33,961 : INFO : loading wv recursively from /kbdata/Processed/Models/1890-1910-Katholiek.w2v.model.wv.* with mmap=None
2020-10-28 09:00:33,962 : INFO : loading vectors from /kbdata/Processed/Models/1890-1910-Katholiek.w2v.model.wv.vectors.npy with mmap=None
2020-10-28 09:00:34,091 : INFO : setting ignored attribute vectors_norm to None
2020-10-28 09:00:34,093 : INFO : loading vocabulary recursively from /kbdata/Processed/Models/1890-1910-Katholiek.w2v.model.vocabulary.* with mmap=None
2020-10-28 09:00:34,093 : INFO : loading trainables recursively from /kbdata/Processed/Models/1890-1910-Katholiek.w2v.model.trainables.* with mmap=None
2020-10-28 09:00:34,094 : INFO : loading syn1neg from /kbdata/Processed/Models/1890-1910-Katholiek.w2v.model.trainables.syn1neg.npy with mmap=None
2020-10-28 09:00:34,222 : INFO : setting ignored attribute cum_table to None


## 1. Unidirectional Lexicon Expansion

## 1.1 Select Seed Words

Select the seed words, these will provide the starting point of the expansion. More precisely, the script below will average the vector representations of the seed words and save it as `core_init` the vector whose neighbourhood we will explore.

In [12]:
core = {'vrouw','moeder'}

core_init = average_vector(core.copy(),model)

seen = core.copy()
peripheral = set()
    
rounds = 0
log = defaultdict(dict)

## 1.2 Annotation Cycle

### 1.2.1 Select Sampling Strategy
<img src='img/average_vector_st_2.png' width="500" align="left"/>

Lexicon Expansion provides a handful of functions to explore vector spaces and create word lists that track some kind of underlying concepts. Given a word list `L` at time `t`, to expand this list, we average the vector representations of the words in `L` (i.e. combine them in one vector) and use it a a query vector `q_v`. The functions listed below, allow you to navigate the area around `q_v`: 

Select the sampling procedure, this effect the procedure for navigating the vector space. The options here are:

- `average`: samples the word closest to `q_v`; 
- `query_tokens`: add selected tokens to `q_v` and sample the closest neighbours;
- `entropy`: sample neighbours of `q_v`  based on the entropy scores entropy(`q_v`,`neighbour_n`);
- `distance`: select neighbours with the highest distance to `q_v` (according to some distance metric such as cosine similarity) to avg.

Uncomment the code below, if you want to change the sampling procedure.


The figure above, visualizes the `average` sampling strategy. We start with two vectors ("machine" and "engine"), combine them into one vector by averaging the activations, and then query the neighbourhood of this averaged vector, add "spinning-jenny" and "power-loom", after which we can repeat the whole procedure.

In [13]:
sampling_procedure = sampling_options['average']
sampling_procedure

{'method': <function utils.average_all(words, model)>}

In [9]:
# sampling_procedure = sampling_options['query_tokens']
# sampling_procedure['args']['tokens'] = list(core)
# sampling_procedure['args']['merge'] = False
# sampling_procedure

In [10]:
# sampling_procedure =  sampling_options['entropy']
# sampling_procedure["args"]['init_vec'] = core_init
# sampling_procedure

### 1.2.2  Annotate Words

By running the cell below, you can annotate words. `Core` words will be used to update the query vector `q_v`, `Periphal` words will be saved but not used for expansion. You can use it, for example, to keep track of words with OCR errors, but don't allow them to contaminate the creation the word list.

In [11]:
log = update_log(log,rounds,seen,core,peripheral,sampling_procedure)

neighbours = expand_lexicon(core,model,**sampling_procedure)
neighbours = topn_new(neighbours,seen,topn=5)                     

buttons = [Button('Core',color='green'),
           Button('Peripheral',color='blue'),
           Button('Ignore',color='red'), Back(), Next()]   

annotations = annotate(list(neighbours.keys()), buttons=buttons, display = show_html)
annotations

2020-10-27 14:22:00,753 : INFO : precomputing L2-norms of word weight vectors


Using "average_all" as sample method.


Annotation(canvas=OutputCanvas(), progress=Progress(atoms=[<ipyannotate.progress.Atom object at 0x7fd06318c390…

### 1.2.3 Add Annotations to Lexicon

After each round of annotations, add the selected words (`Core` and `Peripheral`) to the lexicon.

In [None]:
core.update([t.output for t in annotations.tasks if t.value=='Core'])
peripheral.update([t.output for t in annotations.tasks if t.value=='Peripheral'])
seen.update([t.output for t in annotations.tasks])
rounds+=1
log = update_log(log,rounds,seen,core,peripheral,sampling_procedure)
print('Core Lexicon contains {0} tokens at stage {1}.\n'.format(len(core), rounds))  
print(', '.join(core))

## 1.3 Inspect Results

At each stage of the annotations, you can inspects your lexicon visually. Running the cell below, will project all words on a 2D plane. The red dots represent the `Core` vocabulary, whereas the green dots show you the surrounding tokens not yet in the lexicon. It could give you an idea of the words you are missing, and the next destination for word vector space travel.

In [None]:
#plot_travel_distance(log,model,core_init,method=np.mean)
plot_2d(log,model,figsize=(10,10),include_neighbours=True)

## 1.4. Save Annotations

After multiple round of annotation you can save the log of your work for later analysis.

In [None]:
ts = datetime.datetime.now()
with open('logged_annotations_{}.pickle'.format(datetime.datetime.now().strftime("%Y-%m-%d_%H:%M:%S")),'wb') as out_pickle:
    pickle.dump(log,out_pickle)
print('Annotations saved!')

# 2. Bi-Directional Lexicon Expansion

<img src='img/projection.png' width="500" align="left">

## 2.2. Select Seed Words

Select the two opposite ends for contrastive lexicons you want to create. The `Core` and `Anitode` serves as two contrastive "feminine" and "masculine" word lists.

In [15]:
core = {'vrouw','vrouwen'}
antipode = {'man','mannen'}

In [None]:
periferal=None
seen=None

seen = core.copy().union(antipode)
periferal = set()
    
rounds = 0
log = defaultdict(dict)   

## 2.3 Annotation Cycle

### 2.3.1  Annotate Words

Assign sampled words to a either `Core` or `Antipode` lexicon. This section consist of two rounds of annotation, the first one focussed on `Core` words, the second one on `Antipode` words.

#### Explore the `Core` end of the spectrum

In [None]:
log[rounds]['timestamp'] = datetime.datetime.now()           
log[rounds]['seen'] = seen.copy(); log[rounds]['core'] = core.copy()
log[rounds]['periferal'] = periferal.copy(); log[rounds]['antipode'] = antipode.copy()

neighbours = contrastive_expansion(core,model,antipode,direction='core')
neighbours = topn_new(neighbours,seen,topn=5)    

buttons = [Button('Core',color='green'),Button('Antipode',color='green'),
           Button('Peripheral',color='blue'),Button('Ignore',color='red'),
           Back(), Next()]

annotations_core = annotate(list(neighbours.keys()), buttons=buttons, display = show_html)
annotations_core

#### Explore the `Antipode` end of the spectrum

In [None]:
neighbours = contrastive_expansion(core,model,antipode,direction='antipode')
neighbours = topn_new(neighbours,seen,reverse=False,topn=5)  

buttons = [Button('Antipode',color='green'),Button('Core',color='green'),
           Button('Peripheral',color='blue'),Button('Ignore',color='red'),
           Back(), Next()]

annotations_antipode = annotate(list(neighbours.keys()), buttons=buttons, display = show_html)
annotations_antipode

### 2.3.2. Add Annotations to Lexicons

Run these cells to update the lexicon, and inspects to words harvested so far.

In [None]:
annotations = annotations_antipode.tasks + annotations_core.tasks
core.update([t.output for t in annotations if t.value=='Core'])
antipode.update([t.output for t in annotations if t.value=='Antipode'])
periferal.update([t.output for t in annotations if t.value=='Peripheral'])
seen.update([t.output for t in annotations])
rounds+=1
print('Core-lexicon at stage {0} contains {1} words.\nSize of Antipode-lexicon is {2} words'.format(rounds,len(core),len(antipode)))      


In [None]:
print('Core Lexicon at stage {}.\n'.format(rounds))
print(', '.join(core))
print('\n')
print('Antipode Lexicon at stage {}.\n'.format(rounds))
print(', '.join(antipode))

## 2.4. Save Annotations

Save the output for later analysis.

In [None]:
ts = datetime.datetime.now()
with open('logged_annotations_{}.pickle'.format(datetime.datetime.now().strftime("%Y-%m-%d_%H:%M:%S")),'wb') as out_pickle:
    pickle.dump(log,out_pickle)

## 3. Active Leaning


### 3.1 Define seed words

### 3.1.1 Collect annotated seed words

In [None]:
seed = 'machine'; neighbourhood = 1000

In [None]:

machine_neighbours = np.array([w for w,v in model.wv.most_similar(seed,topn=neighbourhood)])
seed_idx = np.random.choice(range(len(machine_neighbours)), size=50, replace=False)
seed_words = machine_neighbours[seed_idx]
print(seed_words)
buttons = [Button('Core',color='green'),Button('Antipode',color='red'),
           Button('Ignore',color='blue'),Back(), Next()]

annotations_init = annotate(seed_words, buttons=buttons, display = show_html)
annotations_init

### 3.1.2 Initialize learner

In [None]:
core_init = [t.output for t in annotations_init.tasks if t.value=='Core']
antipode_init = [t.output for t in annotations_init.tasks if t.value=='Antipode']
print(len(core_init),len(antipode_init))
seed_words_annotated = antipode_init + core_init
X_initial = np.array([model.wv[w] for w in seed_words_annotated])
y_initial = np.array([0]*len(antipode_init) + [1]*len(core_init))

In [None]:
X = np.array([model.wv[w] for w,v in model.wv.most_similar(seed,topn=1000)])
words = np.array([w for w,v in model.wv.most_similar(seed,topn=1000)])

In [None]:
X_training, X_testing, y_training, y_testing = train_test_split(X_initial, y_initial, test_size=0.50, random_state=0) 
initial_idx = np.array([i for i,w in enumerate(words) if w in seed_words_annotated])
X_pool,y_pool = np.delete(X, initial_idx, axis=0),np.delete(words, initial_idx, axis=0)

In [None]:
# initializing the learner
learner = ActiveLearner(
    estimator=SVC(probability=True,kernel='linear'), # ,class_weight='balanced',C=10
    query_strategy=uncertainty_sampling,
    X_training=X_training, y_training=y_training
)

In [None]:
scores = []

### 3.2 Annotation cycle

In [None]:
query_idx, query_inst = learner.query(X_pool,10)
to_label = [y_pool[qx] for qx in query_idx]#{words[qx]:[qx,list(q_inst)] for qx,q_inst in zip(query_idx, query_inst)}

buttons = [Button('Core',color='green'),Button('Antipode',color='red'),
           Button('Ignore',color='blue'),Back(), Next()]

annotations = annotate(to_label, buttons=buttons, display=show_html)
annotations

In [None]:
X_pool,y_pool = np.delete(X_pool, query_idx, axis=0),np.delete(y_pool, query_idx, axis=0)
y_new = [{'Core':1, 'Antipode': 0}.get(a.value,0) for a in annotations.tasks]
learner.teach(query_inst,y_new)  # 
y_pred = learner.predict(X_testing)
scores.append(f1_score(y_pred,y_testing))
print('Done updating model. Go to previous cell to annotate other examples.')
pd.Series(scores).plot()

### 3.3. Print results of trained model

In [None]:
probs= dict(zip(words,learner.predict_proba(X)[:,1]))
machine_words = sorted(probs.items(),key = lambda x : x[1], reverse=True)[:100]
#print('\n'.join([f'{e[0]},{round(e[1],2)}' for e in machine_words]))
print('\n'.join(['{0: <20}{1}'.format(e[0],round(e[1],2)) for e in machine_words]))
