In [None]:
# Distributional semantic models
## Some explora

# 1. Introduction

In this notebook, we will show you:

* How to load a distributional model in python (and where to download distributional models from)
* How to train your own Word2Vec model (simple version) and a pointer for training your own distributional model with control over hyperparameters andthe option to compare unstable vactors (hyperwords code)
* How to get insights into the quality and content of the distributional word representations using:
    
    (a) Simple cosine similarity operations
    
    (c) Running standard evaluation
    
    (b) Clustering 
    
    
    
* How to run standard evaluations

In addition, the notebook contains a small evercise for getting started on Dutch data.
    

**About this notebook:**

This notebook is using python 3.6. It is recommeded to run it using Anaconda (which includes most packages used here). 

Even though you can deal with embeddings using commonly used libraries such as numpy, the Gensim library is a very easy way of getting a first impression: https://radimrehurek.com/gensim/models/keyedvectors.html

I recommed installing it via pip:

`pip install gensim`

(Alternatively, you can use 'commandline magic' to install packages directly from the notebook as shown in the cell below.) 

In addition, this notebook uses:

* NLTK (Natural language processing toolkit)
* Spacy (optional)
* Numpy 
* Scikit learn

In [None]:
#You can use pip install from within the notebook like this:

%pip install gensim

# 2. Downloading or creating a distributional semantic model 



## 2.a Links for downloading existing models

Follow the links to brows available models. The sources listed below contain English models trained using different algorithms, data with different degrees of preprocessing and varying hyperparameter settings. Some resources also include models in other languages (even Dutch with a bit of luck). 

### Large and commonly used models (English):

* Google word2vec: o be downloaded from here (follow link in instructions): http://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/

* GloVe (trained on various corpora): https://nlp.stanford.edu/projects/glove/

* FastText embeddings (Facebook): https://fasttext.cc/docs/en/english-vectors.html

* Models with different algorithms, hyperparamtersdimensions and degrees of preprocessing (e.g. dependency parsing windows):  https://vecto.readthedocs.io/en/docs/tutorial/getting_vectors.html



### Various models in English & other languages:

* word2vec trained on Wikipedia for various languages (including Dutch): https://wikipedia2vec.github.io/wikipedia2vec/pretrained/

* Various algorithms and parameters for English and other languages: http://vectors.nlpl.eu/repository/#

* Word2vec wikipedia for English and German: https://github.com/idio/wiki2vec

* fastText for languages other than English: https://fasttext.cc/docs/en/crawl-vectors.html 


[TO DO: turn into table with speifications: Language, underying corpus, underlying corpus size, algorithm, hyperparameters]


Gensim even lets you download models directly via their api. 

## 2.b Creating your own model - the quick out-of-the box way 

### Required installations

* requests: used for downloading data - you can download data manually instead
* Gensim: Word2vec implementation for python
* An NLP package for preprocessing: The examples here use NLTK ([Natural language processing toolkit](http://www.nltk.org/install.html). Alternatively, you can use [SpaCy](https://spacy.io/usage/models). If this is your first time using nltk, please make sure to download the most important corpora and resources (included in 'nltk book') but running `nltk.download()` (after having imported nltk). 



In [None]:
#You can use pip install from within the notebook like this:

# add/remove packages you (don't) need to install here
%pip install requests
#%pip install nltk



In [None]:
# Downloading nltk (only run this is you haven't done this already)
import nltk 

nltk.download()

### Data preprocessing

**Step 0: Download a corpus**

In [None]:
# Download a text corpus using python:
# You can also do this manually by following this link: http://www.gutenberg.org/cache/epub/730/pg730.txt
import requests
import os

# project Gutenberg Oliver Twist as a .txt file:
url = 'http://www.gutenberg.org/cache/epub/730/pg730.txt'
r = requests.get(url)
# Access content and decode bytes to utf-8
text = r.content.decode('utf-8')


# create directory for data
if not os.path.isdir('../data/'):
    os.mkdir('../data')

# Write the text to a file and store it in our data directory (or do this step manually)
with open('../data/oliver_twist.txt', 'w') as outfile:
    outfile.write(text)

**Step 1: Preprocess the text**
    
 There are several choices you can make in the preprocessing step. In general, you want to remove everything from the data that may introduce artifacts. You can also consider further regularization steps, such as replacing all numeric characters by the same representation or using lower case spelling for the entire corpus. ATTENTION: There is a trade-off between generalization and information (e.g. Lower- and uppercase spelling can be a relevant distinction, consider apple vs Apple).
 
For smaller corpora, you will most likely want to remove punctuation and perhaps include some more regularization. You can even consider lemmatizing the text. If you inspect larger models (e.g. Google word2vec), you will notice that the vocabulary contains punctiuation (and all kinds of other weird symbols). For such a large dataset, the noise introduced by these things can most likely be neglected. 

 Here, we do the following:
 
 * remove punctuation
 * set everythin to lower case
 * cut the text in sentences, so it can be processed by the vanilla, out of the box word2vec implementation 

In [None]:
from nltk.tokenize import word_tokenize, sent_tokenize

import string

def remove_punct(tokens):
    
    # put to lower case 
    punct = string.punctuation 
    # Iterate over punctuation marks and replace them by an empty string one by one:
    tokens_clean = []
    for t in tokens:
        if t not in punct:
            tokens_clean.append(t)
            
    return tokens_clean

def preprocess(text):
    
    clean_sentences = []
    sentences = sent_tokenize(text.strip())

    for s in sentences:
        tokens = word_tokenize(s.lower())
        tokens_clean = remove_punct(tokens)
        clean_sentences.append(tokens_clean)
    return clean_sentences
    
test = "This is a test text. Let's see if this works! TEST."

preprocess(test)

In [None]:
# apply to your text corpus 

# load data:

with open('../data/oliver_twist.txt') as infile:
    text_oliver = infile.read()

# clean: 
text_oliver_clean = preprocess(text_oliver)
print(text_oliver_clean[201:202])


**Step 3: Create word2vec model**

Gensim allows you to train your own model. Here, we use a toy example, which should run on your local machine. If you'd like to train a larger corpus, you will most likely need to use a server.

In the cell below, we train a model on the oliver twist novel you've downloaded in the previous step:

In [None]:
from gensim.models import Word2Vec


# create directory for models
if not os.path.isdir('../models/'):
    os.mkdir('../models')

oliver_w2v = Word2Vec(text_oliver_clean, size = 300, window = 4, min_count =2)
# How to write out the model as a text file (vectors can be inspected easily)
oliver_w2v.wv.save_word2vec_format('../models/oliver.txt')
# How to write out the model as a binary file (cannot be inspected easily)
oliver_w2v.wv.save_word2vec_format('../models/oliver.bin', binary=True)


In the cell below, we train a model on the movie reviews corpus included in nltk:

In [None]:
from nltk.corpus import movie_reviews

mr = Word2Vec(movie_reviews.sents())
mr.wv.save_word2vec_format('../models/movies.bin', binary = True)

## 2.c Creating your own model with control over hyperparameters 

If you want to create your own embeddings, this repository is a good start: 

https://bitbucket.org/omerlevy/hyperwords/src/default/

Attention: code requires python2. We're woring on a python 3 version. 

**This is optional and probably takes quite some.**

# 3. Accessing word representations of different models


Models may be stored in different (and sometimes a little confusing) formats, but they all boil down to these components:

* a matrix of word vectors 
* a vocabulary
* a mapping between vectors in the matrix to the words in the vocabulary (often via indices)

Even though there is existing software for inspecting and manipulating vectors (e.g. in the Gensim Word2vec toolkit), you can easily write code yourself using numpy (my preferred way of working). This way, you don't have to rely on non-transparent implementations (remember the analogy example...). 

## 3.a Accessing models using the Word2vec toolkit

In [None]:
# How to load a stored model:
from gensim.models import KeyedVectors

#oliver_w2v = KeyedVectors.load_word2vec_format('../models/oliver.txt', binary=False) 
oliver_w2v = KeyedVectors.load_word2vec_format('../models/oliver.bin', binary=True) 

In [None]:
# Explore the word2vec model as a python object:
print('The model is represented internally as a...')
print(type(oliver_w2v))
print()
#####
vocabulary = oliver_w2v.vocab
print('The model vocabulary is represented internally as a...')
print(type(vocabulary))
print('Some words from the model vocabulary:')
print(list(vocabulary.keys())[:20])
print('Information stored in the vocabulary for a word:')
print(vocabulary['man'])
print()
#####
# To access the vector of a particular word, you can simply do the following:
vec_word = oliver_w2v['day']
# This way, you access the vector as a numpy array
print('Representation of an individual word vector:')
print(type(vec_word))
print('Number of vector dimensions:', len(vec_word))


## 3.b Accessing a model without a specific package

In [None]:
# Alternatively, you can write your own code for loading your model as a numpy matrix. 
# I suggest to do this
import numpy as np

def load_model(path):
    
    matrix = []
    vocab = []
    word2index_dict = dict()
    
    with open(path) as infile:
        lines = infile.read().split('\n')
        
    for n, line in enumerate(lines[1:]):
        line_list = line.split(' ')
        word = line_list[0]
        vocab.append(word)
        vec = [float(v) for v in line_list[1:]]
        matrix.append(vec)
        word2index_dict[word] = n
        
    return np.array(matrix), vocab, word2index_dict
        
matrix, vocab, word2index_dict = load_model('../models/oliver.txt')    


# 4. Inspecting word representations

Time to explore what the model represents! Play around with similarity, nearest neighbors and analogies and try to get a feeling for what the vectors can do. Feel free to load existing models and compare what they represent. The code snippets below continue with the Oliver Twist toy example - so don't be disappointed if it returns nonsense. 

## 4.a Simple vector operations 

**Gensim**

In [None]:
# How to load a stored model:
from gensim.models import KeyedVectors

oliver_w2v = KeyedVectors.load_word2vec_format('../models/oliver.txt', binary=False) 


In [None]:
# similarity

cos_man_woman = oliver_w2v.similarity('man', 'woman')
cos_man_dog = oliver_w2v.similarity('man', 'dog')

print(f'Man and woman should be more similar than man and dog:')
if cos_man_woman > cos_man_dog:
    print('True!')
    print('man-woman', cos_man_woman)
    print('man-dog', cos_man_dog)
else:
    print('False')
    print('man-woman', cos_man_woman)
    print('man-dog', cos_man_dog)

In [None]:
# nearest neighbors 

# Tip: use the help function if you want to explore the arguments
#help(oliver_w2v.most_similar)
nearest_neighbors = oliver_w2v.most_similar('dog', topn=10)
for w, cos in nearest_neighbors:
    print(w, cos)

In [None]:
# Analogy

closest_to_predicted_vec = oliver_w2v.most_similar(positive=['woman', 'king'], negative=['man'], topn=10)

for word, cosine in closest_to_predicted_vec:
    print(word, cosine)

**Numpy (generally applicable)**


Cosine similarity is one of the most important concepts to understand if you want to work with vectors. It is calculated as shown below:

![Cosine similarity](../images/cosine.png "Logo Title Text 1")



In [None]:
import numpy
import math
# Using numpy:

def normalize_vector(vec):
   
    # magnitude of the vector
    mag = math.sqrt(sum([pow(value, 2) for value in vec]))

    unit_vec = []

    for value in vec:
        unit_vec.append(value/mag)
    unit_vec = np.array(unit_vec)
    
    
def get_cosine(vec1, vec2):

    vec1_norm = normalize_vector(vec1)
    vec2_norm = normalize_vector(vec2)

    cos = np.dot(vec1_norm, vec2_norm)

    return cos
    

## 4.b  Standard evaluations

Evaluations sets can be found here:

Similarity

* WordSim 353: http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/
* SimLex 999: https://fh295.github.io/simlex.html
* MEN: https://staff.fnwi.uva.nl/e.bruni/MEN
* Luong rare words: http://www.bigdatalab.ac.cn/benchmark/bm/dd?data=Rare%20Word



Analogy 

* Google test sets (combined): http://download.tensorflow.org/data/questions-words.txt
* Google test sets (semantic and morphological)https://bitbucket.org/omerlevy/hyperwords/src/default/testsets/analogy/
* BATS: http://vecto.space/projects/BATS/


Gensim already contains functions to run some standard evaluations. ATTENTION: If you already have a Gensim version installed, make sure to update it. (Mine was out of date and the evaluations did not run.) 

**Discussion question: How can we compare the scores of the similarity and relatedness evaluations? How would you test whether the correlation between the model and human judgments of one model is better than the correlation between model and human judgments of another model?**



In [None]:
# Gensim has the evaluation methods built in
import gensim
from gensim.test.utils import datapath

# no access to actual pairs
# if you want to read up on the details, run:
#help(oliver_w2v.evaluate_word_pairs)
pearson, spearman, oov = oliver_w2v.evaluate_word_pairs(datapath('wordsim353.tsv'))

print('Pearson score', pearson)
print('Spearman Rho score', spearman)
print('Proportion of out ov vocabulary words', oov)

In [None]:
# Analogy evaluation = actually gives you the model output sorted into correct and 
# incorrect predictions 
score, output= oliver_w2v.evaluate_word_analogies(datapath('questions-words.txt'))

## 4.c Clustering 


A nice way of inspecting word vectors is testing how they behave in clustering. Scikit learn offers a number of implementations of different [clustering algorithms](https://scikit-learn.org/stable/modules/clustering.html). 


Note: This is just to get a first impression. I recommed reading up on clustering evaluation for using larger sets without label annotations. Scikit learn is a good start. 

In [None]:
from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from gensim.models import KeyedVectors
import numpy as np

In [None]:
fruits = ['apple', 'orange', 'lemon', 'strawberry', 'tomato']
vegetables = ['cucumber', 'pepper', 'carrot', 'zucchini', 'egg_plant']
animals = ['cat', 'dog', 'chicken', 'shrimp', 'lion', 'hamster', 'jaguar']
abstract_concepts = ['feeling', 'idea', 'thought', 'theory', 'anger', 'aggression']

In [None]:
from collections import defaultdict

def map_words(words, label, word_label_dict):
    for word in words:
        word_label_dict[word] = label

word_label_dict = dict()
map_words(animals, 'animal', word_label_dict)
map_words(fruits, 'fruit', word_label_dict)
map_words(vegetables, 'vegetable', word_label_dict)
map_words(abstract_concepts, 'abstract', word_label_dict)

for label, words in word_label_dict.items():
    print(label, words)

In [None]:
def get_all_vectors(word_label_dict, model):
    
    vecs = []
    words_in_vocab = []
    
    for word in word_label_dict.keys():
        if word in model.vocab:
            vec = model[word]
            vecs.append(vec)
            words_in_vocab.append(word)
        else:
            print(word, 'oov')
    
    return np.array(vecs), words_in_vocab


vecs, words_in_vocab = get_all_vectors(word_label_dict, oliver_w2v)

In [None]:
# Clustering doc: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans

# as many clusters as classes:

n_clusters = len(set(word_label_dict[word] for word in words_in_vocab))
print('number of clusters', n_clusters)
# abstract vs concrete?
#n_clusters = 2
y_pred = KMeans(n_clusters=n_clusters, init='random').fit_predict(vecs)

In [None]:
predicted_clusters = defaultdict(list)
for word, pred_label in zip(words_in_vocab, y_pred):
    predicted_clusters[pred_label].append(word)
    
for label, words in predicted_clusters.items():
    print(label, words)

# Exercise for Dutch

Spacy for Dutch: https://spacy.io/models/nl

Spacy quickstart: https://spacy.io/usage

1) Find a Dutch corpus

2) Preprocess it using Spacy for Dutch

3) Create a model (using Gensim or something else)

4) See if you can get an impression of what it captures

In [None]:
%pip install spacy


To download the Dutch models, please run the following line from the terminal. Make sure that the command `python` is linked to the same python version used by anaconda. You can use `python --version` to find out. 

`python -m spacy download nl_core_news_sm`

In [None]:
import spacy
nlp = spacy.load("nl_core_news_sm")
doc = nlp(u"Dit is een zin.")
# Accessing tokens:

for token in doc:
    print(token.text)
