# Training Word Vectors for SQL

## Goal

Notebook is part of a larger effort to translate artificial languages into each other: 
https://github.com/fastai/fastai/blob/master/courses/dl2/translate.ipynb
http://course.fast.ai/lessons/lesson11.html

We want to train a word vector for two sql dialects (...)

In [14]:
## Every notebook starts with the following three lines; they ensure that any edits to libraries 
## you make are reloaded here automatically, and also that any charts or images displayed are shown in this notebook.

%matplotlib inline
%reload_ext autoreload
%autoreload 2



## Get code from Github (TO DO)

In [1]:
https://pygithub.readthedocs.io/en/latest/examples/MainClass.html#search-repositories-by-language

SyntaxError: invalid syntax (<ipython-input-1-f4e8c9c07fc1>, line 1)

## Word Vector - Andrew Ng

https://www.coursera.org/learn/nlp-sequence-models/home/week/2

https://www.coursera.org/learn/nlp-sequence-models/lecture/6Oq70/word-representation



https://www.coursera.org/learn/nlp-sequence-models/lecture/qHMK5/using-word-embeddings

https://www.coursera.org/learn/nlp-sequence-models/lecture/S2mat/properties-of-word-embeddings

https://www.coursera.org/learn/nlp-sequence-models/lecture/K604Z/embedding-matrix

https://www.coursera.org/learn/nlp-sequence-models/lecture/APM5s/learning-word-embeddings

https://www.coursera.org/learn/nlp-sequence-models/lecture/8CZiw/word2vec

https://www.coursera.org/learn/nlp-sequence-models/lecture/Iwx0e/negative-sampling

https://www.coursera.org/learn/nlp-sequence-models/lecture/IxDTG/glove-word-vectors

## Word Vectors - Simple Example

In [30]:
import torch
from torch.autograd import Variable
import numpy as np
import torch.functional as F
import torch.nn.functional as F

We need a list of words of the SQL dialect. Any other tokens are variables. Variables need to be immutable, that can be learnt by the model or we hardcode that. 

Can we run over docs or tickets of SQL to create word vectors?

https://stackoverflow.com/search?q=postgres

https://lxml.de/
http://docs.python-requests.org/en/latest/

http://adventuresinmachinelearning.com/word2vec-tutorial-tensorflow/

http://nlp.fast.ai/category/classification.html
http://nlp.fast.ai/classification/2018/05/15/introducting-ulmfit.html

https://towardsdatascience.com/implementing-word2vec-in-pytorch-skip-gram-model-e6bae040d2fb

https://gist.github.com/mbednarski/da08eb297304f7a66a3840e857e060a0

https://adoni.github.io/2017/11/08/word2vec-pytorch/


https://gist.github.com/GavinXing/9954ea846072e115bb07d9758892382c

In [31]:
## following blog: https://towardsdatascience.com/implementing-word2vec-in-pytorch-skip-gram-model-e6bae040d2fb
## this should get replaced by sql statements of both databases
corpus = [
    'he is a king',
    'she is a queen',
    'he is a man',
    'she is a woman',
    'warsaw is poland capital',
    'berlin is germany capital',
    'paris is france capital',
]



In [32]:
## simplification, for 
def tokenize_corpus(corpus):
    tokens = [x.split() for x in corpus]
    return tokens

tokenized_corpus = tokenize_corpus(corpus)

tokenized_corpus

[['he', 'is', 'a', 'king'],
 ['she', 'is', 'a', 'queen'],
 ['he', 'is', 'a', 'man'],
 ['she', 'is', 'a', 'woman'],
 ['warsaw', 'is', 'poland', 'capital'],
 ['berlin', 'is', 'germany', 'capital'],
 ['paris', 'is', 'france', 'capital']]

In [33]:
vocabulary = []
for sentence in tokenized_corpus:
    for token in sentence:
        if token not in vocabulary:
            vocabulary.append(token)

word2idx = {w: idx for (idx, w) in enumerate(vocabulary)}
print(word2idx)
idx2word = {idx: w for (idx, w) in enumerate(vocabulary)}
print(idx2word)

vocabulary_size = len(vocabulary)
print(vocabulary_size)

{'he': 0, 'is': 1, 'a': 2, 'king': 3, 'she': 4, 'queen': 5, 'man': 6, 'woman': 7, 'warsaw': 8, 'poland': 9, 'capital': 10, 'berlin': 11, 'germany': 12, 'paris': 13, 'france': 14}
{0: 'he', 1: 'is', 2: 'a', 3: 'king', 4: 'she', 5: 'queen', 6: 'man', 7: 'woman', 8: 'warsaw', 9: 'poland', 10: 'capital', 11: 'berlin', 12: 'germany', 13: 'paris', 14: 'france'}
15


In [34]:
## TO DO: what is this?

##We can now generate pairs center word, context word. 
##Let’s assume context window to be symmetric and equal to 2.

window_size = 2
idx_pairs = []
# for each sentence
for sentence in tokenized_corpus:
    indices = [word2idx[word] for word in sentence]
    # for each word, threated as center word
    for center_word_pos in range(len(indices)):
        # for each window position
        for w in range(-window_size, window_size + 1):
            context_word_pos = center_word_pos + w
            # make soure not jump out sentence
            if context_word_pos < 0 or context_word_pos >= len(indices) or center_word_pos == context_word_pos:
                continue
            context_word_idx = indices[context_word_pos]
            idx_pairs.append((indices[center_word_pos], context_word_idx))

idx_pairs = np.array(idx_pairs) # it will be useful to have this as numpy array

It gives us list of pairs center, context indices, 

array([[ 0,  1],
       [ 0,  2],
       ...

Which can be easily translated to words:

he is
he a
is he
is a
is king
a he
a is
a king

### Input layer
Input layer is just the center word encoded in one-hot manner. It dimensions are [1, vocabulary_size]

In [35]:
def get_input_layer(word_idx):
    x = torch.zeros(vocabulary_size).float()
    x[word_idx] = 1.0
    return x

### Hidden layer 

Hidden Layers makes our v vectors. Therefore it has to have embedding_dims neurons. To compute it value we have to define W1 weight matrix. Of course its has to be [embedding_dims, vocabulary_size]. There is no activation function — just plain matrix multiplication.

x is one-hot and if you multiply one-hot vector by matrix, the result is same as selecting select single column from it. Therefore, each column of W1 stores v vector for single word.

In [36]:
embedding_dims = 5
W1 = Variable(torch.randn(embedding_dims, vocabulary_size).float(), requires_grad=True)
W2 = Variable(torch.randn(vocabulary_size, embedding_dims).float(), requires_grad=True)
num_epochs = 101
learning_rate = 0.001

for epo in range(num_epochs):
    loss_val = 0
    for data, target in idx_pairs:
        x = Variable(get_input_layer(data)).float()
        y_true = Variable(torch.from_numpy(np.array([target])).long())

        z1 = torch.matmul(W1, x)
        z2 = torch.matmul(W2, z1)
    
        log_softmax = F.log_softmax(z2, dim=0)

        loss = F.nll_loss(log_softmax.view(1,-1), y_true)
        loss_val += loss.data[0]
        loss.backward()
        W1.data -= learning_rate * W1.grad.data
        W2.data -= learning_rate * W2.grad.data

        W1.grad.data.zero_()
        W2.grad.data.zero_()
    if epo % 10 == 0:    
        print(f'Loss at epo {epo}: {loss_val/len(idx_pairs)}')



Loss at epo 0: 4.759732723236084
Loss at epo 10: 4.197955131530762
Loss at epo 20: 3.7988269329071045
Loss at epo 30: 3.516937732696533
Loss at epo 40: 3.3159615993499756
Loss at epo 50: 3.1684799194335938
Loss at epo 60: 3.0558013916015625
Loss at epo 70: 2.9662201404571533
Loss at epo 80: 2.8926713466644287
Loss at epo 90: 2.830836534500122
Loss at epo 100: 2.7779338359832764


### Output layer

Last layer must have vocabulary_size neurons — because it generates probabilities for each word. Therefore, W2 is [vocabulary_size, embedding_dims] in terms of shape.

In [16]:
W2 = Variable(torch.randn(vocabulary_size, embedding_dims).float(), requires_grad=True)
z2 = torch.matmul(W2, z1)

NameError: name 'Variable' is not defined

On top on that we have to use softmax layer. PyTorch provides optimized version of this, combined with log — because regular softmax is not really numerically stable:

In [17]:
log_softmax = F.log_softmax(a2, dim=0)

NameError: name 'F' is not defined

This is equivalent to compute softmax and after that applying log.Now we can compute the loss. As usual PyTorch provides everything we need.

The nll_loss computes negative-log-likelihood on logsoftmax. y_true is context word — we want to make this as high as possible — because pair x, y_true is from training data — so the are indeed center, context.

In [18]:
loss = F.nll_loss(log_softmax.view(1,-1), y_true)

NameError: name 'F' is not defined

### Backprop

In [None]:
loss.backward()

For optimization SDG is used. It is so simple, that it was faster to write it by hand instead of creating optimizer object:

In [None]:
W1.data -= 0.01 * W1.grad.data
W2.data -= 0.01 * W2.grad.data

Last step is to zero gradients to make next pass clear:

In [20]:
W1.grad.data.zero_()
W2.grad.data.zero_()

NameError: name 'W1' is not defined

### Training Loop

One potentially tricky thing is y_true definition. We do not create one-hot explicitly — nll_loss does it by itself.

In [21]:
embedding_dims = 5
W1 = Variable(torch.randn(embedding_dims, vocabulary_size).float(), requires_grad=True)
W2 = Variable(torch.randn(vocabulary_size, embedding_dims).float(), requires_grad=True)
num_epochs = 100
learning_rate = 0.001

for epo in range(num_epochs):
    loss_val = 0
    for data, target in idx_pairs:
        x = Variable(get_input_layer(data)).float()
        y_true = Variable(torch.from_numpy(np.array([target])).long())

        z1 = torch.matmul(W1, x)
        z2 = torch.matmul(W2, z1)
    
        log_softmax = F.log_softmax(z2, dim=0)

        loss = F.nll_loss(log_softmax.view(1,-1), y_true)
        loss_val += loss.data[0]
        loss.backward()
        W1.data -= learning_rate * W1.grad.data
        W2.data -= learning_rate * W2.grad.data

        W1.grad.data.zero_()
        W2.grad.data.zero_()
    if epo % 10 == 0:    
        print(f'Loss at epo {epo}: {loss_val/len(idx_pairs)}')

NameError: name 'Variable' is not defined

### Extract Words

Ok, we have trained the network. One, last thing is to extract vectors for words. It is possible in three ways:

- Use vector v from W1
- Use vector u from W2
- Use average v and u

### List of Language Tokens

### List of Variables

### Build a word vector manually

Artificial SQL languages are less complex, building a word vector and a language model is easy (). 

One consider building a word vector from a documentation (?) or log file (?). The issue is, we only want SQL not the log constructus or natural language around it

## Appendix

### Word Vectors fast.ai

#### Preparations

Goal: present words as numbers

1 Approach: go to dictionary and give every word a number --> min. 22.40

issue: numbered list doesnt e.g. relate cat to kitten but rather cat to catastrophe because of the alphabetic order

Alternative: words as vectors, e.g. 

         dogness /  childness / catness
puppy :     0.9  /     1.0    /   0.0
dog   :     1.0  /     0.2    /   0.0
kitten:     0.0  /     1.0    /   0.9
cat   :     0.0  /     0.2    /   1.0

How to get a computer to calculate such a matrix?

- Training to get word embeddings can be very time consuming.
- Learning meaning from context: Word that appear another word indicate, what it means
- Take a text and where ever a specific word shows up, replace it with that above embedding vector



https://github.com/fastai/word-embeddings-workshop

https://www.youtube.com/watch?v=25nC0n9ERq4

In [2]:
import pickle
import numpy as np
import re
import json

In [3]:
np.set_printoptions(precision=4, suppress=True)

The dataset is available at http://files.fast.ai/models/glove/6B.100d.tgz To download and unzip the files from the command line, you can run:

- wget http://files.fast.ai/models/glove_50_glove_100.tgz 
- Mac: curl -O http://files.fast.ai/models/glove_50_glove_100.tgz    
- tar xvzf glove_50_glove_100.tgz

You will need to update the path below to be accurate for where you are storing the data.

In [4]:
##!pwd
##!curl -O http://files.fast.ai/models/glove_50_glove_100.tgz    
##!tar xvzf glove_50_glove_100.tgz
##!ls -ltr

In [5]:
vecs = np.load("glove_vectors_100d.npy")
vecs50 = np.load("glove_vectors_50d.npy")

#### Checking Data

In [6]:
with open('words.txt') as f:
    content = f.readlines()
words = [x.strip() for x in content]

In [7]:
wordidx = json.load(open('wordsidx.txt'))

In [8]:
len(words)

400000

In [12]:
words[:10] ## the 1st 10 words

['the', ',', '.', 'of', 'to', 'and', 'in', 'a', '"', "'s"]

In [13]:
words[600:610] ## words 600 - 610

['together',
 'congress',
 'index',
 'australia',
 'results',
 'hard',
 'hours',
 'land',
 'action',
 'higher']

In [52]:
type(wordidx)

dict

In [53]:
wordidx['feminist']

11853

In [54]:
words[11853]

'feminist'

#### Word as Vectors

The word "intelligence" is represented by the 100 dimensional vector:

In [55]:
type(vecs)

numpy.ndarray

In [56]:
vecs[11853]

array([ 0.296 ,  0.7626, -0.9866,  0.3776,  0.3194,  0.8286, -0.1686,
       -1.4558,  0.1965,  0.3854, -0.3348, -0.6503, -0.2528, -0.11  ,
       -0.1545,  0.5354, -0.4527, -0.0516,  0.1312,  0.0744,  0.5001,
        0.2151,  0.0688,  0.4347,  0.261 , -0.0371,  0.1385, -1.518 ,
        0.0641,  0.149 , -0.0314,  0.5038,  0.2839,  0.3457, -0.4411,
       -0.3459, -0.2118,  0.5651, -0.088 , -0.0438, -1.2228,  0.6039,
       -0.23  ,  0.2287, -0.2695, -0.9398,  0.2376,  0.3302, -0.2422,
        0.6359,  0.1347,  0.5542,  0.1432,  0.2861,  0.0216, -0.7437,
        0.3508,  0.362 ,  0.5566,  0.3403,  0.3613,  0.5185, -0.5437,
       -0.285 ,  1.1831, -0.1192,  0.2473,  0.0614,  0.4436, -0.244 ,
        0.2016,  0.5143, -0.4695, -0.0974, -0.9836, -0.3594,  0.3903,
       -0.517 , -0.1659, -1.2132, -1.3228,  0.0578,  0.7022,  0.3492,
       -0.9103, -0.381 , -0.1545,  0.4467, -0.009 , -0.9838,  1.0114,
       -0.227 ,  0.2697,  0.1566,  0.5613,  0.1175, -0.5755, -0.6324,
        0.1052,  1.2

This lets us do some useful calculations. For instance, we can see how far apart two words are using a distance metric:

In [57]:
from scipy.spatial.distance import cosine as dist

Smaller numbers mean two words are closer together, larger numbers mean they are further apart.

The distance between similar words is low:

In [58]:
dist(vecs[wordidx["puppy"]], vecs[wordidx["dog"]])

0.27636247873306274

In [59]:
dist(vecs[wordidx["queen"]], vecs[wordidx["princess"]])

0.20527541637420654

And the distance between unrelated words is high:

In [60]:
dist(vecs[wordidx["celebrity"]], vecs[wordidx["dusty"]])

0.9883578838780522

In [61]:
dist(vecs[wordidx["kitten"]], vecs[wordidx["airplane"]])

0.8729851841926575

In [62]:
dist(vecs[wordidx["avalanche"]], vecs[wordidx["antique"]])

0.9621107056736946