## Skipgrams in Keras

- In this lecture, we will implement Skipgrams in `Keras`.

#### Loading in and preprocessing data
- Load the Alice in Wonderland data in Corpus using Keras utility
- `Keras` has some nice text preprocessing features too!
- Split the text into sentences.
- Use `Keras`' `Tokenizer` to tokenize sentences into words.

In [1]:
# Imports
# Basics
from __future__ import print_function, division
import pandas as pd 
import numpy as np
import random
from IPython.display import SVG
%matplotlib inline

# nltk
from nltk import sent_tokenize

# keras
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Embedding, Reshape, Activation
from keras.utils import get_file
from keras.preprocessing.text import Tokenizer
from keras.utils import model_to_dot 
from keras.preprocessing.sequence import skipgrams

np.random.seed(1234)

ModuleNotFoundError: No module named 'nltk'

In [2]:
# We'll use Alice in Wonderland

path = get_file('carrol-alice.txt', origin="http://www.gutenberg.org/files/11/11-0.txt")
corpus = open(path, encoding='utf-8').read()

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
from keras.preprocessing.text import Tokenizer

# Split document into sentences first
corpus = corpus[corpus.index('\n\n')+2:]  # remove header.
sentences = sent_tokenize(corpus)

# Tokenize using Keras
base_filter='!"#$%&()*+,-./:;`<=>?@[\\]^_{|}~\t\n“”' + "'"
tokenizer = Tokenizer(filters=base_filter)
tokenizer.fit_on_texts(sentences)  # tokenizer assigns a unique integer index to each unique word in the text
                                   # also characters in base_filter will be removed from the text

# Convert tokenized sentences to sequence format
sequences = tokenizer.texts_to_sequences(sentences)
nb_samples = sum(len(s) for s in corpus)

print(len(sequences), tokenizer.document_count)

In [None]:
print("Word Index:", tokenizer.word_index)
print("Word Counts:", tokenizer.word_counts)
print("Document Count:", tokenizer.document_count)
print("Word Document Frequency:", tokenizer.word_docs)

In [None]:
# To understand what is happening;

print(sentences[324])  # this is a sentence
print(sentences[324].split())
print(sequences[324])  # this is the same sentence where words are encoded as numbers.
print(list(tokenizer.word_index[word.lower().replace(',', '').replace('.', '').replace('“', '').replace('”', '')] 
           for word in sentences[324].split()))

#### Skipgrams: Generating Input and Output Labels
- Now that we have sentences, and word tokenization, we are in good position to create our training set for skipgrams.
- Now we need to generate our `X_train` and `y_train`

#### In the context of Word2Vec and skip-gram models:
- *Target Word* is the word for which you are trying to predict the context words.
- *Context words* are the words that appear in the vicinity (context) of the target word.

- *couples* This is a list of tuples, where each tuple contains a pair of words (integer indices) representing a skip-gram pair. For example, (2, 45) might represent a target word at index 2 and a context word at index 45. The pairs are generated based on the specified window size around each target word.
- *labels* This is a list of binary labels associated with each skip-gram pair. A label of 1 typically indicates that the context word is a true context word (a positive sample), and a label of 0 indicates a negative sample (a randomly selected word that is not the actual context).

In [None]:
# Let's first see how Keras' skipgrams function works.
# The integer indices assigned to actual words start from 1 and padding token at index 0 allows us to represent sequences with variable lengths while still maintaining a consistent input size for the neural network.

print(sentences[324], '\n') 
couples, labels = skipgrams(sequences[324], len(tokenizer.word_index) + 1,
    window_size=2, negative_samples=0, shuffle=True,
    categorical=False, sampling_table=None)

index_2_word = {val: key for key, val in tokenizer.word_index.items()}

for (w1, w2), l in zip(couples, labels):
    if w1 == tokenizer.word_index['temper']:
        print(f'{index_2_word[w1]:{len(index_2_word[w1])}} - {index_2_word[w2]:10} {l}')

In [7]:
# Function to generate the inputs and outputs for all windows

# Vocab size
vocab_size = len(tokenizer.word_index) + 1
# Dimension to reduce to
embedding_dim = 100
window_size = 2

def generate_data(sequences, window_size, vocab_size):
    for seq in sequences:
        X, y = [], []
        couples, _ = skipgrams(
            seq, vocab_size,
            window_size=window_size, negative_samples=0, shuffle=True,
            categorical=False, sampling_table=None)
        if not couples:
            continue
        for target, context in couples:
            X.append(target)
            y.append(keras.utils.to_categorical(context, vocab_size))
        X, y = np.array(X), np.array(y)
        X = X.reshape(len(X), 1)
        y = y.reshape(len(X), vocab_size)
        yield X, y
        
data_generator = generate_data(sequences, window_size, vocab_size)

In [8]:
X, y = next(data_generator)

### Skipgrams: Creating the Model
* Lastly, we create the (shallow) network!
* In Embedding layer in Keras, 
  * *input_dim* specifies the size of the vocabulary, i.e., the total number of distinct words in your dataset. It essentially sets the upper limit for the word indices that the layer can expect as input.
  * *output_dim* specifies the size of the dense embedding vector for each input word.
  * embeddings_initializer=*'glorot_uniform'* (also called Xavier initialization) initializes weights with a uniform distribution in a specific range, ensuring balanced gradient flow. This is useful for stabilizing training.
  * *input_length* defines the length of the input sequence (i.e., how many tokens/words are expected in each input sample).
Since this is a skip-gram model, each input is a single word, hence input_length=1.
* During the training process of the Embedding layer, the updates to the embedding matrix are based on the specific word indices present in the training examples. It's not necessary to train all indices simultaneously.
* To get the weights of the Embedding layer:
  ```
  embedding_weights = model.get_weights()[0]
  ```

In [None]:
# Create the Keras model and view it 
skipgram = Sequential()
skipgram.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, 
                       embeddings_initializer='glorot_uniform', input_length=1))
skipgram.add(Reshape((embedding_dim,)))
skipgram.add(Dense(units=vocab_size, activation='softmax'))

print('vocab_size:', vocab_size)
print('embedding_dim:', embedding_dim)
skipgram.summary()
SVG(model_to_dot(skipgram, show_shapes=True, dpi=65).create(prog='dot', format='svg'))

### Skipgrams: Compiling and Training
- Time to compile and train
- We use crossentropy, common loss for classification

In [None]:
# Compile the Keras Model
skipgram.compile(loss='categorical_crossentropy', optimizer='adam')

# Fit the Skipgrams
for iteration in range(1, 11):
    loss, cnt = 0, 0
    for x, y in generate_data(sequences, window_size, vocab_size):
        loss += skipgram.train_on_batch(x, y)
        cnt += 1
    print('iteration {}, avg. loss is {}'.format(iteration, loss/cnt))

### Skipgrams: Looking at the vectors

To get word_vectors now, we look at the weights of the first layer.

Let's also write functions giving us similarity of two words.

In [11]:
word_vectors = skipgram.get_weights()[0]

def get_similarity(w1, w2):
    i1, i2 = tokenizer.word_index[w1], tokenizer.word_index[w2]
    v1, v2 = word_vectors[i1], word_vectors[i2]
    return np.dot(v1, v2)/(np.linalg.norm(v1)*np.linalg.norm(v2))

In [None]:
get_similarity('king', 'queen')

In [17]:
w1 = 'queen'
similar_words  = {word: get_similarity(w1, word) 
                  for word in tokenizer.word_index.keys() if word != w1}

In [None]:
sorted(similar_words.items(), key=lambda item: item[1])[-10:]

## <span style="color:orange">(100 คะแนน) ปรับจูนโมเดลให้ดีขึ้นเพื่อหาคำที่ใกล้เคียงกับคำที่ระบุ</span>
### * (50 คะแนน) ให้แสดงคำ 10 คำแรกที่ใกล้เคียงกับ queen มากที่สุดพร้อมค่าตัววัด
### * (50 คะแนน) ให้แสดงคำ 10 คำแรกที่ใกล้เคียงกับ queen น้อยที่สุดพร้อมค่าตัววัด  