<a href="https://colab.research.google.com/github/Northwind01/metaphors/blob/master/4_CBOW_fast.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/googlecolab/colabtools/blob/master/notebooks/colab-github-demo.ipynb)

# CBOW

https://www.kdnuggets.com/2018/04/implementing-deep-learning-methods-feature-engineering-text-data-cbow.html

## 0. Set-up

### Imports

In [0]:
%tensorflow_version 2.x
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


In [0]:
import os
import numpy as np
import _pickle as cPickle

### Get Google drive access

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


### Set the paths

In [0]:
root_path = 'gdrive/My Drive/metaphors/'
corpus_dir = root_path + 'data/Wikipedia/'
pickle_dir = root_path + 'data/pickles/'
CBOW_dir = root_path + 'models/CBOW/'

## 1. Get the corpus

### List of sentences

In [0]:
corpus_text_file = os.path.join(corpus_dir, 'wiki_en.txt')
corpus_file = open(corpus_text_file,'r')
corpus = corpus_file.read()
len(corpus)

327728373

In [0]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [0]:
sents = sent_tokenize(corpus)

In [0]:
len(sents)

2199884

In [0]:
sents[4]

'As anarchism does not offer a fixed body of doctrine from a single particular worldview, many anarchist types and traditions exist and varieties of anarchy diverge widely.'

In [0]:
sents[4][10]

's'

### List of articles (INFO)

In [0]:
import re

In [0]:
split_reg = re.compile('\n')
articles = split_reg.split(corpus)

In [0]:
articles[10]

'The \'\'\'Academy Awards\'\'\', more popularly known as \'\'\'the Oscars\'\'\', are awards for artistic and technical merit in the film industry. Given annually by the Academy of Motion Picture Arts and Sciences (AMPAS), the awards are an international recognition of excellence in cinematic achievements as assessed by the Academy\'s voting membership. The various category winners are awarded a copy of a golden statuette, officially called the "Academy Award of Merit", although more commonly referred to by its nickname "Oscar". The statuette depicts a knight rendered in Art Deco style. The award was originally sculpted by George Stanley from a design sketch by Cedric Gibbons. AMPAS first presented it in 1929 at a private dinner hosted by Douglas Fairbanks in the Hollywood Roosevelt Hotel in what would become known as the 1st Academy Awards. The Academy Awards ceremony was first broadcast by radio in 1930 and televised for the first time in 1953. It is the oldest worldwide entertainment

## 2. Tokenize the corpus

In [0]:
path = os.path.join(pickle_dir + 'CBOW/', 'tokenizer.pickle')

In [0]:
from tensorflow.keras.preprocessing import text, sequence
from tensorflow.keras.utils import to_categorical

In [0]:
%%time
tokenizer = text.Tokenizer()
tokenizer.fit_on_texts(sents)

CPU times: user 1min 17s, sys: 175 ms, total: 1min 17s
Wall time: 1min 17s


In [0]:
%%time
word2id = tokenizer.word_index
word2id['PAD'] = 0
id2word = {v:k for k, v in word2id.items()}
wids = [[word2id[w] for w in text.text_to_word_sequence(sent)] for sent in sents]

CPU times: user 55.7 s, sys: 567 ms, total: 56.3 s
Wall time: 56.3 s


In [0]:
%%time
tok_sents = tokenizer.texts_to_sequences(sents)

CPU times: user 1min, sys: 587 ms, total: 1min 1s
Wall time: 1min 1s


## 3. Create tf records

https://medium.com/@TalPerry/getting-text-into-tensorflow-with-the-dataset-api-ffb832c8bec6

In [0]:
def create_example(sequence, target):
  '''Creates a SequenceExample from a tokenized sequence and target word'''
  ex = tf.train.SequenceExample()
  
  # Add the context feature (here we just take the target word/label)
  ex.context.feature["target"].int64_list.value.append(target[0])
  
  # Feature list for the the sequential feature
  fl_tokens = ex.feature_lists.feature_list["seq"]
  
  # Prepend with start token
  fl_tokens.feature.add().int64_list.value.append(0)
 
  for token in sequence[0]:
    # Add those tokens one by one
    fl_tokens.feature.add().int64_list.value.append(token)
  
  # Apend  with end token
  fl_tokens.feature.add().int64_list.value.append(0)
  
  return ex

In [0]:
import random

In [0]:
def create_record(sents, rec_id):
  '''Creates a TFRecord from a list of tokenized sentences'''
  examples = []
  
  for sent in sents:
    sentence_length = len(sent)
    for index, word in enumerate(sent):
      context_words = []
      label_word = []            
      start = index - window_size
      end = index + window_size + 1
            
      context_words.append([sent[i] 
                            for i in range(start, end) 
                            if 0 <= i < sentence_length 
                            and i != index])
      
      label_word.append(word)

      example = create_example(context_words, label_word)
      examples.append(example)
    
  random.shuffle(examples)

  filename = 'examples_' + str(rec_id) + '.tfrecord'

  with open(CBOW_dir + filename,'w') as f:
    writer = tf.io.TFRecordWriter(f.name)
    for example in examples:
      writer.write(example.SerializeToString())
    
  print('Processed '+ filename)

In [0]:
len(tok_sents)

2199884

In [0]:
batch = 200000

In [0]:
for i in range(0, len(tok_sents), batch):
  create_record(tok_sents[i:i+batch-1], i/batch)
print('Processing complete!')

Processed gdrive/My Drive/metaphors/models/CBOW/examples_0.0.tfrecord
Processed gdrive/My Drive/metaphors/models/CBOW/examples_1.0.tfrecord
Processed gdrive/My Drive/metaphors/models/CBOW/examples_2.0.tfrecord
Processed gdrive/My Drive/metaphors/models/CBOW/examples_3.0.tfrecord
Processed gdrive/My Drive/metaphors/models/CBOW/examples_4.0.tfrecord
Processed gdrive/My Drive/metaphors/models/CBOW/examples_5.0.tfrecord
Processed gdrive/My Drive/metaphors/models/CBOW/examples_6.0.tfrecord
Processed gdrive/My Drive/metaphors/models/CBOW/examples_7.0.tfrecord
Processed gdrive/My Drive/metaphors/models/CBOW/examples_8.0.tfrecord
Processed gdrive/My Drive/metaphors/models/CBOW/examples_9.0.tfrecord
Processed gdrive/My Drive/metaphors/models/CBOW/examples_10.0.tfrecord
Processing complete!


## 3. Build the CBOW model architecture

In [0]:
import tensorflow.keras.backend as K
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Lambda

In [0]:
cbow = Sequential()
cbow.add(Embedding(input_dim=vocab_size, output_dim=embed_size, input_length=window_size*2))
cbow.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=(embed_size,)))
cbow.add(Dense(vocab_size, activation='softmax'))
cbow.compile(loss='categorical_crossentropy', optimizer='rmsprop')

# View model summary
print(cbow.summary())

NameError: ignored

In [0]:
# Visualize model structure
from IPython.display import SVG
from tensorflow.keras.utils import model_to_dot

SVG(model_to_dot(cbow, show_shapes=True, show_layer_names=False, dpi=65,
                 rankdir='TB').create(prog='dot', format='svg'))

In [0]:
tf.keras.models.save_model(
    cbow,
    CBOW_dir+'model',
    overwrite=True,
    include_optimizer=True,
    save_format=None,
    signatures=None,
    options=None
)

In [0]:
from google.colab import drive
drive.mount('/content/drive')

## 4. Train the model

https://www.tensorflow.org/tutorials/keras/save_and_load

### Save checkpoints during training

In [0]:
checkpoint_path = os.path.join(CBOW_dir, 'training_1/cp.ckpt')
checkpoint_dir = os.path.dirname(checkpoint_path)

# Create a callback that saves the model's weights
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
                                                 save_weights_only=True,
                                                 verbose=1)

### Actually train

In [0]:
for epoch in range(1, 6):
  loss = 0.
  i = 0
  for x, y in generate_context_word_pairs(corpus=wids, window_size=window_size, vocab_size=vocab_size):
    i += 1
    loss += cbow.fit(x, y, epochs=1, callbacks=[cp_callback])  # Pass callback to training
    if i % 10000 == 0:
      print('Processed {} (context, word) pairs'.format(i))

  print('Epoch:', epoch, '\tLoss:', loss)
  print()

Train on 1 samples

Epoch 00001: saving model to gdrive/My Drive/metaphors/models/CBOW/training_1/cp.ckpt


TypeError: ignored

In [0]:
from tensorflow.keras.models import save_model

filepath = os.path.join(CBOW_dir, 'cbow.tf')

models.save_model(
    cbow,
    filepath,
    overwrite=True,
    include_optimizer=True,
    save_format=None,
    signatures=None,
    options=None
)

NameError: ignored

In [0]:


# Train the model with the new callback
model.fit(train_images, 
          train_labels,  
          epochs=10,
          validation_data=(test_images,test_labels),
          callbacks=[cp_callback])  # Pass callback to training

NameError: ignored

## 5. Get word embeddings

In [0]:
weights = cbow.get_weights()[0]
weights = weights[1:]
print(weights.shape)

pd.DataFrame(weights, index=list(id2word.values())[1:]).head()

In [0]:
from sklearn.metrics.pairwise import euclidean_distances

# compute pairwise distance matrix
distance_matrix = euclidean_distances(weights)
print(distance_matrix.shape)

# view contextually similar words
similar_words = {search_term: [id2word[idx] for idx in distance_matrix[word2id[search_term]-1].argsort()[1:6]+1] 
                   for search_term in ['god', 'jesus', 'noah', 'egypt', 'john', 'gospel', 'moses','famine']}

similar_words