<a href="https://colab.research.google.com/github/Otobi1/Back-to-Basics-A-Refresher-/blob/master/Back_to_Basics_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Embeddings 
# - these are capable of capturing the contextual, semantic and syntatic meaning in data.

# While one-hot encoding allows us to preserve the structural info, it has two disadvantages 
# 1. linearly dependent on the number of unique token in the vocab which is problem if we have a large corpus
# 2. the representation for each token does not preserve any relationship with respect to other tokens

# Embeddings address the short comings of one-hot encoding
# - its main idea is to have fixed length representations for the tokens in a text regardless of the tokens in the vocab
# with one-hot encoding, each token is represented by an array of size vocab size but with embeddings, each token now has the shape embed dim
###- the values in the rep are not fixed binary values but rather changing floating points allowing for fine-grained learned reps

# the objective here is to rep tokens in text that capture the intrinsic semantic relationships
## leveraging the low-dimensionality while capturing relationships and interpretable token reps.

In [3]:
# Learning Embeddings 
# - We can learn embeddings by creating our model in PyTorch, but first, we're going to use a library that specialises in embeddings and topic modelling called Gensim

import nltk
nltk.download('punkt');
import numpy as np
import re
import urllib

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [4]:
SEED = 1234

In [5]:
# Set seed for reproducibility 

np.random.seed(SEED)

In [6]:
# Split text into sentences 

tokeniser = nltk.data.load("tokenizers/punkt/english.pickle")
book = urllib.request.urlopen(url = "https://raw.githubusercontent.com/GokuMohandas/madewithml/main/datasets/harrypotter.txt")
sentences = tokeniser.tokenize(str(book.read()))
print (f"{len(sentences)} sentences")

12443 sentences


In [7]:
def preprocess(text):
  """Conditional preprocessing on our text."""
  # Lower
  text = text.lower()

  # Spacing and filters 
  text = re.sub(r"([-;;.,!?<=>])", r" \1 ", text)
  text = re.sub('[^A-Za-z0-9]+', ' ', text) # remove non alphanumeric characters 
  text = re.sub(' +', ' ', text) # remove multiple spaces
  text = text.strip()

  # Separate into word tokens
  text = text.split(" ")

  return text

In [8]:
# Preprocess sentences
print (sentences[11])
sentences = [preprocess(sentence) for sentence in sentences]
print (sentences[11])

Snape nodded, but did not elaborate.
['snape', 'nodded', 'but', 'did', 'not', 'elaborate']


In [10]:
# How doe we learn the embeddings in the first place?
# The intuition behind embeddins is that the definition of a token depends NOT on the token itself, but on its context.
# - There are several ways of doing this. 
# -- given the word in context, predict the target word (CBOW - continous bag of words)
# -- given the target word, predict the context word (skip-gram)
# -- given a sequence of words, predict the next word(LM - language modelling)

# all these approaches involve the creation of data to train the model on. 
# Every word in a sentence becomes the target word and the context words are determined by a window
# we repeat this for every sentence in the corpus and this results in the training data for unsupervised taskk

# the idea is that similar target words will appear with similar contexts and we can learn this relationship by repeatedly training our model (wiht context and target) pairs

In [11]:
# Word2Vec

# working with large vocabs to learn embeddings can become complicated quickly
# Here, we can use the "negative sampling", which only updates the correct class and a few arbitrary incorrect classes 
# We can do this because of the large amoutn of training data where we will see the same word as the target class multiple times

import gensim
from gensim.models import KeyedVectors
from gensim.models import Word2Vec

In [12]:
EMBEDDING_DIM = 100
WINDOW = 5
MIN_COUNT = 3 # ignores all the words with total frequency lower than this
SKIP_GRAM = 1 # 0 = CBOW
NEGATIVE_SAMPLING = 20

In [14]:
# super fast because of optimised C code under the hood

w2v = Word2Vec(sentences = sentences, size = EMBEDDING_DIM, window = WINDOW,
               min_count = MIN_COUNT, sg = SKIP_GRAM, negative = NEGATIVE_SAMPLING)
print (w2v)

Word2Vec(vocab=4937, size=100, alpha=0.025)


In [15]:
# Vector for each word
w2v.wv.get_vector("potter")

array([ 1.14101067e-01, -1.86282560e-01, -1.96112245e-01, -4.37779844e-01,
       -2.18696296e-01,  6.13153130e-02, -2.68156797e-01,  5.49259856e-02,
        3.87510538e-01,  3.57815892e-01, -4.57724869e-01, -2.98481286e-01,
       -2.44551778e-01, -1.48118595e-02, -1.39958128e-01, -3.70013803e-01,
        1.70049042e-01,  1.82302162e-01, -5.09113669e-01,  1.95511967e-01,
       -2.11835250e-01, -3.96204233e-01, -1.62837163e-01,  2.33413607e-01,
       -2.88501829e-01,  5.16816020e-01, -2.68738568e-02,  2.61840940e-01,
       -6.31084368e-02, -2.39665210e-01,  5.61609209e-01,  3.76967430e-01,
       -5.10030724e-02, -2.95698971e-01, -3.10349941e-01,  2.53110495e-03,
        2.99663901e-01, -2.41563305e-01, -2.20411971e-01, -1.10006832e-01,
       -9.20895562e-02, -1.83540471e-02, -1.59294441e-01, -2.67816428e-02,
        2.46242791e-01,  3.71243924e-01, -8.76362324e-02, -1.73952013e-01,
        4.29101586e-02, -4.38437164e-01,  1.54941484e-01,  3.31484169e-01,
        7.01173991e-02,  

In [16]:
# Get nearest neighbours (excluding itself)

w2v.wv.most_similar(positive = "scar", topn = 5)

[('pain', 0.9308451414108276),
 ('forehead', 0.9219275116920471),
 ('prickling', 0.9136929512023926),
 ('cold', 0.9049762487411499),
 ('mouth', 0.9016754031181335)]

In [17]:
# Saving and loading 

w2v.wv.save_word2vec_format("model.bin", binary = True)
w2v = KeyedVectors.load_word2vec_format("model.bin", binary = True)

In [18]:
# FastText

# What happens if a word doesnt exist in the vocab?
# We could assign an "UNK" (unkown) token which is used for all OOV (out of vocab) words or we could use FastText, which uses character-level n-grams to embed a word
# This helps embed rare words, misspelled words and also words that don't exist in our corpus but are similar to words in our corpus

In [19]:
from gensim.models import FastText

In [20]:
# super fast because of the optimised C code under the hood

ft = FastText(sentences = sentences, size = EMBEDDING_DIM, window = WINDOW, 
              min_count = MIN_COUNT, sg = SKIP_GRAM, negative = NEGATIVE_SAMPLING)
print (ft)

FastText(vocab=4937, size=100, alpha=0.025)


In [22]:
# This word doesn't 'exist so the word2vec model will error out 

# w2v.wv.most_similar(positive = "scarring", topn = 5) # uncomment to check

In [23]:
# FastText on the other hand will use n-grams to embed an OOV word

ft.wv.most_similar(positive = "scarring", topn = 10)

[('lightning', 0.971960186958313),
 ('shimmering', 0.9709751605987549),
 ('prickling', 0.9709533452987671),
 ('trembling', 0.9704004526138306),
 ('shivering', 0.9698933959007263),
 ('clearing', 0.9686659574508667),
 ('glittering', 0.9681605100631714),
 ('bearing', 0.9673503637313843),
 ('fluttering', 0.9671472311019897),
 ('sparkling', 0.9658246040344238)]

In [24]:
# Save and load 

ft.wv.save("model.bin")
ft = KeyedVectors.load("model.bin")