# Static word embeddings

Goal: finding meaningful low-dimensional representation of English words (or for vocabulary)

- [Word2Vec - by Google AI - 2013](https://code.google.com/archive/p/word2vec/)
  - pre-trained vectors trained on a part of the Google News dataset (about 100 billion words)
  - model contains 300-dimensional vectors for 3 million words and phrases.  
- [GloVe - Global Vectors for Word Representation - 2014](https://nlp.stanford.edu/pubs/glove.pdf)
  - mapping words into a meaningful space where the distance between words is related to semantic similarity
  - words that occur in similar contexts tend to have similar vector representations
  - builds a word-word co-occurrence matrix where each entry represents how often a word appears in the context of another word across the corpus.
- [FastText - by Facebook AI Research - 2016](https://arxiv.org/pdf/1612.03651)
  - Uses subword information, unlike Word2Vec, FastText represents words as a combination of character n-grams (subword units
  - Out-of-vocabulary robustness: by using subword information, FastText can create embeddings for words it hasn't seen during training by combining n-grams.

- We can use the `gensim` module for experiments with `word2vec` embeddings

![Embeddings](https://miro.medium.com/v2/resize:fit:1200/1*mWerYTuy9xH4SlRY9fFg1A.jpeg)

In [1]:
#!pip install gensim
from gensim.downloader import load

model = load("word2vec-google-news-300")  


We can search for similarites based on cosine similarities with the 
- `model.most_similar()` function for a key or
- `model.similar_by_vector()` function for a vector

- we can look at analogies, such as:
$$man - woman \approx king - queen$$
$$queen \approx king - man + woman$$

### FastText embeddings

We can use Facebook AI's FastText embeddings via `pip install fasttext` and downloading and unzipping the embedding matrix

In [2]:
#!pip install fasttext
#!wget -qO- https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz | gunzip > cc.en.300.bin

In [3]:
import fasttext

model = fasttext.load_model("cc.en.300.bin")

We can use 
 - `model.get_word_vector` for embeddings
 - `model.get_analogies` for analogies, for example we can look for:
   - country/capital connections
   - gender-related connections
   - singular/plural grammar-related connections
   - present/past tense grammar-related connections
   - country-specific connections
   - ordinal number-related connections (`one, two, three`)

### Sentence embeddings and Classification problems

- The word embeddings can be used to generate features on textual data
  - we can either keep the original sentences or
  - we can take an average values for the sentence

In [4]:
#!wget http://rs1.sze.hu/~katihi/SMSSpamCollection

import pandas as pd

df = pd.read_csv("SMSSpamCollection", header=None, sep="\t", names=["label", "text"])
df

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [5]:
from tqdm import tqdm
import numpy as np

tqdm.pandas()

def text_to_avg_vector(sentence):
        
    return 
    

In [6]:
from sklearn.model_selection import train_test_split



### Simple classification model training with 300 features

In [7]:
from sklearn.linear_model import LogisticRegression



In [8]:
from sklearn.ensemble import GradientBoostingClassifier



### T-SNE visualization

In [9]:
from sklearn.manifold import TSNE



In [10]:
import seaborn as sns



## Dynamic embeddings - Transformer models

- The previous models cannot really handle longer relations in the text and the context
- Solution: Attention-based Transformer models with Tokenization

Tokenization:

- For some popular models' vocabulary sizes are roughly:
  - `BERT`: 30k 
  - `llama-2` - 32k
  - `GPT3` - 50k

- Embedding dimensions:
  - `BERT` - 768
  - `llama` - 4096
  -  `GPT3` - 12288

Important tricks in the embedding: the embeddings can change depending on
 - the position of the tokens (positional encoding)
 - and the context of the tokens, i.e the same token in the same position can change its embedding based on the context

- [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)

![Attention](https://miro.medium.com/v2/resize:fit:1400/1*eincSo4zd1LxcOrdbK_yLw.png)

- [Attention is All You Need](https://arxiv.org/pdf/1706.03762)

## Very Important Trick: Self-Attention!

![Transformer](https://arxiv.org/html/1706.03762v7/extracted/1706.03762v7/Figures/ModalNet-21.png)

![K-Q_V](https://www.lamsalashish.com.np/assets/blog/attention-is-all-you-need/image012.png)

Query - Key - Value matrices:
  - Query (Q): Represents the current token seeking information from other tokens
  - Key (K): Represents the tokens available to provide information
  - Value (V): Contains the actual information or content of the tokens

- similarity scores are computed between the Query and each Key
- these scores determine the attention weights, the relevance of each token
- final output: a weighted sum of the Value vectors, guided by the attention weights

![QKV](https://cdn-ilcabpl.nitrocdn.com/XTpGTaZWYQSxctfMHQPVOQKOsBspWTQi/assets/desktop/optimized/rev-a0beb67/lh3.googleusercontent.com/mOIfKc2jQ1pEZUhCMktSLcvZPBfEUAMGL-8Qp7sgE6f23S4i5tPXTN43GazvcjRywgE38t9ghmxt7nOqI-AgRLq9MVXFDqf7VuHm00aV9_6ofYqCpRMA_lTk1DOA-HFO1VCQ2uCjGOpXVEk72-nrzf7059KsCdOVfQWqxn4STCsKUDDOMpk0WlkWQt6VzQ)

### BERT - Bidirectional encoder representations from transformers

![BERT](https://upload.wikimedia.org/wikipedia/commons/b/b5/BERT_embeddings_01.png)


Pre-training phase of BERT:

- Masked Language Modelling: 
  - 15% of tokens would be randomly selected for masked-prediction task
  - training objective: predict the masked token given its context.

  - replaced with a [MASK] token with probability 80%,
  - replaced with a random word token with probability 10%,
  - not replaced with probability 10%.

![MLM](https://upload.wikimedia.org/wikipedia/commons/d/d1/BERT_masked_language_modelling_task.png)

- Next sentence prediction:
  - given two spans of text, the model predicts if these two spans appeared sequentially in the training corpus
  - outputs: either [IsNext] or [NotNext].



BERT was trained on the BookCorpus (800M words) and a filtered version of English Wikipedia (2,500M words) without lists, tables, and headers.
- Training BERT-BASE on 4 cloud TPU (16 TPU chips total) took 4 days, at an estimated cost of 500 USD. 
- Training BERT-LARGE on 16 cloud TPU (64 TPU chips total) took 4 days.



#### Different Transformer architectures

- Encoder-only models work well on Classification problems where we can look at the whole text at the same time (BERT)
- Decoder-only models work well where Text generation (chatbot) is the task (GPT), we have to predict the next word
- Encoder-decoder models work well for Language translation tasks (T5), where
  - we have to predict the next word in the target language
  - but we can also look at the whole input text in the source language
    
![Encoder/Decoder](http://rs1.sze.hu/~katihi/encoderdecoder.png)

## Most of the functionalities are available via the `transformers` module both in PyTorch and TensorFlow

- `pip install transformers`
- we can simply load a tokenizer with `BertTokenizer.from_pretrained()`


In [13]:
# pip install transformers

from transformers import BertTokenizer, TFBertForSequenceClassification
from tensorflow.keras.optimizers import Adam

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

max_length = 200

#### Let's look at some tokenization 

- we can just apply `tokenizer()`, where can specify:
    - truncation
    - padding (`max_length` for example)
    - max_length
    - tensor format (`tf`)

## BERT Fine-tuning

- for classification tasks we can use the pre-trained BERT model with the `TFBertForSequenceClassification` model building
- we should specify a smaller learning rate within the optimizer
- use SparseCategoricalCrossentropy loss:
  - instead of one-hot-encoding it works with multi-class classification problems where the target labels are integers representing different classes
  - `from_logits=True`: by default the `softmax` is not yet applied, we only have the raw output values

In [None]:
from tensorflow.keras.losses import SparseCategoricalCrossentropy

if 0:
    model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
        
    model.compile(optimizer=Adam(learning_rate=3e-5), 
                  loss=SparseCategoricalCrossentropy(from_logits=True), 
                  metrics=["accuracy"])
    
    history = model.fit(
        x={"input_ids": train_encodings["input_ids"], "attention_mask": train_encodings["attention_mask"]},
        y=train_labels,
        validation_data=(
            {"input_ids": test_encodings["input_ids"], "attention_mask": test_encodings["attention_mask"]},
            test_labels
        ),
        epochs=3,
        batch_size=32
    )