# Text Embeddings

 - text does not have a inherit semantic distance between words. (hamming-distance...)
 - Given the text: "This is the first document" and "Is this the first document?", how similar are these? 
 - Transform text (Graph) into vectors!

## Bag-Of-Words 
 - Idea: Create Vocabulary Array of size x, containing every word
 - For each document ys: count how often each word occurs
 - x*y matrix containing the embeddings

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

### Our Documents

In [3]:
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

### Vectorizer counts occurences in the corpus

In [4]:
bow_vectoricer : CountVectorizer = CountVectorizer()
bow_embedding = bow_vectoricer.fit_transform(corpus)

#### Returns the Feature names

In [5]:
bow_vectoricer.get_feature_names_out()

array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], dtype=object)

In [6]:
bow_embedding.toarray()

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]])

## TF-IDF

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [8]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_embeddings = tfidf_vectorizer.fit_transform(corpus)

In [9]:
tfidf_vectorizer.get_feature_names_out()

array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], dtype=object)

In [10]:
tfidf_embeddings.toarray()

array([[0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524],
       [0.        , 0.6876236 , 0.        , 0.28108867, 0.        ,
        0.53864762, 0.28108867, 0.        , 0.28108867],
       [0.51184851, 0.        , 0.        , 0.26710379, 0.51184851,
        0.        , 0.26710379, 0.51184851, 0.26710379],
       [0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524]])

## Word2Vec

 - Get Cooccurance of words in the document. 
 - The higher the cooccurance between two words, the more similar the vectors should be.

In [11]:
from gensim.models import Word2Vec
from gensim.models import KeyedVectors

### Tokenize
 - We want the embeddings of the words
 - Thus we need to split the sentences into the words

In [12]:
tokens = [[word for word in document.split(" ")] for document in corpus]

In [13]:
tokens

[['This', 'is', 'the', 'first', 'document.'],
 ['This', 'document', 'is', 'the', 'second', 'document.'],
 ['And', 'this', 'is', 'the', 'third', 'one.'],
 ['Is', 'this', 'the', 'first', 'document?']]

In [14]:
word2vec = Word2Vec(sentences=tokens, vector_size=9, window=2, min_count=1, sg=1)

In [15]:
w2v_embedding = word2vec.wv

In [26]:
w2v_embedding

<gensim.models.keyedvectors.KeyedVectors at 0x7fba6c2ede80>

In [27]:
vocab = set([word for line in tokens for word in line])

In [17]:
for word in vocab:
    if(word in word2vec.wv):
        print(f"{word}: {w2v_embedding[word]}")

this: [-0.09205794 -0.10498687  0.08124185  0.05633624  0.07508548  0.00847628
  0.07056545 -0.0378374  -0.01051557]
is: [-0.04181524  0.08200561 -0.01703857 -0.05040681  0.07282279 -0.05400178
 -0.02017797  0.031962    0.01102082]
one.: [-0.01615424 -0.10232265  0.04856641  0.00635378  0.08269591 -0.00903735
 -0.02931856 -0.09726511 -0.00951822]
Is: [-0.08536667 -0.01677566  0.02745128 -0.00986536  0.06150129 -0.03047973
  0.02510632  0.06060106  0.09273084]
third: [ 0.03140464  0.0599922   0.07838441 -0.06337231  0.02067937  0.06765798
 -0.05332528 -0.03455647  0.07553378]
This: [-0.0032909  -0.08512489  0.10683048  0.0553562   0.10259048 -0.09064353
  0.04995332 -0.04596751  0.00916151]
document?: [ 0.09443673 -0.04957192  0.05022183 -0.07542852 -0.03941385  0.10444671
 -0.01754775  0.00355158 -0.04598591]
second: [ 0.07102815 -0.09577431  0.04073042  0.05766537  0.06379931  0.08296576
 -0.06852973  0.0122846   0.06719203]
document.: [ 0.06409526 -0.08357375 -0.04373448 -0.08346203 

## BERT Embeddings

In [18]:
from transformers import BertModel, BertTokenizer
import torch

  from .autonotebook import tqdm as notebook_tqdm


In [19]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

In [20]:
input = tokenizer(corpus, return_tensors="pt", padding=True, truncation=True)


In [21]:
input

{'input_ids': tensor([[ 101, 2023, 2003, 1996, 2034, 6254, 1012,  102,    0],
        [ 101, 2023, 6254, 2003, 1996, 2117, 6254, 1012,  102],
        [ 101, 1998, 2023, 2003, 1996, 2353, 2028, 1012,  102],
        [ 101, 2003, 2023, 1996, 2034, 6254, 1029,  102,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0]])}

In [None]:
with torch.no_grad():
    outputs = model(**input)
word_embeddings = outputs.last_hidden_state

#### Embedding of the first word in the first sentence:

In [29]:
print(len(word_embeddings[0,0]))

768
