# Main breakthroughs in Word embeddings, from Word Counts to LLMs:

**Low-tech beginnings:**

* **Word Counts:** Basic frequency of words in a corpus served as initial representations. Useful for simple tasks but ignored semantic relationships.
* **Term Frequency (TF):** As mentioned, simply counting the occurrences of words within a corpus was one of the earliest methods for representing words. This basic approach laid the foundation for more sophisticated techniques.
* **N-grams:** These are sequences of n consecutive words (e.g., bigrams for pairs, trigrams for triplets). By analyzing n-gram frequencies, we can capture some local context and word relationships, going beyond individual word counts.


**Distributional Semantics:**

* **Word2Vec (2013):** First major breakthrough. Learned word embeddings by predicting surrounding words, capturing semantic similarities.
* **GloVe (2014):** Leverages co-occurrence statistics for better context sensitivity.
* **FastText (2016):** Incorporates subword information, handling rare words and morphological variants.

**Contextual Embeddings:**

* **ELMo (2018):** Uses bi-directional LSTMs to capture word meaning based on surrounding context.
* **BERT (2018):** Pre-trained transformer model on large unlabeled text, learning contextualized representations.
* **XLNet (2019):** Builds upon BERT's masked language modeling with permutation language modeling for better understanding of word relationships.

**Towards Understanding and Generation:**

* **GPT-3 (2020):** Generative Pre-trained Transformer 3, a large language model (LLM) with impressive text generation capabilities.
* **LaMDA (2021):** Language Model for Dialogue Applications, focuses on factual consistency and grounding in conversation.
* **PaLM (2022):** Pathways Language Model, pushes the boundaries of LLM size and performance, demonstrating progress in reasoning and question answering.

# Import Libraries

In [1]:
import pandas as pd
import math
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import ngrams
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist, MLEProbDist
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import keras

**Define Documents list**

In [2]:
Doc_1= "The cat in the hat"
Doc_2= "The quick brown fox"
Doc_3= "The hat is blue"

Docs =[Doc_1,Doc_2,Doc_3]

# **Term Frequency (TF)**
* **TF(t,d) is the term frequency of term t in document d (how often the term appears in the document).*** **TF(t,d) is the term frequency of term t in document d (how often the term appears in the document).**

In [3]:
#get distinct words for each document
lst = []
for d in Docs:
    lst.extend(d.lower().split(' '))
wrds = set(lst) # remove duplicate words
wrds

{'blue', 'brown', 'cat', 'fox', 'hat', 'in', 'is', 'quick', 'the'}

In [4]:
#form a dataframe to represent TF for each word in each Document where columns are words and rows are documents
def count_wrd_Doc(wrd,doc):
    i=0
    for w in doc.lower().split(' '):
        if wrd == w:
            i = i+1
    return i/len(doc.lower().split(' '))
    
tf_df = pd.DataFrame(columns=list(wrds)) #empty dataframe initialized with words column headers
freq_lst=[] #empty list for each column to save word frequencies in each document
for c in tf_df.columns:
    freq_lst=[]#empty the list
    for d in Docs:
        freq_lst.append(count_wrd_Doc(c,d))#append the frequency of word in document d
    tf_df[c]=freq_lst #assign values to column
tf_df #display the dataframe of TF for each word in each document

Unnamed: 0,brown,in,quick,is,blue,fox,hat,the,cat
0,0.0,0.2,0.0,0.0,0.0,0.0,0.2,0.4,0.2
1,0.25,0.0,0.25,0.0,0.0,0.25,0.0,0.25,0.0
2,0.0,0.0,0.0,0.25,0.25,0.0,0.25,0.25,0.0


**Document Frequency (DF)**
* **Calculate Document Frequency (DF): the word appears in how many documents*** **Calculate Document Frequency (DF): the word appears in how many documents**

In [5]:
df_df = pd.DataFrame(columns=list(wrds)) #empty dataframe initialized with words column headers
for c in df_df.columns:
    df_df[c] = [sum(1 for doc in Docs if c in doc.lower().split(' '))]
df_df #display the dataframe of DF for each word 

Unnamed: 0,brown,in,quick,is,blue,fox,hat,the,cat
0,1,1,1,1,1,1,2,3,1


**Inverse Document Frequency (IDF)**
* **IDF(t,D) is the inverse document frequency of term t in the entire document set D (logarithmically scaled inverse fraction of the documents that contain the term).*** **IDF(t,D) is the inverse document frequency of term t in the entire document set D (logarithmically scaled inverse fraction of the documents that contain the term).**

In [6]:
idf_df = pd.DataFrame(columns=list(wrds)) #empty dataframe initialized with words column headers
for c in idf_df.columns:
    N = 3 #No of documents
    df = df_df[c].iloc[0] # DF of word
    idf_df[c] = [math.log((N+1) / (df+1))+1]#IDF = log (no. of documents/DF(word)) 
idf_df #display the dataframe of idf for each word 

Unnamed: 0,brown,in,quick,is,blue,fox,hat,the,cat
0,1.693147,1.693147,1.693147,1.693147,1.693147,1.693147,1.287682,1.0,1.693147


**Term Frequency - Inverse Document Frequency (TF-IDF)= TF * IDF**

In [7]:
tfidf_df = pd.DataFrame(columns=list(wrds)) #empty dataframe initialized with words column headers
tfidf_lst=[]  #empty list for each column 
for c in tfidf_df.columns:
    tfidf_lst=[] #empty list for each column
    for i in range(0,len(Docs)):
        tf_idf_d1 = tf_df[c].iloc[i]*idf_df[c].iloc[0] #append tf of word in i th document to idf of word
        tfidf_lst.append(tf_idf_d1)
    tfidf_df[c]=tfidf_lst#assign tfidf values for each word
tfidf_df #display the dataframe of tf-idf for all words

Unnamed: 0,brown,in,quick,is,blue,fox,hat,the,cat
0,0.0,0.338629,0.0,0.0,0.0,0.0,0.257536,0.4,0.338629
1,0.423287,0.0,0.423287,0.0,0.0,0.423287,0.0,0.25,0.0
2,0.0,0.0,0.0,0.423287,0.423287,0.0,0.321921,0.25,0.0


**L2 Normalization**
* **L2 normalization, also known as Euclidean normalization or L2 norm normalization, is a technique used to scale vectors (or arrays) in such a way that their Euclidean norm becomes equal to 1.**

In [8]:
normalized_df = pd.DataFrame(columns=tfidf_df.columns)

# Apply L2 normalization to each document's TF-IDF values
for i,row in enumerate(tfidf_df.iterrows()):
    # Extract TF-IDF values    
    tfidf_values_list = list(tfidf_df.iloc[i].values)
    # Calculate L2 norm
    l2_norm = math.sqrt(sum(val**2 for val in tfidf_values_list))
    # Normalize TF-IDF values using L2 norm
    normalized_tfidf = [val / l2_norm for val in list(tfidf_df.iloc[i].values)]
    new_row = pd.Series(normalized_tfidf, index=tfidf_df.columns)
    normalized_df.loc[len(normalized_df)] = new_row
    
normalized_df

Unnamed: 0,brown,in,quick,is,blue,fox,hat,the,cat
0,0.0,0.501651,0.0,0.0,0.0,0.0,0.381519,0.592567,0.501651
1,0.546454,0.0,0.546454,0.0,0.0,0.546454,0.0,0.322745,0.0
2,0.0,0.0,0.0,0.584483,0.584483,0.0,0.444514,0.345205,0.0


**TfidfVectorizer Python Library**

In [9]:
# Create the TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(norm='l2',smooth_idf=True)

# Fit the documents and transform them into a TF-IDF matrix
tfidf_matrix = tfidf_vectorizer.fit_transform(Docs)

# Get the feature names (terms) from the vectorizer
feature_names = tfidf_vectorizer.get_feature_names_out()

df_tfidf = pd.DataFrame(data=tfidf_matrix.toarray(), columns=feature_names)
df_tfidf

Unnamed: 0,blue,brown,cat,fox,hat,in,is,quick,the
0,0.0,0.0,0.501651,0.0,0.381519,0.501651,0.0,0.0,0.592567
1,0.0,0.546454,0.0,0.546454,0.0,0.0,0.0,0.546454,0.322745
2,0.584483,0.0,0.0,0.0,0.444514,0.0,0.584483,0.0,0.345205


* **The TfidfVectorizer in scikit-learn, by default, adds a smoothing term to the denominator of the IDF calculation to avoid division by zero. This is done to handle the case where a term is present in all documents, ensuring that the IDF is not undefined.**
* **L2 normalization, also known as Euclidean normalization or L2 norm normalization, is a technique used to scale vectors (or arrays) in such a way that their Euclidean norm becomes equal to 1.**

# Unigram
* **A unigram, in the context of natural language processing (NLP) and linguistics, refers to a single unit or token of a word. It is the simplest form of linguistic analysis where text is broken down into individual words. In other words, a unigram is a term used to describe a single word in a sequence of words.**
* **Unigrams are the building blocks for more complex linguistic analyses, such as bigrams (pairs of consecutive words), trigrams (triplets of consecutive words), and n-grams in general.*** **A unigram, in the context of natural language processing (NLP) and linguistics, refers to a single unit or token of a word. It is the simplest form of linguistic analysis where text is broken down into individual words. In other words, a unigram is a term used to describe a single word in a sequence of words.**
* **Unigrams are the building blocks for more complex linguistic analyses, such as bigrams (pairs of consecutive words), trigrams (triplets of consecutive words), and n-grams in general.**

In [10]:
#Probability of unigram P(w)=C(w)/m same idea of TF
def count_wrd_Doc(wrd,doc):
    i=0
    for w in doc.lower().split(' '):
        if wrd == w:
            i = i+1
    return i
    
unigram_df = pd.DataFrame(columns=list(wrds)) #empty dataframe initialized with words column headers
freq_lst=[] #empty list for each column to save word frequencies in each document
for c in tf_df.columns:
    freq_lst=[]#empty the list
    for d in Docs:
        freq_lst.append(count_wrd_Doc(c,d))#append the frequency of word in document d
    unigram_df[c]=freq_lst #assign values to column
unigram_df #display the dataframe of TF for each word in each document

Unnamed: 0,brown,in,quick,is,blue,fox,hat,the,cat
0,0,1,0,0,0,0,1,2,1
1,1,0,1,0,0,1,0,1,0
2,0,0,0,1,1,0,1,1,0


**Unigrams python function**

In [11]:
for d in Docs:
    words = word_tokenize(d.lower())
    result = list(ngrams(words, 1))
    # Calculate frequency distribution of bigrams
    ngram_freq = FreqDist(result)
    for word, frequency in ngram_freq.items():
        print(f"{word}: {frequency}")

('the',): 2
('cat',): 1
('in',): 1
('hat',): 1
('the',): 1
('quick',): 1
('brown',): 1
('fox',): 1
('the',): 1
('hat',): 1
('is',): 1
('blue',): 1


**Bigram**
* **A bigram, in the context of natural language processing (NLP) and linguistics, refers to an ordered pair of consecutive words within a text or sequence of words. It is a type of n-gram, where "n" represents the number of words in the sequence.*** **A bigram, in the context of natural language processing (NLP) and linguistics, refers to an ordered pair of consecutive words within a text or sequence of words. It is a type of n-gram, where "n" represents the number of words in the sequence.**

In [12]:
#get bi-grams of input sentence
def bi_lst(doc):
    wrds = doc.lower().split(' ')
    bi_lst = []
    for j in range(0,len(wrds)-1):
        bi_lst.append(wrds[j:j+2])
    return bi_lst

lst = []
for d in Docs:
    lst.extend(bi_lst(d))
unique_list = []
unique_list = [item for item in lst if item not in unique_list]

def count_biwrd_Doc(st,doc):
    i=0    
    for s in bi_lst(doc):
        if s == st.split(' '):
            i = i+1
    return i
bigram_df = pd.DataFrame(columns=list((' '.join(x) for x in unique_list))) #empty dataframe initialized with words column headers
freq_lst=[] #empty list for each column to save word frequencies in each document
for c in bigram_df.columns:
    freq_lst=[]#empty the list
    for d in Docs:
        freq_lst.append(count_biwrd_Doc(c,d))#append the frequency of word in document d
    bigram_df[c]=freq_lst #assign values to column
bigram_df #display the dataframe of TF for each word in each document

Unnamed: 0,the cat,cat in,in the,the hat,the quick,quick brown,brown fox,the hat.1,hat is,is blue
0,1,1,1,1,0,0,0,1,0,0
1,0,0,0,0,1,1,1,0,0,0
2,0,0,0,1,0,0,0,1,1,1


**Bigrams python function**

In [13]:
for d in Docs:
    words = word_tokenize(d.lower())
    result = list(ngrams(words, 2))
    # Calculate frequency distribution of bigrams
    ngram_freq = FreqDist(result)
    for word, frequency in ngram_freq.items():
        print(f"{word}: {frequency}")

('the', 'cat'): 1
('cat', 'in'): 1
('in', 'the'): 1
('the', 'hat'): 1
('the', 'quick'): 1
('quick', 'brown'): 1
('brown', 'fox'): 1
('the', 'hat'): 1
('hat', 'is'): 1
('is', 'blue'): 1


# Skip-gram & Continuous Bag of Words (CBOW)
* These models use shallow neural networks to learn word embeddings. Skip-gram predicts context words given a target word, while CBOW predicts a target word given its context. These models use shallow neural networks to learn word embeddings. Skip-gram predicts context words given a target word, while CBOW predicts a target word given its context. 

**CBOW**
* Objective: The main objective of CBOW is to predict a target word given its context (surrounding words). It learns to represent words in a continuous vector space based on their distributional semantics.
* Architecture: CBOW uses a neural network with a single hidden layer. The input layer and output layer are typically equal to the size of the vocabulary, and the hidden layer has a much smaller dimension, often referred to as the embedding dimension.
* Input and Output: The input to the CBOW model is a set of context words represented as one-hot vectors (binary vectors with a 1 at the index corresponding to the word's position in the vocabulary). The output is the target word's one-hot vector.
* Context Window: The context window is a fixed-size window of surrounding words used to predict the target word. The model is trained to predict the target word based on the words within this context window.

In [14]:
def count_wrd_Doc(wrd,doc):
    i=0
    for w in doc.lower().split(' '):
        if wrd == w:
            i = i+1
    return i
    
cw_df = pd.DataFrame(columns=list(wrds)) #empty dataframe initialized with words column headers
freq_lst=[] #empty list for each column to save word frequencies in each document
for c in tf_df.columns:
    freq_lst=[]#empty the list
    for d in Docs:
        freq_lst.append(count_wrd_Doc(c,d))#append the frequency of word in document d
    cw_df[c]=freq_lst #assign values to column
cw_df #display the datafram

Unnamed: 0,brown,in,quick,is,blue,fox,hat,the,cat
0,0,1,0,0,0,0,1,2,1
1,1,0,1,0,0,1,0,1,0
2,0,0,0,1,1,0,1,1,0


In [15]:
# Create a CountVectorizer instance
vectorizer = CountVectorizer()

# Fit and transform the documents to create the Bag of Words representation
X_bow = vectorizer.fit_transform(Docs)
feature_names = vectorizer.get_feature_names_out()
# Print the Bag of Words representation
print("Bag of Words representation:")
print(X_bow.toarray())
print("Feature names:")
print(feature_names)

Bag of Words representation:
[[0 0 1 0 1 1 0 0 2]
 [0 1 0 1 0 0 0 1 1]
 [1 0 0 0 1 0 1 0 1]]
Feature names:
['blue' 'brown' 'cat' 'fox' 'hat' 'in' 'is' 'quick' 'the']


**Skip-Grams**
* **Objective:** The main objective of the Skip-gram model is to learn distributed representations (word embeddings) of words in a continuous vector space. It does so by predicting the context words based on a given target word.
* **Architecture:** Skip-gram uses a neural network with a single hidden layer. The input layer and output layer are typically equal to the size of the vocabulary, and the hidden layer has a much smaller dimension, often referred to as the embedding dimension.
* **Input and Output:** The input to the Skip-gram model is a one-hot vector representing a target word (the word for which embeddings are being learned). The output is a probability distribution over the vocabulary, representing the likelihood of each word being a context word.
* **Context Window:** During training, a context window is defined around the target word. The context words within this window are used to predict the target word. The context window provides local context information for each target word.
* **Training Objective:** Skip-gram is trained using a supervised learning approach. The model aims to minimize the cross-entropy loss between the predicted probability distribution over the vocabulary and the actual distribution (one-hot vector of the true context word).

In [16]:
# Generate training pairs (target word, context word)
window_size = 3# the window specifies the context words size neigbored the target word 
training_pairs = []
context_words = []#the context words for w=3 i-3,i-2,i-1,i+1,i+2,i+3
for d in Docs:
    t=[]
    c=[]
    for i, target_word in enumerate(d.lower().split(' ')):
        start = max(0, i - window_size)
        end = min(len(d.lower().split(' ')), i + window_size + 1)
        c = [d.lower().split(' ')[j] for j in range(start, end) if j != i]
        for context_word in c:
            t.append((target_word, context_word))
        print("document:",d.lower())
        print("target:",target_word)
        print("context words",c,",window_size",window_size)
        print("----------------------------------------")
    training_pairs.append(t)    

document: the cat in the hat
target: the
context words ['cat', 'in', 'the'] ,window_size 3
----------------------------------------
document: the cat in the hat
target: cat
context words ['the', 'in', 'the', 'hat'] ,window_size 3
----------------------------------------
document: the cat in the hat
target: in
context words ['the', 'cat', 'the', 'hat'] ,window_size 3
----------------------------------------
document: the cat in the hat
target: the
context words ['the', 'cat', 'in', 'hat'] ,window_size 3
----------------------------------------
document: the cat in the hat
target: hat
context words ['cat', 'in', 'the'] ,window_size 3
----------------------------------------
document: the quick brown fox
target: the
context words ['quick', 'brown', 'fox'] ,window_size 3
----------------------------------------
document: the quick brown fox
target: quick
context words ['the', 'brown', 'fox'] ,window_size 3
----------------------------------------
document: the quick brown fox
target: brown

In [17]:
training_pairs # this is the pairs of training formed the target and context word pairs according to context window size

[[('the', 'cat'),
  ('the', 'in'),
  ('the', 'the'),
  ('cat', 'the'),
  ('cat', 'in'),
  ('cat', 'the'),
  ('cat', 'hat'),
  ('in', 'the'),
  ('in', 'cat'),
  ('in', 'the'),
  ('in', 'hat'),
  ('the', 'the'),
  ('the', 'cat'),
  ('the', 'in'),
  ('the', 'hat'),
  ('hat', 'cat'),
  ('hat', 'in'),
  ('hat', 'the')],
 [('the', 'quick'),
  ('the', 'brown'),
  ('the', 'fox'),
  ('quick', 'the'),
  ('quick', 'brown'),
  ('quick', 'fox'),
  ('brown', 'the'),
  ('brown', 'quick'),
  ('brown', 'fox'),
  ('fox', 'the'),
  ('fox', 'quick'),
  ('fox', 'brown')],
 [('the', 'hat'),
  ('the', 'is'),
  ('the', 'blue'),
  ('hat', 'the'),
  ('hat', 'is'),
  ('hat', 'blue'),
  ('is', 'the'),
  ('is', 'hat'),
  ('is', 'blue'),
  ('blue', 'the'),
  ('blue', 'hat'),
  ('blue', 'is')]]

In [18]:
# Initialize word vectors randomly
embedding_dim = 10
learning_rate = 0.01
epochs = 500
word_vectors = []
vocab=[]
for d in Docs:
    vocab.append((list(set(d.lower().split(' ')))))
for v in vocab:
    word_vectors.append({word: np.random.rand(embedding_dim) for word in v})#initialize random values vector for each word 
    
for i in range(0,len(training_pairs)):
    # Train the Skip-gram model
    for epoch in range(epochs):
    
        for target_word, context_word in training_pairs[i]:
            # Forward pass
            input_vector = word_vectors[i][target_word]
            output_vector = word_vectors[i][context_word]

            # Calculate loss (using negative log likelihood)
            error = -np.log(np.exp(np.dot(input_vector, output_vector)))

            # Backward pass (update word vectors using gradient descent)
            gradient = input_vector * np.exp(np.dot(input_vector, output_vector)) / (1 + np.exp(np.dot(input_vector, output_vector)))
            word_vectors[i][target_word] -= learning_rate * gradient
            word_vectors[i][context_word] -= learning_rate * gradient

        if epoch % 500 == 0:
            print(f"Epoch {epoch}, Loss: {error}")
    print('-------------------------------')
    #word vectors
    for word, vector in word_vectors[i].items():
        print(f"Vector for '{word}': {vector}")
    print('-------------------------------')

Epoch 0, Loss: -1.5547117209204573
-------------------------------
Vector for 'cat': [-5.65290452e-06 -3.01434705e-04 -1.76163224e-04 -2.30386202e-04
  4.97224867e-04  8.31000930e-04 -2.24089526e-04 -4.23949785e-04
  2.42732156e-05 -2.53179441e-04]
Vector for 'the': [-1.97857776e-05  4.90639859e-05  1.92712368e-05  9.16633056e-05
 -1.31928845e-04 -2.17141981e-04  1.50333863e-05  8.35862131e-05
 -2.42806385e-06  5.97708621e-05]
Vector for 'in': [ 1.60375577e-04 -2.11180454e-04 -2.44764123e-05 -5.58840109e-04
  4.44964232e-04  8.06192995e-04  7.62748022e-05 -2.90721732e-04
 -3.82902910e-05 -2.77866749e-04]
Vector for 'hat': [-1.71487143e-04  6.35114868e-04  2.47629529e-04  9.11440123e-04
 -9.89152092e-04 -1.77102955e-03  1.77768794e-04  8.20374061e-04
  3.38888153e-05  6.20319984e-04]
-------------------------------
Epoch 0, Loss: -2.2503771454682244
-------------------------------
Vector for 'brown': [ 1.42927057e-03  1.84843773e-03  6.59118891e-05 -2.69540572e-03
 -1.31570351e-03  7.12

# TensorFlow Skip-gram

In [19]:
import tensorflow as tf
from tensorflow.keras.layers import Embedding, Dot, Dense,Flatten
from tensorflow.keras.models import Model
from nltk.tokenize import word_tokenize

# Create a vocabulary with unique words and indices
vocab = []
for doc in Docs:
    wrds=list(set(doc.lower().split()))
    vocab.append({w:wrds.index(w) for w in wrds})
reverse_vocab = []
for v in vocab:
    reverse_vocab.append({idx: word for word, idx in v.items()})

# Generate training pairs (target word, context word)
training=[]
for d,v in zip(Docs,vocab):
    training_pairs=[]
    for i, target_word in enumerate(d.lower().split(' ')):
        start = max(0, i - window_size)
        end = min(len(d.lower().split(' ')), i + window_size + 1)
        context_words = [d.lower().split(' ')[j] for j in range(start, end) if j != i]
        for context_word in context_words:
            training_pairs.append((v[target_word], v[context_word]))
    training.append(training_pairs)
    
    
# Define the Skip-gram model using TensorFlow
for i,v,t_p in zip(range(0,len(Docs)),vocab,training):
    target_word_input = tf.keras.layers.Input(shape=(1,), name="target_word")
    context_word_input = tf.keras.layers.Input(shape=(1,), name="context_word")

    embedding_layer = Embedding(input_dim=len(v), output_dim=embedding_dim)
    target_word_embedding = embedding_layer(target_word_input)#obtain respective embeddings
    context_word_embedding = embedding_layer(context_word_input)#obtain respective embeddings
    dot_product = Dot(axes=1, normalize=False)([target_word_embedding, context_word_embedding])#captures relations and similarity between context and target words
    #dot_product = Dot(axes=2)([target_word_embedding, context_word_embedding])
    output_layer = Dense(1, activation='sigmoid')(Flatten()(dot_product))  # Flatten the output

    model = Model(inputs=[target_word_input, context_word_input], outputs=output_layer)
    model.compile(optimizer='adam', loss='binary_crossentropy')
    
    
    # Train the Skip-gram model
    target_words = np.array([pair[0] for pair in t_p], dtype=np.int32)
    context_words = np.array([pair[1] for pair in t_p], dtype=np.int32)

    labels = np.array([1] * len(t_p), dtype=np.float32)# Positive labels for all training pairs
    
    model.fit({'target_word': target_words, 'context_word': context_words}, labels, epochs=500,verbose=0)

    # Access word vectors
    word_vectors = embedding_layer.get_weights()[0]
    for idx, word in reverse_vocab[i].items():
        print(f"Vector for '{word}': {word_vectors[idx]}")
    
    keras.utils.plot_model(model,show_shapes=True,show_layer_names=True)
    print("-----------------------------------------------------------")

Vector for 'cat': [-0.43010348  0.38208094  0.43516183  0.34945223 -0.26633     0.36079276
  0.38095602  0.34150225  0.23537134  0.42937216]
Vector for 'the': [-0.34411123  0.38560492  0.3979685   0.37494144 -0.41802138  0.4021703
  0.39510897  0.36356258  0.37729284  0.34837696]
Vector for 'in': [-0.3807053   0.37619838  0.33505103  0.35508528 -0.43122908  0.3351781
  0.33790073  0.39544725  0.41365743  0.3455321 ]
Vector for 'hat': [-0.36376128  0.36160553  0.40696654  0.40687633 -0.3848791   0.38903594
  0.39137614  0.3218226   0.36984816  0.35859945]
-----------------------------------------------------------
Vector for 'brown': [-0.37116742 -0.38867804  0.39639646  0.3804858   0.34291995  0.3749708
 -0.34248227  0.3728325  -0.40337723 -0.38095927]
Vector for 'the': [-0.37745386 -0.3032146   0.38352677  0.38538253  0.41359064  0.39850843
 -0.3676237   0.4071769  -0.3573647  -0.30913416]
Vector for 'quick': [-0.384397   -0.3783502   0.40433174  0.39939916  0.3999617   0.37938774
 -0

* **Count of Words (Bag-of-Words Model):**  Early NLP models represented documents using a Bag-of-Words (BoW) model, which counts the occurrence of words in a document without considering their order. This approach provides a basic representation of documents but lacks capturing semantic relationships.
* **Term Frequency-Inverse Document Frequency (TF-IDF):** introduced a weighting scheme that considers not only the count of words in a document but also their importance in the entire corpus. It helps identify words that are significant to a particular document but not frequent across all documents.
* **Latent Semantic Analysis (LSA):** also known as Latent Semantic Indexing (LSI), applies singular value decomposition (SVD) to the term-document matrix. It reduces the dimensionality of the space and captures latent semantic relationships between words and documents, improving representation.
* **Skip-gram and Continuous Bag of Words (CBOW):** Word2Vec, introduced by Mikolov et al., includes two models: Skip-gram and Continuous Bag of Words (CBOW). These models use shallow neural networks to learn word embeddings. Skip-gram predicts context words given a target word, while CBOW predicts a target word given its context. Word2Vec significantly improves word embeddings' quality and captures semantic relationships.
* **Global Vectors for Word Representation (GloVe):** introduces a global approach by training on aggregated global word co-occurrence statistics. It leverages a matrix factorization technique to capture the relationships between words in a more efficient manner, producing high-quality word embeddings.
* **FastText and Subword Embeddings:** also by Mikolov et al., extends word embeddings to subword level. It represents words as bags of character n-grams, enabling the generation of embeddings for out-of-vocabulary words and capturing morphological information.
* **Transformer Architecture:** introduced by Vaswani et al., revolutionizes NLP by employing self-attention mechanisms. It enables models like BERT (Bidirectional Encoder Representations from Transformers) to learn contextualized word embeddings, considering the entire input sequence bidirectionally.
* **Contextualized Word Embeddings (BERT, GPT, ELMo):** BERT, GPT (Generative Pre-trained Transformer), and ELMo (Embeddings from Language Models) introduce contextualized embeddings by considering surrounding words and contexts. These models capture rich contextual information, leading to state-of-the-art performance in various NLP tasks.
* **Transfer Learning and Fine-Tuning:** Pre-trained language models, such as BERT and GPT, can be fine-tuned on specific downstream tasks. This transfer learning approach significantly reduces the need for large labeled datasets and improves performance on specific tasks.
* **Multimodal Embeddings:** are extended beyond text to include multimodal information, combining textual and visual features. Models like CLIP (Contrastive Language-Image Pre-training) learn joint representations of text and images, enabling cross-modal understanding.


**Key breakpoints:**

* **From word counts to context:** Moving beyond simple frequency to considering surrounding words for richer representations.
* **Pre-training on large corpora:** Utilizing massive amounts of text data to learn general language understanding.
* **Bi-directional and attention mechanisms:** Capturing complex relationships between words in a sentence.
* **Transformers and self-attention:** Enabling efficient learning of long-range dependencies.
* **LLMs reaching human-level performance in certain tasks:** Highlighting the potential of language models for natural communication and problem-solving.

**Future directions:**

* **Explainability and interpretability:** Understanding how LLMs work and make decisions.
* **Addressing biases and fairness:** Ensuring models are inclusive and represent diverse perspectives.
* **Combining symbolic and neural approaches:** Integrating logic and reasoning with language understanding.
