# Visualize word vectors - reduced to three dimensions

Here the following Python packages are used to vectorize text and visualize it:  
- **flair** is a NLP packages which is very powerful and well documented: https://flairnlp.github.io/docs/intro  
- **torch** is the famous **PyTorch** - a powerful deep learning Python package: https://pytorch.org
- **numpy** is one of the most used packages for mathematical/vectorization purposes: https://numpy.org
- **scikit-learn** (sklearn) is a well known and powerful Machine Learning package: https://scikit-learn.org/stable/index.html
- **Plotly** is a useful package to visualize data, especially in 3D: https://plotly.com/python

In [None]:
# install Plotly
%pip install plotly

In [None]:
# WordEmbeddings will be initalized to get word vectors
from flair.embeddings import WordEmbeddings, TransformerWordEmbeddings
from flair.data import Sentence

# Data handling and Tools
from typing import Union
from torch import Tensor
import numpy as np

# Visualization
import plotly.graph_objects as go

# Dimensionality Reduction
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

The **Vectorizer** is a class to vectorize the text `get_vectors()`, reduce its dimension `dimension_reduction_pca()` with the Principal Component Analysis, and get the data after processing with `get_data()`.

In [None]:
class Vectorizer(object):
    def __init__(
            self,
            listtoembed: list,
            color: str,
            word: list = None,
            embedding: Union[TransformerWordEmbeddings, WordEmbeddings] = None,
            label: str = None,
    ):
        if type(word) is str:
            word = [word]
        elif type(word) is not list and word is not None:
            raise ValueError('"word" must be a string or a list of strings.')
        self.word = word
        self.listtoembed = listtoembed
        self.embedding = embedding
        self.label = label
        self.color = color

    def get_vectors(self):
        results = []
        
        for i, element in enumerate(self.listtoembed):            
            toEmbed = Sentence(str(element))
            self.embedding.embed(toEmbed)
            if self.word is not None:
                for j, w in enumerate(self.word):
                    result = [w+str(i)+'_'+str(j)+'_'+self.label, str(element)]

                for token in toEmbed:
                    text = token.text
                    if text.lower() in [w.lower() for w in self.word]:
                        result.append(token.embedding)
            else:
                result = [element, element, toEmbed[0].embedding]
            results.append(result)           
        return results

    @staticmethod
    def dimension_reduction_pca(
            word_vectors,
            dims: int = 3,
            random_state: int = 42
    ):
        word_vectors_np = []
        for w in word_vectors:
            if type(w) == Tensor:
                word_vectors_np.append(w.detach().cpu().numpy())
            else:
                word_vectors_np.append(w)
        return PCA(random_state=random_state).fit_transform(np.array(word_vectors_np))[:, :dims]

    def get_data(
            self,
            dims: int = 3,
            random_state: int = 42,
            perplexity: int = 5,
            learning_rate: int = 500,
            n_iter: int = 10000
    ):
        results = self.get_vectors()
        word_vecs = []
        data = []
        for r in results:
            word_vecs.append(r[2])
        
        reduced_vec = self.dimension_reduction_pca(word_vecs, dims, random_state)
        
        for elem in zip(reduced_vec, results):
            try:
                z = elem[0][2].tolist()
            except:
                z = 0.0

            item = {
                'Label': str(elem[1][0]),
                'Legend': str(elem[1][1]),
                'X': elem[0][0].tolist(),
                'Y': elem[0][1].tolist(),
                'Z': z,
                'Color': self.color
            }
            data.append(item)

        return data

The function **visualize()** takes the prepared data and puts it in a fitting *plotly*-3D-scatter-plot.

In [None]:
def visualize(data):
    layout = go.Layout(
        autosize=False,
        width=800,
        height=800
    )

    fig = go.Figure(layout=layout)

    for d in data:
        fig.add_trace(
            go.Scatter3d(
                x=[d['X']],
                y=[d['Y']],
                z=[d['Z']],
                mode='markers',
                marker_color=d['Color'],
                text=d['Legend'],
                name=d['Label']
            )
        )

    fig.show()

# Initiating the trained embeddings for vectorizing

Here we create to embedders: `bertEmbedding` and `gloveEmbedding`.  

For this we use the **flair** library. The class `TransformerWordEmbeddings` (see: https://flairnlp.github.io/docs/tutorial-embeddings/transformer-embeddings) is able to import any Transformer Embeddings from https://huggingface.co/. There you have very powerful model-architectures.  

The `WordEmbeddings` from flair are the classic word embeddings (see: https://flairnlp.github.io/docs/tutorial-embeddings/classic-word-embeddings). There is a list of available models. **glove** is a *word2vec* model, which has, after training, fixed vectors for each word available in its vocabulary.

In [None]:
# import the BertEmbedding and the Glove-WordEmbedding from flair
bertEmbedding = TransformerWordEmbeddings('bert-base-uncased')
gloveEmbedding = WordEmbeddings('glove')

# Visualize Word2Vec

Here we gonna use the **glove**-model, wie initiated as `gloveEmbedding` before.

In [None]:
# list of wors to embed
words = [
    'queen',
    'king',
    'bat', # polysemic
    'baseball',
    'cave'
]

In [None]:
# use the Vectorizer with Glove-WordEmbedding
glove = Vectorizer(listtoembed=words, embedding=gloveEmbedding, label='glove', color='gold')
glove_data = glove.get_data()

In [None]:
# now visualize the data
visualize(glove_data)

# Disambiguiation with Bert

Since the **Bert**-model-architecture is taking care of the context in a sentence when embedding, we hand it over ten sentences wich contain the polysemic word **arm**. Have a look at the positioning of the word being embedded in it's specific context.

In [None]:
# feel free to change the sentences or put some more.
sentences = [
    "She puts her arm around on his shoulder.",
    "His horse had been wounded under him and his own arm slightly grazed by a bullet.",
    "He shrugged into one arm of his pajamas.",
    "They had to arm all the troops because of the threat.",
    "Take care, this arm is very powerful and lethal.",
    "Joe grabbed Bob's arm.",
    "Pumpkin Green strolled into the inn, arm in arm with Billy Langstrom's female friend, Melissa.",
    "She sat on the arm of the sofa.",
    "It was a bad idea to arm the bomb.",
    "Military intervention and crisis handling will become an EU task, as the WEU defence alliance takes on more of a role as an arm of EU defence policy.",
    "The end of the verge is driven into the balance, which has one straight arm."
]

In [None]:
bert = Vectorizer(word='arm', listtoembed=sentences, embedding=bertEmbedding, label='bert', color='navy')
bert_data = bert.get_data()

visualize(bert_data)

# Use a latin-trained Roberta embedding

We found this on huggingface.co: https://huggingface.co/pstroe/roberta-base-latin-cased3

In [None]:
latinEmbedding = TransformerWordEmbeddings('pstroe/roberta-base-latin-cased3')

In [None]:
sentences = [
    "Sanctus pater noster, qui in caelis es.",
    "In ecclesia cantamus Sanctus, Sanctus, Sanctus.",
    "Sanctus Ioannes Baptista est patronus multorum.",
    "In libro sacro, sancti viri exempla virtutis nobis praebent.",
    "Sanctus Dei Genitrix, ora pro nobis peccatoribus.",
    "Abbatia est locum sanctum, ubi monachi orant et laborant.",
    "Sanctus animus est placidus et pacificus.",
    "Duae urbes in Italia sunt nominibus Sanctus Petrus et Sanctus Franciscus.",
    "Sanctus Aloysius Gonzaga erat iuvenis sanctus et devotus.",
    "Vita sancti Francisci Assisiensis est nobis exemplum paupertatis et humilitatis."
]

latinbert = Vectorizer(
    word=['sanctus','sanctum','sancti'], 
    listtoembed=sentences, 
    embedding=latinEmbedding, 
    label='bert', 
    color='navy'
)
latinbert_data = latinbert.get_data()

visualize(latinbert_data)