# Embeddings Generation

In this part, we generate sentence embeddings to represent text data in a vector space suitable for clustering. The core idea is to transform cleaned 15min.lt news articles and their mBART-generated summaries into numerical vectors that capture their semantic meaning.

We use the SentenceTransformer library, which provides powerful pre-trained models based on transformer architectures optimized for semantic similarity and clustering tasks. To generate textual embeddings we are using a pretrained SentenceTransformer model (all-MiniLM-L6-v2). These embeddings convert our textual data into vector representations, enabling effective clustering (Agglomerative Clustering) and visualization (UMAP). The selected model provides a balanced trade-off between computational efficiency and embedding quality.

##Import modules
* SentenceTransformer - library for text vectorization (turns texts into numeric vectors)

In [None]:
from sentence_transformers import SentenceTransformer
import pandas as pd
import torch
import numpy as np

* device = "cuda" if torch.cuda.is_available() else "cpu" - since the number of summaries in dataset are 2927, it's highly recommended to use GPU
* model_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2" - model that is compatible with Lithuanian language, effective, generates 384 dimensions vectors
* texts = df["mbart_summary"].tolist() - converting pandas Series to Python list (needed for SentenceTransformers model)

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device in use: {device}")

df = pd.read_csv("/content/mbart_summary_dataset.csv")

model_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
model = SentenceTransformer(model_name, device=device)

texts = df["mbart_summary"].tolist()

### Converts texts into numeric vectors (embeddings) using the SentenceTransformer model
* texts - mBART generated summaries
* show_progress_bar=True - shows the progress of how many texts are processed
* convert_to_numpy=True - converting results to Numpy (not PyTorch tensors) for later use with UMAP
* normalize_embeddings=True - ensures that all vectors are of the same scale - for UMAP and clasterization
* np.save("/content/mbart_embeddings.npy", embeddings) - saved in .npy format whitch is usefull for clustering and UMAP

**After this step, each of our articles is represented as a numerical vector. These vectors will serve as the basis for subsequent clustering and visualization tasks**

In [None]:
embeddings = model.encode(
    texts,
    batch_size=32,
    show_progress_bar=True,
    convert_to_numpy=True,
    normalize_embeddings=True
)

# Output summary of generated embeddings: total number and embedding dimension
print(f"Generated {embeddings.shape[0]} embeddings, length of the vector: {embeddings.shape[1]}")

np.save("/content/mbart_embeddings.npy", embeddings)