Bellow are listed all packages that are not available in the google colab so that they need to be downloaded

In [1]:
!pip install fasttext
!pip install transformers torch
!pip install sacremoses

Collecting fasttext
  Downloading fasttext-0.9.3.tar.gz (73 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/73.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.4/73.4 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting pybind11>=2.2 (from fasttext)
  Using cached pybind11-3.0.1-py3-none-any.whl.metadata (10.0 kB)
Using cached pybind11-3.0.1-py3-none-any.whl (293 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (pyproject.toml) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.3-cp312-cp312-linux_x86_64.whl size=4498213 sha256=e98992e9b984bdf7cafdd6af09ad229ecb2bf2fb6e21b017cad49a0307eee324
  Stored in directory: /root/.cache/pip/wheels/20/27/95/a7baf1b435f1cbde017cabd

Bellow are all the required imports

In [2]:
from datasets import load_dataset
from google.colab import userdata
from google.colab import drive
import fasttext
import numpy
import random
import torch
from transformers import AutoTokenizer, AutoModel
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity



Data loading needs token from hugging face account that will be saved in the google colab env variable from where it can be loaded to the code

In [3]:
HF_TOKEN = userdata.get("HF_TOKEN")

Datasets are downloaded from the hugging face: https://huggingface.co/datasets/NASK-PIB/PL-Guard

In [4]:
PL_Guard_dataset = load_dataset("NASK-PIB/PL-Guard",'test', token=HF_TOKEN)
PL_Guard_adversarial_dataset = load_dataset("NASK-PIB/PL-Guard", "test_adversarial", token=HF_TOKEN)

README.md:   0%|          | 0.00/1.96k [00:00<?, ?B/s]

test/data.parquet:   0%|          | 0.00/485k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/900 [00:00<?, ? examples/s]

test_adversarial/data.parquet:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

Generating test_adversarial split:   0%|          | 0/900 [00:00<?, ? examples/s]

In [None]:
print(PL_Guard_dataset)
print(PL_Guard_dataset["test"].features)
print(PL_Guard_adversarial_dataset)
print(PL_Guard_adversarial_dataset["test_adversarial"].features)

# Representing text as vectors using embedding

**fasttext** python package created by the **Facebook's AI Research (FAIR) lab** is the wrapper on C++ implementation of the fastText embedding algorithm (paper by Bojanowski et al.). Downloaded models were pretrained using this algorithm. The PL-Gaurd dataset is in polish so proper model for this language was chosen. Based on the information from official website (https://fasttext.cc/docs/en/crawl-vectors.html) the models provided by the Facebook were trained on Common Crawl and Wikipedia data and are saved to .bin format. These models were trained using **CBOW** with position-weights, in dimension 300, with **character n-grams** of length 5, a window of size 5 and **10 negatives**.

---


* **CBOW** (Continuous Bag of Words) is a training algorithm (or model architecture) whose objective is to predict a target word from its surrounding context words. In FastText it is the training objective that updates word and subword embeddings by predicting a target word from its surrounding context words, providing a signal that tells the model how to adjust the embeddings.
* **character n-gram** is a substring of lenght n taken from a word where n-gram is a contiguous sequence of n items of some type (here type is a character). Spliting words to n-grams help to embed words that have mispellings or are built from some subwords.
* **10 negatives** refer to the **negative sampling**. Negative sampling in FastText is a training technique that efficiently approximates the prediction of the target word by updating embeddings only for the correct word and a small number of randomly chosen “negative” words, providing a fast learning signal without computing over the entire vocabulary like it is done in softmax.


Google Colab VM cannot persist downloaded files across sessions, so large models must be cached in Google Drive and loaded from there to avoid repeated downloads.

In [None]:
drive.mount("/content/drive") # attaches google drive as the folder to the google colab virtual machine filesystem
!mkdir -p /content/drive/MyDrive/models

!if [ ! -f "/content/drive/MyDrive/models/cc.pl.300.bin" ]; then \
    echo "Downloading and unzipping fastText model..."; \
    wget -O "/content/drive/MyDrive/models/cc.pl.300.bin.gz" https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.pl.300.bin.gz; \
    gunzip -f "/content/drive/MyDrive/models/cc.pl.300.bin.gz" -d "/content/drive/MyDrive/models/"; \
    echo "fastText model downloaded and unzipped."; \
else \
    echo "fastText model already exists."; \
fi

fasttext_model_path = "/content/drive/MyDrive/models/cc.pl.300.bin"
fasttext_embedding_model = fasttext.load_model(fasttext_model_path)

Mounted at /content/drive
fastText model already exists.


Setting random seed for PyTorch will get us high reproducibility.  It also handles the randomness of GPU fi any is available.

In [5]:
random_seed = 42
random.seed(random_seed)

torch.manual_seed(random_seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(random_seed)

Since PL-Guard dataset contains the whole sentences, to map each of these to the embedding vector the **bag-of-words embedding** technique will be used - embed each word and then average them. Each word in the sentence will be lowercased because **cc.pl.300.bin** was trained on lowercase text.

In [None]:
fasttext_embedding_vectors = []
for row in PL_Guard_dataset["test"]:
  sentence_embedding_vectors = []
  for word in row["text"].split():
    sentence_embedding_vectors.append(fasttext_embedding_model.get_word_vector(word.lower()))
  fasttext_embedding_vectors.append(numpy.mean(sentence_embedding_vectors, axis=0))

900
[-0.03624863 -0.00731831  0.02759482  0.01410999]


In [None]:
drive.mount("/content/drive")
!mkdir -p /content/drive/MyDrive/embeddings
numpy.save("/content/drive/MyDrive/embeddings/fasttext_embedding_vectors.npy", fasttext_embedding_vectors)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
drive.mount("/content/drive")
fasttext_embedding_vectors = numpy.load("/content/drive/MyDrive/embeddings/fasttext_embedding_vectors.npy")

In [None]:
print(len(fasttext_embedding_vectors))
print(fasttext_embedding_vectors[0][:4])

**BERT** (Bidirectional Encoder Representations from Transformers) is a transformer-based neural network that generates context-dependent embeddings for each token by analyzing the entire sentence using stacked attention layers. **AutoTokenizer** automaticly detects what is the type of the pretrained model, selects correct tokenizer for it and instantiates it with its config and vocabulary. For BERT it will be **WordPiece** tokenizer. **Automodel** downloads model architecture, trained weights of the model and builds the neural network in PyTorch. The most ofen used is "bert-base-uncased" and "bert-base-cased" which keeps the case sensitivity of the characters. But they are English-focused so "herbert-base-cased" was chosen.   
Sources:  
https://www.geeksforgeeks.org/nlp/how-to-generate-word-embedding-using-bert/  
https://huggingface.co/allegro/herbert-base-cased

In [None]:
BERT_tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-base-cased")
BERT_model = AutoModel.from_pretrained("allegro/herbert-base-cased")

pytorch_model.bin:   0%|          | 0.00/654M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/654M [00:00<?, ?B/s]

Some weights of the model checkpoint at allegro/herbert-base-cased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.sso.sso_relationship.bias', 'cls.sso.sso_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Explanation of the parameters passed to BERT_tokenizer:
- padding appends to the sentence additional tokens [PAD] so that each sentence is of each length
- truncation cuts the sentence if it is too long
- return_tensors tells tokenizer to output tokens in the format suitable for PyTorch

BERT_tokenizer for each sentence outputs dictionary:
- "input_ids": tensor([id for each token passed])
- "token_type_ids": tensor([zeros]) - tell the model which token belongs to which sentence (segment). These is because BERT was trained on two sentences at the time.
- "attention_mask": tensor([ones]) - to mark [PAD] tokens which don't need attention

By writing with torch.no_grad(): we using model in the inference mode. In this mode trained model produces outputs without changing its parameters and computing gradient descent.

The last hidden state is what the model outputs - separate vector for each passed token. To get the embedding for the whole sentence we will use mean pooling. Other posibility would be getting only the vector assigned to the first token which is special [CLS] token (it is capturing some sort of summary of the whole sentence). Each pytorch.Tensor is converted to ndarray to have vectors embedded via different methods in the same format.

In [None]:
BERT_embedding_vectors = []
for row in PL_Guard_dataset["test"]:
  inputs = BERT_tokenizer(row["text"], return_tensors="pt", padding=True, truncation=True)
  with torch.no_grad():
    outputs = BERT_model(**inputs)
  token_embeddings = outputs.last_hidden_state
  sentence_embedding = token_embeddings.mean(dim=1).squeeze()
  BERT_embedding_vectors.append(sentence_embedding.numpy())

In [None]:
drive.mount('/content/drive')
!mkdir -p /content/drive/MyDrive/embeddings
numpy.save("/content/drive/MyDrive/embeddings/BERT_embedding_vectors.npy", BERT_embedding_vectors)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
drive.mount('/content/drive')
BERT_embedding_vectors = torch.load("/content/drive/MyDrive/embeddings/BERT_embedding_vectors.npy")

In [None]:
print(BERT_embedding_vectors[0].device)
print(len(BERT_embedding_vectors))
print(BERT_embedding_vectors[0].shape)
print(BERT_embedding_vectors[0][:4])
print(inputs["input_ids"])
print(inputs["token_type_ids"])
print(inputs["attention_mask"])

cpu
900
torch.Size([1, 768])
tensor([[-5.2318e-02,  2.2354e-02, -1.2902e-01,  1.0401e-01,  3.0761e-01,
         -2.0457e-01, -1.2662e-01,  3.2543e-01, -6.4237e-03, -1.2980e-03,
          9.3004e-02,  2.2632e-01,  5.4255e-01,  6.5196e-02,  1.0874e-01,
          9.6080e-02,  9.0223e-02,  2.6781e-01, -1.5139e-01, -5.1023e-02,
          2.0876e-01,  1.0214e-01,  1.1614e-01,  3.7391e-02,  2.0110e-01,
         -1.8436e-01,  1.6819e-01,  1.9441e-02,  1.7304e-01,  2.1117e-01,
          1.0623e-01, -5.4748e-02, -2.8980e-02,  3.2231e-01, -1.0826e-02,
         -2.0313e-02, -1.0872e-01,  9.1013e-02,  4.5888e-01, -4.3054e-03,
          2.7931e-02,  4.1409e-02, -1.0492e-01, -1.6580e-02,  1.4951e-01,
         -9.7196e-03,  2.1507e-02,  1.4988e-01,  1.8104e-01, -1.0383e-01,
         -1.4252e-01,  1.8439e-01, -2.5939e-02,  4.3357e-01,  2.9369e-01,
         -1.2090e-01,  3.2472e-02,  2.9223e-02,  8.9340e-02,  3.8044e-01,
          3.7646e-01,  2.8562e-01,  1.1881e-01,  1.0438e-01,  9.0697e-02,
         

**SBERT** (Sentence BERT) performs tokenizations and pooling internally. Its architecture is designed to output one vector per sentence. It is the most known implementation of the sentence transformer and was introduced by Nils Reimers and Iryna Gurevych in 2019. It is good for tasks like semantic textual similarity (STS). What means it is good for measuring how similar are two pieces of text in terms of their meanings rather than counting usage of the same words. It has gained this feature by modifying BERT architecture - it utilizes a siamese or triplet network configuration. In a siamese setup two identical transformer-based encoders with shared weights process two different sentences independently and produce two sentence embeddings. The similarity between the two embeddings is then computed using an Euclidean distance or cosine similarity. The model is trained on the objective that semantically similar sentences should have embeddings that are close together in this space, while semantically dissimilar sentences should be far apart. In a triplet setup the model processes three types of sentences: an anchor, a positive and a negative. Its objective is to have the distance between anchor and positive vector smaller than anchor and negative vector.

Sources:  
https://arxiv.org/pdf/1908.10084  
https://en.wikipedia.org/wiki/Triplet_loss  
https://www.geeksforgeeks.org/nlp/sentence-transformer/  
https://en.wikipedia.org/wiki/Siamese_neural_network  
https://huggingface.co/Voicelab/sbert-base-cased-pl

In [6]:
SBERT_model = SentenceTransformer("Voicelab/sbert-base-cased-pl")
SBERT_embedding_vectors = []
for row in PL_Guard_dataset["test"]:
  SBERT_embedding_vectors.append(SBERT_model.encode(row["text"]))



config.json:   0%|          | 0.00/679 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/498M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/498M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/129 [00:00<?, ?B/s]

In [7]:
drive.mount("/content/drive")
!mkdir -p /content/drive/MyDrive/embeddings
numpy.save("/content/drive/MyDrive/embeddings/SBERT_embedding_vectors.npy", SBERT_embedding_vectors)

Mounted at /content/drive


In [None]:
drive.mount('/content/drive')
BERT_embedding_vectors = torch.load("/content/drive/MyDrive/embeddings/SBERT_embedding_vectors.npy")

In [8]:
print(len(SBERT_embedding_vectors))
print(SBERT_embedding_vectors[0].shape)
print(SBERT_embedding_vectors[0][:4])

900
(768,)
[ 0.13900119  0.17306335 -0.3774487   0.25774908]


Computing the similarity metrics between the PL_Guard_dataset and PL_Guard_adversarial_dataset using each of the embeddings methods.  
Sources:  
https://www.geeksforgeeks.org/python/how-to-calculate-cosine-similarity-in-python/  


In [None]:
similarity_score = cosine_similarity(sentence_embedding, example_sentence_embedding)

# Print the similarity score
print("Cosine Similarity Score:", similarity_score[0][0])

# Dimensionality reduction

PCA (Principal Component Analysis) finds a new set of axes called principal components that will capture the direction of biggest variantion (change) in the data. First data in each dimention needs to be standardized - values in each dimention should have the same range. PCA algorithm finds these principal components by calculating eigenvectors and eigenvalues from the covariance matrix.
- covariance measures whether two variables change values in the same direction or not
- covariance matrix stores relative covariances between each feature
- matrix's eigenvector represents direction that does not change when this matrix multriply it
- eigenvalue measures how much stretched or shrinken was eigenvector. It represents how much different information was stored in matrix in the direction of its eigenvector

PCA algorithm finds some number of (how many dimensions will be in result) eigenvectors with the highest eigenvalues and  projects the rest of the data on this vectors.

Sources:  
https://en.wikipedia.org/wiki/Covariance  
https://www.geeksforgeeks.org/maths/covariance-matrix/  
https://www.geeksforgeeks.org/data-analysis/principal-component-analysis-pca/  
https://www.geeksforgeeks.org/engineering-mathematics/eigen-values/