## Testing LZ78 Embeddings: Simple Experiment

In [None]:
from sys import stdout
from tqdm import tqdm
from lz_embed.transformer_based import LZPlusEmbeddingModel, WeightType, EmbeddingType, \
    TokenizedLZPlusEmbedding
from datasets import load_dataset
import numpy as np
import matplotlib.pyplot as plt
import torch
from sklearn.metrics import ndcg_score
import mteb

from model2vec import StaticModel
from transformers import AutoTokenizer, AutoModel
import regex as re
from model2vec.distill.inference import create_output_embeddings_from_model_and_tokens

In [None]:
%load_ext autoreload
%autoreload 2

### `TokenizedLZPlusEmbedding`
This does the following:
1. Computes embeddings for all tokens in (e.g.) the BERT vocabulary and performs PCA, just like the Potion models. 

    _Note_: this part is currently suboptimal compared to the Potion models, which do an additional round of finetuning for the stored embeddings. We can very well do this, but it would make the initial experiment cycle quite unwieldy.

2. Trains an LZ78 SPA on some data (making everything lowercase and omitting everything that isn't a letter, number, or space)

3. Computes embeddings by taking the weighted average of the stored embeddings over the tokens in the input sequences.
Here, we have the option to use uniform weighting, Zipf weighting (which should be equivalent to the Potion models, minus the finetuning), or LZ log loss weighing.

Theoretically, the LZ log loss weighting should be better, but that's currently not the case. Probably the LZ tree needs to be trained on more data. It's better than uniform weighting, which is at least a step in the right direction.

**Some future steps**:
- Train LZ on more data
- Try replacing the BERT tokenizer with an LZ-based tokenizer
- Train LZ on tokens instead of letters.

In [None]:
model = TokenizedLZPlusEmbedding( 
    inner_model_name="BAAI/bge-base-en-v1.5",
    output_dir="object",
    compute_device="cuda:7",
    weight_type=WeightType.LOG_LOSS,
    pca_dim=256,
    pca=True,
)

### Very Simple Model Training

In [None]:
dataset = load_dataset("salesforce/Wikitext", "wikitext-2-v1")

In [None]:
EPOCHS = 20
stdout.flush()
for _ in tqdm(range(EPOCHS)):
    model.train_spa([text["text"] for text in dataset["train"]])

In [None]:
# model.spa.prune(5)

In [None]:
print(f"The LZ tree has {model.spa.get_total_nodes() / 1e6} million nodes")

## MTEB Evaluation
As the `LZPlusEmbeddingModel` class inherits from `SentenceTransformer`, any MTEB task can be evaluated using the `mteb` library's interface.

Below are a few from the benchmark, with a very high-level description of how the task is scored.

### AILA Statutes (Retrieval)
We are given some documents and queries. The embedding model is scored based on whether the relevant documents for each query are close to the query in embedding space.

In [None]:
model.spa.set_inference_config(
    lb=1e-3,
    gamma=1/model.charmap.alphabet_size(),
    ensemble_type="entropy",
    ensemble_n=20,
    backshift_ctx_len=20
)

In [None]:
model.weight_type = WeightType.LOG_LOSS

tasks = mteb.get_tasks(tasks=["AILAStatutes"])
evaluation = mteb.MTEB(tasks=tasks)

# If this doesn't actually run, you'll have to delete a JSON file in results/test
results = evaluation.run(
    model, output_folder=f"results/test",
    show_progress_bar=True,
    overwrite_results=True
)
print("SCORE: ", results[0].scores["test"][0]["main_score"] * 100)

In [None]:
model.weight_type = WeightType.ZIPF

tasks = mteb.get_tasks(tasks=["AILAStatutes"])
evaluation = mteb.MTEB(tasks=tasks)

# If this doesn't actually run, you'll have to delete a JSON file in results/test
results = evaluation.run(
    model, output_folder=f"results/test",
    show_progress_bar=True,
    overwrite_results=True
)
print("SCORE: ", results[0].scores["test"][0]["main_score"] * 100)

In [None]:
model.weight_type = WeightType.UNIFORM

tasks = mteb.get_tasks(tasks=["AILAStatutes"])
evaluation = mteb.MTEB(tasks=tasks)

# If this doesn't actually run, you'll have to delete a JSON file in results/test
results = evaluation.run(
    model, output_folder=f"results/test",
    show_progress_bar=True,
    overwrite_results=True
)
print("SCORE: ", results[0].scores["test"][0]["main_score"] * 100)

### ArXivHierarchicalClusteringP2P (Clustering)
We are given articles from Arxiv, and the embedding model is scored based on how well embeddings of the articles can be hierarchically clustered (compared to ground-truth "topic" labels for the articles).

In [None]:
model.weight_type = WeightType.LOG_LOSS

tasks = mteb.get_tasks(tasks=["ArXivHierarchicalClusteringP2P"])
evaluation = mteb.MTEB(tasks=tasks)

# If this doesn't actually run, you'll have to delete a JSON file in results/test
results = evaluation.run(
    model, output_folder=f"results/test",
    show_progress_bar=True,
    overwrite_results=True
)
print("SCORE: ", results[0].scores["test"][0]["main_score"] * 100)

In [None]:
model.weight_type = WeightType.ZIPF

tasks = mteb.get_tasks(tasks=["ArXivHierarchicalClusteringP2P"])
evaluation = mteb.MTEB(tasks=tasks)

# If this doesn't actually run, you'll have to delete a JSON file in results/test
results = evaluation.run(
    model, output_folder=f"results/test",
    show_progress_bar=True,
    overwrite_results=True
)
print("SCORE: ", results[0].scores["test"][0]["main_score"] * 100)

In [None]:
model.weight_type = WeightType.UNIFORM

tasks = mteb.get_tasks(tasks=["ArXivHierarchicalClusteringP2P"])
evaluation = mteb.MTEB(tasks=tasks)

# If this doesn't actually run, you'll have to delete a JSON file in results/test
results = evaluation.run(
    model, output_folder=f"results/test",
    show_progress_bar=True,
    overwrite_results=True
)
print("SCORE: ", results[0].scores["test"][0]["main_score"] * 100)

### DBpediaClassification
Some classification task for encyclopedia articles, scored based on accuracy. Classification appears to be performed based on k-nearest-neighbors in embedding space.

In [None]:
model.weight_type = WeightType.LOG_LOSS

tasks = mteb.get_tasks(tasks=["DBpediaClassification"])
evaluation = mteb.MTEB(tasks=tasks)

# If this doesn't actually run, you'll have to delete a JSON file in results/test
results = evaluation.run(
    model, output_folder=f"results/test",
    show_progress_bar=True,
    overwrite_results=True
)

print("SCORE: ", results[0].scores["test"][0]["main_score"] * 100)

In [None]:
model.weight_type = WeightType.ZIPF

tasks = mteb.get_tasks(tasks=["DBpediaClassification"])
evaluation = mteb.MTEB(tasks=tasks)

# If this doesn't actually run, you'll have to delete a JSON file in results/test
results = evaluation.run(
    model, output_folder=f"results/test",
    show_progress_bar=True,
    overwrite_results=True
)

print("SCORE: ", results[0].scores["test"][0]["main_score"] * 100)

In [None]:
model.weight_type = WeightType.UNIFORM

tasks = mteb.get_tasks(tasks=["DBpediaClassification"])
evaluation = mteb.MTEB(tasks=tasks)

# If this doesn't actually run, you'll have to delete a JSON file in results/test
results = evaluation.run(
    model, output_folder=f"results/test",
    show_progress_bar=True,
    overwrite_results=True
)

print("SCORE: ", results[0].scores["test"][0]["main_score"] * 100)

### TweetTopicSingleClassification

In [None]:
model.weight_type = WeightType.LOG_LOSS

tasks = mteb.get_tasks(tasks=["TweetTopicSingleClassification"])
evaluation = mteb.MTEB(tasks=tasks)

# If this doesn't actually run, you'll have to delete a JSON file in results/test
results = evaluation.run(
    model, output_folder=f"results/test",
    show_progress_bar=True,
    overwrite_results=True
)
print("SCORE: ", results[0].scores["test_2021"][0]["main_score"] * 100)

In [None]:
model.weight_type = WeightType.ZIPF

tasks = mteb.get_tasks(tasks=["TweetTopicSingleClassification"])
evaluation = mteb.MTEB(tasks=tasks)

# If this doesn't actually run, you'll have to delete a JSON file in results/test
results = evaluation.run(
    model, output_folder=f"results/test",
    show_progress_bar=True,
    overwrite_results=True
)

print("SCORE: ", results[0].scores["test_2021"][0]["main_score"] * 100)

In [None]:
model.weight_type = WeightType.UNIFORM

tasks = mteb.get_tasks(tasks=["TweetTopicSingleClassification"])
evaluation = mteb.MTEB(tasks=tasks)

# If this doesn't actually run, you'll have to delete a JSON file in results/test
results = evaluation.run(
    model, output_folder=f"results/test",
    show_progress_bar=True,
    overwrite_results=True
)

print("SCORE: ", results[0].scores["test_2021"][0]["main_score"] * 100)

### PoemSentimentClassification

In [None]:
model.weight_type = WeightType.LOG_LOSS

tasks = mteb.get_tasks(tasks=["PoemSentimentClassification"])
evaluation = mteb.MTEB(tasks=tasks)

# If this doesn't actually run, you'll have to delete a JSON file in results/test
results = evaluation.run(
    model, output_folder=f"results/test",
    show_progress_bar=True,
    overwrite_results=True
)
print("SCORE: ", results[0].scores["test"][0]["main_score"] * 100)

In [None]:
model.weight_type = WeightType.ZIPF

tasks = mteb.get_tasks(tasks=["PoemSentimentClassification"])
evaluation = mteb.MTEB(tasks=tasks)

# If this doesn't actually run, you'll have to delete a JSON file in results/test
results = evaluation.run(
    model, output_folder=f"results/test",
    show_progress_bar=True,
    overwrite_results=True
)

print("SCORE: ", results[0].scores["test"][0]["main_score"] * 100)

In [None]:
model.weight_type = WeightType.UNIFORM

tasks = mteb.get_tasks(tasks=["PoemSentimentClassification"])
evaluation = mteb.MTEB(tasks=tasks)

# If this doesn't actually run, you'll have to delete a JSON file in results/test
results = evaluation.run(
    model, output_folder=f"results/test",
    show_progress_bar=True,
    overwrite_results=True
)

print("SCORE: ", results[0].scores["test"][0]["main_score"] * 100)