<a href="https://colab.research.google.com/github/Hadrien-Cornier/cool-nn-stuff/blob/main/mteb_8_tasks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

MTEB evaluates a language model on 8 tasks :    

1. Bitext mining
2. Classification
3. Clustering
4. Pair classification
5. Reranking
6. Retrieval
7. Semantic Textual Similarity (STS)
8. Summarization

Pair classification
Reranking
Retrieval
Semantic Textual Similarity (STS)
Summarization


Classification
Clustering
Pair classification
Reranking
Retrieval
Semantic Textual Similarity (STS)
Summarization

I'll explain each of the 8 tasks in detail and provide dummy example code for each:

# Bitext Mining

Bitext mining involves finding parallel sentences in two different languages. This task is crucial for machine translation and cross-lingual information retrieval.

Example code:

In [2]:
!pip install sentence_transformers

Collecting sentence_transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.11.0->sentence_transform

In [4]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('LaBSE')

# Sentences in two languages
english_sentences = ["Hello, how are you?", "The weather is nice today."]
french_sentences = ["Bonjour, comment allez-vous?", "Il fait beau aujourd'hui.", "J'aime le chocolat."]

# Encode sentences
en_embeddings = model.encode(english_sentences)
fr_embeddings = model.encode(french_sentences)

# Find best matches
for en_emb in en_embeddings:
    scores = util.cos_sim(en_emb, fr_embeddings)[0]
    best_match = french_sentences[scores.argmax()]
    print(f"Best match: {best_match}")

Best match: Bonjour, comment allez-vous?
Best match: Il fait beau aujourd'hui.


# Classification

Classification involves assigning predefined categories to input data. In text classification, we assign labels to text documents.

Example code:


In [5]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Sample data
X = ["This movie is great", "I hated this book", "The food was delicious"]
y = ["positive", "negative", "positive"]

# Create pipeline
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', MultinomialNB()),
])

# Train the model
text_clf.fit(X, y)

# Predict
print(text_clf.predict(["I loved this restaurant"]))

['positive']


# Clustering

Clustering groups similar data points together without predefined labels. In text clustering, we group similar documents.

Example code:


In [6]:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

# Sample data
documents = [
    "The cat is on the mat",
    "The dog is in the yard",
    "The sun is shining bright",
    "The moon is full tonight"
]

# Vectorize the text
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)

# Perform clustering
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)

# Print cluster assignments
for doc, cluster in zip(documents, kmeans.labels_):
    print(f"Document: {doc} | Cluster: {cluster}")

Document: The cat is on the mat | Cluster: 1
Document: The dog is in the yard | Cluster: 1
Document: The sun is shining bright | Cluster: 1
Document: The moon is full tonight | Cluster: 0





# Pair Classification

Pair classification determines the relationship between two pieces of text, such as whether they are paraphrases or duplicates.

Example code:


In [8]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

# Sample pairs
pairs = [
    ("The cat is on the mat", "A feline is resting on a rug"),
    ("I love pizza", "Pizza is my favorite food"),
    ("The sun is shining", "It's raining outside")
]

# Classify pairs
for sent1, sent2 in pairs:
    emb1 = model.encode(sent1)
    emb2 = model.encode(sent2)
    similarity = util.cos_sim(emb1, emb2)
    label = "Paraphrase" if similarity > 0.5 else "Not paraphrase"
    print(f"Pair: '{sent1}' - '{sent2}' | Label: {label}")

Pair: 'The cat is on the mat' - 'A feline is resting on a rug' | Label: Paraphrase
Pair: 'I love pizza' - 'Pizza is my favorite food' | Label: Paraphrase
Pair: 'The sun is shining' - 'It's raining outside' | Label: Not paraphrase


# Reranking

Reranking involves improving the order of a list of items, typically search results, based on relevance to a query.

Example code:


In [9]:
from sentence_transformers import CrossEncoder

model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

query = "What is the capital of France?"
documents = [
    "Paris is the capital of France.",
    "London is the capital of England.",
    "France is a country in Europe.",
    "The Eiffel Tower is in Paris."
]

# Score query-document pairs
pairs = [[query, doc] for doc in documents]
scores = model.predict(pairs)

# Rerank documents
ranked_results = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)

for doc, score in ranked_results:
    print(f"Score: {score:.4f} | Document: {doc}")

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Score: 8.5007 | Document: Paris is the capital of France.
Score: -1.1092 | Document: France is a country in Europe.
Score: -2.8622 | Document: London is the capital of England.
Score: -5.5860 | Document: The Eiffel Tower is in Paris.


# Retrieval

Retrieval involves finding the most relevant documents from a large corpus given a query.

Example code:


In [10]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

# Sample corpus
corpus = [
    "The quick brown fox jumps over the lazy dog",
    "A journey of a thousand miles begins with a single step",
    "To be or not to be, that is the question",
    "All that glitters is not gold"
]

# Encode corpus
corpus_embeddings = model.encode(corpus)

# Query
query = "What animal is mentioned?"
query_embedding = model.encode(query)

# Retrieve most similar documents
cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
top_results = sorted(range(len(cos_scores)), key=lambda i: cos_scores[i], reverse=True)[:2]

for idx in top_results:
    print(f"Score: {cos_scores[idx]:.4f} | Document: {corpus[idx]}")

Score: 0.3885 | Document: The quick brown fox jumps over the lazy dog
Score: 0.0829 | Document: To be or not to be, that is the question



# Semantic Textual Similarity (STS)

STS measures the degree of semantic similarity between two pieces of text.

Example code:


In [None]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

# Sample sentence pairs
sentence_pairs = [
    ("The cat is on the mat", "There is a cat on the rug"),
    ("I love pizza", "Pizza is my favorite food"),
    ("The sun is shining", "It's a sunny day")
]

for sent1, sent2 in sentence_pairs:
    emb1 = model.encode(sent1)
    emb2 = model.encode(sent2)
    similarity = util.cos_sim(emb1, emb2)
    print(f"Sentence 1: {sent1}")
    print(f"Sentence 2: {sent2}")
    print(f"Similarity: {similarity[0][0]:.4f}\n")


# Summarization

Summarization involves creating a concise version of a longer text while preserving its key information.

Example code:

In [11]:
from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

text = """
The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France.
It is named after the engineer Gustave Eiffel, whose company designed and built the tower.
Constructed from 1887 to 1889 as the entrance arch to the 1889 World's Fair, it was initially
criticized by some of France's leading artists and intellectuals for its design, but it has
become a global cultural icon of France and one of the most recognizable structures in the world.
"""

summary = summarizer(text, max_length=50, min_length=10, do_sample=False)
print(summary[0]['summary_text'])

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France.Constructed from 1887 to 1889 as the entrance arch to the 1889 World's Fair, it was initiallycriticized by



Citations:
[1] https://ppl-ai-file-upload.s3.amazonaws.com/web/direct-files/12491979/4bb6d8a5-d429-4c65-aa4e-1bcc5d7685f1/mteb.pdf
[2] https://www.sbert.net/examples/applications/parallel-sentence-mining/README.html
[3] https://machinelearningmastery.com/types-of-classification-in-machine-learning/
[4] https://www.datacamp.com/tutorial/text-classification-python
[5] https://levity.ai/blog/9-text-classification-examples
[6] https://towardsdatascience.com/parallel-sentence-mining-in-python-ad54fc909f85
[7] https://www.datacamp.com/blog/classification-machine-learning
[8] https://paperswithcode.com/task/cross-lingual-bitext-mining
[9] https://github.com/topics/bitext-mining