# Практическое задание 5: Реализация модели BM25


## Цель задания
Научиться реализовывать модель BM25 (Best Matching 25), используемую для оценки релевантности документов на основе текстовых запросов.

## Теоретическая часть
BM25 — это вероятностная модель ранжирования, которая улучшает классический TF-IDF за счет учета длины документов и дополнительных параметров.

### Формула для вычисления BM25:

\[
BM25(t, d) = IDF(t) \cdot \frac{TF(t, d) \cdot (k_1 + 1)}{TF(t, d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{avgdl})}
\]

Где:
- **TF(t,d)** — частота термина t в документе d.
- **|d|** — длина документа d.
- **avgdl** — средняя длина документов в корпусе.
- **k1** и **b** — гиперпараметры модели (обычно k1=1.5, b=0.75).
- **IDF(t)** — взвешивание термина:

\[
IDF(t) = \ln\left( \frac{N - n_t + 0.5}{n_t + 0.5} + 1 \right)
\]

Где:
- **N** — общее количество документов.
- **n_t** — число документов, содержащих термин t.


## Часть 1. Подготовка данных

In [5]:

import string
import spacy
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Download required resources
nltk.download('punkt')
nltk.download('stopwords')

# Load spaCy model for lemmatization
nlp = spacy.load("en_core_web_sm")

# Document corpus
documents = [
    "Natural language processing enables machines to understand human language.",
    "Boolean retrieval is a basic model in information retrieval.",
    "Language models are essential for processing and analyzing text.",
    "Understanding Boolean operators is crucial for search engines.",
    "Machine learning techniques help improve ranking models.",
    "Search engines rely on efficient retrieval algorithms.",
    "Statistical language models are commonly used in NLP tasks.",
    "Deep learning models have transformed the field of artificial intelligence.",
    "Computational linguistics focuses on the use of algorithms to process natural language.",
    "Search engine optimization techniques influence web ranking.",
    "Vector space models represent text as numerical vectors for similarity calculations.",
    "Data mining extracts useful patterns from large datasets.",
    "Reinforcement learning allows AI systems to learn through trial and error.",
    "Large language models like GPT have revolutionized NLP applications.",
    "Information retrieval techniques help in document ranking and search efficiency.",
    "Text classification assigns predefined categories to text documents.",
    "Named entity recognition identifies proper nouns in texts.",
    "Sentiment analysis determines whether a text is positive, negative, or neutral.",
    "Machine translation enables conversion between different languages.",
    "Speech recognition systems convert spoken language into text.",
    "Optical character recognition extracts text from images and scanned documents.",
    "Knowledge graphs organize information in a structured format for better search results.",
    "Bayesian models are widely used for probabilistic text analysis.",
    "Information retrieval systems index and rank web documents for search engines.",
    "Topic modeling discovers hidden themes in large collections of documents.",
    "Summarization techniques generate concise versions of long documents.",
    "Word embeddings represent words as dense vectors in continuous space.",
    "Text-to-speech technology enables computers to read text aloud.",
    "Tokenization splits text into words, sentences, or subword units.",
    "Entity linking connects recognized entities to structured knowledge bases.",
    "Natural language inference determines the logical relationship between sentences.",
    "Text generation models produce human-like textual content.",
    "Co-reference resolution finds pronoun references in texts.",
    "Parsing techniques analyze the grammatical structure of sentences.",
    "Syntax and semantics are crucial for natural language understanding.",
    "Bag-of-words and TF-IDF are traditional text representation methods.",
    "BM25 improves document ranking based on term frequency and document length.",
    "Recurrent neural networks process sequential text data efficiently.",
    "Transformers revolutionized NLP with attention mechanisms.",
    "Self-supervised learning enables models to learn from unlabeled data.",
    "Autoencoders compress and reconstruct textual data.",
    "Text clustering groups similar documents together.",
    "Hierarchical clustering arranges text documents in tree structures.",
    "Graph neural networks enhance NLP tasks with graph representations.",
    "Causal inference determines cause-effect relationships in textual data.",
    "Cross-lingual embeddings enable multilingual text analysis.",
    "Universal sentence encoders generate vector representations of entire sentences.",
    "Biomedical text mining extracts insights from medical literature.",
    "Legal text analysis aids in contract understanding and law enforcement.",
    "Aspect-based sentiment analysis focuses on opinions related to specific features.",
    "Neural machine translation surpasses traditional phrase-based methods.",
    "Language modeling predicts the probability of text sequences.",
    "Corpus-based analysis studies language usage patterns.",
    "Multi-document summarization generates summaries from multiple sources.",
    "Extractive summarization selects key sentences from a document.",
    "Abstractive summarization generates new sentences to summarize content.",
    "Dialogue systems enable human-computer conversations.",
    "Conversational AI assists users with natural language queries.",
    "Question-answering systems retrieve precise answers from documents.",
    "Topic detection and tracking monitor emerging news trends.",
    "Stance detection determines a text's position on a given topic.",
    "Lexical semantics focuses on word meanings and relationships.",
    "Discourse analysis studies text structure beyond individual sentences.",
    "Code-switching detection identifies multilingual text shifts.",
    "Morphological analysis processes word inflections and derivations.",
    "Text anonymization removes personally identifiable information.",
    "Bias detection ensures fairness in AI-generated text.",
    "Explainable AI helps understand model decisions in NLP tasks.",
    "Zero-shot learning enables models to generalize without specific training data.",
    "Few-shot learning adapts NLP models with minimal training examples.",
    "Meta-learning helps models learn how to learn new tasks efficiently.",
    "Adversarial attacks on NLP models exploit weaknesses in text understanding.",
    "Neural retrieval models enhance search engines with deep learning.",
    "Hybrid search combines lexical and semantic retrieval techniques.",
    "Multimodal NLP integrates text, images, and speech.",
    "Commonsense reasoning enables AI to infer implicit knowledge.",
    "Hyperparameter tuning optimizes NLP model performance.",
    "Continual learning allows AI models to evolve over time.",
    "Human-in-the-loop NLP improves AI outputs with expert feedback.",
    "Sentiment-aware chatbots respond based on user emotions.",
    "Legal document retrieval improves efficiency in law firms.",
    "Real-time speech-to-text applications enhance accessibility.",
    "Scientific text mining uncovers trends in research publications.",
    "Paraphrase generation produces alternative versions of text.",
    "Fact-checking systems verify claims against trusted sources.",
    "Opinion mining analyzes subjective expressions in text.",
    "Text matching models compare similarity between documents.",
    "AI-assisted content writing speeds up editorial processes.",
    "Structured prediction in NLP predicts interdependent labels.",
    "Causal modeling in NLP infers relationships between variables.",
    "E-commerce search optimization enhances online shopping.",
    "Personalized search tailors results to user preferences.",
    "Intent classification improves chatbot understanding.",
    "Speech synthesis generates human-like voice output.",
    "AI-powered grammar correction improves writing quality.",
    "Code summarization condenses programming logic into readable text.",
    "Text watermarking techniques detect unauthorized content use.",
    "Robust NLP models generalize well to unseen data.",
    "Lifelong learning ensures AI adaptability in evolving domains.",
    "Memory-augmented neural networks store external knowledge for reasoning.",
    "Neurosymbolic AI blends rule-based and neural NLP methods."
]

# Text preprocessing function
stop_words = set(stopwords.words('english'))

def preprocess(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = word_tokenize(text)
    lemmatized_tokens = [token.lemma_ for token in nlp(" ".join(tokens))]
    return [word for word in lemmatized_tokens if word.isalnum() and word not in stop_words]

# Apply preprocessing
processed_documents = [preprocess(doc) for doc in documents]
processed_documents


[nltk_data] Downloading package punkt to /Users/aikei/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/aikei/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


[['natural',
  'language',
  'processing',
  'enable',
  'machine',
  'understand',
  'human',
  'language'],
 ['boolean', 'retrieval', 'basic', 'model', 'information', 'retrieval'],
 ['language', 'model', 'essential', 'processing', 'analyze', 'text'],
 ['understand', 'boolean', 'operator', 'crucial', 'search', 'engine'],
 ['machine', 'learn', 'technique', 'help', 'improve', 'rank', 'model'],
 ['search', 'engine', 'rely', 'efficient', 'retrieval', 'algorithm'],
 ['statistical', 'language', 'model', 'commonly', 'use', 'nlp', 'task'],
 ['deep',
  'learning',
  'model',
  'transform',
  'field',
  'artificial',
  'intelligence'],
 ['computational',
  'linguistic',
  'focus',
  'use',
  'algorithm',
  'process',
  'natural',
  'language'],
 ['search', 'engine', 'optimization', 'technique', 'influence', 'web', 'rank'],
 ['vector',
  'space',
  'model',
  'represent',
  'text',
  'numerical',
  'vector',
  'similarity',
  'calculation'],
 ['data', 'mining', 'extract', 'useful', 'pattern', 'l

## Часть 2. Вычисление параметров BM25

In [6]:

from math import log

def compute_idf(corpus):
    N = len(corpus)
    idf = {}
    for doc in corpus:
        for word in set(doc):
            idf[word] = idf.get(word, 0) + 1
    for word, freq in idf.items():
        idf[word] = log((N - freq + 0.5) / (freq + 0.5) + 1)
    return idf

def compute_tf(doc):
    tf = {}
    for word in doc:
        tf[word] = tf.get(word, 0) + 1
    return tf

# Compute parameters
doc_lengths = [len(doc) for doc in processed_documents]
avgdl = sum(doc_lengths) / len(doc_lengths)
idf = compute_idf(processed_documents)

idf, avgdl


({'processing': 3.708682081410116,
  'human': 4.219507705176107,
  'enable': 2.4849066497880004,
  'natural': 2.920224721045846,
  'understand': 3.372209844788903,
  'machine': 3.120895416507997,
  'language': 2.0992441689760155,
  'retrieval': 2.6100697927420065,
  'information': 2.920224721045846,
  'basic': 4.219507705176107,
  'boolean': 3.708682081410116,
  'model': 1.6045479271399086,
  'essential': 4.219507705176107,
  'text': 1.1134273744532504,
  'analyze': 3.372209844788903,
  'crucial': 3.708682081410116,
  'engine': 2.920224721045846,
  'search': 2.2735975561207935,
  'operator': 4.219507705176107,
  'learn': 2.75317063638268,
  'help': 3.120895416507997,
  'technique': 2.6100697927420065,
  'rank': 2.920224721045846,
  'improve': 2.920224721045846,
  'efficient': 4.219507705176107,
  'algorithm': 3.708682081410116,
  'rely': 4.219507705176107,
  'nlp': 1.9508241638577424,
  'task': 3.120895416507997,
  'commonly': 4.219507705176107,
  'use': 3.120895416507997,
  'statistic

## Часть 3. Реализация BM25

In [7]:

def bm25_score(query, doc, idf, k1=1.5, b=0.75):
    tf = compute_tf(doc)
    score = 0
    for term in query:
        if term in doc:
            term_tf = tf[term]
            numerator = term_tf * (k1 + 1)
            denominator = term_tf + k1 * (1 - b + b * (len(doc) / avgdl))
            score += idf.get(term, 0) * (numerator / denominator)
    return score

# Testing BM25 with a query
query = preprocess("language models retrieval")
scores = [bm25_score(query, doc, idf) for doc in processed_documents]

scores


[2.847059462988167,
 5.586081169241199,
 3.925560790102218,
 0,
 1.5900925404089186,
 2.7663506406212592,
 3.6704245997545195,
 1.5900925404089186,
 1.9533752388054766,
 0,
 1.4071775670816966,
 0,
 0,
 3.4464288991048475,
 2.4287054264789667,
 0,
 0,
 0,
 2.224938761337277,
 2.080332059345601,
 0,
 0,
 1.5900925404089186,
 2.2890133717669077,
 0,
 0,
 0,
 0,
 0,
 0,
 2.080332059345601,
 1.5900925404089186,
 0,
 0,
 2.224938761337277,
 0,
 0,
 0,
 0,
 1.5900925404089186,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 2.224938761337277,
 2.224938761337277,
 0,
 0,
 0,
 0,
 2.080332059345601,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1.493053660299371,
 1.4071775670816966,
 1.493053660299371,
 1.493053660299371,
 1.493053660299371,
 3.9217590867783376,
 2.5865556504650518,
 0,
 0,
 1.7006220287649407,
 1.5900925404089186,
 0,
 0,
 2.5865556504650518,
 0,
 0,
 0,
 0,
 0,
 1.7006220287649407,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1.5900925404089186,
 0,
 0,
 0]

## Часть 4. Тестирование модели

In [8]:

queries = [
    "natural language processing",
    "Boolean retrieval",
    "models text"
]

for query in queries:
    processed_query = preprocess(query)
    scores = [bm25_score(processed_query, doc, idf) for doc in processed_documents]
    sorted_results = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
    
    print(f"Запрос: {query}")
    for rank, (doc_index, score) in enumerate(sorted_results):
        print(f"{rank + 1}. Документ {doc_index + 1}: {documents[doc_index]} (счет: {score:.4f})")
    print()


Запрос: natural language processing
1. Документ 1: Natural language processing enables machines to understand human language. (счет: 9.0153)
2. Документ 3: Language models are essential for processing and analyzing text. (счет: 6.1557)
3. Документ 35: Syntax and semantics are crucial for natural language understanding. (счет: 5.3200)
4. Документ 31: Natural language inference determines the logical relationship between sentences. (счет: 4.9742)
5. Документ 58: Conversational AI assists users with natural language queries. (счет: 4.9742)
6. Документ 9: Computational linguistics focuses on the use of algorithms to process natural language. (счет: 4.6707)
7. Документ 19: Machine translation enables conversion between different languages. (счет: 2.2249)
8. Документ 52: Language modeling predicts the probability of text sequences. (счет: 2.2249)
9. Документ 53: Corpus-based analysis studies language usage patterns. (счет: 2.2249)
10. Документ 7: Statistical language models are commonly used

## Часть 5. Улучшения


### Внесенные улучшения:
1. **Лемматизация**: Добавлена поддержка лемматизации с использованием spaCy.
2. **Масштабируемость**: Добавлены дополнительные документы в корпус для тестирования.
3. **Оптимизация параметров**: Позволяет изменять значения k1 и b для настройки качества ранжирования.
4. **Расширенное тестирование**: Дополнительные запросы для оценки качества поиска.
