# Compare BM25 vs TF-IDF

### Goal: Compare the performance of BM25 and TF-IDF on a corpus of documents, specifically focusing on section headings. To classify the documents based on their relevance to a set of keywords, we will use both methods and compare their scores.
This notebook will:

1. Load your corpus of documents  
2. Build a TF-IDF vectorizer and a BM25 index  
3. Define a “query” of section-heading keywords  
4. Compute and compare scores from both methods  
5. Display results in a table and a simple plot  


In [None]:
from rank_bm25 import BM25Okapi
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import matplotlib.pyplot as plt



In [None]:
corpus = [
    
]

In [None]:
# 3. Build BM25 index
tokenized = [doc.lower().split() for doc in corpus]
bm25 = BM25Okapi(tokenized, k1=1.5, b=0.75)

# 4. Build TF-IDF matrix
tfidf = TfidfVectorizer(sublinear_tf=True, smooth_idf=True)
tfidf_matrix = tfidf.fit_transform(corpus)


In [None]:
query = "abstract introduction methods results discussion references"
q_tokens = query.split()
q_vec    = tfidf.transform([query])

tfidf_scores = cosine_similarity(q_vec, tfidf_matrix)[0]
bm25_scores  = bm25.get_scores(q_tokens)
