### Practical on how does keyword based searches work

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import re

In [34]:
documents = [
    "satyajeet is taking a session",
    "everybody is attending the session",
    "this is on rag",
    "session also covers advanced rag"
]

query="satyajeet"


In [35]:
vector=TfidfVectorizer()
X=vector.fit_transform(documents)


In [36]:
X.toarray()[0]

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.38044393, 0.        , 0.        , 0.59603894, 0.38044393,
       0.59603894, 0.        , 0.        ])

In [37]:
y=vector.transform([query])

In [38]:
y.toarray()

array([[0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.]])

In [39]:
similarities = cosine_similarity(X, y)
np.argsort(similarities,axis=0)
ranked_indices=np.argsort(similarities,axis=0)[::-1].flatten()

In [40]:
ranked_indices

array([0, 3, 2, 1])

In [41]:
ranked_documents = [documents[i] for i in ranked_indices]
for i, doc in enumerate(ranked_documents):
    print(f"Rank {i+1}: {doc}")

Rank 1: satyajeet is taking a session
Rank 2: session  also covers advanced rag
Rank 3: this is on rag
Rank 4: everybody is attending the session


#### How is BM25 different from Tf-IDF

### 1) Handling Document Length
**TF-IDF:**
- Doesn't account for document length.
- Long documents with repeated terms can get unfairly high scores, even if they are less relevant.

Example:
- Consider two documents with the query "machine learning":
> Doc 1: "machine learning is important as it is widely used in industries."

> Doc 2: "machine learning is important. machine learning is widely used in industries."
- TF-IDF would assign a much higher score to Doc 2 because of repeated occurrences of "machine learning," even though Doc 1 might be more concise and relevant.

**BM25:**
- Normalizes term frequency based on document length, ensuring fair treatment of shorter documents.
- This avoids over-rewarding long documents with repeated terms.

#### 2) TF Saturation
**TF-IDF:**
- Treats term frequency linearly. A term appearing 10 times is considered 10 times more relevant than a term appearing once.
Example:
> Query: "artificial intelligence"

> Doc 1: "artificial intelligence is a field of study."

> Doc 2: "artificial intelligence artificial intelligence artificial intelligence in every line."

- TF-IDF would unfairly boost Doc 2 due to repeated occurrences of the query terms.

**BM25:**
- Introduces a saturation mechanism using 𝑘1, ensuring diminishing returns for repeated term occurrences.
-In the above case, BM25 recognizes that after a certain frequency, additional occurrences of "artificial intelligence" do not significantly increase relevance.