# Text Retrieval Formal Definition

### Steps

1. Define the document collection (C)
2. Define the user query (q)
3. Tokenize documents and query (Bag-of-Words representation)
4. Build the vocabulary V (all unique words)
5. Define scoring function f(q, d): number of query words appearing in a document
6. Compute f(q, d) for each document
7. Document Selection strategy
8. Document Ranking strategy


### 1. Define the document collection (C)
####  A small set of documents representing our text collection.

In [1]:
documents = {
    "d1": "climate change is real and affecting the planet",
    "d2": "renewable energy helps to reduce global warming",
    "d3": "python programming is useful for data science",
    "d4": "global efforts are needed to stop climate change"
}

### 2. Define the user query (q)
#### The user's input query to retrieve relevant documents.

In [2]:
query_text = "climate change and global warming"

### 3. Tokenize documents and query (Bag-of-Words representation)
#### Convert all texts into lowercase word lists for easy comparison.



In [3]:
def tokenize(text):
    return text.lower().split()

### 4. Build the vocabulary V (all unique words)
#### Collect all unique words from documents and query.

In [4]:
vocabulary = set()
tokenized_documents = {}
for doc_id, text in documents.items():
    tokens = tokenize(text)
    tokenized_documents[doc_id] = tokens
    vocabulary.update(tokens)

In [5]:
query_tokens = tokenize(query_text)
vocabulary.update(query_tokens)

### 5. Define scoring function f(q, d): number of query words appearing in a document
#### A function that counts how many query words appear in a document.

In [6]:
def f(query_tokens, document_tokens):
    return sum(1 for word in query_tokens if word in document_tokens)

### 6. Compute f(q, d) for each document
#### Calculate relevance scores for all documents based on the query.

In [7]:
scores = {doc_id: f(query_tokens, tokens) for doc_id, tokens in tokenized_documents.items()}
scores

{'d1': 3, 'd2': 2, 'd3': 0, 'd4': 3}

#### Vocabulary, Query and tokenized documents

In [8]:
print("Vocabulary (V): ", vocabulary, "\n")
print("Query (q):", query_tokens)
print("\nTokenized Documents:")

for doc_id, tokens in tokenized_documents.items():
    print(f"{doc_id}: {tokens}")

Vocabulary (V):  {'data', 'to', 'useful', 'science', 'renewable', 'the', 'helps', 'stop', 'reduce', 'for', 'warming', 'is', 'energy', 'efforts', 'global', 'change', 'affecting', 'python', 'climate', 'programming', 'real', 'needed', 'and', 'planet', 'are'} 

Query (q): ['climate', 'change', 'and', 'global', 'warming']

Tokenized Documents:
d1: ['climate', 'change', 'is', 'real', 'and', 'affecting', 'the', 'planet']
d2: ['renewable', 'energy', 'helps', 'to', 'reduce', 'global', 'warming']
d3: ['python', 'programming', 'is', 'useful', 'for', 'data', 'science']
d4: ['global', 'efforts', 'are', 'needed', 'to', 'stop', 'climate', 'change']


## 7. Document Selection strategy
#### Select documents with scores above a certain threshold (e.g., ≥ 2).

In [9]:
selection_threshold = 2 # select documents where the score is at least 2
R_selection = {doc_id: score for doc_id, score in scores.items() if score >= selection_threshold}

In [10]:
print("Document Selection (Score ≥ 2):") 
for doc_id, score in R_selection.items():
    print(f"{doc_id}: Score = {score}")

Document Selection (Score ≥ 2):
d1: Score = 3
d2: Score = 2
d4: Score = 3


## 8. Document Ranking strategy
#### sort all documents by their f(q, d) score in descending order

In [11]:
R_ranking = dict(sorted(scores.items(), key=lambda x: x[1], reverse=True))

In [12]:
print("All documents ranked by relevance to query:")
for doc_id, score in R_ranking.items():
    print(f"{doc_id}: Score = {score}")

All documents ranked by relevance to query:
d1: Score = 3
d4: Score = 3
d2: Score = 2
d3: Score = 0
