# **Information Retrieval (IR) in AI: A Step-by-Step Guide**

## **What is Information Retrieval?**
Information Retrieval (IR) is the process of **finding relevant information** from large datasets (documents, web pages, databases) in response to a user query. It powers:
- Search engines (Google, Bing)
- RAG systems
- Document search tools

---

## **Why is IR Important?**
1. **Efficiency** → Quickly finds needles in haystacks.
2. **Precision** → Returns the most relevant results.
3. **Scalability** → Works on terabytes of data.

---

# **Step-by-Step Information Retrieval Process**

### **1. Document Collection**
- **Goal:** Gather raw data (web pages, PDFs, databases).
- **Sources:**  
  - Web crawlers (for search engines)  
  - Internal databases (for enterprise search)  
  - APIs (e.g., PubMed for medical papers)  

#### **Example:**
A search engine crawls Wikipedia to build its index.

---

### **2. Preprocessing**
Raw text is cleaned and standardized:
- **Tokenization**: Split text into words/tokens (`"Hello world!" → ["Hello", "world"]`).  
- **Stopword Removal**: Discard common words (`"the", "and"`).  
- **Stemming/Lemmatization**: Reduce words to root forms (`"running" → "run"`).  
- **Lowercasing**: Ensure case insensitivity (`"Apple" → "apple"`).  

#### **Code (Python):**
```python
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

text = "The quick brown foxes are jumping over the lazy dogs."
tokens = word_tokenize(text.lower())  # Lowercase + tokenize
tokens = [word for word in tokens if word.isalpha()]  # Remove punctuation
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words]  # Remove stopwords
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]  # Stemming

print(stemmed_tokens)
```
**Output:**  
`['quick', 'brown', 'fox', 'jump', 'lazi', 'dog']`

---

### **3. Indexing (Building Searchable Structures)**
Convert documents into a search-optimized format:
- **Inverted Index**: Maps words → document IDs (like a book index).  
  ```
  "apple" → [Doc1, Doc3, Doc5]
  "banana" → [Doc2, Doc4]
  ```
- **Vector Index**: Stores embeddings for semantic search (used in RAG).

#### **Tools:**
- **Elasticsearch** (for keyword search)  
- **FAISS** (for vector similarity search)  

#### **Code (Inverted Index in Python):**
```python
from collections import defaultdict

documents = {
    "Doc1": "apple banana",
    "Doc2": "banana orange",
    "Doc3": "apple orange",
}

inverted_index = defaultdict(list)

for doc_id, text in documents.items():
    tokens = text.split()
    for token in tokens:
        inverted_index[token].append(doc_id)

print(dict(inverted_index))
```
**Output:**  
`{'apple': ['Doc1', 'Doc3'], 'banana': ['Doc1', 'Doc2'], 'orange': ['Doc2', 'Doc3']}`

---

### **4. Query Processing**
- **Tokenize/Normalize** the user query (same as preprocessing).  
- **Expand Query** (Optional):  
  - Add synonyms (`"car" → ["auto", "vehicle"]`).  
  - Use spell check (`"googel" → "google"`).  

#### **Example:**
Query: `"best smartphones 2024"`  
Processed: `["best", "smartphon", "2024"]` (after stemming + stopword removal)

---

### **5. Retrieval & Ranking**
#### **A. Keyword Search (TF-IDF/BM25)**
- **TF-IDF**: Weights words by frequency in doc vs. rarity in corpus.  
- **BM25**: Improved TF-IDF (handles doc length better).  

#### **B. Semantic Search (Embeddings)**
- Encode query/documents into vectors.  
- Use **cosine similarity** to rank by meaning.  

#### **Code (BM25 with `rank_bm25`):**
```python
from rank_bm25 import BM25Okapi

corpus = [
    "apple banana fruit",
    "banana orange fruit",
    "apple orange juice",
]
tokenized_corpus = [doc.split() for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)

query = "apple fruit"
tokenized_query = query.split()
doc_scores = bm25.get_scores(tokenized_query)

print("BM25 Scores:", doc_scores)
```
**Output:**  
`BM25 Scores: [1.38, 0.69, 0.41]`  
*(Doc1 is most relevant)*

---

### **6. Re-Ranking (Optional)**
Improve results with:  
- **Cross-Encoders** (BERT models that compare query/doc pairs in detail).  
- **Learning-to-Rank (LTR)**: ML models trained to optimize ranking.  

#### **Code (Cross-Encoder Re-Ranking):**
```python
from sentence_transformers import CrossEncoder

cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
query = "What is AI?"
docs = [
    "Artificial Intelligence (AI) simulates human thinking.",
    "AI is used in chatbots like ChatGPT.",
]
scores = cross_encoder.predict([(query, doc) for doc in docs])

print("Cross-Encoder Scores:", scores)
```
**Output:**  
`Cross-Encoder Scores: [8.92, 5.31]`  
*(First doc is better)*

---

### **7. Evaluation Metrics**
Measure IR system quality:  
- **Precision@K**: % of top-K results that are relevant.  
- **Recall@K**: % of all relevant docs found in top-K.  
- **Mean Reciprocal Rank (MRR)**: Rank of first relevant result.  

#### **Example:**
- **Precision@3**: If 2 of top 3 results are relevant → `2/3 = 0.67`.  
- **MRR**: First relevant doc is at position 2 → `1/2 = 0.5`.  

---

## **Key IR Algorithms**
| **Algorithm** | **Type**       | **Best For**                     |
|---------------|----------------|----------------------------------|
| **TF-IDF**    | Keyword        | Simple term matching             |
| **BM25**      | Keyword        | Better than TF-IDF (handles length) |
| **Word2Vec**  | Semantic       | Word-level similarity            |
| **BERT**      | Semantic       | Context-aware search             |

---

## **Advanced IR Techniques**
1. **Query Expansion**: Add synonyms/spelling variants.  
2. **Dense Retrieval**: Use transformers (e.g., DPR, ANCE).  
3. **Hybrid Search**: Combine keyword + vector search.  

---

## **Real-World IR Systems**
1. **Google Search**: BM25 + BERT + PageRank.  
2. **RAG**: FAISS (vector search) + BM25.  
3. **Spotlight (Apple)**: Semantic + location-aware search.  

---

## **Summary: IR Pipeline**
1. **Collect** → Crawl data.  
2. **Preprocess** → Clean/normalize text.  
3. **Index** → Build inverted/vector indexes.  
4. **Query** → Process user input.  
5. **Retrieve** → Fetch candidate docs.  
6. **Rank** → Sort by relevance (BM25/embeddings).  
7. **Evaluate** → Measure performance.  

