In [1]:
def build_inverted_index_with_positions(docs):
    """
    Builds an inverted index mapping each term to a dict of:
      doc_id → sorted list of positions where the term occurs.
    """
    index = {}
    for doc_id, text in enumerate(docs):
        for pos, word in enumerate(text.split()):
            token = word.lower()
            postings = index.setdefault(token, {})
            postings.setdefault(doc_id, []).append(pos)
    return index

def phrase_query(index, phrase, docs):
    """
    Returns the list of doc_ids where the exact phrase appears.
    """
    terms = [t.lower() for t in phrase.split()]
    # Start from the posting list of the first term
    first_term = terms[0]
    if first_term not in index:
        return []
    candidate_docs = set(index[first_term].keys())

    for term in terms[1:]:
        if term not in index:
            return []
        candidate_docs &= set(index[term].keys())

    results = []
    for doc_id in candidate_docs:
        # For each position p of the first term, check subsequent positions
        positions = index[first_term][doc_id]
        for p in positions:
            if all((p + i) in index[terms[i]][doc_id] for i in range(1, len(terms))):
                results.append(doc_id)
                break
    return sorted(results)

# Example usage:
docs = [
    "Data science is fun",
    "Machine learning makes data science powerful",
    "Science requires data and learning."
]
idx = build_inverted_index_with_positions(docs)
print(phrase_query(idx, "data science", docs))  # → [0, 1]


[0, 1]


### Explanation of the Code in index.ipynb

This code demonstrates how to build an **inverted index** with positional information and perform **phrase queries** on a collection of documents. Below is a detailed explanation of the code:

---

### **1. Building an Inverted Index with Positions**
```python
def build_inverted_index_with_positions(docs):
    """
    Builds an inverted index mapping each term to a dict of:
      doc_id → sorted list of positions where the term occurs.
    """
    index = {}
    for doc_id, text in enumerate(docs):
        for pos, word in enumerate(text.split()):
            token = word.lower()
            postings = index.setdefault(token, {})
            postings.setdefault(doc_id, []).append(pos)
    return index
```

#### **What is an Inverted Index?**
An **inverted index** is a data structure that maps terms (words) to the documents in which they appear. In this implementation:
- Each term is mapped to a dictionary where:
  - **Key**: `doc_id` (the document ID where the term appears).
  - **Value**: A sorted list of positions where the term occurs in the document.

#### **How It Works:**
1. **Input**: A list of documents (`docs`), where each document is a string.
2. **Processing**:
   - Iterate over each document (`doc_id`) and split its text into words.
   - Convert each word to lowercase (`token`) to ensure case-insensitivity.
   - Use `setdefault` to:
     - Add the term to the index if it doesn't already exist.
     - Add the document ID to the term's posting list if it doesn't already exist.
   - Append the position of the term in the document to the posting list.
3. **Output**: A dictionary representing the inverted index.

#### **Example Output:**
For the documents:
```python
docs = [
    "Data science is fun",
    "Machine learning makes data science powerful",
    "Science requires data and learning."
]
```
The inverted index will look like:
```python
{
    'data': {0: [0], 1: [4], 2: [2]},
    'science': {0: [1], 1: [5], 2: [0]},
    'is': {0: [2]},
    'fun': {0: [3]},
    'machine': {1: [0]},
    'learning': {1: [1], 2: [4]},
    'makes': {1: [2]},
    'powerful': {1: [6]},
    'requires': {2: [1]},
    'and': {2: [3]}
}
```

---

### **2. Performing a Phrase Query**
```python
def phrase_query(index, phrase, docs):
    """
    Returns the list of doc_ids where the exact phrase appears.
    """
    terms = [t.lower() for t in phrase.split()]
    # Start from the posting list of the first term
    first_term = terms[0]
    if first_term not in index:
        return []
    candidate_docs = set(index[first_term].keys())

    for term in terms[1:]:
        if term not in index:
            return []
        candidate_docs &= set(index[term].keys())

    results = []
    for doc_id in candidate_docs:
        # For each position p of the first term, check subsequent positions
        positions = index[first_term][doc_id]
        for p in positions:
            if all((p + i) in index[terms[i]][doc_id] for i in range(1, len(terms))):
                results.append(doc_id)
                break
    return sorted(results)
```

#### **What is a Phrase Query?**
A **phrase query** searches for an exact sequence of words (a phrase) in the documents. For example, the phrase `"data science"` should match documents where the words "data" and "science" appear consecutively.

#### **How It Works:**
1. **Input**:
   - `index`: The inverted index built earlier.
   - `phrase`: The phrase to search for (e.g., `"data science"`).
   - `docs`: The list of original documents (used for reference).
2. **Processing**:
   - Split the phrase into individual terms and convert them to lowercase.
   - Start with the posting list of the first term in the phrase.
   - Narrow down the candidate documents by intersecting the posting lists of all terms in the phrase.
   - For each candidate document:
     - Check if the terms appear consecutively at the correct positions.
     - If they do, add the document ID to the results.
3. **Output**: A sorted list of document IDs where the phrase appears.

#### **Example Query:**
For the phrase `"data science"`:
- The terms `"data"` and `"science"` must appear consecutively in the same document.
- The result will be:
```python
[0, 1]
```
This means the phrase appears in documents 0 and 1.

---

### **3. Example Usage**
```python
docs = [
    "Data science is fun",
    "Machine learning makes data science powerful",
    "Science requires data and learning."
]
idx = build_inverted_index_with_positions(docs)
print(phrase_query(idx, "data science", docs))  # → [0, 1]
```

#### **Steps:**
1. Build the inverted index using `build_inverted_index_with_positions(docs)`.
2. Perform a phrase query using `phrase_query(idx, "data science", docs)`.

#### **Output:**
```
[0, 1]
```
This indicates that the phrase `"data science"` appears in documents 0 and 1.

---

### **Summary**
1. **Inverted Index**:
   - Efficiently maps terms to the documents and positions where they appear.
   - Enables fast lookups for terms and their occurrences.
2. **Phrase Query**:
   - Searches for exact phrases by leveraging the positional information in the inverted index.
   - Filters candidate documents and checks for consecutive term positions.

This implementation is a foundational concept in **information retrieval systems**, such as search engines, where efficient text search is critical.