![](./lab%20header%20image.png)

<div style="text-align: center;">
    <h3>Assignment No. 02</h3>
</div>

<img src="./Student%20Information.png" style="width: 100%;" alt="Student Information">

<div style="border: 1px solid #ccc; padding: 8px; background-color: #f0f0f0; text-align: start;">
    <strong>Q. How can we implement a basic search engine from scratch using Python that tokenizes text, builds an inverted index, and retrieves relevant documents based on search queries?</strong>
</div>

A search engine is a system designed to retrieve relevant documents from a collection based on user queries. At its core, a search engine operates by creating an index of the documents, processing user input (queries), and retrieving relevant documents based on those queries. Here's how each component works:

**1. Text Tokenization:**
Tokenization is the process of breaking down text into smaller units, typically words or terms. This allows the search engine to handle documents and queries in a structured way. Common tokenization techniques include splitting text by spaces and punctuation, removing stop words, and converting to lowercase to ensure uniformity.

In [1]:
import re

def tokenize(text):
    # Remove non-alphanumeric characters and convert text to lowercase
    text = text.lower()
    tokens = re.findall(r'\b\w+\b', text)
    return tokens

**2. Inverted Index:**
An inverted index is a data structure used to map terms (tokens) to the documents in which they appear. This allows the search engine to quickly retrieve all documents containing a given term. The index is structured as a dictionary, where each term points to a list of document IDs (or other identifiers).

In [2]:
from collections import defaultdict

def build_inverted_index(documents):
    inverted_index = defaultdict(list)
    
    for doc_id, document in enumerate(documents):
        tokens = tokenize(document)
        for token in tokens:
            if doc_id not in inverted_index[token]:
                inverted_index[token].append(doc_id)
    
    return inverted_index


**3. Search and Retrieval:**
When a user enters a query, the search engine tokenizes the query, retrieves documents from the inverted index that contain any of the query terms, and ranks them based on relevance. Simple ranking can be done by counting the number of matching terms (term frequency) or by more advanced techniques like TF-IDF (Term Frequency-Inverse Document Frequency).

In [3]:
def search(query, inverted_index):
    query_tokens = tokenize(query)
    matching_documents = set()

    for token in query_tokens:
        if token in inverted_index:
            if not matching_documents:
                matching_documents = set(inverted_index[token])
            else:
                matching_documents.intersection_update(inverted_index[token])
    
    return matching_documents

# Sample documents
documents = [
    "Hello World! This is a sample document.",
    "This is another sample document.",
    "Sample search engine implementation using Python."
]

# Build the inverted index
inverted_index = build_inverted_index(documents)

# Search query
query = "sample document"
results = search(query, inverted_index)

# Output search results
print(f"Documents matching '{query}': {results}")


Documents matching 'sample document': {0, 1}


##### Explanation:
- **Tokenization**: Text is tokenized into lowercase words without special characters or punctuation.
- **Inverted Index**: The index maps tokens to a list of document IDs where they appear.
- **Search**: The query is tokenized, and the search engine finds documents containing the query terms. The search results return the document IDs where all query terms are found.

##### Enhancements:
- **Stopword Removal**: Exclude common words like 'the', 'is', 'and', etc., to improve search relevance.
- **Ranking**: Use TF-IDF or cosine similarity to rank documents based on query relevance.
- **Stemming/Lemmatization**: Apply stemming (e.g., "running" → "run") or lemmatization for better search accuracy.

<div style="float: right; border: 1px solid black; display: inline-block; padding: 10px; text-align: center">
    <br>
    <br>
    <span style="font-weight: bold;">Signature of Lab Incharge</span>
    <br>
    <span>(Prof. Rupali Sharma)</span> 
</div>