In [2]:
# Define the documents
document1 = "The quick brown fox jumped over the lazy dog ."
document2 = "The lazy dog slept in the sun ."

# Step 1: Tokenize the documents
# Convert each document to lowercase and split it into words
tokens1 = document1.lower().split()
tokens2 = document2.lower().split()

# Combine the tokens into a list of unique terms
terms = list(set(tokens1 + tokens2))

# Step 2: Build the inverted index
# Create an empty dictionary to store the inverted index
inverted_index = {}

# For each term, find the documents that contain it
for term in terms:
    documents = []
    if term in tokens1:
        documents.append("Document 1")
    if term in tokens2:
        documents.append("Document 2")
    inverted_index[term] = documents

# Step 3: Print the inverted index
print("Inverted Index:")
for term, documents in inverted_index.items():
    print(term, "->", ", ".join(documents))

# Step 4: Search Query
query = input("\nEnter your search query: ").lower()  # Get the search query from the user
query_terms = query.split()  # Split query into individual terms

# Find the documents for the query
result_docs = set()  # To store the matching documents

# Iterate over the query terms and retrieve documents
for term in query_terms:
    if term in inverted_index:
        result_docs.update(inverted_index[term])  # Add documents that contain the query term

# Step 5: Display the results
if result_docs:
    print("\nDocuments matching the query:")
    for doc in result_docs:
        print(doc)
else:
    print("\nNo documents found for the query.")


Inverted Index:
fox -> Document 1
. -> Document 1, Document 2
sun -> Document 2
the -> Document 1, Document 2
jumped -> Document 1
quick -> Document 1
dog -> Document 1, Document 2
lazy -> Document 1, Document 2
brown -> Document 1
in -> Document 2
slept -> Document 2
over -> Document 1



Enter your search query:  in



Documents matching the query:
Document 2


# Explanation:
1. Inverted Index Construction: The inverted index is created the same way as before, mapping each word to the documents it appears in.
2. Search Input: We prompt the user to input a search query.

      -The query is converted to lowercase and split into individual terms.
3. Search Query Processing: We search the inverted index for each query term and find which documents contain it.
4. Result Display: If any matching documents are found, they are displayed. Otherwise, a message indicates no matches were found.# 

# How it Works:
- Inverted Index: The program builds an index of words (terms) and the documents where those words appear.
- Search: The user enters a search query, and the program checks which documents contain all the words in the query.
- Results: The documents containing all the terms from the query are displayed.

# Alternative short


In [5]:
# Define the documents
document1 = "The quick brown fox jumped over the lazy dog."
document2 = "The lazy dog slept in the sun."

# Step 1: Tokenize the documents
# Convert each document to lowercase and split it into words
tokens1 = document1.lower().split()
tokens2 = document2.lower().split()

# Combine the tokens into a list of unique terms
terms = list(set(tokens1 + tokens2))

# Step 2: Build the inverted index
# Create an empty dictionary to store the inverted index
inverted_index = {}

# For each term, find the documents that contain it
for term in terms:
	documents = []
	if term in tokens1:
		documents.append("Document 1")
	if term in tokens2:
		documents.append("Document 2")
	inverted_index[term] = documents

# Step 3: Print the inverted index
for term, documents in inverted_index.items():
	print(term, "->", ", ".join(documents))


dog -> Document 2
the -> Document 1, Document 2
brown -> Document 1
slept -> Document 2
in -> Document 2
fox -> Document 1
lazy -> Document 1, Document 2
sun. -> Document 2
over -> Document 1
jumped -> Document 1
quick -> Document 1
dog. -> Document 1


In [None]:
'''
Let’s break down the code and explain each technical term and the entire procedure in detail.

1. Tokenization
Definition: Tokenization is the process of breaking a string of text into smaller units, typically words or phrases, called tokens.
Why it's important: Tokenization is the first step in most text processing tasks because it helps convert raw text into structured data (tokens) that can be analyzed.
In the code, tokens1 = document1.lower().split() splits the input document1 into a list of individual words by:

Converting the text to lowercase using .lower() to ensure case insensitivity.
Using .split() to break the text into words based on spaces.
Similarly, tokens2 = document2.lower().split() does the same for the second document.

2. Inverted Index
Definition: An inverted index is a data structure used to store a mapping from words (terms) to their occurrences in a set of documents. It's called "inverted" because it flips the relationship between documents and terms: instead of storing which terms appear in a document, we store which documents contain a specific term.
Why it's important: The inverted index is central to information retrieval systems like search engines because it allows for fast and efficient searching. When you search for a word, the inverted index immediately tells you which documents contain that word.
In the code:

The terms list is created by combining the unique tokens from both document1 and document2. This list contains all the distinct words in both documents.
For each term, we then check whether it appears in each document and record which document(s) the term appears in. This is stored in the inverted_index dictionary, where the keys are terms and the values are lists of documents containing the term.
Example:

If the term "lazy" appears in both document1 and document2, the inverted index entry would look like:
python
Copy code
inverted_index['lazy'] = ['Document 1', 'Document 2']
3. Stopwords
Definition: Stopwords are commonly used words (like "the", "is", "in", etc.) that don't carry much meaningful information in the context of text analysis. Removing stopwords is a common step in text preprocessing because they can introduce noise in text analysis tasks.
Why it's important: Removing stopwords helps reduce the size of the dataset, focusing only on the meaningful terms that are relevant for tasks like information retrieval or text classification.
In the code, the stopwords are removed from both the documents and the search query using the stopwords.words('english') list provided by the NLTK (Natural Language Toolkit). The code filters out any word in the document or query that appears in the stopwords list.

Example:

If "the" appears in a document, it will be ignored when building the inverted index or when processing a query.
4. Preprocessing the Text
Definition: Preprocessing is the step where raw text is cleaned and normalized before analysis. Common preprocessing steps include converting text to lowercase, removing punctuation, tokenizing, and removing stopwords.
Why it's important: Preprocessing ensures that the text is in a consistent format, making it easier to work with and reducing variations in words (e.g., "dog" and "Dog" should be treated the same).
The preprocess_text function in the code performs:

Lowercasing: text = text.lower() converts the text to lowercase, which helps with case-insensitive matching.
Punctuation Removal: text = text.translate(str.maketrans("", "", string.punctuation)) removes punctuation (e.g., commas, periods, etc.) to avoid treating punctuation as part of the words.
Stopword Removal: A list comprehension filters out stopwords from the tokens.
5. Query Processing
Definition: Query processing refers to how a user's search query is handled, from parsing the query to identifying the relevant documents that match the query terms.
Why it's important: The query is typically processed in the same way as the documents to ensure consistency. For example, if both documents and the query are converted to lowercase and stopwords are removed, the search will be accurate even if the user enters the query in a different case or includes unnecessary words.
In the code:

The query is first converted to lowercase using query = input().lower().
The query is then split into individual words (tokens) using query_terms = query.split().
Stopwords are removed from the query using the preprocess_text function, which ensures that common words (like "the", "in") don't affect the search results.
After processing the query, the terms are checked against the inverted index, and the relevant documents are retrieved.

6. Search Results (Intersection of Documents)
Definition: When a user submits a query, the system must identify which documents contain all the terms in the query. This is done by looking up each query term in the inverted index and collecting the documents that contain those terms.
Why it's important: This is the heart of the search functionality: retrieving documents based on the user's query.
In the code:

For each term in the query (query_terms), the code checks if the term exists in the inverted index (if term in inverted_index).
If the term is found, the corresponding documents are added to the result_docs set using result_docs.update(inverted_index[term]). Using a set ensures that duplicate document entries are avoided.
Finally, the code prints the documents that match the query terms. If no documents match, it prints a message indicating no results were found.

7. Set Operations
Definition: A set is a data structure that stores unique elements. In Python, sets provide efficient membership tests and operations like union, intersection, and difference.
Why it's important: Using a set to collect matching documents avoids duplicates, ensuring that each document is listed only once in the search results.
In the code, the search results are stored in a set (result_docs), and the update() method is used to add documents that contain the query terms. This ensures that documents are only listed once, even if a term appears in multiple documents.

Overall Procedure
Tokenization: Split each document into words (tokens).
Preprocessing: Clean and normalize the text by lowercasing it, removing punctuation, and removing stopwords.
Inverted Index Construction: Build an inverted index that maps each unique word (term) to the documents where it appears.
Query Processing: Accept a query from the user, preprocess it in the same way as the documents, and split it into terms.
Search: Check which documents contain the query terms by looking up each term in the inverted index.
Display Results: Show the documents that match the query terms, or inform the user if no documents were found.
This procedure is a basic implementation of a search engine's core functionality, where documents are indexed and queried for relevant results.
'''