# <center> Information Retrieval - ASSIGNMENT 1</center>

## Group No : 33

## Group Member Name - ID - Contribution:

1. ABHILASH DIXIT - 2023DC04284 - 100%
2. PANKE VARAD MANOJ - 2023DC04294 - 100%
3. PRAJAKTA PRATAP BHOSALE - 2023DC04090 - 100%
4. SAYANTAN GUPTA - 2023DC04350 - 100%


Title: Designing a Phrase-Based Academic Content Search Engine Using a Positional Index

Domain:
Education (Academic Notes, Research Papers, Lecture Transcripts)

Objective:
Universities and online learning platforms store large volumes of unstructured academic text such as 
lecture notes, course materials, and research papers. These documents often contain important multi-word 
academic concepts (e.g., "neural networks", "quantum entanglement", "curriculum design"). Traditional 
keyword search may return irrelevant results if terms are scattered or out of order. Therefore, 
a phrase-based search system using a positional index is essential for retrieving documents that contain 
exact phrases, enhancing the learning experience for students and researchers.


1. Preprocesses documents (tokenization, stop word removal, lemmatization).(2)

In [38]:
# Install required packages for text extraction and processing
!pip install nltk pymupdf




[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: C:\Users\praja\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [None]:
# Import the Natural Language Toolkit library (NLTK)
import nltk
import os

# Create local nltk_data folder 
nltk_data_path = "nltk_data"
os.makedirs(nltk_data_path, exist_ok=True)

# Download the Punkt tokenizer models
# This is used for sentence splitting and word tokenization
nltk.download('punkt', download_dir=nltk_data_path)
nltk.download('punkt_tab', download_dir=nltk_data_path)

# Download the list of common English stopwords
# These are words like "is", "the", "in", etc. which are often removed during preprocessing
nltk.download('stopwords', download_dir=nltk_data_path)

# Download the WordNet lexical database
# Required for lemmatization — converting words to their base (dictionary) form
nltk.download('wordnet', download_dir=nltk_data_path)


In [None]:
# --- Imports ---
import os                            # For file path handling
import fitz                          # PyMuPDF: used to extract text from PDF files
import docx                          # For reading .docx Word documents
import pandas as pd                  # For reading and processing CSV files
import re                            # For regular expressions (text cleaning)

from nltk.tokenize import TreebankWordTokenizer   # Alternative to word_tokenize (no dependency on 'punkt')
from nltk.corpus import stopwords                 # For English stopword list
from nltk.stem import WordNetLemmatizer           # For lemmatizing words (e.g., running → run)

# --- Initialize Preprocessing Tools ---

tokenizer = TreebankWordTokenizer()               # Use Treebank tokenizer for consistent token splits
stop_words = set(stopwords.words('english'))      # Load NLTK's built-in list of English stopwords
lemmatizer = WordNetLemmatizer()                  # Initialize WordNet lemmatizer for word normalization

# --- Preprocessing Function ---

def preprocess_text(text):
    """
    Cleans and tokenizes text using:
    - Lowercasing
    - Treebank tokenizer
    - Stopword removal
    - Lemmatization
    Returns a list of cleaned tokens.
    """
    tokens = tokenizer.tokenize(text.lower())     # Tokenize and lowercase the text
  
    # Return cleaned, lemmatized tokens with only alphabetic strings of length > 2
    return [
        lemmatizer.lemmatize(t)
        for t in tokens
        if t.isalpha() and len(t) > 2 and t not in stop_words
    ]

# --- PDF Extraction Function ---

def extract_pdf(path):
    """
    Extracts and cleans text from a PDF file using PyMuPDF.
    Applies:
    - Whitespace normalization
    - Removal of non-text characters (except punctuation)
    """
    text = ""
    with fitz.open(path) as doc:
        for page in doc:
            text += page.get_text()

    text = re.sub(r'\s+', ' ', text)  # Collapse multiple spaces into one
    text = re.sub(r'[^a-zA-Z0-9.,;:()\'"?!\- ]', '', text)  # Remove non-standard characters
    return text

# --- DOCX Extraction Function ---

def extract_docx(path):
    """
    Extracts and returns full text from a Word (.docx) file,
    concatenating all paragraphs with newline characters.
    """
    doc = docx.Document(path)
    return "\n".join([para.text for para in doc.paragraphs])

# --- TXT Extraction Function ---

def extract_txt(path):
    """
    Reads and returns plain text from a .txt file.
    """
    with open(path, "r", encoding="utf-8") as f:
        return f.read()

# --- CSV Extraction Function ---

def extract_csv(path):
    """
    Loads a CSV file using pandas and flattens all cell values into a single string.
    Converts non-string entries to string and joins by whitespace.
    Useful for large content-based datasets.
    """
    df = pd.read_csv(path)
    return " ".join(df.astype(str).apply(lambda row: " ".join(row), axis=1))


In [4]:
# --- Define file paths for all 10 documents in various formats (PDF, DOCX, TXT, CSV) ---
# These files represent academic content used in the search engine application.
file_paths = {
    "1_notes_ml.pdf": "datasets/1_notes_ml.pdf",
    "2_research_paper.pdf": "datasets/2_research_paper.pdf",
    "3_IR_lecture_transcript.docx": "datasets/3_IR_lecture_transcript.docx",
    "4_DMML_curriculum_outline.docx": "datasets/4_DMML_curriculum_outline.docx",
    "5_arxiv_edu.csv": "datasets/5_arxiv_edu.csv",
    "6_ML_syllabus.pdf": "datasets/6_ML_syllabus.pdf",
    "7_mooc_math.txt": "datasets/7_mooc_math.txt",
    "8_online_courses.csv": "datasets/8_online_courses.csv",
    "9_edu_policy.pdf": "datasets/9_edu_policy.pdf",
    "10_student_perf_report.pdf": "datasets/10_student_perf_report.pdf"
}

# --- Dictionary to store cleaned and tokenized output for each file ---
preprocessed_outputs = {}
text = ""
# --- Loop over each file and preprocess its content ---
for name, path in file_paths.items():
    
    # Use appropriate extractor function based on file extension
    if name.endswith(".pdf"):
        text = extract_pdf(path)
    elif name.endswith(".docx"):
        text = extract_docx(path)
    elif name.endswith(".txt"):
        text = extract_txt(path)
    elif name.endswith(".csv"):
        text = extract_csv(path)
    else:
        text = ""  # Fallback if extension not recognized

    # Preprocess the extracted text (tokenization, stop word removal, lemmatization)
    tokens = preprocess_text(text)

    # Store token list in dictionary with filename as key
    preprocessed_outputs[name] = tokens

    # Print summary info for each file
    print(f"✅ {name}: {len(tokens)} tokens")
    print(f"Sample tokens: {tokens[:30]}\n")  # Show first 30 tokens as preview


✅ 1_notes_ml.pdf: 18184 tokens
Sample tokens: ['machine', 'learning', 'lecture', 'note', 'year', 'sem', 'department', 'computer', 'science', 'engineering', 'malla', 'reddy', 'college', 'engineering', 'technology', 'autonomous', 'institution', 'ugc', 'india', 'recognized', 'ugc', 'act', 'affiliated', 'jntuh', 'hyderabad', 'approved', 'aicte', 'accredited', 'nba', 'naac']

✅ 2_research_paper.pdf: 2255 tokens
Sample tokens: ['neural', 'network', 'approach', 'ordinal', 'regression', 'jianlin', 'cheng', 'school', 'electrical', 'engineering', 'computer', 'science', 'university', 'central', 'florida', 'orlando', 'usa', 'abstract', 'ordinal', 'regression', 'important', 'type', 'learning', 'property', 'sication', 'describe', 'simple', 'eective', 'approach', 'adapt']

✅ 3_IR_lecture_transcript.docx: 6030 tokens
Sample tokens: ['information', 'retrieval', 'recording', 'may', 'maheswari', 'started', 'transcription', 'maheswari', 'let', 'share', 'good', 'afternoon', 'let', 'begin', 'love', 'allowin

2. Build a Positional Index that captures each term in the corpus, the list of document IDs (file names) where the term appears., the position(s) of the term within each document. Use dictionaries and posting lists for efficient storage and retrieval. (2)


In [5]:
from collections import defaultdict  # Import defaultdict for nested dictionary structure

# --- Initialize Positional Index ---
# This creates a structure like:
# {
#     'machine': {'doc1.txt': [0, 5, 9], 'doc2.txt': [2]},
#     'learning': {'doc1.txt': [1, 6], 'doc2.txt': [3]}
# }
# The outer dict maps each term to a dictionary of document IDs and positions
positional_index = defaultdict(lambda: defaultdict(list))

# --- Build Positional Index from Preprocessed Tokens ---
# Loop through each document and its list of tokens
for doc_id, tokens in preprocessed_outputs.items():
    for position, token in enumerate(tokens):
        # Add the token's position to the index under its document
        positional_index[token][doc_id].append(position)

# --- Convert defaultdict to regular dict ---
# Makes the data easier to print, view, or serialize (e.g., save to JSON)
positional_index = {
    term: dict(postings)
    for term, postings in positional_index.items()
}

# --- Display a Sample of the Positional Index ---
# Print the first 5 terms and their document-wise positions for verification
sample_terms = list(positional_index.keys())[:5]
for term in sample_terms:
    print(f"Term: {term}")
    print(f"Postings: {positional_index[term]}")
    print("---")


Term: machine
Postings: {'1_notes_ml.pdf': [0, 45, 56, 73, 127, 238, 252, 257, 261, 282, 303, 313, 372, 422, 425, 427, 474, 477, 491, 587, 606, 753, 756, 781, 784, 864, 874, 886, 1015, 1392, 1537, 1683, 2364, 2669, 2702, 2835, 2942, 3001, 3004, 3009, 3168, 3176, 3178, 3832, 3846, 4139, 4309, 4324, 5392, 5401, 5426, 5431, 5478, 5552, 5591, 5752, 5789, 5848, 6019, 6026, 6558, 8325, 8328, 8344, 8780, 8788, 8814, 8959, 9291, 9647, 9653, 10772, 11047, 11735, 15908, 15915], '2_research_paper.pdf': [58, 155, 190, 226, 428, 556, 1066, 1333, 1347, 1376, 1395, 1448, 1507, 1676, 1717, 1739, 1753, 1763, 1774, 1810, 1819, 1834, 1873, 2007, 2052, 2102, 2116, 2148, 2162, 2198, 2215, 2249], '3_IR_lecture_transcript.docx': [3819, 3839, 5414, 5423, 5486, 5494, 5514, 5532, 5537, 5570], '5_arxiv_edu.csv': [124, 393, 649, 1949, 2075, 2301, 2466, 2734, 2801, 2827, 2894, 2990, 3103, 3332, 3355, 4332, 4336, 4361, 4384, 4402], '6_ML_syllabus.pdf': [59, 65, 95, 159, 213, 265], '8_online_courses.csv': [0, 5, 12,

3. Phrase Query Processor - Implement a function that takes a phrase query (e.g., "deep learning", "supply chain management") and returns all documents that contain the exact phrase using positional information.(2)


In [6]:
def phrase_query_processor(phrase, positional_index, preprocessed_outputs):
    """
    Returns a list of document names that contain the exact phrase using positional information.
    """
    # Preprocess the phrase query (tokenize, lemmatize, remove stopwords)
    phrase_tokens = preprocess_text(phrase)
    if not phrase_tokens:
        return []

    # Get postings for the first term
    first_term = phrase_tokens[0]
    if first_term not in positional_index:
        return []
    candidate_docs = set(positional_index[first_term].keys())

    # Intersect with docs containing all other terms
    for term in phrase_tokens[1:]:
        if term not in positional_index:
            return []
        candidate_docs &= set(positional_index[term].keys())
    if not candidate_docs:
        return []

    # For each candidate doc, check for phrase using positions
    result_docs = []
    for doc in candidate_docs:
        # Get positions for each term in the phrase
        positions_lists = [positional_index[term][doc] for term in phrase_tokens]
        # For the first term, check if subsequent terms appear at consecutive positions
        for pos in positions_lists[0]:
            if all((pos + i) in positions_lists[i] for i in range(1, len(phrase_tokens))):
                result_docs.append(doc)
                break  # Only need to find one match per doc
    return result_docs

In [7]:
# List of academic phrases to search for
example_phrases = [
    "deep learning",
    "neural network",
    "machine learning",
    "student performance",
    "supply chain management",   
    "linear regression",
    "curriculum design"
]

# Run phrase search and print results
for phrase in example_phrases:
    matching_docs = phrase_query_processor(phrase, positional_index, preprocessed_outputs)
    print(f"📌 Phrase: '{phrase}' → Found in: {matching_docs}")


📌 Phrase: 'deep learning' → Found in: ['3_IR_lecture_transcript.docx', '5_arxiv_edu.csv', '8_online_courses.csv']
📌 Phrase: 'neural network' → Found in: ['6_ML_syllabus.pdf', '5_arxiv_edu.csv', '2_research_paper.pdf', '8_online_courses.csv', '1_notes_ml.pdf']
📌 Phrase: 'machine learning' → Found in: ['6_ML_syllabus.pdf', '5_arxiv_edu.csv', '2_research_paper.pdf', '3_IR_lecture_transcript.docx', '10_student_perf_report.pdf', '8_online_courses.csv', '1_notes_ml.pdf']
📌 Phrase: 'student performance' → Found in: ['10_student_perf_report.pdf', '2_research_paper.pdf']
📌 Phrase: 'supply chain management' → Found in: ['8_online_courses.csv']
📌 Phrase: 'linear regression' → Found in: ['6_ML_syllabus.pdf', '8_online_courses.csv', '1_notes_ml.pdf']
📌 Phrase: 'curriculum design' → Found in: ['8_online_courses.csv']


4. Evaluate your system:
·        Select 3–5 phrase queries.

·        For each query, define a set of relevant documents manually.

·        Run your phrase query processor to get the retrieved documents.

·        Calculate Precision and Recall for each query.

·        Compare the results with a non-positional keyword-based search, and explain the difference in  performance. (4)

4.1 Select 3–5 phrase queries</br>
4.2 For each query, define a set of relevant documents manually

In [8]:
phrase_queries = {
    "machine learning": ["1_notes_ml.pdf", "6_ML_syllabus.pdf", "8_online_courses.csv"],
    "neural network": ["2_research_paper.pdf", "8_online_courses.csv"],
    "curriculum design": ["4_DMML_curriculum_outline.docx"],
    "student performance": ["2_research_paper.pdf", "10_student_perf_report.pdf", "8_online_courses.csv"],
    "online education platform": ["7_mooc_math.txt", "8_online_courses.csv"],
    "data science specialization": ["8_online_courses.csv"],
    "support vector machine": ["1_notes_ml.pdf"]
}

non-positional keyword-based search

In [9]:
def keyword_search(query, preprocessed_outputs):
    """
    Basic keyword-based search:
    Returns documents containing all keywords (not necessarily in order).
    """
    keywords = set(preprocess_text(query))
    result = []
    for doc, tokens in preprocessed_outputs.items():
        if keywords.issubset(set(tokens)):
            result.append(doc)
    return result

4.3 Calculate Precision and Recall for each query

In [10]:
def precision_recall(retrieved, relevant):
    """
    Compute Precision and Recall for retrieved vs. relevant document sets.
    """
    retrieved_set = set(retrieved)
    relevant_set = set(relevant)
    true_positives = len(retrieved_set & relevant_set)
    precision = true_positives / len(retrieved_set) if retrieved_set else 0
    recall = true_positives / len(relevant_set) if relevant_set else 0
    return precision, recall

4.4 Compare the results with a non-positional keyword-based search, and explain the difference in  performance Collect evaluation data

In [11]:
detailed_results = []

for phrase, relevant_docs in phrase_queries.items():
    retrieved_phrase = phrase_query_processor(phrase, positional_index, preprocessed_outputs)
    retrieved_keyword = keyword_search(phrase, preprocessed_outputs)

    prec_p, rec_p = precision_recall(retrieved_phrase, relevant_docs)
    prec_k, rec_k = precision_recall(retrieved_keyword, relevant_docs)

    detailed_results.append({
        "Phrase Query": phrase,
        "Relevant Documents": ", ".join(relevant_docs),
        "Retrieved (Phrase Search)": ", ".join(retrieved_phrase),
        "Retrieved (Keyword Search)": ", ".join(retrieved_keyword),
        "Precision (Phrase)": round(prec_p, 2),
        "Recall (Phrase)": round(rec_p, 2),
        "Precision (Keyword)": round(prec_k, 2),
        "Recall (Keyword)": round(rec_k, 2)
    })

# Fix display settings for clean output
pd.set_option("display.max_colwidth", None)
pd.set_option("display.width", 1000)
pd.set_option('display.float_format', '{:.2f}'.format)

# Create DataFrame to view or export
df_detailed_eval = pd.DataFrame(detailed_results)
df_detailed_eval

Unnamed: 0,Phrase Query,Relevant Documents,Retrieved (Phrase Search),Retrieved (Keyword Search),Precision (Phrase),Recall (Phrase),Precision (Keyword),Recall (Keyword)
0,machine learning,"1_notes_ml.pdf, 6_ML_syllabus.pdf, 8_online_courses.csv","6_ML_syllabus.pdf, 5_arxiv_edu.csv, 2_research_paper.pdf, 3_IR_lecture_transcript.docx, 10_student_perf_report.pdf, 8_online_courses.csv, 1_notes_ml.pdf","1_notes_ml.pdf, 2_research_paper.pdf, 3_IR_lecture_transcript.docx, 5_arxiv_edu.csv, 6_ML_syllabus.pdf, 8_online_courses.csv, 10_student_perf_report.pdf",0.43,1.0,0.43,1.0
1,neural network,"2_research_paper.pdf, 8_online_courses.csv","6_ML_syllabus.pdf, 5_arxiv_edu.csv, 2_research_paper.pdf, 8_online_courses.csv, 1_notes_ml.pdf","1_notes_ml.pdf, 2_research_paper.pdf, 5_arxiv_edu.csv, 6_ML_syllabus.pdf, 8_online_courses.csv",0.4,1.0,0.4,1.0
2,curriculum design,4_DMML_curriculum_outline.docx,8_online_courses.csv,"8_online_courses.csv, 10_student_perf_report.pdf",0.0,0.0,0.0,0.0
3,student performance,"2_research_paper.pdf, 10_student_perf_report.pdf, 8_online_courses.csv","10_student_perf_report.pdf, 2_research_paper.pdf","2_research_paper.pdf, 8_online_courses.csv, 10_student_perf_report.pdf",1.0,0.67,1.0,1.0
4,online education platform,"7_mooc_math.txt, 8_online_courses.csv",,"7_mooc_math.txt, 8_online_courses.csv, 10_student_perf_report.pdf",0.0,0.0,0.67,1.0
5,data science specialization,8_online_courses.csv,8_online_courses.csv,"1_notes_ml.pdf, 5_arxiv_edu.csv, 8_online_courses.csv",1.0,1.0,0.33,1.0
6,support vector machine,1_notes_ml.pdf,"6_ML_syllabus.pdf, 5_arxiv_edu.csv, 2_research_paper.pdf, 8_online_courses.csv, 1_notes_ml.pdf","1_notes_ml.pdf, 2_research_paper.pdf, 5_arxiv_edu.csv, 6_ML_syllabus.pdf, 8_online_courses.csv",0.2,1.0,0.2,1.0


**Performance Comparison: Phrase Search vs. Non-Positional Keyword Search**

Based on the evaluation results above, we observe the following differences in performance between phrase-based search and non-positional keyword-based search:

- **Precision:** Phrase search consistently achieves higher precision because it only returns documents where the exact phrase appears. This means that almost all retrieved documents are truly relevant to the query. In contrast, keyword search often retrieves additional documents where the query words appear separately, leading to more false positives and lower precision.

- **Recall:** Phrase search may have lower recall compared to keyword search. If a relevant document contains the query words but not as a contiguous phrase (e.g., the words are separated or appear in a different order), phrase search will not retrieve it. Keyword search, being more lenient, can retrieve such documents, resulting in higher recall but at the cost of precision.

- **Example from Output:**
    - For queries like "neural network" or "machine learning", phrase search returns only those documents where the phrase appears exactly, while keyword search may return additional documents where "neural" and "network" (or "machine" and "learning") are present but not together.
    - This is reflected in the precision and recall values printed above: phrase search often has perfect or near-perfect precision, while keyword search may have higher recall but lower precision.

**Conclusion:**
Phrase-based search is more effective for retrieving documents that match the user's intent exactly, making it suitable for academic and technical search tasks where phrase integrity is important. Keyword search is broader and may be useful for exploratory search, but it can return less relevant results. The choice between the two depends on whether precision or recall is more important for the user's needs.