# üß™ Lab 2: Document Preprocessing, Indexing, and Relevance Feedback

## üìö Modules Covered
- Module 3: Document Preprocessing, Indexing & Searching  
- Module 4: Evaluation of IR Systems  
- Module 5: Relevance Feedback and Query Expansion  

## üîß Objective
Implement and evaluate core IR techniques including tokenization, normalization, inverted indexing, relevance feedback, and query expansion. Build a mini IR pipeline and test its performance using precision, recall, and feedback mechanisms.

## <font color='red'>Submission: Submit both .ipynb file and .ipynb converted to PDF</font>
## <font color='blue'>Submissions with following cases will get a zero</font>
* ## <font color='red'>Code or commented text truncated from the pdf version of the notebook</font>
* ### <font color='blue'>Any compilation error in the notebook</font>
* ### <font color='blue'>Missing output for any of the programming cells. There should be an output for every code cell</font>
  
## ‚úÖ Submission Checklist
* Preprocessing code and inverted index output
* Evaluation metrics for at least 3 queries
* Expanded queries and feedback results
* Answers to reflection questions


## üõ†Ô∏è Part A: Document Preprocessing and Indexing
### Task 1: Tokenization and Normalization

- Load a small corpus of 5‚Äì10 sample documents (e.g., news articles or Wikipedia snippets).
- Apply tokenization using NLTK or spaCy.
- Normalize tokens:
  - Lowercasing  
  - Removing punctuation  
  - Handling Unicode and spelling variants  

In [1]:
#code/program Task 1

# Using wikipedia articles
import wikipedia

titles = ["Python (programming language)", "Badminton", "Soccer", "Computer Science", "Data science"]
documents = {}
for title in titles:
    page = wikipedia.page(title)
    documents[title] = page.summary

# Made documents a dictionary where key is title and value is the summary
documents

{'Python (programming language)': 'Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation. Python is dynamically type-checked and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming.\nGuido van Rossum began working on Python in the late 1980s as a successor to the ABC programming language. Python 3.0, released in 2008, was a major revision and not completely backward-compatible with earlier versions. Beginning with Python 3.5, capabilities and keywords for typing were added to the language, allowing optional static typing. Currently only versions in the 3.x series are supported. \nPython has gained widespread use in the machine learning community. It is widely taught as an introductory programming language. Since 2003, Python has consistently ranked in the top ten of the most popular progra

In [2]:
# Tokenizing Text using NLTK
import nltk
tokenized_documents = {}
for title, summary in documents.items():
    tokenized_documents[title] = nltk.word_tokenize(summary)

In [3]:
# This is a dictionary of lists, where the key is the title,
# and the value is a list of tokenized terms per document
print(tokenized_documents["Badminton"])

['Badminton', 'is', 'a', 'racquet', 'sport', 'played', 'using', 'racquets', 'to', 'hit', 'a', 'shuttlecock', 'across', 'a', 'net', '.', 'Although', 'it', 'may', 'be', 'played', 'with', 'larger', 'teams', ',', 'the', 'most', 'common', 'forms', 'of', 'the', 'game', 'are', '``', 'singles', "''", '(', 'with', 'one', 'player', 'per', 'side', ')', 'and', '``', 'doubles', "''", '(', 'with', 'two', 'players', 'per', 'side', ')', '.', 'Badminton', 'is', 'often', 'played', 'as', 'a', 'casual', 'outdoor', 'activity', 'in', 'a', 'yard', 'or', 'on', 'a', 'beach', ';', 'professional', 'games', 'are', 'played', 'on', 'a', 'rectangular', 'indoor', 'court', '.', 'Points', 'are', 'scored', 'by', 'striking', 'the', 'shuttlecock', 'with', 'the', 'racquet', 'and', 'landing', 'it', 'within', 'the', 'other', 'team', "'s", 'half', 'of', 'the', 'court', ',', 'within', 'the', 'set', 'boundaries', '.', 'Each', 'side', 'may', 'only', 'strike', 'the', 'shuttlecock', 'once', 'before', 'it', 'passes', 'over', 'the',

In [4]:
# Normalizing tokens (FIRST, LOWERCASING)
lowercased_tokenized_documents = {}
for title, tokenized_doc in tokenized_documents.items():
    lowercased_tokenized_documents[title.lower()] = [token.lower() for token in tokenized_doc]

# Using badminton document as the reference document to show in this report
print(lowercased_tokenized_documents["badminton"])

['badminton', 'is', 'a', 'racquet', 'sport', 'played', 'using', 'racquets', 'to', 'hit', 'a', 'shuttlecock', 'across', 'a', 'net', '.', 'although', 'it', 'may', 'be', 'played', 'with', 'larger', 'teams', ',', 'the', 'most', 'common', 'forms', 'of', 'the', 'game', 'are', '``', 'singles', "''", '(', 'with', 'one', 'player', 'per', 'side', ')', 'and', '``', 'doubles', "''", '(', 'with', 'two', 'players', 'per', 'side', ')', '.', 'badminton', 'is', 'often', 'played', 'as', 'a', 'casual', 'outdoor', 'activity', 'in', 'a', 'yard', 'or', 'on', 'a', 'beach', ';', 'professional', 'games', 'are', 'played', 'on', 'a', 'rectangular', 'indoor', 'court', '.', 'points', 'are', 'scored', 'by', 'striking', 'the', 'shuttlecock', 'with', 'the', 'racquet', 'and', 'landing', 'it', 'within', 'the', 'other', 'team', "'s", 'half', 'of', 'the', 'court', ',', 'within', 'the', 'set', 'boundaries', '.', 'each', 'side', 'may', 'only', 'strike', 'the', 'shuttlecock', 'once', 'before', 'it', 'passes', 'over', 'the',

In [5]:
# REMOVING PUNCTUATION AND HANDLING UNICODE USING isascii()
normalized_tokenized_documents = {}

for title, lowercased_doc in lowercased_tokenized_documents.items():
    normalized_tokenized_documents[title] = [
        token for token in lowercased_doc if token.isalpha() and token.isascii()
    ]

print(normalized_tokenized_documents["badminton"])

['badminton', 'is', 'a', 'racquet', 'sport', 'played', 'using', 'racquets', 'to', 'hit', 'a', 'shuttlecock', 'across', 'a', 'net', 'although', 'it', 'may', 'be', 'played', 'with', 'larger', 'teams', 'the', 'most', 'common', 'forms', 'of', 'the', 'game', 'are', 'singles', 'with', 'one', 'player', 'per', 'side', 'and', 'doubles', 'with', 'two', 'players', 'per', 'side', 'badminton', 'is', 'often', 'played', 'as', 'a', 'casual', 'outdoor', 'activity', 'in', 'a', 'yard', 'or', 'on', 'a', 'beach', 'professional', 'games', 'are', 'played', 'on', 'a', 'rectangular', 'indoor', 'court', 'points', 'are', 'scored', 'by', 'striking', 'the', 'shuttlecock', 'with', 'the', 'racquet', 'and', 'landing', 'it', 'within', 'the', 'other', 'team', 'half', 'of', 'the', 'court', 'within', 'the', 'set', 'boundaries', 'each', 'side', 'may', 'only', 'strike', 'the', 'shuttlecock', 'once', 'before', 'it', 'passes', 'over', 'the', 'net', 'play', 'ends', 'once', 'the', 'shuttlecock', 'has', 'struck', 'the', 'floor'

### Task 2: Stop Word Removal and Stemming

- Remove stop words using a standard list (e.g., NLTK‚Äôs English stopwords).
- Apply stemming (Porter or Snowball) and compare with lemmatization.

In [6]:
#code/program Task 2
# Stop word removal
from nltk.corpus import stopwords
stopwords = stopwords.words('english')

removed_stopwords_tokenized_documents = {}
for title, normalized_doc in normalized_tokenized_documents.items():
    removed_stopwords_tokenized_documents[title] = [
        token for token in normalized_doc if token not in stopwords
    ]

print(removed_stopwords_tokenized_documents["badminton"])

['badminton', 'racquet', 'sport', 'played', 'using', 'racquets', 'hit', 'shuttlecock', 'across', 'net', 'although', 'may', 'played', 'larger', 'teams', 'common', 'forms', 'game', 'singles', 'one', 'player', 'per', 'side', 'doubles', 'two', 'players', 'per', 'side', 'badminton', 'often', 'played', 'casual', 'outdoor', 'activity', 'yard', 'beach', 'professional', 'games', 'played', 'rectangular', 'indoor', 'court', 'points', 'scored', 'striking', 'shuttlecock', 'racquet', 'landing', 'within', 'team', 'half', 'court', 'within', 'set', 'boundaries', 'side', 'may', 'strike', 'shuttlecock', 'passes', 'net', 'play', 'ends', 'shuttlecock', 'struck', 'floor', 'ground', 'fault', 'called', 'umpire', 'service', 'judge', 'absence', 'opposing', 'side', 'shuttlecock', 'feathered', 'informal', 'matches', 'plastic', 'projectile', 'flies', 'differently', 'balls', 'used', 'many', 'sports', 'particular', 'feathers', 'create', 'much', 'higher', 'drag', 'causing', 'shuttlecock', 'decelerate', 'rapidly', 'sh

In [7]:
# Porter Stemmer
from nltk.stem import PorterStemmer
porter = PorterStemmer()

porter_tokenized_documents = {}
for title, no_stopword_doc in removed_stopwords_tokenized_documents.items():
    porter_tokenized_documents[title] = [porter.stem(token) for token in no_stopword_doc]

print(porter_tokenized_documents["badminton"])

['badminton', 'racquet', 'sport', 'play', 'use', 'racquet', 'hit', 'shuttlecock', 'across', 'net', 'although', 'may', 'play', 'larger', 'team', 'common', 'form', 'game', 'singl', 'one', 'player', 'per', 'side', 'doubl', 'two', 'player', 'per', 'side', 'badminton', 'often', 'play', 'casual', 'outdoor', 'activ', 'yard', 'beach', 'profession', 'game', 'play', 'rectangular', 'indoor', 'court', 'point', 'score', 'strike', 'shuttlecock', 'racquet', 'land', 'within', 'team', 'half', 'court', 'within', 'set', 'boundari', 'side', 'may', 'strike', 'shuttlecock', 'pass', 'net', 'play', 'end', 'shuttlecock', 'struck', 'floor', 'ground', 'fault', 'call', 'umpir', 'servic', 'judg', 'absenc', 'oppos', 'side', 'shuttlecock', 'feather', 'inform', 'match', 'plastic', 'projectil', 'fli', 'differ', 'ball', 'use', 'mani', 'sport', 'particular', 'feather', 'creat', 'much', 'higher', 'drag', 'caus', 'shuttlecock', 'deceler', 'rapidli', 'shuttlecock', 'also', 'high', 'top', 'speed', 'compar', 'ball', 'racquet

In [8]:
# Lemmatization
from nltk.stem import WordNetLemmatizer
lemma = WordNetLemmatizer()

lemma_tokenized_documents = {}
for title, no_stopword_doc in removed_stopwords_tokenized_documents.items():
    lemma_tokenized_documents[title] = [lemma.lemmatize(token) for token in no_stopword_doc]

print(lemma_tokenized_documents["badminton"])

['badminton', 'racquet', 'sport', 'played', 'using', 'racquet', 'hit', 'shuttlecock', 'across', 'net', 'although', 'may', 'played', 'larger', 'team', 'common', 'form', 'game', 'single', 'one', 'player', 'per', 'side', 'double', 'two', 'player', 'per', 'side', 'badminton', 'often', 'played', 'casual', 'outdoor', 'activity', 'yard', 'beach', 'professional', 'game', 'played', 'rectangular', 'indoor', 'court', 'point', 'scored', 'striking', 'shuttlecock', 'racquet', 'landing', 'within', 'team', 'half', 'court', 'within', 'set', 'boundary', 'side', 'may', 'strike', 'shuttlecock', 'pass', 'net', 'play', 'end', 'shuttlecock', 'struck', 'floor', 'ground', 'fault', 'called', 'umpire', 'service', 'judge', 'absence', 'opposing', 'side', 'shuttlecock', 'feathered', 'informal', 'match', 'plastic', 'projectile', 'fly', 'differently', 'ball', 'used', 'many', 'sport', 'particular', 'feather', 'create', 'much', 'higher', 'drag', 'causing', 'shuttlecock', 'decelerate', 'rapidly', 'shuttlecock', 'also', 

When comparing Lemmatization and Porter stemmer, we notice that the porter stemmer may be less accurate due some some words being stemmed incorrectly. For example, we notice in the porter stemmer dictionary, the word "doubles" got stemmed into "doubl", which clearly is not right. It was also noticeable that the lemmatization cell took slightly longer to run.

### Task 3: Build an Inverted Index

- Create a dictionary mapping each term to a list of document IDs.
- Store term positions for phrase search support.

In [9]:
#code/program Task 3
# This inverted index would need to have the term as keys, 
# and another dictionary as values with the document title as keys,
# and a list of word-level indexes as values
inverted_index = {}

for title, tokens in lemma_tokenized_documents.items():
    for position, token in enumerate(tokens):
        if token not in inverted_index:
            inverted_index[token] = {}
        if title not in inverted_index[token]:
            inverted_index[token][title] = []
        inverted_index[token][title].append(position)

inverted_index

{'python': {'python (programming language)': [0, 11, 28, 34, 42, 56, 69]},
 'programming': {'python (programming language)': [1, 15, 22, 32, 66, 75, 78],
  'computer science': [67],
  'data science': [116]},
 'language': {'python (programming language)': [2, 33, 47, 67, 76],
  'badminton': [119],
  'computer science': [68, 141]},
 'design': {'python (programming language)': [3],
  'computer science': [22, 91, 104]},
 'philosophy': {'python (programming language)': [4]},
 'emphasizes': {'python (programming language)': [5]},
 'code': {'python (programming language)': [6], 'data science': [117]},
 'readability': {'python (programming language)': [7]},
 'use': {'python (programming language)': [8, 59],
  'soccer': [14, 92, 96, 113]},
 'significant': {'python (programming language)': [9]},
 'indentation': {'python (programming language)': [10]},
 'dynamically': {'python (programming language)': [12]},
 'support': {'python (programming language)': [13]},
 'multiple': {'python (programming l

## üìà Part B: Evaluation of IR Systems
### Task 4: Implement Precision, Recall, and F1

- Define a set of queries and manually label relevant documents.
- Retrieve documents using keyword matching or TF-IDF ranking.
- Compute:
  - Precision  
  - Recall  
  - F1 Score  

In [10]:
#code/program Task 4
# This dictionary maps 5 chosen queries to their relevant documents
queries_mapped_to_relevant_documents = {"how to start programming":["python (programming language)",
                                                                    "computer science","data science"],
                                        "what is a good sport to play":["badminton","soccer"],
                                        "how to code using computer":["python (programming language)",
                                                                      "computer science","data science"],
                                        "how to kick a ball":["soccer"],
                                        "coding project ideas":["python (programming language)",
                                                                "computer science","data science"],
                                        "best world cup goals":["soccer"],
                                        "sports that require cardio":["soccer","badminton"],
                                        "what language should i learn":["python (programming language)",
                                                                        "computer science"],
                                        "what is a fun game i can play with friends":["soccer","badminton"],
                                        "what do professional daily routines look like":["soccer","badminton",
                                                                                         "data science"]}
query_list = list(queries_mapped_to_relevant_documents.keys())

In [11]:
# Keyword Matching algorithm
def keyword(query, corpus):
    query_words = query.lower().split()
    res = []

    for title, token_list in corpus.items():
        # count the occurences of keyword (Using lemmatized documents for more accurate performance of search)
        keyword_count = 0
        for word in query_words:
            keyword_count += token_list.count(word)

        if keyword_count > 0:
            res.append(title)
    
    res.sort(key=lambda x:x[1], reverse=True)
    return res

In [12]:
# RESULTS OF KEYWORD SEARCH PER QUERY
for query in queries_mapped_to_relevant_documents.keys():
    print(f"{query}: {keyword(query, lemma_tokenized_documents)}")

how to start programming: ['python (programming language)', 'computer science', 'data science']
what is a good sport to play: ['soccer', 'badminton']
how to code using computer: ['python (programming language)', 'computer science', 'badminton', 'data science']
how to kick a ball: ['soccer', 'badminton']
coding project ideas: []
best world cup goals: ['soccer', 'badminton']
sports that require cardio: ['badminton']
what language should i learn: ['python (programming language)', 'computer science', 'badminton']
what is a fun game i can play with friends: ['soccer', 'badminton']
what do professional daily routines look like: ['badminton', 'data science']


In [13]:
#### PRECISION, RECALL AND F1 SCORE

# "how to start programming"
Q1_precision = 3/3
Q1_recall = 3/3
Q1_F1 = 2*(Q1_precision*Q1_recall) / (Q1_precision+Q1_recall)

# "what is a good sport to play"
Q2_precision = 2/2
Q2_recall = 2/2
Q2_F1 = 2*(Q2_precision*Q2_recall) / (Q2_precision+Q2_recall)

# "how to code using computer"
Q3_precision = 3/4
Q3_recall = 3/3
Q3_F1 = 2*(Q3_precision*Q3_recall) / (Q3_precision+Q3_recall)

# "how to kick a ball"
Q4_precision = 1/2
Q4_recall = 1/1
Q4_F1 = 2*(Q4_precision*Q4_recall) / (Q4_precision+Q4_recall)

# "coding project ideas"
Q5_precision = 0/1
Q5_recall = 0/3
Q5_F1 = 2*(Q5_precision*Q5_recall) / (Q5_precision+Q5_recall+1)

# best world cup goals
Q6_precision = 1/2
Q6_recall = 1/1
Q6_F1 = 2*(Q6_precision*Q6_recall) / (Q6_precision+Q6_recall)

# sports that require cardio
Q7_precision = 1/1
Q7_recall = 1/2
Q7_F1 = 2*(Q7_precision*Q7_recall) / (Q7_precision+Q7_recall)

# what language should i learn
Q8_precision = 2/3
Q8_recall = 2/2
Q8_F1 = 2*(Q8_precision*Q8_recall) / (Q8_precision+Q8_recall)

# what is a fun game i can play with friends
Q9_precision = 2/2
Q9_recall = 2/2
Q9_F1 = 2*(Q9_precision*Q9_recall) / (Q9_precision+Q9_recall)

# what do professional daily routines look like
Q10_precision = 2/2
Q10_recall = 2/3
Q10_F1 = 2*(Q10_precision*Q10_recall) / (Q10_precision+Q10_recall)

# Average metrics
avg_precision = (Q1_precision + Q2_precision + Q3_precision + Q4_precision + Q5_precision + 
                 Q6_precision + Q7_precision + Q8_precision + Q9_precision + Q10_precision) / 10
avg_recall = (Q1_recall + Q2_recall + Q3_recall + Q4_recall + Q5_recall + Q6_recall + Q7_recall + 
              Q8_recall + Q9_recall + Q10_recall) / 10
avg_F1 = (Q1_F1 + Q2_F1 + Q3_F1 + Q4_F1 + Q5_F1 + Q6_F1 + Q7_F1 + Q8_F1 + Q9_F1 + Q10_F1) / 10

print(f"Q1-{query_list[0]}: precision = {Q1_precision:.3f}, recall = {Q1_recall:.3f}, f1 = {Q1_F1:.3f}")
print(f"Q2-{query_list[1]}: precision = {Q2_precision:.3f}, recall = {Q2_recall:.3f}, f1 = {Q2_F1:.3f}")
print(f"Q3-{query_list[2]}: precision = {Q3_precision:.3f}, recall = {Q3_recall:.3f}, f1 = {Q3_F1:.3f}")
print(f"Q4-{query_list[3]}: precision = {Q4_precision:.3f}, recall = {Q4_recall:.3f}, f1 = {Q4_F1:.3f}")
print(f"Q5-{query_list[4]}: precision = {Q5_precision:.3f}, recall = {Q5_recall:.3f}, f1 = {Q5_F1:.3f}")
print(f"Q6-{query_list[5]}: precision = {Q6_precision:.3f}, recall = {Q6_recall:.3f}, f1 = {Q6_F1:.3f}")
print(f"Q7-{query_list[6]}: precision = {Q7_precision:.3f}, recall = {Q7_recall:.3f}, f1 = {Q7_F1:.3f}")
print(f"Q8-{query_list[7]}: precision = {Q8_precision:.3f}, recall = {Q8_recall:.3f}, f1 = {Q8_F1:.3f}")
print(f"Q9-{query_list[8]}: precision = {Q9_precision:.3f}, recall = {Q9_recall:.3f}, f1 = {Q9_F1:.3f}")
print(f"Q10-{query_list[9]}: precision = {Q10_precision:.3f}, recall = {Q10_recall:.3f}, f1 = {Q10_F1:.3f}")
print(f"AVERAGES: precision = {avg_precision:.3f}, recall = {avg_recall:.3f}, f1 = {avg_F1:.3f}")

Q1-how to start programming: precision = 1.000, recall = 1.000, f1 = 1.000
Q2-what is a good sport to play: precision = 1.000, recall = 1.000, f1 = 1.000
Q3-how to code using computer: precision = 0.750, recall = 1.000, f1 = 0.857
Q4-how to kick a ball: precision = 0.500, recall = 1.000, f1 = 0.667
Q5-coding project ideas: precision = 0.000, recall = 0.000, f1 = 0.000
Q6-best world cup goals: precision = 0.500, recall = 1.000, f1 = 0.667
Q7-sports that require cardio: precision = 1.000, recall = 0.500, f1 = 0.667
Q8-what language should i learn: precision = 0.667, recall = 1.000, f1 = 0.800
Q9-what is a fun game i can play with friends: precision = 1.000, recall = 1.000, f1 = 1.000
Q10-what do professional daily routines look like: precision = 1.000, recall = 0.667, f1 = 0.800
AVERAGES: precision = 0.742, recall = 0.817, f1 = 0.746


### Task 5: Precision@k and MAP

- Rank documents using cosine similarity.
- Evaluate Precision@5 and MAP across multiple queries.

In [14]:
#code/program Task 5
import math
# COSINE SIMILARITY
def cosine_similarity(query, corpus):
    # Find Term frequency for each document and term, including query
    freq_dict  = {}
    document_titles = list(corpus.keys())
    
    for idx, title in enumerate(document_titles):
        list_words = corpus[title]
        for word in list_words:
            if word not in freq_dict:
                freq_dict[word] = [0] * len(corpus)
            freq_dict[word][idx] += 1

    # Cosine Similarity
    # prepare document and query vectors
    vocabulary = freq_dict.keys()

    document_vectors = []
    for title in document_titles:
        words = corpus[title]
        vector = [words.count(term) for term in vocabulary]
        document_vectors.append(vector)

    query_vector = [query.split().count(term) for term in vocabulary]
    
    # 5b: calculate similarity between query and each document
    cosine_similarities = []
    for idx,doc_vec in enumerate(document_vectors):
        dot_prod = sum(x*y for x,y in zip(doc_vec, query_vector))
        doc_norm = math.sqrt(sum(x*x for x in doc_vec))
        query_norm = math.sqrt(sum(x*x for x in query_vector))
        # Division by 0 error handling
        if doc_norm == 0 or query_norm == 0:
            cs = 0
        else:
            cs = round(dot_prod / (doc_norm*query_norm),3)
        cosine_similarities.append((document_titles[idx],cs))

    cosine_similarities.sort(key=lambda x: x[1], reverse=True)

    return cosine_similarities

In [15]:
# Cosine Similarity on first query
cosine_similarity("how to start programming", lemma_tokenized_documents)

[('python (programming language)', 0.499),
 ('computer science', 0.048),
 ('data science', 0.045),
 ('badminton', 0.0),
 ('soccer', 0.0)]

In [16]:
# Precision@5 for all queries
def precision_at_5(relevant_docs_for_query, search_results, k=5):
    topk = search_results[:k]

    topk_docs = [doc for doc, score in topk]

    rel_topk = sum(1 for doc in topk_docs if doc in relevant_docs_for_query)

    precision_k = rel_topk / k

    return precision_k

for query in queries_mapped_to_relevant_documents.keys():
    relevant_docs_for_query = queries_mapped_to_relevant_documents[query]
    print(f"{query} PRECISION@5: {precision_at_5(relevant_docs_for_query,
                                  cosine_similarity(query, lemma_tokenized_documents))}")

how to start programming PRECISION@5: 0.6
what is a good sport to play PRECISION@5: 0.4
how to code using computer PRECISION@5: 0.6
how to kick a ball PRECISION@5: 0.2
coding project ideas PRECISION@5: 0.6
best world cup goals PRECISION@5: 0.2
sports that require cardio PRECISION@5: 0.4
what language should i learn PRECISION@5: 0.4
what is a fun game i can play with friends PRECISION@5: 0.4
what do professional daily routines look like PRECISION@5: 0.6


In [17]:
# AVERAGE PRECISION
def avg_precision(relevant_docs_for_query, search_results):
    if len(relevant_docs_for_query) == 0:
        return 0

    precision_list = []
    relevant_doc_count = 0

    for idx, (doc, score) in enumerate(search_results):
        if doc in relevant_docs_for_query:
            relevant_doc_count += 1
            precision_at_k = relevant_doc_count / (idx+1)
            precision_list.append(precision_at_k)

    if len(precision_list) == 0:
        return 0

    avg_precision = sum(precision_list) / len(relevant_docs_for_query)
    return avg_precision

# MAP
APs = []
for query in queries_mapped_to_relevant_documents.keys():
    relevant_docs_for_query = queries_mapped_to_relevant_documents[query]
    search_results = cosine_similarity(query, lemma_tokenized_documents)
    average_precision = avg_precision(relevant_docs_for_query, search_results)
    APs.append(average_precision)

MAP = sum(APs) / len(APs)
print(f"MAP SCORE = {MAP}")

MAP SCORE = 0.945


## üîÅ Part C: Relevance Feedback and Query Expansion
### Task 6: Pseudo Relevance Feedback

- Assume top-k retrieved documents are relevant.
- Extract frequent terms and expand the query.
- Re-run retrieval and compare performance metrics.

In [18]:
#code/program Task 6
# Using three queries for PRF
queries_used_3 = {"how to kick a ball":["soccer"],
                  "how to code using computer":["python (programming language)",
                                                "computer science","data science"],
                  "what do professional daily routines look like":["soccer","badminton","data science"]}
query_3_list = list(queries_used_3.keys())

for query in queries_used_3.keys():
    relevant_docs = queries_used_3[query]
    # Initial values
    init_res = cosine_similarity(query, lemma_tokenized_documents)

    # Using top 3 docs as relevant
    top3 = init_res[:3]
    terms = []
    for doc, _ in top3:
        terms.extend(lemma_tokenized_documents[doc])

    # get top 5 new terms
    query_tokens = set(query.split())
    term_frequency = {}
    for term in terms:
        if term not in query_tokens:
            term_frequency[term] = term_frequency.get(term, 0) + 1

    # sort by frqeuncy
    sorted_terms = sorted(term_frequency.items(), key=lambda x:x[1], reverse=True)
    new_terms = [term for term, freq in sorted_terms[:5]]
    expanded_query = query + " " +" ".join(new_terms)
    print(expanded_query)

how to kick a ball game sport team shuttlecock played
how to code using computer science data programming language python
what do professional daily routines look like science data sport shuttlecock programming


In [19]:
def recall_at_3(relevant_docs_for_query, search_results, k=3):
    topk = search_results[:k]
    topk_docs = [doc for doc, score in topk]
    rel_topk = sum(1 for doc in topk_docs if doc in relevant_docs_for_query)
    if len(relevant_docs_for_query) == 0:
        recall_k = 0
    else:
        recall_k = rel_topk / len(relevant_docs_for_query)
    return recall_k

In [20]:
queries_used_ref_3 = {"how to kick a ball game sport team shuttlecock played":["soccer"],
        "how to code using computer science data programming language python":["python (programming language)",
                                                                            "computer science","data science"],
        "what do professional daily routines look like science data sport shuttlecock programming":["soccer",
                                                                                "badminton","data science"]}
# USING PRECISION@3 BY MAKING K=3
query_3_list = list(queries_used_3.keys())
query_3_ref_list = list(queries_used_ref_3.keys())

print(f"{query_3_list[0]} P@3:{precision_at_5(queries_used_3[query_3_list[0]], 
                                    cosine_similarity(query_3_list[0], lemma_tokenized_documents),k=3)}")
print(f"{query_3_ref_list[0]} P@3:{precision_at_5(queries_used_ref_3[query_3_ref_list[0]], 
                                    cosine_similarity(query_3_ref_list[0], lemma_tokenized_documents),k=3)}")
print()
print(f"{query_3_list[1]} P@3:{precision_at_5(queries_used_3[query_3_list[1]], 
                                    cosine_similarity(query_3_list[1], lemma_tokenized_documents),k=3)}")
print(f"{query_3_ref_list[1]} P@3:{precision_at_5(queries_used_ref_3[query_3_ref_list[1]], 
                                    cosine_similarity(query_3_ref_list[1], lemma_tokenized_documents),k=3)}")
print()
print(f"{query_3_list[2]} P@3:{precision_at_5(queries_used_3[query_3_list[2]], 
                                    cosine_similarity(query_3_list[2], lemma_tokenized_documents),k=3)}")
print(f"{query_3_ref_list[2]} P@3:{precision_at_5(queries_used_ref_3[query_3_ref_list[2]], 
                                    cosine_similarity(query_3_ref_list[2], lemma_tokenized_documents),k=3)}")

how to kick a ball P@3:0.3333333333333333
how to kick a ball game sport team shuttlecock played P@3:0.3333333333333333

how to code using computer P@3:1.0
how to code using computer science data programming language python P@3:1.0

what do professional daily routines look like P@3:0.6666666666666666
what do professional daily routines look like science data sport shuttlecock programming P@3:0.6666666666666666


In [21]:
# RECALL@3
print(f"{query_3_list[0]} R@3:{recall_at_3(queries_used_3[query_3_list[0]], 
                                    cosine_similarity(query_3_list[0], lemma_tokenized_documents),k=3)}")
print(f"{query_3_ref_list[0]} R@3:{recall_at_3(queries_used_ref_3[query_3_ref_list[0]], 
                                    cosine_similarity(query_3_ref_list[0], lemma_tokenized_documents),k=3)}")
print()
print(f"{query_3_list[1]} R@3:{recall_at_3(queries_used_3[query_3_list[1]], 
                                    cosine_similarity(query_3_list[1], lemma_tokenized_documents),k=3)}")
print(f"{query_3_ref_list[1]} R@3:{recall_at_3(queries_used_ref_3[query_3_ref_list[1]], 
                                    cosine_similarity(query_3_ref_list[1], lemma_tokenized_documents),k=3)}")
print()
print(f"{query_3_list[2]} R@3:{recall_at_3(queries_used_3[query_3_list[2]], 
                                    cosine_similarity(query_3_list[2], lemma_tokenized_documents),k=3)}")
print(f"{query_3_ref_list[2]} R@3:{recall_at_3(queries_used_ref_3[query_3_ref_list[2]], 
                                    cosine_similarity(query_3_ref_list[2], lemma_tokenized_documents),k=3)}")

how to kick a ball R@3:1.0
how to kick a ball game sport team shuttlecock played R@3:1.0

how to code using computer R@3:1.0
how to code using computer science data programming language python R@3:1.0

what do professional daily routines look like R@3:0.6666666666666666
what do professional daily routines look like science data sport shuttlecock programming R@3:0.6666666666666666


### Task 7: Query Expansion Techniques

- Apply synonym expansion using WordNet.
- Use local analysis to extract terms from top-ranked documents.
- Compare original vs. expanded query results.

In [22]:
#code/program Task 7
# SYNONYM EXAPNSION USING WORDNET
from nltk.corpus import wordnet

# We will once again use these three queries
queries_used_3 = {
    "how to kick a ball":["soccer"],
    "how to code using computer":["python (programming language)","computer science","data science"],
    "what do professional daily routines look like":["soccer","badminton","data science"]
}

def query_synonym_expansion(query):
    tokens = query.lower().split()
    # Remove stopwords, and tokenize
    tokens = [token for token in tokens if token not in stopwords]
    expanded_query = []

    for token in tokens:
        expanded_query.append(token)

        synonyms = []
        for set_of_synonyms in wordnet.synsets(token):
            for lemma in set_of_synonyms.lemmas():
                synonym_words = lemma.name().lower()
                if synonym_words != token:
                    synonyms.append(synonym_words)

        expanded_query.extend(list(set(synonyms)))
    return " ".join(expanded_query)

In [23]:
synonym_expansion = dict()
for query in queries_used_3.keys():
    synonyms = query_synonym_expansion(query)
    synonym_expansion[synonyms] = queries_used_3[query]
synonym_expansion

{'kick kicking flush quetch plain thrill bitch kick_back rush charge bang boot recoil gripe sound_off kvetch give_up complain beef squawk ball musket_ball lucille_ball orb formal orchis clod bollock glob chunk ballock egg lump nut testicle globe clump testis': ['soccer'],
 'code encipher codification encrypt cypher inscribe write_in_code computer_code cipher using apply victimization victimisation expend practice utilise habituate use employ exploitation utilize computer computing_machine estimator calculator computing_device reckoner electronic_computer figurer information_processing_system data_processor': ['python (programming language)',
  'computer science',
  'data science'],
 'professional pro professional_person master daily day-to-day day-by-day casual everyday day_by_day day-after-day routines modus_operandi act procedure subprogram function subroutine turn routine bit number look take_care see wait spirit aspect depend face appear feeling reckon flavour calculate bet tone sm

In [24]:
# LOCAL ANALYSIS
local_analysis = dict()
for query in queries_used_3.keys():
    relevant_docs = queries_used_3[query]
    # Initial values
    init_res = cosine_similarity(query, lemma_tokenized_documents)

    # Using top 3 docs as relevant
    top3 = init_res[:3]
    terms = []
    for doc, _ in top3:
        terms.extend(lemma_tokenized_documents[doc])

    # get top 5 new terms
    query_tokens = set(query.split())
    term_frequency = {}
    for term in terms:
        if term not in query_tokens:
            term_frequency[term] = term_frequency.get(term, 0) + 1

    # sort by frqeuncy
    sorted_terms = sorted(term_frequency.items(), key=lambda x:x[1], reverse=True)
    new_terms = [term for term, freq in sorted_terms[:5]]
    expanded_query = query + " " +" ".join(new_terms)
    local_analysis[expanded_query] = queries_used_3[query]
local_analysis

{'how to kick a ball game sport team shuttlecock played': ['soccer'],
 'how to code using computer science data programming language python': ['python (programming language)',
  'computer science',
  'data science'],
 'what do professional daily routines look like science data sport shuttlecock programming': ['soccer',
  'badminton',
  'data science']}

In [25]:
# USING PRECISION@3 BY MAKING K=3
queries_syn = list(synonym_expansion.keys())
query_local = list(local_analysis.keys())

print("QUERY 1 RESULTS:")
print(f"{queries_syn[0]} PRECISION@3: {precision_at_5(synonym_expansion[queries_syn[0]], 
                                        cosine_similarity(queries_syn[0], lemma_tokenized_documents),k=3)}")
print()
print(f"{query_local[0]} PRECISION@3: {precision_at_5(local_analysis[query_local[0]], 
                                        cosine_similarity(query_local[0], lemma_tokenized_documents),k=3)}")
print()
print("QUERY 2 RESULTS:")
print(f"{queries_syn[1]} PRECISION@3: {precision_at_5(synonym_expansion[queries_syn[1]], 
                                        cosine_similarity(queries_syn[1], lemma_tokenized_documents),k=3)}")
print()
print(f"{query_local[1]} PRECISION@3: {precision_at_5(local_analysis[query_local[1]], 
                                        cosine_similarity(query_local[1], lemma_tokenized_documents),k=3)}")
print()
print("QUERY 3 RESULTS:")
print(f"{queries_syn[2]} PRECISION@3: {precision_at_5(synonym_expansion[queries_syn[2]], 
                                        cosine_similarity(queries_syn[2], lemma_tokenized_documents),k=3)}")
print()
print(f"{query_local[2]} PRECISION@3: {precision_at_5(local_analysis[query_local[2]], 
                                        cosine_similarity(query_local[2], lemma_tokenized_documents),k=3)}")

QUERY 1 RESULTS:
kick kicking flush quetch plain thrill bitch kick_back rush charge bang boot recoil gripe sound_off kvetch give_up complain beef squawk ball musket_ball lucille_ball orb formal orchis clod bollock glob chunk ballock egg lump nut testicle globe clump testis PRECISION@3: 0.3333333333333333

how to kick a ball game sport team shuttlecock played PRECISION@3: 0.3333333333333333

QUERY 2 RESULTS:
code encipher codification encrypt cypher inscribe write_in_code computer_code cipher using apply victimization victimisation expend practice utilise habituate use employ exploitation utilize computer computing_machine estimator calculator computing_device reckoner electronic_computer figurer information_processing_system data_processor PRECISION@3: 0.6666666666666666

how to code using computer science data programming language python PRECISION@3: 1.0

QUERY 3 RESULTS:
professional pro professional_person master daily day-to-day day-by-day casual everyday day_by_day day-after-day r

In [26]:
# RECALL @ 3
print("QUERY 1 RESULTS:")
print(f"{queries_syn[0]} RECALL@3: {recall_at_3(synonym_expansion[queries_syn[0]], 
                                        cosine_similarity(queries_syn[0], lemma_tokenized_documents),k=3)}")
print()
print(f"{query_local[0]} RECALL@3: {recall_at_3(local_analysis[query_local[0]], 
                                        cosine_similarity(query_local[0], lemma_tokenized_documents),k=3)}")
print()
print("QUERY 2 RESULTS:")
print(f"{queries_syn[1]} RECALL@3: {recall_at_3(synonym_expansion[queries_syn[1]], 
                                        cosine_similarity(queries_syn[1], lemma_tokenized_documents),k=3)}")
print()
print(f"{query_local[1]} RECALL@3: {recall_at_3(local_analysis[query_local[1]], 
                                        cosine_similarity(query_local[1], lemma_tokenized_documents),k=3)}")
print()
print("QUERY 3 RESULTS:")
print(f"{queries_syn[2]} RECALL@3: {recall_at_3(synonym_expansion[queries_syn[2]], 
                                        cosine_similarity(queries_syn[2], lemma_tokenized_documents),k=3)}")
print()
print(f"{query_local[2]} RECALL@3: {recall_at_3(local_analysis[query_local[2]], 
                                        cosine_similarity(query_local[2], lemma_tokenized_documents),k=3)}")

QUERY 1 RESULTS:
kick kicking flush quetch plain thrill bitch kick_back rush charge bang boot recoil gripe sound_off kvetch give_up complain beef squawk ball musket_ball lucille_ball orb formal orchis clod bollock glob chunk ballock egg lump nut testicle globe clump testis RECALL@3: 1.0

how to kick a ball game sport team shuttlecock played RECALL@3: 1.0

QUERY 2 RESULTS:
code encipher codification encrypt cypher inscribe write_in_code computer_code cipher using apply victimization victimisation expend practice utilise habituate use employ exploitation utilize computer computing_machine estimator calculator computing_device reckoner electronic_computer figurer information_processing_system data_processor RECALL@3: 0.6666666666666666

how to code using computer science data programming language python RECALL@3: 1.0

QUERY 3 RESULTS:
professional pro professional_person master daily day-to-day day-by-day casual everyday day_by_day day-after-day routines modus_operandi act procedure subpr

## üß† Reflection Questions

<font color='blue'>Please answer the following questions in markdown cells following each question:</font>

### Q1: How did normalization affect retrieval performance?

Normalization affects retreival performance heavily. Firstly, we lowercased which merged many words together in our token list (for example "Badminton" and "badminton" weren't seperate tokens after lowercasing, which helps lower our index size. Secondly, there were many useless tokens in our tokenized lists which were punctuations such as "." and ",". Removing these are good since they do not help in document retrieval so keeping them in our tokens list is redundant, removing them lowers our space and improves retrieval performance. Also we handled unicode which is useful in retrieval performance, because if it is not handled properly we may miss relevant documents when using retrieval algorithms. We prefer to keep a universal representation of our characters so our retrival can be consistent. 

### Q2: What are the trade-offs between stemming and lemmatization?



While stemming and lemmtizing our documents. It was noticeable that stemming is more imprecise than lemmatization. For example, some words such as "doubles" were stemmed to be "doubl" which is not a correct word. A lot of words such as double which ended with "e" actually got the "e" cut off in stemming, which is not ideal. It was also noticeable that the lemmatization algorithm is more accurate, but the trade-off was that it was slower when ran. This makes sense because it has POS awareness. Since our corpus was small, it was not a big issue, but may be a bigg issue when a large corpus for search retreival is used.

### Q3: How did relevance feedback improve recall?



Normally, the recall should improve after relevance feedback. But due to the small corpus set, adding additional terms in the query may not change anything due to the fact that the queries were probably already pretty accurate when calculating the cosine similarity due to the small sample size. Nowrmally however, the recall and precision using relevance feedback should increase (improve).

### Q4: Which expansion technique yielded the best precision?

For our scenario and documents, local analysis gave the best results. We can see on query 2 that both precision and recall increased, and did better than our base query and our synonym expansion. This is most likely due to the fact that for synonym expansion, our query and documents aren't very large so not many synonyms of existing words in the document can be found. Local analysis works well in this case since we already assume that the top-k are relevant and gets higher weighted terms from those documents. Overall, both expansion techniques are solid but in this case, local analysis is better.

## üßÆ Evaluation Criteria

| Component                    | Points |
|-----------------------------|--------|
| Preprocessing implementation | 20     |
| Inverted index construction  | 15     |
| Evaluation metrics           | 20     |
| Relevance feedback & QE      | 25     |
| Reflection answers           | 20     |
| **Total**                    | **100** |