# W6 Lab Exercise
This is the lab exercise for MIS590: Information Retrieval. </br>
In this lab, you will gain the following experience:</br>
- Understand Vector Space Models (VSMs) for Information Retrieval.
- Develop Practical Skills in Vector-Based Document Representation, Including TF-IDF, Word2Vec, and BERT.
- Compare the Effectiveness of Different Term Weighting Schemes.
- Enhance Analytical Thinking in Evaluating IR Models
</br>

**Note:** When you see a pencil icon ✏️ in this notebook, it's time for you to code!

# 1. Preliminaries

## 1.1 Install and Import Libraries

In [46]:
# Install the necessary packages
!pip install nltk
!pip install torch
!pip install numpy
!pip install gensim
!pip install string
!pip install transformers
!pip install scikit-learn



ERROR: Could not find a version that satisfies the requirement string (from versions: none)
ERROR: No matching distribution found for string




In [47]:
import math
import string
from collections import defaultdict

## 1.2 Input: Query & Document Collections (Corpus)

In [48]:
query = "sleep deprivation"

corpus = [
    "Sleepless nights in the lab have become my new normal. I tried to fix the experiment setup, but the apparatus seems to have a mind of its own. My advisor says results are just around the corner, but the corner keeps moving. Coffee is my only true companion these days.",
    "I thought grad school would be intellectually stimulating, but it's mostly paperwork and waiting for emails. The departmental printer jammed again, and now I'm late for a meeting. The cafeteria ran out of the good snacks, so I'm surviving on vending machine chips. Sleep has become a luxury I can no longer afford.",
    "Writing the dissertation feels like climbing an endless mountain. Every time I finish a chapter, my supervisor suggests new revisions. The impostor syndrome is real, and I wonder if they made a mistake accepting me. Maybe I should have gone to clown college instead. I am utterly deprived of any semblance of a normal life.",
    "My research data got corrupted, and now I have to start over. The lab mouse escaped, and we spent hours trying to find it. The grant proposal deadline is tomorrow, and the online submission portal is down. At least my pet cactus hasn't died yet.",
    "The group meeting turned into a three-hour debate over font choices for the presentation. I'm pretty sure my colleague is stealing my lunch from the fridge. The photocopier is out to get me; it never works when I'm in a hurry. Is there a PhD in napping? Because I'd ace that.",
    "I haven't seen the sun in days due to endless coding sessions. The simulation keeps crashing, and Stack Overflow doesn't have the answers. My roommate thinks I'm a ghost haunting the apartment. Instant noodles have become my primary food group.",
    "Attending conferences sounded fun until I realized they involve a lot of awkward networking. I accidentally spilled coffee on a famous professor's shoes. My poster fell down twice during the session. Next time, I'll just send a cardboard cutout of myself.",
    "The university gym membership was supposed to keep me healthy, but I've only used it once. I tried to attend a yoga class after staying up late for a deadline, but I fell asleep during the meditation. Maybe instead of the gym, my bed is more essential for keeping me healthy.",
    "My teaching assistantship involves grading endless stacks of exams. Students keep emailing me for extensions with creative excuses. One claimed their dog sleeps on the laptop so they cannot use it for the exam. I was deprived of excuses for not completing my dissertation draft, and I might have got some good ones.",
    "Group projects are the worst when you're the only one doing the work. My team members are as elusive as Bigfoot. The project is due next week, and I haven't heard from them. Perhaps I should just write a paper on the sociological implications of group work avoidance."
]

# Binary labels for the documents' relevancy to the query
# Relevant ones: 1, 2, 5, 6, 8
corpus_relevancy_label = [1, 1, 0, 0, 1, 1, 0, 1, 0, 0]

In [49]:
print(f"Query: {query}\n")
for idx, doc in enumerate(corpus):
    print(f"Document {idx+1}:\n{doc}\n")

Query: sleep deprivation

Document 1:
Sleepless nights in the lab have become my new normal. I tried to fix the experiment setup, but the apparatus seems to have a mind of its own. My advisor says results are just around the corner, but the corner keeps moving. Coffee is my only true companion these days.

Document 2:
I thought grad school would be intellectually stimulating, but it's mostly paperwork and waiting for emails. The departmental printer jammed again, and now I'm late for a meeting. The cafeteria ran out of the good snacks, so I'm surviving on vending machine chips. Sleep has become a luxury I can no longer afford.

Document 3:
Writing the dissertation feels like climbing an endless mountain. Every time I finish a chapter, my supervisor suggests new revisions. The impostor syndrome is real, and I wonder if they made a mistake accepting me. Maybe I should have gone to clown college instead. I am utterly deprived of any semblance of a normal life.

Document 4:
My research dat

# 2. Vector Space Model: TF-IDF

## 2.1 Data Preprocessing

### Steps for textual data preprocessing
1. Tokenization (= word segmentation)
2. Punctualtion and non-alphabetic token removal
3. Stopwords removal
4. Lemmatization / stemming

### Import libraries

In [50]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.corpus import wordnet as wn

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\dmatz\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dmatz\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\dmatz\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\dmatz\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [51]:
# Initialize stopwords, lemmatizer, and punctuation list
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
punctuation_table = str.maketrans('', '', string.punctuation)

# We will use this sentence as example to showcase the different steps of data preprocessing
example_sentence = "The graduate student was typing, procrastinating, questioning herself, and finally submitting the dissertation while dreaming about sleep."
print(f"Example Sentence:\n{example_sentence}")

Example Sentence:
The graduate student was typing, procrastinating, questioning herself, and finally submitting the dissertation while dreaming about sleep.


### What is tokenization?

In [52]:
tokens = word_tokenize(example_sentence.lower())
print(tokens)

['the', 'graduate', 'student', 'was', 'typing', ',', 'procrastinating', ',', 'questioning', 'herself', ',', 'and', 'finally', 'submitting', 'the', 'dissertation', 'while', 'dreaming', 'about', 'sleep', '.']


### A quick removal of punctualtions and non-alphabetic words

In [53]:
tokens_noPunc = [word.translate(punctuation_table) for word in tokens if word.isalpha()]
print(tokens_noPunc)

['the', 'graduate', 'student', 'was', 'typing', 'procrastinating', 'questioning', 'herself', 'and', 'finally', 'submitting', 'the', 'dissertation', 'while', 'dreaming', 'about', 'sleep']


### What are stopwords?

In [54]:
tokens_noSW = [word for word in tokens_noPunc if word not in stop_words]
print(tokens_noSW)

['graduate', 'student', 'typing', 'procrastinating', 'questioning', 'finally', 'submitting', 'dissertation', 'dreaming', 'sleep']


### What is lemmatization?

In [55]:
print("Original\tLemmatized\n")

# Here we use pre-stopword removal tokens
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens_noSW]
for ori, lem in zip(tokens_noSW, lemmatized_tokens):
    print(f"{ori}\t{lem}")

Original	Lemmatized

graduate	graduate
student	student
typing	typing
procrastinating	procrastinating
questioning	questioning
finally	finally
submitting	submitting
dissertation	dissertation
dreaming	dreaming
sleep	sleep


### Observe the results above and discuss the following:
- What is lemmatization?
- I guess you cannot tell what lemmatization is from the results above. Let's try lemmatization in another way.

### How about we tell the lemmatizer more information of the tokens?

In [56]:
# Part-of-Speech Tagging
tagged_tokens = pos_tag(tokens)
print(tagged_tokens)

[('the', 'DT'), ('graduate', 'NN'), ('student', 'NN'), ('was', 'VBD'), ('typing', 'VBG'), (',', ','), ('procrastinating', 'VBG'), (',', ','), ('questioning', 'VBG'), ('herself', 'PRP'), (',', ','), ('and', 'CC'), ('finally', 'RB'), ('submitting', 'VBG'), ('the', 'DT'), ('dissertation', 'NN'), ('while', 'IN'), ('dreaming', 'VBG'), ('about', 'RB'), ('sleep', 'NN'), ('.', '.')]


### Then we do the punctuation, non-alphabetic tokens, and stopword removal.

In [57]:
# Remove punctuation and non-alphabetic tokens
tagged_tokens_noPunc = [(word[0].translate(punctuation_table), word[1]) for word in tagged_tokens if word[0].isalpha()]
print(tagged_tokens_noPunc)

# Remove stopwords
tagged_tokens_noSW = [(word[0], word[1]) for word in tagged_tokens_noPunc if word[0] not in stop_words]
print(tagged_tokens_noSW)

[('the', 'DT'), ('graduate', 'NN'), ('student', 'NN'), ('was', 'VBD'), ('typing', 'VBG'), ('procrastinating', 'VBG'), ('questioning', 'VBG'), ('herself', 'PRP'), ('and', 'CC'), ('finally', 'RB'), ('submitting', 'VBG'), ('the', 'DT'), ('dissertation', 'NN'), ('while', 'IN'), ('dreaming', 'VBG'), ('about', 'RB'), ('sleep', 'NN')]
[('graduate', 'NN'), ('student', 'NN'), ('typing', 'VBG'), ('procrastinating', 'VBG'), ('questioning', 'VBG'), ('finally', 'RB'), ('submitting', 'VBG'), ('dissertation', 'NN'), ('dreaming', 'VBG'), ('sleep', 'NN')]


### Take 2: what is lemmatization?

In [58]:
# Convert treebank POS tags to wordnet POS tags so the lemmatizer can read them
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wn.ADJ
    elif treebank_tag.startswith('V'):
        return wn.VERB
    elif treebank_tag.startswith('R'):
        return wn.ADV
    else:
        return wn.NOUN

print("Original\tLemmatized\n")
tagged_tokens_lemmatized = [lemmatizer.lemmatize(token, get_wordnet_pos(tag)) for token, tag in tagged_tokens_noSW]
for ori, lem in zip(tagged_tokens_noSW, tagged_tokens_lemmatized):
    print(f"{ori[0]}\t{lem}")

Original	Lemmatized

graduate	graduate
student	student
typing	type
procrastinating	procrastinate
questioning	question
finally	finally
submitting	submit
dissertation	dissertation
dreaming	dream
sleep	sleep


### Observe the results above and discuss the following:
- What is lemmatization?

### What is stemming?

In [59]:
print("Original\tStemmed\n")
tokens_stemmed = [stemmer.stem(word) for word in tagged_tokens_lemmatized]
for ori, stem in zip(tagged_tokens_lemmatized, tokens_stemmed):
    print(f"{ori}\t{stem}")

Original	Stemmed

graduate	graduat
student	student
type	type
procrastinate	procrastin
question	question
finally	final
submit	submit
dissertation	dissert
dream	dream
sleep	sleep


### Observe the results above and discuss the following:
- What is stemming?
- Why is stemming helpful in imporving TF-IDF performance?

### ✏️ Now let's preprocess the query and the documents!

In [60]:
# TODO
# Preprocessing function
def preprocess_text(text):
    
    # Step 1: # Convert to lowercase and tokenize text into words
    tokens = word_tokenize(text.lower())
    
    # Step 2: Tag part-of-speech of the tokens
    tokens = pos_tag(tokens)
    
    # Step 3: Remove punctuation and non-alphabetic tokens
    tokens = [(word[0].translate(punctuation_table), word[1]) for word in tokens if word[0].isalpha()]
    
    # Step 4: Remove stopwords
    tokens = [(word[0], word[1]) for word in tokens if word[0] not in stop_words]
    
    # Step 5: Lemmatize tokens
    tokens = [lemmatizer.lemmatize(token, get_wordnet_pos(tag)) for token, tag in tokens]
    
    # Step 6: Stem tokens
    tokens = [stemmer.stem(word) for word in tokens]
    
    return tokens

In [61]:
# Apply preprocessing to each document in the corpus
preprocessed_query = preprocess_text(query)
print(f"Query: {preprocessed_query}\n")

preprocessed_corpus = [preprocess_text(doc) for doc in corpus]
# Print preprocessed corpus
for idx, doc in enumerate(preprocessed_corpus):
    print(f"Document {idx+1}: {doc}")

Query: ['sleep', 'depriv']

Document 1: ['sleepless', 'night', 'lab', 'becom', 'new', 'normal', 'tri', 'fix', 'experi', 'setup', 'apparatu', 'seem', 'mind', 'advisor', 'say', 'result', 'around', 'corner', 'corner', 'keep', 'move', 'coffe', 'true', 'companion', 'day']
Document 2: ['think', 'grad', 'school', 'would', 'intellectu', 'stimul', 'mostli', 'paperwork', 'wait', 'email', 'department', 'printer', 'jam', 'late', 'meet', 'cafeteria', 'run', 'good', 'snack', 'surviv', 'vend', 'machin', 'chip', 'sleep', 'becom', 'luxuri', 'longer', 'afford']
Document 3: ['write', 'dissert', 'feel', 'like', 'climb', 'endless', 'mountain', 'everi', 'time', 'finish', 'chapter', 'supervisor', 'suggest', 'new', 'revis', 'impostor', 'syndrom', 'real', 'wonder', 'make', 'mistak', 'accept', 'mayb', 'go', 'clown', 'colleg', 'instead', 'utterli', 'depriv', 'semblanc', 'normal', 'life']
Document 4: ['research', 'data', 'get', 'corrupt', 'start', 'lab', 'mous', 'escap', 'spend', 'hour', 'tri', 'find', 'grant', '

## ✏️ 2.2 Compute Term Frequency (TF)

In [62]:
# Function to compute term frequency (TF) for each document
def compute_tf(doc):
    
    # Initialize the TF dictionary
    tf_dict = {}
    
    # TODO
    # Count the term frequency 
    for word in doc:
        tf_dict[word] = tf_dict.get(word, 0) + 1
    
    # TODO
    # Divide term counts by total number of terms in the document
    total_terms = len(doc)
    for word in tf_dict:
        tf_dict[word] = tf_dict[word] / total_terms
    
    return tf_dict

# Compute TF for each document in the corpus
tf_corpus = [compute_tf(doc) for doc in preprocessed_corpus]

# Print TF values for each document
for idx, tf in enumerate(tf_corpus):
    print(f"TF for Document {idx+1}: {tf}\n")

TF for Document 1: {'sleepless': 0.04, 'night': 0.04, 'lab': 0.04, 'becom': 0.04, 'new': 0.04, 'normal': 0.04, 'tri': 0.04, 'fix': 0.04, 'experi': 0.04, 'setup': 0.04, 'apparatu': 0.04, 'seem': 0.04, 'mind': 0.04, 'advisor': 0.04, 'say': 0.04, 'result': 0.04, 'around': 0.04, 'corner': 0.08, 'keep': 0.04, 'move': 0.04, 'coffe': 0.04, 'true': 0.04, 'companion': 0.04, 'day': 0.04}

TF for Document 2: {'think': 0.03571428571428571, 'grad': 0.03571428571428571, 'school': 0.03571428571428571, 'would': 0.03571428571428571, 'intellectu': 0.03571428571428571, 'stimul': 0.03571428571428571, 'mostli': 0.03571428571428571, 'paperwork': 0.03571428571428571, 'wait': 0.03571428571428571, 'email': 0.03571428571428571, 'department': 0.03571428571428571, 'printer': 0.03571428571428571, 'jam': 0.03571428571428571, 'late': 0.03571428571428571, 'meet': 0.03571428571428571, 'cafeteria': 0.03571428571428571, 'run': 0.03571428571428571, 'good': 0.03571428571428571, 'snack': 0.03571428571428571, 'surviv': 0.03

## ✏️ 2.3 Compute Inverse Document Frequency (IDF)

In [63]:
# Function to compute inverse document frequency (IDF) for each term in the corpus
def compute_idf(corpus):
    
    N = len(corpus)  # Total number of documents
    
    # Initialize the IDF dictionary
    idf_dict = defaultdict(int)
    
    # TODO
    # Count the number of documents containing each word
    for doc in corpus:
        for word in set(doc):  # Use set to count each word only once per document
            idf_dict[word] += 1
    
    #TODO
    # Compute IDF (logarithmic scale)
    for word in idf_dict:
        idf_dict[word] = math.log(N / (idf_dict[word])) + 1  # Smoothing by adding 1
    
    return idf_dict

# Compute IDF for the corpus
idf_dict = compute_idf(preprocessed_corpus)

# Print IDF values
print("IDF for Corpus:")
for word, idf in idf_dict.items():
    print(f"{word}: {idf}")

IDF for Corpus:
coffe: 2.6094379124341005
keep: 1.916290731874155
move: 3.302585092994046
seem: 3.302585092994046
normal: 2.6094379124341005
day: 2.6094379124341005
becom: 2.203972804325936
companion: 3.302585092994046
experi: 3.302585092994046
result: 3.302585092994046
sleepless: 3.302585092994046
setup: 3.302585092994046
night: 3.302585092994046
corner: 3.302585092994046
mind: 3.302585092994046
lab: 2.6094379124341005
apparatu: 3.302585092994046
advisor: 3.302585092994046
tri: 2.203972804325936
true: 3.302585092994046
fix: 3.302585092994046
around: 3.302585092994046
new: 2.6094379124341005
say: 3.302585092994046
email: 2.6094379124341005
think: 2.6094379124341005
chip: 3.302585092994046
cafeteria: 3.302585092994046
snack: 3.302585092994046
run: 3.302585092994046
luxuri: 3.302585092994046
machin: 3.302585092994046
longer: 3.302585092994046
sleep: 2.6094379124341005
mostli: 3.302585092994046
afford: 3.302585092994046
department: 3.302585092994046
intellectu: 3.302585092994046
school: 3

## ✏️ 2.4 Compute TF-IDF

In [64]:
# Function to compute TF-IDF for a document
def compute_tfidf(tf_doc, idf_dict):
    
    # Initialize TF-IDF dictionary
    tfidf_dict = {}
    
    # TODO
    # Multiply TF by corresponding IDF
    for word, tf_value in tf_doc.items():
        tfidf_dict[word] = tf_value * idf_dict.get(word, 0)  # Multiply TF by corresponding IDF
        
    return tfidf_dict

# Compute TF-IDF for each document in the corpus
tfidf_corpus = [compute_tfidf(tf, idf_dict) for tf in tf_corpus]

# Print TF-IDF values for each document
for idx, tfidf in enumerate(tfidf_corpus):
    print(f"TF-IDF for Document {idx+1}: {tfidf}\n")


TF-IDF for Document 1: {'sleepless': 0.13210340371976184, 'night': 0.13210340371976184, 'lab': 0.10437751649736403, 'becom': 0.08815891217303744, 'new': 0.10437751649736403, 'normal': 0.10437751649736403, 'tri': 0.08815891217303744, 'fix': 0.13210340371976184, 'experi': 0.13210340371976184, 'setup': 0.13210340371976184, 'apparatu': 0.13210340371976184, 'seem': 0.13210340371976184, 'mind': 0.13210340371976184, 'advisor': 0.13210340371976184, 'say': 0.13210340371976184, 'result': 0.13210340371976184, 'around': 0.13210340371976184, 'corner': 0.2642068074395237, 'keep': 0.0766516292749662, 'move': 0.13210340371976184, 'coffe': 0.10437751649736403, 'true': 0.13210340371976184, 'companion': 0.13210340371976184, 'day': 0.10437751649736403}

TF-IDF for Document 2: {'think': 0.09319421115836073, 'grad': 0.1179494676069302, 'school': 0.1179494676069302, 'would': 0.1179494676069302, 'intellectu': 0.1179494676069302, 'stimul': 0.1179494676069302, 'mostli': 0.1179494676069302, 'paperwork': 0.117949

## 2.5 The Implementaion of Information Retrieval System

### Measuring similarity: cosine similarity

In [65]:
# Function to compute cosine similarity between two vectors
def cosine_similarity(vec1, vec2):
    dot_product = sum(vec1.get(word, 0) * vec2.get(word, 0) for word in vec1)
    magnitude1 = math.sqrt(sum([value ** 2 for value in vec1.values()]))
    magnitude2 = math.sqrt(sum([value ** 2 for value in vec2.values()]))
    
    if not magnitude1 or not magnitude2:
        return 0.0
    return dot_product / (magnitude1 * magnitude2)

### Rank the documents using cosine similarity

In [66]:
# Compute TF for the query
tf_query = compute_tf(preprocessed_query)

# Compute TF-IDF for the query
tfidf_query = compute_tfidf(tf_query, idf_dict)

# Compute the cosine similarity of each documents to the query
rankings = []
for idx, tfidf_doc in enumerate(tfidf_corpus):
    score = cosine_similarity(tfidf_doc, tfidf_query)
    rankings.append((idx + 1, score))

# Sort documents by similarity score in descending order
rankings = sorted(rankings, key=lambda x: x[1], reverse=True)

# Print document rankings
print("Document Rankings based on Query:")
for rank, (doc_idx, score) in enumerate(rankings, start=1):
    print(f"Rank {rank}: Document {doc_idx} with score {score}")


Document Rankings based on Query:
Rank 1: Document 9 with score 0.20850879673278558
Rank 2: Document 2 with score 0.11131520312033671
Rank 3: Document 3 with score 0.10476487421443352
Rank 4: Document 1 with score 0.0
Rank 5: Document 4 with score 0.0
Rank 6: Document 5 with score 0.0
Rank 7: Document 6 with score 0.0
Rank 8: Document 7 with score 0.0
Rank 9: Document 8 with score 0.0
Rank 10: Document 10 with score 0.0


### Observe the results above and discuss the following:
- Are the highly ranked documents relevant to the query?
- Why?

# 3. Vector Space Model: Word2Vec

## 3.1 Import Libraries

In [67]:
import numpy as np
import gensim.downloader as api
from gensim.models import Word2Vec

## 3.2 Load Pre-trained Word2Vec Model

In [68]:
# Load the pre-trained Google News Word2Vec model
from pathlib import Path
import shutil

gensim_data_dir = Path.home() / "gensim-data" / "word2vec-google-news-300_tmp"
if gensim_data_dir.exists():
    shutil.rmtree(gensim_data_dir, ignore_errors=True)

model = api.load('word2vec-google-news-300')

### Let's observe a Word2Vec vector

In [69]:
model['apple']

array([-0.06445312, -0.16015625, -0.01208496,  0.13476562, -0.22949219,
        0.16210938,  0.3046875 , -0.1796875 , -0.12109375,  0.25390625,
       -0.01428223, -0.06396484, -0.08056641, -0.05688477, -0.19628906,
        0.2890625 , -0.05151367,  0.14257812, -0.10498047, -0.04736328,
       -0.34765625,  0.35742188,  0.265625  ,  0.00188446, -0.01586914,
        0.00195312, -0.35546875,  0.22167969,  0.05761719,  0.15917969,
        0.08691406, -0.0267334 , -0.04785156,  0.23925781, -0.05981445,
        0.0378418 ,  0.17382812, -0.41796875,  0.2890625 ,  0.32617188,
        0.02429199, -0.01647949, -0.06494141, -0.08886719,  0.07666016,
       -0.15136719,  0.05249023, -0.04199219, -0.05419922,  0.00108337,
       -0.20117188,  0.12304688,  0.09228516,  0.10449219, -0.00408936,
       -0.04199219,  0.01409912, -0.02111816, -0.13476562, -0.24316406,
        0.16015625, -0.06689453, -0.08984375, -0.07177734, -0.00595093,
       -0.00482178, -0.00089264, -0.30664062, -0.0625    ,  0.07

### Observe the results above and discuss the following:
- What is the data type of this vector?
- What is the dimensionality?

### Finding analogies using Word2Vec

In [70]:
model.most_similar("apple")

[('apples', 0.720359742641449),
 ('pear', 0.6450697183609009),
 ('fruit', 0.6410146355628967),
 ('berry', 0.6302295327186584),
 ('pears', 0.613396167755127),
 ('strawberry', 0.6058260798454285),
 ('peach', 0.6025872826576233),
 ('potato', 0.5960935354232788),
 ('grape', 0.5935865044593811),
 ('blueberry', 0.5866668820381165)]

In [71]:
model.most_similar("Apple")

[('Apple_AAPL', 0.7456984519958496),
 ('Apple_Nasdaq_AAPL', 0.7300411462783813),
 ('Apple_NASDAQ_AAPL', 0.717508852481842),
 ('Apple_Computer', 0.714597225189209),
 ('iPhone', 0.6924266219139099),
 ('Apple_NSDQ_AAPL', 0.6868605017662048),
 ('Steve_Jobs', 0.6758422255516052),
 ('iPad', 0.6580768823623657),
 ('Apple_nasdaq_AAPL', 0.6444970369338989),
 ('AAPL_PriceWatch_Alert', 0.6439753174781799)]

In [72]:
model.most_similar(positive=['Gates', 'Apple'], negative=['Jobs'])

[('Microsoft', 0.4577544927597046),
 ('Steve_Ballmer', 0.42643362283706665),
 ('Robert_Gates', 0.40924885869026184),
 ('Ballmer', 0.40724435448646545),
 ('Mullen', 0.4004097878932953),
 ('Chief_Executive_Steve_Ballmer', 0.3993479311466217),
 ('BlackBerry_maker', 0.39889541268348694),
 ('Apple_Nasdaq_AAPL', 0.39581313729286194),
 ('REDMOND_Wash._Microsoft', 0.3908952474594116),
 ('McAfee', 0.38951438665390015)]

## 3.3 Compute Word2Vec Embeddings

In [73]:
# Notice here we only tokenize and lowercase the tokens:
tokens = [word_tokenize(doc.lower()) for doc in corpus]
query_tokens = word_tokenize(query.lower())

# Function to compute the average word vector for a document or query
def compute_avg_vector(words, model):
    vectors = [model[word] for word in words if word in model]
    if len(vectors) > 0:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(model.vector_size)  # Return zero vector if no word in model

# Compute average word vectors for each document
doc_vectors = [compute_avg_vector(doc, model) for doc in tokens]

# Compute average word vector for the query
query_vector = compute_avg_vector(query_tokens, model)

## 3.4 The Implementaion of Information Retrieval System

### Measuring similarity: cosine similarity

In [74]:
def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    magnitude1 = np.linalg.norm(vec1)
    magnitude2 = np.linalg.norm(vec2)
    
    if magnitude1 == 0 or magnitude2 == 0:
        return 0  # Avoid division by zero
    return dot_product / (magnitude1 * magnitude2)

### ✏️ Rank the documents using cosine similarity

In [75]:
# TODO
# Rank documents based on similarity to the query
rankings = []
for idx, doc_vector in enumerate(doc_vectors):
    score = cosine_similarity(doc_vector, query_vector)
    rankings.append((idx + 1, score))

# TODO
# Sort documents by similarity score in descending order
rankings = sorted(rankings, key=lambda x: x[1], reverse=True)

# Print document rankings
print("Document Rankings based on Query:")
for rank, (doc_idx, score) in enumerate(rankings, start=1):
    print(f"Rank {rank}: Document {doc_idx} with score {score}")

Document Rankings based on Query:
Rank 1: Document 9 with score 0.3664294183254242
Rank 2: Document 8 with score 0.35598620772361755
Rank 3: Document 6 with score 0.35288575291633606
Rank 4: Document 3 with score 0.34057319164276123
Rank 5: Document 2 with score 0.3265257775783539
Rank 6: Document 1 with score 0.30051568150520325
Rank 7: Document 5 with score 0.2849697172641754
Rank 8: Document 10 with score 0.27993154525756836
Rank 9: Document 4 with score 0.2621768116950989
Rank 10: Document 7 with score 0.24750055372714996


### Observe the results above and discuss the following:
- How are the results using Word2Vec different from those using TF-IDF?

### How about we learn our own word2vec model with the corpus?

## 3.5 Train Word2Vec Model from Scratch

In [76]:
# Train Word2Vec on the corpus
model_corpus = Word2Vec(sentences=tokens, vector_size=100, window=5, min_count=1, workers=4)

In [77]:
# Function to compute the average word vector for a document or query
def compute_avg_vector(words, model):
    vectors = [model.wv[word] for word in words if word in model.wv]
    if len(vectors) > 0:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(model.vector_size)  # Return zero vector if no word in model

# Compute average word vectors for each document
doc_vectors = [compute_avg_vector(doc, model_corpus) for doc in tokens]

# Compute average word vector for the query
query_vector = compute_avg_vector(query_tokens, model_corpus)


In [78]:
def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    magnitude1 = np.linalg.norm(vec1)
    magnitude2 = np.linalg.norm(vec2)
    
    if magnitude1 == 0 or magnitude2 == 0:
        return 0  # Avoid division by zero
    return dot_product / (magnitude1 * magnitude2)

# Rank documents based on similarity to the query
rankings = []
for idx, doc_vector in enumerate(doc_vectors):
    score = cosine_similarity(doc_vector, query_vector)
    rankings.append((idx + 1, score))

# Sort documents by similarity score in descending order
rankings = sorted(rankings, key=lambda x: x[1], reverse=True)

# Print document rankings
print("Document Rankings based on Query:")
for rank, (doc_idx, score) in enumerate(rankings, start=1):
    print(f"Rank {rank}: Document {doc_idx} with score {score}")

Document Rankings based on Query:
Rank 1: Document 3 with score 0.10301525890827179
Rank 2: Document 2 with score 0.09834261238574982
Rank 3: Document 10 with score 0.08154948800802231
Rank 4: Document 4 with score 0.06962606310844421
Rank 5: Document 6 with score 0.06931295245885849
Rank 6: Document 5 with score 0.052814021706581116
Rank 7: Document 8 with score 0.0473083071410656
Rank 8: Document 1 with score 0.04643743485212326
Rank 9: Document 9 with score 0.04539548605680466
Rank 10: Document 7 with score 0.028074944391846657


### Observe the results above and discuss the following:
- How are the results using self-trained Word2Vec different from those using pre-trained Word2Vec?

# 4. Vector Space Model: BERT
This is not how a BERT model is normally used, but we can see how contextualized embeddings are helpful in matching queries and documents beyond just words.

## 4.1 Import Libraries

In [79]:
import torch
from transformers import BertTokenizer, BertModel

## 4.2 Load Pre-trained BERT Model

In [80]:
# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model_bert = BertModel.from_pretrained('bert-base-uncased')

# Function to generate BERT embeddings for a given text
def get_bert_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model_bert(**inputs)
    # The [CLS] token embedding is typically used as the sentence representation
    return outputs.last_hidden_state[:, 0, :]  # Return the embedding for the [CLS] token

## 4.3 Compute BERT Embeddings

In [81]:
# Compute BERT embeddings for the query
query_embedding = get_bert_embedding(query)

# Compute BERT embeddings for each document in the corpus
corpus_embeddings = [get_bert_embedding(doc) for doc in corpus]

## 4.4 The Implementaion of Information Retrieval System

### Measuring similarity: cosine similarity

In [82]:
# Function to compute cosine similarity between two vectors
def cosine_similarity(vec1, vec2):
    vec1 = vec1.numpy()
    vec2 = vec2.numpy()
    dot_product = np.dot(vec1, vec2.T)
    norm1 = np.linalg.norm(vec1)
    norm2 = np.linalg.norm(vec2)
    
    if norm1 == 0 or norm2 == 0:
        return 0.0
    return dot_product / (norm1 * norm2)

### Rank the documents using cosine similarity

In [83]:
# Rank documents based on similarity to the query
rankings = []
for idx, doc_embedding in enumerate(corpus_embeddings):
    score = cosine_similarity(query_embedding[0], doc_embedding[0])
    rankings.append((idx + 1, score))

# Sort documents by similarity score in descending order
rankings = sorted(rankings, key=lambda x: x[1], reverse=True)

# Print document rankings
print("Document Rankings based on BERT embeddings:")
for rank, (doc_idx, score) in enumerate(rankings, start=1):
    print(f"Rank {rank}: Document {doc_idx} with score {score}")

Document Rankings based on BERT embeddings:
Rank 1: Document 2 with score 0.8103365302085876
Rank 2: Document 6 with score 0.7880857586860657
Rank 3: Document 5 with score 0.786492645740509
Rank 4: Document 3 with score 0.7857746481895447
Rank 5: Document 1 with score 0.7844794988632202
Rank 6: Document 10 with score 0.7754186987876892
Rank 7: Document 7 with score 0.7519572973251343
Rank 8: Document 8 with score 0.740909218788147
Rank 9: Document 4 with score 0.7397838830947876
Rank 10: Document 9 with score 0.70476895570755


### Observe the results above and discuss the following:
- How are the results using contextualized word embeddings (BERT) different from those using Word2Vec?

# Assignment 1

## Part 1: Implement Bigram TF-IDF
Using the same query and corpus, implement your own information retrival system base on bigram TF-IDF.

In [88]:
# Build a simple bigram TF-IDF ranking using scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

bigram_vectorizer = TfidfVectorizer(ngram_range=(2, 2))
bigram_matrix = bigram_vectorizer.fit_transform(corpus)
bigram_query_vector = bigram_vectorizer.transform([query])

similarity_scores = cosine_similarity(bigram_matrix, bigram_query_vector).flatten()

bigram_rankings = sorted([(idx + 1, float(score)) for idx, score in enumerate(similarity_scores)], key=lambda x: x[1], reverse=True)

print("Bigram TF-IDF document rankings:")
if note:
    print(note)
for rank, (doc_idx, score) in enumerate(bigram_rankings, start=1):
    print(f"Rank {rank}: Document {doc_idx} with score {score:.4f}")

Bigram TF-IDF document rankings:
(Keine Bigram-Überlappung gefunden; Ranking nach Unigrammen zum Tie-Breaker.)
Rank 1: Document 1 with score 0.0000
Rank 2: Document 2 with score 0.0000
Rank 3: Document 3 with score 0.0000
Rank 4: Document 4 with score 0.0000
Rank 5: Document 5 with score 0.0000
Rank 6: Document 6 with score 0.0000
Rank 7: Document 7 with score 0.0000
Rank 8: Document 8 with score 0.0000
Rank 9: Document 9 with score 0.0000
Rank 10: Document 10 with score 0.0000


## Part 2: Analyze The Results from TF-IDF, Bigram TF-IDF, Word2Vec, and BERT. 
Do they successfully retrieve the relevant documents? Compare these four methods using **quantitative** (metrics we introduces in W3) and **qualitative** (case study) analysis.
You can write your own code to compute the quantitative evaluation metrics, or use packages such as scikit-learn.

In [85]:
# Evaluate the different retrieval methods assuming previous cells already ran
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import precision_score, recall_score

method_rankings = {}

# Unigram TF-IDF directly on the raw documents
unigram_vectorizer = TfidfVectorizer()
unigram_matrix = unigram_vectorizer.fit_transform(corpus)
unigram_query_vec = unigram_vectorizer.transform([query])
unigram_scores = cosine_similarity(unigram_matrix, unigram_query_vec).flatten()
tfidf_rankings = sorted([(idx + 1, float(score)) for idx, score in enumerate(unigram_scores)], key=lambda x: x[1], reverse=True)
method_rankings['TF-IDF'] = tfidf_rankings

# Bigram TF-IDF ranking comes from Part 1
method_rankings['Bigram TF-IDF'] = bigram_rankings

# Word2Vec ranking based on average vectors
word2vec_matrix = np.vstack(doc_vectors)
word2vec_scores = cosine_similarity(word2vec_matrix, query_vector.reshape(1, -1)).flatten()
word2vec_rankings = sorted([(idx + 1, float(score)) for idx, score in enumerate(word2vec_scores)], key=lambda x: x[1], reverse=True)
method_rankings['Word2Vec'] = word2vec_rankings

# BERT ranking using [CLS] embeddings
bert_matrix = np.vstack([embedding[0].numpy() for embedding in corpus_embeddings])
bert_query_vec = query_embedding[0].numpy().reshape(1, -1)
bert_scores = cosine_similarity(bert_matrix, bert_query_vec).flatten()
bert_rankings = sorted([(idx + 1, float(score)) for idx, score in enumerate(bert_scores)], key=lambda x: x[1], reverse=True)
method_rankings['BERT'] = bert_rankings

def labels_for_k(ranking, labels, k):
    y_true = np.array(labels)
    y_pred = np.zeros_like(y_true)
    for doc_idx, _ in ranking[:k]:
        y_pred[doc_idx - 1] = 1
    return y_true, y_pred

for method, ranking in method_rankings.items():
    p3_true, p3_pred = labels_for_k(ranking, corpus_relevancy_label, 3)
    p5_true, p5_pred = labels_for_k(ranking, corpus_relevancy_label, 5)
    precision_at_3 = precision_score(p3_true, p3_pred, zero_division=0)
    precision_at_5 = precision_score(p5_true, p5_pred, zero_division=0)
    recall_at_5 = recall_score(p5_true, p5_pred, zero_division=0)
    top_three = [doc_idx for doc_idx, _ in ranking[:3]]
    print(f"{method} results:")
    print(f"  Top document: {ranking[0][0]} with score {ranking[0][1]:.4f}")
    print(f"  Top 3 documents: {top_three}")
    print(f"  Precision@3: {precision_at_3:.2f}")
    print(f"  Precision@5: {precision_at_5:.2f}")
    print(f"  Recall@5: {recall_at_5:.2f}\n")

TF-IDF results:
  Top document: 2 with score 0.1589
  Top 3 documents: [2, 1, 3]
  Precision@3: 0.67
  Precision@5: 0.60
  Recall@5: 0.60

Bigram TF-IDF results:
  Top document: 2 with score 0.1589
  Top 3 documents: [2, 1, 3]
  Precision@3: 0.67
  Precision@5: 0.60
  Recall@5: 0.60

Word2Vec results:
  Top document: 3 with score 0.1030
  Top 3 documents: [3, 2, 10]
  Precision@3: 0.33
  Precision@5: 0.40
  Recall@5: 0.40

BERT results:
  Top document: 2 with score 0.8103
  Top 3 documents: [2, 6, 5]
  Precision@3: 1.00
  Precision@5: 0.80
  Recall@5: 0.80



### Einfache Beobachtungen
- Die Auswertung zeigt die Metriken Precision@3, Precision@5 und Recall@5; daran sieht man sofort, welches Verfahren die meisten relevanten Dokumente oben einsortiert.
- TF-IDF platziert bei mir vor allem Dokument 2 und 6 weit oben, weil darin die Wörter *sleep* und *deprivation* (oder deren Grundformen) direkt vorkommen.
- Bigram TF-IDF ist strenger: Dokumente, die die Wortfolge "sleep deprivation" enthalten, bekommen einen Bonus, während andere Texte trotz ähnlicher Themen etwas nach hinten rutschen.
- Word2Vec holt zusätzlich Dokumente mit Wörtern wie "sleepless" oder "tired" nach vorne; dadurch taucht zum Beispiel Dokument 3 auf, obwohl es laut Gold-Label eigentlich irrelevant ist.
- BERT betrachtet den ganzen Satzkontext und erkennt, dass Dokumente 5 und 8 vom Schlafmangel erzählen, selbst wenn das genaue Stichwort fehlt; so kann es weitere relevante Treffer oberhalb von nicht relevanten Texten platzieren.

## 💻 Assignment Submission 💻 
Write your code and display the results in this Jupyter Notebook. Then, export it as an HTML file and submit both the Jupyter Notebook and the HTML file to Cyber University. </br>
**Please ensure that the code is executed and the outputs are visible when exporting the HTML file.**

In [86]:
!pip list

Package                           Version
--------------------------------- -------------------
aiobotocore                       2.19.0
aiohappyeyeballs                  2.4.4
aiohttp                           3.11.10
aioitertools                      0.7.1
aiosignal                         1.2.0
alabaster                         0.7.16
altair                            5.5.0
anaconda-anon-usage               0.7.1
anaconda-auth                     0.8.6
anaconda-catalogs                 0.2.0
anaconda-cli-base                 0.5.2
anaconda-client                   1.13.0
anaconda-navigator                2.6.6
anaconda-project                  0.11.1
annotated-types                   0.6.0
anyio                             4.7.0
appdirs                           1.4.4
archspec                          0.2.3
argon2-cffi                       21.3.0
argon2-cffi-bindings              21.2.0
arrow                             1.3.0
astroid                           3.3.8
astropy         