# SCS 3546: Deep Learning
> **Assignment 3: Contextualized Word Embeddings**

### Your name & student number:

<pre> Nataliia Kobrii </pre>

<pre> qq577503 </pre>

## **Assignment Description**
***

Search Engines are a standard tool for finding relevant content. The calculation of similarity between textual information is an important factor for better search results.

### **Objectives**

**Your goal in this assignment is to calculate the textual similarity between queries and the provided sample documents, using a variety of NLP approaches.**

In achieving the above goal, you will also:
- Demonstrate how to preprocess text and embed textual data.
- Compare the results of textual similarity scoring between traditional and deep-learning based NLP methods.

### **Data and Queries**

You will use the document repository provided by `sample_repository.json`, which you can download from the following link, or from the assignment description in Quercus: https://q.utoronto.ca/courses/286389/files/21993451/download?download_frd=1

The queries you will run against these sample documents are the following:

- Query 1: “fruits”
- Query 2: “vegetables”
- Query 3: “healthy foods in Canada”

### **Techniques to Demonstrate**

The techniques you will use to compute the similarity scores are:
- 1. TF-IDF.
- 2. Semantic similarity using GloVe word vectors.
- 3. Semantic similarity using a BERT-based model.


### **Feel Free to Choose Your Own Approach**

How you go about demonstrating each of the above techniques is up to you. You are not expected to use any particular library. The code below is just meant to provide you with some guidance to get started. You **do**, however, need to demonstrate obtaining similarity scores **with all 3 techniques above**, but how you go about doing this is totally up to you. The evaluation will be based on your ability obtain results using all three techniques, plus your discussion/comparison of any differences you observe.



## **Grade Allocation**
***
15 points total

- Experiment 1 (TD-IDF), implementation: 2 marks
- Experiment 2 (GloVe), implementation: 3 marks
- Experiment 3 (BERT), implementation: 3 marks
- Comparison and Discussion: 3 marks
  - Compare all three techniques and interpret your findings. Do your best to explain the differences you observe in terms of concepts learned in class (not just the _what_, but also the _how_ and _why_ one technique produces different results from another).
- Text Pre-Processing: 2 marks
 - Cleaning and standardization (e.g. lemmatization, stemming) in Experiment 1
 - Basic text cleaning (e.g. removal of special characters or tags) in Experiments 2 and 3.
- Clarity: 2 marks
 - The marks for clarity are awarded for code documentation, clean code (e.g. avoiding repetition by building re-usable functions)  and how well you explained/supported your answers, including the use of visualizations.


# Setup and Data Import
***
You can use the code snippets below to help you load and extract the document repository.


In [None]:
# Load sample_repository.json from current folder
import json

with open('sample_repository.json', 'r', encoding='utf-8') as in_file:
    repo_data = json.load(in_file)

print("File loaded successfully.")
print("Number of records:", len(repo_data['data']))


File loaded successfully.
Number of records: 32


In [None]:
# this will unpack the json file contents into a list of titles and documents
import json

with open('sample_repository.json') as in_file:
    repo_data = json.load(in_file)

titles = [item[0] for item in repo_data['data']]
documents = [item[1] for item in repo_data['data']]


In [None]:
# let's take a look at some of these documents and titles;
# here we print the five last entries
for id in range(-5, 0, 1):
  print(f"Document title: {titles[id]}")
  print(f"Document contents: {documents[id]}")
  print("\n") # adds newline

Document title: botany
Document contents: Botany, also called plant science(s), plant biology or phytology, is the science of plant life and a branch of biology. A botanist, plant scientist or phytologist is a scientist who specialises in this field. 


Document title: Ford Bronco 
Document contents: The Ford Bronco is a model line of sport utility vehicles manufactured and marketed by Ford. ... The first SUV model developed by the company, five generations of the Bronco were sold from the 1966 to 1996 model years. A sixth generation of the model line is sold from the 2021 model year. the Ford Bronco will be available in Canada, with first deliveries beginning in spring of 2021. The Bronco will come in six versions in Canada: Base, Big Bend, Black Diamond, Outer Banks, Wildtrak and Badlands. 


Document title: List of fruit dishes
Document contents: Fruit dishes are those that use fruit as a primary ingredient. Condiments prepared with fruit as a primary ingredient are also included in

# Experiment 1: TF-IDF
***

**T**erm **F**requency - **I**nverse **D**ocument **F**requency (TF-IDF) is a traditional NLP technique to look at words that appear in both pieces of text, and score them based on how often they appear. For this experiment, you are free to use the TF-IDF implementation provided by scikit-learn.


In [None]:
!pip install nltk


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

nltk.download('punkt')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/natalikobrii/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/natalikobrii/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
# Query terms to test
queries = ["fruits", "vegetables", "healthy foods in Canada"]

def rank_tfidf_raw(query, docs, titles, top_k=5):
    """Rank documents using TF-IDF on raw text."""
    vectorizer = TfidfVectorizer(stop_words=list(stop_words))
    vectors = vectorizer.fit_transform([query] + docs)

    # First vector is the query
    query_vec = vectors[0:1]
    doc_vecs = vectors[1:]

    # Compute cosine similarity
    scores = linear_kernel(query_vec, doc_vecs).flatten()

    # Top-k highest scores
    top_idx = scores.argsort()[::-1][:top_k]

    results = []
    for idx in top_idx:
        results.append({
            "title": titles[idx],
            "score": float(scores[idx]),
            "snippet": docs[idx][:250] + ("..." if len(docs[idx]) > 250 else "")
        })
    return results


# Top-5 similar documents for each query
for q in queries:
    print("=" * 80)
    print(f"TF-IDF (RAW) — Query: {q!r}")
    ranked = rank_tfidf_raw(q, documents, titles)
    for r in ranked:
        print(f"{r['title']} | score = {r['score']:.4f}")
        print(r["snippet"], "\n")

TF-IDF (RAW) — Query: 'fruits'
Food classes | score = 0.1646
To a botanist, a fruit is an entity that develops from the fertilized ovary of a flower. This means that tomatoes, squash, pumpkins, cucumbers, peppers, eggplants, corn kernels, and bean and pea pods are all fruits; so are apples, pears, peaches, apr... 

Canada's Food Guide | score = 0.0733
Canada's Food Guide is a nutrition guide produced by Health Canada to promote Healthy behaviours and habits, and lifestyles in Canada - this is to increase the number of healthy people in Canada. In 2007, it was reported to be the second most request... 

fruit serving bowl | score = 0.0000
A fruit serving bowl is a round dish or container typically used to prepare and serve food. The interior of a bowl is characteristically shaped like a spherical cap, with the edges and the bottom forming a seamless curve. This makes bowls especially ... 

Neuro linguistic programming | score = 0.0000
Neuro linguistic programming (NLP) is a pseudoscient

## TF-IDF after preprocessing

In [26]:
import nltk
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import re

nltk.download("punkt")
nltk.download("punkt_tab")
nltk.download("wordnet")
nltk.download("omw-1.4")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [27]:
# Load stopwords
stop_words = set(stopwords.words("english"))

lemmatizer = WordNetLemmatizer()

def clean_text(text):
    """
    Lowercase, remove non-letters, remove stopwords, lemmatize.
    """
    text = text.lower()
    text = re.sub(r"[^a-z\s]", " ", text)
    tokens = word_tokenize(text)

    tokens = [
        lemmatizer.lemmatize(t)
        for t in tokens
        if t not in stop_words and len(t) > 1
    ]

    return " ".join(tokens)

cleaned_documents = [clean_text(doc) for doc in documents]

print("Original example:\n", documents[0][:200], "...\n")
print("Cleaned example:\n", cleaned_documents[0][:200], "...")

Original example:
 Fresh Pomegranate from Anushka Avni International Bhagwa is a premium Pomegranate variety from India. The Deep Red arils & the pleasing Red but rugged skin enhances the appearance whilst promoting she ...

Cleaned example:
 fresh pomegranate anushka avni international bhagwa premium pomegranate variety india deep red aril pleasing red rugged skin enhances appearance whilst promoting shelf life fruit bhagwa widely known s ...


In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

def rank_tfidf_clean(query, cleaned_docs, original_docs, titles, top_k=5):
    q_clean = clean_text(query)
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform([q_clean] + cleaned_docs)

    query_vec = tfidf_matrix[0:1]
    doc_vecs = tfidf_matrix[1:]

    scores = linear_kernel(query_vec, doc_vecs).flatten()
    top_idx = scores.argsort()[::-1][:top_k]

    results = []
    for idx in top_idx:
        results.append({
            "title": titles[idx],
            "score": float(scores[idx]),
            "snippet": original_docs[idx][:250] + ("..." if len(original_docs[idx]) > 250 else "")
        })
    return results


queries = ["fruits", "vegetables", "healthy foods in Canada"]

for q in queries:
    print("=" * 80)
    print(f"TF-IDF (CLEANED) — Query: {q!r}")
    for r in rank_tfidf_clean(q, cleaned_documents, documents, titles):
        print(f"{r['title']}  |  score = {r['score']:.4f}")
        print(r["snippet"], "\n")

TF-IDF (CLEANED) — Query: 'fruits'
List of fruit dishes  |  score = 0.4542
Fruit dishes are those that use fruit as a primary ingredient. Condiments prepared with fruit as a primary ingredient are also included in this list. 

Food classes  |  score = 0.2542
To a botanist, a fruit is an entity that develops from the fertilized ovary of a flower. This means that tomatoes, squash, pumpkins, cucumbers, peppers, eggplants, corn kernels, and bean and pea pods are all fruits; so are apples, pears, peaches, apr... 

Pomegranate Bhagwa  |  score = 0.1458
Fresh Pomegranate from Anushka Avni International Bhagwa is a premium Pomegranate variety from India. The Deep Red arils & the pleasing Red but rugged skin enhances the appearance whilst promoting shelf life of the fruit. Bhagwa is widely known for i... 

fruit serving bowl  |  score = 0.0925
A fruit serving bowl is a round dish or container typically used to prepare and serve food. The interior of a bowl is characteristically shaped like a sp

What impact did the text cleaning / preprocessing have on your results?

After preprocessing, the TF-IDF results became noticeably more relevant. Several documents moved higher in the ranking because unnecessary noise words were removed, and the texts were normalized, for example, “fruits”, “fruit”, “fruity” -> “fruit”.  
This reduced the weight of unrelated vocabulary and improved matching between the query and documents even when the exact word form was different.  
Overall, preprocessing led to more stable and meaningful similarity scores.

# Experiment 2: Semantic matching using GloVe embeddings
***

In [2]:
!pip install -q "gensim>=4.3.0"

import gensim
print(gensim.__version__)

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[?25h4.4.0


In [9]:
import logging
import json
import logging
from re import sub
from multiprocessing import cpu_count

import numpy as np

import gensim.downloader as api
from gensim.utils import simple_preprocess
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.similarities import WordEmbeddingSimilarityIndex
from gensim.similarities import SparseTermSimilarityMatrix
from gensim.similarities import SoftCosineSimilarity

In [4]:
# optional, but it helps
import logging

# Initialize logging.
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.WARNING)

In [5]:
import nltk

# Import and download stopwords from NLTK.
nltk.download('stopwords')  # Download stopwords list.
stopwords = set(nltk.corpus.stopwords.words("english"))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [6]:
def preprocess(doc):
    # Tokenize, clean up input document string
    doc = sub(r'<img[^<>]+(>|$)', " image_token ", doc)
    # you may decide to add additional steps here
    return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]

In [8]:
# Load test data
with open('sample_repository.json') as in_file:
    repo_data = json.load(in_file)

titles = [item[0] for item in repo_data['data']]
documents = [item[1] for item in repo_data['data']]

In [11]:
query_s = 'Your queries here'

# Preprocess the documents, including the query string
corpus = [preprocess(document) for document in documents]
query = preprocess(query_s)

queries = ['fruits', 'vegetables', 'healthy foods in Canada']

In [12]:
# Download and load the GloVe word vector embeddings
if 'glove' not in locals():  # only load if not already in memory
    glove = api.load("glove-wiki-gigaword-50")

similarity_index = WordEmbeddingSimilarityIndex(glove)



In [13]:
# Build the term dictionary, TF-idf model
# Keep in mind that the search query must be in the dictionary as well, in case the terms do not overlap with the documents
dictionary = Dictionary(corpus+[query])
tfidf = TfidfModel(dictionary=dictionary)

# Create the term similarity matrix.
# The nonzero_limit enforces sparsity by limiting the number of non-zero terms in each column.
# In my case, I got best results by removing the default value of 100
similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)  # , nonzero_limit=None)

100%|██████████| 569/569 [00:10<00:00, 53.97it/s]


In [14]:
# Compute similarity measure between the query and the documents.
query_tf = tfidf[dictionary.doc2bow(query)]

index = SoftCosineSimilarity(
            tfidf[[dictionary.doc2bow(document) for document in corpus]],
            similarity_matrix)

doc_similarity_scores = index[query_tf]

  normalized_corpus = np.multiply(corpus, 1.0 / corpus_norm)
  normalized_corpus = np.multiply(corpus, 1.0 / corpus_norm)


In [15]:
import numpy as np

for q_s in queries:
    print('=' * 80)
    print(f"GloVe Soft-Cosine — Query: {q_s!r}")

    # Preprocess documents and the query
    corpus_q = [preprocess(doc) for doc in documents]
    query_q = preprocess(q_s)

    # Build dictionary + TF-IDF model
    dictionary_q = Dictionary(corpus_q + [query_q])
    tfidf_q = TfidfModel(dictionary=dictionary_q)

    # Build similarity matrix using GloVe vectors
    similarity_matrix_q = SparseTermSimilarityMatrix(similarity_index, dictionary_q, tfidf_q)

    # Compute similarities between query and documents
    query_tf_q = tfidf_q[dictionary_q.doc2bow(query_q)]
    index_q = SoftCosineSimilarity(
        tfidf_q[[dictionary_q.doc2bow(doc) for doc in corpus_q]],
        similarity_matrix_q
    )

    doc_similarity_scores_q = np.array(index_q[query_tf_q])

    # Top 5 most similar documents
    top_idx = doc_similarity_scores_q.argsort()[::-1][:5]

    for idx in top_idx:
        score = float(doc_similarity_scores_q[idx])
        print(f"{titles[idx]} | score = {score:.4f}")
        snippet = documents[idx][:250] + ('...' if len(documents[idx]) > 250 else '')
        print(snippet, '\n')

GloVe Soft-Cosine — Query: 'fruits'


100%|██████████| 568/568 [00:08<00:00, 64.13it/s]


Food classes | score = 0.8839
To a botanist, a fruit is an entity that develops from the fertilized ovary of a flower. This means that tomatoes, squash, pumpkins, cucumbers, peppers, eggplants, corn kernels, and bean and pea pods are all fruits; so are apples, pears, peaches, apr... 

fruit serving bowl | score = 0.8437
A fruit serving bowl is a round dish or container typically used to prepare and serve food. The interior of a bowl is characteristically shaped like a spherical cap, with the edges and the bottom forming a seamless curve. This makes bowls especially ... 

List of fruit dishes | score = 0.8437
Fruit dishes are those that use fruit as a primary ingredient. Condiments prepared with fruit as a primary ingredient are also included in this list. 

Pomegranate Bhagwa | score = 0.8092
Fresh Pomegranate from Anushka Avni International Bhagwa is a premium Pomegranate variety from India. The Deep Red arils & the pleasing Red but rugged skin enhances the appearance whilst promoting

100%|██████████| 568/568 [00:10<00:00, 53.56it/s]


Food classes | score = 0.8961
To a botanist, a fruit is an entity that develops from the fertilized ovary of a flower. This means that tomatoes, squash, pumpkins, cucumbers, peppers, eggplants, corn kernels, and bean and pea pods are all fruits; so are apples, pears, peaches, apr... 

Canada's Food Guide | score = 0.8104
Canada's Food Guide is a nutrition guide produced by Health Canada to promote Healthy behaviours and habits, and lifestyles in Canada - this is to increase the number of healthy people in Canada. In 2007, it was reported to be the second most request... 

List of fruit dishes | score = 0.7603
Fruit dishes are those that use fruit as a primary ingredient. Condiments prepared with fruit as a primary ingredient are also included in this list. 

Small Onions | score = 0.7505
We are one of the leading organizations engaged in delivering our customers with Fresh Onions. We manufacture this in bulk requirements for our clients. Malaysia, Singapore, Philippines, Vietnam Qualit

100%|██████████| 568/568 [00:10<00:00, 53.87it/s]

Canada's Food Guide | score = 0.9388
Canada's Food Guide is a nutrition guide produced by Health Canada to promote Healthy behaviours and habits, and lifestyles in Canada - this is to increase the number of healthy people in Canada. In 2007, it was reported to be the second most request... 

Diet | score = 0.6835
In nutrition, the diet of an organism is the sum of foods it eats, which is largely determined by the availability and palatability of foods. 

fruit serving bowl | score = 0.5887
A fruit serving bowl is a round dish or container typically used to prepare and serve food. The interior of a bowl is characteristically shaped like a spherical cap, with the edges and the bottom forming a seamless curve. This makes bowls especially ... 

About Us | score = 0.5879
Anushka Avni International (AAI) takes pleasure in presenting itself as one of the renowned Suppliers and Exporter. We have huge assortment of agro products available with us. We feel proud when buyers come to us recognizin




# Experiment 3: BERT Model
***
Use a BERT model obtain sentence embeddings and calculate the similarity between queries and documents.

> Hint: see the Module 07 jupyter notebook for examples of how to work with BERT.

In [16]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Load BERT model
bert_model = SentenceTransformer("all-MiniLM-L6-v2")

# Embed all documents once
doc_embeddings = bert_model.encode(documents, show_progress_bar=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [17]:
queries = ["fruits", "vegetables", "healthy foods in Canada"]

def rank_bert(query_s, titles, docs, doc_embs, top_k=5):
    """Compute BERT similarity and return top-k most similar documents."""

    # Encode query
    query_emb = bert_model.encode([query_s])

    # Compare query vs document embeddings
    scores = cosine_similarity(query_emb, doc_embs).flatten()

    # Top-k highest scores
    top_idx = scores.argsort()[::-1][:top_k]

    results = []
    for idx in top_idx:
        results.append({
            "title": titles[idx],
            "score": float(scores[idx]),
            "snippet": docs[idx][:250] + ("..." if len(docs[idx]) > 250 else "")
        })
    return results

for q in queries:
    print("=" * 80)
    print(f"BERT — Query: {q!r}")
    ranked = rank_bert(q, titles, documents, doc_embeddings, top_k=5)
    for r in ranked:
        print(f"{r['title']} | score = {r['score']:.4f}")
        print(r["snippet"], "\n")

BERT — Query: 'fruits'
Food classes | score = 0.6041
To a botanist, a fruit is an entity that develops from the fertilized ovary of a flower. This means that tomatoes, squash, pumpkins, cucumbers, peppers, eggplants, corn kernels, and bean and pea pods are all fruits; so are apples, pears, peaches, apr... 

List of fruit dishes | score = 0.5436
Fruit dishes are those that use fruit as a primary ingredient. Condiments prepared with fruit as a primary ingredient are also included in this list. 

Tomatoes | score = 0.4571
Fresh Tomatoes from Anushka Avni International We have emerged as one of the reputed organization actively participating in exporting and suppling Red Tomatoes. Product Details: Taste enhancer Free from preservatives Pure Useful for chutney Packing :... 

fruit serving bowl | score = 0.4364
A fruit serving bowl is a round dish or container typically used to prepare and serve food. The interior of a bowl is characteristically shaped like a spherical cap, with the edges an

 # Technique Comparison
 ***

1. TF-IDF on raw text

The raw TF-IDF approach relied entirely on exact word matching. As a result, documents were ranked highly only if they contained the same terms as the query. Any noise in the text influenced the scores and often pushed relevant documents lower in the ranking.
This method is straightforward, but it lacks an understanding of meaning.

2. TF-IDF after preprocessing

Once the text was cleaned, the results became more consistent and more relevant. The model no longer treated “fruit,” “fruits,” and “fruity” as unrelated terms, and removing common stopwords helped the important keywords dominate the similarity calculation.
Although the method still depends on matching specific words, the preprocessing step significantly improved the quality of the rankings.

3. Semantic similarity using embeddings

The embedding-based method delivered the most intuitive and meaningful results. Instead of focusing on exact words, it captured the conceptual similarity between the query and the documents. For example, a query related to “healthy foods” could surface documents discussing “nutrition,” “balanced diets,” or “vitamins,” even if the exact phrase never appeared.
This happens because embeddings model the relationships between words in a continuous vector space, allowing the technique to recognize context and meaning rather than just term overlap.

 ## Conclusion

TF-IDF is useful when the task involves direct keyword matching, and preprocessing clearly enhances its performance. However, it remains limited by its focus on surface-level word frequency.
In contrast, the embedding-based method provides a deeper and more flexible understanding of language. It can identify related ideas even when the vocabulary differs, resulting in more human-like and semantically aligned rankings.