# Lab 4: Document Representation and Similariy Measurement

Note: This lab session is graded. Complete all exercises and submit it under **Canvas->Lab4** (https://utexas.instructure.com/courses/1382133/assignments/6619548) by no later than **02/08/2023, 11:59PM**. Please attempt all exercises.

For extracting representations from text, we will be using the following libraries:

1. SpaCy for text pre-processing (tokenization, lemmatization, stopword removal)
2. Scikit learn's vectorizer's
3. Gensim's word vector's


References:
1. [ SpaCY ]  https://spacy.io/usage/processing-pipelines
2. [ Gensim ] https://radimrehurek.com/gensim/auto_examples/index.html
3. [ Scikit ] https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

## 1. Representing Documents with Bag-of-words approach

We are interested in featurizing (or vectorizing) a given text corpus so that we can compute document similarity or perform similariy based search.

For example: for the given query, `query  = "Context is captured better through deep learning based text encoders"`, we want to identify, which documents in the text corpus are most similary to the query, and possibly rank them in the order of similarity.

In [1]:
corpus = [
    # Technology
    "The latest smartphone model boasts a revolutionary camera system that enhances low-light photography.",
    "Artificial intelligence algorithms are reshaping industries by automating routine tasks and streamlining operations.",
    "Quantum computing holds the promise of solving complex problems exponentially faster than classical computers.",

    # Environment
    "Deforestation in the Amazon rainforest continues to pose a significant threat to biodiversity and indigenous communities.",
    "Renewable energy sources such as solar and wind power are crucial for reducing carbon emissions and combating climate change.",
    "Plastic pollution in oceans is a pressing environmental issue, with millions of marine animals suffering from ingestion or entanglement.",

    # Health
    "Vaccination campaigns are essential for preventing the spread of infectious diseases and achieving herd immunity.",
    "Mental health awareness initiatives aim to reduce stigma and promote access to support services for individuals struggling with psychological disorders.",
    "Regular exercise and a balanced diet are key components of maintaining a healthy lifestyle and preventing chronic illnesses like heart disease and diabetes."
]

query = "the latest advancements in artificial intelligence for image recognition"

## 1.1. Apply Pre-processing Techniques

Understanding the benefits of employing text pre-processing techniques is crucial in mitigating data sparsity concerns and reducing the necessity for an extensive vocabulary. We will proceed by directly implementing pre-processing on both the corpus and the query. However, for those who are curious, we offer the option to disable pre-processing to evaluate whether its application was indeed advantageous.

We will apply the following basic preprocessing steps:

1. Lowercasing
2. Tokenization
2. Lemmatization
3. Stopword removal


In [2]:
#%pip install spacy

In [3]:
import spacy
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

# get a list of stopwords from NLTK
stops = set(stopwords.words('english'))

# Load SpaCy English language model
# this is a pipeline capable of applying morphological, lexical and syntax analysis on text

nlp_pipeline = spacy.load("en_core_web_sm")

def pre_process_a_single_sentence(sentence):
  # Lower case text
  sentence = sentence.lower()

  processed_sentence = []

  # Tokenize, and lemmatize the text
  doc = nlp_pipeline(sentence)

  for token in doc:
    # here token is an object that contains various information about each token
    # information such as lemma, pos, parse labels are available

    # we will check here if tokens are present in stopwords
    # if not, we will retain their lemma
    if token not in stops:
      lemmatized_token = token.lemma_
      processed_sentence.append(lemmatized_token)
  processed_sentence = " ".join (processed_sentence)
  return processed_sentence

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Now, apply the preprocessing steps on the corpus as well as the queries

In [4]:
pre_processed_corpus = [pre_process_a_single_sentence(sentence) for sentence in corpus]

pre_processed_query = pre_process_a_single_sentence(query)

# sanity check: print 5 processed documents

print (pre_processed_corpus[:5])
print (pre_processed_query)

['the late smartphone model boast a revolutionary camera system that enhance low - light photography .', 'artificial intelligence algorithm be reshape industry by automate routine task and streamline operation .', 'quantum computing hold the promise of solve complex problem exponentially fast than classical computer .', 'deforestation in the amazon rainforest continue to pose a significant threat to biodiversity and indigenous community .', 'renewable energy source such as solar and wind power be crucial for reduce carbon emission and combat climate change .']
the late advancement in artificial intelligence for image recognition


## 1.2. Extract Bag-of-words (BoW) representations / vectors based on the corpus

We construct the vocabulary using `fit()` method applied on the corpus and then used the vocabulary to construct a BoW presence / absence vector.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

# Define the N for N-grams
N = 1  # Change this to the desired value for N-grams

# Initialize the CountVectorizer with N-gram range
vectorizer = CountVectorizer(ngram_range=(N, N), lowercase = False, binary = True)

# Fit and transform the corpus
vectorizer.fit(pre_processed_corpus)

# Check a few items in the vocabulary
vocab = vectorizer.get_feature_names_out()

# check the vocabulary size
print (len(vocab))

# sanity check: check the list of vocabulary
print (vocab)

120
['access' 'achieve' 'aim' 'algorithm' 'amazon' 'and' 'animal' 'artificial'
 'as' 'automate' 'awareness' 'balanced' 'be' 'biodiversity' 'boast' 'by'
 'camera' 'campaign' 'carbon' 'change' 'chronic' 'classical' 'climate'
 'combat' 'community' 'complex' 'component' 'computer' 'computing'
 'continue' 'crucial' 'deforestation' 'diabete' 'diet' 'disease'
 'disorder' 'emission' 'energy' 'enhance' 'entanglement' 'environmental'
 'essential' 'exercise' 'exponentially' 'fast' 'for' 'from' 'health'
 'healthy' 'heart' 'herd' 'hold' 'illness' 'immunity' 'in' 'indigenous'
 'individual' 'industry' 'infectious' 'ingestion' 'initiative'
 'intelligence' 'issue' 'key' 'late' 'lifestyle' 'light' 'like' 'low'
 'maintain' 'marine' 'mental' 'million' 'model' 'ocean' 'of' 'operation'
 'or' 'photography' 'plastic' 'pollution' 'pose' 'power' 'press' 'prevent'
 'problem' 'promise' 'promote' 'psychological' 'quantum' 'rainforest'
 'reduce' 'regular' 'renewable' 'reshape' 'revolutionary' 'routine'
 'service' '

Once the vocabulary is formed, we now transform ANY text using `transform` function.

In [6]:
bow_transformed_corpus = []

for sentence in pre_processed_corpus:
  transformed_vector = vectorizer.transform([sentence])
  bow_transformed_corpus.append(transformed_vector.toarray()[0])

bow_transformed_query = vectorizer.transform([pre_processed_query]).toarray()[0]

# sanity check : print a few items from the bow_transformed_corpus and bow_transformed_query
print ("Transformed Corpus Samples", bow_transformed_corpus[:5])
print ("Transformed Query", bow_transformed_query)

Transformed Corpus Samples [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 1, 0, 0, 0, 0, 0]), array([0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0,

Now we can match the query vector with the corpus vectors in a pairwise manner to see which ones are most similar to the query vector.

In [7]:
# let's define a similarity function.
# we take 1 / euclidean_distance between two vectors as the similarity

import numpy as np
def euclidean_distance_based_similarity (vector1, vector2):
    """
    Compute the Euclidean distance between two vectors.

    Parameters:
        vector1 (array-like): First vector.
        vector2 (array-like): Second vector.

    Returns:
        float: Euclidean distance between the two vectors.
    """
    return 1 / (np.linalg.norm(np.array(vector1) - np.array(vector2)))


similarity_scores = {}

for i, document_vector in enumerate(bow_transformed_corpus):
  sim = euclidean_distance_based_similarity(document_vector, bow_transformed_query)
  similarity_scores[i] = sim

ranked_documents = sorted(similarity_scores.items(),key = lambda x: x[1] ,reverse = True)

# Let's print the top 5 documents based on ranked score

print (f"Query: {query}")
for document_idx, score in ranked_documents[:5]:
  print (f"Document: {corpus[document_idx]}, Score: {score}")



Query: the latest advancements in artificial intelligence for image recognition
Document: The latest smartphone model boasts a revolutionary camera system that enhances low-light photography., Score: 0.2581988897471611
Document: Artificial intelligence algorithms are reshaping industries by automating routine tasks and streamlining operations., Score: 0.2581988897471611
Document: Deforestation in the Amazon rainforest continues to pose a significant threat to biodiversity and indigenous communities., Score: 0.25
Document: Vaccination campaigns are essential for preventing the spread of infectious diseases and achieving herd immunity., Score: 0.24253562503633297
Document: Quantum computing holds the promise of solving complex problems exponentially faster than classical computers., Score: 0.23570226039551587


### Exercise E1. Repeat the above steps using BoW-count method:

Hint: Use `binary = False`

`vectorizer = CountVectorizer(ngram_range=(N, N), lowercase = False, binary = False)`


In [8]:
# Exercise E1
bow_c_vectorizer = CountVectorizer(ngram_range=(N, N), lowercase = False, binary = False)
bow_c_vectorizer.fit(pre_processed_corpus)

bow_c_vocab = vectorizer.get_feature_names_out()

# now transform the vector and the query
bow_c_transformed_corpus = []

for sentence in pre_processed_corpus:
  bow_c_transformed_vector = bow_c_vectorizer.transform([sentence])
  bow_c_transformed_corpus.append(bow_c_transformed_vector.toarray()[0])

bow_c_transformed_query = bow_c_vectorizer.transform([pre_processed_query]).toarray()[0]

#  print items to make sure it worked
print ("Transformed Corpus Samples", bow_c_transformed_corpus[:5])
print ("Transformed Query", bow_c_transformed_query)

Transformed Corpus Samples [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 1, 0, 0, 0, 0, 0]), array([0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0,

There's a 2 in there! it worked

In [9]:
# part 2 of E1 - get similarity (C+P from above)
# we take 1 / euclidean_distance between two vectors as the similarity
c_similarity_scores = {}

for i, document_vector in enumerate(bow_c_transformed_corpus):
  s = euclidean_distance_based_similarity(document_vector, bow_c_transformed_query)
  c_similarity_scores[i] = s

c_ranked_documents = sorted(c_similarity_scores.items(),key = lambda x: x[1], reverse = True)

# Let's print the top 5 documents based on ranked score

print (f"Query: {query}")
for document_idx, score in c_ranked_documents[:5]:
  print (f"Document: {corpus[document_idx]}, Score: {score}")


Query: the latest advancements in artificial intelligence for image recognition
Document: The latest smartphone model boasts a revolutionary camera system that enhances low-light photography., Score: 0.2581988897471611
Document: Artificial intelligence algorithms are reshaping industries by automating routine tasks and streamlining operations., Score: 0.2581988897471611
Document: Vaccination campaigns are essential for preventing the spread of infectious diseases and achieving herd immunity., Score: 0.24253562503633297
Document: Quantum computing holds the promise of solving complex problems exponentially faster than classical computers., Score: 0.23570226039551587
Document: Deforestation in the Amazon rainforest continues to pose a significant threat to biodiversity and indigenous communities., Score: 0.22941573387056174


### Exercise E2: Repeat the above steps using TF-IDF method:

Hint: Use `TfidfVectorizer` instead of CountVectorizer

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

Try different queries and explain your observations

In [10]:
# Exercise E2 - Use TF-IDF Method
from sklearn.feature_extraction.text import TfidfVectorizer
# mostly the same code as before
tfidf_vectorizer = TfidfVectorizer(ngram_range=(N, N), lowercase = False, binary = True)

# Fit and transform the corpus
tfidf_vectorizer.fit(pre_processed_corpus)

# Check a few items in the vocabulary
vocab = tfidf_vectorizer.get_feature_names_out()

# now transform the vector and the query
tfidf_transformed_corpus = []

for sentence in pre_processed_corpus:
  tfidf_transformed_vector = tfidf_vectorizer.transform([sentence])
  tfidf_transformed_corpus.append(tfidf_transformed_vector.toarray()[0])

tfidf_transformed_query = tfidf_vectorizer.transform([pre_processed_query]).toarray()[0]

#  print items to make sure it worked
print ("Transformed Corpus Samples", tfidf_transformed_corpus[:5])
print ("Transformed Query", tfidf_transformed_query)

Transformed Corpus Samples [array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.28374061,
       0.        , 0.28374061, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.28374061, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.28374061,
       0.        , 0.28374061, 0.        , 0.28374061, 0.        ,
       0.        , 0.        , 0. 

In [11]:
# part 2 of E2 - get similarity (C+P from above)
# we take 1 / euclidean_distance between two vectors as the similarity
t_similarity_scores = {}

for i, document_vector in enumerate(tfidf_transformed_corpus):
  t_similarity_scores[i] = euclidean_distance_based_similarity(document_vector, tfidf_transformed_query)

t_ranked_documents = sorted(t_similarity_scores.items(),key = lambda x: x[1], reverse = True)

# Let's print the top 5 documents based on ranked score

print (f"Query: {query}")
for document_idx, score in t_ranked_documents[:5]:
  print (f"Document: {corpus[document_idx]}, Score: {score}")

Query: the latest advancements in artificial intelligence for image recognition
Document: Artificial intelligence algorithms are reshaping industries by automating routine tasks and streamlining operations., Score: 0.8284922651973035
Document: The latest smartphone model boasts a revolutionary camera system that enhances low-light photography., Score: 0.7839844906920026
Document: Deforestation in the Amazon rainforest continues to pose a significant threat to biodiversity and indigenous communities., Score: 0.7672970118364889
Document: Vaccination campaigns are essential for preventing the spread of infectious diseases and achieving herd immunity., Score: 0.7587451350178317
Document: Plastic pollution in oceans is a pressing environmental issue, with millions of marine animals suffering from ingestion or entanglement., Score: 0.7380234489141645


The scores given here are MUCH higher than in regular Count Vectorization. That's cool!

## 1.3. Extract Word Vectors based on GloVe and compute similarity between query and documents

- In all the above implementations, we always convert words into a number and sentences into an N-hot representation.

- This does not effectively capture relationships between words and phrases.

- In principle, words should be "known by the company they keep". For example, the word "cat" should be related to "dog" more than "Wednesday".

- We thus vectorize corpus and queries using word embeddings, i.e., representations that capture the semantic association between words

- Vectorization using word embeddings allow us to perform semantic search

- We will use glove embeddings (http://nlp.stanford.edu/data/glove.6B.zip) as our source of pre-trained word embeddings/


In [12]:
%pip install gensim



Note: you may need to restart the kernel to use updated packages.


In [13]:
from gensim.models import KeyedVectors
import gensim.downloader as api

# Load pre-trained GloVe embeddings
word_vectors = api.load("glove-wiki-gigaword-50")

# Function to generate average word vectors for a sentence
def average_word_embeddings(sentence):
    words = sentence.split()
    embeddings = []
    for word in words:
        if word in word_vectors:
            embeddings.append(word_vectors[word])
    if len(embeddings) > 0:
        # is word vector exists for the word
        return np.mean(embeddings, axis=0)
    else:
        return np.zeros(word_vectors.vector_size)



In [14]:
word_vector_transformed_corpus = []

for sentence in pre_processed_corpus:
  transformed_vector = average_word_embeddings(sentence)
  word_vector_transformed_corpus.append(transformed_vector)

word_vector_transformed_query = average_word_embeddings(pre_processed_query)

# sanity check : print a few items from the bow_transformed_corpus and bow_transformed_query
print ("Word Vector Transformed Corpus Samples", word_vector_transformed_corpus[:5])
print ("Word Vector Transformed Query", word_vector_transformed_query)

Word Vector Transformed Corpus Samples [array([ 0.1861708 ,  0.23128964,  0.1962533 ,  0.07322185,  0.09015149,
       -0.01049213, -0.29616618, -0.6643501 ,  0.0113958 ,  0.25757927,
        0.2509965 , -0.06853063, -0.17922047,  0.22037098, -0.10514338,
        0.11703007, -0.368467  ,  0.1734872 , -0.31144908, -0.42928684,
        0.09016962,  0.04219813, -0.21167749, -0.18209696,  0.02590456,
       -1.3121846 , -0.4729219 ,  0.05249795,  0.16609696,  0.12275426,
        2.8859878 , -0.0463383 , -0.27610356, -0.39248237,  0.10850608,
        0.02044093, -0.02673562,  0.31013373, -0.21586457, -0.22675025,
        0.13021228,  0.16225356, -0.03704819,  0.01487388,  0.04129475,
       -0.01695319,  0.19756524, -0.21693762,  0.06474212, -0.01510743],
      dtype=float32), array([ 6.65031433e-01, -5.51267922e-01, -7.34305009e-02,  1.08917437e-01,
       -9.01242867e-02,  1.31470293e-01, -5.51946349e-02, -3.14387828e-01,
        4.34026450e-01, -1.80350065e-01,  1.47496939e-01,  2.392611

Now, let's use the vectors for computing similarity between queries and documents to see which ones are most similar to the queries.

For aimilarity measurement, we will use the same `euclidean_distance_based_similarity()` function.

In [15]:
word_vector_based_similarity_scores = {}

for i, vector in enumerate(word_vector_transformed_corpus):
  sim = euclidean_distance_based_similarity(vector, word_vector_transformed_query)
  word_vector_based_similarity_scores[i] = sim

ranked_documents = sorted(word_vector_based_similarity_scores.items(),key = lambda x: x[1] ,reverse = True)

# Let's print the top 5 documents based on ranked score

print (f"Query: {query}")
for document_idx, score in ranked_documents[:5]:
  print (f"Document: {corpus[document_idx]}, Score: {score}")


Query: the latest advancements in artificial intelligence for image recognition
Document: The latest smartphone model boasts a revolutionary camera system that enhances low-light photography., Score: 0.7373770039611364
Document: Deforestation in the Amazon rainforest continues to pose a significant threat to biodiversity and indigenous communities., Score: 0.6037419499710674
Document: Mental health awareness initiatives aim to reduce stigma and promote access to support services for individuals struggling with psychological disorders., Score: 0.6014855932843852
Document: Vaccination campaigns are essential for preventing the spread of infectious diseases and achieving herd immunity., Score: 0.577208812792024
Document: Regular exercise and a balanced diet are key components of maintaining a healthy lifestyle and preventing chronic illnesses like heart disease and diabetes., Score: 0.5741223590009982


### Exercise E3: Repeat section 1.3 with `glove-wiki-gigaword-50`

Hint: Use, `word_vectors = api.load("glove-wiki-gigaword-300")`

Do you see any improved results? What could be the reason behind getting better results? Comment.

Also feel free to try other queries and share your observations.

In [16]:
# Exercise E3
n_word_vectors = api.load("glove-wiki-gigaword-300")

# Function to generate average word vectors for a sentence
def average_word_embeddings(sentence):
    words = sentence.split()
    embeddings = []
    for word in words:
        if word in n_word_vectors:
            embeddings.append(n_word_vectors[word])
    if len(embeddings) > 0:
        # is word vector exists for the word
        return np.mean(embeddings, axis=0)
    else:
        return np.zeros(n_word_vectors.vector_size)
    
word_vector_transformed_corpus = []
for sentence in pre_processed_corpus:
  transformed_vector = average_word_embeddings(sentence)
  word_vector_transformed_corpus.append(transformed_vector)

word_vector_transformed_query = average_word_embeddings(pre_processed_query)

# sanity check : print a few items from the bow_transformed_corpus and bow_transformed_query
print ("Word Vector Transformed Corpus Samples", word_vector_transformed_corpus[:5])
print ("Word Vector Transformed Query", word_vector_transformed_query)

KeyboardInterrupt: 

In [None]:
word_vector_based_similarity_scores = {}

for i, vector in enumerate(word_vector_transformed_corpus):
  sim = euclidean_distance_based_similarity(vector, word_vector_transformed_query)
  word_vector_based_similarity_scores[i] = sim

ranked_documents = sorted(word_vector_based_similarity_scores.items(),key = lambda x: x[1] ,reverse = True)

# Let's print the top 5 documents based on ranked score

print (f"Query: {query}")
for document_idx, score in ranked_documents[:5]:
  print (f"Document: {corpus[document_idx]}, Score: {score}")


Query: the latest advancements in artificial intelligence for image recognition
Document: The latest smartphone model boasts a revolutionary camera system that enhances low-light photography., Score: 0.4489006370416574
Document: Artificial intelligence algorithms are reshaping industries by automating routine tasks and streamlining operations., Score: 0.4023113287992021
Document: Mental health awareness initiatives aim to reduce stigma and promote access to support services for individuals struggling with psychological disorders., Score: 0.40202061581758824
Document: Plastic pollution in oceans is a pressing environmental issue, with millions of marine animals suffering from ingestion or entanglement., Score: 0.3953461497640631
Document: Quantum computing holds the promise of solving complex problems exponentially faster than classical computers., Score: 0.39100700592005466


The scores are... lower? idk if this is right, but the scores are significantly worse than the scores before, like nearly .3 worse than before. It could be the extra information mucking things up, or it could just be that I'm interpreting the scores in the wrong way, idk which.