# Task 3 (25 points):

### In this task, use any of the pre-trained word embeddings. The Wor2vec embedding link provided with the lecture notes can be useful to get started. Write your own code/function that uses these embeddings and outputs cosine similarity and a dissimilarity score for any 2 pair of words (read as user input). The dissimilarity score should be defined by you. You either can have your own idea of a dissimilarity score or refer to literature (cite the paper you used). In either case clearly describe how this score helps determine the dissimilarity between 2 words.

### Note: Dissimilarity measure has been an important metric for recommender systems trying to introduce ‘Novelty and Diversity’ in assortments (as opposed to only accuracy). You might find different metrics of dissimilarity in recommender system’s literature.

In [None]:
import numpy as np
import gensim.downloader as api

# load word2vec
word_vectors = api.load('word2vec-google-news-300')

def cosine_similarity(vector1, vector2):
    dot_product = np.dot(vector1, vector2)
    norm_vector1 = np.linalg.norm(vector1)
    norm_vector2 = np.linalg.norm(vector2)
    similarity = dot_product / (norm_vector1 * norm_vector2)
    return similarity

def dissimilarity_score(word1, word2):
    vector1 = word_vectors[word1]
    vector2 = word_vectors[word2]
    similarity = cosine_similarity(vector1, vector2)

    return 1 - similarity

# user input
word1 = input("Enter first word: ")
word2 = input("Enter second word: ")

# cal cosine similarity and a dissimilarity score
vector1 = word_vectors[word1] 
vector2 = word_vectors[word2]  
similarity = cosine_similarity(vector1, vector2)
dissimilarity = dissimilarity_score(word1, word2)

print(f"Cosine Similarity between '{word1}' and '{word2}': {similarity}")
print(f"Dissimilarity Score between '{word1}' and '{word2}': {dissimilarity}")

Enter first word: hello
Enter second word: second
Cosine Similarity between 'hello' and 'second': 0.028079349547624588
Dissimilarity Score between 'hello' and 'second': 0.9719206504523754

The dissimilarity score I defined is calculated by 1 - cosine similarity, which essentially converts cosine similarity into a more intuitive measure of dissimilarity. Cosine similarity itself measures the degree of similarity in direction between two vectors, ranging from -1 to 1, where 1 indicates exactly the same direction, -1 indicates completely opposite directions, and 0 indicates orthogonality (no correlation).

When the vectors of two words are exactly the same (cosine similarity is 1), the dissimilarity score will be 0, indicating that these two words are very close or identical in semantics. When the vectors of two words are completely opposite in direction (which is rare in practice because word vectors are usually positive), the theoretical cosine similarity is -1, leading to a dissimilarity score of 2, indicating that these two words are completely different in semantics. For most practical situations, the cosine similarity between two words will be between 0 and 1, hence the dissimilarity score will also be between 0 and 1, with a higher score indicating greater semantic dissimilarity between the two words.

### How It Helps Determine Dissimilarity
By using the complement of cosine similarity as the dissimilarity score, it provides a direct way to measure and compare the semantic differences between two words. This measurement is very useful for various application scenarios, such as:

Recommendation Systems: By calculating the dissimilarity between items, recommendation systems can ensure that the content recommended to users is not only similar to their known preferences but also includes items that offer novelty and diversity.
Text Analysis: In text analysis, the dissimilarity score can help identify different topics or concepts within documents or corpora, exploring the diversity of textual content by comparing the dissimilarity of words.
Semantic Search: In semantic search applications, the dissimilarity score can be used to exclude results that are not semantically related to the query, thereby improving the accuracy and relevance of search results.