# **Similar Words**

This code contains an algorithm that searches for the most similar words to a specified word. To achieve this, the specified word is projected to a vector using the tokenizer of the NLP model GPT2. Once the vector of the first word is obtained, the same procedure is applied to get the vectors of the 50,000 (arbitrary value) most common words in the English language. Then, the vector of the specified word is compared to every vector of the set of words using the cosine similarity. Sorting the set of words by the value of the result of this operation, by a descending order, a list of the most similar words is obtained. Finally, the algorithm returns a dataframe with that list.

In [3]:
from transformers import GPT2Tokenizer, GPT2Model
from numpy import dot
from numpy.linalg import norm
import pandas as pd

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/548M [00:00<?, ?B/s]

In [4]:
# Function used to calculate the cosine similarity between a given word and a list of words
def cosine_similarity(base_word,llista_paraules):
  # Vectors of the words in the list generated by the tokenizer
  base_word = model.wte.weight[tokenizer.encode(base_word,add_prefix_space=True),:]
  words = [model.wte.weight[tokenizer.encode(word,add_prefix_space=True),:] for word in llista_paraules]

  # Cosine similarity between the vector of the given word and the vectors of the word list
  cos_list = []
  base_word = base_word[0].detach().numpy()
  for word in words:
    word = word[0].detach().numpy()
    cos_list.append(dot(base_word, word)/(norm(base_word)*norm(word)))

  return cos_list

In [5]:
# Example
# The words with a higher cosine similarity like blue, color or red are more similar to the word "yellow" than words like "car" or "happy"
paraules = ['blue','color','car','red','happy']
cosine_similarity('yellow',paraules)

[0.652698, 0.44690767, 0.2955696, 0.63572675, 0.26336986]

In [39]:
# Create the word dataframe from the csv
df = pd.read_csv('data/unigram_freq.csv', on_bad_lines='skip', header=None)
word_df = df[0][:50000]
word_df = pd.DataFrame(data= {'word': word_df})

# Delete rows with nan value
word_df = word_df.drop([2578,12820])

# Define the word to analyze and drop it from the dataframe
base_word = 'banana'
word_df = word_df.drop(word_df.loc[word_df['word'] == base_word].index.to_list())

# Compute the cosine similarity between the defined word and the words of the dataframe
cos_list = cosine_similarity(base_word,word_df['word'])

# Add the cosine similarity value list to the dataframe
word_df['cosine similarity'] = cos_list

  exec(code_obj, self.user_global_ns, self.user_ns)


In [40]:
# Print the first one hundred words of the sorted list
word_df.sort_values(by=['cosine similarity'],ascending=False,ignore_index=True)['word'][:100].to_list()

['bananas',
 'mango',
 'pineapple',
 'strawberry',
 'strawberrynet',
 'avocado',
 'coconut',
 'peach',
 'potato',
 'peanut',
 'tomato',
 'strawberries',
 'fruit',
 'fruitless',
 'apple',
 'appleby',
 'cocoa',
 'gorilla',
 'chocolate',
 'cucamonga',
 'cucumbers',
 'cucina',
 'cucumber',
 'lemon',
 'lemonade',
 'monkey',
 'pumpkin',
 'raspberry',
 'cinnamon',
 'oranges',
 'chicken',
 'yogurt',
 'almond',
 'potatoes',
 'bacon',
 'tomatoes',
 'orangeburg',
 'orange',
 'fruity',
 'caramel',
 'carrot',
 'carrots',
 'citrus',
 'vegetable',
 'burger',
 'rice',
 'maize',
 'spinach',
 'bunnyteens',
 'bunny',
 'grape',
 'grapevine',
 'grapefruit',
 'broccoli',
 'pizza',
 'coffee',
 'coffeehouse',
 'tropical',
 'mushroom',
 'fructose',
 'peanuts',
 'pancakes',
 'berries',
 'jungle',
 'monkeys',
 'beanies',
 'bean',
 'beanie',
 'sandwich',
 'fascist',
 'garlic',
 'vegetables',
 'almonds',
 'biscotti',
 'biscayne',
 'biscuit',
 'vanilla',
 'frog',
 'peel',
 'dessert',
 'roasted',
 'bamboo',
 'tuna',