<a href="https://colab.research.google.com/github/PalemSandeepSrinivas/Word-Embedding/blob/main/Word_Embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Importing Necessary Libraries

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

## Create a data folder and loading packages

In [2]:
# Make data directory if it doesn't exist
# Fasttext and Glove data Links

# https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip
# https://nlp.stanford.edu/data/glove.6B.zip

# Making a Directory with name data and loding file into it
!mkdir -p data
!wget -nc https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip -P data
!wget -nc https://nlp.stanford.edu/data/glove.6B.zip -P data

--2023-07-08 13:09:15--  https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 13.35.7.38, 13.35.7.128, 13.35.7.82, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|13.35.7.38|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 681808098 (650M) [application/zip]
Saving to: ‘data/wiki-news-300d-1M.vec.zip’


2023-07-08 13:09:22 (99.5 MB/s) - ‘data/wiki-news-300d-1M.vec.zip’ saved [681808098/681808098]

--2023-07-08 13:09:22--  https://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2023-07-08 13:09:23--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.

## Unzip the two file on to the local space

In [3]:


import zipfile

fast_text_path = "/content/data/wiki-news-300d-1M.vec.zip"
glove_text_path = "/content/data/glove.6B.zip"

with zipfile.ZipFile(fast_text_path, 'r') as zip_ref:
    zip_ref.extractall("/content/")

with zipfile.ZipFile(glove_text_path,'r') as zip_ref:
    zip_ref.extractall("/content/")

## Loading the model into variables

In [4]:
from gensim.models import KeyedVectors

# Load FastText word embeddings
fasttext_model = KeyedVectors.load_word2vec_format('/content/wiki-news-300d-1M.vec', binary=False)

# Load GloVe word embeddings
glove_model = KeyedVectors.load_word2vec_format('/content/glove.6B.300d.txt', binary=False,no_header=True)

## Lowering the words to small case and applying the Fast_text model on to the words

In [9]:
# Get the word vectors

word1 = "Microsoft".lower()
word2 = "Facebook".lower()
word3 = "Dolphin".lower()

# First Using Fast_text

microsoft_vector_fasttext = fasttext_model[word1]
facebook_vector_fasttext = fasttext_model[word2]
Dolhin_vector_fasttext = fasttext_model[word3]
# print(microsoft_vector_fasttext,facebook_vector_fasttext)

## Applying the Glove model on to the words

In [10]:
# Using Glove

microsoft_vector_glove = glove_model[word1]
facebook_vector_glove = glove_model[word2]
Dolhpin_vector_fasttext = glove_model[word3]
# print(microsoft_vector_glove,facebook_vector_glove)

##  Calculate cosine similarity between Microsoft and Facebook

In [11]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim_fasttext = cosine_similarity([microsoft_vector_fasttext], [facebook_vector_fasttext])[0][0]
cosine_sim_glove = cosine_similarity([microsoft_vector_glove], [facebook_vector_glove])[0][0]

# Print the cosine similarity scores
print("Cosine similarity using FastText:", round(cosine_sim_fasttext,4))
print()
print("Cosine similarity using GloVe:", round(cosine_sim_glove,4))


Cosine similarity using FastText: 0.4398

Cosine similarity using GloVe: 0.4328


##  Calculate cosine similarity between Microsoft and Dolphin

In [12]:
cosine_sim_fasttext = cosine_similarity([microsoft_vector_fasttext], [Dolhpin_vector_fasttext])[0][0]
cosine_sim_glove = cosine_similarity([microsoft_vector_glove], [Dolhpin_vector_fasttext])[0][0]

# Print the cosine similarity scores
print("Cosine similarity using FastText:", round(cosine_sim_fasttext,4))
print()
print("Cosine similarity using GloVe:", round(cosine_sim_glove,4))

Cosine similarity using FastText: -0.0054

Cosine similarity using GloVe: -0.0433


## Insights between Microsoft and Facebook

* Similarity Scores: The cosine similarity scores indicate the similarity between the word vectors of 'Microsoft' and 'Facebook' in the respective embedding spaces. A cosine similarity score ranges from -1 to 1, where a score closer to 1 indicates higher similarity and a score closer to -1 indicates dissimilarity.

* Similarity Comparison: Both FastText and GloVe embeddings show relatively high cosine similarity scores for 'Microsoft' and 'Facebook'. The scores of 0.4398 (FastText) and 0.4328 (GloVe) suggest that there is some level of similarity between the word vectors of these two words.

## Insights between Microsoft and Dolphin


* The cosine similarity scores for both FastText and GloVe are close to zero, indicating a weak similarity or dissimilarity between the words 'Microsoft' and 'Dolphin'.
* The negative values suggest that the vectors of these words have an opposite or orthogonal direction in the embedding space.
* Since 'Microsoft' and 'Dolphin' are unrelated words, it aligns with our expectation that they would have low cosine similarity scores.
* FastText and GloVe are different word embedding models trained on different data sources, so their cosine similarity scores may vary.

* Based on these insights, we can conclude that 'Microsoft' and 'Dolphin' are dissimilar words according to the embedding models, and they do not exhibit any significant semantic or contextual similarity.