# Word Embeddings (Word2Vec, Sent2Vec, and Doc2Vec)

## Due date

April 30, 2018

## Assignment description

In this assignment, you will implement a semantic search engine using the word2vec algorithm. You will use pre-trained word embeddings and build a search engine that can retrieve documents related to a given query based on semantic similarity.

### Objective

1. Familiarize yourself with the word2vec algorithm: Start by reading about the word2vec algorithm and its applications in NLP. You can use the resources provided in the course or search for additional materials online.

2. Choose a pre-trained word embedding model: There are many pre-trained word embedding models available online, such as Google's Word2Vec, Stanford's GloVe, and Facebook's fastText. Choose one that you find suitable for your task and download it. See the lecture notebooks for links to code that can be used to load the models.

3. Preprocess the data: Choose a dataset of documents that you want to use for your search engine. Use the news dataset that you performed Exploratory Data Analysis on the previous assignment.

4. Map the documents to vectors: Use the pre-trained word embedding model to map the words in each document to vectors. You can do this by averaging the vectors of the individual words in each document or using a more sophisticated technique such as doc2vec.

5. Implement the search engine: Given a query, map it to a vector using the same technique you used for the documents. Then, retrieve the documents that are most similar to the query vector based on cosine similarity or another distance metric.

6. Write a brief summary of your algorithm and document it's usage with some examples.

### Outcomes

The student will be able to:

1. Implement a semantic search engine using word embeddings.
2. Use pre-trained word embedding models.
3. Map documents to vectors using word embeddings.
4. Discover how cosine similarity can be used to cluster documents.

## Submission medium

Well documented Jupyter notebook.

## Dataset

The dataset used in this assignment is the same as the one used in the EDA assignment. That is, the input for this assignment is the output you created in the EDA assignment. You can download the preprocessed dataset from the following link:

In [1]:
import pandas as pd
import numpy as np

data_source = 'https://raw.githubusercontent.com/JamesMTucker/DATA_340_NLP/master/Notebooks/data/news-2023-02-01.csv'

articles = pd.read_csv(data_source)

### Dataset description

In [8]:
articles.head()

Unnamed: 0,source,title,text
0,politicususa,Prosecutors Pay Attention: Stormy Daniels Than...,Manhattan prosecutors are likely to notice tha...
1,politicususa,Investigators Push For Access To Trump Staff C...,Print\nInvestigators looking into Donald Trump...
2,politicususa,The End Is Near For George Santos As He Steps ...,The AP reported:\nRepublican Rep. George Santo...
3,politicususa,Rachel Maddow Cuts Trump To The Bone With Stor...,Rachel Maddow showed how Trump committed a cri...
4,vox,Alec Baldwin has been formally charged with in...,Candles are placed in front of a photo of cine...


## Preprocessing

Clean, deduplicate, and tokenize the documents. You should be able to repurpose your code from the EDA assignment to do this.

In [13]:
df2 = articles[~np.array(articles.duplicated(subset = 'text',keep='first'))]
df2.drop(index = 40, inplace = True)
df2.reset_index(drop = True, inplace = True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2.drop(index = 40, inplace = True)


0       Manhattan prosecutors are likely to notice tha...
1       Print\nInvestigators looking into Donald Trump...
2       The AP reported:\nRepublican Rep. George Santo...
3       Rachel Maddow showed how Trump committed a cri...
4       Candles are placed in front of a photo of cine...
                              ...                        
1057    White House bids farewell to Klain, as Zients ...
1058    Lawmakers clash over allowing guns in Natural ...
1059    Pizza Shop Employee Gets Rude Awakening After ...
1060    President Joe Biden boards Air Force One at th...
1061    Russian politicians and companies offer reward...
Name: text, Length: 1062, dtype: object

In [31]:
## YOUR CODE HERE
import re
def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r"<[^>]*>", "", text)  # Remove HTML tags
    text = re.sub(r"[^a-z0-9]+", " ", text)  # Remove non-alphanumeric characters
    return text.strip()
df2['text'] = df2.apply(lambda x: clean_text(str(x['text'])), axis = 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['text'] = df2.apply(lambda x: clean_text(str(x['text'])), axis = 1)


## Word embeddings

Load the pre-trained word embedding model. You can use the code provided in the lecture notebooks to load the model. Vectorize the documents using the pre-trained word embedding model. You can do this by averaging the vectors of the individual words in each document or using a more sophisticated technique such as doc2vec (see SpaCy and Gensim packages).

In [37]:
## YOUR CODE HERE
from gensim import corpora, models, similarities
import gensim

from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

dictionary = corpora.Dictionary([review.split() for review in df2['text']]) #create dictionary
corpus = [dictionary.doc2bow(review.split()) for review in df2['text']] #create corpus with bag of words vectors

In [41]:
tfidf = models.TfidfModel(corpus) #compute tfidf
tfidf_corpus = tfidf[corpus]
lsi = models.LsiModel(tfidf_corpus, id2word=dictionary)

In [42]:
lsi_corpus = lsi[tfidf_corpus]
index_lsi = similarities.MatrixSimilarity(lsi_corpus) #this uses cosine similarities see here https://tedboy.github.io/nlps/generated/generated/gensim.similarities.MatrixSimilarity.html

[('manhattan prosecutors are likely to notice that stormy daniels has publicly thanked trump for posting confirmation of her story about the illegal hush money payments daniels tweeted thanks for just admitting that i was telling the truth about everything guess i ll take my horse face back to bed now mr former president btw that s the correct way to use quotation marks pic twitter com vsg867kwk8 as the manhattan district attorney is presenting evidence in front of a criminal grand jury about hush money payments donald trump was on his social media platform confirming that the story is true trump is trying to claim that the whole illegal plot was michael cohen s fault and the use of the statement advice of counsel is a signal that he is going to be trying to throw lawyers under the bus donald trump always thinks he is his best pr person and legal counsel but every time he makes a public statement he digs the hole deeper trump is also his own worst enemy because if there were ever a tim

## Search engine

Write a search engine that can retrieve documents related to a given query based on semantic similarity. Given a query, map it to a vector using the same technique you used for the documents. Then, retrieve the documents that are most similar to the query vector based on cosine similarity or another distance metric.

In [48]:
## YOUR CODE HERE
def find_similar_docs(text, num_results=10):
    #send the text through the same pipeline as when we made our model
    text = clean_text(text) 
    vec_bow = dictionary.doc2bow(text.split())
    vec_tfidf = tfidf[vec_bow]
    vec_lsi = lsi[vec_tfidf]
    sims = index_lsi[vec_lsi]
    sims = sorted(enumerate(sims), key=lambda item: -item[1])
    
    results = [] 
    for i in range(num_results):
        index, score = sims[i]
        results.append((df2['text'][index], score))
    return results

In [49]:
find_similar_docs(df2['text'][0]) #This should return a 100% similarity score with the first doc as it is the same text

[('manhattan prosecutors are likely to notice that stormy daniels has publicly thanked trump for posting confirmation of her story about the illegal hush money payments daniels tweeted thanks for just admitting that i was telling the truth about everything guess i ll take my horse face back to bed now mr former president btw that s the correct way to use quotation marks pic twitter com vsg867kwk8 as the manhattan district attorney is presenting evidence in front of a criminal grand jury about hush money payments donald trump was on his social media platform confirming that the story is true trump is trying to claim that the whole illegal plot was michael cohen s fault and the use of the statement advice of counsel is a signal that he is going to be trying to throw lawyers under the bus donald trump always thinks he is his best pr person and legal counsel but every time he makes a public statement he digs the hole deeper trump is also his own worst enemy because if there were ever a tim

## Extra credit

Based on the results of your search engine, write a kmeans clustering algorithm that can cluster the documents into groups based on their semantic similarity, along with some topics words that can describe each cluster. Some tips are to look into kmeans++, DBSCAN, and agglomerative clustering. For example, see this blog post: https://towardsdatascience.com/silhouette-method-better-than-elbow-method-to-find-optimal-clusters-378d62ff6891

In [None]:
## YOUR CODE HERE