# Word Embeddings (Word2Vec, Sent2Vec, and Doc2Vec)

## Due date

April 30, 2018

## Assignment description

In this assignment, you will implement a semantic search engine using the word2vec algorithm. You will use pre-trained word embeddings and build a search engine that can retrieve documents related to a given query based on semantic similarity.

### Objective

1. Familiarize yourself with the word2vec algorithm: Start by reading about the word2vec algorithm and its applications in NLP. You can use the resources provided in the course or search for additional materials online.

2. Choose a pre-trained word embedding model: There are many pre-trained word embedding models available online, such as Google's Word2Vec, Stanford's GloVe, and Facebook's fastText. Choose one that you find suitable for your task and download it. See the lecture notebooks for links to code that can be used to load the models.

3. Preprocess the data: Choose a dataset of documents that you want to use for your search engine. Use the news dataset that you performed Exploratory Data Analysis on the previous assignment.

4. Map the documents to vectors: Use the pre-trained word embedding model to map the words in each document to vectors. You can do this by averaging the vectors of the individual words in each document or using a more sophisticated technique such as doc2vec.

5. Implement the search engine: Given a query, map it to a vector using the same technique you used for the documents. Then, retrieve the documents that are most similar to the query vector based on cosine similarity or another distance metric.

6. Write a brief summary of your algorithm and document it's usage with some examples.

### Outcomes

The student will be able to:

1. Implement a semantic search engine using word embeddings.
2. Use pre-trained word embedding models.
3. Map documents to vectors using word embeddings.
4. Discover how cosine similarity can be used to cluster documents.

## Submission medium

Well documented Jupyter notebook.

## Dataset

The dataset used in this assignment is the same as the one used in the EDA assignment. That is, the input for this assignment is the output you created in the EDA assignment. You can download the preprocessed dataset from the following link:

In [2]:
import pandas as pd
import numpy as np

data_source = 'https://raw.githubusercontent.com/JamesMTucker/DATA_340_NLP/master/Notebooks/data/news-2023-02-01.csv'

articles = pd.read_csv(data_source)

### Dataset description

In [4]:
articles.describe()

Unnamed: 0,source,title,text
count,11587,11586,11419
unique,20,716,1062
top,politicususa,Nicolle Wallace Devastates Trump And Shows Why...,Contact Us\nThis material may not be published...
freq,720,127,698


## Preprocessing

Clean, deduplicate, and tokenize the documents. You should be able to repurpose your code from the EDA assignment to do this.

In [5]:
## YOUR CODE HERE

## Word embeddings

Load the pre-trained word embedding model. You can use the code provided in the lecture notebooks to load the model. Vectorize the documents using the pre-trained word embedding model. You can do this by averaging the vectors of the individual words in each document or using a more sophisticated technique such as doc2vec (see SpaCy and Gensim packages).

In [6]:
## YOUR CODE HERE

## Search engine

Write a search engine that can retrieve documents related to a given query based on semantic similarity. Given a query, map it to a vector using the same technique you used for the documents. Then, retrieve the documents that are most similar to the query vector based on cosine similarity or another distance metric.

In [7]:
## YOUR CODE HERE

## Extra credit

Based on the results of your search engine, write a kmeans clustering algorithm that can cluster the documents into groups based on their semantic similarity, along with some topics words that can describe each cluster. Some tips are to look into kmeans++, DBSCAN, and agglomerative clustering. For example, see this blog post: https://towardsdatascience.com/silhouette-method-better-than-elbow-method-to-find-optimal-clusters-378d62ff6891

In [8]:
## YOUR CODE HERE