#Word2Vec

Often times in machine learning we will come across a set of data that contains words with useful information. However extracting meaning from these texts in a way machine learning models can understand, can be challenging, which is where word embeddings come in.

##Embedding Spaces

In a deep neural network, each new layer transforms the input data to create a high dimensionality representation of our previous data. When viewing this layer in a trained neural network, you may begin to notice patterns or groupings. These patterns are created during the training phase and serve to organize similar data which helps the network distinguish between different classes.

<img src='https://miro.medium.com/v2/resize:fit:720/format:webp/1*jYu2qwF4w3h7Xa93B05Ocw.png'>

##Word Embeddings

Word embeddings are simply a vector representation of a word. Below is a basic example with made up parameters.

<img src="https://miro.medium.com/v2/resize:fit:1200/1*sAJdxEsDjsPMioHyzlN3_A.png">

Notice in the bottom example where man and women, and king and queen share a simlar spatial pattern. This is one of the main goals of embedding. We want the meaning of these words and relations to other words to be represented in this space. This is also called the semantic relationship between words.

##How Word2Vec works

Word2Vec is a neural network that consists of an input layer a single hidden layer and an output layer using the softmax activation function. The Word2Vec model can be trained in various ways shown below.

<img src="https://community.alteryx.com/t5/image/serverpage/image-id/45458iDEB69E518EBA3AD9?v=v2">

###Skip-Gram

In a skip-gram model the network is given some text, and for each word in the text, the model has to predict the surrounding context words. By doing this the model trains the hidden or embedding layer and when training is done we use the hidden layer as the output and scrap the output layer.

###CBOW (Continuous Bag of Words)

CBOW is the opposite of skip-gram, where we try to predict what a certain word will be based on the surrounding context of the word. The output layer is scrapped and the rest of the network is used to create embeddings.

#Book Recommendation System

Below I have a dataset from GoodReads which is one of the largest book review websites. It consists of 52194 books and contains a description as well as author, ISBN, titles, etc.

Book recommendation systems often use the approach of book ratings, authors, etc, to provide you a book recommendation, however, we will be using JUST the description of the book. We will do this by first cleaning our text, then creating word embeddings and using the cosine similarity score to recommend us similar books.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn

import tensorflow as tf

In [2]:
dataset = pd.read_csv("https://raw.githubusercontent.com/ShawnPatrick-Barhorst/Word2Vec_Recommendation/refs/heads/main/goodreads_books.csv")

In [3]:
dataset = dataset[['title', 'link', 'author', 'description']]
dataset = dataset[dataset['description'].notna()]

In [4]:
dataset

Unnamed: 0,title,link,author,description
0,Inner Circle,https://www.goodreads.com//book/show/630104.In...,"Kate Brian, Julian Peploe",Reed Brennan arrived at Easton Academy expecti...
1,A Time to Embrace,https://www.goodreads.com//book/show/9487.A_Ti...,Karen Kingsbury,"Ideje az Ã¶lelÃ©snek TÃ¶rtÃ©net a remÃ©nyrÅl,..."
2,Take Two,https://www.goodreads.com//book/show/6050894-t...,Karen Kingsbury,Filmmakers Chase Ryan and Keith Ellison have c...
4,The Millionaire Next Door: The Surprising Secr...,https://www.goodreads.com//book/show/998.The_M...,"Thomas J. Stanley, William D. Danko",The incredible national bestseller that is cha...
5,Black Sheep,https://www.goodreads.com//book/show/311164.Bl...,Georgette Heyer,With her high-spirited intelligence and good l...
...,...,...,...,...
52194,The Stranger I Married,https://www.goodreads.com//book/show/15743072-...,Sylvia Day,"The unabridged, downloadable audiobook edition..."
52195,The Opposite of Loneliness: Essays and Stories,https://www.goodreads.com//book/show/18143905-...,Marina Keegan,An affecting and hope-filled posthumous collec...
52196,Sadako will leben,https://www.goodreads.com//book/show/1466878.S...,Karl Bruckner,"6. August 1945, 8 Uhr 15 Minuten - die kleine ..."
52197,Confessions,https://www.goodreads.com//book/show/630103.Co...,Kate Brian,Sometimes the truth hurts.... Reed Brennan...


In [5]:
dataset['description'][42]

'The "New York Times" Number One bestseller from 1976 is back in this great new package. As the day begins at First Mercantile American Bank, so do the high-stake risks, the public scandals, and the private affairs. It is the inside world where secret million-dollar deals are made, manipulated, and sweetened with sex by the men and women who play to win.'

##Data Cleaning and the NLTK Library

The NLTK library consists of a variety of natural language tools that will help us clean our text. The goal of the cleaning is to get rid of non-text and to get similar words to match each other. For example run and running are similar so we should in this instance, remove the ning from running to make them the same. This will shrink the library of our model.

Some of this functionality will help us remove stopwords such as [a, is, but, and] and so on, these are words that carry little meaning and serve as intermediary words to help our speech. The model will not gain from this so they are removed.

Another function is the stemmer, which will help us snip off various suffixes to make similar words match.

In [6]:
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import PorterStemmer
nltk.download('stopwords')
import re

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Here I use a function to make everything lowercase and to remove all non letters as well as "'s", to insure it doesn't get recognized as a word later

In [7]:
def clean(row):
  sub = re.sub("[^A-Za-z']+", ' ', str(row)).lower()
  cleaned = cleaned_sentence = re.sub(r"'s\b", "", sub)
  return cleaned

In [8]:
brief_cleaning = [clean(row) for row in dataset['description']]

In this section of code I create a new list for my cleaned descriptions. I first tokenize the sequences, remove stopwords, and stem all words. Once I do this the data is clean.

In [None]:
stop_words = stopwords.words('english')
stemmer = PorterStemmer()
cleaned_descriptions = []


cleaned = [clean(row) for row in dataset['description']]
for description in cleaned:
  tokens = word_tokenize(description)
  cleaned_tokens = []
  for token in tokens:
    if token not in stop_words:
      stemmed = stemmer.stem(token)
      cleaned_tokens.append(stemmed)
  cleaned_descriptions.append(cleaned_tokens)


In [None]:
dataset['description'][0]

In [None]:
cleaned_descriptions[0]

##Word2Vec

In the gensim library we can import a Word2Vec model already trained on a large corpus by google. This corpus consists of webpages like Wikipedia, various dictionaries, etc.

We will still have to do a little bit of training so the model understands some of the words in our particular dataset. This is done below.

In [None]:
import multiprocessing

from gensim.models import Word2Vec
from time import time

In [None]:
cores = multiprocessing.cpu_count()

In [None]:
w2v_model = Word2Vec(min_count=20,      # Removes low frequency words with less than "min_count" occurances
                     window=2,          # Window size both left and right
                     sample=6e-5,       # Removes high frequency words, words appearing more than sample% of word occurances are down sampled
                     alpha=0.03,        # Learning Rate
                     min_alpha=0.0007,  # Minimum Learning Rate
                     negative=20,       # Negative sampling technique, improves performance by introducing random words into the window to reinforce the fact these words don't appear together
                     workers=cores-1)

w2v_model.build_vocab(cleaned_descriptions)

In [None]:
t = time()
w2v_model.train(cleaned_descriptions, total_examples=w2v_model.corpus_count, epochs=10, report_delay=1)
print('Time to train the model: {} mins'.format(round((time() - t) / 60, 2)))

In [None]:
#w2v_model.init_sims(replace=True)

##The Recommender

Now that our model is finally setup we can use create a vector representation. To do this we use the word2vec model to find a vector of each word and then average the vectors of each word. This is done below.

In [None]:
sentence_vectors = []
for description in cleaned_descriptions:
  word_vectors = [w2v_model.wv[word] for word in description if word in w2v_model.wv]
  sentence_vector = np.mean(word_vectors, axis=0)
  sentence_vectors.append(sentence_vector)

In [None]:
len(sentence_vectors)

In some cases the description was an empty string but didn't show an Nan value. So the model returned Nan values for them and we will have to remove those rows again.

In [None]:
dataset['vector'] = sentence_vectors
dataset = dataset[dataset['vector'].notna()]

Here we have a new dataset with a vector representation of each description. This is great!

In [None]:
dataset

##Cosine Similarity

Cosine similarity is a method of finding similarity between 2 objects by finding the cosine, between the angle of the two objects.

The reason why we would want to use this as opposed to euclidean similarity is due to the fact that cosine doesn't rely on distances and is therfore immune to the high dimensionality that is present. Essentially cosine normalizes the distance.

<img src="https://miro.medium.com/v2/resize:fit:1400/1*FTVRr_Wqz-3_k6Mk6G4kew.png">

Below I create a quick example using the first book and finding the similarity of that book between all other books. I then create a new column and a new dataset that I will then sort to find the highest similarities.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

vector1 = np.array(dataset['vector'][0])
vector1 = vector1.reshape(1,-1)
similarities = []

for vector2 in dataset['vector']:
  vector2 = np.array(vector2)
  vector2 = vector2.reshape(1, -1)

  similarities.append(cosine_similarity(vector1, vector2))

In [None]:
similarity_df = dataset
similarity_df['similarity'] = similarities

This is what we will be creating in a function shortly. However for viewing purposes you can see the similarity score between the first book "Inner Circle" and various other books.

In [None]:
similarity_df

In [None]:
sorted_df = similarity_df.sort_values(by='similarity', ascending=False)

By sorting them we can get the most similar books in our list and recommend them to a person.

In [None]:
sorted_df

Below I put it all into a single function and used a book from above as a quick example. You can see that it returns us the top 5 most similar books.

In [None]:
def recommend(title, n=5):

  #Check if book exists
  if title not in dataset['title'].values:
    print("Book not found")
    return

  indices_with_string = dataset[dataset['title'].str.contains(title)].index

  #get embedding and reshape
  vector1 = np.array(dataset['vector'][indices_with_string[0]])
  vector1 = vector1.reshape(1,-1)
  similarities = []

  #assemble every other embedding
  for vector2 in dataset['vector']:
    vector2 = np.array(vector2)
    vector2 = vector2.reshape(1, -1)

    #Get similarity
    similarities.append(cosine_similarity(vector1, vector2))

  #Create copy of original dataframe then add distance metrics
  similarity_df = dataset.copy()
  similarity_df['similarity'] = similarities

  #Sort values
  sorted_df = similarity_df.sort_values(by='similarity', ascending=False)
  sorted_df = sorted_df[sorted_df['title'] != title]

  #Print top N books
  for idx, title in enumerate(sorted_df['title'][:n], start=1):
    print(f"{idx}: {title}")

In [None]:
recommend('The Old Man and the Sea')

In [None]:
recommend('The Martian')

In [None]:
recommend('Darkly Dreaming Dexter')

In [None]:
recommend('A Clockwork Orange')

In [None]:
recommend('V for Vendetta')