#### Imports

In [1]:
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
import nltk
from nltk.tokenize import sent_tokenize
from gensim.models import word2vec
from bs4 import BeautifulSoup  # To clean the text from the html
import re  # Regular Expressions

import warnings
warnings.filterwarnings("ignore")

#### Dataset

We will work with dataset of reviews on women's clothes.

In [2]:
data = pd.read_csv("Womens Clothing E-Commerce Reviews.csv")
data.head(2)

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses


#### Data preparation

We will only need the column with the review text since our task is to train the w2v model.

In [3]:
data = pd.DataFrame(data=data['Review Text']).rename(columns={'Review Text': 'review'})
len(data)

23486

Helper functions that create tokens from text

In [4]:
def review_to_wordlist(review, remove_stopwords=False ):
    review = re.sub(r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", " ", review)
    review_text = BeautifulSoup(review, "html.parser").get_text()
    review_text = re.sub("[^a-zA-Z]"," ", review_text)
    words = review_text.lower().split()
    if remove_stopwords:
        stops = stopwords.words("english")
        words = [w for w in words if not w in stops]
    return(words)

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
def review_to_sentences(review, tokenizer=tokenizer, remove_stopwords=False):
    raw_sentences = tokenizer.tokenize(review.strip())
    sentences = []
    for raw_sentence in raw_sentences:
        if len(raw_sentence) > 0:
            sentences.append(review_to_wordlist(raw_sentence, remove_stopwords))
    return sentences

In [5]:
data["review"].head(2)

0    Absolutely wonderful - silky and sexy and comf...
1    Love this dress!  it's sooo pretty.  i happene...
Name: review, dtype: object

In [6]:
sentences = list(tqdm(map(review_to_sentences, data["review"].astype(str)), total=len(data)))

  0%|          | 0/23486 [00:00<?, ?it/s]

We now have lists of words that make up the review.

In [7]:
print(len(sentences))
print(sentences[:2])

23486
[[['absolutely', 'wonderful', 'silky', 'and', 'sexy', 'and', 'comfortable']], [['love', 'this', 'dress'], ['it', 's', 'sooo', 'pretty'], ['i', 'happened', 'to', 'find', 'it', 'in', 'a', 'store', 'and', 'i', 'm', 'glad', 'i', 'did', 'bc', 'i', 'never', 'would', 'have', 'ordered', 'it', 'online', 'bc', 'it', 's', 'petite'], ['i', 'bought', 'a', 'petite', 'and', 'am'], ['i', 'love', 'the', 'length', 'on', 'me', 'hits', 'just', 'a', 'little', 'below', 'the', 'knee'], ['would', 'definitely', 'be', 'a', 'true', 'midi', 'on', 'someone', 'who', 'is', 'truly', 'petite']]]


In [8]:
flat_sentences = [item for sublist in sentences for item in sublist]
flat_sentences[:3]

[['absolutely', 'wonderful', 'silky', 'and', 'sexy', 'and', 'comfortable'],
 ['love', 'this', 'dress'],
 ['it', 's', 'sooo', 'pretty']]

#### Model Training

In [9]:
%%time
model = word2vec.Word2Vec(flat_sentences, workers=4, vector_size=300, min_count=10, window=10, sample=1e-3, epochs=5)

CPU times: total: 10.7 s
Wall time: 3.08 s


Our model has 3.3k unique vectorized words in it's vocabulary.

In [10]:
print(len(model.wv.index_to_key))

3364


#### Model Results

Now that our model has been trained on the data from the women's clothing review, we can perform some operations on the vectorized words.

For example, it is quite entertaining that if you subtract the word "sexy" from the phrase "small dress", similar words would be "large", "medium", etc.

In [11]:
print(*model.wv.most_similar(positive=["dress", "small"], negative=["sexy"], topn=3))

('large', 0.6205908060073853) ('l', 0.5897884368896484) ('medium', 0.589218258857727)


And here we get the plural from the singular for heel:

In [12]:
print(*model.wv.most_similar(positive=["heel", "women"], negative=["woman"], topn=1))

('heels', 0.6841935515403748)


The model has the following most similar words to the word USA :

In [13]:
print(*model.wv.most_similar("usa", topn=3))

('cheaply', 0.8089789152145386) ('poorly', 0.6855989098548889) ('well', 0.6430597901344299)


If we give a sequence of words and ask the model to eliminate the superfluous (most distant from the others, or unlike) word, we get the following quite logically explainable result:

In [14]:
print(model.wv.doesnt_match("lagre tall small elegant tight".split()))

elegant


The main limitation of the model is that its interpretation of words and synonyms depends on the texts it receives for training. For example, for the word "star" its direct meaning as a celestial luminary will be very far away, but the figurative meanings will be closer, since these are the meanings that occur in our reviews.

In [15]:
print(*model.wv.most_similar("star", topn=10), sep='\n')

('stars', 0.7087801098823547)
('major', 0.674744725227356)
('rating', 0.6625295281410217)
('minor', 0.6463609933853149)
('four', 0.6169639229774475)
('negative', 0.6142308115959167)
('reason', 0.608757734298706)
('five', 0.5718724727630615)
('solved', 0.5523854494094849)
('gave', 0.5501779317855835)


"Star" is almost twice as close to "rating" as to "shine".

In [16]:
model.wv.similarity('star', 'shine')

0.40836945

And almost six times farther away from "sky" than "rating".

In [17]:
model.wv.similarity('star', 'sky')

0.08556241

#### Model Retraining

We can take a pre-trained model and further train it on other data, which is very convenient. This will allow us to add new words to an existing dictionary and change the vector values of existing ones.

To start with, let's take a couple of words that are likely to change after training and memorize synonyms for them. Since we will be learning from Alice in Wonderland, it is likely that the meanings of the words "illusion" and "white" may change due to frequent references in the book.

In [18]:
print(*model.wv.most_similar("illusion", topn=10), sep='\n')

('hiding', 0.6951606869697571)
('empire', 0.6687750220298767)
('bulge', 0.6579235196113586)
('appearance', 0.6348645091056824)
('volume', 0.6276013851165771)
('defined', 0.6257961392402649)
('effect', 0.6207180023193359)
('creating', 0.6204461455345154)
('fullness', 0.6187328696250916)
('accentuate', 0.6186639666557312)


In [19]:
print(*model.wv.most_similar("white", topn=10), sep='\n')

('black', 0.8461518883705139)
('navy', 0.767144501209259)
('colored', 0.7560238242149353)
('grey', 0.7489383816719055)
('cream', 0.748173177242279)
('gray', 0.7364131808280945)
('ivory', 0.732673704624176)
('red', 0.7277088165283203)
('brown', 0.7240748405456543)
('beige', 0.6978108286857605)


Let's process the words as we did last time, dividing them into sentences and tokens.

In [20]:
with open("alice_in_wonderland.txt", 'r', encoding='utf-8') as f:
    text = f.read()

text = re.sub('\n', ' ', text)
sents = sent_tokenize(text)

punct = '!"#$%&()*+,-./:;<=>?@[\]^_`{|}~„“«»†*—/\-‘’'
clean_sents = []

for sent in sents:
    s = [w.lower().strip(punct) for w in sent.split()]
    clean_sents.append(s)

print(clean_sents[:2])

[["alice's", 'adventures', 'in', 'wonderland', "alice's", 'adventures', 'in', 'wonderland', 'lewis', 'carroll', 'the', 'millennium', 'fulcrum', 'edition', '3.0', 'chapter', 'i', 'down', 'the', 'rabbit-hole', 'alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank', 'and', 'of', 'having', 'nothing', 'to', 'do', 'once', 'or', 'twice', 'she', 'had', 'peeped', 'into', 'the', 'book', 'her', 'sister', 'was', 'reading', 'but', 'it', 'had', 'no', 'pictures', 'or', 'conversations', 'in', 'it', 'and', 'what', 'is', 'the', 'use', 'of', 'a', "book,'", 'thought', 'alice', 'without', 'pictures', 'or', "conversation?'"], ['so', 'she', 'was', 'considering', 'in', 'her', 'own', 'mind', 'as', 'well', 'as', 'she', 'could', 'for', 'the', 'hot', 'day', 'made', 'her', 'feel', 'very', 'sleepy', 'and', 'stupid', 'whether', 'the', 'pleasure', 'of', 'making', 'a', 'daisy-chain', 'would', 'be', 'worth', 'the', 'trouble', 'of', 'getting', 'up', 'and', 

Saving the model for later loading for pre-training.

In [21]:
model_path = "clothes_reviews.model"
model.save(model_path)

Loading our model again, this time into a different variable. Run the training on Alice in Wonderland.

In [22]:
model_retrained = word2vec.Word2Vec.load(model_path)

model_retrained.build_vocab(clean_sents, update=True)
model_retrained.train(clean_sents, total_examples=model_retrained.corpus_count, epochs=5)

(70570, 132360)

Comparison of the results obtained after retraining the model on known words:

In [23]:
wrds1 = [words[0] for words in model.wv.most_similar("illusion", topn=10)]
wrds2 = [words[0] for words in model_retrained.wv.most_similar("illusion", topn=10)]
print(f'Original model | Retrained model')
print(' '*14, '|')
[print(f'{wrds1:>14} | {wrds2}') for wrds1, wrds2 in zip(wrds1, wrds2)];

Original model | Retrained model
               |
        hiding | hiding
        empire | empire
         bulge | bulge
    appearance | volume
        volume | defined
       defined | creating
        effect | fullness
      creating | accentuate
      fullness | favors
    accentuate | silhouette


In [24]:
wrds1 = [words[0] for words in model.wv.most_similar("white", topn=10)]
wrds2 = [words[0] for words in model_retrained.wv.most_similar("white", topn=10)]
print(f'Original model | Retrained model')
print(' '*14, '|')
[print(f'{wrds1:>14} | {wrds2}') for wrds1, wrds2 in zip(wrds1, wrds2)];

Original model | Retrained model
               |
         black | black
          navy | navy
       colored | colored
          grey | grey
         cream | cream
          gray | ivory
         ivory | gray
           red | brown
         brown | red
         beige | turquoise


Some synonyms have moved up in the list, and some have newer meanings that were not in the top 10.

Also, we have new mappings, such as for example "white rabbit", which is mentioned in the book.

In [25]:
model_retrained.wv.similarity('white', 'rabbit')

0.41129318

#### Conclusion

The model of creating vectors from *word2vec* words creates logical and understandable embeddings, over which you can perform some mathematical operations with meaningful results. At the same time, the training time of such a model is relatively short. Especially worth noting is the power provided by the ability to retrain an existing model for a specific class of tasks. At the same time, the specificity of the resulting embeddings is both an advantage and a disadvantage, since some familiar word senses may be absent due to the subject matter of the texts on which the model was trained.