Import gensim to use Word2Vec and Doc2Vec, and logging 

In [1]:
import gensim, logging

Add logging configuration

In [2]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Load a pre-built Word2Vec model provided by Google

In [3]:
gmodel = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

2023-02-26 21:54:48,723 : INFO : loading projection weights from GoogleNews-vectors-negative300.bin
2023-02-26 21:55:10,905 : INFO : KeyedVectors lifecycle event {'msg': 'loaded (3000000, 300) matrix of type float32 from GoogleNews-vectors-negative300.bin', 'binary': True, 'encoding': 'utf8', 'datetime': '2023-02-26T21:55:10.905765', 'gensim': '4.3.0', 'python': '3.10.0 (tags/v3.10.0:b494f59, Oct  4 2021, 19:00:18) [MSC v.1929 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19044-SP0', 'event': 'load_word2vec_format'}


Printing the vector of the word cat

In [4]:
# gmodel['cat']
len(gmodel["cat"]) # 300

300

Getting the similarity between some words

In [5]:
gmodel.similarity("cat", "dog")

0.76094574

Setup the Doc2Vec model

In [6]:
from gensim.models.doc2vec import TaggedDocument
from gensim.models import Doc2Vec

Define a function to format the data for the TaggedDocument

In [7]:
def extract_words(sent):
    sent = sent.lower()
    sent = re.sub(r'<[^>]+>', ' ', sent) # strip html tags
    sent = re.sub(r'(\w)\'(\w)', '\1\2', sent) # remove apostrophes
    sent = re.sub(r'\W', ' ', sent) # remove punctuation
    sent = re.sub(r'\s+', ' ', sent) # remove repeated spaces
    sent = sent.strip()
    return sent.split()

In [9]:
# unsupervised training data
import re
import os
unsup_sentences = []

# source: https://ai.stanford.edu/~amaas/data/sentiment, data from IMDB
for dirname in ['train/pos', "train/neg", "train/unsup", "test/pos", "test/neg"]:
    for fname in sorted(os.listdir("aclImdb/"+dirname)):
        if fname[-4:] == ".txt":
            with open("aclImdb/"+dirname+"/"+fname, encoding="UTF-8") as f:
                sent = f.read()
                words = extract_words(sent)
                unsup_sentences.append(TaggedDocument(words, [dirname+"/"+fname]))

# source: http://cs.cornell.edu/people/pabo/movie-review-data/
for dirname in ["txt_sentoken/pos", "txt_sentoken/neg"]:
    for fname in sorted(os.listdir(dirname)):
        if fname[-4:] == ".txt":
            with open(dirname + "/" + fname, encoding="UTF-8") as f:
                for i, sent in enumerate(f):
                    words = extract_words(sent)
                    unsup_sentences.append(TaggedDocument(words, ["%s/%s-%d" % (dirname, fname, i)]))

# source: https://nlp.stanford.edu/sentiment/, data from Rottn Tomatoes
with open("stanfordSentimentTreebank/original_rt_snippets.txt", encoding="UTF-8") as f:
    for i, line in enumerate(f):
        words = extract_words(sent)
        unsup_sentences.append(TaggedDocument(words, ["rt-%d" % i]))

In [11]:
len(unsup_sentences)
print(unsup_sentences[0]) # first sentence

TaggedDocument<['bromwell', 'high', 'is', 'a', 'cartoon', 'comedy', 'it', 'ran', 'at', 'the', 'same', 'time', 'as', 'some', 'other', 'programs', 'about', 'school', 'life', 'such', 'as', 'teachers', 'my', '35', 'years', 'in', 'the', 'teaching', 'profession', 'lead', 'me', 'to', 'believe', 'that', 'bromwell', 'hig', 'satire', 'is', 'much', 'closer', 'to', 'reality', 'than', 'is', 'teachers', 'the', 'scramble', 'to', 'survive', 'financially', 'the', 'insightful', 'students', 'who', 'can', 'see', 'right', 'through', 'their', 'pathetic', 'teachers', 'pomp', 'the', 'pettiness', 'of', 'the', 'whole', 'situation', 'all', 'remind', 'me', 'of', 'the', 'schools', 'i', 'knew', 'and', 'their', 'students', 'when', 'i', 'saw', 'the', 'episode', 'in', 'which', 'a', 'student', 'repeatedly', 'tried', 'to', 'burn', 'down', 'the', 'school', 'i', 'immediately', 'recalled', 'at', 'high', 'a', 'classic', 'line', 'inspector', 'here', 'to', 'sack', 'one', 'of', 'your', 'teachers', 'student', 'welcome', 'to', '

Create a shuffeling class

In [18]:
import random
class PermuteSentences(object):
    def __init__(self, sents):
        self.sents = sents
    
    def __iter__(self):
        shuffeled = list(self.sents)
        random.shuffle(shuffeled)
        for sent in shuffeled:
            yield sent

Lets shuffle the sentences and fit them into our Doc2Vec model

In [20]:
permuter = PermuteSentences(unsup_sentences)
model = Doc2Vec(permuter, dm=0, hs=0, vector_size=50)

2023-02-26 22:21:39,715 : INFO : collecting all words and their counts
2023-02-26 22:21:39,836 : INFO : PROGRESS: at example #0, processed 0 words (0 words/s), 0 word types, 0 tags
2023-02-26 22:21:40,124 : INFO : PROGRESS: at example #10000, processed 1401401 words (4873031 words/s), 45400 word types, 10000 tags
2023-02-26 22:21:40,392 : INFO : PROGRESS: at example #20000, processed 2839828 words (5397229 words/s), 61709 word types, 20000 tags
2023-02-26 22:21:40,659 : INFO : PROGRESS: at example #30000, processed 4243392 words (5258399 words/s), 73168 word types, 30000 tags
2023-02-26 22:21:40,925 : INFO : PROGRESS: at example #40000, processed 5642275 words (5282198 words/s), 82696 word types, 40000 tags
2023-02-26 22:21:41,211 : INFO : PROGRESS: at example #50000, processed 7090241 words (5072916 words/s), 91075 word types, 50000 tags
2023-02-26 22:21:41,487 : INFO : PROGRESS: at example #60000, processed 8519346 words (5202009 words/s), 98556 word types, 60000 tags
2023-02-26 22:2

After the training, we free up some memory

In [23]:
model.delete_temporary_training_data(keep_interface=True)

AttributeError: 'Doc2Vec' object has no attribute 'delete_temporary_training_data'