# I.  Semantic Analysis:  Introduction

So far in this course we've dealt with exclusively with data analysis methods rooted in descriptive statistics. In this module, we will move beyond descriptive mothods to implement three machine learning techniques designed to model textual semantics:  Term Frequncy Inverse Document Fruency (TFIDF), Topic Modeling, and Word Embedding.

The goal of each these processes is to program the computer to 'intuit' the semantic meaning texts.  The machine learning approach taken by each of these methods is, however, fundamentally different.  TFIDF is based on a conceptual model of language importance; Topic modeling represents a 'Bag of Words" (or Bayesean, to be more formal) approach to the problem of semantics; and Word Embeddings implement a neural network (Deep Learning) approach to the problem.  We'll explain more about each of these approaches as they are described.  Taken as a group, they represent three primary branches of machine learning that dominate the current state of the art in advanced textual analysis.

# II.  Environment Setup

You'll need to prepare you environment to perform the tasks in this notebook.  When executed, the code cell below will load those packages and modules needed to perform the activities presented in this course module.  Comments in the code identify each of the packages being loaded.  In each case, you can refer to the package documentation for more specific information about the package being used.  You must run the code cells below to properly prepare your environment to perfrom the text mining and analysis tasks presented in this module.

In [None]:
# update collab environment to latest version of NLTK
# documentation: https://www.nltk.org/
!pip install nltk -U

In [None]:
# we also need to install the ldavis/gensim connector package
!pip install pyLDAvis

In [None]:
# import the base nltk package 
# https://www.nltk.org/
import nltk

# load the nltk tokenize module
from nltk.tokenize import word_tokenize

# import nltk stopword module
from nltk.corpus import stopwords

# import the nltk porter stemmer
from nltk.stem.porter import PorterStemmer

# import the nltk lemmatizer
from nltk.stem import WordNetLemmatizer

# download the punkt model for nltk 
# https://www.kite.com/python/docs/nltk.punkt
nltk.download('punkt')

# donload the nltk stopword list
nltk.download('stopwords')

# import regular expression package 
# https://docs.python.org/3/library/re.html
import re

# import numpy 
# https://numpy.org/
import numpy as np

# import pandas 
# https://pandas.pydata.org/
import pandas as pd
from pprint import pprint

# import the os package 
# https://docs.python.org/3/library/os.html 
import os

# import main Gensim package
# https://radimrehurek.com/gensim/
import gensim

# import gensim corpora module
import gensim.corpora as corpora

# import gensim simple_process module
from gensim.utils import simple_preprocess

# import gensim modles module
from gensim import models

# import gensim language models
from gensim.models import CoherenceModel

# import the gensim simple parsing process module
from gensim.utils import simple_preprocess

# import ldavize package for model visualization
# https://pyldavis.readthedocs.io/en/latest
import pyLDAvis

# import the matplotlib package for plotting
# https://matplotlib.org/stable/contents.htm
import matplotlib.pyplot as plt

# setup matplot lib to work from commandline
%matplotlib inline

You also need to setup some system configuration to control error messages and logging.

In [None]:
# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

# disable deprication warnings
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

# III.  Load a Working Corpus

Before you can load files for analysis, you must mount your Google Drive in this environment.

In [None]:
from google.colab import drive
drive.mount('/gdrive/')

Once your Google Drive has successfully mounted, you can choose to work with the sample data provided for the coure or to work with a corpus of your own.  For those who choose to work with course sample data, we'll be working with a small, randomly selected subset of eebo-tcp text.  If you want to work with your own corpus, you'll put all of the text files that you want to model in a single directory in the "data_my" directory of the Course Home Directory. Note that the machine learning processes we will cover in this module can be quite computationally intensive, so you won't want to work, during class time, with a corpus of more than a couple hundred documents.  Depending on whether you intend to work on the sample data or on your own corpus, please follow the appropriate instructions below:

1.   To load the course sample corpus, in the code cell below uncomment (remove the hashtag at the start of the line) the line that reads, "g_file_directory_path = '/gdrive/MyDrive/rbs_digital_approaches_2021/data_class/eebo-tcp/'" and then run the cell.
2.   To load a corpus of your own, replace the "\<filename\>" substring in the line that reads, "working_file_directory_path = '/gdrive/MyDrive/rbs_digital_approaches_2021/data_my/\<directory_name\>/'" with the name of your file, uncomment the line, and then run the cell.


In [25]:
# working_file_directory_path = "/gdrive/MyDrive/rbs_digital_approaches_2021/data_class/eebo-tcp/"
# working_file_directory_path = "/gdrive/MyDrive/rbs_digital_approaches_2021/data_my/<directory_name>/"

Now that we've defined a set of texts with which we want to work, we can being loading the corpus.

In [26]:
# first, we have to get a list of all files in the
# designated corpus directory
file_list = os.listdir(working_file_directory_path) 

Print the list to make sure that you retrieved it successfully.

In [None]:
print(file_list)

['B06649.headed.txt', 'B06707.headed.txt', 'B06669.headed.txt', 'B07103.headed.txt', 'B06716.headed.txt', 'B06667.headed.txt', 'B06774.headed.txt', 'B06872.headed.txt', 'B06876.headed.txt', 'B06758.headed.txt', 'B06672.headed.txt', 'B06789.headed.txt', 'B06569.headed.txt', 'B06608.headed.txt', 'B06674.headed.txt', 'B06761.headed.txt', 'B06682.headed.txt', 'B06575.headed.txt', 'B06645.headed.txt', 'B06792.headed.txt', 'B06605.headed.txt', 'B06795.headed.txt', 'B06699.headed.txt', 'B06777.headed.txt', 'B06632.headed.txt', 'B06782.headed.txt', 'B06802.headed.txt', 'B06688.headed.txt', 'B06712.headed.txt', 'B06788.headed.txt', 'B31385.headed.txt', 'B06739.headed.txt', 'B06556.headed.txt', 'B06784.headed.txt', 'B06677.headed.txt', 'B06646.headed.txt', 'B06762.headed.txt', 'B06558.headed.txt', 'B25542.headed.txt', 'B06656.headed.txt', 'B06787.headed.txt', 'B06767.headed.txt', 'B06614.headed.txt', 'B06563.headed.txt', 'B06624.headed.txt', 'B06634.headed.txt', 'B06694.headed.txt', 'B06597.head

Now that we know which files we need to load, we can begin reading the files into memory.  Below, we'll create an empty list to hold the full text of the documnts in our corpus and then loop through each document, open the file, and then append the contents to the end of the list.

In [27]:
# define an empty list object to hold the texts
text_collection = []
# loop through our list of filenames and process each one
for nextfile in file_list:
  # join the corpus working directory path to the filename to create 
  # a full path name to the file
  next_file_path = working_file_directory_path + nextfile
  # open the designated file
  next_file_object = open(next_file_path, "r", encoding='windows-1252')
  # reada the contents into a variable
  next_text = next_file_object.read()
  # append the contents to the end of the corpus text list
  text_collection.append(next_text)
  # close the file object
  next_file_object.close()

Take a look at a text from the list.  Feel free to change the reference index and look at multiple texts.

In [None]:
print(text_collection[1])

The World turn'd up-side down OR,  Money grown Trouble some. Shewing the vanity of youngmen, who spend their youthfull days in rioting and want onness, which is undoubtely the High-way to want and Beggary, as you may plainly see in these following lines, wherein the Extravagant doth not only lament his mispent time, but also gives advice to others, to prevent tjose miseries which befell him by his profuse spending till too Late he sees his error. Tune of, Packingtons Pound.       I Am a young blade that had money good store But now by debauchery grown very poor When I had enough to have served my turn Oh then in my pocket my money did burn Then straitway I hunted to find out good fellows, And could not endure to be out of an Alehouse, But by Whoring and Drinking I now am undone, And now I am laugh'd at, by every one. And when I was drunk I must needs have a whore, By which means I quickly consumed my store; For I met with a Wench with her powderde locks, And she for my love furnish me 

Create a "list of lists" where each top item in the list is a text and each text is associated with another, ordered list of words that constitue the text.

In [29]:
# pare each text into tokens
tokens = [simple_preprocess(next_doc, deacc=True) for next_doc in text_collection]


In [12]:
# view the token list for a few texts. (Change the index numbert
# to change the text you are viewing.)
print(tokens[23])



Now we have the entire corpus stored in a list of lists, where each item at the top level of the list is an individual text, and each individual text is defined by an odrdered list of the words that belong to it.  

The next thing we need to do is to start creating a Document Term Matrix (DTM) for the corpus.  The DTM is a standard, base data structure for representing texts for Bayesian, Bag of Words (BOW) approaches to analysis, whic is based on analyzing the frequency with which words cooccurre across all texts in a given corpus.  BOW approaches ignore grammar and syntax and mnodel semantics based simply on frequncy of cooccurrance.  Thsi approach may seem quite naive; however, as you will see, it works quite well.

In [30]:
# our first step is to create a 'dictionary' for the corpus,
# which is a unique accounting of all words that appear in all documents.
gensim_dictionary = corpora.Dictionary()
# next, we create a 'corpus' representation of the texts, which
# involves calculating the word frequencies in each individual text
# of every word in the dictionary 
gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in tokens]

In [None]:
# get a high-altitude view of our dictionary
print(gensim_dictionary)

Dictionary(79940 unique tokens: ['accidents', 'achilles', 'action', 'ad', 'ages']...)


In [None]:
# and let's also look at the corpus.  The code output is a list of ID numbers
# from the dictionary paired with the frequency with which that item appears
# in the text.
print(gensim_corpus[0])

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 3), (8, 1), (9, 1), (10, 1), (11, 1), (12, 5), (13, 19), (14, 1), (15, 1), (16, 1), (17, 1), (18, 3), (19, 3), (20, 1), (21, 1), (22, 1), (23, 1), (24, 5), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 11), (33, 7), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 2), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 3), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 3), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 2), (78, 1), (79, 1), (80, 1), (81, 1), (82, 4), (83, 1), (84, 1), (85, 1), (86, 1), (87, 2), (88, 1), (89, 2), (90, 3), (91, 1), (92, 1), (93, 1), (94, 1), (95, 2), (96, 1), (97, 1), (98, 1), (99, 1), (100, 1), (101, 1), (102, 1), (103, 2), (104, 1), (105, 1), (106, 1), (107, 1), (108, 2), (109, 3), (110, 3

In [None]:
# and here's another view that replaces the vocabulary ID with the word
# itself from the dictionary
bow_word_frequencies = [[(gensim_dictionary[id], frequence) for id, frequence in couple] for couple in gensim_corpus]
print(bow_word_frequencies[:50])



# IV.  Term Frequency Inverse Document Frequency (TFIDF)

TFIDF is one of earliest developed attempts to semantically differentiate texts, and it is still one of the most widely used.  The primary goal of the TFIDF is to identify the words that are most unique to a single text as compared to the words that appear in every other text in the corpus of analys.

This is accomplished, as the name implies, by calculating the relative frequency of word in a single text as an inverse ratio to the relative frequenc of the term across all documents in the corpus.  Happily, the Gensim package contains functions for calcualting TFIDF, so you don't have to program the math to take advantage of the process.

In [None]:
# instantiate a gensim tfidf model object using the gensim corpus
# note the 'smartiris' parameter.  This stands for
# System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System
# a data structure for representing TFIDF data that was
# develped at Cornell Univeristy in the 1960's.  'ntc'
# designates the origal, triples data structure.  This has
# computational implications but no functional implications.
# 
# NOTE:  TfidfModel is a Class that is defined in the Gensim package.
# Instantiating a Class object does not actually perform an computation.
# It simply creates the object and sets its parameters for future use.

tfidf = models.TfidfModel(gensim_corpus, smartirs='ntc')

In [None]:
# here we actually invoke our tfidf object, running it on
# our tfidf corpus and, then, for each text that entry
# we add the results (stored in the 'sent' variable)a list

# create an empy tfidf list
tfidf_list = []

# run the tfidf model on the corpus then
# loop through all tfidf text representations, storing each
# in a variable called 'sent' for the duration of that loop
for sent in tfidf[gensim_corpus]:
    # append the data from the tfidf text representation stored in the sent
    # variable to the tfidf_list list
    tfidf_list.append([[gensim_dictionary[id], np.around(frequency, decimals=2)] for id, frequency in sent])

Now we have a list, where each row represents a single text in our corpus and contains a list of touples for each word, where the first item in th touple is the word and the second is its TFIDF ranking.  The higher the ranking, the more unique the word is to the individual text as compared to all other texts in the corpus.  To look at different texts, change the index that you are looing at and re-run the code cell.

In [None]:
print(tfidf_list[0])

[['accidents', 0.06], ['achilles', 0.06], ['action', 0.04], ['ad', 0.03], ['ages', 0.03], ['air', 0.06], ['alas', 0.03], ['all', 0.0], ['allow', 0.04], ['almost', 0.03], ['alone', 0.02], ['am', 0.01], ['an', 0.03], ['and', 0.01], ['appetite', 0.04], ['are', 0.0], ['argument', 0.04], ['arran', 0.09], ['as', 0.0], ['at', 0.0], ['attaques', 0.09], ['austin', 0.07], ['back', 0.03], ['banisht', 0.07], ['be', 0.0], ['bed', 0.03], ['better', 0.01], ['biast', 0.09], ['bloom', 0.08], ['bode', 0.08], ['breast', 0.04], ['build', 0.05], ['but', 0.02], ['by', 0.01], ['caesar', 0.05], ['calling', 0.03], ['can', 0.01], ['cato', 0.04], ['chain', 0.06], ['chamber', 0.05], ['charles', 0.04], ['city', 0.03], ['clouds', 0.05], ['colour', 0.03], ['common', 0.01], ['confess', 0.04], ['conscience', 0.02], ['constellat', 0.09], ['correspondency', 0.07], ['could', 0.01], ['cramm', 0.09], ['crime', 0.03], ['croud', 0.08], ['crown', 0.07], ['dar', 0.07], ['dark', 0.05], ['dead', 0.02], ['departure', 0.05], ['des

The above code gives us a view of the TFIDF data for each text, but it is difficult to find the most important words because it 

In [None]:
# Python program to sort a list of
# tuples by the second Item using sort() 
  
# Function to sort hte list by second item of tuple
def Sort_Tuple(tup): 
  
    # reverse = None (Sorts in Ascending order) 
    # key is set to sort using second element of 
    # sublist lambda has been used 
    tup.sort(key = lambda x: x[1]) 
    return tup 

  
# printing the sorted list of tuples
print(Sort_Tuple(tfidf_list[0]) 

[['accidents', 0.06], ['achilles', 0.06], ['action', 0.04], ['ad', 0.03], ['ages', 0.03], ['air', 0.06], ['alas', 0.03], ['all', 0.0], ['allow', 0.04], ['almost', 0.03], ['alone', 0.02], ['am', 0.01], ['an', 0.03], ['and', 0.01], ['appetite', 0.04], ['are', 0.0], ['argument', 0.04], ['arran', 0.09], ['as', 0.0], ['at', 0.0], ['attaques', 0.09], ['austin', 0.07], ['back', 0.03], ['banisht', 0.07], ['be', 0.0], ['bed', 0.03], ['better', 0.01], ['biast', 0.09], ['bloom', 0.08], ['bode', 0.08], ['breast', 0.04], ['build', 0.05], ['but', 0.02], ['by', 0.01], ['caesar', 0.05], ['calling', 0.03], ['can', 0.01], ['cato', 0.04], ['chain', 0.06], ['chamber', 0.05], ['charles', 0.04], ['city', 0.03], ['clouds', 0.05], ['colour', 0.03], ['common', 0.01], ['confess', 0.04], ['conscience', 0.02], ['constellat', 0.09], ['correspondency', 0.07], ['could', 0.01], ['cramm', 0.09], ['crime', 0.03], ['croud', 0.08], ['crown', 0.07], ['dar', 0.07], ['dark', 0.05], ['dead', 0.02], ['departure', 0.05], ['des

# V. Topic Modelling

Topic Modelling is one of the most used and least understood semantic text analysis methods in the Digital Humanties / Cultural Heritage domain.  Because it is so poorly undertant, it is almost always applied incorrectly or poorly and/or mis-interpreted after the fact.

In [None]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=gensim_corpus, 
                                            id2word=gensim_dictionary,
                                            num_topics=20, 
                                            random_state=100,
                                            update_every=1,
                                            chunksize=100,
                                            passes=10,
                                            alpha='auto',
                                            per_word_topics=True)

In [None]:
topics = lda_model.print_topics(num_words=10)
for topic in topics:
    print(topic)

(0, '0.001*"clipped" + 0.001*"swear" + 0.001*"payment" + 0.001*"pass" + 0.001*"crowns" + 0.001*"loans" + 0.001*"receivers" + 0.001*"boat" + 0.001*"exchequer" + 0.001*"payments"')
(1, '0.000*"and" + 0.000*"the" + 0.000*"of" + 0.000*"to" + 0.000*"that" + 0.000*"in" + 0.000*"his" + 0.000*"he" + 0.000*"was" + 0.000*"be"')
(2, '0.000*"and" + 0.000*"the" + 0.000*"of" + 0.000*"to" + 0.000*"that" + 0.000*"in" + 0.000*"he" + 0.000*"for" + 0.000*"was" + 0.000*"his"')
(3, '0.004*"blazon" + 0.001*"royal" + 0.001*"cheerful" + 0.000*"tempteth" + 0.000*"arms" + 0.000*"cryest" + 0.000*"soit" + 0.000*"diev" + 0.000*"pense" + 0.000*"droit"')
(4, '0.000*"and" + 0.000*"of" + 0.000*"the" + 0.000*"in" + 0.000*"that" + 0.000*"was" + 0.000*"he" + 0.000*"his" + 0.000*"to" + 0.000*"kynge"')
(5, '0.048*"of" + 0.037*"the" + 0.026*"and" + 0.025*"to" + 0.025*"was" + 0.022*"in" + 0.020*"he" + 0.019*"his" + 0.015*"ye" + 0.015*"that"')
(6, '0.019*"my" + 0.016*"and" + 0.013*"for" + 0.012*"to" + 0.011*"me" + 0.010*"the"

In [None]:
!pip install pyLDAvis -U

In [None]:
import pyLDAvis

In [None]:
from pyLDAvis import gensim_models

In [None]:
lda_viz = pyLDAvis.gensim_models.prepare(lda_model, gensim_corpus, gensim_dictionary)
pyLDAvis.display(lda_viz)

  by='saliency', ascending=False).head(R).drop('saliency', 1)


# Model Evaluation

In [61]:
! pip install tmtoolkit
import tmtoolkit

Collecting tmtoolkit
[?25l  Downloading https://files.pythonhosted.org/packages/98/97/d58e7fdbf44c55af721d9c57ce71d02b193f1b305c349f7e1803ca0b397a/tmtoolkit-0.10.0-py3-none-any.whl (7.1MB)
[K     |████████████████████████████████| 7.2MB 4.5MB/s 
[?25hCollecting globre<0.2,>=0.1.5
  Downloading https://files.pythonhosted.org/packages/5a/ce/a9e2f3317a458f8c591a1f95d4061d4e241f529ba678292acdcf2d804783/globre-0.1.5.tar.gz
Collecting scipy<1.6,>=1.5.0
[?25l  Downloading https://files.pythonhosted.org/packages/dc/7e/8f6a79b102ca1ea928bae8998b05bf5dc24a90571db13cd119f275ba6252/scipy-1.5.4-cp37-cp37m-manylinux1_x86_64.whl (25.9MB)
[K     |████████████████████████████████| 25.9MB 113kB/s 
[?25hCollecting matplotlib<3.4,>=3.3.0
[?25l  Downloading https://files.pythonhosted.org/packages/23/3d/db9a6b3c83c9511301152dbb64a029c3a4313c86eaef12c237b13ecf91d6/matplotlib-3.3.4-cp37-cp37m-manylinux1_x86_64.whl (11.5MB)
[K     |████████████████████████████████| 11.6MB 40.7MB/s 
[?25hCollecting spa

In [68]:
# function to create a dtm
def base_dtm(docs):
    vocab = set()
    for doc in docs:
        vocab |= set(doc.split(' '))
        
    counts = [dict.fromkeys(vocab, 0) for doc in docs]
    for idx, doc in enumerate(docs):
        for word in doc.split(' '):
            counts[idx][word] += 1
            
    dtm = [[count for count in doc.values()] for doc in counts]
    return dtm, list(vocab)


In [69]:
dtm, vocab = base_dtm(text_collection)

In [71]:
dtm = np.asarray(dtm)

In [72]:
dtm

array([[  57,    0,    0, ...,    0,    0,    0],
       [   7,    0,    0, ...,    0,    0,    0],
       [  25,    0,    0, ...,    0,    0,    0],
       ...,
       [ 350,    0,    0, ...,    0,    0,    0],
       [ 814,    0,    0, ...,    0,    0,    0],
       [1105,    0,    0, ...,    0,    0,    0]])

In [75]:
from scipy.sparse import coo_matrix

In [73]:
def np_dtm(docs):
    vocab = set()
    n_nonzero = 0
    for doc in docs:
        split_doc = doc.split(' ')
        unique_terms = set(split_doc)
        vocab |= unique_terms
        n_nonzero += len(unique_terms)
        
    docnames = np.array(range(0, len(docs)))
    vocab = np.array(list(vocab))    
    vocab_sorter = np.argsort(vocab)
    
    ndocs = len(docnames)
    nvocab = len(vocab)
    
    data = np.empty(n_nonzero, dtype=np.intc)
    rows = np.empty(n_nonzero, dtype=np.intc)
    cols = np.empty(n_nonzero, dtype=np.intc)
    
    idx = 0
    for docname, doc in zip(docnames, docs):
        doc = doc.split(' ')
        term_indices = vocab_sorter[np.searchsorted(vocab, doc, sorter=vocab_sorter)]
        unique_indices, counts = np.unique(term_indices, return_counts=True)
        
        n_vals = len(unique_indices)
        idx_end = idx + n_vals
        
        data[idx:idx_end] = counts
        cols[idx:idx_end] = unique_indices
        doc_idx = np.where(docnames == docname)
        rows[idx:idx_end] = np.repeat(doc_idx, n_vals)
        
        idx = idx_end
        
    dtm = coo_matrix((data, (rows, cols)), shape=(ndocs, nvocab), dtype=np.intc)
    return dtm, vocab

In [76]:
dtm, vocab = np_dtm(text_collection)

In [62]:
var_params = [{'n_topics': k, 'alpha': 1/k} for k in range(20, 121, 10)]

In [66]:
from tmtoolkit.topicmod import tm_lda
from tmtoolkit.topicmod.tm_lda import evaluate_topic_models
from tmtoolkit.topicmod.evaluate import results_by_parameter

In [67]:
from tmtoolkit.topicmod.tm_lda import evaluate_topic_models
from tmtoolkit.topicmod.evaluate import results_by_parameter

const_params = {
    'n_iter': 1000,
    'eta': 0.1,       # "eta" aka "beta"
    'random_state': 20191122  # to make results reproducible
}

In [77]:
eval_results = evaluate_topic_models(dtm,
                                     varying_parameters=var_params,
                                     constant_parameters=const_params,
                                     return_models=True)

Process <class 'tmtoolkit.topicmod.tm_lda.MultiprocEvaluationWorkerLDA'>#1:
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/dist-packages/tmtoolkit/topicmod/parallel.py", line 284, in run
    results = self.fit_model(data, params)
  File "/usr/local/lib/python3.7/dist-packages/tmtoolkit/topicmod/tm_lda.py", line 80, in fit_model
    lda_instance = super(MultiprocEvaluationWorkerLDA, self).fit_model(data, params)
  File "/usr/local/lib/python3.7/dist-packages/tmtoolkit/topicmod/tm_lda.py", line 68, in fit_model
    lda_instance.fit(data)
  File "/usr/local/lib/python3.7/dist-packages/lda/lda.py", line 130, in fit
    self._fit(X)
  File "/usr/local/lib/python3.7/dist-packages/lda/lda.py", line 243, in _fit
    self._initialize(X)


KeyboardInterrupt: ignored

  File "/usr/local/lib/python3.7/dist-packages/lda/lda.py", line 290, in _initialize
    ndz_[d, z_new] += 1
KeyboardInterrupt


# Word Embeddings

In [23]:
! pip install --upgrade gensim

Requirement already up-to-date: gensim in /usr/local/lib/python3.7/dist-packages (4.0.1)


In [18]:
! pip install word2vec

Collecting word2vec
[?25l  Downloading https://files.pythonhosted.org/packages/11/9e/dc6d96578191b6167cb1ea4a3fe3edeed0dce54d3db21ada013b2b407d65/word2vec-0.11.1.tar.gz (42kB)
[K     |███████▊                        | 10kB 10.9MB/s eta 0:00:01[K     |███████████████▌                | 20kB 9.1MB/s eta 0:00:01[K     |███████████████████████▎        | 30kB 7.5MB/s eta 0:00:01[K     |███████████████████████████████ | 40kB 7.2MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 2.8MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Building wheels for collected packages: word2vec
  Building wheel for word2vec (PEP 517) ... [?25l[?25hdone
  Created wheel for word2vec: filename=word2vec-0.11.1-cp37-none-any.whl size=156422 sha256=3d83654def71c8d1bff9d91b910bd5fef236098bb6a37ec80a578ba5a8f438a5
  Stored in directory: /root/.cache/pip/wheels/f3/7c/ac/f

In [22]:
import numpy

import scipy

import six

import smart_open

import word2vec

In [33]:
#model = gensim.models.Word2Vec(tokens, min_count=1,size= 50,workers=3, window =3, sg = 1)

embeddings_model = gensim.models.Word2Vec(tokens, window=5, min_count=3)


In [41]:
print(embeddings_model.wv['god'])

[ 4.27930541e-02  3.81329000e-01 -3.98281962e-01  9.64392185e-01
 -8.72463882e-01 -1.65362489e+00 -3.53933096e-01  1.64274943e+00
 -3.33928429e-02 -8.93075168e-01 -1.78375438e-01 -1.35138321e+00
 -1.45301604e+00  1.33670008e+00  7.14526922e-02  1.11690365e-01
  8.77326131e-01 -5.19665948e-04 -1.68028259e+00 -3.90429050e-01
 -3.62992287e-01  1.31916091e-01 -6.06979370e-01  5.45018911e-01
  4.41591799e-01 -8.63319814e-01  4.23939705e-01  2.41257966e-01
  3.97634476e-01  5.53256333e-01 -3.08707952e-01 -1.27587879e+00
  1.16316521e+00  8.42916369e-02 -8.10926914e-01  7.56874502e-01
  1.91269314e+00 -1.99423778e+00 -1.09833038e+00 -1.40286815e+00
 -4.28900212e-01  1.44551963e-01 -1.25229239e+00  1.04692131e-01
 -4.60093059e-02  4.72747773e-01 -1.93228114e+00  1.60589725e-01
  2.20460743e-01  1.66349754e-01  5.02303004e-01  5.28400719e-01
  3.58899981e-01 -1.46616328e+00 -9.85918567e-02  5.18002629e-01
 -4.60181385e-01 -8.87220919e-01  9.42828000e-01 -2.62365758e-01
  4.56117094e-01 -6.22347

In [48]:
embeddings_model.build_vocab(tokens, progress_per=10000)

In [50]:
embeddings_model.train(tokens, total_examples=embeddings_model.corpus_count, epochs=30, report_delay=1)

(17289685, 51873480)

In [51]:
embeddings_model.wv.most_similar(positive=["king"])

[('kings', 0.6769992709159851),
 ('earl', 0.5980961918830872),
 ('william', 0.5974116325378418),
 ('charles', 0.5859288573265076),
 ('lord', 0.5813938975334167),
 ('elizabeth', 0.5806944966316223),
 ('council', 0.5676884651184082),
 ('queen', 0.5656291246414185),
 ('supreame', 0.5543076992034912),
 ('prince', 0.5534899830818176)]

In [52]:
embeddings_model.wv.most_similar(negative=["king"])

[('delyberacyon', 0.4829302132129669),
 ('conestable', 0.4333772659301758),
 ('leders', 0.4314591586589813),
 ('thesame', 0.38099753856658936),
 ('syngynge', 0.37595027685165405),
 ('pryncypally', 0.3750309646129608),
 ('kepe', 0.36959633231163025),
 ('whanne', 0.36798468232154846),
 ('pleadyng', 0.3655109703540802),
 ('ryng', 0.3649815320968628)]

In [59]:
embeddings_model.wv.similarity("king", "lord")

0.58139384