<a class="anchor" id="first-part"></a>

#!/usr/bin/env python<br># coding: utf-8

Author: Bao Cai

Course: Machine Learning for Descriptive Problems

Topic: NLP-Unsupervised

Start Date: 2020-03-11

Last Save: 2020-03-12

[1. Topic analysis](#topic-part)

There are quantitative measures for evaluating various aspects of clustering results and topic models intrinsically, and they can also be evaluated extrinsically by how well the clusters/topics serve some supervised task as features. In this exervise, however, we well fovus on qualitative evaluation of the results in terms of their descriptiveness. As in the example code, you may limit yourself to the 1000 first documents of ther corpus when performing clustering, in order to simplify the task and speed up experimentation, but use the whole corpus to calculate tf-idf features.

a. Experiment with different setups of the tf-idf feature extraction and clustering (k-means or hierarchical). In order to obtain meaningful results. When you arrive at a good configuration, describe it and motivate your chosen setup/parameters.

b. Inspect the keywords of the clusters. List the 10 first clusters out of all (i.e. not cherry picked examples) and privde an as descriptive label as possible for each of them.

c. Select one or two good clusters (that can be clearly interpreted) and one or two bad clusters (that might be difficult to interpret or distinguish). Motivate your choise (clusters may for instance, be overlapping, to broad/narrow or incoherent).

d. Repeat the experiment in (a) with LDA topic modelling instead (on the whole corpus), and explain briefly how the results compare to your previously chosen clustering setup. A few concrete examples may be helpful. Do your best to make sure the list of topic keywords are informative through appropriate post-processing.

[2. Word vectors](#word-vector)

a. Choose about 5 words (arbitrary) to use as seed words in the following experiment. Train word2vec vectors on the corpus while trying out variations on the parameters. Evaluate the vector models by inspecting the most similar words for each of the seed words, and try to identify qualitative differences between different parameter choices. Which parameters seem to have the most interesting effect? At what values? Motivate. Finally, study the qualitative effect of increasing the training data, by similarly comparing vectors trained with the best setup on the texts from the awards_2020 directory against vectors trained on the whole set of abstracts (1990-2002).

b. Repeat the experiment with ELMo from the lecture, with a different target word and diferent sentences. Choose a word that can have multiple senses, and construct 10 sentences that express 2-3 different senses of the word. Produce ELMo embeddings for the target word in each sentence and measure the similarity between the vectors. Evaluate in how many cases the measured similarities can be used to successfully distinguish between the different senses. Comment on the results, e.g. are you able to identify a particular way in which the model fails?

[top](#first-part)

In [11]:
import numpy
import os
import re
import binascii
import itertools
import heapq
import gensim
import logging
import numpy as np
import pandas as pd
import tensorflow_hub as hub
import tensorflow as tf
from time import time
from collections import Counter
from gensim import corpora, models
from scipy.spatial.distance import cosine
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster

In [2]:
# Functions
def get_fnames(path='./Data'):
    """Read all text files in a folder.
    """
    fnames = []
    for root, _, files in os.walk(path):
        for fname in files:
            if fname[-4:] == '.txt':
                fnames.append(os.path.join(root, fname))
    return fnames


def read_file(fname):
    with open(fname, 'rt', encoding='latin-1') as f:
        # skip all lines until abstract
        for line in f:
            if "Abstract    :" in line:
                break

        # get abstract as a single string
        abstract = ' '.join([line[:-1].strip() for line in f])
        abstract = re.sub(' +', ' ', abstract)
        return abstract


def print_clusters(matrix, clusters, features, n_keywords=10):
    for cluster in range(min(clusters), max(clusters)+1):
        cluster_docs = [i for i, c in enumerate(clusters) if c == cluster]
        print("Cluster: %d (%d docs)" % (cluster, len(cluster_docs)))
        
        # Keep scores for top n terms
        new_matrix = np.zeros((len(cluster_docs), matrix.shape[1]))
        for cluster_i, doc_vec in enumerate(matrix[cluster_docs].toarray()):
            for idx, score in heapq.nlargest(n_keywords, enumerate(doc_vec), key=lambda x:x[1]):
                new_matrix[cluster_i][idx] = score

        # Aggregate scores for kept top terms
        keywords = heapq.nlargest(n_keywords, zip(new_matrix.sum(axis=0), features))
        print(', '.join([w for s,w in keywords]))
        print()


def vectorize_cluster(
    documents,
    sample_size=None,
    ngram_range=(1, 1),
    analyzer='word',
    max_df=1.0,
    min_df=0.0,
    max_features=None,
    use_idf=True,
    sublinear_tf=True,
    n_clusters=10,
    tol=0.0001
):
    vectorizer = TfidfVectorizer(
        ngram_range=ngram_range,
        analyzer=analyzer,
        max_df=max_df,
        min_df=min_df,
        max_features=max_features,
        use_idf=use_idf,
        sublinear_tf=sublinear_tf
    )
    token_matrix = vectorizer.fit_transform(documents)
    print(
        'The shape of the token matrix is:',
        token_matrix.toarray().shape
    )
    print()
    features = vectorizer.get_feature_names()
    print('The top 20 tokens are:')
    for feature, idf in sorted(
        zip(features, vectorizer._tfidf.idf_),
        key=lambda x:x[1]
    )[:20]:
        print('{:.2f}\t{}'.format(idf, feature))
    print()
    print('For the first 5 docs, the top tokens are:')
    for i in range(5):
        print('\nDocument {}, top terms by TF-IDF'.format(i))
        for feature, score in sorted(
            zip(features, token_matrix.toarray()[i]),
            key=lambda x: -x[1]
        )[:5]:
            print('{:.2f}\t{}'.format(score, feature))
    print()
    if not sample_size:
        matrix_to_cluster = token_matrix
    else:
        matrix_to_cluster = token_matrix[:sample_size]
    km = KMeans(
        n_clusters=n_clusters,
        tol=tol,
        random_state=44,
        verbose=False
    )
    km.fit(matrix_to_cluster)
    print_clusters(matrix_to_cluster, km.labels_, features)
    return (
        vectorizer,
        token_matrix,
        features,
        matrix_to_cluster,
        km
    )

In [3]:
%%time
documents = [read_file(file) for file in get_fnames('./Data/Arcada_BigDataAnalytic/')]

CPU times: user 1.23 s, sys: 254 ms, total: 1.48 s
Wall time: 2.25 s


<a class="anchor" id="topic-part"></a>
### Topic analysis

[top](#first-part)

In [27]:
%%time
default_setup = vectorize_cluster(
    documents,
    sample_size=None,
    ngram_range=(1, 1),
    analyzer='word',
    max_df=1.0,
    min_df=0.0,
    max_features=None,
    use_idf=True,
    sublinear_tf=True,
    n_clusters=10,
    tol=0.0001
)

The shape of the token matrix is: (9923, 53816)

The top 20 tokens are:
1.03	the
1.03	of
1.03	and
1.04	to
1.05	in
1.13	this
1.14	for
1.18	is
1.19	will
1.26	be
1.29	on
1.31	with
1.33	that
1.40	are
1.40	research
1.41	by
1.43	as
1.51	from
1.55	an
1.61	these

For the first 5 docs, the top tokens are:

Document 0, top terms by TF-IDF
0.34	trafficking
0.22	drug
0.20	database
0.19	discovery
0.18	mislocalization

Document 1, top terms by TF-IDF
0.25	nmr
0.23	optically
0.20	ingap
0.18	hayes
0.17	inp

Document 2, top terms by TF-IDF
0.25	fabric
0.19	textiles
0.18	pesticides
0.18	weapons
0.18	drapes

Document 3, top terms by TF-IDF
0.33	thundersnow
0.19	rawinsonde
0.18	lightning
0.16	snow
0.15	forecasting

Document 4, top terms by TF-IDF
0.20	dots
0.19	tutorials
0.18	helium
0.17	condensates
0.17	quantum
Cluster: 0 (1505 docs)
polymer, magnetic, spin, nano, laser, films, quantum, optical, nanoscale, nanoparticles

Cluster: 1 (1494 docs)
species, birds, political, social, genetic, populations, evol

In [48]:
%%time
# So for this, I'll take out the top and bottom words
# Essentially stop words and
# words that are way too niche to be classified as descriptive
# By enforcing max_features=10000
# min_df is there to make sure of those 10000, there's no leftover
# Also looser clusters because too many things got put together
picky_setup = vectorize_cluster(
    documents,
    sample_size=None,
    ngram_range=(1, 1),
    analyzer='word',
    max_df=0.95,
    min_df=0.001,
    max_features=10000,
    use_idf=True,
    sublinear_tf=True,
    n_clusters=20,
    tol=0.0001
)

The shape of the token matrix is: (9923, 10000)

The top 20 tokens are:
1.13	this
1.14	for
1.18	is
1.19	will
1.26	be
1.29	on
1.31	with
1.33	that
1.40	are
1.40	research
1.41	by
1.43	as
1.51	from
1.55	an
1.61	these
1.61	project
1.61	at
1.84	have
1.88	which
1.92	new

For the first 5 docs, the top tokens are:

Document 0, top terms by TF-IDF
0.36	trafficking
0.24	drug
0.22	database
0.20	discovery
0.19	diseases

Document 1, top terms by TF-IDF
0.27	nmr
0.24	optically
0.15	wells
0.15	gaas
0.15	heterostructures

Document 2, top terms by TF-IDF
0.29	fabric
0.21	textiles
0.20	pesticides
0.20	weapons
0.16	protect

Document 3, top terms by TF-IDF
0.21	lightning
0.18	snow
0.17	forecasting
0.17	synoptic
0.17	dangerous

Document 4, top terms by TF-IDF
0.21	dots
0.20	tutorials
0.19	helium
0.18	quantum
0.17	path

Cluster: 0 (427 docs)
phase, fuel, sensor, market, polymer, optical, cell, devices, manufacturing, drug

Cluster: 1 (225 docs)
available, not, zooplankton, zoology, zones, zone, zonal, zno, z

In [51]:
%%time
# So for this, I'll take out the top and bottom words
# Essentially stop words and
# words that are way too niche to be classified as descriptive
# By enforcing max_features=10000
# min_df is there to make sure of those 10000, there's no leftover
# Also looser clusters because too many things got put together
pickier_setup = vectorize_cluster(
    documents,
    sample_size=None,
    ngram_range=(1, 1),
    analyzer='word',
    max_df=0.95,
    min_df=0.01,
    max_features=10000,
    use_idf=True,
    sublinear_tf=True,
    n_clusters=20,
    tol=0.0001
)

The shape of the token matrix is: (9923, 2378)

The top 20 tokens are:
1.13	this
1.14	for
1.18	is
1.19	will
1.26	be
1.29	on
1.31	with
1.33	that
1.40	are
1.40	research
1.41	by
1.43	as
1.51	from
1.55	an
1.61	these
1.61	project
1.61	at
1.84	have
1.88	which
1.92	new

For the first 5 docs, the top tokens are:

Document 0, top terms by TF-IDF
0.28	drug
0.26	database
0.24	discovery
0.23	diseases
0.21	protein

Document 1, top terms by TF-IDF
0.15	chemistry
0.15	nanostructures
0.15	associates
0.15	detected
0.15	academia

Document 2, top terms by TF-IDF
0.23	medical
0.18	testing
0.17	workers
0.17	could
0.17	personnel

Document 3, top terms by TF-IDF
0.18	observations
0.17	collection
0.17	vertical
0.16	weather
0.16	summer

Document 4, top terms by TF-IDF
0.22	quantum
0.22	path
0.21	nanostructures
0.18	semiconductor
0.17	integral

Cluster: 0 (398 docs)
chemistry, molecules, reactions, organic, metal, complexes, professor, compounds, reaction, chemical

Cluster: 1 (631 docs)
solar, stars, wind, mag

I tried with even stricter setup and it's a bit better, not too clear but I can identify topic much quicker

In [49]:
%%time
# So for this, it will be the same as above
# but with 2 words as well
two_words_setup = vectorize_cluster(
    documents,
    sample_size=None,
    ngram_range=(1, 2),
    analyzer='word',
    max_df=0.95,
    min_df=0.001,
    max_features=10000,
    use_idf=True,
    sublinear_tf=True,
    n_clusters=20,
    tol=0.0001
)

The shape of the token matrix is: (9923, 10000)

The top 20 tokens are:
1.13	this
1.14	for
1.18	is
1.19	will
1.22	of the
1.26	be
1.29	on
1.31	with
1.33	that
1.40	are
1.40	research
1.40	in the
1.41	by
1.43	as
1.51	from
1.55	an
1.57	will be
1.61	these
1.61	project
1.61	at

For the first 5 docs, the top tokens are:

Document 0, top terms by TF-IDF
0.19	drug
0.18	the database
0.17	database
0.16	discovery
0.15	diseases

Document 1, top terms by TF-IDF
0.20	nmr
0.13	and undergraduate
0.12	graduate and
0.11	gaas
0.11	heterostructures

Document 2, top terms by TF-IDF
0.28	fabric
0.16	protect
0.15	medical
0.15	military
0.13	polymerization

Document 3, top terms by TF-IDF
0.19	lightning
0.16	snow
0.15	forecasting
0.12	winter
0.12	cloud

Document 4, top terms by TF-IDF
0.17	dots
0.16	helium
0.14	quantum
0.14	path
0.14	statistical mechanics

Cluster: 0 (351 docs)
mantle, seismic, fault, deformation, magma, subduction, lithosphere, crust, the mantle, rocks

Cluster: 1 (774 docs)
galaxies, stars, st

This one is good too but at some point it's quite messed up so I'll go with the `pickier_setup`

In [62]:
%%time
new_vectorizer = TfidfVectorizer(
    documents,
    ngram_range=(1, 1),
    analyzer='word',
    max_df=0.95,
    min_df=0.01,
    max_features=10000,
    use_idf=True,
    sublinear_tf=True
)
# new_vectorizer = TfidfVectorizer()
word_tokenizer = new_vectorizer.build_tokenizer()
tokenized_text = [word_tokenizer(doc) for doc in documents]

dictionary = corpora.Dictionary(tokenized_text)
lda_corpus = [dictionary.doc2bow(text) for text in tokenized_text]
lda_model = models.LdaModel(lda_corpus, id2word=dictionary, num_topics=20)

for i, topic in lda_model.show_topics(num_topics=20, num_words=50, formatted=False):
    print("Topic", i)
    printed_terms = 0
    for term, score in topic:
        if printed_terms >= 10:
            break
        elif term in "this This that That these These have will Will\
        the of and to for in or The is be may an a with at are on by as from can \
        In it It has Has also not Not new".split():
            continue
        printed_terms += 1
        print("%.4f\t%s" % (score,term))
    print()

Topic 0
0.0058	reef
0.0044	coral
0.0044	corals
0.0036	rig
0.0035	preK
0.0030	WGBH
0.0028	PM
0.0025	reefs
0.0020	job
0.0012	judicial

Topic 1
0.0374	Available
0.0111	routing
0.0052	protocol
0.0050	query
0.0028	terminals
0.0025	adaptivity
0.0023	box
0.0022	RPI
0.0020	volatility
0.0019	Internet

Topic 2
0.0088	research
0.0035	project
0.0034	their
0.0027	workshop
0.0025	which
0.0024	study
0.0022	information
0.0022	development
0.0021	how
0.0021	data

Topic 3
0.0132	girls
0.0093	children
0.0083	science
0.0036	their
0.0034	project
0.0033	research
0.0033	students
0.0023	geoscience
0.0020	about
0.0019	fMRI

Topic 4
0.0090	data
0.0042	project
0.0028	such
0.0028	model
0.0026	which
0.0026	between
0.0026	research
0.0025	models
0.0025	how
0.0024	information

Topic 5
0.0088	theory
0.0058	problems
0.0053	study
0.0053	which
0.0049	systems
0.0048	equations
0.0043	project
0.0040	such
0.0034	geometry
0.0034	mathematical

Topic 6
0.0060	project
0.0041	Phase
0.0037	high
0.0034	DNA
0.0033	research
0.0030	dev

It's rather quick and output the results quite nicely. But with a little help of preprocessing on the tokenizer to filter out the noises, it's more meaningful than on its own

<a class="anchor" id="word-vector"></a>
### Word vectors

[top](#first-part)

In [5]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [7]:
new_vectorizer = TfidfVectorizer()
# new_vectorizer = TfidfVectorizer()
word_tokenizer = new_vectorizer.build_tokenizer()
tokenized_text = [word_tokenizer(doc) for doc in documents]

In [23]:
vectors = gensim.models.Word2Vec(tokenized_text, size=100, window=5, min_count=5, sg=0, workers=4)

print("Most similar to:", 'silicon')
print(vectors.wv.most_similar('silicon'))
print()

print("Most similar to:", 'flux')
print(vectors.wv.most_similar('flux'))
print()

print("Most similar to:", 'stratosphere')
print(vectors.wv.most_similar('stratosphere'))
print()

print("Most similar to:", 'music')
print(vectors.wv.most_similar('music'))
print()

print("Most similar to:", 'pitch')
print(vectors.wv.most_similar('pitch'))
print()

2020-03-12 18:34:57,449 : INFO : collecting all words and their counts
2020-03-12 18:34:57,450 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-03-12 18:34:57,855 : INFO : collected 63538 word types from a corpus of 2656274 raw words and 9923 sentences
2020-03-12 18:34:57,856 : INFO : Loading a fresh vocabulary
2020-03-12 18:34:57,908 : INFO : effective_min_count=5 retains 20752 unique words (32% of original 63538, drops 42786)
2020-03-12 18:34:57,908 : INFO : effective_min_count=5 leaves 2585188 word corpus (97% of original 2656274, drops 71086)
2020-03-12 18:34:57,966 : INFO : deleting the raw counts dictionary of 63538 items
2020-03-12 18:34:57,968 : INFO : sample=0.001 downsamples 26 most-common words
2020-03-12 18:34:57,969 : INFO : downsampling leaves estimated 2005620 word corpus (77.6% of prior 2585188)
2020-03-12 18:34:58,017 : INFO : estimated required memory for 20752 words and 100 dimensions: 26977600 bytes
2020-03-12 18:34:58,018 : INFO : res

Most similar to: silicon
[('nanoparticles', 0.9142018556594849), ('films', 0.9072830677032471), ('thin', 0.9039884805679321), ('fibers', 0.9034739136695862), ('composites', 0.8895953893661499), ('film', 0.8893566131591797), ('amorphous', 0.882348895072937), ('aluminum', 0.8821524381637573), ('doped', 0.8797479867935181), ('oxide', 0.8784518241882324)]

Most similar to: flux
[('fluxes', 0.9093732833862305), ('thickness', 0.8973891735076904), ('concentration', 0.8948220014572144), ('melting', 0.8926695585250854), ('melt', 0.8925642371177673), ('clouds', 0.8725210428237915), ('accumulation', 0.8716062307357788), ('momentum', 0.8660423755645752), ('precipitation', 0.8651330471038818), ('dust', 0.8647468090057373)]

Most similar to: stratosphere
[('troposphere', 0.9066334366798401), ('porewaters', 0.9060134291648865), ('thickening', 0.9047499299049377), ('basaltic', 0.8800020217895508), ('plume', 0.879138708114624), ('thermosphere', 0.8738328218460083), ('mesosphere', 0.8577731847763062), (

In [24]:
vectors = gensim.models.Word2Vec(tokenized_text, size=100, window=5, min_count=5, sg=1, workers=4)

print("Most similar to:", 'silicon')
print(vectors.wv.most_similar('silicon'))
print()

print("Most similar to:", 'flux')
print(vectors.wv.most_similar('flux'))
print()

print("Most similar to:", 'stratosphere')
print(vectors.wv.most_similar('stratosphere'))
print()

print("Most similar to:", 'music')
print(vectors.wv.most_similar('music'))
print()

print("Most similar to:", 'pitch')
print(vectors.wv.most_similar('pitch'))
print()

2020-03-12 18:36:14,779 : INFO : collecting all words and their counts
2020-03-12 18:36:14,780 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-03-12 18:36:15,170 : INFO : collected 63538 word types from a corpus of 2656274 raw words and 9923 sentences
2020-03-12 18:36:15,171 : INFO : Loading a fresh vocabulary
2020-03-12 18:36:15,373 : INFO : effective_min_count=5 retains 20752 unique words (32% of original 63538, drops 42786)
2020-03-12 18:36:15,374 : INFO : effective_min_count=5 leaves 2585188 word corpus (97% of original 2656274, drops 71086)
2020-03-12 18:36:15,437 : INFO : deleting the raw counts dictionary of 63538 items
2020-03-12 18:36:15,439 : INFO : sample=0.001 downsamples 26 most-common words
2020-03-12 18:36:15,440 : INFO : downsampling leaves estimated 2005620 word corpus (77.6% of prior 2585188)
2020-03-12 18:36:15,489 : INFO : estimated required memory for 20752 words and 100 dimensions: 26977600 bytes
2020-03-12 18:36:15,490 : INFO : res

Most similar to: silicon
[('carbide', 0.8787307739257812), ('nitride', 0.8575783967971802), ('wafers', 0.8548914194107056), ('SOI', 0.8428232669830322), ('SiC', 0.8398675322532654), ('LEDs', 0.8305456042289734), ('semiconducting', 0.8267442584037781), ('nanocomposite', 0.8259124159812927), ('germanium', 0.8222129344940186), ('wafer', 0.8169799447059631)]

Most similar to: flux
[('fluxes', 0.8159793615341187), ('heat', 0.786853551864624), ('latent', 0.7678372263908386), ('canopies', 0.7632498741149902), ('denitrification', 0.7597169280052185), ('firn', 0.7594362497329712), ('transports', 0.7594219446182251), ('momentum', 0.759409487247467), ('Hg', 0.7592573165893555), ('longwave', 0.7577768564224243)]

Most similar to: stratosphere
[('mesosphere', 0.8974559903144836), ('troposphere', 0.8770797252655029), ('tropopause', 0.8468611240386963), ('transports', 0.8443900942802429), ('thermosphere', 0.8377493619918823), ('midlatitude', 0.8325599431991577), ('remineralization', 0.829300284385681

This makes more sense to me, though the music is still a bump since there's nothing about classical music out of all these abstract

In [25]:
vectors = gensim.models.Word2Vec(tokenized_text, size=100, window=3, min_count=10, sg=1, workers=4)

print("Most similar to:", 'silicon')
print(vectors.wv.most_similar('silicon'))
print()

print("Most similar to:", 'flux')
print(vectors.wv.most_similar('flux'))
print()

print("Most similar to:", 'stratosphere')
print(vectors.wv.most_similar('stratosphere'))
print()

print("Most similar to:", 'music')
print(vectors.wv.most_similar('music'))
print()

print("Most similar to:", 'pitch')
print(vectors.wv.most_similar('pitch'))
print()

2020-03-12 18:38:59,749 : INFO : collecting all words and their counts
2020-03-12 18:38:59,752 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-03-12 18:39:00,141 : INFO : collected 63538 word types from a corpus of 2656274 raw words and 9923 sentences
2020-03-12 18:39:00,142 : INFO : Loading a fresh vocabulary
2020-03-12 18:39:00,188 : INFO : effective_min_count=10 retains 13531 unique words (21% of original 63538, drops 50007)
2020-03-12 18:39:00,189 : INFO : effective_min_count=10 leaves 2538020 word corpus (95% of original 2656274, drops 118254)
2020-03-12 18:39:00,222 : INFO : deleting the raw counts dictionary of 63538 items
2020-03-12 18:39:00,224 : INFO : sample=0.001 downsamples 26 most-common words
2020-03-12 18:39:00,224 : INFO : downsampling leaves estimated 1955256 word corpus (77.0% of prior 2538020)
2020-03-12 18:39:00,250 : INFO : estimated required memory for 13531 words and 100 dimensions: 17590300 bytes
2020-03-12 18:39:00,250 : INFO : 

Most similar to: silicon
[('carbide', 0.8942508697509766), ('SiC', 0.8837494850158691), ('LEDs', 0.8702294826507568), ('nitride', 0.8609273433685303), ('SOI', 0.8567335605621338), ('wafers', 0.8488566875457764), ('GaN', 0.8457410931587219), ('semiconducting', 0.8403136730194092), ('doped', 0.8397879600524902), ('nanotubes', 0.8320607542991638)]

Most similar to: flux
[('fluxes', 0.8522653579711914), ('momentum', 0.8182167410850525), ('longwave', 0.8127259016036987), ('transports', 0.7976531982421875), ('salinity', 0.7946630716323853), ('O2', 0.7905195951461792), ('moisture', 0.7879533767700195), ('steep', 0.7848137617111206), ('outflow', 0.7828040719032288), ('thermosphere', 0.7814702987670898)]

Most similar to: stratosphere
[('mesosphere', 0.8834695816040039), ('troposphere', 0.8829984664916992), ('thermosphere', 0.875001072883606), ('EPS', 0.8548007011413574), ('Archean', 0.8498013019561768), ('overlying', 0.8486577868461609), ('carbonaceous', 0.848038375377655), ('transports', 0.84

Not much different but then it also looks pretty great. Especially in pitch, there's `tunability` now, that's what I was looking for actually

In [26]:
vectors = gensim.models.Word2Vec(tokenized_text, size=1000, window=3, min_count=10, sg=1, workers=4)

print("Most similar to:", 'silicon')
print(vectors.wv.most_similar('silicon'))
print()

print("Most similar to:", 'flux')
print(vectors.wv.most_similar('flux'))
print()

print("Most similar to:", 'stratosphere')
print(vectors.wv.most_similar('stratosphere'))
print()

print("Most similar to:", 'music')
print(vectors.wv.most_similar('music'))
print()

print("Most similar to:", 'pitch')
print(vectors.wv.most_similar('pitch'))
print()

2020-03-12 18:41:55,212 : INFO : collecting all words and their counts
2020-03-12 18:41:55,213 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-03-12 18:41:55,600 : INFO : collected 63538 word types from a corpus of 2656274 raw words and 9923 sentences
2020-03-12 18:41:55,601 : INFO : Loading a fresh vocabulary
2020-03-12 18:41:55,644 : INFO : effective_min_count=10 retains 13531 unique words (21% of original 63538, drops 50007)
2020-03-12 18:41:55,645 : INFO : effective_min_count=10 leaves 2538020 word corpus (95% of original 2656274, drops 118254)
2020-03-12 18:41:55,680 : INFO : deleting the raw counts dictionary of 63538 items
2020-03-12 18:41:55,681 : INFO : sample=0.001 downsamples 26 most-common words
2020-03-12 18:41:55,682 : INFO : downsampling leaves estimated 1955256 word corpus (77.0% of prior 2538020)
2020-03-12 18:41:55,707 : INFO : estimated required memory for 13531 words and 1000 dimensions: 115013500 bytes
2020-03-12 18:41:55,708 : INFO 

2020-03-12 18:42:47,472 : INFO : worker thread finished; awaiting finish of 3 more threads
2020-03-12 18:42:47,501 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-03-12 18:42:47,504 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-03-12 18:42:47,542 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-03-12 18:42:47,543 : INFO : EPOCH - 5 : training on 2656274 raw words (1955442 effective words) took 8.8s, 221779 effective words/s
2020-03-12 18:42:47,543 : INFO : training on a 13281370 raw words (9776255 effective words) took 49.6s, 196927 effective words/s
2020-03-12 18:42:47,548 : INFO : precomputing L2-norms of word weight vectors


Most similar to: silicon
[('carbide', 0.8683127164840698), ('SiC', 0.8661246299743652), ('LEDs', 0.858808159828186), ('wafers', 0.8447800874710083), ('GaN', 0.8418112397193909), ('wafer', 0.826755702495575), ('nitride', 0.8264325857162476), ('conductive', 0.8259358406066895), ('ultrathin', 0.8248092532157898), ('nanocomposite', 0.8241363763809204)]

Most similar to: flux
[('fluxes', 0.8358415961265564), ('momentum', 0.8162760138511658), ('transports', 0.7871787548065186), ('O2', 0.7836939096450806), ('haze', 0.7819864749908447), ('vorticity', 0.7814186215400696), ('mesosphere', 0.7802945375442505), ('diapycnal', 0.7741315364837646), ('outflow', 0.7718337178230286), ('asthenosphere', 0.77115797996521)]

Most similar to: stratosphere
[('mesosphere', 0.9094099998474121), ('thermosphere', 0.8826446533203125), ('troposphere', 0.8724294900894165), ('H2O', 0.8540778160095215), ('circulating', 0.8538209199905396), ('diapycnal', 0.853712260723114), ('remineralization', 0.8531094789505005), ('as

I like this setup mostly because of the stricter boundaries and increase dimensionality (somehow). But this setup makes the most sense when it comes to expectation. I would want to see these information appears together.

Move on to the whole dataset

In [28]:
%%time
documents_full = [read_file(file) for file in get_fnames('../abstracts/')]

CPU times: user 12.9 s, sys: 2.01 s, total: 14.9 s
Wall time: 14.9 s


In [29]:
full_vectorizer = TfidfVectorizer()
# new_vectorizer = TfidfVectorizer()
full_word_tokenizer = full_vectorizer.build_tokenizer()
full_tokenized_text = [full_word_tokenizer(doc) for doc in documents_full]

In [31]:
vectors_full = gensim.models.Word2Vec(full_tokenized_text, size=1000, window=3, min_count=10, sg=1, workers=4)

print("Most similar to:", 'silicon')
print(vectors_full.wv.most_similar('silicon'))
print()

print("Most similar to:", 'flux')
print(vectors_full.wv.most_similar('flux'))
print()

print("Most similar to:", 'stratosphere')
print(vectors_full.wv.most_similar('stratosphere'))
print()

print("Most similar to:", 'music')
print(vectors_full.wv.most_similar('music'))
print()

print("Most similar to:", 'pitch')
print(vectors_full.wv.most_similar('pitch'))
print()

2020-03-12 18:55:40,081 : INFO : collecting all words and their counts
2020-03-12 18:55:40,082 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-03-12 18:55:40,302 : INFO : PROGRESS: at sentence #10000, processed 1535501 words, keeping 46762 word types
2020-03-12 18:55:40,533 : INFO : PROGRESS: at sentence #20000, processed 3047602 words, keeping 65659 word types
2020-03-12 18:55:40,842 : INFO : PROGRESS: at sentence #30000, processed 4977757 words, keeping 93131 word types
2020-03-12 18:55:41,218 : INFO : PROGRESS: at sentence #40000, processed 7438658 words, keeping 117620 word types
2020-03-12 18:55:41,623 : INFO : PROGRESS: at sentence #50000, processed 9998121 words, keeping 137906 word types
2020-03-12 18:55:41,899 : INFO : PROGRESS: at sentence #60000, processed 11734386 words, keeping 148537 word types
2020-03-12 18:55:42,215 : INFO : PROGRESS: at sentence #70000, processed 13706872 words, keeping 165794 word types
2020-03-12 18:55:42,654 : INFO : 

2020-03-12 18:56:45,138 : INFO : EPOCH 1 - PROGRESS: at 46.48% examples, 172085 words/s, in_qsize 8, out_qsize 2
2020-03-12 18:56:46,179 : INFO : EPOCH 1 - PROGRESS: at 47.32% examples, 172221 words/s, in_qsize 6, out_qsize 1
2020-03-12 18:56:47,192 : INFO : EPOCH 1 - PROGRESS: at 48.10% examples, 172265 words/s, in_qsize 7, out_qsize 0
2020-03-12 18:56:48,214 : INFO : EPOCH 1 - PROGRESS: at 48.93% examples, 172183 words/s, in_qsize 8, out_qsize 0
2020-03-12 18:56:49,219 : INFO : EPOCH 1 - PROGRESS: at 49.97% examples, 172434 words/s, in_qsize 8, out_qsize 0
2020-03-12 18:56:50,233 : INFO : EPOCH 1 - PROGRESS: at 50.71% examples, 172386 words/s, in_qsize 8, out_qsize 0
2020-03-12 18:56:51,244 : INFO : EPOCH 1 - PROGRESS: at 51.44% examples, 172106 words/s, in_qsize 7, out_qsize 0
2020-03-12 18:56:52,257 : INFO : EPOCH 1 - PROGRESS: at 52.09% examples, 171817 words/s, in_qsize 7, out_qsize 0
2020-03-12 18:56:53,290 : INFO : EPOCH 1 - PROGRESS: at 52.88% examples, 171498 words/s, in_qsiz

2020-03-12 18:57:56,491 : INFO : EPOCH 2 - PROGRESS: at 3.33% examples, 175759 words/s, in_qsize 8, out_qsize 0
2020-03-12 18:57:57,519 : INFO : EPOCH 2 - PROGRESS: at 4.46% examples, 175054 words/s, in_qsize 7, out_qsize 1
2020-03-12 18:57:58,545 : INFO : EPOCH 2 - PROGRESS: at 5.58% examples, 176234 words/s, in_qsize 8, out_qsize 0
2020-03-12 18:57:59,561 : INFO : EPOCH 2 - PROGRESS: at 6.67% examples, 174047 words/s, in_qsize 7, out_qsize 0
2020-03-12 18:58:00,590 : INFO : EPOCH 2 - PROGRESS: at 8.10% examples, 173224 words/s, in_qsize 8, out_qsize 0
2020-03-12 18:58:01,704 : INFO : EPOCH 2 - PROGRESS: at 9.36% examples, 172427 words/s, in_qsize 7, out_qsize 0
2020-03-12 18:58:02,725 : INFO : EPOCH 2 - PROGRESS: at 10.44% examples, 172919 words/s, in_qsize 7, out_qsize 0
2020-03-12 18:58:03,807 : INFO : EPOCH 2 - PROGRESS: at 11.61% examples, 172930 words/s, in_qsize 7, out_qsize 0
2020-03-12 18:58:04,809 : INFO : EPOCH 2 - PROGRESS: at 12.78% examples, 174042 words/s, in_qsize 7, o

2020-03-12 18:59:11,795 : INFO : EPOCH 2 - PROGRESS: at 67.54% examples, 178887 words/s, in_qsize 7, out_qsize 0
2020-03-12 18:59:12,872 : INFO : EPOCH 2 - PROGRESS: at 68.39% examples, 178622 words/s, in_qsize 8, out_qsize 0
2020-03-12 18:59:13,911 : INFO : EPOCH 2 - PROGRESS: at 69.14% examples, 178491 words/s, in_qsize 7, out_qsize 0
2020-03-12 18:59:14,967 : INFO : EPOCH 2 - PROGRESS: at 70.12% examples, 178494 words/s, in_qsize 7, out_qsize 0
2020-03-12 18:59:15,994 : INFO : EPOCH 2 - PROGRESS: at 71.09% examples, 178445 words/s, in_qsize 7, out_qsize 0
2020-03-12 18:59:17,019 : INFO : EPOCH 2 - PROGRESS: at 72.00% examples, 178400 words/s, in_qsize 8, out_qsize 0
2020-03-12 18:59:18,053 : INFO : EPOCH 2 - PROGRESS: at 72.97% examples, 178325 words/s, in_qsize 7, out_qsize 0
2020-03-12 18:59:19,075 : INFO : EPOCH 2 - PROGRESS: at 74.03% examples, 178520 words/s, in_qsize 8, out_qsize 0
2020-03-12 18:59:20,125 : INFO : EPOCH 2 - PROGRESS: at 75.21% examples, 179023 words/s, in_qsiz

2020-03-12 19:00:24,130 : INFO : EPOCH 3 - PROGRESS: at 32.93% examples, 185007 words/s, in_qsize 7, out_qsize 0
2020-03-12 19:00:25,145 : INFO : EPOCH 3 - PROGRESS: at 33.72% examples, 186062 words/s, in_qsize 7, out_qsize 0
2020-03-12 19:00:26,145 : INFO : EPOCH 3 - PROGRESS: at 34.41% examples, 186105 words/s, in_qsize 7, out_qsize 0
2020-03-12 19:00:27,176 : INFO : EPOCH 3 - PROGRESS: at 35.18% examples, 185827 words/s, in_qsize 8, out_qsize 0
2020-03-12 19:00:28,177 : INFO : EPOCH 3 - PROGRESS: at 35.98% examples, 186291 words/s, in_qsize 8, out_qsize 0
2020-03-12 19:00:29,187 : INFO : EPOCH 3 - PROGRESS: at 36.86% examples, 187258 words/s, in_qsize 7, out_qsize 0
2020-03-12 19:00:30,209 : INFO : EPOCH 3 - PROGRESS: at 37.68% examples, 188113 words/s, in_qsize 8, out_qsize 0
2020-03-12 19:00:31,238 : INFO : EPOCH 3 - PROGRESS: at 38.59% examples, 188727 words/s, in_qsize 6, out_qsize 1
2020-03-12 19:00:32,263 : INFO : EPOCH 3 - PROGRESS: at 39.77% examples, 188582 words/s, in_qsiz

2020-03-12 19:01:39,539 : INFO : EPOCH 3 - PROGRESS: at 93.46% examples, 175227 words/s, in_qsize 8, out_qsize 0
2020-03-12 19:01:40,568 : INFO : EPOCH 3 - PROGRESS: at 94.22% examples, 175146 words/s, in_qsize 7, out_qsize 0
2020-03-12 19:01:41,607 : INFO : EPOCH 3 - PROGRESS: at 95.07% examples, 175116 words/s, in_qsize 8, out_qsize 0
2020-03-12 19:01:42,628 : INFO : EPOCH 3 - PROGRESS: at 95.84% examples, 175078 words/s, in_qsize 8, out_qsize 0
2020-03-12 19:01:43,638 : INFO : EPOCH 3 - PROGRESS: at 96.59% examples, 175255 words/s, in_qsize 7, out_qsize 0
2020-03-12 19:01:44,656 : INFO : EPOCH 3 - PROGRESS: at 97.41% examples, 175599 words/s, in_qsize 7, out_qsize 0
2020-03-12 19:01:45,671 : INFO : EPOCH 3 - PROGRESS: at 98.27% examples, 176008 words/s, in_qsize 8, out_qsize 1
2020-03-12 19:01:46,711 : INFO : EPOCH 3 - PROGRESS: at 99.10% examples, 176373 words/s, in_qsize 8, out_qsize 0
2020-03-12 19:01:47,605 : INFO : worker thread finished; awaiting finish of 3 more threads
2020-

2020-03-12 19:02:51,769 : INFO : EPOCH 4 - PROGRESS: at 58.79% examples, 186962 words/s, in_qsize 7, out_qsize 0
2020-03-12 19:02:52,775 : INFO : EPOCH 4 - PROGRESS: at 59.61% examples, 187336 words/s, in_qsize 8, out_qsize 0
2020-03-12 19:02:53,845 : INFO : EPOCH 4 - PROGRESS: at 60.40% examples, 187479 words/s, in_qsize 7, out_qsize 0
2020-03-12 19:02:54,894 : INFO : EPOCH 4 - PROGRESS: at 61.03% examples, 187137 words/s, in_qsize 7, out_qsize 0
2020-03-12 19:02:55,895 : INFO : EPOCH 4 - PROGRESS: at 61.66% examples, 186810 words/s, in_qsize 7, out_qsize 0
2020-03-12 19:02:56,919 : INFO : EPOCH 4 - PROGRESS: at 62.35% examples, 186321 words/s, in_qsize 7, out_qsize 0
2020-03-12 19:02:57,931 : INFO : EPOCH 4 - PROGRESS: at 63.01% examples, 186093 words/s, in_qsize 7, out_qsize 0
2020-03-12 19:02:58,965 : INFO : EPOCH 4 - PROGRESS: at 63.72% examples, 185595 words/s, in_qsize 7, out_qsize 0
2020-03-12 19:03:00,092 : INFO : EPOCH 4 - PROGRESS: at 64.77% examples, 185209 words/s, in_qsiz

2020-03-12 19:04:04,225 : INFO : EPOCH 5 - PROGRESS: at 23.77% examples, 193889 words/s, in_qsize 8, out_qsize 0
2020-03-12 19:04:05,240 : INFO : EPOCH 5 - PROGRESS: at 24.56% examples, 192388 words/s, in_qsize 7, out_qsize 0
2020-03-12 19:04:06,252 : INFO : EPOCH 5 - PROGRESS: at 25.22% examples, 190790 words/s, in_qsize 8, out_qsize 0
2020-03-12 19:04:07,291 : INFO : EPOCH 5 - PROGRESS: at 25.87% examples, 189737 words/s, in_qsize 8, out_qsize 0
2020-03-12 19:04:08,353 : INFO : EPOCH 5 - PROGRESS: at 26.56% examples, 189175 words/s, in_qsize 7, out_qsize 0
2020-03-12 19:04:09,384 : INFO : EPOCH 5 - PROGRESS: at 27.27% examples, 188613 words/s, in_qsize 7, out_qsize 0
2020-03-12 19:04:10,393 : INFO : EPOCH 5 - PROGRESS: at 27.96% examples, 188248 words/s, in_qsize 7, out_qsize 0
2020-03-12 19:04:11,472 : INFO : EPOCH 5 - PROGRESS: at 28.66% examples, 187498 words/s, in_qsize 7, out_qsize 0
2020-03-12 19:04:12,490 : INFO : EPOCH 5 - PROGRESS: at 29.32% examples, 186887 words/s, in_qsiz

2020-03-12 19:05:19,756 : INFO : EPOCH 5 - PROGRESS: at 84.78% examples, 179300 words/s, in_qsize 8, out_qsize 0
2020-03-12 19:05:20,818 : INFO : EPOCH 5 - PROGRESS: at 85.50% examples, 179113 words/s, in_qsize 8, out_qsize 0
2020-03-12 19:05:21,821 : INFO : EPOCH 5 - PROGRESS: at 86.24% examples, 178898 words/s, in_qsize 7, out_qsize 0
2020-03-12 19:05:22,898 : INFO : EPOCH 5 - PROGRESS: at 86.91% examples, 178673 words/s, in_qsize 7, out_qsize 0
2020-03-12 19:05:23,980 : INFO : EPOCH 5 - PROGRESS: at 87.82% examples, 178465 words/s, in_qsize 8, out_qsize 1
2020-03-12 19:05:24,982 : INFO : EPOCH 5 - PROGRESS: at 88.70% examples, 178471 words/s, in_qsize 7, out_qsize 0
2020-03-12 19:05:26,007 : INFO : EPOCH 5 - PROGRESS: at 89.43% examples, 178356 words/s, in_qsize 8, out_qsize 1
2020-03-12 19:05:27,029 : INFO : EPOCH 5 - PROGRESS: at 90.30% examples, 178339 words/s, in_qsize 7, out_qsize 0
2020-03-12 19:05:28,080 : INFO : EPOCH 5 - PROGRESS: at 91.11% examples, 178299 words/s, in_qsiz

Most similar to: silicon
[('germanium', 0.6562955379486084), ('arsenide', 0.5950453281402588), ('polysilicon', 0.5918377637863159), ('silicide', 0.5918266773223877), ('nitride', 0.5912208557128906), ('phosphide', 0.5898659229278564), ('oxynitride', 0.5877647399902344), ('indium', 0.5833597183227539), ('SiGe', 0.5799596905708313), ('hydrogenated', 0.5767875909805298)]

Most similar to: flux
[('fluxes', 0.5494881868362427), ('Poynting', 0.478647381067276), ('PON', 0.43299418687820435), ('diapycnal', 0.43154287338256836), ('downdraft', 0.42625388503074646), ('sinking', 0.424627423286438), ('irradiances', 0.41862404346466064), ('TCO2', 0.417788565158844), ('Joule', 0.4175970256328583), ('shortwave', 0.41750848293304443)]

Most similar to: stratosphere
[('troposphere', 0.7435741424560547), ('mesosphere', 0.7202921509742737), ('stratospheric', 0.6767431497573853), ('tropospheric', 0.6421911716461182), ('mesopause', 0.6389365792274475), ('mesospheric', 0.6287567615509033), ('MLT', 0.615911126

Well well well, with much much more information, the similar words are better than before. Plus the score went down. Probably because the vocabulary is significantly larger.

Now, for that part with ELMo

In [None]:
# Load ELMo model (takes a little while)
tf.compat.v1.disable_eager_execution()
elmo = hub.Module("https://tfhub.dev/google/elmo/3", trainable=True)

In [8]:
def elmo_vectors(sents):
    embeddings = elmo(sents, signature="default", as_dict=True)["elmo"]
    with tf.compat.v1.Session() as sess:
        sess.run(tf.compat.v1.global_variables_initializer())
        return sess.run(embeddings)

In [22]:
sents = [
    'The pillow was there to break the fall.',
    "He's going to break the ladder",
    'Give me a break',
    "I'm taking a break after this",
    'Her water is about to break',
    'I will break this into pieces',
    'Tea break is a common thing in England',
    'All it needs is one break',
    'break a leg',
    'See me in my office during the break'
]

target = 'break'

elmo_vecs = elmo_vectors(sents)
word_vecs = []
for i, sent in enumerate(sents):
    word_vecs.append(elmo_vecs[i][sent.split().index(target)])
    print("Sentence:", sent)
    print("Vector for '%s':" % target, word_vecs[-1])
    print()

print("Word vector size:", word_vecs[0].shape)

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


Sentence: The pillow was there to break the fall.
Vector for 'break': [-0.11922167  0.00357817 -0.33186203 ... -0.15633361  0.6160723
 -0.154829  ]

Sentence: He's going to break the ladder
Vector for 'break': [-0.35467988 -0.31914636  0.3277204  ...  0.22955832  0.14098051
 -0.24496101]

Sentence: Give me a break
Vector for 'break': [-0.38369322 -0.30650288 -0.15357974 ...  0.20365655  0.23261338
  0.19071695]

Sentence: I'm taking a break after this
Vector for 'break': [-0.20114338 -0.4249664   0.3780563  ... -0.22613028  0.5801039
 -0.0413116 ]

Sentence: Her water is about to break
Vector for 'break': [-0.13044037  0.17287034  0.536402   ...  0.20365655  0.23261338
  0.19071695]

Sentence: I will break this into pieces
Vector for 'break': [-0.04437125 -0.4148057   0.18563403 ... -0.01195487  0.37412658
 -0.5906379 ]

Sentence: Tea break is a common thing in England
Vector for 'break': [ 0.33905056  0.06670371  0.17880782 ... -0.23717158  0.5685922
  0.02736001]

Sentence: All it ne

The function just doesn't work with capital letters and end of the sentense, does it

In [25]:
vec_size = word_vecs[0].shape[0]
print("Similarities between '%s' vector in sentences:" % target)
for i in range(0, len(sents)):
    print("Sent 5-%d:" % i, cosine_similarity(word_vecs[5].reshape((1,vec_size)), 
                                              word_vecs[i].reshape((1,vec_size)))[0][0])

Similarities between 'break' vector in sentences:
Sent 5-0: 0.7517743
Sent 5-1: 0.8202457
Sent 5-2: 0.527225
Sent 5-3: 0.5016476
Sent 5-4: 0.6245154
Sent 5-5: 1.0000001
Sent 5-6: 0.4141451
Sent 5-7: 0.50315714
Sent 5-8: 0.77526045
Sent 5-9: 0.5151755


Pretty much okay. Except for the 5th sentence `(i=4)`, it seems to classify `to break` at the end of a sentence differently than when it's in the middle, although both are still verb. Not so much as fail but ambiguous results.


[top](#first-part)