<a href="https://colab.research.google.com/github/DGuilherme/AAUTIA2/blob/main/run_doc2vec_lee.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%matplotlib inline


Doc2Vec Model
=============

Introduces Gensim's Doc2Vec model and demonstrates its use on the
`Lee Corpus <https://hekyll.services.adelaide.edu.au/dspace/bitstream/2440/28910/1/hdl_28910.pdf>`__.




# Imports


In [None]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [None]:
# CH3 imports

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

Doc2Vec is a `core_concepts_model` that represents each
`core_concepts_document` as a `core_concepts_vector`.  This
tutorial introduces the model and demonstrates how to train and assess it.

Here's a list of what we'll be doing:

0. Review the relevant models: bag-of-words, Word2Vec, Doc2Vec
1. Load and preprocess the training and test corpora (see `core_concepts_corpus`)
2. Train a Doc2Vec `core_concepts_model` model using the training corpus
3. Demonstrate how the trained model can be used to infer a `core_concepts_vector`
4. Assess the model
5. Test the model on the test corpus

Review: Bag-of-words
--------------------

.. Note:: Feel free to skip these review sections if you're already familiar with the models.

You may be familiar with the `bag-of-words model
<https://en.wikipedia.org/wiki/Bag-of-words_model>`_ from the
`core_concepts_vector` section.
This model transforms each document to a fixed-length vector of integers.
For example, given the sentences:

- ``John likes to watch movies. Mary likes movies too.``
- ``John also likes to watch football games. Mary hates football.``

The model outputs the vectors:

- ``[1, 2, 1, 1, 2, 1, 1, 0, 0, 0, 0]``
- ``[1, 1, 1, 1, 0, 1, 0, 1, 2, 1, 1]``

Each vector has 10 elements, where each element counts the number of times a
particular word occurred in the document.
The order of elements is arbitrary.
In the example above, the order of the elements corresponds to the words:
``["John", "likes", "to", "watch", "movies", "Mary", "too", "also", "football", "games", "hates"]``.

Bag-of-words models are surprisingly effective, but have several weaknesses.

First, they lose all information about word order: "John likes Mary" and
"Mary likes John" correspond to identical vectors. There is a solution: bag
of `n-grams <https://en.wikipedia.org/wiki/N-gram>`__
models consider word phrases of length n to represent documents as
fixed-length vectors to capture local word order but suffer from data
sparsity and high dimensionality.

Second, the model does not attempt to learn the meaning of the underlying
words, and as a consequence, the distance between vectors doesn't always
reflect the difference in meaning.  The ``Word2Vec`` model addresses this
second problem.

Review: ``Word2Vec`` Model
--------------------------

``Word2Vec`` is a more recent model that embeds words in a lower-dimensional
vector space using a shallow neural network. The result is a set of
word-vectors where vectors close together in vector space have similar
meanings based on context, and word-vectors distant to each other have
differing meanings. For example, ``strong`` and ``powerful`` would be close
together and ``strong`` and ``Paris`` would be relatively far.

Gensim's :py:class:`~gensim.models.word2vec.Word2Vec` class implements this model.

With the ``Word2Vec`` model, we can calculate the vectors for each **word** in a document.
But what if we want to calculate a vector for the **entire document**\ ?
We could average the vectors for each word in the document - while this is quick and crude, it can often be useful.
However, there is a better way...

Introducing: Paragraph Vector
-----------------------------

.. Important:: In Gensim, we refer to the Paragraph Vector model as ``Doc2Vec``.

Le and Mikolov in 2014 introduced the `Doc2Vec algorithm <https://cs.stanford.edu/~quocle/paragraph_vector.pdf>`__,
which usually outperforms such simple-averaging of ``Word2Vec`` vectors.

The basic idea is: act as if a document has another floating word-like
vector, which contributes to all training predictions, and is updated like
other word-vectors, but we will call it a doc-vector. Gensim's
:py:class:`~gensim.models.doc2vec.Doc2Vec` class implements this algorithm.

There are two implementations:

1. Paragraph Vector - Distributed Memory (PV-DM)
2. Paragraph Vector - Distributed Bag of Words (PV-DBOW)

.. Important::
  Don't let the implementation details below scare you.
  They're advanced material: if it's too much, then move on to the next section.

PV-DM is analogous to Word2Vec CBOW. The doc-vectors are obtained by training
a neural network on the synthetic task of predicting a center word based an
average of both context word-vectors and the full document's doc-vector.

PV-DBOW is analogous to Word2Vec SG. The doc-vectors are obtained by training
a neural network on the synthetic task of predicting a target word just from
the full document's doc-vector. (It is also common to combine this with
skip-gram testing, using both the doc-vector and nearby word-vectors to
predict a single target word, but only one at a time.)

Prepare the Training and Test Data
----------------------------------

For this tutorial, we'll be training our model using the `Lee Background
Corpus
<https://hekyll.services.adelaide.edu.au/dspace/bitstream/2440/28910/1/hdl_28910.pdf>`_
included in gensim. This corpus contains 314 documents selected from the
Australian Broadcasting Corporation’s news mail service, which provides text
e-mails of headline stories and covers a number of broad topics.

And we'll test our model by eye using the much shorter `Lee Corpus
<https://hekyll.services.adelaide.edu.au/dspace/bitstream/2440/28910/1/hdl_28910.pdf>`_
which contains 50 documents.




In [None]:
# resolver problema de gensim version
!pip install --upgrade gensim

Collecting gensim
[?25l  Downloading https://files.pythonhosted.org/packages/44/52/f1417772965652d4ca6f901515debcd9d6c5430969e8c02ee7737e6de61c/gensim-4.0.1-cp37-cp37m-manylinux1_x86_64.whl (23.9MB)
[K     |████████████████████████████████| 23.9MB 5.0MB/s 
Installing collected packages: gensim
  Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-4.0.1


In [None]:
# Dataset Chalenge 3
# raw: https://raw.githubusercontent.com/DGuilherme/AAUTIA2/main/Doc2Vec/dataset/dataset.txt


url = 'https://raw.githubusercontent.com/DGuilherme/AAUTIA2/main/Doc2Vec/dataset/dataset.txt'
train_data = pd.read_csv(url,delimiter=r"|")
# Dataset is now stored in a Pandas Dataframe
print(train_data)

                            ID  ...                                             Resumo
0    57025eb6eb1ec9f5515f7c33   ...   A damper for use in submerged hydrophone susp...
1    57025eb6eb1ec9f5515f7c34   ...   A flexible longitudinally continuous tape con...
2    57025eb6eb1ec9f5515f7c35   ...   The method of packaging perishable products i...
3    57025eb6eb1ec9f5515f7c36   ...   An improvement in a tampon for absorbing mens...
4    57025eb6eb1ec9f5515f7c37   ...   A heavy duty motor vehicle has a main frame w...
..                         ...  ...                                                ...
495  57025eb6eb1ec9f5515f7e22   ...   A peristaltic pump comprising a portable hous...
496  57025eb6eb1ec9f5515f7e23   ...   A gas compressor or blower comprising a cylin...
497  57025eb6eb1ec9f5515f7e24   ...   A rotary cell pump is provided for the convey...
498  57025eb6eb1ec9f5515f7e25   ...   An overspeed shutoff device for use in a rota...
499  57025eb6eb1ec9f5515f7e26   ...   A rot

In [None]:
train_data.describe()

Unnamed: 0,ID,Titulo,Resumo
count,500,500,500
unique,500,497,500
top,57025eb6eb1ec9f5515f7cd9,Heat exchanger,A small portable motorized attachment for pro...
freq,1,2,1


In [None]:

import gensim
def tagData(dataframe):
  for index,row in dataframe.iterrows():
    tokens = gensim.utils.simple_preprocess(row['Resumo'])
    yield gensim.models.doc2vec.TaggedDocument(tokens, [index]) #yield gensim.utils.simple_preprocess(row['Resumo'])


vocabulary = list(tagData(train_data))



In [None]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40) # Create inital empty model

2021-04-08 22:23:34,817 : INFO : Doc2Vec lifecycle event {'params': 'Doc2Vec(dm/m,d50,n5,w5,mc2,s0.001,t3)', 'datetime': '2021-04-08T22:23:34.817860', 'gensim': '4.0.1', 'python': '3.7.10 (default, Feb 20 2021, 21:17:23) \n[GCC 7.5.0]', 'platform': 'Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic', 'event': 'created'}


Build a vocabulary



In [None]:
model.build_vocab(vocabulary) # Add data to the model

2021-04-08 22:25:07,519 : INFO : collecting all words and their counts
2021-04-08 22:25:07,521 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2021-04-08 22:25:07,542 : INFO : collected 5094 word types and 500 unique tags from a corpus of 500 examples and 56523 words
2021-04-08 22:25:07,543 : INFO : Creating a fresh vocabulary
2021-04-08 22:25:07,561 : INFO : Doc2Vec lifecycle event {'msg': 'effective_min_count=2 retains 3187 unique words (62.563800549666276%% of original 5094, drops 1907)', 'datetime': '2021-04-08T22:25:07.561467', 'gensim': '4.0.1', 'python': '3.7.10 (default, Feb 20 2021, 21:17:23) \n[GCC 7.5.0]', 'platform': 'Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic', 'event': 'prepare_vocab'}
2021-04-08 22:25:07,563 : INFO : Doc2Vec lifecycle event {'msg': 'effective_min_count=2 leaves 54616 word corpus (96.62615218583585%% of original 56523, drops 1907)', 'datetime': '2021-04-08T22:25:07.563091', 'gensim': '4.0.1', 'python': '3.7.10 (default

Essentially, the vocabulary is a list (accessible via
``model.wv.index_to_key``) of all of the unique words extracted from the training corpus.
Additional attributes for each word are available using the ``model.wv.get_vecattr()`` method,
For example, to see how many times ``penalty`` appeared in the training corpus:




In [None]:
print(f"Word 'penalty' appeared {model.wv.get_vecattr('penalty', 'count')} times in the training corpus.")

Word 'penalty' appeared 4 times in the training corpus.


Next, train the model on the corpus.
If optimized Gensim (with BLAS library) is being used, this should take no more than 3 seconds.
If the BLAS library is not being used, this should take no more than 2
minutes, so use optimized Gensim with BLAS if you value your time.




In [None]:
model.train(vocabulary, total_examples=model.corpus_count, epochs=model.epochs)

2021-04-08 22:25:16,460 : INFO : Doc2Vec lifecycle event {'msg': 'training model with 3 workers on 3187 vocabulary and 50 features, using sg=0 hs=0 sample=0.001 negative=5 window=5', 'datetime': '2021-04-08T22:25:16.460197', 'gensim': '4.0.1', 'python': '3.7.10 (default, Feb 20 2021, 21:17:23) \n[GCC 7.5.0]', 'platform': 'Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic', 'event': 'train'}
2021-04-08 22:25:16,603 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-04-08 22:25:16,612 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-04-08 22:25:16,614 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-04-08 22:25:16,618 : INFO : EPOCH - 1 : training on 56523 raw words (40040 effective words) took 0.1s, 275971 effective words/s
2021-04-08 22:25:16,724 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-04-08 22:25:16,735 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-04-08 22:25:16,

Now, we can use the trained model to infer a vector for any piece of text
by passing a list of words to the ``model.infer_vector`` function. This
vector can then be compared with other vectors via cosine similarity.




In [None]:
vector = model.infer_vector(['overspeed', 'shutoff' ,'device', 'for' ,'use'])
print(vector)

[-0.0380471   0.07107193 -0.16104157 -0.05671706 -0.11206745 -0.25842693
  0.09781607  0.21397597 -0.3734844   0.05311184 -0.0563758   0.00350833
  0.08645441  0.10875712 -0.2533288   0.09351682  0.24101683  0.13937809
 -0.23678514 -0.05698922 -0.12249822  0.0262745   0.20443738 -0.06480797
  0.25651255  0.06226546  0.09046153 -0.11489527  0.00764522 -0.05704122
  0.12138305  0.3093298  -0.04058864  0.02917185 -0.00899924  0.20694292
  0.15399466 -0.222546    0.0661717   0.03404893  0.21823987  0.02295602
  0.09952864 -0.18332772  0.30713084  0.04149257 -0.13677527 -0.25025985
  0.2873318   0.15527335]


Note that ``infer_vector()`` does *not* take a string, but rather a list of
string tokens, which should have already been tokenized the same way as the
``words`` property of original training document objects.

Also note that because the underlying training/inference algorithms are an
iterative approximation problem that makes use of internal randomization,
repeated inferences of the same text will return slightly different vectors.




Assessing the Model
-------------------

To assess our new model, we'll first infer new vectors for each document of
the training corpus, compare the inferred vectors with the training corpus,
and then returning the rank of the document based on self-similarity.
Basically, we're pretending as if the training corpus is some new unseen data
and then seeing how they compare with the trained model. The expectation is
that we've likely overfit our model (i.e., all of the ranks will be less than
2) and so we should be able to find similar documents very easily.
Additionally, we'll keep track of the second ranks for a comparison of less
similar documents.




In [None]:
ranks = []
second_ranks = []
for doc_id in range(len(vocabulary)):
    inferred_vector = model.infer_vector(vocabulary[doc_id].words) # uses the model to create the vector
    sims = model.dv.most_similar([inferred_vector], topn=len(model.dv)) # similaridade  
    rank = [docid for docid, sim in sims].index(doc_id) 
    ranks.append(rank)

    second_ranks.append(sims[1])

Let's count how each document ranks with respect to the training corpus

NB. Results vary between runs due to random seeding and very small corpus



In [None]:
import collections

counter = collections.Counter(ranks)
print(counter)

Counter({0: 498, 1: 2})


Basically, greater than 95% of the inferred documents are found to be most
similar to itself and about 5% of the time it is mistakenly most similar to
another document. Checking the inferred-vector against a
training-vector is a sort of 'sanity check' as to whether the model is
behaving in a usefully consistent manner, though not a real 'accuracy' value.

This is great and not entirely surprising. We can take a look at an example:




In [None]:
print('Document ({}): «{}»\n'.format(doc_id, ' '.join(vocabulary[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(vocabulary[sims[index][0]].words)))

Document (499): «rotary positive displacement pump of the internally meshing screw type having stator rotor and an inlet or outlet chamber at one end of the rotor has drive comprising connecting member rigidly secured to the rotor and extending through the chamber which connecting member beyond the end of the chamber remote from the rotor is joined by connecting rod with two universal joints to drive shaft connecting member in the chamber being supported by bearing flexibly carried on resilient support member which is sealed to the outer race of the bearing and also to the chamber wall»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/m,d50,n5,w5,mc2,s0.001,t3):

MOST (499, 0.9197404980659485): «rotary positive displacement pump of the internally meshing screw type having stator rotor and an inlet or outlet chamber at one end of the rotor has drive comprising connecting member rigidly secured to the rotor and extending through the chamber which connecting member beyond the end of the cham

Notice above that the most similar document (usually the same text) is has a
similarity score approaching 1.0. However, the similarity score for the
second-ranked documents should be significantly lower (assuming the documents
are in fact different) and the reasoning becomes obvious when we examine the
text itself.

We can run the next cell repeatedly to see a sampling other target-document
comparisons.




In [None]:
print(train_data.iloc[499]['Titulo'],' | ',train_data.iloc[499]['Resumo'])

print(train_data.iloc[496]['Titulo'],' | ',train_data.iloc[496]['Resumo'])

print(train_data.iloc[181]['Titulo'],' | ',train_data.iloc[181]['Resumo'])





 Rotary displacement pumps   |   A rotary positive displacement pump of the internally-meshing screw type having a stator, a rotor and an inlet or outlet chamber at one end of the rotor has a drive comprising a connecting member rigidly secured to the rotor and extending through the chamber, which connecting member, beyond the end of the chamber remote from the rotor, is joined by a connecting rod with two universal joints to a drive shaft, connecting member in the chamber being supported by a bearing flexibly carried on a resilient support member which is sealed to the outer race of the bearing and also to the chamber wall.
 Apparatus for use as a gas compressor or gas blower   |   A gas compressor or blower comprising a cylinder having air intake and exhaust ports; a piston which is rotatably and reciprocably movable in the cylinder; valve means for admitting air into and exhausting air from at least one chamber lying to one side of the piston; a piston shaft to which the piston is s

In [None]:
# Pick a random document from the corpus and infer a vector from the model
import random
doc_id = random.randint(0, len(train_corpus) - 1)

# Compare and print the second-most-similar document
print('Train Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
sim_id = second_ranks[doc_id]
print('Similar Document {}: «{}»\n'.format(sim_id, ' '.join(train_corpus[sim_id[0]].words)))

Train Document (79): «dozens of people were injured some seriously and others were trapped after roof collapsed at south african shopping centre burying some children on skating rink witnesses and police said initial reports spoke of up to people trapped at the kolonnade shopping centre in the north of the capital but pretoria emergency spokesman johan pieterse later said police had rescued four people from the rubble and could not immediately locate any more police and police dogs are still inside but can find anyone else just now he added by pm local time am aedt two hours after the collapse injured people mostly adults had been taken to hospitals around the city by ambulance or helicopter mr pieterse said most of those injured were those standing at the glass wall watching people ice skate he added some of the injured included children skating on the ice rink who were partially buried in rubble when the roof gave way witnesses said some square metres of roofing were believed to have

Testing the Model
-----------------

Using the same approach above, we'll infer the vector for a randomly chosen
test document, and compare the document to our model by eye.




In [None]:
# Pick a random document from the test corpus and infer a vector from the model
doc_id = random.randint(0, len(test_corpus) - 1)
inferred_vector = model.infer_vector(test_corpus[doc_id])
sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))

# Compare and print the most/median/least similar documents from the train corpus
print('Test Document ({}): «{}»\n'.format(doc_id, ' '.join(test_corpus[doc_id])))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

Test Document (8): «hunan province remained on high alert last night as thunderstorms threatened to exacerbate the flood crisis now entering its fifth day and with already dead and hundreds of thousands evacuated on the flood frontline at dongting lake the water level peaked at just under on saturday night then eased about cm during the day under hot sun with temperatures reaching but with the lake still brimming at dangerously high levels and spilling over the top of its banks in some places locals were fearful that thunderstorm and high winds forecast to hit the region last night would damage the dikes about km of dikes around the lake are all that stand between million people in the surrounding farmland and disaster»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/m,d50,n5,w5,mc2,s0.001,t3):

MOST (10, 0.6962526440620422): «work is continuing this morning to restore power supplies to tens of thousands of homes that were blacked out during wild storms that struck south east queensland 

Conclusion
----------

Let's review what we've seen in this tutorial:

0. Review the relevant models: bag-of-words, Word2Vec, Doc2Vec
1. Load and preprocess the training and test corpora (see `core_concepts_corpus`)
2. Train a Doc2Vec `core_concepts_model` model using the training corpus
3. Demonstrate how the trained model can be used to infer a `core_concepts_vector`
4. Assess the model
5. Test the model on the test corpus

That's it! Doc2Vec is a great way to explore relationships between documents.

Additional Resources
--------------------

If you'd like to know more about the subject matter of this tutorial, check out the links below.

* `Word2Vec Paper <https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf>`_
* `Doc2Vec Paper <https://cs.stanford.edu/~quocle/paragraph_vector.pdf>`_
* `Dr. Michael D. Lee's Website <http://faculty.sites.uci.edu/mdlee>`_
* `Lee Corpus <http://faculty.sites.uci.edu/mdlee/similarity-data/>`__
* `IMDB Doc2Vec Tutorial <doc2vec-IMDB.ipynb>`_


