# Unsupervised learning with Paragraph Vector algorithm

## Upsupervised Learning

Unsupervised learning is the training of an artificial intelligence (AI) algorithm using information that is neither classified nor labeled and allowing the algorithm to act on that information without guidance.

<img src="./unsupervised_examples.png">

### Unsupervised Techniques
* Clustering
* Principal Component Analysis (PCA)
* Anomaly detection
* Autoencoders
* Deep Belief Nets
* Hebbian Learning
* Generative Adversarial Networks(GANs)
* Self-Organizing maps


## Representation

<img src="./representation.png">

## Paragraph Vector algorithm (also known as Document To Vector/D2V)

<img src="./para_vec_new.png">

In [12]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [13]:
import gensim
import os
import collections
import smart_open
import random

## What is it?

Doc2Vec is an NLP tool for representing documents as a vector and is a generalizing of the Word2Vec method. This tutorial will serve as an introduction to Doc2Vec and present ways to train and assess a Doc2Vec model.

## Resources

* [Word2Vec Paper](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)
* [Doc2Vec Paper](https://cs.stanford.edu/~quocle/paragraph_vector.pdf)
* [Dr. Michael D. Lee's Website](http://faculty.sites.uci.edu/mdlee)
* [Lee Corpus](http://faculty.sites.uci.edu/mdlee/similarity-data/)
* [IMDB Doc2Vec Tutorial](doc2vec-IMDB.ipynb)

## Getting Started

To get going, we'll need to have a set of documents to train our doc2vec model. In theory, a document could be anything from a short 140 character tweet, a single paragraph (i.e., journal article abstract), a news article, or a book. In NLP parlance a collection or set of documents is often referred to as a <b>corpus</b>. 

For this tutorial, we'll be training our model using the [Lee Background Corpus](https://hekyll.services.adelaide.edu.au/dspace/bitstream/2440/28910/1/hdl_28910.pdf) included in gensim. This corpus contains 314 documents selected from the Australian Broadcasting
Corporation’s news mail service, which provides text e-mails of headline stories and covers a number of broad topics.

And we'll test our model by eye using the much shorter [Lee Corpus](https://hekyll.services.adelaide.edu.au/dspace/bitstream/2440/28910/1/hdl_28910.pdf) which contains 50 documents.

In [14]:
# Set file names for train and test data
test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])
lee_train_file = test_data_dir + os.sep + 'lee_background.cor'
lee_test_file = test_data_dir + os.sep + 'lee.cor'

## Define a Function to Read and Preprocess Text

Below, we define a function to open the train/test file (with latin encoding), read the file line-by-line, pre-process each line using a simple gensim pre-processing tool (i.e., tokenize text into individual words, remove punctuation, set to lowercase, etc), and return a list of words. Note that, for a given file (aka corpus), each continuous line constitutes a single document and the length of each line (i.e., document) can vary. Also, to train the model, we'll need to associate a tag/number with each document of the training corpus. In our case, the tag is simply the zero-based line number.

In [15]:
def read_corpus(fname, tokens_only=False):
    with smart_open.smart_open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            if tokens_only:
                yield gensim.utils.simple_preprocess(line)
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), [i])

In [16]:
train_corpus = list(read_corpus(lee_train_file))
test_corpus = list(read_corpus(lee_test_file, tokens_only=True))

Let's take a look at the training corpus

In [17]:
train_corpus[:2]

[TaggedDocument(words=['hundreds', 'of', 'people', 'have', 'been', 'forced', 'to', 'vacate', 'their', 'homes', 'in', 'the', 'southern', 'highlands', 'of', 'new', 'south', 'wales', 'as', 'strong', 'winds', 'today', 'pushed', 'huge', 'bushfire', 'towards', 'the', 'town', 'of', 'hill', 'top', 'new', 'blaze', 'near', 'goulburn', 'south', 'west', 'of', 'sydney', 'has', 'forced', 'the', 'closure', 'of', 'the', 'hume', 'highway', 'at', 'about', 'pm', 'aedt', 'marked', 'deterioration', 'in', 'the', 'weather', 'as', 'storm', 'cell', 'moved', 'east', 'across', 'the', 'blue', 'mountains', 'forced', 'authorities', 'to', 'make', 'decision', 'to', 'evacuate', 'people', 'from', 'homes', 'in', 'outlying', 'streets', 'at', 'hill', 'top', 'in', 'the', 'new', 'south', 'wales', 'southern', 'highlands', 'an', 'estimated', 'residents', 'have', 'left', 'their', 'homes', 'for', 'nearby', 'mittagong', 'the', 'new', 'south', 'wales', 'rural', 'fire', 'service', 'says', 'the', 'weather', 'conditions', 'which', '

And the testing corpus looks like this:

In [18]:
print(test_corpus[:2])

[['the', 'national', 'executive', 'of', 'the', 'strife', 'torn', 'democrats', 'last', 'night', 'appointed', 'little', 'known', 'west', 'australian', 'senator', 'brian', 'greig', 'as', 'interim', 'leader', 'shock', 'move', 'likely', 'to', 'provoke', 'further', 'conflict', 'between', 'the', 'party', 'senators', 'and', 'its', 'organisation', 'in', 'move', 'to', 'reassert', 'control', 'over', 'the', 'party', 'seven', 'senators', 'the', 'national', 'executive', 'last', 'night', 'rejected', 'aden', 'ridgeway', 'bid', 'to', 'become', 'interim', 'leader', 'in', 'favour', 'of', 'senator', 'greig', 'supporter', 'of', 'deposed', 'leader', 'natasha', 'stott', 'despoja', 'and', 'an', 'outspoken', 'gay', 'rights', 'activist'], ['cash', 'strapped', 'financial', 'services', 'group', 'amp', 'has', 'shelved', 'million', 'plan', 'to', 'buy', 'shares', 'back', 'from', 'investors', 'and', 'will', 'raise', 'million', 'in', 'fresh', 'capital', 'after', 'profits', 'crashed', 'in', 'the', 'six', 'months', 'to'

Notice that the testing corpus is just a list of lists and does not contain any tags.

## Training the Model

### Instantiate a Doc2Vec Object 

Now, we'll instantiate a Doc2Vec model with a vector size with 50 words and iterating over the training corpus 40 times. We set the minimum word count to 2 in order to discard words with very few occurrences. (Without a variety of representative examples, retaining such infrequent words can often make a model worse!) Typical iteration counts in published 'Paragraph Vectors' results, using 10s-of-thousands to millions of docs, are 10-20. More iterations take more time and eventually reach a point of diminishing returns.

However, this is a very very small dataset (300 documents) with shortish documents (a few hundred words). Adding training passes can sometimes help with such small datasets.

In [19]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=48, min_count=2, epochs=40)

### Build a Vocabulary

In [20]:
model.build_vocab(train_corpus)

2019-02-22 14:27:36,331 : INFO : collecting all words and their counts
2019-02-22 14:27:36,338 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2019-02-22 14:27:36,375 : INFO : collected 6981 word types and 300 unique tags from a corpus of 300 examples and 58152 words
2019-02-22 14:27:36,376 : INFO : Loading a fresh vocabulary
2019-02-22 14:27:36,401 : INFO : effective_min_count=2 retains 3955 unique words (56% of original 6981, drops 3026)
2019-02-22 14:27:36,403 : INFO : effective_min_count=2 leaves 55126 word corpus (94% of original 58152, drops 3026)
2019-02-22 14:27:36,431 : INFO : deleting the raw counts dictionary of 6981 items
2019-02-22 14:27:36,433 : INFO : sample=0.001 downsamples 46 most-common words
2019-02-22 14:27:36,435 : INFO : downsampling leaves estimated 42390 word corpus (76.9% of prior 55126)
2019-02-22 14:27:36,470 : INFO : estimated required memory for 3955 words and 48 dimensions: 3553820 bytes
2019-02-22 14:27:36,478 : INFO : res

Essentially, the vocabulary is a dictionary (accessible via `model.wv.vocab`) of all of the unique words extracted from the training corpus along with the count (e.g., `model.wv.vocab['penalty'].count` for counts for the word `penalty`).

### Train model

In [21]:
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

2019-02-22 14:27:37,352 : INFO : training model with 3 workers on 3955 vocabulary and 48 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2019-02-22 14:27:37,513 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-02-22 14:27:37,522 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-02-22 14:27:37,547 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-02-22 14:27:37,549 : INFO : EPOCH - 1 : training on 58152 raw words (42772 effective words) took 0.2s, 227759 effective words/s
2019-02-22 14:27:37,713 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-02-22 14:27:37,731 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-02-22 14:27:37,745 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-02-22 14:27:37,746 : INFO : EPOCH - 2 : training on 58152 raw words (42568 effective words) took 0.2s, 224132 effective words/s
2019-02-22 14:27:37,918 : INFO : worker 

2019-02-22 14:27:41,685 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-02-22 14:27:41,697 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-02-22 14:27:41,710 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-02-22 14:27:41,712 : INFO : EPOCH - 21 : training on 58152 raw words (42790 effective words) took 0.2s, 171269 effective words/s
2019-02-22 14:27:41,917 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-02-22 14:27:41,922 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-02-22 14:27:41,941 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-02-22 14:27:41,942 : INFO : EPOCH - 22 : training on 58152 raw words (42561 effective words) took 0.2s, 192684 effective words/s
2019-02-22 14:27:42,086 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-02-22 14:27:42,100 : INFO : worker thread finished; awaiting finish of 1 more threads
2019

### Inferring a Vector

One important thing to note is that you can now infer a vector for any piece of text without having to re-train the model by passing a list of words to the `model.infer_vector` function. This vector can then be compared with other vectors via cosine similarity.

In [22]:
model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])

array([-0.10797174,  0.19715056,  0.0972636 , -0.06634831, -0.05573022,
       -0.22814459, -0.01247532, -0.19113724, -0.2335291 , -0.13367285,
       -0.04636728, -0.13835841, -0.13360482,  0.17536612,  0.05678298,
        0.03886212,  0.09036909,  0.08988244, -0.3042306 , -0.06391596,
        0.06544841, -0.04593709, -0.01873477,  0.2733597 ,  0.06608206,
       -0.01115317, -0.2598342 ,  0.06903029,  0.05787918, -0.03917908,
        0.06389979, -0.24206126, -0.11864331, -0.2917862 , -0.08169527,
       -0.06216768,  0.1248833 , -0.08139188,  0.18219376,  0.30983114,
       -0.07212948, -0.0461818 ,  0.01162641, -0.01393203, -0.20548104,
       -0.19370623, -0.02727796, -0.03386957], dtype=float32)

Note that `infer_vector()` does *not* take a string, but rather a list of string tokens, which should have already been tokenized the same way as the `words` property of original training document objects. 

Also note that because the underlying training/inference algorithms are an iterative approximation problem that makes use of internal randomization, repeated inferences of the same text will return slightly different vectors.

## Assessing Model

To assess our new model, we'll first infer new vectors for each document of the training corpus, compare the inferred vectors with the training corpus, and then returning the rank of the document based on self-similarity. Basically, we're pretending as if the training corpus is some new unseen data and then seeing how they compare with the trained model. The expectation is that we've likely overfit our model (i.e., all of the ranks will be less than 2) and so we should be able to find similar documents very easily. Additionally, we'll keep track of the second ranks for a comparison of less similar documents. 

In [12]:
ranks = []
second_ranks = []
for doc_id in range(len(train_corpus)):
    inferred_vector = model.infer_vector(train_corpus[doc_id].words)
    sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
    rank = [docid for docid, sim in sims].index(doc_id)
    ranks.append(rank)
    
    second_ranks.append(sims[1])

2019-02-14 17:54:02,191 : INFO : precomputing L2-norms of doc weight vectors
  if np.issubdtype(vec.dtype, np.int):


Let's count how each document ranks with respect to the training corpus 

In [13]:
collections.Counter(ranks)  # Results vary between runs due to random seeding and very small corpus

Counter({0: 292, 1: 8})

Basically, greater than 95% of the inferred documents are found to be most similar to itself and about 5% of the time it is mistakenly most similar to another document. the checking of an inferred-vector against a training-vector is a sort of 'sanity check' as to whether the model is behaving in a usefully consistent manner, though not a real 'accuracy' value.

This is great and not entirely surprising. We can take a look at an example:

In [14]:
print('Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

Document (299): «australia will take on france in the doubles rubber of the davis cup tennis final today with the tie levelled at wayne arthurs and todd woodbridge are scheduled to lead australia in the doubles against cedric pioline and fabrice santoro however changes can be made to the line up up to an hour before the match and australian team captain john fitzgerald suggested he might do just that we ll make team appraisal of the whole situation go over the pros and cons and make decision french team captain guy forget says he will not make changes but does not know what to expect from australia todd is the best doubles player in the world right now so expect him to play he said would probably use wayne arthurs but don know what to expect really pat rafter salvaged australia davis cup campaign yesterday with win in the second singles match rafter overcame an arm injury to defeat french number one sebastien grosjean in three sets the australian says he is happy with his form it not v

Notice above that the most similar document (usually the same text) is has a similarity score approaching 1.0. However, the similarity score for the second-ranked documents should be significantly lower (assuming the documents are in fact different) and the reasoning becomes obvious when we examine the text itself.

We can run the next cell repeatedly to see a sampling other target-document comparisons. 

In [15]:
# Pick a random document from the corpus and infer a vector from the model
doc_id = random.randint(0, len(train_corpus) - 1)

# Compare and print the second-most-similar document
print('Train Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
sim_id = second_ranks[doc_id]
print('Similar Document {}: «{}»\n'.format(sim_id, ' '.join(train_corpus[sim_id[0]].words)))

Train Document (167): «turning grief into defiance americans have paused in remembrance three months after the deadly september attacks as resolute president george bush forecast certain victory in his war on terrorism at am new york time am aedt the exact moment when hijacked airliner steered on suicide mission sliced into one of the twin towers of the world trade centre ceremonies in washington new york and around the world honoured some people killed on an unprecedented day of horror today the wrongs are being righted justice is being done mr bush said we still have far to go and many dangers still lie ahead yet there can be no doubt how this conflict will end in new york firefighters police officers and community leaders assembled in the wreckage strewn crater where the world trade centre stood until its signature towers were levelled on the bright sunny morning of september under grey skies lone tenor sung let there be peace on earth before priest rabbi and an imam addressed solem

## Testing the Model

Using the same approach above, we'll infer the vector for a randomly chosen test document, and compare the document to our model by eye.

In [16]:
# Pick a random document from the test corpus and infer a vector from the model
doc_id = random.randint(0, len(test_corpus) - 1)
inferred_vector = model.infer_vector(test_corpus[doc_id])
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))

# Compare and print the most/median/least similar documents from the train corpus
print('Test Document ({}): «{}»\n'.format(doc_id, ' '.join(test_corpus[doc_id])))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

Test Document (15): «the bush administration has drawn up plans to escalate the war of words against iraq with new campaigns to step up pressure on baghdad and rally world opinion behind the us drive to oust president saddam hussein this week the state department will begin mobilising iraqis from across north america europe and the arab world training them to appear on talk shows write opinion articles and give speeches on reasons to end president saddam rule»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/m,d48,n5,w5,mc2,s0.001,t3):

MOST (12, 0.707869291305542): «president general pervez musharraf says pakistan wants to defuse the brewing crisis with india but was prepared to respond vigorously to any attack pakistan stands for peace pakistan wants peace pakistan wants to reduce tension he said let the two countries move towards peace and harmony however pakistan has taken all counter measures if any war is thrust on pakistan the pakistan armed forces and the million people of pakista

  if np.issubdtype(vec.dtype, np.int):


### Wrapping Up

That's it! Doc2Vec is a great way to explore relationships between documents.