# Unsupervised methods

Most of the methods we've looked at so far have applied *supervised* machine learning, where the learning methods require examples of text **input** (documents, words, etc.) and the **correct outputs** (classes, tags, etc.)

By contrast, *unsupervised* methods do not expect outputs, but instead aim to identify commonalities or structure in their input.

## Example: language models



Statistical language models estimate the "probability" of text, i.e. $P(w_1, w_2, \ldots, w_n)$

Basic N-gram approach:

* Assume (incorrectly) that $P(w_i)$ only depends on the N-1 previous words
    * e.g. for bigrams $P(w_i|w_{i-1},w_{i-2}, \ldots) = P(w_i|w_{i-1})$
* Count occurrences of N-grams in a large amount of (unlabelled) text
    * e.g. [Google Web 5-grams](https://catalog.ldc.upenn.edu/LDC2006T13): 1,024,908,267,229 words (~1 trillion)
* Estimate probabilities from counts and apply smoothing

---

A model that can estimate $P(w_i|w_{i-1},w_{i-2}, \ldots)$ can be used to generate text: given a "prompt" of words, pick the most likely next word, and repeat.

Recent example from [Open AI GPT-2](https://openai.com/blog/better-language-models/) neural language model:

<div style="margin:1em">
<table style="font-size:100%; text-align:left">
  <tr>
    <td><b>Prompt (human input)</b>:</td>
    <td><i>In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.</i></td>
  </tr>
  <tr>
    <td><b>System output</b>:</td>
    <td><p>The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science.</p>

<p>Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.</p>

<p>Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow.</p>

<p>Pérez and the others then ventured further into the valley. “By the time we reached the top of one peak, the water looked blue, with some crystals on top,” said Pérez.</p>

<p>Pérez and his friends were astonished to see the unicorn herd. These creatures could be seen from the air without having to move too much to see them – they were so close they could touch their horns.</p>

<p>While examining these bizarre creatures the scientists discovered that the creatures also spoke some fairly regular English. Pérez stated, “We can see, for example, that they have a common ‘language,’ something like a dialect or dialectic.”</p>

<p>[...]</p>
</td>
</tr>
</table>
</div>

## Distributional semantics

Methods that aim to derive (representations of) meaning from text statistics. Based on a comparatively old idea (mandatory quotes):

<div style="text-align:center; font-size:120%; margin-top:1em"><i>If A and B have almost identical environments we say that they are synonyms</i> (Harris 1954)</div>

<div style="text-align:center; font-size:120%; margin-top:1em"><i>You shall know a word by the company it keeps</i> (Firth 1957)</div>

<div style="text-align:center; font-size:120%; margin-top:1em"><i>Words which are similar in meaning occur in similar contexts</i> (Rubenstein and Goodenough, 1965)</div>

---

Example:

<div style="margin:1em">
<table style="font-size:100%; text-align:left">
    <tr style="background-color:lightgray"><td>A large</td><td><b>___</b></td><td>runs in the yard</td></tr>
    <tr style="background-color:lightgreen"><td>A large</td><td><b>dog</b></td><td>runs in the yard</td></tr>
    <tr style="background-color:lightgreen"><td>A large</td><td><b>cat</b></td><td>runs in the yard</td></tr>
    <tr style="background-color:yellow"><td>A large</td><td><b>hat</b></td><td>runs in the yard</td></tr>
    <tr style="background-color:orangered"><td>A large</td><td><b>the</b></td><td>runs in the yard</td></tr>
</table>
</div>

Basic approach:

* Create a **word-context matrix** that records occurrences of words in context (for various definitions of context)
    * Most typically **word-word matrix**
* Process a large corpus of text, summing up (word, context) counts

<img src="figs/word_context_example.png" width="60%">

* Apply smoothing and/or statistics measuring association strength (e.g. PMI, TF-IDF)
* Perform dimensionality reduction

The process creates **word vectors** that aim to reflect word relationships: e.g. `similarity(cat,dog)` > `similarity(cat,hat)`

Hypothetical example with *very* small corpus:

In [1]:
words = ['a', 'cat', 'dog', 'eats', 'runs', 'the', 'yard']

word_word_matrix = [
    [  0, 43, 57,  0,  0,  0, 12 ],
    [  0,  0,  0, 13,  9,  0,  0 ],
    [  0,  0,  0,  9, 11,  0,  0 ],
    [ 21,  0,  0,  0,  0, 16,  0 ],
    [  5,  0,  0,  0,  0,  2,  0 ],
    [  0, 33, 38,  0,  0,  0,  7 ],
    [  0,  0,  0,  0,  0,  0,  0 ],    
]

In [2]:
from pandas import DataFrame


DataFrame(word_word_matrix, index=words, columns=words)

Unnamed: 0,a,cat,dog,eats,runs,the,yard
a,0,43,57,0,0,0,12
cat,0,0,0,13,9,0,0
dog,0,0,0,9,11,0,0
eats,21,0,0,0,0,16,0
runs,5,0,0,0,0,2,0
the,0,33,38,0,0,0,7
yard,0,0,0,0,0,0,0


Now, similar words should have similar vectors (rows):

In [3]:
import numpy as np


vec = dict((words[i], np.array(word_word_matrix[i])) for i in range(len(words)))

for w in ['dog', 'cat', 'eats', 'runs']:
    print(w, '\t', vec[w])

dog 	 [ 0  0  0  9 11  0  0]
cat 	 [ 0  0  0 13  9  0  0]
eats 	 [21  0  0  0  0 16  0]
runs 	 [5 0 0 0 0 2 0]


Vector similarity is typically measured using *cosine similarity*, the dot product of normalized (length 1) vectors:

$$cos(a,b) = \frac{a \cdot b}{\|a\|\|b\|} = \frac{\sum_i{a_i b_i}}{\sqrt{\sum_i{a_i^2}}\sqrt{\sum_i{b_i^2}}}$$

(i.e. for normalized vectors, this is just the sum of the elementwise products)

As the name suggests, this corresponds to the cosine of the angle between the vectors:

* if $a$ and $b$ pointing in the same direction, $cos(a,b) = 1$
* if $a$ and $b$ have a near 90 degree angle, $cos(a,b) = 0$
* if $a$ and $b$ point in opposite directions, $cos(a,b) = -1$


In [4]:
from math import sqrt


def cos(a, b):
    return np.dot(a,b)/(sqrt(np.dot(a,a))*sqrt(np.dot(b,b)))


for a, b in [['dog', 'cat'], ['dog', 'eats'], ['eats', 'runs']]:
    print('sim({:4s},{:4s}) = {:.2f}'.format(a, b, cos(vec[a], vec[b])))

sim(dog ,cat ) = 0.96
sim(dog ,eats) = 0.00
sim(eats,runs) = 0.96


It's not very surprising that an artificially constructed example can be made to work. Can we do this with real data?

## Neural word representations

Count-based word vectors are easy and effective, but recent results (e.g. [Baroni et al. 2014](http://www.aclweb.org/anthology/P/P14/P14-1023.pdf)) indicate that *prediction-based* methods offer better performance at many tasks.

In prediction-based models, the task is to predict a word given its context (or vice versa):

<img src="figs/cbow_and_skipgram.png" width="60%">

<div style="text-align:center; font-size:80%">(figure adapted from <a href="https://arxiv.org/pdf/1301.3781.pdf">Mikolov et al. 2013</a>)</div>

The word vectors are learned as a "side effect" of performing the prediction task.

---

We'll be working with the [word2vec](https://github.com/tmikolov/word2vec) implementation of the skip-gram model. Before going into details, let's give this a quick try:

In [8]:
%matplotlib inline
from lib import wvlib

wv = wvlib.load("/course_data/textmine/GoogleNews-vectors-negative300.100K.bin")

In [12]:
print(wv['dog'].shape)
print(wv['dog'][:100])

(300,)
[ 0.0512695  -0.0223389  -0.172852    0.161133   -0.0844727   0.057373
  0.0585938  -0.0825195  -0.0153809  -0.0634766   0.179688   -0.423828
 -0.022583   -0.166016   -0.0251465   0.107422   -0.199219    0.15918
 -0.1875     -0.120117    0.155273   -0.0991211   0.142578   -0.164062
 -0.0893555   0.200195   -0.149414    0.320312    0.328125    0.0244141
 -0.097168   -0.0820312  -0.036377   -0.0859375  -0.0986328   0.00778198
 -0.0134277   0.0527344   0.148438    0.333984    0.0166016  -0.212891
 -0.0150757   0.0524902  -0.107422   -0.0888672   0.249023   -0.0703125
 -0.0159912   0.0756836  -0.0703125   0.119141    0.229492    0.0141602
  0.115234    0.00750732  0.275391   -0.244141    0.296875    0.0349121
  0.242188    0.135742    0.142578    0.0175781   0.0292969  -0.121582
  0.0228271  -0.0476074  -0.155273    0.00314331  0.345703    0.122559
 -0.195312    0.0810547  -0.0683594  -0.0147095   0.214844   -0.121094
  0.157227   -0.207031    0.136719   -0.129883    0.0529785  -0.2

In [13]:
for a, b in [['dog', 'cat'], ['dog', 'eats'], ['eats', 'runs']]:
    print('sim({:4s},{:4s}) = {:.2f}'.format(a, b, cos(wv[a], wv[b])))

sim(dog ,cat ) = 0.76
sim(dog ,eats) = 0.22
sim(eats,runs) = 0.26


Seems OK, if not quite ideal. What are the words with the largest similarity?

In [15]:
wv.nearest('dog')

[('dogs', 0.86804897),
 ('puppy', 0.8106429),
 ('pit_bull', 0.7803961),
 ('pooch', 0.7627376),
 ('cat', 0.7609456),
 ('golden_retriever', 0.75009006),
 ('German_shepherd', 0.74651736),
 ('Rottweiler', 0.7437614),
 ('beagle', 0.7418621),
 ('pup', 0.7406911)]

(To be continued ...)