In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

Macro `_latex_std_` created. To execute, type its name (without quotes).
=== Macro contents: ===
get_ipython().run_line_magic('run', 'Latex_macros.ipynb')
 

$$
\newcommand{\V}{\mathbf{V}}
\newcommand{\v}{\mathbf{v}}
\newcommand{\offset}{o}
\newcommand{\o}{o}
\newcommand{\E}{\mathbf{E}}
$$


# Natural Language Processing

The datasets for Machine Learning have historically been mainly numeric.

But non-numeric data such as Image and Text is an abundant and potentially rich source of insight.

We have illustrated many of the concepts in this course with Image data.

We will briefly dive into the world of text.
  
*Natural Language Processing* is the set of tools/techniques that facilitate using text as raw material.

## The world of text

- SEC filing
- analyst reports
- news articles
- tweets

We will approach text mainly from a Deep Learning perspective
- lots of data
- minimal pre-processing
- "feature engineering" by the Neural Network

That is not to discount more "classical" methods for NLP
- Part of speech
- Stemming
- Lemmatization
- n-grams

All of these are potentially useful as pre-processing steps for Deep Learning.

However, if our data sets are big enough, it may be counter-productive to preprocess.

# Issues with text

There are several big issues to tackle regarding text data
- Words are categorical variables
- Token sequences (sentences/paragraphs) are variable length
- Token sequences: order matters

We are using the term "token" rather than word
- tokens may include punctuation, special characters
- tokens may be characters rather than entire words

## Notation
- $\w$ is a sequence of $n_\w$ tokens $\w_{(1)}, \ldots, \w_{(n_\w)}$
- each token is an element of vocabulary $\V: \w_\tp \in \V, 1 \le \tt \le ||\w||$
    - token $j$ in vocabulary $\V$ is denoted $\V_j$
- We define two pseuduo-tokens to denote the start/end of the sentence
    - $\w_(0) = \text{<START>}$
    - $\w_{(n_\w+1)} =\text{<END>}$

We need a function to convert a token into a numeric vector:
$$\text{rep}: \text{token} \mapsto \mathbb{R}^{n_\V}
$$


One Hot Encoding (OHE) and word embeddings are examples of such a function.

- For OHE: $n_\V = ||\V||$
- For Word Embeddings: $n_\V$ is the dimension of the embedding vector

We will extend $\text{rep}$ to sequences $\w$:
$$\text{rep}(\w) = \left[ \text{rep}(\w_\tp) | 1 \le \tt \le ||\w||  \right]$$

# Issue 1: Words are categorical variables

We address the first issue relating to text: words are *categorical variables*.

By now, we should know to **not** treat categorical variables as ordinals.

Let's review the reason.

Treating a word as an ordinal 
$$\text{rep}(w) \in \mathbb{R}^1$$ 
would imply
- "apple" < "orange" is a sensible statement
- that this ordering is meaningful to a Machine Learning model

**Example**

Linear regression:
$$
\y = \Theta^T \text{rep}(w)
$$

Predict $\y$ given feature vector (attributes) $\text{rep}(w)$
- by learning parameters $\Theta$

Suppose that we tried to encode word $w$ with an integer:  $\text{rep}(w) = I_w$.
- $I_\text{apple} = 10 * I_\text{orange}$
    - means "apple" has 10 times the impact on prediction $\hat{\y}$ as "orange"
    - impact is $\Theta * I_w$
- Re-encoding "apple" with a value 10 times larger would make it 10 times more important
       

## Sparse Represention of words by One Hot Encoding (OHE)

So the natural way of representing a word is as a categorical variable
- indictor per word: $\text{Is}_\text{apple}$
- One Hot Encoding

OHE is a *sparse* representation
- length of $\text{rep}(w)$ is $| \V |$, yet only a single non-zero element

The problem is that there are lots of words !
- $|V|$ is large !
- $\text{rep}(\w)$ length is $|\w| |\V|$

# Issue 2: Word-streams are variable length

We are already familiar with two ways of dealing with variable length input
- Use a Recurrent model, which handles sequences of arbitrary length
- Convert to a fixed length representation using pooling

## Fixed length representation via pooling

One way to deal with a sequence $\w$ of words is to convert it to a vector  of **fixed length**.

Once the length is fixed
- Classical and Deep Learning models taking fixed length inputs
can work as usual.

Doing so usually involves losing the ordering information.

### Bag of Words (BOW): Pooling

We define a *reduction* operation $\text{CBOW}$
- convert a sequence $\w$ of $||\w||$ elements
- each element of length $|| \text{rep}(\w_\tp) ||$
- to a fixed length vector of length $|| \text{rep}(\w_\tp) ||$

This will necessarily lose token order: this method is called *Bag of Words (BOW)*

There are many operators to achieve the reduction, which we will group under the name *pooling*

#### Sum/Average
$$
\text{CBOW}(\w) = \sum_{\tt=1}^{||\w ||} {  \text{rep}(\w_\tp) }
$$

Since $\w_\tp$ is a vector, the addition operation is element-wise.

So the composite vector for the sequence is the sum of the vectors of each element in the sequence.

We can easily turn the Sum into an average by dividing by $||\w||$

#### Count vectorization:  

In the special case that 
$$\text{rep}(\w_\tp) = \text{OHE}(\w_\tp)
$$

$\text{CBOW}(\w)_j$ is equal to the number of occurrences in sequence $\w$ of the $j^{th}$ word in $\V$.

This is often called *Count Vectorization*.

#### TF-IDF

Count Vectorization is simple but ignores a basic fact or language
- word "importance" is often inversely correlated with frequency in $\V$

In English: 
- the words "a", "the" and "is" are extremely high frequency (so high counts in most $\w$).
- but are so common as to convey little meaning

On the other hand, a rare word (or sequence of words) may be very distinctive ("Machine Learning").


*Term Frequency, Inverse Document Frequency (TF-IDF)* 
- is based on the idea that
a word that is *infequent* in the wide corpus
- but is frequent in a particular document in
the corpus is very meaningul in the context of the document.

So a document 
- in which "Machine Learning" occured a disproportionately high (relative to the broad corpus)
number of times 
- is likely to indicate that the document is dealing with the subject of Machine Learning.

**Note** A similar idea is behind many Web search algorithms (Google).

TF-IDF is similar to the Count Vectorizer, but with modified counts 
that are the product of 
- the frequency of a word within a single document
- the inverse of the frequency of the word relative to all documents

- $v$ is a word
- $d$ is a document (collection of words) in set of documents $D$

$$
\begin{array}[llll]\\
\text{tf}(v,d) & = & \text{frequency of word } v \text{ in document } d & \text{(Term Frquency)}\\
\text{df}(v) & = & \text{number of documents that contain word } v \\
\text{idf}(v) & = & \log( \frac{ ||D|| } { \text{df}(v) } ) + 1 & \text{Inverse Document Frequency} \\
\\
\text{tf-idf}(v,d) & = & \text{tf}(v, d) * \text{idf}(v) \\
\end{array}
$$

## Detour: Sentiment classification notebook on Colab : simple model

Classification task
- Input: Movie review, as sequence of characters
- Label: Positive/Negative

[NLP notebook: examine the data](https://colab.research.google.com/github/kenperry-public/ML_Fall_2019/blob/master/Keras_examples_imdb_cnn.ipynb#scrollTo=shHO2IU80XJ7&line=20&uniqifier=1)

[NLP notebook: simple model](https://colab.research.google.com/drive/15KZrB_qR63Q3KjLVdaT2BV-NgsdcXo4F#scrollTo=QtvUFJZJ7Oqi)

## Back from the detour:  summary of SImple Model

- One Hot Encoded words
    - OHE via an Embedding layer
        - Use Embedding as pedagogical device; world's *slowest* way to perform OHE !
- Variable length sequence of words
- Global Pooling to reduce to fixed length


# Issue 3: Ordering matters, first attempt using Convolution

Not every text problem requires the complete ordering of words.

We will briefly discuss non-ordered methods of dealing with text.

## Neural n-grams using Conv1d

An *n-gram* is a sequence of $n$ consecutive tokens that encapsulates a single concept (*phrase*)


An n-gram captures
- multi-token concept
    - "New York City" versus [ "New", "York", "City" ]
- ordering information
    - [ "hard", "not", "easy" ] versus [ "easy", "not", "hard" ]
    
NLP can be enhanced by replacing the subsequence of related words by the n-gram.

How does one identify n-grams ? There are two approaches.

The first is statistical
- joint frequency of the phrase's tokens being higher than the product assuming independence

- $\pr{\text{"New York City"}} > \pr{\text{"New"}} \pr{\text{"York"}} \pr{\text{"City"}}$

That is: the frequency of the phrase is greater than the joint probability of its components,
assuming independence.


The other way is: use Machine Learning !
- Discover consecutive $n$ tokens that are useful for some task

Using one dimensional convolution with kernel size $n$
- the convolution encodes each group of $n$ consecutive tokens (assuming stride 1)
- using multiple kernels: we can create an n-gram per kernel that captures some concept

That is: we have created a new feature, per kernel, at each location of the text sequence.

n-grams can capture partial ordering of words (within span $n$).

So creating n-grams (with varying $n$) 
- before applying Pooling 
- retains local ordering for features within span of $n$ tokens

<table>
    <tr>
        <th><center>One dimensional convolution</center></th>
    </tr>
    <tr>
        <td><img src="images/NLP_conv1d.jpg" width=1000></td>
    </tr>
</table>

<table>
    <tr>
        <th><center>One dimensional convolution, multiple kernels</center></th>
    </tr>
    <tr>
        <td><img src="images/NLP_conv1d_multi_kernel.jpg" width=1000></td>
    </tr>
</table>

<table>
    <tr>
        <th><center>Global Pooling</center></th>
    </tr>
    <tr>
        <td><img src="images/NLP_Global_Pooling.jpg" width=600></td>
    </tr>
</table>

## Detour: Sentiment classification notebook on Colab : Neural n-grams

[NLP notebook: neural n-grams](https://colab.research.google.com/github/kenperry-public/ML_Fall_2019/blob/master/Keras_examples_imdb_cnn.ipynb#scrollTo=LPChvdnzUY9f)

## Back from the detour:  summary of Neural n-grams

- One dimensional convolution over time dimension
    - 3-grams
- Global Pooling

# Issue 1 revisited: Sparse verus dense representation of categoricals

## Dense representation of words: Embeddings

Sparse encodings, such as OHE
- convert a token into a vector of features
- where the features are orthogonal: only one is active at a time

This is called a *discrete* representation.

Discrete representations have a major drawbacks
- they are long
    -  $\text{rep}(\w)$ length is $||\w|| * ||\V||$
- there is no meaningful metric of "distance" between the representation of words

To illustrate the "lack of distance" issue, let 

$$
\text{OHE}(w)
$$

denote the One Hot Encoding of word $w$.

Using dot product (cosine similarity) as a measure of similarity

| word   | OHE(word) | Similarity |
| ---    | ---       | :---:        |
| dog   | [1,0,0,0]   | OHE(word) $\cdot$ OHE(dog)  = 1  |
| dogs  | [0,1,0,0]   | OHE(word) $\cdot$ OHE(dog)  = 0  |
| cat   | [0,0,1,0]   | OHE(word) $\cdot$ OHE(dog)  = 0  |
| apple | [0,0,0,1]   | OHE(word) $\cdot$ OHE(dog)  = 0  |


Each pair of distinct words has 0 similarity
- no recognition of plural form
- no recognition of commonality (pets)

This is due to the fact that only a single "feature" of the OHE is active (non-zero).

However, it's possible that, in reality, there are many "dimensions" to a word, for example
- singular/plural
- entity type, e.g., Person
- positive/negative

- "Cats", "Dogs", "Apples"
    - related by being plural form
- "Cat", "Dog"
    - related by being animals
- "good", "bad"
    - related by being "opposites"

Thus it is not unreasonable to represent a word as a short *dense vector* of features 
- each feature (vector element) captures a concept
- numeric value of element encodes the strength of the word's relation to the concept

Ideally the features would be indepenent

This is called a *continuous* word representation.


# Doing math with words

Let's explore the implication and power of dense vector representation of words.

Let $\v_w$ be the dense vector/embedding for word $w$
- captures multiple aspects of a word
- where each element of the vector is a nearly-independent aspect
- then we can perform interesting mathematical manipulations on word vectors


| $w$   | $\v_w$ |
| ---    | ---       | 
| cat   | [.7, .5, .01 ]   
| cats   | [.7, .5, .95 ]  
| dog   | [.7, .2, .01 ]   
| dogs   | [.7, .2, .95 ]
| apple   | [.1, .4, .01 ]   
| apples   | [.1, .4, .95 ]

Does the last dimension encode "plural form" ?
$$
\v_\text{cats} - \v_\text{cat} \approx \v_\text{dogs} - \v_\text{dog} \approx \v_\text{apples} - \v_\text{apple}
$$

If so:
$$
\v_\text{apples} \approx \v_\text{apple} + (\v_\text{cats} - \v_\text{cat})
$$


## Word analogies

king:man :: ? : queen


Let
- $\v_w$ be the dense vector for word $w$
- $d(\v_{w}, \v_{w'})$ be some measure of the distance between the two vectors $\v_{w}, \v_{w'}$
    - e.g., ( $1 - \text{cosine similarity}$ )

Using the distance metric,  define the set of words in vocabulary $\V$ that are "closest" to a word $w$.

Let
- $\text{wv}_{n',d}(\v_w)$ be the dense vectors of the $n'$ words in $\V$ closest to word $w$
$$
\text{wv}_{n',d}(\v_w) = \{ \v_{w'} | \text{rank}_V( d(\v_{w}, \v_{w'}) ) \le n' \}
$$
- $N_{n',d}(w)$ be the set of $n'$ words associated with $\text{wv}_{n',d}(\v_w)$


$$
N_{n',d}(w) = \{ w' | w' \in \text{wv}_{n',d}(\v_w) \}
$$

We can define approximate equality of two words $w, w'$ if they are among the closest words 

$$
w \approx_{n',d} w' \; \; \text{if } \w' \in N_{n',d}(w) 
$$

That is: 
- word $w$ is approximately equal to word $w'$
- if $w'$ is among the $n'$ words closest to $w$ according to distance metric $d$.

Finally, we can define word analogies:

a:b :: c:d

means

$$
\v_a - \v_b  \approx_{n',d}  \v_c - \v_d 
$$

So to solve the word analogy for $c$:
$$
\v_c \approx_{n',d}  \v_a - \v_b + \v_d
$$

To be concrete:
$$
\v_\text{king} - \v_\text{man} + \v_\text{woman} \approx_{n',d} \v_\text{queen}
$$

## Why does adding 2 word vectors work
- Mikolov
    - Vector for a word reflects its context
    - Vector is log probability
        - so sum of log probabilities is log of product of probabilities
            - product is like a logical "and"

## GloVe: Pre-trained embeddings

Fortunately, you don't have to create your own word-embeddings from scratch.

There are a number of pre-computed embeddings freely available.

GloVe is a family of word embeddings that have been trained on large corpra
- GloVe6b
    - Trained on 6 Billion tokens
    - 400K words
    - Corpus:  Wikipedia (2014) + GigaWord5 (version 5, news wires 1994-2010)
    - Many different dense vector lengths to choose from
        - 50, 100, 200, 300

We will illustrate the power of word embeddings using GloVe6b vectors of length $100$.

$
\begin{array}[llllll]\\
\text{king- man + woman} &  \approx_{n',d} & \text{queen } \\
\text{man - boy + girl} &  \approx_{n',d} & \text{woman } \\
\text{Paris - France + Germany} &  \approx_{n',d} & \text{Berlin } \\
\text{Einstein - science + art} &  \approx_{n',d} & \text{Picasso} \\
\end{array}
$

You can see that the dense vectors seem to encode "concepts", that we can manipulate mathematically.

You may discover some unintended bias

$
\begin{array}[llllll]\\
\text{doctor - man + woman} &  \approx_{n',d} & \text{nurse } \\
\text{mechanic  - man + woman} &  \approx_{n',d} & \text{teacher } \\
\end{array}
$

### Domain specific embeddings

Do we speak Wikipedia English in this room ?

Here are the neighborhoods of some financial terms, according to GloVe:

$
\begin{array}[lll]\\
N(\text{bull}) & =  & [ \text{cow, elephant, dog, wolf, pit, bear, rider, lion, horse}] \\
N(\text{short}) & =  & [ \text{rather, instead, making, time, though, well, longer, shorter, long}] \\
N(\text{strike}) & =  & [ \text{workers, struck, action, blow, striking, protest, stoppage, walkout, strikes}] \\
N(\text{FX}) & =  & [ \text{showtime, cnbc, ff, nickelodeon, hbo, wb, cw, vh1}] \\
\end{array}
$

It may be desirable to create word embeddings on a narrow (domain specific) corpus.

This is not difficult provided you have enough data.

# Obtaining Dense Vectors: Transfer Learning

How do we obtain Dense Vector representation of words ?

We learn them !

Suppose we had a task T 
that involves mapping a sequence of words to an outcome.

To be concrete: mapping a movie review to an indicator of Positive/Negative sentiment.



Ignoring for the moment the issue of converting variable length sequences to a fixed length
- inputs are OHE of words
- target is Positive/Negative label

- Logistic Regression from  sentence representation to binary target Positive/Negative

One could also ask
- can we map the OHE of a word $\w_\tp$ (length $|\V|$)
- to a shorter, dense vector $\mathbf{e}_\tp$ of length $n_e$
- and use the dense vector in the Logistic Regerssion
 
This mapping can be represented by an an $(|\V| \times n_e)$ matrix
$\E$ 

$$
\mathbf{e}_\tp = \text{OHE}(\w_\tp)^T \E 
$$

Using Machine Learning, 
- we solve for both the Logistic Regression parameters $\W$ *and* $\mathbf{E}$
- when solving the Classification Task via Logistic Regression.

The matrix $\mathbf{E}$ is called 
- an *embedding matrix* for words 
- and
$\e_\tp$ is called an *embedding*  or *word vector* for word $\w_\tp$.

*Word embeddings* have become an important component of Deep Learning for NLP.

<table>
    <tr>
        <th><center>Word prediction: Neural Net</center></th>
    </tr>
    <tr>
        <td><img src="images/w2v_word_prediction_layers.jpg" width=800></td>
    </tr>
</table>

In other words
- we have learned a dense vector representation of words $\mathbf{E}$
- that is useful for a particular classification task



Might it be possible that the dense vector representation of words for this task
- is useful for other tasks involving words ?
- this is Transfer Learning

The problem with this approach is having a large enough training set for the task $T$.

We will show how to solve this problem using semi-supervised learning
- word prediction problems

# Word prediction problems: high-level

Let's explore how to create generally useful (as opposed to task  specific) word embeddings.

In the absence of labelled data (needed for Supervised Learning)
- we can create a Semi-Supervised Learning task

From unlabelled sequence $\w$ define the *word prediction* problem
as
- predict a target word given a "context" sequence of words

For example:
- given prefix $\w_{(1)} \ldots \w_{(\tt-1)}$
- predict $\w_\tp$.

The inspiration is that if you can predict the occurrence of word from it's neighbors
that  you have somehow capture dimensions of meaning.

This is often refered to as "a word is known by the company it keeps".

- "I ate an apple"
- "I ate a  blueberry"
- "I ate a pie"

"apple", "blueberry", "pie" concept: things that you eat


Word embeddings can be obtained as a by-product of this *word prediction* problem.


Let $\w$ be the sequence of $n_\w$ words 

A *word prediction* is a mapping 
- from input $\w$
- to a probability distribution $\hat{\y}$ over all words in vocabulary $\V$
    - $\hat{\y}_j = \pr{V_j}$
    - That is: it assigns a probability to each word in the vocabulary

Here are some simple word prediction problems:
$
\begin{array}[lll]\
\text{predict next word from context}  & \pr{\w_\tp | & \w_{(\tt-\offset)} \ldots, \w_{(\tt-1)} } \\
\text{predict a surrounding word}      & \pr{\w_{(\tt')} |& \w_\tp } \\
    & & \tt' = \{ \tt - o, \ldots, \tt + o \} - \{ \tt \} \\
\text{predict center word from context} & \pr{ \w_\tp | & [ \w_{(\tt-\offset)} \ldots \w_{(\tt-1)} \w_{(\tt+1)} \ldots \w_{(\tt+\offset)} ] }  & \\
\end{array}
$

# Word prediction problems, in detail

## Background: Language models

A *Language Model* takes a sequence of words and produces a *probability* that
the sequence represents a sentence in the language.

- We will show how we can obtain this probability by predicting the probability of word $\w_\tp$
conditional on the first $(\tt-1)$ words in the sequence.

- We will then simplify this by a problem that involves predicting word $w_\tp$ from
a *small* window in the neighborhood of word $\tt$.

Two variants of the window-based approach are the basis for a popular word embedding techinique:
word2vec.

Let $\w$ be the sequence of $n$ words in a sentence.

A *language model* 
- maps $\w$ into a probability $\pr{\w}$ that $\w$ represents
a sentence in the language.

We can compute this probability via the chained probabilitiy
$$
\pr{\w} = \pr{\w_{(1)} | \w_{(0)}} \; 
\pr{\w_{(2)} | \w_{(0)} \w{(1)}} \ldots \;
\pr{\w_{(n_\w+1)} | \w_{(0)} \ldots \w_{(n_\w)}}
$$

or

$$
\pr{\w} = \prod_{\tt=1}^{n_\w+1} { \pr{\w_\tp | \w_{(0)} \ldots \w_{(\tt-1)} } }
$$

That is
- the probablility of $\w$ be a valid sequence of tokens ("sentence")
- is the chained probability of token $\w_\tp$ following prefix $\w_{(0)} \ldots \w_{(\tt-1)}$
    - for each $1 \le \tt \le n_\w$

We can determine these probabilities via a *maximum likelihood* estimate by counting 
word sequence occurrences in our text corpus.

$
\pr{\w_\tp | \w_{(0)} \ldots \w_{(\tt-1)} } = \frac{ \text{count}_\text{Corpus} { \w_0 \ldots \w_\tp} } {\text{count}_\text{Corpus} { \w_0 \ldots \w_{(\tt-1)}}  }
$


The estimate via counting is an unrealistic ideal
- we don't have *all* possible sentences in the language; the corpus is a subset
- may not have *true* probability for rare words in language when corpus is small
- computationally expensive
- Out of Vocabulary (OOV) problem
    - tokens appearing at inference (test) time that were *not* in training


## Window based models

It is more realistic to *approximate*
$$\pr{\w_\tp | \w_{(0)} \ldots \w_{(\tt-1)} }$$
by conditioning $\w_\tp$ on a **fixed length** prefix ending at $\tt-1$


- Unigram (1-gram) approximation
$
\pr{\w_\tp | \w_{(0)} \ldots \w_{(\tt-1)} } \approx \pr{\w_\tp}
$

    - That is, conditional probability of $\w_i$ is just the unconditional probability.

- Bigram (2-gram) approximation
$
\pr{\w_\tp | \w_{(0)} \ldots \w_{(\tt-1)} } \approx \pr{\w_\tp | \w_{(\tt-1)} }
$
- n-gram approximation
$
\pr{\w_\tp | \w_{(0)} \ldots \w_{(\tt-1)} } \approx \pr{\w_i | \w_{(\tt-(n-1))} \ldots \w_{(\tt-1)} }
$

You can probably see the weakness of the unigram model
- Doesn't respect word order
$$\pr{ \text{["New", "York"]} } = \pr{ \text{["York", "New" ]} }$$

and how increasing the window improves the approximation.

The assumption that I can predict solely based on a prefix is called a *Markov* assumption.

# Word prediction problem for word2vec

word2vec is based on one of two prediction problems.
- predict center word given surrounding words as context
- precict which words can occur on either side of a given center words

The problems are framed as:
Predict target word $w_t$ given conditional word $w_c$
- $p(w_t|w_c) $

The first prediction problems is called *Skip gram*: predict surrounding words
- conditional word is a "center word" that is surrounded by other words: $w_c = \w_\tp$
- target word is any surrounding word at a position $\tt'$ within a window of $\tt$
    - predict probability of any word $w_t \in \V$ being equal to $\w_{(\tt')}$ 
    - where $\tt' = \{ \tt - o, \ldots, \tt + o \} - \{ \tt \}$

The set of training examples (example/label pairs) associates with $w_c = \w_\tp$
$$\{ (\w_\tp, \w_{(\tt')}) | \; \tt' \in \{ \tt - o, \ldots, \tt + o \} - \{ \tt \} \}$$

    

The second prediction problem is called *CBOW*
- center word $\w_\tp$
- conditional word $w_c$ is any word that surrounds $\w_\tp$
- target word is the center word $w_t = \w_\tp$
    - predict probability of $w_t = w_\tp$ is center word 
    - given a surrounding word $w_c \in \{ \w_{(\tt')} | \tt' = \{ \tt - o, \ldots, \tt + o \} - \{ \tt \} \}$
    

The set of training examples (example/label pairs) associates with taarget $\w_\tp$
$$\{ (\w_{(\tt')}, \w_\tp) | \;  \tt' = \{ \tt - o, \ldots, \tt + o \} - \{ \tt \} \}$$


## word2vec derivation

Prediction problem as
multinomial (one class per word in vocabulary) classification problem

$\y = \W \x'$

where 
- $\y$ is a (OHE) target word 
- $\x'$ is derived from (one or more) conditional words.

- $\W$ is $(||V|| \times ||V||)$
    - $\W^\ip$, denoting row $i$ of $\W$ is the "template" for target word $\V_i$
        - there are $||V||$ such targets, one per word in $\V$
   -  $\W^\ip$ is length $||V||$, the size of the OHE

We want to obtain an embedding matrix $\E$ of dimension $(|\V| \times n_e)$
- $\E^{(j)}$, the $j^{th}$ row of $\E$ is the dense vector of $\V_j$, the $j^{th}$ word in vocabulary $\V$.
- so $x' = \o *  \E$
    - where $\o = \text{OHE}(\x)$
    - $x'$ is now length $n_e$ rather than $||V||$

The regression solves for both $\W$ (as usual) *and* $\E$.

That is: we find the embedding $\E$ that is best suited for the classification regression.

The matrix $\E$ would then be a map from word $\V_j$ to embedding $\e_j$.

This embedding matrix would hopefully "transfer" to other NLP tasks.

Let's not overlook $\W$, the matrix of classifier parameters
- $\W^{(j)}$, row $j$ of $\W$, is the "template" for target word $\V_j$
    - multinomial regression has one target per vocabulary word
    - it produces a probability distribution $\hat{\y}$ over the words in $\V$
    - $\hat{\y}_j$ is the probability associated with word $\V_j$
    - length of $\W^{(j)}$ is $n_e$, same as embedding vector
   
    

In some sense, $\W^{(j)}$ can *also* be thought of as a dense representation of $\V_i$
- $\E^{(j)}$ is  representation of $\V_j$ when it is a conditional word (independent variable)
-  $\W^{(j)}$ is representation of $\V_j$ when it is a target word (dependent variable)
    - i.e., is a template for target $\V_j$

It is usually the case, for simplicity
- to use average of $\E$ and $\W$ (which are the same size) as embedding

#### Objective function

Maximize average log probability over the $T$ examples in training set:

$$\frac{1}{T} \sum_{\tt=1}^T { \sum_{  \tt' \in \{ \tt - o, \ldots, \tt + o \} - \{ \tt \} } { \log( p(w_{(\tt')}|w_\tp) )} } $$

## Detour: Sentiment classification notebook on Colab : Learned embeddings

[NLP notebook: learned embeddings](https://colab.research.google.com/github/kenperry-public/ML_Fall_2019/blob/master/Keras_examples_imdb_cnn.ipynb#scrollTo=f5XrUD3X8KgP)

## Back from the detour:  summary of Learned embeddings

# Issues 2,3 revisited: Variable length, ordered token sequences

We should already have some idea of how to deal with variable length sequences.

Recurrent Neural Networks take sequence inputs and create representations (hidden states) that are
fixed length "encodings".


Recall that the RNN produces a sequence of hidden states $\h{(0)}, \ldots , \h_{(||\w||)}$.

Hidden state $\h_\tp$ is effectively a summary or encoding of $\w_{(0)} \w_{(1)} \dots \w_\tp$
- it is computed after having seen the prefix of $\w$ ending at $\tt$.

So $\h_\tp$ is a fixed length encoding of $\w_{(0)} \w_{(1)} \dots \w_\tp$.

This gives us a way to convert variable length $\w$ into fixed length $\h_{(||\w||)}$

In [2]:
print("Done")

Done
