#### TF-IDF
TF means term-frequency, which is how often a word appears in a document.  The inverse document frequency (IDF) reduces the contribution of terms that appear often across documents. Example, letting $f_{t,d}$ denote the raw count ot times the term t occurred in document d, and let IDF be the $\log N/D(t)$ where N is the total number of documents and D is the number of documents containing the term t.  Then tf-idf would be
$$
f_{t,d} \log \frac{N}{D(t)}
$$
I just don't get the IDF calulation for document

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
#the words are alphabetized for the dictionary order
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.shape)
print(vectorizer.vocabulary_)
print(vectorizer.idf_)


['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
(4, 9)
{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}
[1.91629073 1.22314355 1.51082562 1.         1.91629073 1.91629073
 1.         1.91629073 1.        ]


In [14]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
import numpy as np
corpus = ['this is the first document',
          'this document is the second document',
          'and this is the third one',
          'is this the first document']
vocabulary = ['this', 'document', 'first', 'is', 'second', 'the',
              'and', 'one']
pipe = Pipeline([('count', CountVectorizer(vocabulary=vocabulary)),
                 ('tfid', TfidfTransformer())]).fit(corpus)

pipe['count'].transform(corpus).toarray()



array([[1, 1, 1, 1, 0, 1, 0, 0],
       [1, 2, 0, 1, 1, 1, 0, 0],
       [1, 0, 0, 1, 0, 1, 1, 1],
       [1, 1, 1, 1, 0, 1, 0, 0]], dtype=int64)

In [15]:
pipe['tfid'].idf_

array([1.        , 1.22314355, 1.51082562, 1.        , 1.91629073,
       1.        , 1.91629073, 1.91629073])

In [18]:
#above makes sense except for the documetn entry, which I do not get
pipe.transform(corpus).shape

(4, 8)

 #### word2vec
 Words are represented as vectors in a real vector space.  What is needed is to learn a mapping from words, as indexed in some dictionary, to a relatively low dimensional real vector space.  Such a mapping could be represented as a matrix of size N x D, where N is the number of words and D is the dimensionality of the subspace, and you find the vector representation by multiplying the elementary basis vector corresponding to the word by the matrix.  The key is learning the matrix, which can be done via a simple neural network that is good at predicting what words will surround the the given word (a skip-gram model).    The basic structure involves word sequences, like "every dog must sleep on the bed".  In the continuous bag of words model, you try to predict the target word, sleep, given the context words "every dog must ... on the bed".  In the skipgram model, you try to predict the context words given the target word.  They are like bigrams, but skipping words.
The learning is done by neural network, with a set of weights mapping the word into the single inner layer, and a set of weights from the inner layer to each of the context word positions.  The input weights are an NxD matrix, and the output weights are a DxN matrix.  As a practical matter, instead of using softmax over all words, it is used sequentially in a binary decision tree of subsets of words.  Then in order to sample, you can use negative sampling, choosing randomly over incorrect words, each cutting off substantial numbers of other words by being in a different split.
 
The skipgram model, given a sequence of training words, maximizes the average log probability of the words fitting in:
$$
\frac{1}{T} \sum_{t=1}^T \sum_{-c \le j \le c} \log p(w_{t+j} | w_t) 
$$
where c is the context, the number of surrounding words.  It drops words from the context set based on their frequency, removing words that are used a lot.  Then you can just use input weights as the embedding.

Ng:  Given a training sentence ("I want a glass of orange juice to go along with my cereal"), come up with context to target pairs, randomly picking a context word, like orange, and then randomly pick a target word (juice).  Pick a vocab size, N, and learn a mapping from context C (orange) to target T (Juice).  Start with a one hot vector for each, multiply by the embedding vector, to get the embedded input e, feed this into a softmax unit, and then let if output the predicted target word.
$$
p(t|c) = \exp(\theta_t^T e_c) / \sum_{j=1}^N \exp(\theta_j^T e_j)
$$
where theta is the parameter associated with the hidden units, and loss is computed using the one hot embedding of y.  

#### GloVe
For glove, the embedding is determined by probabities derivded from a co-occurance matrix.  The co-occurance matrix, X, has a number in cell $(i,j)$ indicating the number of times word i appears in the context of word j, and the probability of i appearing in the context of j is that entry divided by the row sum.  Glove establishes relationships by examining the ratio of word co-occurance in the presence of other words.  Two words that always co-occur with the same words are alike.  So glove involves estimating co-occurance via a model.  

$$
F(w_i,w_j, w_k)  =  \frac{ P(i|k) }{ P(j|k) } 
$$

By imposing a number of criteria related to linearity, they end of with the model minimizing

$$
\sum_{i,j=1}^N f(X_{i,j}) \left( w_i^T \tilde w_j +b_i+b_j -\log X_{i,j} \right)^2
$$
where f is a weighting function.  Alternatively, the fitting could be done via matrix factorization.  


#### Applications overview
Part of speech tagging -- potential to solve via hidden markov model, where the word is observed and the tags are hidden.  This is a latent, or hidden variable model, and one algorithm for estimation is the Viterbi algorithm, which asks what is the most likely sequence of hidden states given the observed states (input is observations, output is hidden states).  We need data from which we can estimate the liklihood of a word given a part of speech, and the state transitions between parts of speech.

Logistic regression can only map words to tags, without context.  Recurrent Neural Networks are like feed forward networks, but with a feedback loop that allows taking into account datat from the past, e.g. $h(t) = \sigma ( W_x^T x(t)+W_h^T h(t-1)+b)$

One type of NN, not implemented in TF, is recursive neural networks, sometimes calles tree neural networks, in which you build a tree recursively until you go from sentence to word, and then choose values for nodes based on values for children.  Since weights are shared, we need two weights, one which tells us how to incorporate the left child and one to incorporate the right child.
$$
h_{node}= f(W_L X_L+W_R+X_R +b)
$$
The X inpust could be either a node or a word, so the input and output size of nodes must be the same size.  As a practical matter, this is implemented by making parse trees into sequences and using recurrant neural networks.  For example, arrange the elements in a list so that children always come before parents, and create a separate array of parents, relations, and word, indexes.  Then go through arrays in parallel, one element at a time, which works since parents value depends on children, which are always to the left.

#### Recurrent NN
The most basic recurrent unit is the simple recurrent unit, which has input and output, and a hidden layer, with weights connecting every unit of the hidden layer to every other unit.  For example, if the hidden layer has M units, they the hidden weights are MxM, e.g. $h(t) = f  ( W_x^T x(t)+W_h^T h(t-1)+b)$.  The initial value h(0) matters to the final output, and additional recurrent units can be added as hyperparameters.

The idea is that for each time step you calculate output (label, etc), so for each x(t) there is a y(t).  This could be unsupervised, like predict the next word given all previous words in the sentence.  Then you are trying to find the joint probability for a sequence.  You could think of this as a deep NN with a number of layers corresponding to the number of layers, and shared weights between the layers.  Note that because we may multiply many things together to calculate a gradient, we may have an exploding or vanishing gradient problem, for which methods exist to mitigate the problem.

To add word embeddings, just add an embedding layer to the RNN ahead of the input layer.  

A simple modification of the simple unit 
$$
f(x(t),h(t-1))=f( W_x^T x(t)+W_h^T h(t-1)+b)
$$
is the rated recurrent unit
$$
h(t) = zf(x(t),h(t-1))+(1-z)h(t-1)
$$
which has a parameter that consists of a matrix z.

Gated recurrent units include an update gate z, 
$$
z_t = \sigma ( W_{xz} x_t+W_{hz} h_{t-1} +b_z)
$$
a reset gate which controls how much of the previous state we will consider when creating the new hidden value.
$$
r_t = \sigma ( W_{xr} x_t+W_{hr} h_{t-1} +b_r)
$$
A candidcate update value depending on the reset parameter
$$
\hat h_t =g(W_{xh}x_t+(r_t \cdot h_{t-1}) W_{hh} +b_h)
$$
A final value depending on the candidate and update rate value
$$
h_t=(1-z_t)h_{t-1}+z_t \hat h_t
$$

Long term - short term memory has three gates, input, output, forget, and a memory cell. The input and forget gates are like the update gate, determining how much of the input will be considered and how much of the cell value will be remembered:
$$
i_t = \sigma ( W_{xi} x_t+W_{hi} h_{t-1} +W_{ci} c_{t-1}+b_i), \qquad f_t = \sigma ( W_{xf} x_t+W_{hf} h_{t-1} +W_{cf} c_{t-1}+b_f) 
$$
and a candidate memory gate like a rated recurrent unit combining the old cell value with a scaling of the input:
$$
c_t = f_tc_{t-1} +i_t \tanh( W_{xc} x_t+W_{hc} h_{t-1} +b_c) 
$$
An output unit that combines all these values, including the current cell value 
$$
o_t = \sigma ( W_{xo} x_t+W_{ho} h_{t-1} +W_{co} c_{t}+b_o)
$$
and the hidded unit value which scales the output based on the cell value.
$$
h_t = o_t \tanh(c_t)
$$

#### Sequence models (NG Coursera)
Slightly different notation, where the initial hidden units are set as $a_0=0$ and the input denoted as x and the output denoted as y.   
$$
a_i=g(W_{aa}a_{i-1} +W_{ax}x_i+b_a)=g(W_a[a_{i-1}:x_i]+b_a); \qquad y_i=g(W_{ya}a_{i} +b_a)
$$
where in the second version of a, matrices an vectors are concatenated appropriately for multiplication.  Loss is computed at each time step for backpropogation.  

In the GRU, the memory cell and gates can be multidimensional, with vector values being changed separately, like there is one bit for plurality, and one for his/her, etc.  http://dprogrammer.org/rnn-lstm-gru





All the rnns so far go only in one direction.  Bidirectional rnns also have a backward recurrent layer, connected to each other backward in time.  The backward units are informed from the beginning to the end, and both the backward and forward units combine their information to inform the output layer (y).  There is no connection between the forward and backward layer, so the graph remains acyclic.


 https://www.researchgate.net/figure/Structure-of-a-bidirectional-RNN_fig2_318332317

Bidirectional rnns with lstm blocks are common.  Bidirectional Rnns need the entire sequence for processing.  



Rnns can be extended to deep rnns by adding paralled layers (from coursera video) where units receive inputs from below and from the left.

Sequence to sequence models take one sequence, like a sentence, to another sequence, like a translation.  One example is a encoder and decoder network, which, in the case of machine translation, can be thought of as estimating the conditional probability of an output sentence given an input sentence, and choosing the most likely one.

Beam search, in the first step of the translation, chooses three possible initial words, and then chooses second words based on those.  At each stage, it maintains the three most likely sentence fragments.

Attention models:  One version begins with a bidirectional Rnn to compute a set of features for each input word, which inform an attention parameter (alpha) for each output word.  An image is available at https://towardsdatascience.com/attention-networks-c735befb5e9f.  The alpha parameters can  be done as a softmax over inputs determined by a one layer NN.



From LZ prog RNNS
Sequence to sequence models are two RNNs joined together, one which encodes the input, and another which decodes to produce output.  When considering the encoder, only the last state of the sequence matters for the decoding (summary vector).  For the decoder, you pass in the summary vector and an initial state, and each subsequent unit takes in the hidden state and the previous output.  An example task is machine translation, and Q&A, chatbots, although seq2seq only memorizes responses and repeats them.
Decoding details:  keras works with constant size data, so for the decoder, you need inputs.  rather than use the previous step output as teh new input, you can use teacher forcing, which is where you pass in the correct input into the next unit rather than the generated input.  At test time, you got back to passing in previous in previous input, meaning the decoder input length during prediction is 1.  The solution is to have two different models, one which has the fixed sentence length and one of length one.  
One of the chief advantages of seq2seq is that the input and output size can be separately determined based on the problem.  However, the whole input must be mapped to one vector, which is limiting for long sentences, and we lose all other information in the encoder's hidden states.

Attention models
RNNs which produce output only at the end are good for small sentences, but not longer ones, as the relevant input may be lost by the time you reach the end.  One alternative is to take a maxpooling layer connected to all the recurrent units, which is like saying pick the most important feature.  Alternatively, you could take a softmax over the recurrent unit values to decide which one matters most.  With attention models, we still have an encoder and a decoder.  The encoder is a bidirectional lstm.

Every unit of the decoder is connected to the encoder via a term consisting of a linear combination of the inner states of the encoder.  This is referred to as the context.  Because we have two different sequences, we have two different times, $t,t^\prime$, with t referring to the output sequence.  The context is 
$$
c=\sum_{t^\prime=1}^{T_x} \alpha(t^\prime) h(t^\prime)
$$
and the attention weights are learned via a neural network
$$
\alpha_{t^\prime}=g(s_{t-1},h_{t^\prime})\qquad t\le T_y; \, t^\prime \le T_x
$$
You can use teacher forcing by concatenating the context with the decoder output from the previous decoding step.

Memory networks
The example case for this area is the bAbI dataset, which is a story and Q&A dataset.  These networks are not deep, so are fast.  The simplest thins to do is grab all the word vectors in the sentence and sum them up, resulting in one vector for each sentence.  Now, with your vector for each sentence, map the question into a vector, but with its own embedding, so that it can learn to be close to relevant sentence.  Then dot the sentence vector with the question vector and take the softmax.  This gives you the relevant sentence.  Now we pass it through a dense logistic regression layer and then take the softmax over teh possible vocabulary to get a single word answer.  If you need to deal with two supporting facts, differing in time, you double the network, taking the answer from the first block and adding it to a second block, again consisting of all sentences.  Each block is called a hop.