# Knowledge Graphs and Its Associates

---

## Entities

* **[Named Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition)** - is typically a problem of detection entities of type LOCation, ORGanization, or PERson.
<br>Note: each of those entities might span several words/tokens in a given text.
<br>Typically, NER is solved as follows: linguistic grammar-based, statistical models or machine learning.
<br>Conditional Random Fields are used for the tasks.
<br>
Examples:
  * Kaggle competition [dataset](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus).
  * [DataTurk](https://dataturks.com/projects/Mohan/Best%20Buy%20E-commerce%20NER%20dataset) - search queries with entities.
  * [Stanford NER](https://nlp.stanford.edu/software/CRF-NER.shtml)
  * [Spacy](https://en.wikipedia.org/wiki/SpaCy)
  * Apache [OpenNLP](http://opennlp.apache.org/index.html)

* **[Coreference resolution](https://en.wikipedia.org/wiki/Coreference#Coreference_resolution)**
<br>
Another important problem with regards to the entities - is conreference resolution.
<br>Coreference denotes a presence of several expressions in a text that refer to the same person/thing.
<br>
For example:
<br>
> **The food** was salty so guests did not enjoy **it**.
<br>
> **The student** was absent for 3 month. **Such person** won't pass any exam.

* **[Entity linking](https://en.wikipedia.org/wiki/Entity_linking) or Disambiguation**
<br>
Entity disambiguation is a problem of matching the identity of the entity mentioned in the text to actual entity.
<br>Typically, supervised learning is used to solve this problem where anchor texts are leveraged in the training data.
<br>Further, several explorations were made with [clean unambiguous training data](http://www.aclweb.org/anthology/C10-1145), or with [topically related texts that potentially have similar types](https://www.cc.gatech.edu/~zha/CSE8801/query-annotation/p457-kulkarni.pdf)
<br>of entities in them, etc.
<br>
For example:
<br>
> Donald **Trump** and **Trump** card.
<br>
>**Paris** Hilton and **Paris** city.

## Relations

Entities could have several various relationships between each other that are important to extract. For example:

* Types of relationships:

  * *is-a* - the relationship between two entities in which one entity inherits from the other

  * *Hypernyms* - a word with a broad meaning constituting a category into which words with more specific meanings fall;
  
  * *Hyponyms* - is a word or phrase whose semantic field[1] is included within that of another word.
  <br>Hyponyms denotes a subset of the hypernym.
  
  * *Meronymy* denotes a constituent part of, or a member of something. Meronym is a part of a whole!

  * *Synsets* - A set of one or more synonyms that are interchangeable in some context without changing the truth value of the
  <br>proposition in which they are embedded;

  * *Metonymy* - the substitution of the name of an attribute or adjunct for that of the thing meant, for example suit for business executive.
  
  * *Anything*: who is married on who, what causes what, etc.
  
* Relation extraction

  * Methodologies: [Regex](http://www.aclweb.org/anthology/D08-1003), [Rule based](http://iswc2012.semanticweb.org/sites/default/files/76490257.pdf), [Wikipedia categories](http://pages.cs.wisc.edu/~anhai/papers/kcs-sigmod13.pdf), [Distant supervision](https://web.stanford.edu/~jurafsky/mintz.pdf), [Bayesian networks](http://aclweb.org/anthology/D17-1192), [Factor graphs](https://cs.stanford.edu/people/czhang/zhang.thesis.pdf).

### Knowledge Bases/Knowledge Graphs

* [WordNet](https://en.wikipedia.org/wiki/WordNet)
<br>
WordNet is a lexical database for English. It groups English words into synonyms (synsets), provides their short
<br>
descriptions and usages. Moreover, it contains several relations between the enties of synsets.

* [OmegaWiki](http://www.omegawiki.org/Meta:Main_Page), [BabelNet](https://babelnet.org/)
<br>
OmegaWiki aims at creating dictionaries of all words of all languages. BabelNet is a multilingual encyclopedic dictionary.

* <a href="https://en.wikipedia.org/wiki/Taxonomy_(general)">**Taxonomy**</a>
<br>
Taxonomy refers to the hierarchical categorization where relatively well-defined classes are nested under broader categories.

* [Folksonomy](https://en.wikipedia.org/wiki/Folksonomy)
<br>
Folksonomy is a relatively new system where users apply tags to online items.
<br>
As opposed to Taxonomy, Folksonomy does not derive a hierarchical structure betwen the tags but rather only assigns them.

* <a href="https://en.wikipedia.org/wiki/Ontology_(information_science)">**Ontology**</a>
<br>
Ontology is a representation, naming, definitions, categories, properties, relations of the concepts/entities
<br>
for several or all domains.

* [DBPedia](https://en.wikipedia.org/wiki/DBpedia)
<br>
DBPedia aims at extracting structured content from the Wikipedia. It describes about 4.5M entiuties,
<br>
with about 1.5M persons, 700K places, 240K organizations, etc.

* <a href="https://en.wikipedia.org/wiki/YAGO_(database)">YAGO</a>
<br>
YAGO is an open sourced knowlege base that was developed in Max Planck Institute.
<br>
This knowledge base contains over 10M entities and about 120M facts about those entities.
<br>
YAGO extracts information from Wikipedia boxes, WordNet and [GeoNames](https://en.wikipedia.org/wiki/GeoNames).

* [**Knowledge Bases**](https://en.wikipedia.org/wiki/Knowledge_base) and [Knowledge Graphs](https://en.wikipedia.org/wiki/Knowledge_Graph)
<br>
KB or KG are technology to store complex structured and unstructured information.
<br>
One of the main example of the Knowledge Graph is [Google Knowledge Graph](https://en.wikipedia.org/wiki/Knowledge_Graph) that was in part
<br>
powered by [Freebase](https://en.wikipedia.org/wiki/Freebase) (Freebase is a large collaborative knowledge base that contain the
<br>
data composed mainly by its community members.).
<br>
Another example: Knowledge Graphs with [DeepDive](https://meta.wikimedia.org/wiki/Research:Wikipedia_Knowledge_Graph_with_DeepDive).

* <a href="https://en.wikipedia.org/wiki/Commonsense_knowledge_(artificial_intelligence)">Commonsense knowledge</a>
<br>
Common sense knowledge consists of facts about everyday life, e.g., The sky is blue, a lemon is yellow and sour.
<br>
A large corpus of this data called [Open Mind Common Sense](https://en.wikipedia.org/wiki/Open_Mind_Common_Sense) (OMCS) was created by MIT.

# Sequence Models

---

Sequential modeling could be either conditioned on sequential data or require to generate sequential data.

Example 1: read a news paper article and pridict the gender of the writer; read the movie review and predict the sentiment.

Example 2: POS tagging, machine translation, summarization, generation, image generation, text to speech, etc.

## HMMs and CRFs


### HMM

* **Hidden Markov Model (HMM)**
<br>
Is a statistical Markov model in which we assume that the process that is being modeled is Markov process
<br>(probability of the event depends only on the previous state/event) with unobserved states.

<img src=https://upload.wikimedia.org/wikipedia/commons/thumb/8/8a/HiddenMarkovModel.svg/750px-HiddenMarkovModel.svg.png width=50%>

> (c) [Wikipedia](https://en.wikipedia.org/wiki/Hidden_Markov_model)

> So in the *Markov process* the states are directly observed, thus, we have only state transition probabilities.
<br>In the *[Hidden Markov Model](http://cs229.stanford.edu/section/cs229-hmm.pdf)* the state is not directly visible, but the output
<br>(dependent on the state) is visible. Thus, each state has a probability distribution over the possible output tokens.
<br>
HMM are well known and used for the following applications: part of speech tagging, reinforcement learning and temporal pattern recognition
<br>(speech, handwriting, gestures, etc.), bioinformatics, etc.
<br><br>
HMM can be seen as a generalization of a mixture model where the hidden variables are related through
<br>a Markov process rather than independent (as we have seen in the Topic Modeling).
<br><br>
For example, consider part-of-speech tagging problem, where POS are the hidden states that have some observed word representation.
<br>As a result, we need the entire sequence of the states to be computed - [Viterbi algorithm](https://en.wikipedia.org/wiki/Viterbi_algorithm)
<br>(dynamic programming + backtracking) can be used.

<img src="http://cse24-iiith.virtual-labs.ac.in/exp7/Exp5/viterbi-4.gif" width=50%>

> (c) [NLP Lab of IIIT-H](http://cse24-iiith.virtual-labs.ac.in/exp7/index.html)

### CRF

* ** [Conditional Random Fields](https://www.research.ed.ac.uk/portal/files/10482724/crftut_fnt.pdf) (CRF)**
<br>
Is also a statistical modeling method that is usually applied for structured prediction (predict the sequence of labels; parsing; named entity recognition,
<br>shallow parsing, etc.). [CRF](https://en.wikipedia.org/wiki/Conditional_random_field) is a discriminative undirected probabilistic graphical models
<br>which somehow encodes known relationships between observations and construct consistent interpretations.
<br>CRF is defined on top of observations X and random variables Y.
<br>In particular, in conditional random field (X, Y) - random variable Y that is conditioned on X follows Markov property (memoryless).
<br>
In the context of CRFs, we are defining feature functions that takes as input:
<br>(1) sentence, (2) position of the word, (3) label of the i-th word, (4) label of the previous word - and outputs a real value (feature value).
<br>We can convert those features and feature scores to the probabilities by using exponentiation and normalization.
<br>Label sequence modeled as a normalized product of feature functions.
<br>
In order to express probability $P(y | x, w)$ of the sequence through the feature functions, exponentiation is used.
$$ P (y | x, w) = \frac{ \sum_{i=1}^{n} \sum_{j} w_j f_j(y_{i-1}, y_i, x, i ) }{\sum_{y \in Y} \sum_{i=1}^{n} \sum_{j} w_j f_j(y_{i-1}, y_i, x, i )} $$
<br>
To learn the weights of the features, we need to compute the gradient of the log probability p(labels | sentence).
<br>More intuitive explanation in this [post](http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/).
<br><br>
Two options are possible once it comes to the inference over CRFs, either the graph is a chain or tree - then we can use HMM algorithms,
<br>or if the graph contains loops - we can use some approximations (e.g. [Loopy belief propagation](https://en.wikipedia.org/wiki/Belief_propagation#Approximate_algorithm_for_general_graphs)).
<br><br>
You can play with a [CRF based NER](http://nlp.stanford.edu:8080/ner/) system provided in a [demo](https://nlp.stanford.edu/software/CRF-NER.shtml) by Stanford.

<img src="https://www.codeproject.com/KB/recipes/559535/gerative-discriminative.png" width=60%>

> Moreover, CRF could be used beyond sequences, but also in the form of trees, and thus enabling computation of the constituency parsing
<br>(one that is described in the preprocessing section).

More information on structured predictors [here](http://lxmls.it.pt/2018/strlearn.pdf).

# Deep Learning

---

For the sake of completeness, we should also consider Reccurent Neural Networds (RNN) that are used to handle textual inputs as sequences.

When we talk about Deep Learning, what we actually mean - is deep (many many layers) neural network training and inference.

## **Reccurent Neural Networks (RNN)** for sequence modeling

You can read up more on the attention models in the very intuitive explanations from Chris Dyer from
<br>DeepMind and his [presentation](http://lxmls.it.pt/2018/lxmls-dl3.pdf) at LxMLs 2018 in Lisbon.

### Language modeling

Language models assings probabilities to the sequences of the words. Typically a chain rule is used.

$ p(w_1, \dots, w_l)  = p(w_1) \times p(w_2 | w_1) \times p(w_3 | w_1, w_2) \times \dots $

Basically we need to represent an arbitrary lond history. 

For this, we could train a neural network that builds a representation of sequences of unbound length.

$\hat y = Wx + b$  - linear regression where we need to minimize the prediction error on the given dataset.

If we assume $h = g(Vx + c)$, $\hat y = Wh + b$ - non-linear regression.

$h$ - is kind of induced features in a linear classifier.

### RNN

A typical Feed-forward Neural network would have the structure as follows (just as described above in fact):

$h = g(Vx + c)$ 

$\hat y = Wh + b$

On the contrary Recurrent Neural Network:

$h_t = g(Vx_t + Uh_{t-1} + c)$

$\hat y_t = Wh_t + b$

In a nutshell, [RNN](http://www.deeplearningbook.org/contents/rnn.html) accepts an input vector $X$ and produces an output vector $Y$.
<br>What is interesting here is that the output is produced not only by the provided $X$, but also by the whole history the model was fed with.
<br>After each iteration (each input), RNN updates a hidden state (vector) $h$, which for example could be a combination of the two:
<br>weighed sum of the input and weighted sum of the hidden state.
<br>To go even deeper, we could define two networks, one that accepts normal input (encoder) as described earlier,
<br>and the other that accepts the output of the first network as an input (decoder).

[Andrej Karpathy](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) shows several options of the RNN applied to the text
<br>(if interested you could check Andrej's [vanilla char-level language RNN](https://gist.github.com/karpathy/d4dee566867f8291f086) implementation).

<img src="http://karpathy.github.io/assets/rnn/diags.jpeg" wirth=60%>
<br>
`Each rectangle is a vector and arrows represent functions (e.g. matrix multiply). Input vectors are in red, output vectors are in blue and green vectors hold the RNN's state (more on this soon). From left to right: (1) Vanilla mode of processing without RNN, from fixed-sized input to fixed-sized output (e.g. image classification). (2) Sequence output (e.g. image captioning takes an image and outputs a sentence of words). (3) Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing positive or negative sentiment). (4) Sequence input and sequence output (e.g. Machine Translation: an RNN reads a sentence in English and then outputs a sentence in French). (5) Synced sequence input and output (e.g. video classification where we wish to label each frame of the video). Notice that in every case are no pre-specified constraints on the lengths sequences because the recurrent transformation (green) is fixed and can be applied as many times as we like.`

#### How do we train RNN?


Let's consider **many-to-many** example. For each of the many outputs we actually know what would be the idea outcome (e.g., next word prediction in a sentence) and each of those output, thus, produced a error/cost. We could formulate a total cost/loss as a sum of the outcomes cotsts $F$. 
Similarly, parmeters are updated based on the propagated errors from each of the outputs.


<img src="https://docs.google.com/drawings/d/e/2PACX-1vQKS_hnAXySuEFOXK3ME8UW96uD0iFN_ggMA_WrPvduLAIDIloL6ow1ttDDkVXiQg-ydYFTHn-az4P6/pub?w=689&h=575">

$\frac{dF}{dU} = \sum_{t=1}^{4} \frac{dF}{dh_{t}} \frac{fh_{t}}{dU}$

#### Read and summarize

Let's consider another example: many-to-one, or "read and summarize" a sequence into a single vector.

<img src="https://docs.google.com/drawings/d/e/2PACX-1vQgGXBEeO9TQ_F_ouOasobFJaMR__iQobAZJYL4H4_Ay4En_5mMMt68CjZ8Xs6ODSYe8O6qT3u81dpW/pub?w=473&h=562" width=45%>
<img src="https://docs.google.com/drawings/d/e/2PACX-1vTfE3Zh6IRjvuxe8YDsR_dcN09QD7yr5R-mq_vB7-57PahL_o8Z7PCWZhVg4lmNknquJvxEJJ1RLJHF/pub?w=454&h=648" width=45%>

$h_t = g(Vx_t + Uh_{t-1} + c)$

$\hat h = max(h_t)$

$\hat y_t = W \hat h_t + b$

#### How do we actually select the output $y'$?

<img src="https://docs.google.com/drawings/d/e/2PACX-1vQSfbYA8T5ULcgfwDSi3k6jDbz-z11nkon45gsrc2jxyKWBUyEkmIVpwnrQAJymIWIag6nENAKwRsAQ/pub?w=473&h=315">

Where in in fact $y'$ is first a vector of the size of the vocabulary. Firther we apply softmax and, thus, estimate something like $P(W|w_1, w_2, w_3, w_4)$. In $y'$ each dimention correspond to a word in a closed vocabulary V. And aprobability of each word is estimated as follows:

$u = Wh + b$

$p_i = \frac{\exp{u_i}}{\sum_j \exp{u_j}}$

### LSTM

LSTM ([Colah's blog intuitive explanations](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)) is conceptually similar to the RNN explained above,
<br>though it adds to each of the reccurent unit a memory notion, thus, each recurrent unit becomes slightly more complicated.

$c_t = c_{t-1} + f(x_t)$, where $f(x_t) = tanh(Wx_t + b)$.

$h_t = g(c_t)$

<img  src="https://docs.google.com/drawings/d/e/2PACX-1vTr8daJ-L_O4s3iQD5MeNIWqDxACz4fFdfNVUBDhpnWZ86VMcgLm67UTo9zou57d18pMqtDkL99cLq6/pub?w=450&h=553">

Note $\frac{dc_t}{dc_{t-1}} \sim I$

So to summarize the image:

$c_t = f_t \cdot c_{t-1} + i_t \cdot f([x_t;h_{t-1}]) $

$h_t = g(c_t)$

$f_t = \sigma(f_f([x_t;h_{t-1}]))$  - forget gate

$i_t = \sigma(f_i([x_t;h_{t-1}]))$  - input gate

To make everything above an actual LSTM - we also add the output gate: 
$o_t = \sigma(f_o([x_t;h_{t-1}]))$  - output gate

And thus,  $h_t = o_t \cdot g(c_t)$

Moreover, we can tune the balance between input and forget gate by allowing $f_t   = 1 - i_t$.

Nice LSTM illustration published by [Shi Yun on Medium](https://medium.com/mlreview/understanding-lstm-and-its-diagrams-37e2f46f1714).
<br>
<img src="https://cdn-images-1.medium.com/max/1600/1*S0Y1A3KXYO7_eSug_KsK-Q.png" width=80%>

Check out more closely each separate unit.
<br>
<img src=https://cdn-images-1.medium.com/max/1600/1*laH0_xXEkFE0lKJu54gkFQ.png width=70%>

### Conditional Language Models

Similar to the previously discussed ones, but here we will be modeling $P(W | w_1, \dots w_k, X)$ - probability of a sequence of words
<br>given some conditioning context $X$.

Examples:

* A sentence in english $\rightarrow$ a sentence in chinese.
* A sentence in english $\rightarrow$ a sentence in french.
* A document $\rightarrow$ it's summary.
* Speech $\rightarrow$ text.
* Question and a document $\rightarrow$ answer.
* Question and image $\rightarrow$ answer.
* An image $\rightarrow$ its textual description.
* Meteo mesures $\rightarrow$ weather report.
* Topic $\rightarrow$ a document on this topic.

#### Encoder-Decoder models

<img src="https://docs.google.com/drawings/d/e/2PACX-1vQva9Ns8Ip_837SQr-yM4IJykLcjy3cmxdJxan1iRConAs0n0BkNw1xG10iXaqZ0y5SvLiXZGMuAiTC/pub?w=704&h=575" width=60%>

So here we need to encode our input sentence (for example), or in other words $c = embed (x)$. Simplest approach - get the mean of each word embeddings.
<br>
The decoded received the input sentence $s = Vc$

As a result, out recurrent decoder will be as follows:

$h_t = g(W[h_{t-1}; w_{t-1}] + s + b)$

$u_t = Ph_t + b'$

And thus, $P(W_t | x, w) = softmax(u_t)$

Recall, unconditional RNN:
<br>
$h_t = g(W[h_{t-1}; w_{t-1}] + b)$

Typically, we need to find the most probably output given the input, i.e., $w^* = \arg max p(w | x)$.

Greedy search is used in such cases: beam search.
Where we preserve N softmax outputs, where N is the size of the beam.

Potential issues and modification:
0. Does not care about the word order if we represent a sentence as a mean.
1. Previous architecture needs to store a lot of state to reproduce what is needed on the decoding step. We could try endocing the sequence backwards.
2. Gradients and memory cells will be eventually forgotten for long sequences. We could represent the embedding of the input and the outputs as matrices.

So the $embed(x)$ would be something like concatened internal states of the input rather than sum them up.

Another approach: we could run the encoding step both forward and backword, stack the token representation together and then merge those into a matrix.

### Attention

Now having some smart/sophisticated representations of the input we can pass it all to the network.
<br>
Here at each output position $t$, TNN would receive two inputs (in addition to any recurrent inputs),
<br>
(1) previously generated output
<br>
(2) encoding of a view of the whole input matrix (weighted sum of the columns based on how important those at the current step - ATTENTION).

(c) [distill.pub](https://distill.pub/2016/augmented-rnns/#attentional-interfaces)
<br>
`The attention distribution is usually generated with content-based attention. The attending RNN generates a query describing what it wants to focus on. Each item is dot-producted with the query to produce a score, describing how well it matches the query. The scores are fed into a softmax to create the attention distribution.`
<br>
<img src=https://distill.pub/2016/augmented-rnns/assets/rnn_preview_ai.svg width=50%>

You can read up more on the attention models in the very intuitive explanations from Chris Dyer from
<br>DeepMind and his [presentation](http://lxmls.it.pt/2018/lxmls-dl3.pdf) at LxMLs 2018 in Lisbon.

### Exercise

Get inspired by the example [here](https://gist.github.com/siemanko/b18ce332bde37e156034e5d3f60f8a23).

But first you might need to load the w2v model below.

In [0]:
#@title Data preparation and preprocessing { display-mode: "form" }
import pandas as pd
import nltk
import re

!pip install --upgrade gensim

# !wget http://nlp.stanford.edu/data/glove.6B.zip
# !unzip glove.6B.zip

#    Note: it might take several minutes. Be patient!
from gensim.scripts.glove2word2vec import glove2word2vec

glove2word2vec(glove_input_file='glove.6B.50d.txt',
               word2vec_output_file="gensim_glove_vectors.txt")

from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format("gensim_glove_vectors.txt")

Requirement already up-to-date: gensim in /usr/local/lib/python3.6/dist-packages (3.7.3)


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [0]:
import nltk
nltk.download('movie_reviews')
from nltk.corpus import movie_reviews

import warnings
warnings.filterwarnings('ignore')

import numpy as np
import random
import tensorflow as tf
import tensorflow.contrib.layers as layers

map_fn = tf.map_fn

from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
negative_ids = movie_reviews.fileids('neg')
positive_ids = movie_reviews.fileids('pos')
negative_reviews = [
    " ".join(movie_reviews.words(fileids=[f])) 
    for f in negative_ids
]
positive_reviews = [
    " ".join(movie_reviews.words(fileids=[f])) 
    for f in positive_ids
]

texts = negative_reviews + positive_reviews
labels = np.array([0] * len(negative_reviews) + [1] * len(positive_reviews))

from sklearn.utils import shuffle
texts, labels = shuffle(texts, labels, random_state=0)

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


In [0]:
# Prepare dataset here

In [0]:
#@title Dataset Preparation
import numpy as np
num_dimension = model.vector_size  # Get dimensionality out of the model.
document_embeddings = np.zeros((20, 2000, 50))
c = 0
for i, document in enumerate(texts):
  document_word_embeddings = \
      np.array([model[token] for token in document if token in model])
  try:
    document_embeddings[:,i,:] = document_word_embeddings[:20, :]
  except Exception as e:
    c = c + 1

input_labels = np.transpose(np.repeat(labels, 20).reshape((1, 2000, 20)))

In [0]:
# Define the graph here

In [0]:
#@title Graph Definition
################################################################################
##                           GRAPH DEFINITION                                 ##
################################################################################

INPUT_SIZE    = 50       # embedding size of 50
RNN_HIDDEN    = 8
OUTPUT_SIZE   = 1       # positive or negative
TINY          = 1e-6    # to avoid NaNs in logs
LEARNING_RATE = 0.01

USE_LSTM = True

tf.reset_default_graph()

inputs  = tf.placeholder(tf.float32, (None, None, INPUT_SIZE))  # (embedding, batch, in)
outputs = tf.placeholder(tf.float32, (None, None, OUTPUT_SIZE)) # (embedding, batch, out)

cell = tf.nn.rnn_cell.BasicLSTMCell(RNN_HIDDEN, state_is_tuple=True)

batch_size    = tf.shape(inputs)[1]
initial_state = cell.zero_state(batch_size, tf.float32)

# Given inputs (time, batch, input_size) outputs a tuple
#  - outputs: (time, batch, output_size)
#  - states:  (time, batch, hidden_size)
rnn_outputs, rnn_states = tf.nn.dynamic_rnn(cell, inputs, initial_state=initial_state, time_major=True)

final_projection = lambda x: layers.linear(x, num_outputs=OUTPUT_SIZE, activation_fn=tf.nn.sigmoid)

predicted_outputs = map_fn(final_projection, rnn_outputs)

error = -(outputs * tf.log(predicted_outputs + TINY) + (1.0 - outputs) * tf.log(1.0 - predicted_outputs + TINY))
error = tf.reduce_mean(error)

train_fn = tf.train.AdamOptimizer(learning_rate=LEARNING_RATE).minimize(error)

accuracy = tf.reduce_mean(tf.cast(tf.abs(outputs - predicted_outputs) < 0.5, tf.float32))

W0615 10:29:57.034386 140029374015360 deprecation.py:323] From <ipython-input-6-370d0e046f38>:15: BasicLSTMCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0.
W0615 10:29:57.067802 140029374015360 deprecation.py:323] From <ipython-input-6-370d0e046f38>:23: dynamic_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
W0615 10:29:57.131539 140029374015360 deprecation.py:506] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument in

In [0]:
# Write your training loop here

In [0]:
#@title Training Loop
NUM_WORDS = 20
TRAINING_SIZE = 1200
BATCH_SIZE = 5
ITERATIONS_PER_EPOCH = int( TRAINING_SIZE / BATCH_SIZE )

valid_x = document_embeddings[:, TRAINING_SIZE:2000, :]
valid_y = input_labels[:, TRAINING_SIZE:2000, :]

session = tf.Session()
session.run(tf.global_variables_initializer())

for epoch in range(100):
    epoch_error = 0
    for i in range(ITERATIONS_PER_EPOCH):
        x = document_embeddings[:, i*BATCH_SIZE : i+BATCH_SIZE*(i+1), :]
        y = input_labels[:, i*BATCH_SIZE : i+BATCH_SIZE*(i+1), :]
        epoch_error += session.run([error, train_fn], {
            inputs: x,
            outputs: y,
        })[0]
    epoch_error /= ITERATIONS_PER_EPOCH
    valid_accuracy = session.run(accuracy, {
        inputs:  valid_x,
        outputs: valid_y,
    })
    if epoch % 10 == 0:
        print ("Epoch %d, train error: %.2f, valid accuracy: %.1f %%" % (epoch, epoch_error, valid_accuracy * 100.0))

Epoch 0, train error: 0.69, valid accuracy: 50.4 %
Epoch 10, train error: 0.69, valid accuracy: 49.4 %
Epoch 20, train error: 0.69, valid accuracy: 50.8 %
Epoch 30, train error: 0.68, valid accuracy: 51.3 %
Epoch 40, train error: 0.66, valid accuracy: 51.2 %
Epoch 50, train error: 0.66, valid accuracy: 50.5 %
Epoch 60, train error: 0.65, valid accuracy: 50.6 %
Epoch 70, train error: 0.64, valid accuracy: 50.8 %
Epoch 80, train error: 0.63, valid accuracy: 51.7 %
Epoch 90, train error: 0.62, valid accuracy: 51.5 %


## Convolutation Neural Network

Initially convolutions as part of the neural networks was used in the context of images.
<br>
Even without any machine learning, for images were found that *filters* could emphasize one or another property of the image.
<br>Convolutation is simply an element-wise multiplication of two matrices followed by a sum.

Typically convolution can be seen as follows:
<br>
<img src="http://cs231n.github.io/assets/cnn/depthcol.jpeg" width=50%>
<br>
<img src="https://www.pyimagesearch.com/wp-content/uploads/2016/06/convolutions_kernel_sliding.jpg" width=30%>
 (image produced by Adrian Rosebrock [here](https://www.pyimagesearch.com/2016/07/25/convolutions-with-opencv-and-python/))


([c](http://cs231n.github.io/convolutional-networks/)) `Each neuron in the convolutional layer is connected only to a local region in the input volume spatially, but to the full depth (i.e. all color channels). Note, there are multiple neurons (5 in this example) along the depth, all looking at the same region in the input - see discussion of depth columns in text below.`

A great [animation](https://www.pyimagesearch.com/2016/07/25/convolutions-with-opencv-and-python/) of the multidimentional convolution in practice:
<img src="https://deeplearning4j.org/img/karpathy-convnet-labels.png" width=60%>

In particular, linear filters were shown to highlight vertical or horizontal borders of the picture (more details [here](http://cs231n.github.io/convolutional-networks/)).
<img src="http://cs231n.github.io/assets/cnn/weights.jpeg" width=50%>

Read more on the convolutation in the math world [here](https://deeplearning4j.org/convolutionalnetwork#define).

### Source http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow

<img src="http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-12.05.40-PM.png" width=60%>

In [0]:
#@title Imports
import tensorflow as tf
import numpy as np
import os
import time
import datetime
import re
from tensorflow.contrib import learn

In [0]:
#@title Flags
flags = tf.app.flags 
FLAGS = flags.FLAGS

tf.app.flags.DEFINE_string('f', '', 'kernel')

# Data loading params
flags.DEFINE_float("dev_sample_percentage", .1, "Percentage of the training data to use for validation")
flags.DEFINE_string("positive_data_file", "rt-polarity.pos", "Data source for the positive data.")
flags.DEFINE_string("negative_data_file", "rt-polarity.neg", "Data source for the negative data.")

# Model Hyperparameters
flags.DEFINE_integer("embedding_dim", 128, "Dimensionality of character embedding (default: 128)")
flags.DEFINE_string("filter_sizes", "3,4,5", "Comma-separated filter sizes (default: '3,4,5')")
flags.DEFINE_integer("num_filters", 128, "Number of filters per filter size (default: 128)")
flags.DEFINE_float("dropout_keep_prob", 0.5, "Dropout keep probability (default: 0.5)")
flags.DEFINE_float("l2_reg_lambda", 0.0, "L2 regularization lambda (default: 0.0)")

# # Training parameters
flags.DEFINE_integer("batch_size", 64, "Batch Size (default: 64)")
flags.DEFINE_integer("num_epochs", 200, "Number of training epochs (default: 200)")
flags.DEFINE_integer("evaluate_every", 100, "Evaluate model on dev set after this many steps (default: 100)")
flags.DEFINE_integer("checkpoint_every", 100, "Save model after this many steps (default: 100)")
flags.DEFINE_integer("num_checkpoints", 5, "Number of checkpoints to store (default: 5)")
# # Misc Parameters
flags.DEFINE_boolean("allow_soft_placement", True, "Allow device soft device placement")
flags.DEFINE_boolean("log_device_placement", False, "Log placement of ops on devices")

FLAGS = flags.FLAGS

In [0]:
#@title TextCNN
import tensorflow as tf
import numpy as np
 
class TextCNN(object):
    """
    A CNN for text classification.
    Uses an embedding layer, followed by a convolutional, max-pooling and softmax layer.
    """
    def __init__(
      self,
        sequence_length, # The length of our sentences. Remember that we padded
        # all our sentences to have the same length (59 for our data set).
        num_classes, # Number of classes in the output layer,
        # two in our case (positive and negative).
        vocab_size, # The size of our vocabulary. This is needed to define the
        #size of our embedding layer, which will have shape
        #[vocabulary_size, embedding_size].
        embedding_size, # The dimensionality of our embeddings.
        filter_sizes, #  The number of words we want our convolutional filters
        # to cover. We will have num_filters for each size specified here.
        # For example, [3, 4, 5] means that we will have filters that slide over
        # 3, 4 and 5 words respectively, for a total of 3 * num_filters filters.
        num_filters, # The number of filters per filter size 
        l2_reg_lambda=0.0
    ):
      
      # Placeholders for input, output and dropout
      # tf.placeholder creates a placeholder variable that we feed to the network
      # when we execute it at train or test time. The second argument is the shape
      # of the input tensor. None means that the length of that dimension could be
      # anything.
      self.input_x = tf.placeholder(tf.int32, [None, sequence_length], name="input_x")
      self.input_y = tf.placeholder(tf.float32, [None, num_classes], name="input_y")
      self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")
      
      # Keeping track of l2 regularization loss (optional)
      l2_loss = tf.constant(0.0)
    
      # tf.device("/cpu:0") forces an operation to be executed on the CPU.
      with tf.device('/cpu:0'), tf.name_scope("embedding"):
        # W is our embedding matrix that we learn during training.
        # We initialize it using a random uniform distribution.
        # tf.nn.embedding_lookup creates the actual embedding operation.
        W = tf.Variable(
            tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
            name="W")
        # Result of embedding_lookup is 3-dimensional tensor of shape
        # [None, sequence_length, embedding_size].
        self.embedded_chars = tf.nn.embedding_lookup(W, self.input_x)
        self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)

      
      pooled_outputs = []
      for i, filter_size in enumerate(filter_sizes):
        with tf.name_scope("conv-maxpool-%s" % filter_size):
          # Convolution Layer
          filter_shape = [filter_size, embedding_size, 1, num_filters]
          # W is our filter matrix
          W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W")
          b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b")
          conv = tf.nn.conv2d(
            self.embedded_chars_expanded,
            W,
            strides=[1, 1, 1, 1],
            padding='VALID', # slide the filter over our sentence without
            # padding the edges, performing a narrow convolution that gives
            # us an output of shape [1, sequence_length - filter_size + 1, 1, 1]
            name="conv")
          # Apply nonlinearity
          # h is the result of applying the nonlinearity to the convolution output
          h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
          # Max-pooling over the outputs
          # Result [batch_size, 1, 1, num_filters]
          #  where last dimension corresponds to our features.
          pooled = tf.nn.max_pool(
            h,
            ksize=[1, sequence_length - filter_size + 1, 1, 1],
            strides=[1, 1, 1, 1],
            padding='VALID',
            name="pool")
          pooled_outputs.append(pooled)

      # Combine all the pooled features
      num_filters_total = num_filters * len(filter_sizes)
      # [batch_size, num_filters_total]
      self.h_pool = tf.concat(pooled_outputs, 3)
      # Using -1 in tf.reshape tells TensorFlow to flatten the dimension when possible.
      self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total])
    
      # Add dropout
      with tf.name_scope("dropout"):
        # The fraction of neurons we keep enabled is defined by the dropout_keep_prob input to our network. 
        self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob)

      with tf.name_scope("output"):
        W = tf.Variable(tf.truncated_normal([num_filters_total, num_classes], stddev=0.1), name="W")
        b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b")
        l2_loss += tf.nn.l2_loss(W)
        l2_loss += tf.nn.l2_loss(b)
        self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores")
        self.predictions = tf.argmax(self.scores, 1, name="predictions")

      # Calculate mean cross-entropy loss (http://cs231n.github.io/linear-classify/#softmax)
        with tf.name_scope("loss"):
            losses = tf.nn.softmax_cross_entropy_with_logits(logits=self.scores, labels=self.input_y)
            self.loss = tf.reduce_mean(losses) + l2_reg_lambda * l2_loss

      # Calculate Accuracy
      with tf.name_scope("accuracy"):
        correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1))
        self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")

In [0]:
#@title Training function
def train(x_train, y_train, vocab_processor, x_dev, y_dev):
  # A Session is the environment you are executing graph operations in, and it
  # contains state about Variables and queues. Each session operates on a single graph.
  # A Graph contains operations and tensors.
  # You can use the same graph in multiple sessions, but not multiple graphs in one session.
  with tf.Graph().as_default():
    session_conf = tf.ConfigProto(
      allow_soft_placement=FLAGS.allow_soft_placement,
      log_device_placement=FLAGS.log_device_placement)
    sess = tf.Session(config=session_conf)
    with sess.as_default():
        cnn = TextCNN(
            sequence_length=x_train.shape[1],
            num_classes=y_train.shape[1],
            vocab_size=len(vocab_processor.vocabulary_),
            embedding_size=FLAGS.embedding_dim,
            filter_sizes=list(map(int, FLAGS.filter_sizes.split(","))),
            num_filters=FLAGS.num_filters,
            l2_reg_lambda=FLAGS.l2_reg_lambda)

        # Define Training procedure
        global_step = tf.Variable(0, name="global_step", trainable=False)
        optimizer = tf.train.AdamOptimizer(1e-3)
        grads_and_vars = optimizer.compute_gradients(cnn.loss)
        train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)

        # Keep track of gradient values and sparsity (optional)
        grad_summaries = []
        for g, v in grads_and_vars:
            if g is not None:
                grad_hist_summary = tf.summary.histogram("{}/grad/hist".format(v.name), g)
                sparsity_summary = tf.summary.scalar("{}/grad/sparsity".format(v.name), tf.nn.zero_fraction(g))
                grad_summaries.append(grad_hist_summary)
                grad_summaries.append(sparsity_summary)
        grad_summaries_merged = tf.summary.merge(grad_summaries)

        # Output directory for models and summaries
        timestamp = str(int(time.time()))
        out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp))
        print("Writing to {}\n".format(out_dir))

        # Summaries for loss and accuracy
        loss_summary = tf.summary.scalar("loss", cnn.loss)
        acc_summary = tf.summary.scalar("accuracy", cnn.accuracy)

        # Train Summaries
        train_summary_op = tf.summary.merge([loss_summary, acc_summary, grad_summaries_merged])
        train_summary_dir = os.path.join(out_dir, "summaries", "train")
        train_summary_writer = tf.summary.FileWriter(train_summary_dir, sess.graph)

        # Dev summaries
        dev_summary_op = tf.summary.merge([loss_summary, acc_summary])
        dev_summary_dir = os.path.join(out_dir, "summaries", "dev")
        dev_summary_writer = tf.summary.FileWriter(dev_summary_dir, sess.graph)

        # Checkpoint directory. Tensorflow assumes this directory already exists so we need to create it
        checkpoint_dir = os.path.abspath(os.path.join(out_dir, "checkpoints"))
        checkpoint_prefix = os.path.join(checkpoint_dir, "model")
        if not os.path.exists(checkpoint_dir):
            os.makedirs(checkpoint_dir)
        saver = tf.train.Saver(tf.global_variables(), max_to_keep=FLAGS.num_checkpoints)

        # Write vocabulary
        vocab_processor.save(os.path.join(out_dir, "vocab"))

        # Initialize all variables
        sess.run(tf.global_variables_initializer())

        def train_step(x_batch, y_batch):
            """
            A single training step
            """
            feed_dict = {
              cnn.input_x: x_batch,
              cnn.input_y: y_batch,
              cnn.dropout_keep_prob: FLAGS.dropout_keep_prob
            }
            _, step, summaries, loss, accuracy = sess.run(
                [train_op, global_step, train_summary_op, cnn.loss, cnn.accuracy],
                feed_dict)
            time_str = datetime.datetime.now().isoformat()
            print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))
            train_summary_writer.add_summary(summaries, step)

        def dev_step(x_batch, y_batch, writer=None):
            """
            Evaluates model on a dev set
            """
            feed_dict = {
              cnn.input_x: x_batch,
              cnn.input_y: y_batch,
              cnn.dropout_keep_prob: 1.0
            }
            step, summaries, loss, accuracy = sess.run(
                [global_step, dev_summary_op, cnn.loss, cnn.accuracy],
                feed_dict)
            time_str = datetime.datetime.now().isoformat()
            print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))
            if writer:
                writer.add_summary(summaries, step)
                
        def batch_iter(data, batch_size, num_epochs, shuffle=True):
          """
          Generates a batch iterator for a dataset.
          """
          data = np.array(data)
          data_size = len(data)
          num_batches_per_epoch = int((len(data)-1)/batch_size) + 1
          for epoch in range(num_epochs):
              # Shuffle the data at each epoch
              if shuffle:
                  shuffle_indices = np.random.permutation(np.arange(data_size))
                  shuffled_data = data[shuffle_indices]
              else:
                  shuffled_data = data
              for batch_num in range(num_batches_per_epoch):
                  start_index = batch_num * batch_size
                  end_index = min((batch_num + 1) * batch_size, data_size)
                  yield shuffled_data[start_index:end_index]

        # Generate batches
        batches = batch_iter(
            list(zip(x_train, y_train)), FLAGS.batch_size, FLAGS.num_epochs)
        # Training loop. For each batch...
        for batch in batches:
            x_batch, y_batch = zip(*batch)
            train_step(x_batch, y_batch)
            current_step = tf.train.global_step(sess, global_step)
            if current_step % FLAGS.evaluate_every == 0:
                print("\nEvaluation:")
                dev_step(x_dev, y_dev, writer=dev_summary_writer)
                print("")
            if current_step % FLAGS.checkpoint_every == 0:
                path = saver.save(sess, checkpoint_prefix, global_step=current_step)
                print("Saved model checkpoint to {}\n".format(path))


In [0]:
#@title Preprocessing
def clean_str(string):
    """
    Tokenization/string cleaning for all datasets except for SST.
    Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
    """
    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
    string = re.sub(r"\'s", " \'s", string)
    string = re.sub(r"\'ve", " \'ve", string)
    string = re.sub(r"n\'t", " n\'t", string)
    string = re.sub(r"\'re", " \'re", string)
    string = re.sub(r"\'d", " \'d", string)
    string = re.sub(r"\'ll", " \'ll", string)
    string = re.sub(r",", " , ", string)
    string = re.sub(r"!", " ! ", string)
    string = re.sub(r"\(", " \( ", string)
    string = re.sub(r"\)", " \) ", string)
    string = re.sub(r"\?", " \? ", string)
    string = re.sub(r"\s{2,}", " ", string)
    return string.strip().lower()

def load_data_and_labels(positive_data_file, negative_data_file):
    """
    Loads MR polarity data from files, splits the data into words and generates labels.
    Returns split sentences and labels.
    """
    # Load data from files
    positive_examples = list(open(positive_data_file, "r", encoding='utf-8').readlines())
    positive_examples = [s.strip() for s in positive_examples]
    negative_examples = list(open(negative_data_file, "r", encoding='utf-8').readlines())
    negative_examples = [s.strip() for s in negative_examples]
    # Split by words
    x_text = positive_examples + negative_examples
    x_text = [clean_str(sent) for sent in x_text]
    # Generate labels
    positive_labels = [[0, 1] for _ in positive_examples]
    negative_labels = [[1, 0] for _ in negative_examples]
    y = np.concatenate([positive_labels, negative_labels], 0)
    return [x_text, y]

def preprocess():
    # Data Preparation
    # ==================================================

    # Load data
    print("Loading data...")
    x_text, y = load_data_and_labels(FLAGS.positive_data_file, FLAGS.negative_data_file)

    # Build vocabulary
    max_document_length = max([len(x.split(" ")) for x in x_text])
    vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
    x = np.array(list(vocab_processor.fit_transform(x_text)))

    # Randomly shuffle data
    np.random.seed(10)
    shuffle_indices = np.random.permutation(np.arange(len(y)))
    x_shuffled = x[shuffle_indices]
    y_shuffled = y[shuffle_indices]

    # Split train/test set
    dev_sample_index = -1 * int(FLAGS.dev_sample_percentage * float(len(y)))
    x_train, x_dev = x_shuffled[:dev_sample_index], x_shuffled[dev_sample_index:]
    y_train, y_dev = y_shuffled[:dev_sample_index], y_shuffled[dev_sample_index:]

    del x, y, x_shuffled, y_shuffled

    print("Vocabulary Size: {:d}".format(len(vocab_processor.vocabulary_)))
    print("Train/Dev split: {:d}/{:d}".format(len(y_train), len(y_dev)))
    return x_train, y_train, vocab_processor, x_dev, y_dev

In [0]:
!wget https://github.com/dennybritz/cnn-text-classification-tf/blob/master/data/rt-polaritydata/rt-polarity.neg
!wget https://github.com/dennybritz/cnn-text-classification-tf/blob/master/data/rt-polaritydata/rt-polarity.pos

--2019-07-21 07:46:52--  https://github.com/dennybritz/cnn-text-classification-tf/blob/master/data/rt-polaritydata/rt-polarity.neg
Resolving github.com (github.com)... 192.30.253.113
Connecting to github.com (github.com)|192.30.253.113|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘rt-polarity.neg’

rt-polarity.neg         [ <=>                ]   1.59M  --.-KB/s    in 0.09s   

2019-07-21 07:46:58 (17.9 MB/s) - ‘rt-polarity.neg’ saved [1672916]

--2019-07-21 07:46:59--  https://github.com/dennybritz/cnn-text-classification-tf/blob/master/data/rt-polaritydata/rt-polarity.pos
Resolving github.com (github.com)... 192.30.253.113
Connecting to github.com (github.com)|192.30.253.113|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘rt-polarity.pos’

rt-polarity.pos         [ <=>                ]   1.61M  --.-KB/s    in 0.09s   

2019-07-21 07:46:59 (17.7 MB/s) - ‘rt-pol

In [0]:
x_train, y_train, vocab_processor, x_dev, y_dev = preprocess()

Loading data...


W0721 07:47:05.210405 140372627502976 deprecation.py:323] From <ipython-input-5-ae7cd7e93c1f>:50: VocabularyProcessor.__init__ (from tensorflow.contrib.learn.python.learn.preprocessing.text) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tensorflow/transform or tf.data.
W0721 07:47:05.213084 140372627502976 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/contrib/learn/python/learn/preprocessing/text.py:154: CategoricalVocabulary.__init__ (from tensorflow.contrib.learn.python.learn.preprocessing.categorical_vocabulary) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tensorflow/transform or tf.data.
W0721 07:47:05.216866 140372627502976 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/contrib/learn/python/learn/preprocessing/text.py:170: tokenizer (from tensorflow.contrib.learn.python.learn.preprocessing.text) is deprecated and will be removed in

Vocabulary Size: 34838
Train/Dev split: 39978/4441


In [0]:
train(x_train, y_train, vocab_processor, x_dev, y_dev)

## Misc

* [GAN](https://arxiv.org/abs/1406.2661) and nice explanation of [GANs](https://deeplearning4j.org/generative-adversarial-network)
  * [MaskGan](https://arxiv.org/abs/1801.07736)
  * [Wasserstein GAN](https://arxiv.org/abs/1701.07875)
* [Capsule Networks](https://medium.com/ai%C2%B3-theory-practice-business/understanding-hintons-capsule-networks-part-i-intuition-b4b559d1159b)
<br>
If you want to know more about the capsule nets and how exactly they became possible only now - read the initial two papers
<br>by [Hinton et al.](https://openreview.net/pdf?id=HJWLfGWRb) and [Sabour et al.](https://arxiv.org/pdf/1710.09829.pdf)


---

## More advanced metrics

### [Text Simplification](https://docs.google.com/presentation/d/1niuedwfJ4n7cG7rKxFq6qHchkFCAyv4biSHrxzXImIg/edit#slide=id.p)


# [Smart Reply Journey](https://ai.googleblog.com/2015/11/computer-respond-to-this-email.html)

[Preso](https://docs.google.com/presentation/d/114LMqq1-IemD7jGNq5eIVn5PRa3aKIAFm9g3k2LSjfc/edit?ts=5b4b93a3)

---