# Deep Learning for Text and Sequences

# 6.1. Working with text data
这里是sequence data，这其中text是最广泛的。
可以理解为sequence of characters 或sequene of words，但words层面最常用。

**将text转化为numeric tensors:**
* Each word -> A vector.
* Each character -> A vector
* Each n-grams of words/characters -> A vector.<br>
N-grams是overlapping groups of multiple consecutive words or  characters.

不同分解方式得到的unit称为tokens, 对应的分解称为tokenization.

**两种常用的将vector与token联系起来的方式：**
* One hot encoding
* Token embedding (通常只对words这么做，所以称为word embedding)

**Bag of words**
* 将text划分成set of 2-grams（连续的一个或两个单词组成的gram,可以重叠），这样的set称为bag-of-2-grams.
因为是set,所以没有顺序。
* The family of tokenization methods is called bag-of-words.
* Extracting n-grams 是feature engineering, 在非deep learning，比如logistic regression和random forests中很有用。
* Deep learning中不用这个，我们用hierarchical feature learning.

## One-hot encoding of words and characters
将每个单词用一个独特的整数index表示，再把这个index化为binary vector of size N (N 是vocabulary的size). Vector中只有对应index的位置是1，其它是0.



In [1]:
# Keras built-in for one-hot encoding
from keras.preprocessing.text import Tokenizer
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
#only take into account the 1000 most common words
tokenizer = Tokenizer(num_words=1000)
tokenizer.fit_on_texts(samples)
sequences = tokenizer.texts_to_sequences(samples)
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')
word_index = tokenizer.word_index
print('Found %s unique tokens.' %len(word_index))

Using TensorFlow backend.


Found 9 unique tokens.


**One-hot hashing trick**: 
* 当vocabulary too large, 不精确把每个单词assign index，而是hash words into vectors of fixed size. 
* 对online encoding 好
* 问题是可能产生hash collisions.
* 适用于dimensionality of the hashing space is much larger than the total number of unique tokens being hashed.

## Using word embeddings

one-hot encoding太占空间了，特别是词汇量多的时候。
比较：

<h4 style='padding: 10px'>One-hot word vectors与word embedding差别</h2><table class='table table-striped'> <thead> <tr> <th> </th> <th>One-hot word vectors</th> <th>Word embeddings</th>  </tr> </thead> <tbody> <tr> <th scope='row'>1</th> <td>Sparse</td> <td>Dense</td> </tr> <tr> <th scope='row'>2</th> <td>High-demensional</td> <td>Lower-dimensional</td> </tr> <tr> <th scope='row'>3</th> <td>Hardcoded</td> <td>Learned from data</td> </tbody> </table>

可以通过NN learn embedding,或者借用别的


### Learn word embeddings with the embedding layer.
**Embedding layer:** A dictionary that maps integer indices (which stand for specific words) to dense vectors. <br>
Word index -> Embedding layer -> Corresponding word vector

与其它层一样，randomized initialization.在training过程中，通过backpropogation gradually adjusted.



In [4]:
# Loading the IMDB data
from keras.datasets import imdb
from keras import preprocessing

max_features = 10000
maxlen = 20

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words = max_features)
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen = maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test,maxlen = maxlen)
#只选取每个sample前maxlen个单词

Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz


In [5]:
# Using an Embedding layer and classifier on the IMDB data
from keras.models import Sequential
from keras.layers import Flatten, Dense

model = Sequential()
model.add(Embedding(10000, 8, input_length=maxlen))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model.summary()

history = model.fit(x_train, y_train,
                   epochs=10,
                   batch_size=32,
                   validation_split=0.2)

Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 20, 8)             80000     
_________________________________________________________________
flatten_1 (Flatten)          (None, 160)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 161       
Total params: 80,161
Trainable params: 80,161
Non-trainable params: 0
_________________________________________________________________
Instructions for updating:
Use tf.cast instead.
Train on 20000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


两个常用的precomputed databases of word embeddings: Word2Vec, GloVe (called Global Vectors)

## Putting it all together: from raw text to word embeddings

Using pretrained word embeddings

<font color='red'>未完成！！！！</font>

## Wrapping up
本节内容：
* 将text转化为NN可以处理的数据类型。
* 两种embedding 方式：
  * Task-specific token embeddings, 用Keras里的Embedding layer。
  * 用 pretrained word embeddings，解决小型NLP问题。 

# 6.2. Understanding recurrent neural networks


原理：maintaining an internal model of what it’s processing, built from past information and constantly updated as new information comes in.

recurrent neural network (RNN)： processes sequences by iterating through the sequence elements and maintaining a state containing information relative to what it has seen so far.

在给的numpy样本里，他把previous output + current input结合起来作为当下的input (W 和U是两个parameter matrices)：

In [0]:
state_t = 0
for input_t in input_sequence:
  output_t = activation(dot(W, input_t) + dot(U, state_t) + b)
  state_t = output_t

Final output是2D tensor of shape (timesteps, output_features), which each timestep is the output of the loop at time t. 最后的输出反映了整个sequence的信息。

## A recurrent layer in Keras

In [0]:
from keras.layers import SimpleRNN
from keras.models import Sequential

model = Sequential()
model.add(Em)

## Understanding the LSTM and GRU layers

## A concrete LSTM example in Keras

## Wrapping up

# 6.3. Advanced use of recurrent neural networks

## A temperature-forecasting problem

## Preparing the data

## A common-sense, non-machine-learning baseline

## A basic machine-learning approach

## A first recurrent baseline

## Using recurrent dropout to fight overfitting

## Stacking recurrent layers

## Using bidirectional RNNs

## Going even further

## Wrapping up

# 6.4. Sequence processing with convnets

## Understanding 1D convolution for sequence data

## 1D pooling for sequence data

## Implementing a 1D convnet

## Combing CNNs and RNNs to process long sequences

## Wrapping up

# Chapter summary