# Chapter 2: Deep Learning and Language: The Basics

## 2.1: Basic Architectures of Deep Learning

### 2.1.1: Deep Multilayer Perceptrons

artificial  neurons are mathematical functions that receive weighted input from their afferents.

In [1]:
from keras.models import Sequential
from keras.utils import np_utils
from keras.preprocessing.text import Tokenizer
from keras.layers.core import Dense, Activation, Dropout
from keras.layers import LSTM, Convolution1D, Flatten, Dropout
from keras.layers.embeddings import Embedding
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import sys



Using TensorFlow backend.


1. `.fit_on_text` : creates a vocabulary index based on word frequency. mapping a word to it's index within text
    * e.g. given `"the cat sat on the mat"` , then it will create a dictionary such that `word_index['sat']` will return `2`.
    * The most frequent words are earlier indices so, `word_index["the"]` will return `0`
    * Often, the first few indices will be stop words because they appear a lot.
2. `.text_to_matrix` : creates a numpy matrix with rows corresponding to the number of documents and columns corresponding to the unique words in the vocabulary.

In [2]:
data = pd.read_csv('../raaijmakers-master-code/code/pos_neg.txt', sep='\t',  encoding = "ISO-8859-1")
docs = data['text']

tokenizer = Tokenizer()
tokenizer.fit_on_texts(docs)
X_train = tokenizer.texts_to_matrix(docs, mode='binary')
Y_train = np_utils.to_categorical(data['label'])  # from [1, 1, 0] --> [(0,1), (0, 1), (1, 0)]

input_dim = X_train.shape[1]
nb_classes = Y_train.shape[1]

In [3]:

tokenizer.word_index['bad']

856

Building the model
1. The Sequential model serves as a container for the stacked layers.
2. Add a dense layer to receive inputs (of dimensionality defined above) and return an output of 128 dimensions
3. Add another layer that receives, as inputs, the outputs of the Dense layer in 2 and uses a ReLU activation function to determine its output.

In [4]:
model = Sequential()
model.add(Dense(128, input_dim=input_dim))
model.add(Activation('relu'))

Adding more layers....

In [5]:
model.add(Dense(128))
model.add(Activation('relu'))

model.add(Dense(128))
model.add(Activation('relu'))

model.add(Dense(128))
model.add(Activation('relu'))

model.add(Dense(128))
model.add(Activation('relu'))

model.add(Dense(nb_classes))
model.add(Activation('softmax'))

Once you are finished adding layers, `compile` by specifying:
 * the loss(error) function
 * a numerical optimizer algorithm to carry out the gradient descent process
 * a metric to evaluate performance.

Next, you can fit the model specifying
 * validation_split: the proportion of data held out of training and used for testing
 * the number of training epochs
 * batch size

In [6]:
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.fit(X_train, Y_train, epochs=10, batch_size=32, validation_split=0.1, shuffle=True,verbose=2)



Train on 180 samples, validate on 20 samples
Epoch 1/10
 - 2s - loss: 0.6926 - accuracy: 0.4889 - val_loss: 0.7528 - val_accuracy: 0.0000e+00
Epoch 2/10
 - 0s - loss: 0.6563 - accuracy: 0.5556 - val_loss: 0.8085 - val_accuracy: 0.0000e+00
Epoch 3/10
 - 0s - loss: 0.5394 - accuracy: 0.8389 - val_loss: 0.7981 - val_accuracy: 0.2500
Epoch 4/10
 - 0s - loss: 0.2616 - accuracy: 0.9722 - val_loss: 0.8000 - val_accuracy: 0.3500
Epoch 5/10
 - 0s - loss: 0.0376 - accuracy: 1.0000 - val_loss: 0.3560 - val_accuracy: 0.8500
Epoch 6/10
 - 0s - loss: 0.0017 - accuracy: 1.0000 - val_loss: 0.1179 - val_accuracy: 0.9000
Epoch 7/10
 - 0s - loss: 7.2622e-05 - accuracy: 1.0000 - val_loss: 0.0600 - val_accuracy: 1.0000
Epoch 8/10
 - 0s - loss: 8.0652e-06 - accuracy: 1.0000 - val_loss: 0.0376 - val_accuracy: 1.0000
Epoch 9/10
 - 0s - loss: 1.8716e-06 - accuracy: 1.0000 - val_loss: 0.0274 - val_accuracy: 1.0000
Epoch 10/10
 - 0s - loss: 8.3067e-07 - accuracy: 1.0000 - val_loss: 0.0225 - val_accuracy: 1.0000


<keras.callbacks.callbacks.History at 0x7fa627a15d50>

### 2.1.2: Spatial and Temporal Operators

**Spatial Filtering: Convolutional Neural Networks**
The goal of spatial filtering is to deal with the _structure_ of input data by filtering out irrelevant data and only letting valuable information propagate. Uses convolution to

### 2.1.2a: Spatial Filters: Convolutional Neural Networks

A CNN applies a set of weights to input data, essentially, as a sliding dot product. Easiest to imagine convolutions in image processing, where _convolving_ an image with a gaussian filter results in a "smoothed" or blurred image.

<img src="images/Screen Shot 2022-04-08 at 10.58.46 AM.png/">

Instead of defining a gaussian filter, CNNs can initialize random weights and, through training, learn the weights that improve accuracy.

filters can be applied as a sliding window. The dimensionality of the output will be reduced to (number of horizontal window moves) X (number of vertical window moves)

**max pooling**:
A filter that returns the largest value (within the filter window) of the input it is applied to.


**CNNs for text**


1. define the maximum number of words to keep, based on word frquency. only the most common <num_words> - 1 will be kept.
2. Create the vocabulary. aka. word_index dictionary
3. `.text_to_sequences`: converts each text to a sequence of integers found in the vocabulary dictionary
4. `pad_sequences`: pads the sequences from step 3 with zeros so they are all the same length
5. `pd.get_dummies`: converts a binary array into dummy codes, e.g from [1, 1, 0] --> [(0,1), (0, 1), (1, 0)]
6. tain_test_split:  Split arrays or matrices into random train and test subsets.

In [7]:
max_words = 1000
tokenizer = Tokenizer(num_words=max_words, split=' ')

tokenizer.fit_on_texts(data['text'].values)

X = tokenizer.texts_to_sequences(data['text'].values)
X = pad_sequences(X)
Y = pd.get_dummies(data['label']).values

X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size = 0.2, random_state = 36)



**creating an embedding layer**
These can be thought of an alternative to one-hot encoding along with dimensionality reduction. Will be discussed in chapter 3.

* The model below contains 3 convolutional layers. Each of which specified the dimensionality of the output and the size of filter (kernel size).
* The flatten layer coerces the 65x16 output of the final convolutional layer into a 1040-dimensional array, which is fed to the dropout layer.
* The dropout layer randomly resets a specified fraction of its input unites to 0 during training. this prevents overfitting.
* The Dense layer contains the binary representation of the output labels

<img src="images/Screen Shot 2022-04-08 at 11.45.26 AM.png ">


In [8]:
embedding_vector_length = 100

model = Sequential()
Embedding()
model.add(Embedding( max_words, embedding_vector_length, input_length=X.shape[1]))
model.add(Convolution1D(64, 3, padding="same"))
model.add(Convolution1D(32, 3, padding="same"))
model.add(Convolution1D(16, 3, padding="same"))
model.add(Flatten())
model.add(Dropout(0.2))
model.add(Dense(2,activation='sigmoid'))
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 47, 100)           100000    
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 47, 64)            19264     
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 47, 32)            6176      
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 47, 16)            1552      
_________________________________________________________________
flatten_1 (Flatten)          (None, 752)               0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 752)               0         
_________________________________________________________________
dense_7 (Dense)              (None, 2)                

In [10]:

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(X_train, y_train, epochs=6, batch_size=64)

# Evaluation on the test set
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Accuracy: 58.75%


### 2.1.2b: Temporal Filtering: Recurrent Neural Networks

<img src="images/rnn.png ">

Retropropagation can fix weights and biases, but earlier memory states can be improved by unrolling a cell multiple times.

<img src="images/rnn_unrolled.png">
We consider the whole unrolled cell updates as one training step. If the final output (Y5) does not match the training label, the cell's states can be changed retroactively. Note that this is all the same cell  (w the same weights and biases)

* The code below begins with a list of words (just one) as data.
* The set of unique characters in the data (6 of them) comprises the alphabet, which the label encoder 'fits' such that the unique letters are arranged  alphabetically as labels.

* Next, the label encoder uses the `.fit_tranform` method on the alphabet to assign integers to each label, just a wrapper around `numpy.unique(ar=alphabet, return_inverse=True)`
* These integers are then dummy coded to make one-hot vectors.

_Training Data_
Training data consists of X and Y, where X is the first 8 (out of a total 9) characters of the words in the data, and Y are the last 8.
e.g. $$ X_1= "c", Y_1="h", \\
X_2="h", Y_2="a"...$$

_Test Data_
Same as Training

After defining sample size and sample length, the code then bootstraps the training data by simply repeating it 256 times (the sample size).
Training XY go from a list of 8 arrays with 6-elements each to a 256 x 8 x 6 tensor (sample size X length of training data X # unique characters in alphabet).
    - in other words 256 identical 8x6 matrices

Test data are also reshaped from a list of 8 arrays with 6-elements each, to a 1x8x6 tensor.


In [1]:
from keras.models import Sequential
from keras.layers import SimpleRNN, TimeDistributed, Dense
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import numpy as np


# Basically onehot vectors of each letter in the data. Because there are 6 unique letters, vectors will be length 6
data = ['character']
alphabet = np.array(list(set([c for w in data for c in w]))) #an array of the unique letters in the data
enc = LabelEncoder()
enc.fit(alphabet) # once the encoder is fit, it has a classes_ attribute containing unique labels
int_enc = enc.fit_transform(alphabet) #assigns an integer to each label in the encoder
onehot_encoder = OneHotEncoder(sparse=False)
int_enc = int_enc.reshape(len(int_enc), 1) # from [int, int, int] to -> [[int], [int], [int]]
onehot_encoded = onehot_encoder.fit_transform(int_enc) # transforms the array of integer encodings into dummy codes.
# every unique letter in the data comprises the alphabet (which will become the labels).
# The label encoder assigns an integer to each letter in the alphabet (int_enc)
# the one-hot encoder transforms integers to one-hot vectors.
# letter -> integer -> onehot vecotr


# since the data consists of one word.
# X_train is are the first n-1 letters in the word (as one hot vectors).
# For each letter in X_train, Y_train, corresponds to the letter that follows it.
#X_train = [[c],[h],[a],[r],[a],[c],[t],[e]]
#Y_train = [[h],[a],[r],[a],[c],[t],[e],[r]]
#...the letters above are represented as one-hot vectors.

X_train = []
Y_train = []
for w in data:
    for i in range(len(w)-1):
        X_train.extend(onehot_encoder.transform([enc.transform([w[i]])]))
        Y_train.extend(onehot_encoder.transform([enc.transform([w[i+1]])]))

X_test = []
Y_test = []

test_data = ['character']

for w in test_data:
    for i in range(len(w)-1):
        X_test.extend(onehot_encoder.transform([enc.transform([w[i]])]))
        Y_test.extend(onehot_encoder.transform([enc.transform([w[i+1]])]))

sample_size = 256
sample_len = len(X_train)

# Training XY go from a 8-list of 6-element arrays to a (256sample_size x 8(length fo training data), x 6 (# of unique characters in alphabet)) tensor
X_train = np.array([X_train * sample_size]).reshape(sample_size, sample_len, len(alphabet))
Y_train = np.array([Y_train * sample_size]).reshape(sample_size, sample_len, len(alphabet))


test_len = len(X_test)
X_test = np.array([X_test]).reshape(1, test_len, len(alphabet))
Y_test = np.array([Y_test]).reshape(1, test_len, len(alphabet))

Using TensorFlow backend.
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [2]:
# First column corresponds to the letter "a", there is a 1.0 in rows 2 and 4 because those are the indices of
#[[c],[h],[a],[r],[a],[c],[t],[e]] containing an "a".
X_train[1]

array([[0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1.],
       [0., 0., 1., 0., 0., 0.]])

Building the RNN

* Starts by initializing an instance of the Sequential model class
* Add a SimpleRNN layer defining input dimensionality to be alphabet length (since input will be each of the 8 rows of 6 elements)
* Add a Densely connected layer that is where weights are applied horizontally at every temporal slice of input.
* Compile the model defining loss function and optimizer, then fit on the training data and predict the labels of the test data.
* The predictions are probability values for each letter in each subsequent time point, i.e.:
    - $p(character_{i+1})={a:.3,r:.5,e:.6,...}$
    - $p(character_{i+2})={a:.8,r:.3,e:.2,...}$
* We take the max of these predictions and use `enc.inverse_transform` for go from integer back to string character
* Finally we print the value of the loss function (crossentropy) and evaluation metric (accuracy) for the model


In [3]:
model = Sequential()
model.add(SimpleRNN(input_dim=len(alphabet), output_dim=100, return_sequences=True))
model.add(TimeDistributed(Dense(output_dim=len(alphabet), activation="sigmoid")))
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer='adam')
model.fit(X_train, Y_train, nb_epoch=10, batch_size=32)

preds = model.predict(X_test)[0]
for p in preds:
    m=np.argmax(p)
    print(enc.inverse_transform([m])[0])
print(model.evaluate(X_test, Y_test, batch_size=32))

  
  
  This is separate from the ipykernel package so we can avoid doing imports until
  """


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
h
a
r
a
c
t
e
r
[0.06813694536685944, 1.0]


In [4]:
# Each row is a subsequent prediction. Each column represents an integer that represents a character in the alphabet
preds

array([[0.35140693, 0.3050174 , 0.24562523, 0.59697104, 0.27344728,
        0.25665528],
       [0.8689045 , 0.05291989, 0.05919242, 0.03941149, 0.03320614,
        0.03217295],
       [0.03738743, 0.01297069, 0.02276149, 0.01652685, 0.9423271 ,
        0.01211572],
       [0.9748498 , 0.01938614, 0.00741681, 0.00337327, 0.01314285,
        0.0118539 ],
       [0.02507344, 0.97005314, 0.01903486, 0.02007443, 0.02824026,
        0.01363108],
       [0.01787403, 0.02216244, 0.00761157, 0.01090309, 0.01122886,
        0.96163845],
       [0.02933225, 0.01139101, 0.9580256 , 0.01354322, 0.01815313,
        0.00988832],
       [0.01280633, 0.00583234, 0.01528642, 0.01604271, 0.97839177,
        0.00858387]], dtype=float32)

While this simple RNN does will for this tiny sequence of letters, they fail on longer sequences and blindly re-use hidden states in their entirety.

**Long Short Term Memory Networks**
Gates the information passed from the past into the present.
LSTM cells encode contextual information - : they make this information available on local positions in the time-distributed case, and globally (e.g. for an entire sequence of words) in the non-time-distributed case.

In [5]:
from keras.models import Sequential
from keras.layers import  LSTM, TimeDistributed, Dense
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import numpy as np

np.random.seed(1234)

data = ['xyzaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaxyz',
       'pqraaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaapqr']

test_data = ['xyzaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaxyz',
            'pqraaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaapqr']

enc = LabelEncoder()
alphabet = np.array(list(set([c for w in data for c in w])))
enc.fit(alphabet)
int_enc = enc.fit_transform(alphabet)
onehot_encoder = OneHotEncoder(sparse=False)
int_enc = int_enc.reshape(len(int_enc), 1)
onehot_encoded = onehot_encoder.fit_transform(int_enc) # dummy coding


X_train=[]
y_train=[]

for w in data:
    for i in range(len(w)-1):
        X_train.extend(onehot_encoder.transform([enc.transform([w[i]])]))
        y_train.extend(onehot_encoder.transform([enc.transform([w[i+1]])]))

X_test=[]
y_test=[]

for w in test_data:
    for i in range(len(w)-1):
        X_test.extend(onehot_encoder.transform([enc.transform([w[i]])]))
        print(i,w[i],onehot_encoder.transform([enc.transform([w[i]])]))
        y_test.extend(onehot_encoder.transform([enc.transform([w[i+1]])]))

sample_size=512
sample_len=len(X_train)

X_train = np.array([X_train*sample_size]).reshape(sample_size,sample_len,len(alphabet))
y_train = np.array([y_train*sample_size]).reshape(sample_size,sample_len,len(alphabet))

test_len=len(X_test)
X_test= np.array([X_test]).reshape(1,test_len,len(alphabet))
y_test= np.array([y_test]).reshape(1,test_len,len(alphabet))

model=Sequential()
model.add(LSTM(input_dim  = len(alphabet), output_dim = 100, return_sequences = True))
model.add(TimeDistributed(Dense(output_dim = len(alphabet), activation =  "sigmoid")))
model.compile(loss="binary_crossentropy",metrics=["accuracy"], optimizer = "adam")

n=1

while True:
        score = model.evaluate(X_test, y_test, batch_size=32)
        print("[Iteration %d] score=%f"%(n,score[1]))
        if score[1] == 1.0:
            break
        n+=1
        model.fit(X_train, y_train, nb_epoch = 1, batch_size = 32)



In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


0 x [[0. 0. 0. 0. 1. 0. 0.]]
1 y [[0. 0. 0. 0. 0. 1. 0.]]
2 z [[0. 0. 0. 0. 0. 0. 1.]]
3 a [[1. 0. 0. 0. 0. 0. 0.]]
4 a [[1. 0. 0. 0. 0. 0. 0.]]
5 a [[1. 0. 0. 0. 0. 0. 0.]]
6 a [[1. 0. 0. 0. 0. 0. 0.]]
7 a [[1. 0. 0. 0. 0. 0. 0.]]
8 a [[1. 0. 0. 0. 0. 0. 0.]]
9 a [[1. 0. 0. 0. 0. 0. 0.]]
10 a [[1. 0. 0. 0. 0. 0. 0.]]
11 a [[1. 0. 0. 0. 0. 0. 0.]]
12 a [[1. 0. 0. 0. 0. 0. 0.]]
13 a [[1. 0. 0. 0. 0. 0. 0.]]
14 a [[1. 0. 0. 0. 0. 0. 0.]]
15 a [[1. 0. 0. 0. 0. 0. 0.]]
16 a [[1. 0. 0. 0. 0. 0. 0.]]
17 a [[1. 0. 0. 0. 0. 0. 0.]]
18 a [[1. 0. 0. 0. 0. 0. 0.]]
19 a [[1. 0. 0. 0. 0. 0. 0.]]
20 a [[1. 0. 0. 0. 0. 0. 0.]]
21 a [[1. 0. 0. 0. 0. 0. 0.]]
22 a [[1. 0. 0. 0. 0. 0. 0.]]
23 a [[1. 0. 0. 0. 0. 0. 0.]]
24 a [[1. 0. 0. 0. 0. 0. 0.]]
25 a [[1. 0. 0. 0. 0. 0. 0.]]
26 a [[1. 0. 0. 0. 0. 0. 0.]]
27 a [[1. 0. 0. 0. 0. 0. 0.]]
28 a [[1. 0. 0. 0. 0. 0. 0.]]
29 a [[1. 0. 0. 0. 0. 0. 0.]]
30 a [[1. 0. 0. 0. 0. 0. 0.]]
31 a [[1. 0. 0. 0. 0. 0. 0.]]
32 a [[1. 0. 0. 0. 0. 0. 0.]]
33 a [[1. 0. 0. 0. 0



[Iteration 1] score=0.317927




Epoch 1/1
[Iteration 2] score=0.973389
Epoch 1/1




[Iteration 3] score=0.971989
Epoch 1/1
[Iteration 4] score=0.971989
Epoch 1/1
[Iteration 5] score=0.971989
Epoch 1/1
[Iteration 6] score=0.971989
Epoch 1/1
[Iteration 7] score=0.971989
Epoch 1/1
[Iteration 8] score=0.971989
Epoch 1/1
[Iteration 9] score=0.971989
Epoch 1/1
[Iteration 10] score=0.971989
Epoch 1/1
[Iteration 11] score=0.971989
Epoch 1/1
[Iteration 12] score=0.971989
Epoch 1/1
[Iteration 13] score=0.971989
Epoch 1/1
[Iteration 14] score=0.971989
Epoch 1/1
[Iteration 15] score=0.974790
Epoch 1/1
[Iteration 16] score=0.974790
Epoch 1/1
[Iteration 17] score=0.976190
Epoch 1/1
[Iteration 18] score=0.981793
Epoch 1/1
[Iteration 19] score=0.978992
Epoch 1/1
[Iteration 20] score=0.981793
Epoch 1/1
[Iteration 21] score=0.983193
Epoch 1/1
[Iteration 22] score=0.983193
Epoch 1/1
[Iteration 23] score=0.983193
Epoch 1/1
[Iteration 24] score=0.983193
Epoch 1/1
[Iteration 25] score=0.983193
Epoch 1/1
[Iteration 26] score=0.984594
Epoch 1/1
[Iteration 27] score=0.985994
Epoch 1/1
[Iterat

KeyboardInterrupt: 

In [7]:
preds=model.predict(X_test)[0]
for p in preds:
    m=np.argmax(p)
    print(enc.inverse_transform([m]))

print(model.evaluate(X_test,y_test,batch_size=32))

['y']
['z']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['x']
['y']
['z']
['q']
['r']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['a']
['p']
['q']
['r']
[0.006394200958311558, 0.9971988797187805]


## 2.2 Deep Learning and NLP: A new paradigm