# Chapter 2: Deep Learning and Language: The Basics

## 2.1: Basic Architectures of Deep Learning

### 2.1.1: Deep Multilayer Perceptrons

artificial  neurons are mathematical functions that receive weighted input from their afferents.

In [29]:
from keras.models import Sequential
from keras.utils import np_utils
from keras.preprocessing.text import Tokenizer
from keras.layers.core import Dense, Activation, Dropout
from keras.layers import LSTM, Convolution1D, Flatten, Dropout
from keras.layers.embeddings import Embedding
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import sys



1. `.fit_on_text` : creates a vocabulary index based on word frequency. mapping a word to it's index within text
    * e.g. given `"the cat sat on the mat"` , then it will create a dictionary such that `word_index['sat']` will return `2`.
    * The most frequent words are earlier indices so, `word_index["the"]` will return `0`
    * Often, the first few indices will be stop words because they appear a lot.
2. `.text_to_matrix` : creates a numpy matrix with rows corresponding to the number of documents and columns corresponding to the unique words in the vocabulary.

In [20]:
data = pd.read_csv('../raaijmakers-master-code/code/pos_neg.txt', sep='\t',  encoding = "ISO-8859-1")
docs = data['text']

tokenizer = Tokenizer()
tokenizer.fit_on_texts(docs)
X_train = tokenizer.texts_to_matrix(docs, mode='binary')
Y_train = np_utils.to_categorical(data['label'])  # from [1, 1, 0] --> [(0,1), (0, 1), (1, 0)]

input_dim = X_train.shape[1]
nb_classes = Y_train.shape[1]

In [21]:

tokenizer.word_index['bad']

856

Building the model
1. The Sequential model serves as a container for the stacked layers.
2. Add a dense layer to receive inputs (of dimensionality defined above) and return an output of 128 dimensions
3. Add another layer that receives, as inputs, the outputs of the Dense layer in 2 and uses a ReLU activation function to determine its output.

In [22]:
model = Sequential()
model.add(Dense(128, input_dim=input_dim))
model.add(Activation('relu'))

Adding more layers....

In [23]:
model.add(Dense(128))
model.add(Activation('relu'))

model.add(Dense(128))
model.add(Activation('relu'))

model.add(Dense(128))
model.add(Activation('relu'))

model.add(Dense(128))
model.add(Activation('relu'))

model.add(Dense(nb_classes))
model.add(Activation('softmax'))

Once you are finished adding layers, `compile` by specifying:
 * the loss(error) function
 * a numerical optimizer algorithm to carry out the gradient descent process
 * a metric to evaluate performance.

Next, you can fit the model specifying
 * validation_split: the proportion of data held out of training and used for testing
 * the number of training epochs
 * batch size

In [24]:
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.fit(X_train, Y_train, epochs=10, batch_size=32, validation_split=0.1, shuffle=True,verbose=2)



Train on 180 samples, validate on 20 samples
Epoch 1/10
 - 2s - loss: 0.6927 - accuracy: 0.5500 - val_loss: 0.7872 - val_accuracy: 0.0000e+00
Epoch 2/10
 - 0s - loss: 0.6473 - accuracy: 0.5556 - val_loss: 0.8588 - val_accuracy: 0.0000e+00
Epoch 3/10
 - 0s - loss: 0.5316 - accuracy: 0.6667 - val_loss: 1.0495 - val_accuracy: 0.0500
Epoch 4/10
 - 0s - loss: 0.2952 - accuracy: 0.9556 - val_loss: 1.1513 - val_accuracy: 0.2000
Epoch 5/10
 - 0s - loss: 0.0693 - accuracy: 1.0000 - val_loss: 0.4489 - val_accuracy: 0.7500
Epoch 6/10
 - 0s - loss: 0.0040 - accuracy: 1.0000 - val_loss: 0.0989 - val_accuracy: 0.9500
Epoch 7/10
 - 0s - loss: 1.8451e-04 - accuracy: 1.0000 - val_loss: 0.0394 - val_accuracy: 1.0000
Epoch 8/10
 - 0s - loss: 1.4050e-05 - accuracy: 1.0000 - val_loss: 0.0212 - val_accuracy: 1.0000
Epoch 9/10
 - 0s - loss: 2.3749e-06 - accuracy: 1.0000 - val_loss: 0.0141 - val_accuracy: 1.0000
Epoch 10/10
 - 0s - loss: 8.3746e-07 - accuracy: 1.0000 - val_loss: 0.0109 - val_accuracy: 1.0000


<keras.callbacks.callbacks.History at 0x7fa88e1ace50>

### 2.1.2: Spatial and Temporal Operators

**Spatial Filtering: Convolutional Neural Networks**
The goal of spatial filtering is to deal with the _structure_ of input data by filtering out irrelevant data and only letting valuable information propagate. Uses convolution to

### 2.1.2a: Spatial Filters: Convolutional Neural Networks

A CNN applies a set of weights to input data, essentially, as a sliding dot product. Easiest to imagine convolutions in image processing, where _convolving_ an image with a gaussian filter results in a "smoothed" or blurred image.

<img src="images/Screen Shot 2022-04-08 at 10.58.46 AM.png/">

Instead of defining a gaussian filter, CNNs can initialize random weights and, through training, learn the weights that improve accuracy.

filters can be applied as a sliding window. The dimensionality of the output will be reduced to (number of horizontal window moves) X (number of vertical window moves)

**max pooling**:
A filter that returns the largest value (within the filter window) of the input it is applied to.


**CNNs for text**


1. define the maximum number of words to keep, based on word frquency. only the most common <num_words> - 1 will be kept.
2. Create the vocabulary. aka. word_index dictionary
3. `.text_to_sequences`: converts each text to a sequence of integers found in the vocabulary dictionary
4. `pad_sequences`: pads the sequences from step 3 with zeros so they are all the same length
5. `pd.get_dummies`: converts a binary array into dummy codes, e.g from [1, 1, 0] --> [(0,1), (0, 1), (1, 0)]
6. tain_test_split:  Split arrays or matrices into random train and test subsets.

In [26]:
max_words = 1000
tokenizer = Tokenizer(num_words=max_words, split=' ')

tokenizer.fit_on_texts(data['text'].values)

X = tokenizer.texts_to_sequences(data['text'].values)
X = pad_sequences(X)
Y = pd.get_dummies(data['label']).values

X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size = 0.2, random_state = 36)



**creating an embedding layer**
These can be thought of an alternative to one-hot encoding along with dimensionality reduction. Will be discussed in chapter 3.

* The model below contains 3 convolutional layers. Each of which specified the dimensionality of the output and the size of filter (kernel size).
* The flatten layer coerces the 65x16 output of the final convolutional layer into a 1040-dimensional array, which is fed to the dropout layer.
* The dropout layer randomly resets a specified fraction of its input unites to 0 during training. this prevents overfitting.
* The Dense layer contains the binary representation of the output labels

<img src="images/Screen Shot 2022-04-08 at 11.45.26 AM.png ">


In [30]:
embedding_vector_length = 100

model = Sequential()

model.add(Embedding(max_words, embedding_vector_length, input_length=X.shape[1]))
model.add(Convolution1D(64, 3, padding="same"))
model.add(Convolution1D(32, 3, padding="same"))
model.add(Convolution1D(16, 3, padding="same"))
model.add(Flatten())
model.add(Dropout(0.2))
model.add(Dense(2,activation='sigmoid'))
model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 47, 100)           100000    
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 47, 64)            19264     
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 47, 32)            6176      
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 47, 16)            1552      
_________________________________________________________________
flatten_1 (Flatten)          (None, 752)               0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 752)               0         
_________________________________________________________________
dense_13 (Dense)             (None, 2)                

In [33]:

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(X_train, y_train, epochs=3, batch_size=64)

# Evaluation on the test set
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/3
Epoch 2/3
Epoch 3/3
Accuracy: 45.00%


## 2.2 Deep Learning and NLP: A new paradigm