# Chapter 1: Deep Learning for Natural Language Processing

## 1.1 A Selection of Machine Learning Methods for NLP

**Goal of classification algorithms:**
    - Linearly separate data according to _classes_.

**Classes:**
- Labels indicating a (usually exclusive) category that a data point belongs to.

**Input Space**
- The vector representations of data presented to machine learning algorithms.

**Feature Space**
- The processing, manipulation and abstraction of the input space during learning.
- can be done externally (i.e. pre-processing: converting raw data to features)


**Output Space**
- class labels that separate data points based on class boundaries



### 1.1.1 The Perception

**Given:** A vector of features describing aspects of something (eg, words in a document)
**Goal:** To create a function that maps from features to binary label.
$$F(\text{feature vector}) \rightarrow (0 | 1)$$

The single-layer perceptron does this from a weighted combination of input values, $x_1, \dots , x_n$, based on a threshold $\theta$ and a bias, $b$

**Perceptron Decision Function**
The weights, $w_1, w_2, ...,w_n $ are learned from _annotated_ training data (input vectors labeled with output labels).
**Neuron**:
 - The threshold unit which receives the summed and weighted inputs $v$.
$$ input space=[10, 20, 30]$$
$$ weights = [3, 5, 7]$$
$$\text{feature space} = (10 * 3) + (20 * 5) + (30 * 7) $$
$$\text{neuron output} = 340$$

In [1]:
from sklearn.linear_model import perceptron
from sklearn.datasets import fetch_20newsgroups

# Make a subselection for two newsgroups of interest
categories = ['alt.atheism', 'sci.med']

# Train a simple perceptron on a vector representation of the documents in these two classes.
train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True)

perceptron = perceptron.Perceptron(max_iter=100)
# Make the Vector representations
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X_train_counts = cv.fit_transform(train.data) # tra

from sklearn.feature_extraction.text import TfidfTransformer
tfidf_tf = TfidfTransformer()
X_train_tfidf = tfidf_tf.fit_transform(X_train_counts) # Turns documents into vectors that can be def into ML algorithm

perceptron.fit(X_train_tfidf, train.target)

test_docs = ['Religion is widespread, even in modern times',
             'His kidney failed and he died',
             'The Pope is a controversial leader of the Catholic church',
             'White blood cells fight off infections',
             'The reverend had a heart attack in church']

X_test_counts = cv.transform(test_docs) # count vectors of test data
X_test_tfidf = tfidf_tf.transform(X_test_counts) # TF_IDF vectors of test data

pred = perceptron.predict(X_test_tfidf)

for doc, category in zip(test_docs, pred):
    print('%r => %s' % (doc, train.target_names[category]))



'Religion is widespread, even in modern times' => alt.atheism
'His kidney failed and he died' => sci.med
'The Pope is a controversial leader of the Catholic church' => sci.med
'White blood cells fight off infections' => sci.med
'The reverend had a heart attack in church' => sci.med


### 1.1.2 Support Vector Machines
A binary classifier that implicitly maps data in feature space to higher dimensions in which data becomes separable by a linear _hyperplane_. This mapping is carried out by a kernal function.

**Kernel Functions**:
- transforms the original input space to an alternative representation that implicitly has a higher dimensionality.
- The migration from lower to higher dimensionality space in takes the form of a similarity function applied to two feature vectors
- Kernals functions take two vectors, mixes in a constant (a kernel parameter) and adds some kern-specific ingredients to produce a specific form of a dot product of the two vectors.

e.g. Quadratic Polynomial Kernel - using a constant, $c$, and inputs $x=(x_1, x_2)$, and $y=(y_1, y_2)$:
$ K(x, y) = (c \plus x^Ty)^2  $
$ K(x, y)  = (c + x_1y_1 \plus x_2y_2)^2$
$ K(x, y) = c^2 \plus x_{1}^2y_{1}^2 \plus x_{2}^2x_{2}^2 \plus 2cx_{1}y_{1} \plus 2cx_{2}y_{2} \plus 2x_1y_1x_2y_2$

i.e. we go from a 2 to a 6 dimensional (factors separated by +) space. So we're _implicitly_ computing the dot product of two vector.
$$< c, x_{1}^2, x_{2}^2, \sqrt{2cx_1}, \sqrt{2cx_2}, \sqrt{2cx_{1}x_2}>$$
$$< c, y_{1}^2, y_{2}^2, \sqrt{2cy_1}, \sqrt{2cy_2}, \sqrt{2cy_{1}y_2}>$$

In the transformed space, the two classes are separated with maximally wide boundaries. The datapoints determining the slope of these boundaries are called **support vectors**.

During training, SVMs learn weights that optimize the margins with the least error.
At test, new inputs are projected onto the support vectors, and depending on which side it lands, it recieves a positive or negative label.



In [9]:
import numpy as np

c = 3
x = [2, 5]
y = [3,6]
k = (c + np.transpose(x)* y)**2
x_ = [c, x[0]**2, x[1]**2, np.sqrt(2*c*x[0]), np.sqrt(2*c*x[1]), np.sqrt(2*c*x[0]*x[1])]
y_ = [c, y[0]**2, y[1]**2, np.sqrt(2*c*y[0]), np.sqrt(2*c*y[1]), np.sqrt(2*c*y[0]*y[1])]
np.dot(x_, y_)

1073.0587390970015

### 1.1.3 Memory-based Learning

* Rather than replace training data with support vectors,memory-based methods keep all training data.
* During classification (test) input data is matched with training data by application of similarity measures.


In [None]:
def delta(x,y):
    if x == y:
        return 0
    else:
        return 1

def IB1(a, b):
    return sum([delta(a_i, b_i) for a_i, b_i in zip(a,b)])

This metric computes the distance between two feature vectors on the basis of **feature value overlap**.
Most memory-based learning algorithms are extend these distance metrics with feature-weighting.


## 1.2 Deep Learning
Which questions can deep learning solve for natural language processing?
* Deep learning can handle a large number of parameters, each of which encode some aspect of input data.

Learns hierarchical representations of data. (lower levels feed into higher ones)
- Layers = complex functions processing inputs with weights that encode the importance of information for labeling purposes.
- Output layers = produce a label
- Hidden layers = layers in between input and output layers. heirarchical from specific (close to the input layer) to abstract (closer to the output layer).
    - it's hard to come up with human-understandable interpretations of hidden layer representations.

Training:Estimating weights in the whole work of neural networks.

Backprop- involves the stepwise minimization of error (gradient descent)
- The goal of which is to tune network weights, until the slope of the error function (predicted minus observed) is close to zero.
    - Can use the chain rule from calculus to compute the slope of functions applied to other functions.  f(g(x)).
  $$\text{activation function}_{\text{output layer}_N}( \text{activation function}_{\text{hidden layer}_{N-1}}(x))$$

  Where the hidden layer, $N-1$ is just before the output layer and the $x$ refers to the outputs of the hidden layer before $N-2$. Meaning weight adaptations never reach hidden layers closer to the input layer.
**solution: Restricted Boltzman Machines** - basically a complete neural network with backprop at every layer.

**Rectified Linear Unit (ReLU) Activation function**
Determines if the input to a neuron is above some threshold in order to "activate" it and propagate information to the next layer.
$$y = ReLU(0, \sum{(input_i * weight+_i)+ bias})$$
Will return the maximum of the two values.
Derivative of ReLU function :
$$ReLU'(x) = \begin{cases}
            1, &\text{if}\ x > 0 \\
             0, & \text{else}
             \end{cases}$$

**Sigmoid Activation function**
$$sigmoid(x) = \frac{1}{1 \plus e^{-x}}$$



### Multi-layer Perceptron
**Scenario**: You want to train a deep network on a sentiment labeling task. The task consists of labeling texts with sentiment labels: 1 for positive sentiment, and 0 for negative. You are unsure about which activation function you should choose. Can you find out experimentally the best option?

In [2]:
from keras.models import Sequential
from keras.utils import np_utils
from keras.preprocessing.text import  Tokenizer
from keras.layers.core import Dense, Activation
import pandas as pd
import sys
from glob import glob
import os
'''
dfolders = glob(os.path.join('review_polarity/txt_sentoken', '*'))
xx = pd.concat([pd.read_csv(file, sep='\t', header=None) for file in glob(os.path.join(dfolders[0], '*.txt'))])
xx.columns=['text']
xx['label'] = 0

yy = pd.concat([pd.read_csv(file, sep='\t', header=None) for file in glob(os.path.join(dfolders[1], '*.txt'))])
yy.columns = ['text']
yy['label'] = 1
data  = pd.concat([xx,yy])
docs = data['text']

tokenizer = Tokenizer()
tokenizer.fit_on_texts(docs)
'''

Using TensorFlow backend.


"\ndfolders = glob(os.path.join('review_polarity/txt_sentoken', '*'))\nxx = pd.concat([pd.read_csv(file, sep='\t', header=None) for file in glob(os.path.join(dfolders[0], '*.txt'))])\nxx.columns=['text']\nxx['label'] = 0\n\nyy = pd.concat([pd.read_csv(file, sep='\t', header=None) for file in glob(os.path.join(dfolders[1], '*.txt'))])\nyy.columns = ['text']\nyy['label'] = 1\ndata  = pd.concat([xx,yy])\ndocs = data['text']\n\ntokenizer = Tokenizer()\ntokenizer.fit_on_texts(docs)\n"

In [4]:
a = pd.read_csv(glob(os.path.join('review_polarity/txt_sentoken/neg', '*'))[0], sep='\t', header=0)
a.columns = ['text']
a['label'] = 0
b = pd.read_csv(glob(os.path.join('review_polarity/txt_sentoken/pos', '*'))[0], sep='\t', header=0)
b.columns = ['text']
b['label'] = 1

data = pd.concat([a, b])
docs = data['text']

tokenizer = Tokenizer()
tokenizer.fit_on_texts(docs)
X_train = tokenizer.texts_to_matrix(docs, mode='binary')
Y_train = np_utils.to_categorical(data['label'])

input_dim = X_train.shape[1]
nb_classes = Y_train.shape[1]

In [41]:


model = Sequential()
model.add(Dense(128,input_dim=input_dim)) #128 neurons
model.add(Activation('sigmoid'))
model.add(Dense(128))
model.add(Activation('sigmoid'))
model.add(Dense(128))
model.add(Activation('sigmoid'))
model.add(Dense(128))
model.add(Activation('sigmoid'))
model.add(Dense(128))
model.add(Activation('sigmoid'))
model.add(Dense(128))
model.add(Activation('sigmoid'))
model.add(Dense(128))
model.add(Activation('sigmoid'))
model.add(Dense(128))
model.add(Activation('sigmoid'))
model.add(Dense(128))
model.add(Activation('sigmoid'))
model.add(Dense(128))
model.add(Activation('sigmoid'))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Training...')
model.fit(x=X_train, y=Y_train, epochs=10, batch_size=32, validation_split=0.1, shuffle=False, verbose=2)

Training...
Train on 95 samples, validate on 11 samples
Epoch 1/10
 - 2s - loss: 1.0353 - accuracy: 0.3368 - val_loss: 0.6842 - val_accuracy: 1.0000
Epoch 2/10
 - 0s - loss: 0.7592 - accuracy: 0.3263 - val_loss: 1.0705 - val_accuracy: 0.0000e+00
Epoch 3/10
 - 0s - loss: 0.6969 - accuracy: 0.6632 - val_loss: 1.3206 - val_accuracy: 0.0000e+00
Epoch 4/10
 - 0s - loss: 0.6842 - accuracy: 0.6632 - val_loss: 1.4261 - val_accuracy: 0.0000e+00
Epoch 5/10
 - 0s - loss: 0.6737 - accuracy: 0.6632 - val_loss: 1.4226 - val_accuracy: 0.0000e+00
Epoch 6/10
 - 0s - loss: 0.6592 - accuracy: 0.6632 - val_loss: 1.3544 - val_accuracy: 0.0000e+00
Epoch 7/10
 - 0s - loss: 0.6448 - accuracy: 0.6632 - val_loss: 1.2581 - val_accuracy: 0.0000e+00
Epoch 8/10
 - 0s - loss: 0.6352 - accuracy: 0.6632 - val_loss: 1.1597 - val_accuracy: 0.0000e+00
Epoch 9/10
 - 0s - loss: 0.6325 - accuracy: 0.6632 - val_loss: 1.0767 - val_accuracy: 0.0000e+00
Epoch 10/10
 - 0s - loss: 0.6352 - accuracy: 0.6632 - val_loss: 1.0168 - va

<keras.callbacks.callbacks.History at 0x7fc2e2ff06d0>

### Results with sigmoid activation function - getting the Data inputs are different.
accuracy does not improve above .66. _now lets try using a ReLU actiation function instead._

In [6]:

model = Sequential()
model.add(Dense(128,input_dim=input_dim)) #128 neurons
model.add(Activation('relu'))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Training...')
model.fit(x=X_train, y=Y_train, epochs=10, batch_size=32, validation_split=0.1, shuffle=False, verbose=2)

Training...
Train on 95 samples, validate on 11 samples
Epoch 1/10
 - 2s - loss: 0.7023 - accuracy: 0.3263 - val_loss: 0.7226 - val_accuracy: 0.0000e+00
Epoch 2/10
 - 0s - loss: 0.6872 - accuracy: 0.6632 - val_loss: 0.7438 - val_accuracy: 0.0000e+00
Epoch 3/10
 - 0s - loss: 0.6774 - accuracy: 0.6632 - val_loss: 0.7797 - val_accuracy: 0.0000e+00
Epoch 4/10
 - 0s - loss: 0.6605 - accuracy: 0.6632 - val_loss: 0.8516 - val_accuracy: 0.0000e+00
Epoch 5/10
 - 0s - loss: 0.6257 - accuracy: 0.6632 - val_loss: 1.0057 - val_accuracy: 0.0000e+00
Epoch 6/10
 - 0s - loss: 0.5624 - accuracy: 0.6632 - val_loss: 1.3371 - val_accuracy: 0.0000e+00
Epoch 7/10
 - 0s - loss: 0.4742 - accuracy: 0.6632 - val_loss: 1.9082 - val_accuracy: 0.0000e+00
Epoch 8/10
 - 0s - loss: 0.3876 - accuracy: 0.6632 - val_loss: 2.5019 - val_accuracy: 0.0000e+00
Epoch 9/10
 - 0s - loss: 0.3208 - accuracy: 0.6632 - val_loss: 2.8874 - val_accuracy: 0.0000e+00
Epoch 10/10
 - 0s - loss: 0.2819 - accuracy: 0.6632 - val_loss: 3.1514 

<keras.callbacks.callbacks.History at 0x7fdba3f9ba10>

## 1.3 Vector Representations of Language
Vector: A point in some multi-dimensional space.
Machine Learning is all about measuring the distances between points in these spaces.


### 1.3.1 Representational vectors
Describe text across a number of human-interpretable feature dimensions.
_Bag-of-words Approach_:
Every dimension can be interpreted as representing a clear feature dimension:The presence(binary) of a certain word in an index lexicon.
Essentially:
* create a list of the unique words in a corpus. - The lexicon
* Give each one a unique number identifier
Then:
* Each sentence in the corpus can be represented as an $N$ length binary vector where $N$ represents the total number of unique words.
* The values in the N-vector correspond to the words in the lexicon, whether or not a lexicon word is present in the sentence.

These can also be counts (the number of times a word appears in a sentence) rather than a binary variable.



In [10]:
train_dat = pd.DataFrame({
    "text":["Natural language is hard for computers", "Computers are capable of learning natural language"],
    "label": [0, 1]
})

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(analyzer="word",
                     tokenizer=None,
                     preprocessor=None,
                     stop_words=None,
                     max_features=1000)

doc_cv = cv.fit_transform(train_dat['text']).toarray()
print(doc_cv)
print(cv.vocabulary_)

[[0 0 1 1 1 1 1 0 1 0]
 [1 1 1 0 0 0 1 1 1 1]]
{'natural': 8, 'language': 6, 'is': 5, 'hard': 4, 'for': 3, 'computers': 2, 'are': 0, 'capable': 1, 'of': 9, 'learning': 7}


### 1.3.2 Operational vectors
typically Not human-interpretable and vector values are derived from some algorithm.

**tf.idf**:
* vector values represent word weights computed as $\text{term frequency} \times \text{inverse document frequency}$, which expresses a degree of saliency.
* _term frequency_  - number of times a word appears in a document.
* _inverse document frequency_ - the frequency of the word in other documents in a collection of documents.

**Neural word embeddings**
* "word2vec models": embeddings from NNs that predict words given context, or context given word.
* "distributional semantic similarity": vectors that are close in distance



## 1.4 Vector Sanitization

### 1.4.1 The Hashing Trick
Because large vectors can take up too many compute resources unnecessarily, we use **feature hashing** to reduce the dimensionality.
* a hashing function maps every feature to an index, and the hashing trick algorithm updates the information at those indices only.

In [None]:

InverseLexicon = {'integer':'word' }
hash_function = {'feature': 'index'}
def feat_hash(featureV, vecSize):
    outputV = np.array(vecSize)
    for f in range(len(featureV)):
        if featureV[f] == 1:
            dim=hash_function[InverseLexicon[f]]
            outputV[dim % vecSize] +=1
    return outputV

### 1.4.2 Vector Normalization
* reducing the length of a vector to one, by dividing each elements by the sum of all elements in the vector.

## Summary