# Deep Learning for NLP
## 1. Overview
* Neural networks are a core example of deep learning
* Deep learning is loosely defined as a network with >3 hidden layers, ensuring that the analysis compeleted is fine-grained and applies many steps of processing
* The broad process is that you have an input layer (raw input data) which is passed through hidden layers (multiple stages of abstraction) before being output in the output layer as the final values
* Hidden layers apply weights and a custom activation function to the inputs in order to transform them
* Activation functions come in many flavours and there is much to be said for each, but **ReLU** is a commonly used one:
    * ReLU (Rectified Linear Unit) is essentially defined as Max(0,x)
    * Where any value <0 is set to 0 and any value above retains its value
    * It's a method of removing issues with negative values whilst focusing on the linear relationship between the data
* [Setting up TensorFlow](https://docs.anaconda.com/anaconda/user-guide/tasks/tensorflow/)
* [My Colab Tutorials](https://colab.research.google.com/drive/1eiKBIh4dvJEEI7AZ-cFXmphz8LtbJMCP?authuser=1)

In [5]:
# load libraries
import numpy
from sklearn.datasets import load_iris

# load iris data
iris = load_iris()

# check data (custom 'bunch' data type, subclass of dictionary)
type(iris)

sklearn.utils.Bunch

In [6]:
# load libraries
from keras.utils import to_categorical
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

# extract features and labels
# note that data is sorted (i.e. targets occur in order of 1, 2, 3)
# if we left it like this, our model would be best at learning class 1, then class 2 and so on
# as such, we shuffle our data in the train, test, split step which is essential for a good model
X, y = iris.data, iris.target

# perform one-hot encoding of features
# class 0 -> [1,0,0]
# class 1 -> [0,1,0]
# class 2 -> [0,0,1]
y = to_categorical(y)

# split data into train and test (shuffles training data)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# create scaler object (divides all values by max values)
scaler = MinMaxScaler()

# fit scaler to training features
scaler.fit(X_train)

# scale features to occur between 0 and 1 only
# for ANN models, this helps prevent weights and biases growing too large
# it also prevents over/under interpreting raw values of different feature scales
scaled_X_train = scaler.transform(X_train)
scaled_X_test = scaler.transform(X_test)

## 2. Designing Networks
* Keras allows you to design neural networks and utilize the TensorFlow framework to run them
* Dense layers are synonymous with hidden layers, allowing you to define the transformations that will occur
* Below, input_dim=4 means we are passing in 4 dimensions of inputs (because our features have 4 columns)
* The activation function used is ReLU (described above)
* The number of nodes is set at 8, this can be any number, but a multiple of the input dimensions is often sensible
* The final hidden layer is given a softmax function, this determines probabilities of class outcomes and is therefore useful for our data where we are trying to predict one of three output classes (it will return e.g. [0.2, 0.3, 0.5] probabilities for each class value)
* Finally, the loss function, optimizer and metrics are defined in the output layer
    * Loss function calculates model error and allows enhancement
        * We've specified a categorical method as we want to predict categorical outputs
    * Optimizers enhance the network in custom ways (often in an adaptive manner)
        * Adam is a modern, general purpose optimizer which enhances standard gradient descent via the addition of momentum and friction to help it avoid convergence at local minima instead of the correct global minima
    * Metrics allows you to define which scores you will assess once your model has run
* [Loss Functions in NN](https://machinelearningmastery.com/loss-and-loss-functions-for-training-deep-learning-neural-networks/)
* [Optimizers in NN](https://towardsdatascience.com/adam-latest-trends-in-deep-learning-optimization-6be9a291375c#:~:text=Adam%20%5B1%5D%20is%20an%20adaptive,for%20training%20deep%20neural%20networks.&text=The%20algorithms%20leverages%20the,learning%20rates%20for%20each%20parameter.)
* [ReLU Explained](https://colab.research.google.com/drive/15AHQG7WTpfuIvUUJ3E8fb-tZUTP8K9Th?authuser=1)

In [8]:
# load libraries
from keras.models import Sequential # allows you to build sequential layers in neural network
from keras.layers import Dense # allows dense/hidden layers to be added

# build ANN
model = Sequential() # create network instane
model.add(Dense(8, input_dim=4, activation='relu')) # create first layer
model.add(Dense(8, input_dim=4, activation='relu')) # create second layer
model.add(Dense(3, activation='softmax')) # create output layer e.g. [0.2, 0.3, 0.5] 
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) # define network parameters

# check output
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_4 (Dense)              (None, 8)                 40        
_________________________________________________________________
dense_5 (Dense)              (None, 8)                 72        
_________________________________________________________________
dense_6 (Dense)              (None, 3)                 27        
Total params: 139
Trainable params: 139
Non-trainable params: 0
_________________________________________________________________


In [11]:
# fit model to training data
# epochs = iterations through full network (with feedback, weight enhancements etc.)
# verbose = 2 = 1 line per epoch (i.e. shows accuracy at each iteration)
model.fit(scaled_X_train, y_train, epochs=150, verbose=2)

Epoch 1/150
 - 0s - loss: 0.7328 - accuracy: 0.6900
Epoch 2/150
 - 0s - loss: 0.7145 - accuracy: 0.7400
Epoch 3/150
 - 0s - loss: 0.6851 - accuracy: 0.7500
Epoch 4/150
 - 0s - loss: 0.6600 - accuracy: 0.7300
Epoch 5/150
 - 0s - loss: 0.6384 - accuracy: 0.6900
Epoch 6/150
 - 0s - loss: 0.6218 - accuracy: 0.6700
Epoch 7/150
 - 0s - loss: 0.6072 - accuracy: 0.6700
Epoch 8/150
 - 0s - loss: 0.5944 - accuracy: 0.6700
Epoch 9/150
 - 0s - loss: 0.5836 - accuracy: 0.6700
Epoch 10/150
 - 0s - loss: 0.5739 - accuracy: 0.6700
Epoch 11/150
 - 0s - loss: 0.5646 - accuracy: 0.6700
Epoch 12/150
 - 0s - loss: 0.5556 - accuracy: 0.6700
Epoch 13/150
 - 0s - loss: 0.5478 - accuracy: 0.6700
Epoch 14/150
 - 0s - loss: 0.5402 - accuracy: 0.6700
Epoch 15/150
 - 0s - loss: 0.5321 - accuracy: 0.6700
Epoch 16/150
 - 0s - loss: 0.5250 - accuracy: 0.6900
Epoch 17/150
 - 0s - loss: 0.5181 - accuracy: 0.7100
Epoch 18/150
 - 0s - loss: 0.5109 - accuracy: 0.7300
Epoch 19/150
 - 0s - loss: 0.5046 - accuracy: 0.7500
Ep

<keras.callbacks.callbacks.History at 0x2de3f537488>

## 3. Evaluating ANNs
* We can use our trained model to make predictions off new data (or our test data)
* In this case, we will be predicting output classes based on predicted probabilites (i.e. max probability of a specific class)
* As always, we can use confusion matrices, classification reports and accuracy scores to evaluate the accuracy of our model

In [13]:
# make predictions of classes
# note that output is the index position of the class, not the actual class itself
# although in our case, these are one and the same due to one hot encoding
model.predict_classes(scaled_X_train)

array([1, 2, 1, 0, 2, 1, 0, 0, 0, 1, 2, 0, 0, 0, 1, 0, 1, 2, 0, 1, 2, 0,
       2, 2, 1, 1, 2, 1, 0, 1, 2, 0, 0, 1, 1, 0, 2, 0, 0, 2, 1, 2, 2, 2,
       2, 1, 0, 0, 2, 2, 0, 0, 0, 2, 2, 0, 2, 2, 0, 1, 1, 2, 1, 2, 0, 2,
       1, 2, 1, 1, 1, 0, 1, 1, 0, 1, 2, 2, 0, 1, 2, 2, 0, 2, 0, 1, 2, 2,
       1, 2, 1, 1, 2, 2, 0, 1, 2, 0, 1, 2], dtype=int64)

In [22]:
# load libraries
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

# make predictions on train data
y_pred = model.predict_classes(scaled_X_test)

# convert y_test from one hot encoded to actual class values
y_test.argmax(axis=1)

# confusion matrix
confusion_matrix(y_test.argmax(axis=1), y_pred)

array([[19,  0,  0],
       [ 0, 14,  1],
       [ 0,  0, 16]], dtype=int64)

In [23]:
# classification report
print(classification_report(y_test.argmax(axis=1), y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      0.93      0.97        15
           2       0.94      1.00      0.97        16

    accuracy                           0.98        50
   macro avg       0.98      0.98      0.98        50
weighted avg       0.98      0.98      0.98        50



In [24]:
# accuracy score
accuracy_score(y_test.argmax(axis=1), y_pred)

0.98

## 4. Saving/Loading Models
* We can save our model along with all of its trained weights, coefficients etc.
* We can then load any models which we have saved for re-use
* Note that if your model took a really long time to train on lots of complex data, be very careful not to overwrite existing models with identical names or you will lose all your valuable work

In [25]:
# save model (all weights etc.)
model.save('nlp_ann_model.h5')

# load model
from keras.models import load_model
new_model = load_model('nlp_ann_model.h5')

## 5. Recurrent Neural Networks
* 
* [Keras Punctuation String to Exclude](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer)

In [32]:
# load libraries
import spacy

# prevent spacy complaining
import logging
logger = logging.getLogger("spacy")
logger.setLevel(logging.ERROR)

# create nlp object (disable components we don't need, only interested in tokenization)
nlp = spacy.load('en_core_web_lg', disable=['parser', 'tagger', 'ner'])

# function to read files
def read_file(filepath):
    # open file
    with open(filepath) as f:
        # read into text object
        str_text = f.read()
        
    # return all text
    return str_text

# set max # of words to > words in entirity of Moby Dick (sometimes spacy complains otherwise)
nlp.max_length = 1198623

# extract unique tokens from text (without special chars/punctuation etc.)
def separate_punctuation(doc_text):
    # special characters to exclude
    filters = '\n\n \n\n\n!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'
    
    # return all tokens as lower case from entire document, excluding special characters
    return [token.text.lower() for token in nlp(doc_text) if token.text not in filters]

# read in first 4 chapters of Moby Dick
d = read_file('NLP Course Files/TextFiles/moby_dick_four_chapters.txt')

# extract tokens
tokens = separate_punctuation(d);

# check # of tokens
len(tokens)

11429

## 6. LSTM and GRU
* 

In [None]:
# 
train_len = 25 + 1

#
text_sequences =  []

# 
for i in range(train_len, len(tokens)):
    seq = tokens[i - train_len:i]
    
    text_sequences.append(seq)