# Neural Networks II

In [None]:
# !pip install keras_nlp
# !pip install tensorflow_datasets
# !pip install transformers

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

from transformers import pipeline


## Neural Networks for Text Data

Neural networks are extremely flexible, which allows you to use them for all kinds of data. We've already seen this with data that was in a 2-dimensional format with images. They can also be used for text data to do tasks such as sentiment analysis using supervised learning.

Let's take a look at an example. We'll bring in a dataset of IMDB reviews from the `tensorflow_datasets` package. This is a set of reviews of movies along with their labels of whether it was a positive review or a negative review. This data can be used to train a neural network model, which can then in turn be used on new movie reviews to determine their sentiment.

In [None]:
train_data, test_data = tfds.load(name="imdb_reviews", split=["train", "test"], 
                                  batch_size=-1, as_supervised=True)

train_examples, train_labels = tfds.as_numpy(train_data)
test_examples, test_labels = tfds.as_numpy(test_data)

Note that we unpack the text data from the actual labels. So, for our purposes, the `train_examples` is the "X" data with which we will train the model, and the `train_labels` is the "y" data which indicates whether it was positive or negative.

Let's take a look at how many observations there are.

In [None]:
print(f"Training entries: {len(train_examples)}, test entries: {len(test_examples)}")

If we take a look at the data, we can see that it was the raw text.

In [None]:
train_examples[:2]

The labels are 0 or 1, with 0 representing negative and 1 representing positive.

In [None]:
train_labels[:10]

## Pre-Built Models

Neural networks can take a long time to train and build. This is especially true for complicated models with complicated data, such as text data or image data. Luckily for us, people have taken the time to train models and build layers of models to use. We'll take one that has already been pre-built to take raw text data and converts it into text embedding vectors. You can think of this layer as doing something similar to the text processing that we've done before, such as tokenization. This has the added benefit of including the context of the words within the text as well.

In [None]:
model = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(model, input_shape=[], dtype=tf.string, trainable=True)
hub_layer(train_examples[:3])

Now, let's fit the full model. We'll use the layer that we downloaded as the first layer, then add a simple Dense layer for now. We can do this similar to how we created the model structure. Note that here, we create the empty model object, then add layers. This is the same as creating the model all at once using a list.

In [None]:
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))

model.summary()

After that, we compile our model. 

In [None]:
model.compile(optimizer='adam',
              loss=tf.losses.BinaryCrossentropy(from_logits=True),
              metrics=[tf.metrics.BinaryAccuracy(threshold=0.5, name='accuracy')])

## Train-Validation Split

One possible pitfall when training machine learning models is something called **overfitting**. Overfitting happens when you train a model that is too specific to the data that you are training with, and it ends up not generalizing to new data. 

In order to avoid issues with overfitting, we can use a **validation set** to see when our models stop improving and start overfitting. To do this, we simply split our data again, designating one as the main train data and the rest as the validation data. Then, we use the validation data to calculate our accuracy as we go, so that we can see when model stops improving on new data.

In [None]:
x_val = train_examples[:10000]
partial_x_train = train_examples[10000:]

y_val = train_labels[:10000]
partial_y_train = train_labels[10000:]

Finally, we fit our data using the `model.fit` structure as before. We give it the validation data so that we can see the performance on the validation set as well.

In [None]:
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=20,
                    batch_size=512,
                    validation_data=(x_val, y_val),
                    verbose=1)

In [None]:
history_dict = history.history
history_dict.keys()

In [None]:
acc = history_dict['accuracy']
val_acc = history_dict['val_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']

epochs = range(1, len(acc) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

In [None]:
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

## Changes to the Model

We can make changes to the model to add more layers and use a different number of epochs. This is part of the overall process for finding the model that has the best performance in terms of accuracy. In reality, we would do these steps many, many times, tuning our model so that it is as good as possible. 

In [None]:
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))

model.compile(optimizer='adam',
              loss=tf.losses.BinaryCrossentropy(from_logits=True),
              metrics=[tf.metrics.BinaryAccuracy(threshold=0.5, name='accuracy')])

In [None]:
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=5,
                    batch_size=512,
                    validation_data=(x_val, y_val),
                    verbose=1)

In [None]:
history_dict = history.history
history_dict.keys()

acc = history_dict['accuracy']
val_acc = history_dict['val_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']

epochs = range(1, len(acc) + 1)


In [None]:
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

## Evaluation on Test Data

The model can then be used to evaluate how well it would perform on new data. Note that we shouldn't use the validation results because that was also used to determine how to build our model. 

In [None]:
results = model.evaluate(test_examples, test_labels)

print(results)

## Other Pre-Built Models

The [Hugging Face Hub](https://huggingface.co/models) has many models that have been pre-trained for you to use. You can access them using the `pipeline` function to get text analysis models, such as more advanced sentiment analysis than just using the VADER method. 

In [None]:
sentiment_pipeline = pipeline("sentiment-analysis")
data = ["I love you", "I hate you"]
sentiment_pipeline(data)

We can apply this to our own data fairly easily, without needing to train anything. The text data just has to be in a list. For example, if we took our NYT abstracts and wanted to get the sentiments for each abstract, we could do that by giving a list of those abstracts as a list for the argument in `sentiment_pipeline`.

In [None]:
nyt_2021 = pd.read_csv('nyt_2021.csv').dropna()
abstracts = nyt_2021.abstract.tolist()

In [None]:
abstracts[:10]

In [None]:
# Only do 10 here for speed
abstract_sentiments = sentiment_pipeline(abstracts[:10])

In [None]:
abstract_sentiments

## Other Types of Sentiment

The nice thing about these models is that they are also pre-trained to do different types of sentiment analysis. For example, let's take the Distilbert-base-uncased-emotion model. This provides scores for emotions such as joy or anger. 

In [None]:
classifier = pipeline("text-classification",
                      model='bhadresh-savani/distilbert-base-uncased-emotion', 
                      top_k=None)


In [None]:
prediction = classifier("I love using transformers. The best part is wide range of support and its easy to use", )
print(prediction)

In [None]:
prediction = classifier(abstracts[:10], )
print(prediction[0])