# Neural Networks

In [None]:
#%pip install transformers
#%pip install tensorflow-hub

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 


import os
# suppressing some warnings
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import tensorflow as tf
import tensorflow_hub as hub
from tensorflow import keras
from tensorflow.keras import layers


## Neural Networks for Text Data


Neural networks are extremely flexible, which allows you to use them for all kinds of data. They can also be used for text data to do tasks such as sentiment analysis using supervised learning.

Go ahead and run this (it will take a moment to finish) and we'll talk about it in a moment:

In [None]:


module_url = 'https://www.kaggle.com/models/google/universal-sentence-encoder/TensorFlow2/universal-sentence-encoder/2'
embedding_model = hub.load(module_url)
# check to see if it works - should give return: TensorShape([1, 512])
embedding_model(["This is a text"]).shape

# Reviews

We'll start by reloading the IMDB movie review corpus that we used a couple of weeks ago. Just to refresh your memory: this is a subset from a larger corpus of user generated IMDB reviews. The dataset contains the full text of each review, along with a numeric label that is equal to 0 if the review was negative and 1 if the review is positive.


In [None]:
reviews = pd.read_csv("imdb_reviews.csv")
reviews.head()

Just as in previous classes, we're going to be evalauting a model here by creating separate training and testing datasets. We'll also convert these datasets to tensors in order to make it easier to work with them in tensorflow

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
                                     reviews['text'],
                                     reviews['label'],
                                     test_size=0.20, # 20% of observations for validation
                                     random_state = 999) # this is a random process, so you want to set a random seed! 
print(f"Training entries: {len(X_train)}, test entries: {len(X_test)}")

## Embeddings

In a previous class, we trained a naive bayes classifier to distinguish positive from negative IMDB reviews with a fairly high degree (~84%) accuracy. Now, we're going to try to do the same task using a neural network with an embedding model instead. In the most general sense, an embedding is a low-dimensional representation of a higher-dimensional data set. Text embeddings can be thought of as a kind of spatial model for text, where words or sentences with similar meanings appear "close" to one another in space.

The methods for inferring embeddings will vary from one model to the next, but a typical approach for getting word embeddings is to take a bunch of text and train a neural network model with a single hidden layer to predict a randomly held out word using the surrounding context:


<div style='display: block;margin-left: auto;margin-right: auto; width: 50%;'>
    <img src='https://lilianweng.github.io/posts/2017-10-15-word-embedding/word2vec-cbow.png' width="600">C-BOW model <a href='https://lilianweng.github.io/posts/2017-10-15-word-embedding/'>(source)</a></img>
</div>

The weights from the hidden layer in this model are used as the word embeddings. Terms with similar meanings will appear in similar contexts, so they should also end up with similar values in this embedding space. This method does a surprisingly good job at picking up on nuanced semantic relationships between terms, and since the training data consists entirely of regular english language texts, its comparatively easy to train them on a ton of data and then use them for multiple different machine learning tasks. <a href='https://projector.tensorflow.org/'>There's a good visual representation of some word embedding models here</a>. 

Transformer-based models take this one step further by modeling context and word order. So, we can use a pre-trained sentence embedding model to convert a full sentence, paragraph, or document and spit out a series of numbers that can function as a context-aware representation of the sentence's meaning.

The `embed` object we downloaded earlier is a pre-trained transformer model that is built for general-purpose applications. It takes a list of strings as inputs and returns a vector of 512 numbers that represent that sentence's "location" in a 512 dimension space. Here's an example of embedding a sentence and then getting the first 10 elements of the embedding vector:

In [None]:
# embedding a sentence about catci and looking at the first 10 elements

embedding_model(["The rattail cactus is native to Mexico."])[0][:10]

Note that we don't really need to do much (or any) pre-processing of the texts here. The model is trained on mostly unprocessed text, so things like stripping punctuation or removing stopwords often does more harm than good.

To illustrate what why the embedding is useful, we can use a little code from the <a href='https://www.tensorflow.org/hub/tutorials/semantic_similarity_with_tf_hub_universal_encoder'>online documentation</a> that will allow us to visualize the similarities between the embeddings produced by different sentences:

In [None]:
def plot_similarity(labels, features, rotation):
  corr = np.inner(features, features)
  sns.set(font_scale=1)
  g = sns.heatmap(
      corr,
      xticklabels=labels,
      yticklabels=labels,
      vmin=0,
      vmax=1,
      cmap="YlOrRd")
  g.set_xticklabels(labels, rotation=rotation)
  g.set_title("Semantic Textual Similarity")

def run_and_plot(messages_):
  message_embeddings_ = embedding_model(messages_)
  plot_similarity(messages_, message_embeddings_, 90)

Below are some sentences from different wikipedia entries. The first two are from the entry on *Citizen Kane*, the last two are from entries on cacti. Note that the terms in both groups share very few terms overall, but take a look at their similarities as measured by the inner products of their respective embeddings:

In [None]:
run_and_plot([
    # two sentences from the Wikipedia entry for citizen kane
    "Citizen Kane is often cited as the greatest film ever.",
    "Hollywood had shown interest in Welles in 1936.",
    # sentences from entries on cacti
    "The rattail cactus is native to Mexico.",
    "Prickly pears are frequently found around California."])


In essence, text embeddings give us a more flexibile way to represent text that can account for nuanced aspects of meaning and context, so that sentences about the same general idea are "close" in the embedding space even if they share none of the exact same terms. Feeding these inputs - instead of a simple bag of words - into a machine learning model, can allow us to make more effective use of the same data. While 512 features may seem like a lot to handle, note that this is far more compact than the term-document-matrices that we were working with before, and it theoretically captures more information.

## Fitting the model

Now, let's fit a model to predict movie reviews that uses the embedding model. We'll use the embedding layer as our input layer and then include two hidden layers and a sigmoid output layer that will return our predicted probability of a review being negative or positive

In [None]:
embed= hub.KerasLayer(embedding_model, trainable=False, name='sentence_embedder')


In [None]:
model = keras.Sequential([
    embed,   
    layers.Dense(128, activation='relu', name='hidden layer'),
    layers.Dense(1, activation='sigmoid', name='output')    
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

After that, we compile our model and then train it for 15 epochs. 

In [None]:
# Converting to tensor objects (data type used by tensorflow)
X_train_tensor = tf.convert_to_tensor(X_train)
y_train_tensor = tf.convert_to_tensor(y_train)
X_test_tensor = tf.convert_to_tensor(X_test)
y_test_tensor = tf.convert_to_tensor(y_test)
tf.convert_to_tensor(y_test)
history = model.fit(X_train_tensor, 
                    y_train_tensor,
                    epochs=20,
                    batch_size=500,
                    validation_data=(X_test_tensor, 
                                     y_test_tensor),
                    verbose=1
                   )


If you're looking at the output above, you should notice that the model is being trained over multiple "epochs". Each epoch represents a single pass through the entire data set, and you'll usually see rapid improvement during the first few iterations, followed by a gradual leveling off in accuracy. Since we're also including data for the "validation_data" argument we can compare improvements on the training data accuracy against improvement on the test data accuracy.

We'll visualize these results below:

In [None]:
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
plt.show()

When watching the progress of a model like this, you should pay attention to things like big disparities between training and test performance, because this can indicate overfitting. You might also want to watch out for cases where the results were still rapidly improving when the training stopped.

Now we can generate some predictions and look at our results:

In [None]:
# generate predictions from test data
preds = model.predict(test_tensor).flatten()>=.5



<h2 style="color:red;font-weight:bold">Question 1 </h2>

<span style="color:red;font-weight:bold">Create a confusion matrix and classification report from the predictions and assess the quality of the classifier</span>

How does this do? Does it outperform the naive bayes classifier? Why might this be? 

## Changes to the Model

We can make changes to the model to add more layers, use more nodes, train it for longer, or even use a different kind of model. This is part of the overall process for finding the model that has the best performance in terms of accuracy. In reality, we would do these steps many, many times, tuning our model so that it is as good as possible.


<h2 style="color:red;font-weight:bold">Question 2 </h2>

<span style="color:red;font-weight:bold">Change something about the model above and compare your results.</span>

(Some options are: add an additional hidden layer, run the same model for more epochs, add more nodes to one or more of the layers, or add <a href='https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout'>dropout</a>)

### Pre-built models from Hugging Face

In reality, the full IMDB reviews corpus is much larger than what we've been using here, so we would also want to use that data in its entirety for a real world application, but since that takes a while to train, we can use a pre-made model that was trained on this data set to get a sense of how well we could do if we did some more fine-tuning.

The [Hugging Face Hub](https://huggingface.co/models) has many models that have been pre-trained for you to use. One of the models hosted there is a <a href='https://huggingface.co/aychang/roberta-base-imdb'>sentiment classifer that was trained on the entire IMDB corpus</a>. We can load this model and see how it performs on our held-out data.


In [None]:
from transformers import pipeline
# load pipeline for IMDB classification
tiny_bert = pipeline("text-classification", "arnabdhar/tinybert-imdb", 
                    # encoder expects a specific text length, so this allows texts to be truncated
                     truncation =True,
                     padding =True,
                     max_length = 500
                    )        




In [None]:
# converting to list since thats the input format this model uses
predicted = tiny_bert(X_test.tolist())

# formatting the test labels to match the output from this model
y_test_text = ["POSITIVE" if i ==1 else "NEGATIVE" for i in y_test ]

predicted_labels = [i.get('label') for i in predicted]
pd.crosstab(predicted_labels, y_test_text)


In [None]:
from sklearn.metrics import accuracy_score, cohen_kappa_score, f1_score, classification_report, balanced_accuracy_score
print(classification_report(y_test_text, predicted_labels))

print("cohens kappa: ", cohen_kappa_score(y_test_text, predicted_labels))#
print("balanced accuracy: ", balanced_accuracy_score(y_test_text, predicted_labels))


In [None]:
print(classification_report(test_labels, tiny_bert_preds, 
                            # add target_names to show labels in the report:
                              target_names=['Negative', 'Positive']))


# add cohen's kappa and balanced accuracy
print("cohens kappa: ", cohen_kappa_score(test_labels, tiny_bert_preds))#
print("balanced accuracy: ", balanced_accuracy_score(test_labels, tiny_bert_preds))

## Other Types of Sentiment

The nice thing about these models is that they can be optimized for all sorts of other takss. For example, let's take the Distilbert-base-uncased-emotion model. This provides scores for emotions such as joy or anger. Here's an example of getting the emotions expressed in the first 100 rows the the reviews data set:

In [None]:
emotion_classifier = pipeline("text-classification",model='bhadresh-savani/distilbert-base-uncased-emotion', return_all_scores=True, 
                             truncation=True,
                             padding = True,
                             max_length = 500, 
                             )

In [None]:
emotions = emotion_classifier(X_test.tolist())
result = [{i.get('label'):i.get('score') for i in j} for j in emotions]
emotions_frame = pd.DataFrame(result)
emotions_frame['review'] = X_test.tolist()

In [None]:
emotions_frame.review[emotions_frame['joy'].idxmax()]

In [None]:
emotion_prediction = pd.concat([pd.DataFrame(i) for i in prediction])
emotion_prediction.groupby('label').agg({'score':['min','max','median','mean']})

You can check out some other options on the hugging face <a href='https://huggingface.co/models'>models page.</a>

# BertTopic

[BertTopic](https://maartengr.github.io/BERTopic/api/bertopic.html) is a transformer-based method for topic modeling that works especially well on short texts. You can read more about the algorithm [here](https://maartengr.github.io/BERTopic/algorithm/algorithm.html).

You can try it out with the code below, but be warned that it will take a while to run. For the sake of a speedier model, we'll work with just the headlines and first few sentences from the Fox/CNN news articles from a couple weeks ago. The BertTopic package also provides several built-in functions that make it easy to visualize and explore results, some of which are demonstrated below.

In [None]:
# Uncomment to install packages:
#%pip install BERTopic
#%pip install SentenceTransformer
#%pip install UMAP

In [None]:
# import BERTopic (this is very slow sometimes)
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from umap import UMAP
from nltk.tokenize import sent_tokenize


# Loading the 4,000 articles sampled from CNN and Fox
articles = pd.read_csv('https://github.com/Neilblund/APAN/raw/main/news_sample.csv')
# doing some truncation and combining here just to make the model run faster
docs= [' '.join(sent_tokenize(i + " "+ j)[:5]).strip() for i, j in  zip(articles.headline, articles.text)]
# embedding the texts (typically the slowest part)
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=True)
# setting the UMAP model (this makes results reproducible)
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=119)
# Train BERTopic
topic_model = BERTopic(umap_model = umap_model).fit(docs, embeddings)

From here, we can try out some of the built-in functions for exploring the results of the topic model:

In [None]:
# view all topics and the top terms associated with each one
topic_model.topic_labels_

In [None]:
# view representative documents for topic 0
topic_model.get_representative_docs(0)

In [None]:
topic_info = topic_model.get_topic_info()
topic_info.head(n=15)

In [None]:
# heatmap showing textual similarity of each topic 
topic_model.visualize_heatmap()


In [None]:
# visualize distance between topics
topic_model.visualize_topics()

In [None]:
# selected topics associated with each outlet (normalized by their overall frequency)
topics_per_class = topic_model.topics_per_class(first_200, articles.source)
topic_model.visualize_topics_per_class(topics_per_class,  topics=[1, 3, 8, 11,12, 13], normalize_frequency=True)

In [None]:
# documents in space 
topic_model.visualize_documents(articles.headline, embeddings=embeddings)


In [None]:
# terms for selected topics
topic_model.visualize_barchart([0, 1, 2, 3,11, 10], n_words=10)

In [None]:
# view topics over time (this model also allows topics themselves to vary slightly depending on the time period)
# see https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html
topics_over_time = topic_model.topics_over_time(docs, articles.date, nr_bins=15)

In [None]:
# top 5 topics
topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=5)
# or try this to see specific topics:
#topic_model.visualize_topics_over_time(topics_over_time, topics = [0, 1, 2])


In [None]:
# get topic distributions across individual terms
topic_distr, topic_token_distr = topic_model.approximate_distribution(docs, calculate_tokens=True)



In [None]:
# plot the topic-word distribution for the first document:
df = topic_model.visualize_approximate_distribution(docs[0], topic_token_distr[0])
df