# Text classification with RNNs
- Description of the dataset `GoEmotions`:

The GoEmotions dataset contains 58k carefully curated Reddit comments labeled for 27 emotion categories or Neutral. The emotion categories are admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, surprise.
## Preamble: installing and importing packages

In [1]:
import numpy as np 
import scipy as sp
import pandas as pd 
import matplotlib.pyplot as plt
import skimage.io
import seaborn as sns
import plotly.express as px

from IPython.display import display, HTML

import tensorflow as tf
import tensorflow_datasets as tfds

import os
import urllib

Information About the dataset

In [2]:
builder = tfds.builder('goemotions')
builder.info.description

'The GoEmotions dataset contains 58k carefully curated Reddit comments labeled\nfor 27 emotion categories or Neutral. The emotion categories are admiration,\namusement, anger, annoyance, approval, caring, confusion, curiosity, desire,\ndisappointment, disapproval, disgust, embarrassment, excitement, fear,\ngratitude, grief, joy, love, nervousness, optimism, pride, realization, relief,\nremorse, sadness, surprise.'

In [3]:
builder.info.features

FeaturesDict({
    'admiration': tf.bool,
    'amusement': tf.bool,
    'anger': tf.bool,
    'annoyance': tf.bool,
    'approval': tf.bool,
    'caring': tf.bool,
    'comment_text': Text(shape=(), dtype=tf.string),
    'confusion': tf.bool,
    'curiosity': tf.bool,
    'desire': tf.bool,
    'disappointment': tf.bool,
    'disapproval': tf.bool,
    'disgust': tf.bool,
    'embarrassment': tf.bool,
    'excitement': tf.bool,
    'fear': tf.bool,
    'gratitude': tf.bool,
    'grief': tf.bool,
    'joy': tf.bool,
    'love': tf.bool,
    'nervousness': tf.bool,
    'neutral': tf.bool,
    'optimism': tf.bool,
    'pride': tf.bool,
    'realization': tf.bool,
    'relief': tf.bool,
    'remorse': tf.bool,
    'sadness': tf.bool,
    'surprise': tf.bool,
})

## Load a training dataset

In [4]:
train_tfds = tfds.load('goemotions', split='train', shuffle_files=True)
train_tfds


<PrefetchDataset element_spec={'admiration': TensorSpec(shape=(), dtype=tf.bool, name=None), 'amusement': TensorSpec(shape=(), dtype=tf.bool, name=None), 'anger': TensorSpec(shape=(), dtype=tf.bool, name=None), 'annoyance': TensorSpec(shape=(), dtype=tf.bool, name=None), 'approval': TensorSpec(shape=(), dtype=tf.bool, name=None), 'caring': TensorSpec(shape=(), dtype=tf.bool, name=None), 'comment_text': TensorSpec(shape=(), dtype=tf.string, name=None), 'confusion': TensorSpec(shape=(), dtype=tf.bool, name=None), 'curiosity': TensorSpec(shape=(), dtype=tf.bool, name=None), 'desire': TensorSpec(shape=(), dtype=tf.bool, name=None), 'disappointment': TensorSpec(shape=(), dtype=tf.bool, name=None), 'disapproval': TensorSpec(shape=(), dtype=tf.bool, name=None), 'disgust': TensorSpec(shape=(), dtype=tf.bool, name=None), 'embarrassment': TensorSpec(shape=(), dtype=tf.bool, name=None), 'excitement': TensorSpec(shape=(), dtype=tf.bool, name=None), 'fear': TensorSpec(shape=(), dtype=tf.bool, name=

In [5]:
train_tfds = train_tfds.take(1)  # Only take a single example

for example in train_tfds:  # example is `{'comment_text': tf.Tensor}`
  print(list(example.keys()))
  text = example["comment_text"]
  print(text)

['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'comment_text', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'neutral', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise']
tf.Tensor(b"It's just wholesome content, from questionable sources", shape=(), dtype=string)


In [6]:
train_tfds.cardinality()

<tf.Tensor: shape=(), dtype=int64, numpy=1>

In [7]:
#comments_labels = train_tfds[['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'neutral', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise']]
#comments_labels.head()

## TODO Loading validation data

In [8]:
val_tfds = tfds.load('goemotions', split='validation', shuffle_files=True)
val_tfds

<PrefetchDataset element_spec={'admiration': TensorSpec(shape=(), dtype=tf.bool, name=None), 'amusement': TensorSpec(shape=(), dtype=tf.bool, name=None), 'anger': TensorSpec(shape=(), dtype=tf.bool, name=None), 'annoyance': TensorSpec(shape=(), dtype=tf.bool, name=None), 'approval': TensorSpec(shape=(), dtype=tf.bool, name=None), 'caring': TensorSpec(shape=(), dtype=tf.bool, name=None), 'comment_text': TensorSpec(shape=(), dtype=tf.string, name=None), 'confusion': TensorSpec(shape=(), dtype=tf.bool, name=None), 'curiosity': TensorSpec(shape=(), dtype=tf.bool, name=None), 'desire': TensorSpec(shape=(), dtype=tf.bool, name=None), 'disappointment': TensorSpec(shape=(), dtype=tf.bool, name=None), 'disapproval': TensorSpec(shape=(), dtype=tf.bool, name=None), 'disgust': TensorSpec(shape=(), dtype=tf.bool, name=None), 'embarrassment': TensorSpec(shape=(), dtype=tf.bool, name=None), 'excitement': TensorSpec(shape=(), dtype=tf.bool, name=None), 'fear': TensorSpec(shape=(), dtype=tf.bool, name=

Check how many validation batches we have:

In [9]:
val_tfds.cardinality()

<tf.Tensor: shape=(), dtype=int64, numpy=5426>

# Text encoding layers

## Text vectorization

#### TODO Creating and fitting the vectorizer

In [10]:
VOCAB_SIZE = 1000
# set the max_tokens argument to VOCAB_SIZE
encoder = tf.keras.layers.TextVectorization(
    max_tokens=VOCAB_SIZE,
    standardize='lower_and_strip_punctuation',
    split='whitespace',
    ngrams=None,
    output_mode='int',
    output_sequence_length=None,
    pad_to_max_tokens=False,
    vocabulary=None,
    idf_weights=None,
    sparse=False,
    ragged=False
    )

encoder

<keras.layers.preprocessing.text_vectorization.TextVectorization at 0x7ff6eb2316a0>

We then need to train `encoder` (our vectorizer) on our training texts. This encoder is fitted in an **unsupervised** manner: we only use the texts, not the labels. Moreover, this encoder needs to be fully fitted prior to training of any subsequent NN models (since it defines the vector space on which NN models will work).
In keras, this type of training uses a different method: `.adapt` (instead of `fit`). 

`.adapt` must receive a different version of the dataset, that only contains the review text and does not contain any labels. We can do this transformation using the method `.as_dataframe`:



In [17]:
df = tfds.as_dataframe(train_tfds)
comments_labels= df[['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'neutral', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise']]
train_tfds_txt= df["comment_text"]
#y = comments_labels.values
#train_tfds_txt = train_tfds.map(lambda text, example: text )

Now we can pass the text only dataset to the layer's `.adapt` method:

In [18]:
encoder.adapt(train_tfds_txt)

#### Checking the vocabulary
The `.adapt` method sets the layer's **vocabulary**. Here are the first 50 tokens. 
- the first is an empty string token, corresponding to zero-padded sequence positions
- the second `[UNK]` stands for any unkknown tokens, all encoded with value 1.
- the remaining tokens are words sorted by frequency of appearence in the text corpus

In [19]:
vocab = np.array(encoder.get_vocabulary())
vocab[:50]

array(['', '[UNK]', 'wholesome', 'sources', 'questionable', 'just', 'its',
       'from', 'content'], dtype='<U12')