# DL Lab 3.1 - Introduction to Natural Language Processing

In this section, we will explore different neural network architectures for dealing with sequential data such as text, i.e., natural language. In recent years, **Natural Language Processing (NLP)** has experienced fast growth as a field, both because of improvements to the language model architectures and because they've been trained on increasingly large text corpora. As a result, their ability to "understand" text has vastly improved, and large pre-trained models such as BERT and GPT have become widely used.

We will focus on the fundamental aspects of **representing NLP as tensors** in TensorFlow, and on classical NLP architectures, such as using **bag-of-words**, **embeddings**, **recurrent neural networks**, and **Transformers**.

## Today's Learning Objectives

* Understand how **text** is processed for NLP tasks
* Learn how to build **text classification** models
* Learn about **Recurrent Neural Networks** (RNNs)

***

**Instructions**

- You'll be using Python 3 in the iPython based Google Colaboratory
- Lines encapsulated in "<font color='green'>`### START YOUR CODE HERE ###`</font>" and "<font color='green'>`### END YOUR CODE HERE ###`</font>", or marked by "<font color='green'>`# TODO`</font>", denote the code fragments to be completed by you.
- There's no need to write any other code.
- After writing your code, you can run the cell by either pressing `SHIFT`+`ENTER` or by clicking on the play symbol on the left side of the cell.
- We may specify "<font color='green'>`(≈ X LOC)`</font>" in the "<font color='green'>`# TODO`</font>" comments to tell you about how many lines of code you need to write. This is just a rough estimate, so don't feel bad if your code is longer or shorter.
- If you get stuck, check your Lecture and Lab notes and use the [discussion forum](https://moodle.tu-ilmenau.de/mod/forum/view.php?id=3371) in Moodle.

Let's get started!

***

**Note**: Training DNNs is a computationally expensive process. Most of the computations can be parallelized very efficently, making them a perfect fit for GPU-acceleration. In order to enable a GPU for your Colab session, do the following steps:
- Click '*Runtime*' -> '*Change runtime type*'
- In the pop-up window for '*Hardware accelerator*', select '*GPU*'
- Click '*Save*'

## Natural Language Tasks

There are several NLP tasks that we can solve using neural networks:

* **Text Classification** is used when we need to classify a text fragment into one of several predefined classes. Examples include e-mail spam detection, news categorization, assigning a support request to a category, and more.
* **Intent Classification** is one specific case of text classification, where we want to map an input utterance in the conversational AI system into one of the intents that represent the actual meaning of the phrase, or intent of the user.
* **Sentiment Analysis** is a regression task, where we want to understand the degree of positivity of a given piece of text. We may want to label text in a dataset from most negative (-1) to most positive (+1), and train a model that will output a number representing the positivity of the input text.
* **Named Entity Recognition (NER)** is the task of extracting entities from text, such as dates, addresses, people names, etc. Together with intent classification, NER is often used in dialog systems to extract parameters from the user's utterance.
* A similar task of **Keyword Extraction** can be used to find the most meaningful words inside a text, which can then be used as tags.
* **Text Summarization** extracts the most meaningful pieces of text, giving the user a compressed version of the original text.
* **Question Answering** is the task of extracting an answer from a piece of text. This model takes a text fragment and a question as input, and finds the exact place within the text that contains the answer. For example, the text "John is a 22 year old student who loves to use Microsoft Learn", and the question How old is John should provide us with the answer 22.

In this Lab, we focus on the **Text Classification** task. However, we will learn all the important concepts that we need to handle more difficult tasks in the future.

# 1 - Download a Text Dataset

In this module, we will start with a simple text classification task based on the **[AG_NEWS](http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html)** dataset: we'll classify news headlines into one of 4 categories: World, Sports, Business and Sci/Tech.

To load the dataset, we will use the **[TensorFlow Datasets](https://www.tensorflow.org/datasets)** API.

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds

dataset = tfds.load('ag_news_subset')

Once you've acquired a new dataset to work with, what should you do first?

Explore it? Inspect it? Verify it? Become one with it?

**All correct.**

In [None]:
dataset.keys()

We can access the training and test portions of the dataset by using `dataset['train']` and `dataset['test']` respectively:

In [None]:
ds_train = dataset['train']
ds_test = dataset['test']

print(f"Length of train dataset = {len(ds_train)}")
print(f"Length of test dataset = {len(ds_test)}")

Let's print out the first 10 new headlines from our dataset:

In [None]:
classes = ['World', 'Sports', 'Business', 'Sci/Tech']
num_classes = len(classes)

for i,x in zip(range(10), ds_train):
    print(f"{x['label']} ({classes[x['label']]}) -> {x['title']} {x['description']}")

Let's check how many examples of each label we have in the train split:

In [None]:
samples_per_class = np.zeros(num_classes)
for x in ds_train:
  samples_per_class[x['label']] += 1

for class_idx, samples in zip(range(num_classes), samples_per_class):
  print(f"label: {class_idx}, samples: {samples:.0f}")

Wonderful! We've got a training set and a validation set containing text and labels. Our labels are in numerical form (`0`, `1`, `2`, `3`) but our texts are in string form. If we want to solve Natural Language Processing (NLP) tasks with neural networks, we need some way to represent text as tensors.

Finally, let us define two helper functions to better deal with this dataset:

In [None]:
def extract_text(x):
    return x['title'] + ' ' + x['description']

def tupelize(x):
    return (extract_text(x), x['label'])

# 2 - Representing Text as Tensors

In NLP, there are two main concepts for turning text into numbers:

* **Tokenization** - A straight mapping from word or character or sub-word to a numerical value. There are three main levels of tokenization:
  1. Using **word-level tokenization** with the sentence "I love TensorFlow" might result in "I" being `0`, "love" being `1` and "TensorFlow" being `2`. In this case, every word in a sequence considered a single **token**.
  2. **Character-level tokenization**, such as converting the letters a-z to values `1-26`. In this case, every character in a sequence considered a single **token**.
  3. **Sub-word tokenization** is in between word-level and character-level tokenization. It involves breaking invidual words into smaller parts and then converting those smaller parts into numbers. For example, "my favourite food is pineapple pizza" might become "my, fav, avour, rite, fo, oo, od, is, pin, ine, app, le, piz, za". After doing this, these sub-words would then be mapped to a numerical value. In this case, every word could be considered multiple **tokens**.

* **Embeddings** - An embedding is a representation of natural language which can be learned. Representation comes in the form of a **feature vector**. For example, the word "dance" could be represented by the 5-dimensional vector `[-0.8547, 0.4559, -0.3332, 0.9877, 0.1112]`. It's important to note here, the size of the feature vector is tuneable. There are two ways to use embeddings:
  1. **Create your own embedding** - Once your text has been turned into numbers (required for an embedding), you can put them through an embedding layer (such as [`tf.keras.layers.Embedding`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding)) and an embedding representation will be learned during model training.
  2. **Reuse a pre-learned embedding** - Many pre-trained embeddings exist online. These pre-trained embeddings have often been learned on large corpuses of text (such as all of Wikipedia) and thus have a good underlying representation of natural language. You can use a pre-trained embedding to initialize your model and fine-tune it to your own specific task.


> 🤔 **Question:** *What level of tokenzation should we use? What embedding should I choose?*

It depends on your problem. You could try character-level tokenization/embeddings and word-level tokenization/embeddings and see which perform best.

If you're looking for pre-trained word embeddings, [Word2vec embeddings](http://jalammar.github.io/illustrated-word2vec/), [GloVe embeddings](https://nlp.stanford.edu/projects/glove/) and many of the options available on [TensorFlow Hub](https://tfhub.dev/s?module-type=text-embedding) are great places to start.

> ⭐ **Note:** Much like searching for a pre-trained computer vision model, you can search for pre-trained word embeddings to use for your problem. Try searching for something like "use pre-trained word embeddings in TensorFlow".

We'll practice **word-level tokenzation** (mapping our words to numbers) first.
Therefore, we need to do two things:

* Use a **tokenizer** to split text into **tokens**.
* Build a **vocabulary** of those tokens.

### Limiting vocabulary size

In the AG News dataset example, the vocabulary size is rather big, more than 100k words. Generally speaking, we don't need words that are rarely present in the text &mdash; only a few sentences will have them, and the model will not learn from them. Thus, it makes sense to limit the vocabulary size to a smaller number by passing an argument to the vectorizer constructor:

Both of those steps can be handled using the **[TextVectorization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization)** layer. Let's instantiate the vectorizer object, and then call the `adapt` method to go through all text and build a vocabulary:

In [None]:
from tensorflow.keras import layers

vocab_size = 50000

vectorizer = layers.TextVectorization(max_tokens=vocab_size)
vectorizer.adapt(ds_train.take(1000).map(extract_text))

> ⭐ **Note:** We are using only subset of the whole dataset (`ds_train.take(1000)`) to build a vocabulary. We do it to speed up the execution time and not keep you waiting. However, we are taking the risk that some of the words from the whole dateset would not be included into the vocabulary, and will be ignored during training. Thus, using the whole vocabulary size and running through all dataset during adapt should increase the final accuracy, but not significantly.

Now we can access the actual vocabulary:

In [None]:
# Get the unique words in the vocabulary
vocab = vectorizer.get_vocabulary()

# Length of the vocabulary
vocab_size = len(vocab)
print(f"Number of words in vocab: {vocab_size}")

# most common tokens (notice the [UNK] token for "unknown" words)
top_5_words = vocab[:5]
print(f"Top 5 most common words: {top_5_words}")

# least common tokens
bottom_5_words = vocab[-5:]
print(f"Bottom 5 least common words: {bottom_5_words}")

Using the vectorizer, we can easily encode any text into a set of numbers:

In [None]:
# Create sample sentence and tokenize it
sample_sentence = "I like deep learning"
vectorizer([sample_sentence])

# 3 - BoW as a Simple Language Model

Because words represent meaning, sometimes we can figure out the meaning of a piece of text by just looking at the individual words, regardless of their order in the sentence. For example, when classifying news, words like *weather* and *snow* are likely to indicate *weather forecast*, while words like *stocks* and *dollar* would count towards *financial news*.

**Bag-of-words** (BoW) vector representation is the most simple to understand traditional vector representation. Each word is linked to a vector index, and a vector element contains the number of occurrences of each word in a given document.

> ⭐ **Note**: BoW is essentially the sum of all one-hot-encoded vectors for individual words in the text.

In [None]:
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import SparseCategoricalAccuracy

def build_bow_model(vectorizer, vocab_size, num_classes):

  input = layers.Input(shape=(1,), dtype=tf.string)
  x = vectorizer(input)
  x = tf.one_hot(x, vocab_size)
  x = tf.reduce_sum(x, axis=1)
  output = layers.Dense(num_classes, activation='softmax')(x)

  model = tf.keras.models.Model(input, output)
  model.compile(
      loss=SparseCategoricalCrossentropy(),
      optimizer=Adam(),
      metrics=[SparseCategoricalAccuracy()]
  )

  print(model.summary())

  return model


In [None]:
bow_model = build_bow_model(vectorizer, vocab_size, num_classes)

In the model `summary`, in the *Output Shape* column, the first tensor dimension `None` corresponds to the minibatch size, and the second corresponds to the length of the token sequence. All token sequences in the minibatch have different lengths. We'll discuss how to deal with it when implementing RNNs.

Here is an example computation of a BoW vector:

In [None]:
sample_sentence = tf.convert_to_tensor(["I like deep learning"], dtype=tf.string)

In [None]:
for layer_idx, layer in enumerate(bow_model.layers[1:-1], start=1):

  # define model with auxiliary output
  aux_output_model = tf.keras.models.Model(bow_model.input, layer.output)

  # get auxiliary output
  out = aux_output_model(sample_sentence).numpy()

  print(f"Layer {layer_idx} - {layer.name}: output shape: {out.shape}")
  print(f"output = {out}")
  if layer_idx == 1:
    indices = out # get word_indices
  elif layer_idx == 3:
    print(f"output[indices] = {out[0,indices]}\n")
  print()

In [None]:
# optimize the datasets for training
BATCHSIZE = 128
AUTOTUNE = tf.data.AUTOTUNE

ds_train_opt = ds_train.map(tupelize).cache().shuffle(1000).batch(BATCHSIZE).prefetch(AUTOTUNE)
ds_test_opt = ds_test.map(tupelize).cache().batch(1000).prefetch(AUTOTUNE)

Now that we have learned how to build the bag-of-words representation of our text, let's train a classifier that uses it:

In [None]:
bow_history = bow_model.fit(
    ds_train_opt,
    validation_data=ds_test_opt,
    epochs=5
)

Since we have 4 classes, an accuracy of above 80% is a good result.

In [None]:
# @title define `plot_history()`
from matplotlib import pyplot as plt

def plot_history(history):
  fig, (ax1, ax2) = plt.subplots(2,1, sharex=True, dpi=150)
  ax1.plot(history.history['loss'], label='training')
  ax1.plot(history.history['val_loss'], label='validation')
  ax1.set_ylabel('Loss')
  ax1.set_yscale('log')
  if history.history.__contains__('lr'):
    ax1b = ax1.twinx()
    ax1b.plot(history.history['lr'], 'g-', linewidth=1)
    ax1b.set_yscale('log')
    ax1b.set_ylabel('Learning Rate', color='g')
  ax1.legend()

  key = None
  for k in sorted(history.history.keys()):
    if 'acc' in k and not 'val_' in k:
      key = k
      break
  if key:
    ax2.plot(history.history[key], label='training')
    ax2.plot(history.history['val_'+key], label='validation')
    ax2.set_ylabel('Accuracy')
    ax2.set_xlabel('Epochs')
  plt.show()

In [None]:
plot_history(bow_history)

# 4 - Addition of Embedding Layer


When training the classifier based on BoW, we operated on high-dimensional bag-of-words vectors with length `vocab_size`, and we were explicitly converting from low-dimensional positional representation vectors into sparse one-hot representation. This one-hot representation, however, is not memory-efficient. In addition, each word is treated independently from each other, i.e. one-hot encoded vectors do not express any semantic similarity between words.

The idea of **embedding** is to represent words using lower-dimensional dense vectors that reflect the semantic meaning of the word.
So, an embedding layer takes a word as input, and produces an output vector of specified `embedding_size`. In a sense, it is very similar to a `Dense` layer, but instead of taking a one-hot encoded vector as input, it's able to take a word number.

By using an embedding layer as the first layer in our network, we can switch from bag-of-words to an **embedding bag** model, where we first convert each word in our text into the corresponding embedding, and then compute some aggregate function over all those embeddings, such as `sum`, `average` or `max`.  

In [None]:
def build_embedding_bag_model(vectorizer, vocab_size, embedding_size, num_classes):

  embedding_layer = layers.Embedding(
      input_dim=vocab_size, # set input shape
      output_dim=embedding_size, # set size of embedding vector
  )

  input = layers.Input(shape=(1,), dtype=tf.string)
  x = vectorizer(input)
  x = embedding_layer(x)
  x = tf.reduce_sum(x,axis=1)(x)
  # or: layers.Lambda(lambda x: tf.reduce_sum(x,axis=1))(x)
  output = layers.Dense(num_classes, activation='softmax')(x)

  model = tf.keras.models.Model(input, output)
  model.compile(
      loss=SparseCategoricalCrossentropy(),
      optimizer=Adam(),
      metrics=[SparseCategoricalAccuracy()]
  )

  print(model.summary())

  return model


In [None]:
emb_bag_model = build_embedding_bag_model(vectorizer, vocab_size, 128, num_classes)

In [None]:
emb_bag_history = emb_bag_model.fit(
    ds_train_opt,
    validation_data=ds_test_opt,
    epochs=5
)

In [None]:
plot_history(emb_bag_history)

We will later discuss how to build meaningful word embeddings and discuss their cool properties, but for now let's just think of embeddings as a way to reduce the dimensionality of a word vector.

# 5 - A simple RNN

In previous sections, we have been using rich semantic representations of text and a simple linear classifier on top of the embeddings. What this architecture does is to capture the aggregated meaning of words in a sentence, but it does not take into account the order of words, because the aggregation operation on top of embeddings removed this information from the original text. Because these models are unable to model word ordering, they cannot solve more complex or ambiguous tasks such as text generation or question answering.

To capture the meaning of a text sequence, we'll use a neural network architecture called **recurrent neural network**, or RNN. When using an RNN, we pass our sentence through the network one token at a time, and the network produces some **state**, which we then pass to the network along with the next token.

Because state vectors are passed through the network, the RNN is able to learn sequential dependencies between words. For example, when the word *not* appears somewhere in the sequence, it can learn to negate certain elements within the state vector.

Let's see how recurrent neural networks can help us classify our news dataset.

In the case of a simple RNN, each recurrent unit is a simple linear network, which takes in an input vector and state vector, and produces a new state vector. In Keras, this can be represented by the [`SimpleRNN`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/SimpleRNN) layer.

While we can pass one-hot encoded tokens to the RNN layer directly, this is not a good idea because of their high dimensionality. Therefore, we will use an embedding layer as before to lower the dimensionality of word vectors, followed by an RNN layer, and finally a `Dense` classifier.

> ⭐ **Note:** In cases where the dimensionality isn't so high, for example when using character-level tokenization, it might make sense to pass one-hot encoded tokens directly into the RNN cell.

In [None]:
def build_RNN_model(embedding_size, hidden_size, num_classes, max_vocab_size = 10000):

  vectorizer = layers.TextVectorization(max_tokens=max_vocab_size)
  # It's a new vectorizer, so we need to train it first:
  print('Training vectorizer')
  vectorizer.adapt(ds_train.take(1000).map(extract_text))

  embedding_layer = layers.Embedding(vocab_size, embedding_size)

  input = layers.Input(shape=(1,), dtype=tf.string)
  x = vectorizer(input)
  x = embedding_layer(x)
  x = layers.SimpleRNN(hidden_size)(x)
  output = layers.Dense(num_classes, activation='softmax')(x)

  model = tf.keras.models.Model(input, output)
  model.compile(
      loss=SparseCategoricalCrossentropy(),
      optimizer=Adam(),
      metrics=[SparseCategoricalAccuracy()]
  )

  print(model.summary())

  return model


In [None]:
rnn_model = build_RNN_model(128, 16, num_classes)

In [None]:
rnn_model_history = rnn_model.fit(
    ds_train_opt,
    validation_data=ds_test_opt,
    epochs=5
)

In [None]:
plot_history(rnn_model_history)

# 6 - LSTM: Long short-term memory

One of the main problems of RNNs is **vanishing gradients**. Sequences can be pretty long, and RNNs may have a hard time propagating the gradients all the way back to the first state of the network during backpropagation. When this happens, the network cannot learn relationships between distant tokens. One way to avoid this problem is to introduce **explicit state management** by using **gates**. The two most common architectures that introduce gates are **long short-term memory** (LSTM) and **gated relay unit** (GRU). We'll cover LSTMs here.


An LSTM network is organized in a manner similar to an RNN, but there are **two states** that are passed across time: the actual state $c$, and the hidden vector $h$. At each unit, the hidden vector $h_{t-1}$ is combined with input $x_t$, and together they control what happens to the state $c_t$ and output $h_{t}$ through **gates**. Each gate has sigmoid activation (output in the range $[0,1]$), which can be thought of as a bitwise mask when multiplied by the state vector. LSTMs have the following gates (from left to right on the picture above):
* **forget gate** which determines which components of the vector $c_{t-1}$ we need to forget, and which to pass through.
* **input gate** which determines how much information from the input vector and previous hidden vector should be incorporated into the state vector.
* **output gate** which takes the new state vector and decides which of its components will be used to produce the new hidden vector $h_t$.

The components of the state $c$ can be thought of as flags that can be switched on and off. For example, when we encounter the name *Alice* in the sequence, we guess that it refers to a woman, and raise the flag in the state that says we have a female noun in the sentence. When we further encounter the words *and Tom*, we will raise the flag that says we have a plural noun. Thus by manipulating state we can keep track of the grammatical properties of the sentence.

> ⭐ **Note:** Here's a great resource for understanding the internals of LSTMs: [Understanding LSTM Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) by Christopher Olah.

While the internal structure of an LSTM cell may look complex, Keras hides this implementation inside the [`LSTM`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM) layer, so the only thing we need to do in the example above is to replace the recurrent layer:

In [None]:
def build_lstm_model(embedding_size, hidden_size, num_classes, max_vocab_size = 10000):

  vectorizer = layers.TextVectorization(max_tokens=max_vocab_size)
  # It's a new vectorizer, so we need to train it first:
  print('Training vectorizer')
  vectorizer.adapt(ds_train.take(1000).map(extract_text))

  embedding_layer = layers.Embedding(vocab_size, embedding_size)

  input = layers.Input(shape=(1,), dtype=tf.string)
  x = vectorizer(input)
  x = embedding_layer(x)
  x = layers.LSTM(hidden_size)(x)
  output = layers.Dense(num_classes, activation='softmax')(x)

  model = tf.keras.models.Model(input, output)
  model.compile(
      loss=SparseCategoricalCrossentropy(),
      optimizer=Adam(),
      metrics=[SparseCategoricalAccuracy()]
  )

  print(model.summary())

  return model


In [None]:
lstm_model = build_lstm_model(128, 16, num_classes)

In [None]:
lstm_model_history = lstm_model.fit(
    ds_train_opt,
    validation_data=ds_test_opt,
    epochs=5
)

In [None]:
plot_history(lstm_model_history)

# 7 - Bidirectional and multilayer RNNs

In our examples so far, the recurrent networks operate from the beginning of a sequence until the end. This feels natural to us because it follows the same direction in which we read or listen to speech. However, for scenarios which require random access of the input sequence, it makes more sense to run the recurrent computation in both directions. RNNs that allow computations in both directions are called **bidirectional** RNNs, and they can be created by wrapping the recurrent layer with a special [`Bidirectional` layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Bidirectional).

> ⭐ **Note:** The `Bidirectional` layer is actually a wrapper that creates two copies of the layer within it, and sets the `go_backwards` property of one of those copies to `True`, making it go in the opposite direction along the sequence.

Recurrent networks, unidirectional or bidirectional, capture patterns within a sequence, and store them into state vectors or return them as output. As with convolutional networks, we can build another recurrent layer following the first one to capture higher level patterns, built from lower level patterns extracted by the first layer. This leads us to the notion of a **multi-layer RNN**, which consists of two or more recurrent networks, where the output of the previous layer is passed to the next layer as input.

Keras makes constructing these networks an easy task, because you just need to add more recurrent layers to the model. For all layers except the last one, we need to specify `return_sequences=True` parameter, because we need the layer to return all intermediate states, and not just the final state of the recurrent computation.

Let's build a two-layer bidirectional LSTM for our classification problem.

> ⭐ **Note:** This code again takes quite a long time to complete, but it gives us highest accuracy we have seen so far. So maybe it is worth waiting and seeing the result.

In [None]:
def build_bidirectional_two_layer_LSTM_model(embedding_size, hidden_size, num_classes, max_vocab_size = 10000):

  vectorizer = layers.TextVectorization(max_tokens=max_vocab_size)
  # It's a new vectorizer, so we need to train it first:
  print('Training vectorizer')
  vectorizer.adapt(ds_train.take(1000).map(extract_text))

  embedding_layer = layers.Embedding(vocab_size, embedding_size)

  input = layers.Input(shape=(1,), dtype=tf.string)
  x = vectorizer(input)
  x = embedding_layer(x)
  x = layers.Bidirectional( layers.LSTM(hidden_size, return_sequences=True) )(x)
  x = layers.Bidirectional( layers.LSTM(hidden_size) )(x)
  output = layers.Dense(num_classes, activation='softmax')(x)

  model = tf.keras.models.Model(input, output)
  model.compile(
      loss=SparseCategoricalCrossentropy(),
      optimizer=Adam(),
      metrics=[SparseCategoricalAccuracy()]
  )

  print(model.summary())

  return model


In [None]:
bidir_lstm = build_bidirectional_two_layer_LSTM_model(128, 16, num_classes)

In [None]:
bidir_lstm_history = bidir_lstm.fit(
    ds_train_opt,
    validation_data=ds_test_opt,
    epochs=5
)

In [None]:
plot_history(bidir_lstm_history)

***

# Congratulations!

You've learned how to represent **text as tensors**, use **word embeddings**, and to **model natural language** to train a wide range of models with increasing complexity - from **simple RNNs** to **bidirectional multilayer LSTMs**.

Until now, we've focused on using RNNs to classify sequences of text. But they can handle many more tasks, such as **text generation** and machine translation &mdash; we'll consider those tasks in the next unit.

***