![image.png](attachment:image.png)

Have you ever wondered how Gmail autocompletes your sentences, or, what powers the WhatsApp suggestions when you’re typing a message? The technology behind these helpful writing hints is machine learning. In this course, you'll build and train machine learning models for different natural language generation tasks. For example, you'll train a model on the literary works of Shakespeare and generate text in the style of his writing. You'll also learn how to create a neural translation model to translate English sentences into French. Finally, you'll train a seq2seq model to generate your own natural language autocomplete sentences, just like Gmail!

# Introduction to Sequential Data
The order of words in sentences is important (unless Yoda you are called). That’s why in this chapter, you’ll learn how to represent your data sequentially and use neural network architecture to model your text data. You'll learn how to create and train a recurrent network to generate new text, character by character. You'll also use the names dataset to build your own baby name generator, using a very simple recurrent neural network and the Keras package.


Natural Language Generation which is a subtopic of Natural Language Processing, or NLP. Natural language generation deals with tasks that generate texts automatically.

### Handling Sequential Data

Natural Language Generation systems can produce text for different applications, for example, the generation of sentences in a certain style, machine translation which is nothing but the generation of text in a different language, auto-completion of a sentence given part of that sentence as input, generation of textual summaries, automated chatbots, etc. In this course, you'll learn how to build systems for some of these tasks.

This section will introduce sequential data and ways to use that data for text generation. Throughout this chapter, you'll also be using what you learned to generate baby names from scratch. Sequential data is any kind of data where the order matters. Examples of these could be time-series data, text data from documents, DNA sequences, etc. We'll deal with text data in this course.

Text data refers to any data used in spoken or written language and the order in which the words appear matters. For example, consider this sentence "I am learning Mathematics" where each word has its place and if the order is changed randomly, the meaning changes or it could become something completely non-sensical. Thus, each word in a sentence depends on all the previous words used. This is true for characters as well.

Now, let's go through an example of a sequential text dataset. The dataset we will be using throughout this chapter is the names dataset which contains people's names where each word is a name. The names are independent, but the characters inside the names are ordered. Each name can be thought of as an ordered sequence of characters that follows some unknown pattern. More formally, the sequence of characters follows some probability distribution which is not known to us. Our goal is to guess this distribution from the existing names and generate new names that are similar to the names in this dataset.

The names dataset is a DataFrame with a single column having a name in each row as shown here.

Our goal is to train a model that will predict a new character given a set of characters as input. So, the model must understand when a name starts and ends. You can use special characters that are not used in any name in this dataset to mark the start and the end. These are called the start and the end token respectively. We'll use the 'tab' and the 'newline' character for this purpose.
![image.png](attachment:image.png)

The start token can be appended at the start of each name using a lambda function as shown here.
![image-3.png](attachment:image-3.png)

The end token can be appended similarly.
![image-4.png](attachment:image-4.png)

Machine learning models deal with numbers. So, we need to convert these sequences of characters into suitable integer representations. For this, we need to create the vocabulary which is a set of all unique characters used in all the names. We only have lowercase letters in our names dataset. So our vocabulary consists of all the lowercase letters plus the start and the end token. The function get_vocabulary() here is iterating over all the characters in each name and adding the character to the vocabulary if it is not already there.
![image-5.png](attachment:image-5.png)

One trick to map these characters to integers is to sort the vocabulary and assign numbers in order. So, the tab character can be mapped to 0, newline can be mapped to 1, 'a' can be mapped to 2, b' can be mapped to 3 and so on. We can use the sorted function on the vocabulary to get a sorted list of characters. Then we can enumerate on the sorted list which will generate tuples of index and character pairs. This can be saved in a dictionary as shown.
![image-6.png](attachment:image-6.png)

We can similarly save the reverse mapping where 2 is mapped to a, 3 is mapped to b and so on in another dictionary.

### Introduction to Recurrent Neural Networks

Now, you'll learn about recurrent neural networks which are specially designed to make use of the order information present in sequential data.

**Feedforward neural networks** accept a fixed-sized input and produce a fixed-sized output using a fixed number of hidden layers in between. They assume that input samples are independent of each other. They are obviously a bad choice for sequential data. If you want to predict the next character in a word, you better know which characters came before it in the sequence. Recurrent neural networks address this concern.
![image-7.png](attachment:image-7.png)

**Recurrence** They are called recurrent because they perform the same computations for every element in the sequence and the output depends on whatever elements came before. At each time-step, a recurrent neuron produces an output along with a hidden state. The state can be thought of as a memory of the network. It consolidates all the history information from the input data. The history and the current input are used together to predict the output. So, the current input and the hidden state from the previous time-step serve as the input for this timestep.
![image-8.png](attachment:image-8.png)

**RNN for baby name generator** Recall the last lesson where we started processing the names dataset to be able to generate new names from scratch. The idea is to generate the next character given the current character and the history as input. Suppose we want to generate the name "john". In the first time-step, we need to input the tab character which should generate j as output. In the second time-step, we need to input j to the network which should return o and the state will keep track that the characters tab and j are already encountered. This will continue until every character of the name is processed. The inputs, outputs, and states are represented by vectors. At each time-step, the network transforms the input vector into the output vector and the state vector is updated to reflect characters already encountered.
![image-9.png](attachment:image-9.png)

**Encoding of the characters** Remember the character to integer mapping we created in the last lesson, and how machine learning models consume numeric values. Each character can be represented by a vector of length equal to the vocabulary size. The vector will have a 1 at the index which is the mapping of that character. All other positions will have zeros. This is called one-hot encoding. The vectors for each character will look like this.
![image-10.png](attachment:image-10.png)

The **number of time-steps** will be the length of the name. As the names have different lengths, the time-step can be made equal to the length of the longest name with shorter names padded with zero after the newline. The get_max_len function here figures out the length of the longest name by iterating over all the names, saving the lengths of the names in a list and finding out the maximum.
![image-11.png](attachment:image-11.png)

The **input and the target vectors** are three dimensional. The first dimension is the number of names in the dataset, the second being the number of time steps which is the length of the longest name. The third dimension is the size of each one-hot encoded vector which is the vocabulary size.
![image-12.png](attachment:image-12.png)

Let's first define the input vector as a 3-dimensional zero vector. The first dimension of this vector is the number of names in the dataset, the second dimension is the length of the longest name which defines our step size and the third dimension is the size of the vocabulary. To fill this vector with data, we need to convert each character of each name to its one-hot encoded vector. The loop here, first iterates over each name and then each character of each name and converts each character to its one-hot encoding using the character to integer mappings.
![image-13.png](attachment:image-13.png)

The target vector can also be defined and initialized similarly as shown here.
![image-14.png](attachment:image-14.png)

**Build and compile recurrent neural network** Now, let's build the network using Keras. First, we're creating a sequential model, then adding an RNN layer of 50 units. We are setting return sequences to true to make sure that the RNN layer outputs a sequence and not just a single vector. This output sequence is then passed to a dense layer with softmax activation to generate the output. The Softmax activation predicts probability values for each character in the vocabulary. The TimeDistributed wrapper layer is used to make sure the dense layers can handle three-dimensional input. We can compile this model now using categorical cross-entropy loss and adam optimizer. Categorical cross-entropy loss is used when we have more than two labels. Here the output will be a character from the vocabulary and so, the number of labels is the size of the vocabulary. Adam is an advanced optimizer which converges faster.
![image-15.png](attachment:image-15.png)

**Check model summary** We can verify the architecture of our model using the model summary as shown here.
![image-16.png](attachment:image-16.png)

### Inference Using Recurrent Neural Networks

In the last set of exercises, we built a recurrent neural network using Keras and compiled it. Our input and target vectors are also ready. In this lesson, you'll learn how to train this network and get predictions from the trained model.

**Understanding training** You must be wondering what training means. You can think of any neural network model as a black box which given an input, produces an output, often called prediction. Each input and target pair x,y tells the network that the ideal output should be y when the input is x. When you provide an input x, the network will do some internal computations and produce an output, say z, which will be different from y. The whole purpose of training is to reduce this difference or error by adjusting the internal parameters of the network. After the network iterates over the full dataset several times, it'll start to produce output similar to what was present in the target examples.

**Input and target vectors for training** Remember that our input and target vectors are three-dimensional vectors whose first dimension is the number of samples or names in the dataset, the second dimension is the number of time steps which we defined as the length of the longest name and the third dimension is the size of the one-hot encoded vectors which is the size of the vocabulary. We need to use these vectors to train the model we built.

**Train recurrent network** We can use the Keras fit function to train the model. We need to pass the input data and the target data. In addition, we need to specify the batch size and the number of epochs. It is efficient to adjust the parameters of the network after accumulating the error over a set of samples than to adjust after every single sample. The number of samples after which the model adjusts the parameters is specified by the batch size. We also need to iterate over the full dataset a number of times to get the best result. Epoch specifies the number of times the full dataset will be iterated.
![image-18.png](attachment:image-18.png)

**Predict first character** Now that the model is trained, we can use it for predictions. We trained the model in such a way that it'll produce the next character given the current character as input. And, the first character is the tab character which is the start token. We can feed the tab character to the network and get the most probable next character as output. We can create a three-dimensional zero vector for the output sequence and initialize it to contain the tab character. We can use the "predict proba" method to get the probability distribution for the next character in the sequence. As we want to generate the first character after tab, we need to slice the probability distribution list to get the probability distribution for the first character. Now, we can find the next character by sampling the vocabulary randomly using this probability distribution.
![image-19.png](attachment:image-19.png)

**Predict second character using the first** We can use the generated first character to predict the second character in the sequence. The same process can be used to predict the most probable second character given the tab and the first character.
![image-20.png](attachment:image-20.png)

**Generate baby names** We can keep on generating characters in this manner until the end token or newline is encountered. We can also put a constraint on the maximum length of the names and stop when the number of generated characters reaches this maximum. We can create a function that does this inside a while loop as shown. In this function, the maximum length is set to be 10. So, it keeps on generating the characters until a newline is encountered or it reaches a maximum of 10 characters. We can also put this whole thing inside another loop to generate more names.
![image-21.png](attachment:image-21.png)

**Cool baby names** These are the ten names generated by our model. Check how similar some of them can be to actual human names. You can train the model for more epochs using a bigger dataset to make it even more accurate.
![image-22.png](attachment:image-22.png)


# Write Like Shakespeare

In this chapter, you’ll find out how to overcome the limitations of recurrent neural networks when input sequences span long intervals. To avoid vanishing and exploding gradient problems you'll be introduced to long short term memory (LSTM) networks that are more effective when working with long-term dependencies. You'll work on a fun project where you'll build and train a simple LSTM model using selected literary works of Shakespeare to generate new text in the unique writing style of Shakespeare.

### Limitations of Recurrent Neural Networks

Welcome back. In the previous chapter, you learned about recurrent neural networks and how they work on sequential data. However, recurrent neural networks are not very effective for longer sequences and we need a different kind of recurrence to handle long sequences. In this chapter, you'll get to know the limitations of simple recurrent neural networks and get introduced to long short term memory.

**Simple neural networks** Neural networks can be thought of as a set of nodes or neurons arranged in layers and nodes in different layers are connected by weights. The first layer gets the input data. Other layers get input from the previous layer.
![image.png](attachment:image.png)

**Computations in neural network** Each node multiplies the incoming inputs with the weights and adds them up. This is a linear transformation. A non-linear transformation is then applied to generate the output which is called the activation from that node. In theory, the combinations of linear and non-linear transformations can approximate any function which makes neural networks very powerful.
![image-2.png](attachment:image-2.png)

**Gradient and training** We can define the error at each output node to be a function of the expected output and the predicted output. In can be represented by the squared difference of the actual and the predicted output. In each training iteration, the errors for all training samples are added. Training a neural network is nothing but adjusting the values of the weights so that the error gets reduced. The gradient is defined to be the rate of change of the error with respect to the weights. If we can calculate the gradients for each weight, we can adjust the weight to reduce the error. The gradient values are multiplied by a small fraction and then subtracted from the weights. This fraction is called the learning rate which influences how the weight values will converge to the optimal value. The adjustment of the weight values can be done for several iterations until we reach sufficient accuracy.
![image-3.png](attachment:image-3.png)

**Chain rule** The gradient values for each weight is calculated using a simple rule from calculus called the chain-rule which says if z is a function of y and y is, in turn, a function of x, then the derivative of z with respect to x can be found by multiplying the derivative of z with respect to y and the derivative of y with respect to x. Gradient values in the output layer can be found by differentiation. For intermediate layers, the chain rule is applied.
![image-4.png](attachment:image-4.png)

**Back-propagation** The gradient values for each layer can be found by backpropagation. This course will give you the intuition behind backpropagation and deep learning without delving much into the mathematics. The gradient in the output layer can be calculated by differentiating the error with respect to the weights. By the chain rule, the gradient for other layers will be the product of the gradient values of the subsequent layers. Intuitively, the gradient is calculated at the output layer and back-propagated towards the input layer.
![image-5.png](attachment:image-5.png)

**Vanishing and exploding gradients** The takeaway here is that the gradient for an internal weight is the product of many gradient values from the subsequent layers. As recurrent neural networks work over many time steps, the gradient values are propagated back from the last time-step towards the first. If the gradient values are small fractions which they usually are, then the gradient will become lesser and lesser as you move toward the first time step from the last and will eventually become zero and the neuron will stop learning. This is known as the vanishing gradient problem. Conversely, if the gradient values are greater than one, then as we move one time-step backward, it'll become bigger and bigger resulting in gradient explosion.
![image-6.png](attachment:image-6.png)

**Remedies** We can set a fixed number of time-steps till which we want to backpropagate to reduce the effect of the vanishing gradients. To remedy exploding gradient problems, we can clip the gradient values at each node. Both these workarounds will result in suboptimal training and reduce the prediction performance.

### Introduction to Long Short Term Memory

In the previous lesson, you learned how simple recurrent neural networks struggle when modeling long sequences. In this lesson, you'll be introduced to long-short term memory which doesn't suffer from vanishing and exploding gradient problems and as a result can handle longer sequences efficiently.

**Long-term dependencies** We already know that RNNs can learn the future from the context of the past. Sometimes we only need the recent past to predict the future. For example, suppose we want to predict the last word in the sentence "The birds are flying in the sky". Here, to predict the word "sky" we only need to remember the last few words. Most often, we need context from further past to predict the future. Consider a text where the first few words are "I was born in Germany" followed by a lot of other words or sentences and finally it ends in "I can speak German". From the recent context, it is pretty evident that the last word would be the name of a language, however, to know which language it is we need to go further back to the beginning of the text. RNNs struggle to model such long-term dependencies because of vanishing and exploding gradient problems.

**Long-short term memory** networks are specifically designed to handle long-term dependencies. Remember that simple recurrent neural networks just have one state to capture historical information. This is not sufficient to capture the long term dependencies. So, long-short term memory networks use an additional state to capture the long term dependencies. Thus they'll have two states - one to capture the short term history and the other to capture the long term history. These sates are called the hidden and the cell states respectively. At each time-step, a long short term memory node will accept the input and the hidden and cell states from the last time-step. Depending on the input data, it may forget or add new information in these hidden and cell states and pass it to the next time-step. The hidden state can also be used as the output if needed.

**Write like Shakespeare** To understand how effective LSTMs are to capture long term dependencies, let's deep dive into a case study where we'll generate text that imitates Shakespeare's unique style of writing based on a dataset of selected literary works of Shakespeare. You can get the vocabulary by creating a set out of the text data and sorting it as shown. Then, you can iterate over the vocabulary and create mappings of characters to integers and vice versa similar to how we did in the previous chapter.

**Input and target data** Effectively, the problem here is to generate the next character given a sequence of characters as input. So, the input to our model will be a sequence of characters and the output will be the next character in the sequence.

To create the training data for this problem, we'll divide the text into sequences of fixed length, say 40, and for each sequence find out the next character. The sequences of length 40 will be our inputs and the next character will be our target output. The loop here iterates over the full text and finds out sequences of length 40 and the next character in the sequence as shown and adds them to the input and target data.

**Create input and target vectors** Now, we need to convert these input and target data into vectors so that they can be fed to the LSTM network. For this, we'll create two vectors one for the input data and the other for the target data.

**Initialize input and target vector** We can fill these vectors by first iterating over all the sequences in the input data and then over the characters of each sequence and finding out the one-hot encoding of the character using the character to integer map as shown.

**Create LSTM network in Keras** Now that our data is preprocessed, it's time to build the LSTM network. We need to create a sequential layer, followed by an LSTM layer of 128 units. This will be followed by a dense layer with softmax activation. The output layer will predict a probability distribution over the vocabulary and so the size of the dense layer is the vocabulary size.

**Compile the model** We can go ahead and compile our model now using categorical cross-entropy loss and adam optimizer. We can verify the model architecture by checking the model summary.

### Inference Using Long Short Term Memory

## Summary

Congratulations. This is the end. You have come a long way and developed conceptual understanding as well as hands-on experience of some very important natural language generation tasks.

**Section 1: Generate language using RNNs.** You learned how to generate short sequences using recurrent neural networks and built a model to generate innovative baby names.

**Section 2: Generate language using LSTMs.** You learned the limitations of simple recurrent neural networks for long sequences and used long short term memory networks to generate longer texts. You worked on language generation in Shakespeare's style of writing.

**Further studies.** You can learn more about recurrent neural networks, long short term memory networks, encoder-decoder architecture from these resources.
![image-2.png](attachment:image-2.png)

**Advanced concepts.** In recent times advanced concepts like attention and transformers emerged which are being heavily used to understand language data. You can read about them from these resources.
![image-3.png](attachment:image-3.png)

## Sources

Halder, Biswanath (2022). Natural Language Generation in Python, Datacamp. Available from https://app.datacamp.com/learn/courses/introduction-to-tensorflow-in-python![image.png](attachment:image.png)