# ![Imgur](https://i.imgur.com/LqgoIet.png)  Roboyogi: the world's first robot yoga teacher 



Using *textgenrnn*\* to train a machine to produce yoga routines

\*https://github.com/minimaxir/textgenrnn




---



## Introduction

I work with neural networks in my masters research to predict properties of stars, so I wanted to work on a project related to this but which could teach me about modern neural network architectures and methods I haven't yet used. Through becoming familiar with other architectures, it might allow me more creative insight when tackling problems in my research. Instead of sticking to the pure science route for this project, however, I wanted to choose a project that I could have a little fun with, one which was perhaps leaning more towards art than science. I thought that a project in the field of natural language processing (NLP) would suit these goals well. 

Two domains humans are particularly proficient in are image analysis and language analysis. Neural networks have in large part solved the image domain but there’s still a lot of work to be done with language. Nevertheless, there have been some impressive results publushed, like the now famous article [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) in which it was shown how a recurrent neural network (RNN) could produce novel texts ranging from Shakespeare to the Linux Kernel. Neural networks being used in NLP (and more generally data which has structure in time) has fascinated me since I read that article and so I wanted to get my hands dirty with some of the modern methods used for this.

The project I decided to work on involving these elements was using an RNN to produce text in the form of yoga routines. Why yoga? Well, the ancient Indian practice of yoga has in large part been completely warped and modified by our modern culture, often resulting in today’s average practitioner oblivious to and separated from its meditative and spiritual roots. To take this separation a step further and complete the transition of yoga entering our digital world, I wanted to create the world’s first robot yoga instructor. 

## Recurrent Neural Networks

RNNs are a unique type of NN that take an input vector (e.g a word) and a hidden state (which only the NN sees) and produces a new hidden state. This new hidden state retains information, i.e. has memory, of the last input vector, such that when a new input vector is sent through the NN, its output depends on the previous input vector. Chaining this process together means that words can be sequentially  sent through the NN, each time producing a new hidden state which contains information about the previous words in the sentence. The output prediction of the RNN is thus dependent on information from all previous hidden states:

![Imgur](https://i.imgur.com/gWbgaLd.png)

There is a problem with this "vanilla" RNN architecture: the hidden state remains fixed in size regardless of how many words are used as input to the system. For this to occur, there must be some loss of information. If the input sequence is long, and the output relies heavily on the first word, it might be difficult for the important information to travel through to the last hidden state.

To fix this issue, modern RNN architectures incorporate ***attention***. The output of the RNN no longer depends on only the last hidden state: rather, it considers a weighted sum of all previous hidden states. Throughout training, therefore, the RNN learns to pay attention to particular hidden states, those containing more potent information. In the case of a sentence, this translates to the RNN focusing on important words, leading to more coherent sentences and a better understanding of the structure of language.  

![Imgur](https://i.imgur.com/m67PI5J.png)

Above: RNN with attention

When applied to generating text, the output of an RNN is an array of words, each with an associated probability indicating how likely it is to be the next word in the sentence. Throughout training, we can penalize the network when it predicts a high probability for a word which almost never follows the previous analyzed words. The loss function required for this task is the ***categorical cross entropy*** which is similar to the binary cross entropy loss function but allows for multiple output nodes.

Once the RNN is trained, we can provide it with a list of words in a sentence and it will output a list containing a *probability distribution* indicating how likely each word in its vocabulary is the next word. In practice the top 3-5 words are used and *randomly sampled*. This random sampling helps ensure diverse outputs, leading to more creative outputs. Due to its random nature, sometimes this sampling technique can result in nonsensical output. This is fixed by increasing the probability of the most probable words, and decreasing the probability of the least probable words. This tweaking of probabilities is parameterized by something called the ***temperature***, where a temperature of 0 leads to only the most probable words being used (and therefore uncreative output) and a temperature of 1.0 leading to more random output. Typically a value for temperature somewhere in between is chosen to achieve a balance between coherent sentences and creativity.

Armed with our RNN architecture, we require a dataset to train it with.



---



## Data collection


**For the Python scripts referenced here, please see** https://github.com/Spiffical/roboyogi

![yoooooooo](https://www.gstatic.com/images/icons/material/product/2x/youtube_64dp.png)

There are many websites that provide yoga videos, but for this project I chose to use Youtube for a few important reasons: 

*  There are many *thousands* of free yoga routines
*  The videos come with transcribed audio
*  Youtube provides a simple Python-based API to automate the process of collecting video information




The process of creating a training dataset of yoga video captions for textgenrnn is as follows:

### 1. Searching Youtube for appropriate videos

Using the Youtube API, I wrote the script [search_yt_vids.py](https://github.com/Spiffical/roboyogi/blob/master/search_yt_vids.py) which takes as input a few search parameters such as length of video, words in the title, and how many search results should be returned. The output of the script is a text file (in JSON format for readability) containing video metadata. The following is an example output:

```
[
    {
        "title": "Total Body Yoga - Deep Stretch | Yoga With Adriene",
        "id": "GLy2rYHwUqY",
        "channelId": "UCFKE7WVJfvaHW5q283SxchA",
        "channelTitle": "Yoga With Adriene"
    },
    {
        "title": "Dedicate - Day 1 - Discern  |  Yoga With Adriene",
        "id": "IHkpIh7nj3M",
        "channelId": "UCFKE7WVJfvaHW5q283SxchA",
        "channelTitle": "Yoga With Adriene"
    }
]
```



This file can be examined and entries which are obviously not yoga related can be easily pruned by hand. 

### 2. Collecting captions of the Youtube videos

The website http://www.diycaptions.com provides an excellent web-based interface for collecting captions of Youtube videos given their video IDs. This works well if you have only a handful of videos, but is quite limiting if, like in my case, you require captions for hundreds of videos.

By utilizing [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)  (a Python package used for parsing HTML) I wrote the script [get_yt_captions.py](https://github.com/Spiffical/roboyogi/blob/master/get_yt_captions.py) which interfaces with diycaptions.com and, given the text file produced in the previous step, automates the collection of the ***human-generated*** captions from the videos. I emphasize the human-generated part because Youtube also provides automated captions, but these lack punctuation and are far from perfect (because they are produced by a neural network and as we shall see, neural networks are not yet as good at humans at understanding language). The output of the script is a csv file containing captions and metadata for the captions of every video. 

By analyzing the metadata of the captions, the dataset can be further pruned by, for example, looking at the number of characters in the caption (there are videos found in the previous step which are >30 minutes in length but contain very few words, likely not videos with a yoga routine). 

**NOTE**: the script sometimes needs to be executed multiple times because retrieving information from diycaptions.com fails for various reasons, despite my best efforts at debugging this. 

### 3. Generating the training dataset

The final step is to produce a text file where each line contains the caption of a video from the pruned list, a format which *textgenrnn* can use. I wrote a simple script [generate_trainingset.py](https://github.com/Spiffical/roboyogi/blob/master/generate_trainingset.py) which takes the csv file from the previous step and performs this task.

The end result of this data collection was the human-generated captions for **625 yoga videos**, a corpus of text **10MB** in size (which is decently large for a task like this). We can now train a model on this dataset!




---



## Training the RNN model

Import textgenrnn and other needed packages

In [1]:
!pip install -q textgenrnn
from google.colab import files
from textgenrnn import textgenrnn
from datetime import datetime
import os

Using TensorFlow backend.


Instructions for updating:
Colocations handled automatically by placer.


#### Mount Google Drive directory where training text file is stored, and where checkpointed model will be saved (and not erased when the runtime on Colab disconnects!)

In [2]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
google_drive_dir = F"/content/gdrive/My Drive/UVic/ASTR511-computing/roboyogi/"

Define path to the text file containing all the yoga video captions

In [0]:
text_file_name = "all_captions.txt"
text_file_path = google_drive_dir + text_file_name

#### Define the base network dictionaries

In [0]:
model_cfg = {
    'word_level': True,         # set to True for training a word level model
    'rnn_size': None,           # number of LSTM cells of each layer
    'rnn_layers': None,         # number of LSTM layers
    'rnn_bidirectional': True,  # consider text both forwards and backward
    'max_length': None,         # number of words to consider before predicting the next
    'max_words': None,          # maximum number of words to model; the rest will be ignored
}

train_cfg = {
    'line_delimited': True,     # set to True if each text has its own line in the source file
    'save_epochs': 5,           # save model after every multiple of given number
    'num_epochs': 5,            # number of epochs to train for
    'gen_epochs': 5,            # generates sample text from model after given number of epochs
    'train_size': 0.8,          # proportion of input data to train on (the rest is validation)
    'dropout': 0.2,             # model generalizes better with dropout, better scores on validation
    'validation': True,         # If train__size < 1.0, test on validation dataset
    'is_csv': False             # set to True if file is a csv
}

#### Mess with different hyperparameter settings

With the task of text generation, we want the trained network to produce output sentences which are ***similar*** to the text it is trained on but ***not exactly identical***; a successfully trained model will be *creative* in its output. In my case this means it will learn the common set of poses which exist in yoga routines and so be able to produce realistic and physically possible yoga poses, but which are *novel in their sequence*. 

Because of this task, comparing a trained model's output to a test set makes little sense since we don't want to exactly mimic the text, and therefore there are no quantitative metrics for defining an ideally trained model. Rather, we want to monitor the loss on the training and validation sets throughout training to ensure the model is not over- or under-fitting the data (i.e. if the training and validation loss are similar, this indicates underfitting, and if the training loss is significantly less than the validation loss, this indicates over-fitting. In either case, we want to make sure the validation loss does not increase). 

With all this in mind, I'll train a few models and determine which set of hyperparameters gives the lowest loss on the validation set. I'll then use these hyperparameters for further testing of the model.

____________

I'll start with a model containing two RNN LSTM layers, each with 256 units. 

In [0]:
model_cfg['rnn_size'] = 256
model_cfg['rnn_layers'] = 2
model_cfg['max_length'] = 20
model_cfg['max_words'] = 15000

model_save_name = 'fulltext_wordlevel_size%s_layers%s_maxlength%s_maxwords%s_drop%.1f' % (model_cfg['rnn_size'],
                                                                                          model_cfg['rnn_layers'],
                                                                                          model_cfg['max_length'],
                                                                                          model_cfg['max_words'],
                                                                                          train_cfg['dropout'])

# Model will be saved to my google drive account
model_save_path = google_drive_dir + model_save_name

In [16]:
textgen = textgenrnn(name=model_save_path)

train_function = textgen.train_from_file

train_function(
    file_path=text_file_path,
    new_model=True,
    num_epochs=train_cfg['num_epochs'],
    gen_epochs=train_cfg['gen_epochs'],
    save_epochs=train_cfg['save_epochs'],
    batch_size=1024,
    train_size=train_cfg['train_size'],
    dropout=train_cfg['dropout'],
    validation=train_cfg['validation'],
    is_csv=train_cfg['is_csv'],
    rnn_layers=model_cfg['rnn_layers'],
    rnn_size=model_cfg['rnn_size'],
    rnn_bidirectional=model_cfg['rnn_bidirectional'],
    max_length=model_cfg['max_length'],
    dim_embeddings=100,
    word_level=model_cfg['word_level'])

625 texts collected.
Training new model w/ 2-layer, 256-cell Bidirectional LSTMs
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Training on 1,940,734 word sequences.
Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
####################
Temperature: 0.2
####################
hey everyone and welcome to yoga with adriene . i ' m adriene and this is benji and today we have an awesome yoga for the feet . this is a great way to build strength and strength , and to help you to feel like you ' re doing a great job . so , i ' m going to put the blanket underneath the bum , so you can see that my toes are coming in and out . so , i ' m going to bring the palms together and i ' m gonna bring my hands to the tops of the thighs . i ' m gonna draw the heels up towards the sky . and then i ' m going to inhale

**After 5 epochs, the results are looking very promising. Let's try increasing the number of layers to 3 to see whether a larger/deeper network will give better results.**

In [0]:
model_cfg['rnn_size'] = 256
model_cfg['rnn_layers'] = 3
model_cfg['max_length'] = 20
model_cfg['max_words'] = 15000

model_save_name = 'fulltext_wordlevel_size%s_layers%s_maxlength%s_maxwords%s_drop%.1f' % (model_cfg['rnn_size'],
                                                                                          model_cfg['rnn_layers'],
                                                                                          model_cfg['max_length'],
                                                                                          model_cfg['max_words'],
                                                                                          train_cfg['dropout'])

model_save_path = google_drive_dir + model_save_name

In [18]:
textgen = textgenrnn(name=model_save_path)

train_function = textgen.train_from_file

train_function(
    file_path=text_file_path,
    new_model=True,
    num_epochs=train_cfg['num_epochs'],
    gen_epochs=train_cfg['gen_epochs'],
    save_epochs=train_cfg['save_epochs'],
    batch_size=1024,
    train_size=train_cfg['train_size'],
    dropout=train_cfg['dropout'],
    validation=train_cfg['validation'],
    is_csv=train_cfg['is_csv'],
    rnn_layers=model_cfg['rnn_layers'],
    rnn_size=model_cfg['rnn_size'],
    rnn_bidirectional=model_cfg['rnn_bidirectional'],
    max_length=model_cfg['max_length'],
    dim_embeddings=100,
    word_level=model_cfg['word_level'])

625 texts collected.
Training new model w/ 3-layer, 256-cell Bidirectional LSTMs
Training on 1,941,262 word sequences.
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
####################
Temperature: 0.2
####################
hey everyone , welcome to yoga with adriene . i ' m adriene and this is benji and today we have a yoga for beginners practice . this is a great practice for you . this is a great place to do . so , this is a great practice that you can do to just kind of feel like you ' re not holding your breath . so we ' re gonna start to wake up the body . so , we ' re gonna start to wake up the muscles of the legs . so , start with the hands on the floor , roll the shoulders back and down . so , you ' re just gonna hold your legs up . and then come back up . and then we ' re gonna do some of this exercise . so , we ' re gonna do some standing poses , so we ' re gonna start off in the middle of the boat . so , you ' re just gonna take your hands to your belly . and then you '

**The bigger/deeper model seems to give very similar results at the end of 5 epochs (according to the score on the validation set), so a more complex model does not seem necessary. Let us see what happens when we decrease the size of each layer.**

In [0]:
model_cfg['rnn_size'] = 128
model_cfg['rnn_layers'] = 3
model_cfg['max_length'] = 20
model_cfg['max_words'] = 15000
train_cfg['num_epochs'] = 10

model_save_name = 'fulltext_wordlevel_size%s_layers%s_maxlength%s_maxwords%s_drop%.1f' % (model_cfg['rnn_size'],
                                                                                          model_cfg['rnn_layers'],
                                                                                          model_cfg['max_length'],
                                                                                          model_cfg['max_words'],
                                                                                          train_cfg['dropout'])

model_save_path = google_drive_dir + model_save_name

In [23]:
textgen = textgenrnn(name=model_save_path)

train_function = textgen.train_from_file

train_function(
    file_path=text_file_path,
    new_model=True,
    num_epochs=train_cfg['num_epochs'],
    gen_epochs=train_cfg['gen_epochs'],
    save_epochs=train_cfg['save_epochs'],
    batch_size=1024,
    train_size=train_cfg['train_size'],
    dropout=train_cfg['dropout'],
    validation=train_cfg['validation'],
    is_csv=train_cfg['is_csv'],
    rnn_layers=model_cfg['rnn_layers'],
    rnn_size=model_cfg['rnn_size'],
    rnn_bidirectional=model_cfg['rnn_bidirectional'],
    max_length=model_cfg['max_length'],
    dim_embeddings=100,
    word_level=model_cfg['word_level'])

625 texts collected.
Training new model w/ 3-layer, 128-cell Bidirectional LSTMs
Training on 1,940,806 word sequences.
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
####################
Temperature: 0.2
####################
hey everyone , welcome to yoga with adriene . i ' m adriene and today we have a nice , beautiful , beautiful , beautiful , beautiful , beautiful , beautiful , organic , yummy , long , smooth , deep breaths . so , we ' re gonna take a second to just notice how you feel . and then we ' ll take it to the other side . so , right hand comes to the center line . and then we ' re gonna come to a nice , low lunge . so we ' re gonna come to a nice low lunge here . so we ' re not just collapsing here , but we ' re not collapsing in the right shoulder here . we ' re not just collapsing . we ' re not just kind of cranking into the shape , but we ' re not just kind of cranking into the pose , but we ' re not just kind of cranking the hips , but we ' re just kind of givi



---

**After 10 epochs, the validation score here did not reach as low as the previous models after 5 epochs, indicating the previous architectures to be a better fit. Since the two previous architectures performed similarly, I will continue training the simpler network with 2 layers of 256 units each to see how low the validation loss will go. I will train for more epochs and checkpoint the model weights more frequently in order to select the best performing model.**

In [0]:
train_cfg['num_epochs'] = 15  # Train for 15 epochs this time
train_cfg['save_epochs'] = 2  # Save model after every 2nd epoch
model_cfg['rnn_size'] = 256
model_cfg['rnn_layers'] = 2
model_cfg['max_length'] = 20
model_cfg['max_words'] = 15000

model_save_name = 'fulltext_wordlevel_size%s_layers%s_maxlength%s_maxwords%s_drop%.1f_2' % (model_cfg['rnn_size'],
                                                                                          model_cfg['rnn_layers'],
                                                                                          model_cfg['max_length'],
                                                                                          model_cfg['max_words'],
                                                                                          train_cfg['dropout'])

model_save_path = google_drive_dir + model_save_name


In [0]:
textgen = textgenrnn(name=model_save_path)

train_function = textgen.train_from_file

train_function(
    file_path=text_file_path,
    new_model=True,
    num_epochs=train_cfg['num_epochs'],
    gen_epochs=train_cfg['gen_epochs'],
    save_epochs=train_cfg['save_epochs'],
    batch_size=1024,
    train_size=train_cfg['train_size'],
    dropout=train_cfg['dropout'],
    validation=train_cfg['validation'],
    is_csv=train_cfg['is_csv'],
    rnn_layers=model_cfg['rnn_layers'],
    rnn_size=model_cfg['rnn_size'],
    rnn_bidirectional=model_cfg['rnn_bidirectional'],
    max_length=model_cfg['max_length'],
    dim_embeddings=100,
    word_level=model_cfg['word_level'])

625 texts collected.
Training new model w/ 2-layer, 256-cell Bidirectional LSTMs
Training on 1,941,803 word sequences.
Epoch 1/15
Epoch 2/15
Saving Model Weights — Epoch #2
Epoch 3/15
Epoch 4/15

**Let's generate some text and see some machine-generated yoga routines!**

In [9]:
model_cfg['rnn_size'] = 256
model_cfg['rnn_layers'] = 2
model_cfg['max_length'] = 20
model_cfg['max_words'] = 15000

model_save_name = 'fulltext_wordlevel_size%s_layers%s_maxlength%s_maxwords%s_drop%.1f' % (model_cfg['rnn_size'],
                                                                                          model_cfg['rnn_layers'],
                                                                                          model_cfg['max_length'],
                                                                                          model_cfg['max_words'],
                                                                                          train_cfg['dropout'])

# Model will be saved to my google drive account
model_save_path = google_drive_dir + model_save_name

model_weights = model_save_path + '_weights.hdf5'
vocab_path = model_save_path + '_vocab.json'
config_path = model_save_path + '_config.json'

textgen = textgenrnn(weights_path=model_weights,
                     vocab_path=vocab_path,
                     config_path=config_path)

textgen.generate(n=5, return_as_list=False, prefix=None,
                 temperature=[0.6, 0.5, 0.5, 0.2, 0.2],
                 max_gen_length=600, top_n=3)

hey , welcome back to yoga with tim . today we ' re going to learn a core strength , a foot , but it ' s just a little bit of an awareness in your feet , so you find that lift up through the front body , grounding through the back body . and then we ' ll slowly release . come back to all fours . walk the knees underneath the hip points , curl the toes under , and send it up and back , downward facing dog . ( deep breath ) this time , we ' ll drop the right heel , lift the left leg up high . bend the right knee , and then step it up into your lunge . pivot on the back foot . strong warrior i here . inhale , open the chest . exhale , plant the palms . step the right toes back . lower the left knee . inhale , open up through the left arm . exhale , left hand to the heart . inhale , reach up . exhale , twist . inhale , reach the arms up . exhale , bend the knee over the ankle . keep the length in the knee , reach past your fingers . make sure your left knee is over the ankle , and then str

**The bias in the training set is clearly identifiable, given that almost all of the examples begin the same way. Roboyogi has clearly understood how to start a yoga practice, however, by welcoming everyone and setting up the intro to the routine. The sentences generated are mostly coherent, and the sequences of moves for the most part could be performed by a human. Unfortunately, the long term structure sort of breaks down, revealing the inability of this method to generate long yoga routines with coherent structure throughout. **

## Conclusions and future work

Roboyogi is quite limited in its ability to produce yoga routines due to a few reasons:

1.  The dataset is currently biased towards whoever is dominating the Youtube yoga market, and therefore the output mostly reflects their style of teaching.
2.  I would have liked to spend more time tweaking hyper parameters.
3.  The dataset is limited in size.

Unfortunately it does not currently seem simple or possible to remove the first limitation. The second limitation can easily be removed by working on this project further.

The third limitation presents an interesting problem. The output of Roboyogi is currently not perfect (sometimes incoherent and repeats itself), likely due to not having access to enough data to train on. One way of mitigating this limitation is to adopt the [strategy implemented by a team at OpenAI](https://openai.com/blog/better-language-models/) which produced incredible results: pre-train a model on a massive dataset of general text from the internet, allowing for a larger and more complex architecture which can accommodate longer input sequences. This pre-trained network would have a deeper understanding of language and could produce much longer and more coherent yoga routines. 

Despite these limitations, Roboyogi can generate pretty creative output, and certainly it can generate novel and coherent sequences of yoga poses. This project was a lot of fun and I consider it a complete success given the constraints!