# Generating New TED Talk Descriptions with RNN's

intro


First we'll import our libraries and data and take a peak:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data_path = "ted_main.csv"
data = pd.read_csv(data_path)
data.head()

Unnamed: 0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,ratings,related_talks,speaker_occupation,tags,title,url,views
0,4553,Sir Ken Robinson makes an entertaining and pro...,1164,TED2006,1140825600,60,Ken Robinson,Ken Robinson: Do schools kill creativity?,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 19645}, {...","[{'id': 865, 'hero': 'https://pe.tedcdn.com/im...",Author/educator,"['children', 'creativity', 'culture', 'dance',...",Do schools kill creativity?,https://www.ted.com/talks/ken_robinson_says_sc...,47227110
1,265,With the same humor and humanity he exuded in ...,977,TED2006,1140825600,43,Al Gore,Al Gore: Averting the climate crisis,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 544}, {'i...","[{'id': 243, 'hero': 'https://pe.tedcdn.com/im...",Climate advocate,"['alternative energy', 'cars', 'climate change...",Averting the climate crisis,https://www.ted.com/talks/al_gore_on_averting_...,3200520
2,124,New York Times columnist David Pogue takes aim...,1286,TED2006,1140739200,26,David Pogue,David Pogue: Simplicity sells,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 964}, {'i...","[{'id': 1725, 'hero': 'https://pe.tedcdn.com/i...",Technology columnist,"['computers', 'entertainment', 'interface desi...",Simplicity sells,https://www.ted.com/talks/david_pogue_says_sim...,1636292
3,200,"In an emotionally charged talk, MacArthur-winn...",1116,TED2006,1140912000,35,Majora Carter,Majora Carter: Greening the ghetto,1,1151367060,"[{'id': 3, 'name': 'Courageous', 'count': 760}...","[{'id': 1041, 'hero': 'https://pe.tedcdn.com/i...",Activist for environmental justice,"['MacArthur grant', 'activism', 'business', 'c...",Greening the ghetto,https://www.ted.com/talks/majora_carter_s_tale...,1697550
4,593,You've never seen data presented like this. Wi...,1190,TED2006,1140566400,48,Hans Rosling,Hans Rosling: The best stats you've ever seen,1,1151440680,"[{'id': 9, 'name': 'Ingenious', 'count': 3202}...","[{'id': 2056, 'hero': 'https://pe.tedcdn.com/i...",Global health expert; data visionary,"['Africa', 'Asia', 'Google', 'demo', 'economic...",The best stats you've ever seen,https://www.ted.com/talks/hans_rosling_shows_t...,12005869


That looks good, but we don't need all those variables. Let's trim our dataset to just what we need, the description and name of the talk:

In [8]:
input_features = ["description", "name"]

data2 = data[input_features]
data2.head()

Unnamed: 0,description,name
0,Sir Ken Robinson makes an entertaining and pro...,Ken Robinson: Do schools kill creativity?
1,With the same humor and humanity he exuded in ...,Al Gore: Averting the climate crisis
2,New York Times columnist David Pogue takes aim...,David Pogue: Simplicity sells
3,"In an emotionally charged talk, MacArthur-winn...",Majora Carter: Greening the ghetto
4,You've never seen data presented like this. Wi...,Hans Rosling: The best stats you've ever seen


## Preprocessing:

Next, we need to embed the words as numbers. In this function, we'll create two dictionaries that can translate words to numbers and back again. First, however, we'll create a single string with all of the text in our input fields.

We must also remove all punctuation to correctly create these lookup tables. This will be accomplished with method token_lookup().


In [27]:
def token_lookup():
    """
    Generate a dict to turn punctuation into a token.
    :return: Tokenize dictionary where the key is the punctuation and the value is the token
    """
    token_dict = {"." : "||Period||",
                  "," : "||Comma||",
                  '"' : "||Quotation_Mark||",
                  ";" : "||Semicolon||",
                  "!" : "||Exclamation_Mark||",
                  "?" : "||Question_Mark||",
                  "(" : "||Left_Parentheses||",
                  ")" : "||Right_Parentheses||",
                  "--" : "||Dash||",
                  "\n" : "||Return||"}
    
    
    return token_dict


def create_jumbo_string(descriptions, names):
    """
    Create a single long string of all input text so as to facilitate in the creation of a lookup table
    """
    jumbo_string = ""
    
    for description in descriptions:
        jumbo_string = jumbo_string + description + " "
        
    for name in names:
        jumbo_string = jumbo_string + name + " "
        
        
    # Remove punctuation, replaces all punctuation with "|| TOKEN ||" in the text
    token_dict = token_lookup()
    for key, token in token_dict.items():
        jumbo_string = jumbo_string.replace(key, " {} ".format(token))
        
        
    # Remove uppercase letters and split words:
    jumbo_string = jumbo_string.lower()
    jumbo_string = jumbo_string.split()
    
    
    return jumbo_string


def create_lookup_tables(text):
    """
    Create lookup tables for vocablulary words
    """
    from collections import Counter
    
    counts = Counter(text)
    vocab = sorted(counts, key = counts.get, reverse = True)
    vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)}
    int_to_vocab = {ii: word for ii, word in enumerate(vocab, 1)}
    
    return vocab_to_int, int_to_vocab
    
    
    
    

In [33]:
descriptions = data2["description"]
names = data2["name"]

# Create single string of all input text
text = create_jumbo_string(descriptions, names)

# Generate lookup tables and token dictionary
vocab_to_int, int_to_vocab = create_lookup_tables(text)
token_dict = token_lookup()

# Generate text to be inputted into our model, using the vocab_to_int lookup table we just defined:

input_text = [vocab_to_int[word] for word in text]

In [46]:
print(input_text)

[2682, 1292, 2155, 93, 31, 1208, 4, 3683, 251, 169, 12, 324, 31, 303, 198, 11, 9539, 44, 620, 81, 9540, 43, 440, 3, 14, 2, 304, 576, 4, 333, 19, 9541, 8, 9, 31, 3108, 953, 1, 9, 898, 1938, 3684, 66, 1293, 130, 11, 1209, 27, 1753, 204, 91, 4632, 1, 21, 1939, 5, 3109, 7, 3110, 5, 39, 1, 4633, 1210, 1371, 12, 121, 2393, 3, 39, 465, 429, 2394, 110, 3685, 170, 4634, 32, 9542, 1145, 9543, 6005, 1, 4, 1754, 3111, 657, 6, 854, 11, 117, 23, 182, 3, 7, 194, 173, 70, 1, 19, 6006, 60, 441, 3, 8, 31, 4635, 4636, 25, 1, 9544, 556, 3686, 2683, 409, 40, 358, 12, 899, 621, 8, 2, 557, 3687, 10, 4, 63, 13, 3688, 1940, 3112, 89, 21, 6007, 703, 1094, 3, 954, 395, 504, 94, 3689, 64, 18, 3, 14, 2, 6008, 4, 6009, 6, 5, 9545, 1, 1146, 3113, 855, 1611, 6010, 1294, 26, 2, 3690, 9, 410, 38, 3, 9, 2395, 4637, 704, 2, 9, 856, 1000, 9, 11, 9546, 3691, 2684, 10, 4, 9547, 898, 1938, 8, 2, 1147, 2685, 3, 48, 111, 174, 6011, 9548, 3692, 20, 1372, 6012, 3114, 59, 162, 1, 23, 6013, 240, 5, 658, 7, 1148, 955, 40, 96, 1295,

In [36]:
# Save the lookup tables and input_text
import pickle
pickle.dump((input_text, vocab_to_int, int_to_vocab, token_dict), open('preprocess.p', 'wb'))

# Input

Use tensorflow to create input placeholders. 

In [38]:
def get_inputs():
    """
    Returns TF placeholders for input, targets, and learning rate
    """
    inputs = tf.placeholder(tf.int32, [None, None], name = "inputs")
    labels = tf.placeholder(tf.int32, [None, None], name = "lables")
    learning_rate = tf.placeholder(tf.float32, name = "learning_rate")
    

## Build RNN Cell and Initialize

In [39]:
def build_init_cell(batch_size, rnn_size):
    """
    Create RNN cell and initialize it
    Outputs:
        Tuple(cell, initialized state)
    """
    
    lstm = tf.contrib.rnn.BasicLSTMCell(rnn_size)
    cell = tf.contrib.rnn.MultiRNNCell([lstm])
    initial_state = cell.zero_state(batch_size, tf.float32)
    initial_state = tf.identity(initial_state, name = "initial_state")
    
    return (cell, initial_state)

## Word Embedding

Apply embedding to input_data using TensorFlow. Return the embedded sequence.

In [40]:
def get_embed(input_data, vocab_size, embed_dim):
    """
    Create an embedding for input_data
    Inputs:
        input_data: TF placeholder for text input
        vocab_size: Number of words in our vocabulary
        embed_dim: Number of embedding dimensions
        
    Return: 
        Embedded Input
    """
    
    embedding = tf.Variable(tf.random_uniform((vocab_size, embed_dim), -1, 1))
    embed = tf.nn.embedding_lookup(embedding, input_data)
    
    return embed

## Build the RNN:

## Build the Neural Network!

In [43]:
def build_nn(cell, rnn_size, input_data, vocab_size, embed_dim):
    """
    Build the neural network
    return: Logits, FinalState
    """
    
    embed = get_embed(input_data, vocab_size, rnn_size)
    outputs, final_state = tf.nn.dynamic_rnn(cell, inputs, dtype = float32)
    final_state = tf.identity(final_state, name = "final_state")
    
    logits = tf.contrib.layers.fully_connected(outputs, vocab_size, activation_fn = None)
    
    return (logits, final_state) 

## Build a Method to Get Batches:

In [None]:
def get_batches(int_text, batch_size, seq_length):
    """
    Return batches of input and target
        int_text: Text with words replaced by their id's
        batch_size: the size of the batch
        seq_length: Length of sequence
    """
    
    n_batches = len(int_text)//(batch_size * seq_length)
    
    # Drop the last few words to make only full batches:
    x = np.array(int_text[: n_batches * batch_size * seq_length])
    y = np.array(int_text[1: n_batches * batch_size * seq_length + 1])
    y[-1] = x[0]
    
    x_batches = 