# Synthetic Friends Script

This notebook will build a model to generate a scene from the TV show 'Friends' using an NLP model. 

This project will include the following steps:

* Import and prepare the data
* Create the model
* Train the model
* Generate the synthetic scene


## Import required libraries and data

First we'll need to import the libraries we'll need as well as our data.

The data used in this project is the friends.csv file from a 2020 edition of tidy tuesday that can be found <a href="https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-09-08">here</a>.

In [109]:
import pandas as pd
import numpy as np
import csv
import random
import string
import json
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.models import Sequential, load_model

In [42]:
random.seed(2021)

#episode_names = json.loads('episode_names.json').read()

data = pd.read_csv('Data/corpus_data.csv').dropna()
data = data.loc[:, ['text', 'speaker', 'scene']]

data.head()


Unnamed: 0,text,speaker,scene
0,"[Scene: Monica and Rachel's, everyone is eatin...",Scene Directions,1
1,"Oh hey, you guys, look! Ugly Naked Guy is putt...",Phoebe Buffay,1
2,(They all run and join her at the window.),Scene Directions,1
3,I'd say from the looks of it; our naked buddy ...,Rachel Green,1
4,"Ironically, most of the boxes seem to be label...",Ross Geller,1


## Prepare the data

With the data loaded, I can now create the data we'll use to train the model. This will mean converting the data into a selection of lines, organised by scene. Once we've generated this data, we'll then visualise the first 10 scenes.

In [44]:
corpus = []

for i in range(0, len(data)):
    if i == 0:
        scene = data.iloc[i,2]
        scene_text = data.iloc[i,1] + ": " + data.iloc[i,0] + " "
    else:
        if scene == data.iloc[i,2]:
            scene_text = scene_text + data.iloc[i,1] + ": " + data.iloc[i,0] + " "
        else:
            corpus = corpus + [scene_text]
            scene = data.iloc[i,2]
            scene_text = data.iloc[i,1] + ": " + data.iloc[i,0] + " "
            
corpus[:10]

["Scene Directions: [Scene: Monica and Rachel's, everyone is eating some Chinese food.] Phoebe Buffay: Oh hey, you guys, look! Ugly Naked Guy is putting stuff in boxes! Scene Directions: (They all run and join her at the window.) Rachel Green: I'd say from the looks of it; our naked buddy is moving. Ross Geller: Ironically, most of the boxes seem to be labeled clothes. Rachel Green: Ohh, I'm gonna miss that big old squishy butt. Chandler Bing: And we're done with the chicken fried rice. Ross Geller: Hey! Hey! If he's moving, maybe I should try to get his place! #ALL#: Good idea! Yes! Ross Geller: It would be so cool to live across from you guys! Joey Tribbiani: Hey, yeah! Then we could do that telephone thing! Y'know, you have a can, we have a can and it's connected by a string! Chandler Bing: Or we can do the actual telephone thing. ",
 "Scene Directions: [Scene: Ugly Naked Guy's apartment, Ross, Rachel, and Phoebe are checking out the place. Luckily, Ugly Naked Guy is nowhere to be s

We'll also need to clean the data by converting it all to lowercase and removing all punctuation. This will make it easier for the model to process and work with.

In [76]:
def clean_text(txt):
    txt = "".join(v for v in txt if v not in string.punctuation).lower()
    txt = txt.encode("utf8").decode("ascii",'ignore')
    return txt 

corpus = [clean_text(x) for x in corpus]
corpus[:10]

['scene directions scene monica and rachels everyone is eating some chinese food phoebe buffay oh hey you guys look ugly naked guy is putting stuff in boxes scene directions they all run and join her at the window rachel green id say from the looks of it our naked buddy is moving ross geller ironically most of the boxes seem to be labeled clothes rachel green ohh im gonna miss that big old squishy butt chandler bing and were done with the chicken fried rice ross geller hey hey if hes moving maybe i should try to get his place all good idea yes ross geller it would be so cool to live across from you guys joey tribbiani hey yeah then we could do that telephone thing yknow you have a can we have a can and its connected by a string chandler bing or we can do the actual telephone thing ',
 'scene directions scene ugly naked guys apartment ross rachel and phoebe are checking out the place luckily ugly naked guy is nowhere to be seen ross geller oh my god i love this apartment isnt it perfect

The final step in preparing the data will be to tokenise this data. This means converting the words within the corpus into a method more readable for machine learning model.

In [77]:
tokenizer = Tokenizer()

def get_sequence_of_tokens(corpus):
    tokenizer.fit_on_texts(corpus)
    total_words = len(tokenizer.word_index) + 1
    
    input_sequences = []
    for line in corpus:
        token_list = tokenizer.texts_to_sequences([line])[0]
        for i in range(1, len(token_list)):
            n_gram_sequence = token_list[:i+1]
            input_sequences.append(n_gram_sequence)
    return input_sequences, total_words

inp_sequences, total_words = get_sequence_of_tokens(corpus)

Finally we'll have to pad the sequences to make their lengths equal.

In [78]:
def generate_padded_sequences(input_sequences):
    max_sequence_len = max([len(x) for x in input_sequences])
    input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
    
    predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
    label = to_categorical(label, num_classes=total_words)
    return predictors, label, max_sequence_len

predictors, label, max_sequence_len = generate_padded_sequences(inp_sequences)

## Prepare the model

With this data we can then build and train a model to begin to generate the scenes. The model we'll be using is an LSTM model from the Keras library.

In [79]:
def create_model(max_sequence_len, total_words):
    input_len = max_sequence_len - 1
    model = Sequential()
    
    model.add(Embedding(total_words, 10, input_length=input_len))
    
    model.add(LSTM(100))
    model.add(Dropout(0.1))
    
    model.add(Dense(total_words, activation='softmax'))

    model.compile(loss='categorical_crossentropy', optimizer='adam')
    
    return model

model = create_model(max_sequence_len, total_words)
model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 1618, 10)          35300     
                                                                 
 lstm_3 (LSTM)               (None, 100)               44400     
                                                                 
 dropout_3 (Dropout)         (None, 100)               0         
                                                                 
 dense_3 (Dense)             (None, 3530)              356530    
                                                                 
Total params: 436,230
Trainable params: 436,230
Non-trainable params: 0
_________________________________________________________________


Finally, we can fit the model. This will take a long time so don't run this unless you have a day to spare. When we are re-running the notebook, we should simply load the model saved at the end rather than retraining the model by running this cell.

In [89]:
model.fit(predictors, label, epochs=100, verbose=10)

Epoch 1/100
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: closure mismatch, requested ('self', 'step_function'), but source function had ()
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: closure mismatch, requested ('self', 'step_function'), but source function had ()
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoc

<keras.callbacks.History at 0x22a4b0ecee0>

## Generating a scene

With the model finally trained we can then begin to set up the function necessary to generate a scene.

In [86]:
def generate_text(seed_text, next_words, model, max_sequence_len):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predicted = np.argmax(model.predict(token_list), axis=-1)
        
        output_word = ""
        for word,index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " "+output_word
    return seed_text.title()

To determine how long the scene should be, we'll take a look at the word length of the average scene and use that.

In [99]:
sum(map(len, corpus))/len(corpus)

1681.393103448276

And finally, we can then generate the scene text with these inputs. For the scene text, we'll use 'Scene Directions' as it is how all the scenes start. It also ensures that we'll see how the model describes the scene it's writing.

In [103]:
print(generate_text("Scene Directions", 1681, model, max_sequence_len))

Scene Directions Scene Monica And Chandlers Apartment Chandler Is Been As Chandler Enters Monica And Monica Are Sitting In The Kitchen Table Rachel Green Hey Phoebe Buffay Hey Ross Geller Hi Rachel Green Hi Monica Geller Hi Rachel Green Hi Monica Geller Hi Rachel Green Hi Monica Geller Hi Rachel Green Hi Monica Geller Oh My God Monica Geller Whatwhats The Matter Was Our Late Ross Geller Oh My God Rachel Green Oh My God Rachel Green Oh Yeah But You Have To Do You Were In An Country Monica Geller Thats Right Chandler Bing Oh My God Rachel Green Oh My God Rachel Green Oh Yeah But You Have To Do You After Me Monica Geller Well I Know I Shouldnt Be So Lucky To Mess Monica Geller Ill See You Getting Married Chandler Bing I Cant Believe You Guys Mean We Win We Have To Do It Monica Geller What Will Colbert What Do You Say That Ross Geller Oh Yeah Chandler You Dont Wanna Get Off The Plane Ross Geller I Know It Was Incredible But It Was A Couple Of Years Or Ago They Are A Big More Richard Burke 

## Saving the Model

Finally, we'll save the model so that we can share it with others. When running this notebook, we should load thsi pretrained model rather than training another model.

In [107]:
model.save('Friends_writer.h5')