## Preparation (40 points total)
### [20 points] Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis. Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created).   

In [92]:
import numpy as np
import pandas as pd
import keras
from keras.preprocessing import sequence

df = pd.read_csv('./south-park-dialogue/All-seasons.csv')
df = pd.DataFrame(df.drop(['Character', 'Season', 'Episode'], axis=1).values, index=df['Character'])

y_string = df['Character'].values
uniques, y_ints, counts = np.unique(y_string, return_inverse=True,return_counts=True)
num_classes = len(uniques)

X_preprep = df.drop(['Character', 'Season', 'Episode'], axis=1).values
    
X_preprep = X_preprep.flatten()

y_ohe = keras.utils.to_categorical(y_ints, num_classes)

print(counts)
print(uniques)

                                                                 0
Character                                                         
Stan                    You guys, you guys! Chef is going away. \n
Kyle                                   Going away? For how long?\n
Stan                                                    Forever.\n
Chef                                             I'm sorry boys.\n
Stan             Chef said he's been bored, so he joining a gro...
Chef                                                        Wow!\n
Mrs. Garrison    Chef?? What kind of questions do you think adv...
Chef                What's the meaning of life? Why are we here?\n
Mrs. Garrison             I hope you're making the right choice.\n
Cartman          I'm gonna miss him.  I'm gonna miss Chef and I...
Stan             Dude, how are we gonna go on? Chef was our fuh...
Mayor McDaniels  And we will all miss you, Chef,  but we know y...
Jimbo                                                   Bye-by

KeyError: 'Character'

In [44]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

NUM_TOP_WORDS = None
padding = 100

tokenizer = Tokenizer(num_words=NUM_TOP_WORDS)
tokenizer.fit_on_texts(X_preprep)
sequences = tokenizer.texts_to_sequences(X_preprep)

X = pad_sequences(sequences, maxlen=padding)


In [51]:
embeddings_index = {}
word_index = tokenizer.word_index
embed_size = 100

f = open('glove.6B/glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

embedding_matrix = np.zeros((len(word_index) + 1, embed_size))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

### [10 points] Choose and explain what metric(s) you will use to evaluate your algorithm’s performance. You should give a detailed argument for why this (these) metric(s) are appropriate on your data. That is, why is the metric appropriate for the task (e.g., in terms of the business case for the task). Please note: rarely is accuracy the best evaluation metric to use. Think deeply about an appropriate measure of performance.


### [10 points] Choose the method you will use for dividing your data into training and testing (i.e., are you using Stratified 10-fold cross validation? Shuffle splits? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate. Convince me that your cross validation method is a realistic mirroring of how an algorithm would be used in practice. 

In [46]:
from sklearn.model_selection import train_test_split

# X_train, X_test, y_train_ohe, y_test_ohe = train_test_split(X, y_ohe, test_size=0.2, stratify=y_string, random_state=8)


## Modeling (50 points total)
### [25 points] Investigate at least two different recurrent network architectures (perhaps LSTM and GRU). Adjust hyper-parameters of the networks as needed to improve generalization performance. 

In [60]:
from keras.layers import Embedding

embedding_layer = Embedding(len(word_index) + 1,
                            embed_size,
                            weights=[embedding_matrix],
                            input_length=padding,
                            trainable=False)

In [64]:
%%time
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM

lstm_rnn = Sequential()
lstm_rnn.add(embedding_layer)
lstm_rnn.add(LSTM(100, dropout=.2, recurrent_dropout=.2))
lstm_rnn.add(Dense(num_classes, activation='sigmoid'))

lstm_rnn.compile(loss='categorical_crossentropy', 
              optimizer='rmsprop', 
              metrics=['accuracy'])
print(lstm_rnn.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 100, 100)          2684000   
_________________________________________________________________
lstm_3 (LSTM)                (None, 20)                9680      
_________________________________________________________________
dense_3 (Dense)              (None, 3950)              82950     
Total params: 2,776,630
Trainable params: 92,630
Non-trainable params: 2,684,000
_________________________________________________________________
None
Wall time: 742 ms


In [66]:
lstm_rnn.fit(X, y_ohe, validation_data=(X, y_ohe), epochs=3, batch_size=64)

Train on 70896 samples, validate on 70896 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x19e2db91eb8>

In [67]:
lstm_rnn.fit(X, y_ohe, validation_data=(X, y_ohe), epochs=3, batch_size=64)

Train on 70896 samples, validate on 70896 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x19e2fa7a160>

In [68]:
from keras.layers import GRU

gru_rnn = Sequential()
gru_rnn.add(embedding_layer)
gru_rnn.add(GRU(100, dropout=.2, recurrent_dropout=.2))
gru_rnn.add(Dense(num_classes, activation='sigmoid'))

gru_rnn.compile(loss='categorical_crossentropy', 
              optimizer='rmsprop', 
              metrics=['accuracy'])
print(gru_rnn.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 100, 100)          2684000   
_________________________________________________________________
gru_1 (GRU)                  (None, 20)                7260      
_________________________________________________________________
dense_4 (Dense)              (None, 3950)              82950     
Total params: 2,774,210
Trainable params: 90,210
Non-trainable params: 2,684,000
_________________________________________________________________
None


In [69]:
gru_rnn.fit(X, y_ohe, validation_data=(X, y_ohe), epochs=3, batch_size=64)

Train on 70896 samples, validate on 70896 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x19e31076eb8>

In [84]:
def data_prep(words):
    tokened = np.array(tokenizer.texts_to_sequences(words))
    return pad_sequences(tokened, maxlen=padding)

sent = ['hello, my name is Stan']
blegh = gru_rnn.predict(data_prep(sent))
print()

(1, 3950)


## Exceptional Work (10 points total)
You have free reign to provide additional analyses.
### One idea: Use more than a single chain of LSTMs or GRUs (i.e., use multiple parallel chains). 
Another Idea: Try to create a RNN for generating novel text. 