# Preparing the Data

In this notebook I'll play with the datasets and try to convert them into sequence matrices then save these into CSV or PICKLE files. The goal is to have a Q&A like data structure both with word IDs and English words for each dataset (Twitter + Cornell Movie DB). The raw data is downloaded from https://github.com/suriyadeepan/datasets and I also use some snippets from his codes instead of using Keras' word tokenizer.

## 1. Twitter Data

This dataset contains questions and answers to them. All I have to do is to sort out the unnecessary data and organize the rest.

In [1]:
def decode(sequence, lookup, separator=''): # 0 used for padding, is ignored
    return separator.join([ lookup[element] for element in sequence if element ])

In [2]:
import pandas as pd
import numpy as np
import pickle
from tqdm import tqdm_notebook
import keras

# preprocessed data
from data.twitter import data
import data_utils

Using TensorFlow backend.


In [3]:
metadata, idx_q, idx_a = data.load_data(PATH='data/twitter/')
(trainX, trainY), (testX, testY), (validX, validY) = data_utils.split_dataset(idx_q, idx_a)
(X, Y), _, _ = data_utils.split_dataset(idx_q, idx_a, [1, 0, 0])

In [4]:
len(X)

267518

In [5]:
twitter_seq = [X, Y]
pickle.dump(twitter_seq, open('data/twitter_seq.pickle', 'wb'))

In [6]:
twitter_lang = pd.DataFrame(columns=["question", "answer"])

In [7]:
metadata['idx2w'][1] = ''
for idx in tqdm_notebook(range(len(X))):
    ans = data_utils.decode(X[idx], metadata['idx2w'], ' ')
    resp = data_utils.decode(Y[idx], metadata['idx2w'], ' ')
    
    twitter_lang.loc[len(twitter_lang)] = [ans, resp]

HBox(children=(IntProgress(value=0, max=267518), HTML(value='')))




In [8]:
twitter_lang.to_csv('data/twitter_dialogue.csv')

In [9]:
pickle.dump(metadata, open('data/twitter_metadata.pickle', 'wb'))

## 2. Cornell Movie Database

This dataset is larger and somewhat better. I'll use the same exact method to organize it into a CSV file.

In [10]:
from data.cornell import prepare_data

In [11]:
id2line = prepare_data.get_id2line()
convs = prepare_data.get_conversations()
questions, answers = prepare_data.gather_dataset(convs,id2line)

In [12]:
cornell_lang = pd.DataFrame(columns=["question", "answer"])

In [13]:
cornell_lang['question'] = questions
cornell_lang['answer'] = answers

In [14]:
cornell_lang[:10]

Unnamed: 0,question,answer
0,Can we make this quick? Roxanne Korrine and A...,"Well, I thought we'd start with pronunciation,..."
1,Not the hacking and gagging and spitting part....,Okay... then how 'bout we try out some French ...
2,You're asking me out. That's so cute. What's ...,Forget it.
3,"No, no, it's my fault -- we didn't have a prop...",Cameron.
4,"The thing is, Cameron -- I'm at the mercy of a...",Seems like she could get a date easy enough...
5,Why?,Unsolved mystery. She used to be really popul...
6,"Gosh, if only we could find Kat a boyfriend...",Let me see what I can do.
7,C'esc ma tete. This is my head,Right. See? You're ready for the quiz.
8,I don't want to know how to say that though. ...,That's because it's such a nice one.
9,How is our little Find the Wench A Date plan p...,"Well, there's someone I think might be --"


In [15]:
tokenizer = keras.preprocessing.text.Tokenizer(num_words=3000, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~©\\', lower=True, 
                                               split=' ', char_level=False, oov_token=None, document_count=0)

In [16]:
tokenizer.fit_on_texts(questions + answers)

In [17]:
X = tokenizer.texts_to_sequences(questions)
Y = tokenizer.texts_to_sequences(answers)

In [18]:
decode(X[6], tokenizer.index_word, separator=' ')

'gosh if only we could find a boyfriend'

In [19]:
cornell_lang.to_csv('data/cornell_dialogue.csv')
cornell_seq = [X, Y]
pickle.dump(cornell_seq, open('data/cornell_seq.pickle', "wb"))
pickle.dump(tokenizer, open('data/cornell_tokenizer.pickle', "wb"))