## Prepare Movies data for training

In order to prepare the data for training, we need to convert the data to the format, that is required in order to train my RNN model.


By the end of the execution, I would have 4 files that will be used of the training and testing:
- Trining Files:
    - train.from (chatbot input)
    - train.to (chatbot output)
- Test Files:
    - test.from (chatbot input)
    - test.to (chatbot output)

In [20]:
import pandas as pd

In [21]:
def get_lines():
    '''
    1. Read from 'movie-lines.txt'
    2. Create a dictionary with ( key = line_id, value = text )
    '''
    
    lines=open('../data/raw_data/movies/movie_lines.txt',
               encoding='utf-8',
               errors='ignore').read().split('\n')[:-1]
    id2line = {}
    for line in lines:
        _line = line.split(' +++$+++ ')
        if len(_line) == 5:
            id2line[_line[0]] = _line[4]
    return id2line

In [22]:
def get_orderd_conversations():
    '''
    1. Read from 'movie_conversations.txt'
    2. Create a list of [list of line_id's]
    '''
    conv_lines = open('../data/raw_data/movies/movie_conversations.txt',
                      encoding='utf-8',
                      errors='ignore').read().split('\n')[:-1]
    convs = [ ]
    for line in conv_lines:
        _line = line.split(' +++$+++ ')[-1][1:-1].replace("'","").replace(" ","")
        convs.append(_line.split(','))
    return convs

In [23]:
def create_dataset(convs, id2line):
    '''
    Get lists of all conversations as Questions and Answers
    1. [questions]
    2. [answers]
    '''
    questions = []
    answers = []

    for conv in convs:
        if len(conv) %2 != 0:
            conv = conv[:-1]
        for i in range(len(conv)):
            if i%2 == 0:
                questions.append(id2line[conv[i]])
            else:
                answers.append(id2line[conv[i]])
                
    df = pd.DataFrame({'parent':questions,
                       'comment':answers})
    return df

In [25]:
line_dict = get_lines()
convs_lst = get_orderd_conversations()
data = create_dataset(convs_lst, line_dict)

test_size = 30000

test_files = data[:test_size]
train_files = data[test_size:]

# test files
# create a file for all parent text only
with open('../data/datasets/movies/test.from','a', encoding='utf8') as f:
    for content in test_files['parent'].values:
        f.write(str(content)+'\n')

# create a file for all comment text only
with open('../data/datasets/movies/test.to','a', encoding='utf8') as f:
    for content in test_files['comment'].values:
        f.write(str(content)+'\n')


# train files
# create a file for all parent text only
with open('../data/datasets/movies/train.from','a', encoding='utf8') as f:
    for content in train_files['parent'].values:
        f.write(str(content)+'\n')

# create a file for all comment text only
with open('../data/datasets/movies/train.to','a', encoding='utf8') as f:
    for content in train_files['comment'].values:
        f.write(str(content)+'\n')
