## Prepare Reddit data for training

In order to prepare the data for training, we need to convert the data to the format, that is required in order to train my RNN model.


By the end of the execution, I would have many files that will be used of the training and 2 files for testing:
- Trining Files:
    - train(serial).from (chatbot input)
    - train(serial).to (chatbot output)
- Test Files:
    - test.from (chatbot input)
    - test.to (chatbot output)

In [18]:
import sqlite3
import re
import pandas as pd

In [21]:
def clean_comments(df):
    
    for row, index in zip(df[['parent', 'comment']].values , df.index.values):

        # remove all coments which contains only special character
        m_parent = re.match(r'^# \*\*\[\S+', row[0])
        m_comment = re.match(r'^# \*\*\[\S+', row[1])
        if m_parent or m_comment:
            df.drop([index], inplace=True)
            
    # remove URLs
    # remove special tags: '[tag name]'
    df['parent'] = df['parent'].apply(lambda x: re.sub(r'(\[[\s \w]+\]\()?http\S+', '', x).strip())
    df['comment'] = df['comment'].apply(lambda x: re.sub(r'(\[[\s \w]+\]\()?http\S+', '', x).strip())

    # remove repeated ' newlinechar '
    df['parent'] = df['parent'].apply(lambda x: re.sub(r'(\snewlinechar\s)+', 'newlinechar ', x).strip())
    df['comment'] = df['comment'].apply(lambda x: re.sub(r'(\snewlinechar\s)+', 'newlinechar ', x).strip())
    
    return df

In [22]:
timeframes = ['2018-06']

# if i have more than one database (more than one month)
for timeframe in timeframes:
    
    # establish a connection
    connection = sqlite3.connect('../data/raw_data/reddit/{}.db'.format(timeframe))
    c = connection.cursor()
    
    # limit is the size of chunk that we're going to pull at a time from the database
    limit = 50000
    #time stamp
    last_unix = 0
    cur_length = limit
    counter = 0
    #when we're done building testing data.
    test_done = False
    
    # help in naming training files
    train_size = 0
    file_name = 1

    #So long as the cur_length is the same as our limit, we've still got more pulling to do. 
    while cur_length == limit:

        # fetch data and save it in dataframe
        df = pd.read_sql("SELECT * FROM parent_reply WHERE unix > {} and parent <> 'None' and score > 0 ORDER BY unix ASC LIMIT {}".format(last_unix,limit),connection)
        
        # last fetched unix
        last_unix = df.tail(1)['unix'].values[0]
        
        # length of our dataframe
        cur_length = len(df)

        # clean text
        df = clean_comments(df)
        
        # need to create sperated files for test and train
        
        # test files
        if not test_done:
            # create a file for all parent text only
            with open('../data/datasets/reddit/test.from','a', encoding='utf8') as f:
                for content in df['parent'].values:
                    f.write(str(content)+'\n')
                    
            # create a file for all comment text only
            with open('../data/datasets/reddit/test.to','a', encoding='utf8') as f:
                for content in df['comment'].values:
                    f.write(str(content)+'\n')

            test_done = True

        # train files
        else:
            train_size += len(df)
            
            # create a file for all parent text only
            with open('../data/datasets/reddit/'+'train'+str(file_name)+'.from','a', encoding='utf8') as f:
                for content in df['parent'].values:
                    f.write(str(content)+'\n')
                    
            # create a file for all comment text only
            with open('../data/datasets/reddit/'+'train'+str(file_name)+'.to','a', encoding='utf8') as f:
                for content in df['comment'].values:
                    f.write(str(content)+'\n')
            
            # create files with 1,000,000 rows
            if train_size >= 1000000:
                break
            elif train_size >= file_name * 100000:
                file_name += 1
                print(train_size,'rows completed so far')

100000 rows completed so far
200000 rows completed so far
300000 rows completed so far
400000 rows completed so far
500000 rows completed so far
600000 rows completed so far
700000 rows completed so far
800000 rows completed so far
900000 rows completed so far
