## Implementing a RNN

Source: http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-model-rnn-with-python-numpy-and-theano/

In [1]:
import csv
import itertools
import operator
import numpy as np
import nltk
import sys
from datetime import datetime
from utils import *
import os
import gzip
from six.moves.urllib.request import urlretrieve

import matplotlib.pyplot as plt
%matplotlib inline

Export reddit comments from a [dataset available on Google's BigQuery](https://bigquery.cloud.google.com/table/fh-bigquery:reddit_comments.2015_08).

In [2]:
def download_file(url, filename, expected_size_in_bytes, force=False):
  """Download a file if not present, and make sure it's the right size."""
  if not os.path.exists(filename) or force == True:
    filename, _ = urlretrieve(url + filename, filename)
  statinfo = os.stat(filename)
  if statinfo.st_size == expected_size_in_bytes:
    print('Found and verified', filename)
  else:
    raise Exception('Failed to verify {0}. Expected {1}B but found {2}B!'.format(filename, expected_size_in_bytes, statinfo.st_size))
  return filename

In [3]:
dataset_compressed_filename = download_file('https://github.com/SebastienBoisard/DeepLearningTutorials/'+
                                            'raw/master/Language_model_with_RNN/data/',
                                            'reddit-comments-2015-08.data.gz', 
                                             3152770)

Found and verified reddit-comments-2015-08.data.gz


In [4]:
def decompress_file(compressed_filename):
    # Split the gziped file name into a name and the extension
    file_name, file_extension = os.path.splitext(compressed_filename)
    
    if file_extension != '.gz':
        raise Exception('Can\'t decompress file \'', compressed_filename, '\' because this is not a .gz file!')
       
    with gzip.open(compressed_filename, 'rb') as f:
        file_content = f.read()    
       
        with open(file_name, 'wb') as outfile:
            outfile.write(file_content)
    
    return file_name

In [5]:
dataset_filename = decompress_file(dataset_compressed_filename)

In [6]:
vocabulary_size = 8000
unknown_token = "UNKNOWN_TOKEN"
sentence_start_token = "SENTENCE_START"
sentence_end_token = "SENTENCE_END"


# Read the data file, parse and tokenzie the senteances, and append SENTENCE_START and SENTENCE_END tokens for each sentence.
def extract_sentences(data_filename):
    print("Reading CSV data file:", data_filename)
    with open(data_filename, 'r', encoding='utf-8') as f:
        # Read the data file as a CVS file
        reader = csv.reader(f, skipinitialspace=True)

        # Skip the first element of the data file (which is always "body")
        reader.__next__()

        # Split full comments into sentences
        sentences = itertools.chain(*[nltk.sent_tokenize(x[0].lower()) for x in reader])

        # Append SENTENCE_START and SENTENCE_END at the beginning and end of each sentence
        sentences = ["%s %s %s" % (sentence_start_token, x, sentence_end_token) for x in sentences]

        return sentences

dataset_filename = "exemple.data"
sentences = extract_sentences(dataset_filename)

print("Parsed %d sentences." % (len(sentences)))
print("sentences=", sentences)


Reading CSV data file: exemple.data
Parsed 192 sentences.
sentences= ["SENTENCE_START i joined a new league this year and they have different scoring rules than i'm used to. SENTENCE_END", "SENTENCE_START it's a slight ppr league- .2 ppr. SENTENCE_END", 'SENTENCE_START standard besides 1 points for 15 yards receiving, .2 points per completion, 6 points per td thrown, and some bonuses for rec/rush/pass yardage. SENTENCE_END', 'SENTENCE_START my question is, is it wildly clear that qb has the highest potential for points? SENTENCE_END', 'SENTENCE_START i put in the rules at a ranking site and noticed that top qbs had 300 points more than the top rb/wr. SENTENCE_END', 'SENTENCE_START would it be dumb not to grab a qb in the first round? SENTENCE_END', 'SENTENCE_START in your scenario, a person could just not run the mandatory background check on the buyer and still sell the gun to the felon. SENTENCE_END', "SENTENCE_START there's no way to enforce it. SENTENCE_END", "SENTENCE_START an hon