# Scraping web data for use with a Recurrent Neural Network

- Data is scraped from a columnist's email Q&A articles which they've published once a week for several years. We'll run them through a neural network that should imitate the style of the column.
- Step 1: Crawl this site for all articles: http://deadspin.com/tag/funbag
- Step 2: Parse just the text content, with some html tagging into individual html files
- Step 3: Concatenate into one large file for feeding into the RNN
- Output: The total file size is 10 MB, which took about 11 hours to run on my laptop. The output of the neural network for 200,000 characters is save in neural_net_output.html
- To do: The output is recognizably in the style of the column. It could be improved however. Different parameters for the neural network should be experimented with.

## Scrape the URLS for each html page

Create a scrapy spider in the funbag_spider.py file, and run it from the shell here.

In [98]:
! scrapy runspider funbag_spider.py -o funbag_urls.json &>/dev/null 

Load URLs from JSON file.

In [1]:
import json

with open('funbag_urls.json') as url_file:    
    url_data = json.load(url_file)

all_urls = []
for place in url_data:
    all_urls.append(str(place.values()[0]))

['http://adequateman.deadspin.com/whats-the-best-store-to-daydream-about-robbing-1796232076', 'http://adequateman.deadspin.com/should-you-ask-people-their-politics-before-dating-them-1796052163', 'http://adequateman.deadspin.com/is-an-unbeaten-playoff-run-more-impressive-than-73-wins-1795847201']


## Parse, sanitize, and save each of the html pages

In [25]:
import os
import io #to fix a problem with writing unicode

from bs4 import BeautifulSoup
import urllib2

def oneFullPageParse(url, outfile):
    fo = io.open(os.path.join(os.getcwd(), 'full_page_files', outfile), 'w', encoding = 'utf-8')
    funbag = urllib2.urlopen(url)
    soup = BeautifulSoup(funbag,'html5lib')
    for anchor in soup.find_all(['p', 'blockquote'], class_ = ''): 
        if anchor.parent.name == 'blockquote': #prevents repeating the <p> tags inside
            continue
        fo.write(anchor.prettify(formatter="html"))
    fo.close()

In [28]:
import time
start_time = time.time()

#loop through all urls to generate cleaned up pages
for idx, row in enumerate(all_urls):
    my_outfile = str(idx) + '.html'
    oneFullPageParse(row, my_outfile)

end_time = time.time()
total_time = end_time - start_time
print 'Total time:', total_time

Total time: 586.253000021


Concatenate all .html files into one file using a simple bash script. This is now ready to be run in torch-rnn.

In [1]:
! cat *.html > all.html

## Train character level language model using LSTM RNN

See:

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

https://github.com/karpathy/char-rnn

https://github.com/jcjohnson/torch-rnn

This last one is the best model to use...
Here's an installation guide for OSX:
http://www.jeffreythompson.org/blog/2016/03/25/torch-rnn-mac-install/

There was an installation issue when I tried to train the model. It was fixed by following the comment from 'tbornt' here: https://github.com/deepmind/torch-hdf5/issues/83#issuecomment-254427843

Run the following code from the shell for the preprocessor:

In [None]:
! python scripts/preprocess.py --input_txt data/funbag.txt --output_h5 data/funbag.h5 --output_json data/funbag.json

Train the model from shell using torch/lua:

In [None]:
! th train.lua -input_h5 data/funbag.h5 -input_json data/funbag.json

View output by writing to file using a checkpoint.

In [2]:
! th sample.lua -checkpoint cv/checkpoint_159500.t7 -length 200000 -gpu -1 > outputs/funbag_159500_output3.html -temperature 0.7