# Code used for scraping the data

The text transcripts of the trial are stored in the following website: http://simpson.walraven.org

The content is organized by date, in 5 folders, one for each of the months in which the trial took place. In this project, the text analyzed is the one of the __Criminal Trial Trial Transcripts__, ignoring the content of the _juror interviews_, _motions and court orders_, the _preliminary hearings_, the _grand jury proceedings_ and the _Civil Trial_.

In [1]:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup

In [2]:
# Every month has a different URL (folder) containing the corresponding transcripts
urls = ['http://simpson.walraven.org/oj-feb.html',
        'http://simpson.walraven.org/oj-mar.html', 
        'http://simpson.walraven.org/oj-apr.html',
        'http://simpson.walraven.org/oj-may.html',
        'http://simpson.walraven.org/oj-jun.html',
        'http://simpson.walraven.org/oj-jul.html',
        'http://simpson.walraven.org/oj-aug.html',
        'http://simpson.walraven.org/oj-sep.html']

The scraping has been done for each month separately, to have a better control on the content. In the following cells, an example is given for the month of __January__. The same code has been used for the other months. 

In [3]:
url = 'http://simpson.walraven.org/oj-jan.html'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
soup.findAll('a')

[<a href="jan11.html">January 11</a>,
 <a href="jan12.html">January 12</a>,
 <a href="jan13.html">January 13</a>,
 <a href="jan23.html">January 23</a>,
 <a href="jan24.html">January 24</a>,
 <a href="jan25.html">January 25</a>,
 <a href="jan26.html">January 26</a>,
 <a href="jan30.html">January 30</a>,
 <a href="jan31.html">January 31</a>,
 <a href="index.html">previous</a>]

In [4]:
# Retrieve one of the links as an example
one_a_tag = soup.findAll("a")[0]
link = one_a_tag["href"]
link

'jan11.html'

By executing the following cells, the content of the link defined above is retrieved and saved as a _txt_ file.

In [5]:
download_url = 'http://simpson.walraven.org/'+ link
download_url

'http://simpson.walraven.org/jan11.html'

In [6]:
urllib.request.urlretrieve(download_url,'./'+link[link.find('/t_')+1:]+'.txt')

('./jan11.html.txt', <http.client.HTTPMessage at 0x676adf0>)

By executing the following cell, a loop is initiated to retrieve and save the content of all the links for the month of January. Adding another level to the loop, one could retrieve all the content of the trial, iterating across all the months. 

A different file is saved for each day of the trial. 

In [16]:
# To download the whole data set, let's do a for loop through all a tags
line_count = 0 # variable to track what line you are on
for one_a_tag in soup.findAll('a'):  #'a' tags are for links
    if line_count < 19: # code for text files starts at line 36
        link = one_a_tag['href']
        download_url = 'http://simpson.walraven.org/'+ link
        urllib.request.urlretrieve(download_url,'./'+link[link.find('/turnstile_')+1:]+'.txt') 
        time.sleep(1) # pause the code for a sec
    # add 1 for next line
    line_count +=1