# Gathering Legislation Corpus
The aim of this notebook is to collect a corpus of Australian Legislation to use in Natural Language Processing (NLP). Robots.txt was checked for Legislation.gov.au to make sure scraping was allowed:

User-agent: *
Crawl-delay: 10
Disallow: /Current/
Disallow: /Search/
Disallow: /Images/
Disallow: /Scripts/
Disallow: /Content/
Disallow: /Account/
Sitemap: https://www.legislation.gov.au/sitemap

In [72]:
import requests 
from bs4 import BeautifulSoup
import time
import pandas as pd
import numpy as np
import regex as re

# select a random sample without replacement
from random import seed
from random import sample
from random import random
# seed random number generator
seed(1)

from datetime import datetime, timedelta

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}

## Collect Alphabetical Names of all Pages in Index
The main page for Acts InForce contains an index. Each index name leads to a new page containing a list of acts

In [None]:
# Collect main page
url=https://www.legislation.gov.au/Browse/ByTitle/Acts/InForce/0/0/Principal
html_doc = requests.get(url, headers=headers).text #text is needed so that html_doc isn't just the response code
#html_doc will contain html text such as headers, etc.
soup = BeautifulSoup(html_doc, 'html.parser')

# Make a list of urls that point to each alphabetical link in the main page
pages=[]
for page in soup.findAll('a', {'class': 'CategoryLetter'}):
    page='https://www.legislation.gov.au/Browse/Results/ByTitle/Acts/InForce/'+str(page.text)+'/0/0/principal'
    pages.append(page)

## Get the Serial Numbers of Each Act

In [74]:
detail_links={} #blank dictionary to store the act title as a key and serial number as value
sleep_time=1+random() #Delay timing to flooding site with requests

for page in pages:
    try:
        print('Working on page '+str(page))
        url=page
        html_doc = requests.get(url, headers=headers).text
        soup = BeautifulSoup(html_doc, 'html.parser')
        
        for link in soup.findAll('a', {'class': 'LegBookmark'}):
            detail_links[link.text] = link.get('href').partition("/Details/")[2]
            time.sleep(sleep_time)
    
    except Exception as e: 
        print(e)
        time.sleep(sleep_time)
        broken.append(case_number)
        print('something went wrong with page '+str(page)+' act '+str(link.text))  

Working on page https://www.legislation.gov.au/Browse/Results/ByTitle/Acts/InForce/Ab/0/0/principal
Working on page https://www.legislation.gov.au/Browse/Results/ByTitle/Acts/InForce/Ac/0/0/principal
Working on page https://www.legislation.gov.au/Browse/Results/ByTitle/Acts/InForce/Nu/0/0/principal
Working on page https://www.legislation.gov.au/Browse/Results/ByTitle/Acts/InForce/Oc/0/0/principal


Working on page https://www.legislation.gov.au/Browse/Results/ByTitle/Acts/InForce/Of/0/0/principal
Working on page https://www.legislation.gov.au/Browse/Results/ByTitle/Acts/InForce/Ol/0/0/principal
Working on page https://www.legislation.gov.au/Browse/Results/ByTitle/Acts/InForce/Wo/0/0/principal


## Save Dictionary of Act Names and Serial Numbers
These files should only need to be generated once. Subsequent analysis should load files instead of generating new files.
Dates and times have been used in the filenames to avoid overwriting older data.

In [78]:
#Saved using Numpy
np.save(datetime.now().strftime("%Y%m%d-%H-%M-%S_detail_links"), detail_links)
np.save(datetime.now().strftime("%Y%m%d-%H-%M-%S_pages"), pages)

In [84]:
# Coverted to Dataframe and Saved as csv
df_links=pd.DataFrame.from_dict(detail_links, orient='index')
df_links.to_csv(datetime.now().strftime("%Y%m%d-%H-%M-%S_detail_links.csv"), sep=',', index=True)
df_links.head()

Unnamed: 0,0
Aboriginal and Torres Strait Islander Act 2005,C2019C00083
Aboriginal and Torres Strait Islander Heritage Protection Act 1984,C2016C00937
Aboriginal and Torres Strait Islander Land and Sea Future Fund Act 2018,C2019C00329
Aboriginal Land Grant (Jervis Bay Territory) Act 1986,C2016C01003
Aboriginal Land (Lake Condah and Framlingham Forest) Act 1987,C2016C00958


In [88]:
df_pages=pd.DataFrame(pages)
df_pages.to_csv(datetime.now().strftime("%Y%m%d-%H-%M-%S_pages.csv"), sep=',', index=False)
df_pages.head()

Unnamed: 0,0
0,https://www.legislation.gov.au/Browse/Results/...
1,https://www.legislation.gov.au/Browse/Results/...
2,https://www.legislation.gov.au/Browse/Results/...
3,https://www.legislation.gov.au/Browse/Results/...
4,https://www.legislation.gov.au/Browse/Results/...


## Collecting the Text of Each Act
Finally we can gather all the text in an act.

In [271]:
acts={} #Create a blank dictionary to hold the act titles as keys and act text as values

for act_title, serial in detail_links.items():
    
    try:
        print('Working on page   '+str(act_title)+'   '+str(serial))
        
        url='https://www.legislation.gov.au/Details/'+str(serial)
        html_doc = requests.get(url, headers=headers).text #text is needed so that html_doc isn't just the response code
        #html_doc will contain html text such as headers, etc.
        soup = BeautifulSoup(html_doc, 'html.parser')
        
        text_of_act=''
        # for loop needed to capture all fragments of text that may exist in multiple instances of 'lang': 'EN-AU'
        for fragment in soup.findAll('div', {'lang': 'EN-AU'}):
            text_of_act += ' ' + fragment.text
        acts[act_title]= text_of_act
        time.sleep(sleep_time)
            
    except Exception as e: 
        print(e)
        time.sleep(sleep_time)
        print('something went wrong with page '+str(page)+' act '+str(link.text))
        
df_acts=pd.DataFrame.from_dict(acts, orient='index')
df_acts.to_csv(datetime.now().strftime("acts/%Y%m%d-%H-%M-%S_acts.csv"), sep=',', index=True)

Working on page   Aboriginal Land Rights and Other Legislation Amendment Act 2013   C2013A00093
Working on page   AIDC Sale Act 1997   C2004A05165
Working on page   Aircraft Noise Levy Act 1995   C2012C00908
Working on page   Airports (On-Airport Activities Administration) Validation Act 2010   C2010A00080
Working on page   Airspace (Consequentials and Other Measures) Act 2007   C2007A00039
Working on page   Antarctic Treaty Act 1960   C2008C00398
Working on page   Anti-Money Laundering and Counter-Terrorism Financing (Transitional Provisions and Consequential Amendments) Act 2006   C2007C00287
Working on page   Ashmore and Cartier Islands Acceptance Act 1933   C2008C00341
Working on page   Asian Development Bank (Additional Subscription) Act 2009   C2009A00109
Working on page   Australia Council (Consequential and Transitional Provisions) Act 2013   C2013A00072
Working on page   Family Law Amendment (Validation of Certain Orders and Other Measures) Act 2012   C2013C00166
Working on pa

Working on page   Family Law (Divorce Fees Validation) Act 2007   C2007A00023
Working on page   Family Trust Distribution Tax (Secondary Liability) Act 1998   C2011C00492
Working on page   Federal Circuit Court of Australia Legislation Amendment Act 2012   C2013C00139
Working on page   New Business Tax System (Former Subsidiary Tax Imposition) Act 1999   C2013C00447
Working on page   New Business Tax System (Franking Deficit Tax) Act 2002   C2004C01224


Working on page   New Business Tax System (Over-franking Tax) Act 2002   C2004A00981
Working on page   New Business Tax System (Untainting Tax) Act 2006   C2006A00081
Working on page   New Business Tax System (Venture Capital Deficit Tax) Act 2003   C2013C00448
Working on page   Trans-Tasman Proceedings Act 2010   C2013C00646


Working on page   Wool International Privatisation Act 1999   C2013C00624
Working on page   Work Health and Safety (Transitional and Consequential Provisions) Act 2011   C2011A00146


## Check for Blank Acts

In [273]:
acts_blank={}
for title in df_acts.index:
    if df_acts.loc[title,0] =='':
        acts_blank[title]=detail_links[title]
        
len(acts_blank), df_acts_unblank.shape

(3, (1119, 1))

In [274]:
acts_blank

{'Australian Heritage Council (Consequential and Transitional Provisions) Act 2003': 'C2004A01170',
 'Defence Legislation Amendment (Enhancement of the Reserves and Modernisation) Act 2001': 'C2004C01249',
 'Financial Sector Reform (Amendments and Transitional Provisions) Act (No. 1) 1999': 'C2004C01110'}

The missing acts will have to be manually entered or otherwise ignored.