![iCor logo light](../logo.png)

# iCOR

iCOR is a company that builds legal registers for companies to evaluate their environmental performance. They do this by manually searching environmental legislation and summarising it along with steps of what has changed. 

At the moment iCOR do this process by hand. They would like to be able to automate this. Can we develop a business idea around this?

# Extracting Legislation Content

This notebook will take you through the process of scrapping information from a webpage and extracting the relevant information. Here we are going to focus on legislation from [legislation.gov.uk](https://www.legislation.gov.uk/). To start with we will focus on the legislation for the [Hazardous Waste Regulations](https://www.legislation.gov.uk/uksi/2005/894/contents).

First we need to import libraries. We will be using the packages `bs4` and `request` for web scraping. 

In [2]:
# For web scrapping
import requests
from bs4 import BeautifulSoup

#Â For text manipulation
import re

# To create a dataframe
import pandas as pd

# To create directories
from pathlib import Path

## Web Scraping

We will make a extract the information for the the [Hazardous Waste Regulations](https://www.legislation.gov.uk/uksi/2005/894/contents). At the moment this webpage contains a lot of links to other webpages, however we can use the `Print Options` dropdown to have the whole information from each additional webpage printed onto [one webpage](https://www.legislation.gov.uk/uksi/2005/894/data.xht?view=snippet&wrap=true). We will use this webpage instead.

 We will scrape the information using the `requests` package to first make a request to the web page, and then use the `get()` method to extract the content of the webpage. 

In [3]:
url = "https://www.legislation.gov.uk/uksi/2005/894/data.xht?view=snippet&wrap=true"
res = requests.get(url)
html_data = res.content

Next we can use `BeautifulSoup` to parse the plain text from `html_data` into an html format which is often referred to as a tree. We can then make use of the many methods `BeautifulSoup` has to offer to extract information and navigate through the tree.

In [4]:
soup = BeautifulSoup(html_data, "html.parser") # parsing html data so goes into the form of html instead of plain text.
print(soup.prettify(formatter="html")[0:1000]) # restrict print-out to first 1000 characters

<!DOCTYPE html
  PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <title xmlns:atom="http://www.w3.org/2005/Atom">
   The Hazardous Waste (England and Wales)Regulations 2005 No. 894
  </title>
  xmlns:atom="http://www.w3.org/2005/Atom"
  <meta content="2024-05-16" name="DC.Date.Modified" scheme="W3CDTF"/>
  xmlns:atom="http://www.w3.org/2005/Atom"
  <meta content="2020-12-31" name="DC.Date.Valid" scheme="W3CDTF"/>
  <style media="screen, print" type="text/css" xmlns:atom="http://www.w3.org/2005/Atom">
   @import "/styles/legislation.css";@import "/styles/secondarylegislation.css";&#xD;
  </style>
 </head>
 <body>
  <div class="LegSnippet">
   <a name="content">
   </a>
   <div class="DocContainer" xmlns:atom="http://www.w3.org/2005/Atom">
    <div class="LegClearFix LegPrelims">
     <p class="LegBanner">
    

Now that we have the document in the form of a tree we can use tags to find specific content by using either the `find()` method to find only the first instance of the tag, or the `find_all()` method to find all instances of the tag. Tags also often have addition information we can filter on, such as classes and ids. We can use these to pass addition information to the filter and pinpoint the specific content we want to extract.

In [5]:
content = soup.find_all('p')                            # extracting all paragraphs
text_content = [c.text for c in content]                # extracting the plain text from the content
text_content = '\n'.join(text_content)                   # joining the text together
for string in ["\t", "\r", "\xa0"]:                     # removing unwanted characters such as \t for tab.
    text_content = text_content.replace(string, " ")
text_content = re.sub(' +', ' ', text_content)           # removing multiple spaces

print(text_content[0:1000])

Statutory Instruments
ENVIRONMENTAL PROTECTION,ENGLAND AND WALES
Made
23rd March 2005
Laid before Parliament
24th March 2005
Coming into force in accordance with regulation 1(1)
The Secretary of State, being a Minister designated M1 for the purposes of section 2(2) of the European Communities Act 1972 M2 in relation to measures relating to the prevention, reduction and elimination of pollution caused by waste, in exercise of the powers conferred on her by section 2(2) of that Act and section 156 of the Environmental Protection Act 1990 M3, makes the following Regulations: 
Marginal Citations
M1S.I. 1992/2870. The National Assembly for Wales is designated in relation to the controlled management of hazardous waste in Wales (see S.I. 2001/3495). The designations in relation to waste for National Assembly for Wales are shortly to be brought into line with those of the Secretary of State.
M21972 c. 68.
M31990 c. 43. The relevant functions of the Secretary of State in so far as they relate 

In [6]:
len(text_content)

132462

We can see that we have _a lot_ of text, with around 132,500 characters! This isn't too surprising since the webpage we are scraping is very long.

Alternatively, if we are unsure on where the main body of text resides within the large html tree, then we can use the `get_text()` method to extract all text from the tree. This is fine for our use case _but_ on other webpages this will extract text such as the names of navigation tabs, any information at the bottom of the webpage, and other unneeded information. We also loose all of the structure of the webpage.

In [7]:
all_text = soup.get_text(separator= ' ')                      # extracting all of the text from the tree
for string in ["\n","\t", "\r", "\xa0"]:                   # removing unwanted characters such as \t for tab.
    all_text = all_text.replace(string, " ")        
all_text = re.sub(' +', ' ', all_text)                          # removing multiple spaces

print(all_text[0:1000])

 The Hazardous Waste (England and Wales)Regulations 2005 No. 894 xmlns:atom="http://www.w3.org/2005/Atom" xmlns:atom="http://www.w3.org/2005/Atom" Statutory Instruments 2005 No. 894 ENVIRONMENTAL PROTECTION,ENGLAND AND WALES The Hazardous Waste (England and Wales)Regulations 2005 Made 23rd March 2005 Laid before Parliament 24th March 2005 Coming into force in accordance with regulation 1(1) The Secretary of State, being a Minister designated M1 for the purposes of section 2(2) of the European Communities Act 1972 M2 in relation to measures relating to the prevention, reduction and elimination of pollution caused by waste, in exercise of the powers conferred on her by section 2(2) of that Act and section 156 of the Environmental Protection Act 1990 M3 , makes the following Regulations: Marginal Citations M1 S.I. 1992/2870 . The National Assembly for Wales is designated in relation to the controlled management of hazardous waste in Wales (see S.I . 2001/3495). The designations in relatio

In [8]:
len(all_text)

140744

We can see this time we extract more text. We have an approximately 8000 extra characters.

We can then save this text along with the relevant url that it came from into a csv file. We will also extract the title of the legislation so we know what legislation the text is referring to.

In [9]:
content = soup.find('title')
title = content.get_text()
print(title)

The Hazardous Waste (England and Wales)Regulations 2005 No. 894


In [10]:
# create a dictionary with the information where the keys are the column names and the values are the information, i.e. 'column_name':[info].
info_for_df = {'title':[title], 'url':[url], 'text':[all_text]} 

# create a dataframe from the dictionary
df = pd.DataFrame(info_for_df)
df

Unnamed: 0,title,url,text
0,The Hazardous Waste (England and Wales)Regulat...,https://www.legislation.gov.uk/uksi/2005/894/d...,The Hazardous Waste (England and Wales)Regula...


In [11]:
# save dataframe
Path('data/raw').mkdir(parents=True, exist_ok=True)

df.to_csv('data/raw/legislation.csv', index=False)

This can be done for many of the legislation pages and they can be subsequently added to the table. Try to find other links to environmental legislation that we could apply this to. We can then build up the CSV file with other legislation that we can then use in `02-text_summarisation.ipynb`.

To do this we need to:
1. read in the current csv file with `df = pd.read_csv(...)`
2. perform the web scraping on the url
3. add a new row to the dataframe with this new information
4. repeat steps 2 and 3 until you have extracted all of the legislation that you want
5. save the dataframe.