# WEB SCRAPING NI ASSEMBLY WEBSITE
Here we will create a web scraping model that not only extracts data from the current year, but also extracts data from the previous years as well.

Defining the URL:

In [688]:
url = 'http://aims.niassembly.gov.uk/officialreport/reports.aspx'

## DEPENDENCIES REQUIRED

In [689]:
pip install selenium



In [690]:
import pandas as pd
import numpy as np

import requests
from bs4 import BeautifulSoup

from selenium import webdriver
from selenium.webdriver.common.keys import Keys    
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

from tqdm import tqdm
import time

For using selenium explicitly we need to make some changes. 

This isn't required if we use a local or virtual machine to run our web-scraping model

In [691]:
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!pip install selenium

options = webdriver.ChromeOptions()
options.add_argument('-headless')
options.add_argument('-no-sandbox')
options.add_argument('-disable-dev-shm-usage')

Reading package lists... Done
Building dependency tree       
Reading state information... Done
chromium-chromedriver is already the newest version (83.0.4103.61-0ubuntu0.18.04.1).
The following package was automatically installed and is no longer required:
  libnvidia-common-440
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.
cp: '/usr/lib/chromium-browser/chromedriver' and '/usr/bin/chromedriver' are the same file


By inspecting the NI Assembly Hansard website we can see that each of the assembly sessions can be viewed in the website itself and also can be downloaded in pdf format. Lets view the page source of this website

![alt text](https://i.imgur.com/1s7qO7R.png)

Here we initialize a driver (virtual browser). This is done so that we are able to maneuver within the NI Assembly website. 

In [692]:
driver = webdriver.Chrome('chromedriver',options=options)
driver.get(url)
# print(driver.page_source) # uncomment to see the the whole page source of the website

We will create an empty DataFrame to save all the data extracted from our web scraping model.

In [693]:
ni_df = pd.DataFrame(columns=['data', 'minister_name', 'statement'], index = [0])
ni_df

Unnamed: 0,data,minister_name,statement
0,,,


Next we will extract the link address of all the NI assembly sessions that is available in a single page. This is stored in a form of list which can be later iterated using a for-loop to extract the the data (date, minister_name, his/her statements) from individual sessions.

In [694]:
# all_views = driver.find_elements_by_tag_name("a")
all_views = driver.find_elements_by_xpath('//*[@href]')

# print(all_views[197].get_attribute('innerHTML'))
# print(all_views[199].get_attribute("textContent"))
# # print(all_views[199].get_attribute('value'))
# # print(all_views[199].text)
print(all_views[200].get_attribute("href"))
print(all_views[200].get_attribute("id"))
print(len(all_views))

final_view_href = []

for i in range(len(all_views)):
  if all_views[i].get_attribute('textContent') == 'View':
    print(all_views[i].get_attribute("href"))
    final_view_href.append(all_views[i].get_attribute("href"))
print("Number of NI Assembly session links available on the page: ",len(final_view_href))

# OR

final_view_id = []
for i in range(len(all_views)):
  if all_views[i].get_attribute('textContent') == 'View':
    print(all_views[i].get_attribute("id"))
    final_view_id.append(all_views[i].get_attribute("id"))

# OR

final_view_full = []
for i in range(len(all_views)):
  if all_views[i].get_attribute('textContent') == 'View':
    print(all_views[i])
    final_view_full.append(all_views[i])



"""We can iterate over the session reports in two ways.
1. By the href values that we filtered out which had 'View' in its contents. This
is then passed through a for loop and clicked into using a try/exception block system.
Then essential data is extracted.
2. By filtering out of href values and then getting the id of each. This is then passed through a 
clicker function where the virtual browser goes inside the session report page, then parse the
html to extract the required information.
The fastest method will be implemented. We will use tqdm for this purpose.
"""


print("Number of NI Assembly session links available on the page: ",len(final_view_id))

http://aims.niassembly.gov.uk/officialreport/report.aspx?&eveDate=2020/05/12&docID=300902
ctl00_MainContentPlaceHolder_OfficialReportsGridView_ctl13_HTMLViewButton1
292
http://aims.niassembly.gov.uk/officialreport/report.aspx?&eveDate=2020/07/21&docID=304884
http://aims.niassembly.gov.uk/officialreport/report.aspx?&eveDate=2020/07/07&docID=304152
http://aims.niassembly.gov.uk/officialreport/report.aspx?&eveDate=2020/07/06&docID=304151
http://aims.niassembly.gov.uk/officialreport/report.aspx?&eveDate=2020/06/30&docID=303726
http://aims.niassembly.gov.uk/officialreport/report.aspx?&eveDate=2020/06/23&docID=302713
http://aims.niassembly.gov.uk/officialreport/report.aspx?&eveDate=2020/06/16&docID=302204
http://aims.niassembly.gov.uk/officialreport/report.aspx?&eveDate=2020/06/09&docID=301801
http://aims.niassembly.gov.uk/officialreport/report.aspx?&eveDate=2020/06/02&docID=301413
http://aims.niassembly.gov.uk/officialreport/report.aspx?&eveDate=2020/06/01&docID=301412
http://aims.niassembl

We were able to extract 30 links from a single page. By manually cross-checking the links myself, this came out to be accurate. Thus, we will use this code-snippet for all the sub-pages as well as when we iterate through different years.

We can iterate over the session reports in two ways.
1. By the href values that we filtered out which had 'View' in its contents. This
is then passed through a for loop and clicked into using a try/exception block system.
Then essential data is extracted.
2. By filtering out of href values and then getting the id of each. This is then passed through a 
clicker function where the virtual browser goes inside the session report page, then parse the
html to extract the required information.

The fastest method will be implemented. We will use tqdm for this purpose.

In [695]:
# METHOD 1 (using href)
try:
    for i in tqdm(final_view_href, total=len(final_view_href)) :
      driver = webdriver.Chrome('chromedriver',options=options)
      driver.get(i)

      ni_text = driver.find_element_by_tag_name("main")
      # print(ni_text.text)

except:
    print("error!")
    driver.quit()









  0%|          | 0/30 [00:00<?, ?it/s][A[A[A[A[A[A[A[A







  3%|▎         | 1/30 [00:06<03:17,  6.82s/it][A[A[A[A[A[A[A[A







  7%|▋         | 2/30 [00:13<03:08,  6.73s/it][A[A[A[A[A[A[A[A







 10%|█         | 3/30 [00:18<02:47,  6.21s/it][A[A[A[A[A[A[A[A







 13%|█▎        | 4/30 [00:24<02:42,  6.24s/it][A[A[A[A[A[A[A[A







 17%|█▋        | 5/30 [00:30<02:32,  6.11s/it][A[A[A[A[A[A[A[A







 20%|██        | 6/30 [00:37<02:30,  6.26s/it][A[A[A[A[A[A[A[A







 23%|██▎       | 7/30 [00:44<02:29,  6.51s/it][A[A[A[A[A[A[A[A







 27%|██▋       | 8/30 [00:49<02:13,  6.07s/it][A[A[A[A[A[A[A[A







 30%|███       | 9/30 [00:54<01:59,  5.68s/it][A[A[A[A[A[A[A[A







 33%|███▎      | 10/30 [00:59<01:53,  5.67s/it][A[A[A[A[A[A[A[A







 37%|███▋      | 11/30 [01:04<01:44,  5.49s/it][A[A[A[A[A[A[A[A







 40%|████      | 12/30 [01:10<01:37,  5.43s/it][A[A[A[A[A

In [696]:
driver = webdriver.Chrome('chromedriver',options=options)
driver.get(url)

In [697]:
# METHOD 2 (using id and clicker)
try:
  for h in tqdm(final_view_id, total=len(final_view_full)):

    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, h))
        )
    element.click()
    ni_text = driver.find_element_by_tag_name("main")      # tag_name
    # print(ni_text.text)
    driver.back()

except:
    print("error!")
    driver.quit()









  0%|          | 0/30 [00:00<?, ?it/s][A[A[A[A[A[A[A[A







  3%|▎         | 1/30 [00:01<00:32,  1.11s/it][A[A[A[A[A[A[A[A







  7%|▋         | 2/30 [00:02<00:35,  1.29s/it][A[A[A[A[A[A[A[A







 10%|█         | 3/30 [00:03<00:32,  1.20s/it][A[A[A[A[A[A[A[A







 13%|█▎        | 4/30 [00:05<00:32,  1.27s/it][A[A[A[A[A[A[A[A







 17%|█▋        | 5/30 [00:06<00:30,  1.22s/it][A[A[A[A[A[A[A[A







 20%|██        | 6/30 [00:07<00:27,  1.16s/it][A[A[A[A[A[A[A[A







 23%|██▎       | 7/30 [00:09<00:30,  1.31s/it][A[A[A[A[A[A[A[A







 27%|██▋       | 8/30 [00:10<00:27,  1.26s/it][A[A[A[A[A[A[A[A







 30%|███       | 9/30 [00:11<00:24,  1.18s/it][A[A[A[A[A[A[A[A







 33%|███▎      | 10/30 [00:12<00:26,  1.34s/it][A[A[A[A[A[A[A[A







 37%|███▋      | 11/30 [00:13<00:23,  1.21s/it][A[A[A[A[A[A[A[A







 40%|████      | 12/30 [00:14<00:19,  1.09s/it][A[A[A[A[A

We can see that method 2 takes significantly lesser time

## XXXXXXX WORK IN PROGRESS XXXXXXX

In [698]:
# # Gave page info
# try:
#     element = WebDriverWait(driver, 10).until(
#         EC.presence_of_element_located((By.TAG_NAME, "main"))
#         )
#     element.click()
#     ni_text = driver.find_element_by_tag_name("main")      # tag_name
#     print(ni_text.text)
#     driver.back()

# except:
#     print("error!")
#     driver.quit()

In [699]:
# try:
#     element = WebDriverWait(driver, 10).until(
#         EC.presence_of_element_located((By.ID, "ctl00_MainContentPlaceHolder_OfficialReportsGridView_ctl02_HTMLViewButton2"))
#     )
#     element.click()

#     # main = WebDriverWait(driver, 10).until(
#     #     EC.presence_of_element_located((By.CLASS_NAME, "col-10 col-med-12 last"))   #class_name
#     # )
#     # print(main.text)


#     # print(driver.page_source) # uncomment to see the the whole page source of the website

#     # driver.back()
#     # drive.forward()
# except:
#     print("error!")
#     driver.quit()

In [700]:
# try:
#     ni_text = driver.find_element_by_tag_name("main")      # tag_name
#     print(ni_text.text)
# except:
#     print("error!")
#     driver.quit()

In [701]:
# try:
#     ni_text = driver.find_elements_by_tag_name("div")
#     print(ni_text[130].get_attribute('innerHTML'))
# except:
#     print("error!")
#     driver.quit()

In [702]:
# print(driver.page_source) # uncomment to see the the whole page source of the website

# print(driver.current_url)
# url = driver.current_url
# response = requests.get(url)
# soup = BeautifulSoup(response.text, "html.parser")
# print(soup.text)
