# Web Scraping Siminar
By Clayton Boone and Pedro Romero

# What is web scraping?

Web scraping is a process that is used to extract data from websites to obtain needed data. Sometimes when we are faced with a problem and are required to create a model for analysis we are given data to use, but sometimes we are not given data and are forced to collect our own. Scraping allows us to get the data we need from a website by reading the website's HTML code. By reading the HTML code of a website we are able to see how the website is built and where the information is contatined. The HTML code for a website also allows us to determine the headers of the information we need. Web scraping allows us to gather information we need and to update our personal data with new data that may be loaded onto the website. Webscraping also allows us to retrieve all the necessary information we need without having to mess with or clean data that we are not interested in. Web scraping allows us to collect data from multiple sources and to combine information to build our data. With webscraping we are able to collect a very large amount of data that we my have not had at our disposal.     

# What tools do we have for web scraping?

The first web scraping tool we are going to talk about is Beautiful Soup. Beautiful Soup is used to pull data from an HTML or XML file. Beautiful Soup makes it easier to read these types of files so that we can see the setup and definitions for each type of file. Beautiful Soup also allows us to see where the information we want is located and what headers we should call for to get the wanted information. Beautiful Soup has a number of tools and commands that allow us to gather information from websites. The commands that we will discuss in this siminar include .get(), .content, .find, .find_all, get_text. Another tool we will discuss in this presentation is called Selenium which is used to open and travel through websites. Selenium is used to test the accessability of a website and to make sure that each route of an website is usable. Selenium can also be used to open up websites so that we can gather information from that website. Selenium allows us to open up and run websites by just using code and html\xml headers and definitions. We will use selenium in this presentation to open up and gather information from websites.   

# Guided example

First, we will do a guided example that will show us how to scrape the seven day forecast from the website of the National Weather Service.The first thing we will need to do is request the content from the website we are wanting to scrape from. The .get() command gets the information from url and goes to that url. Next we will us Beautiful Soup to parse the HTML file of the National Weather Service so that we can view a complete and readable version of the HTML file. By printing out the variable labeled as weather_soup we will be able to view the HTML and will be able to see what headers are included and what infomation is contained in each header. Before starting any type of scraping it is important to inspect the HTML file and look at how the file is built and to see what the names of the id's and class'es are so that it is easier to select the data you are trying to scrape. In this example we are wanting to scrape all the data from the seven day forecast section of the website. By looking at the HTML file we can see that all the information that we need for the seven day forecast is contained in the class named seven-day-forecast. To get all the information for this class we use the find command which searches and returns the first instance of a class that is named seven-day-forecast. Looking futher into the class we are interested in we notice that each day is contained in a class called tombstone-container. To get the information in each tomstone-container we do a find_all search within the seven-day-forecast class to find all of the instances of the class tombstone-container. It is important to note that the find_all command returns a list and we can not directly gather information in that list. To collect text or data from that list we must iterate on each piece of the list and then use any command we want to use to extract the wanted data. Hence, the variable labeled as seven_list is a list. http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168#.Woor6KinHIU      

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import urllib
import urllib.request
from urllib.error import HTTPError
import io
import re
from math import floor
import random
import numpy as np
from selenium import webdriver
from selenium.webdriver.common.keys import Keys


weather_page = requests.get('http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168#.WoW2W6inHIU')
weather_soup = BeautifulSoup(weather_page.content, 'html.parser')
seven_day = weather_soup.find(id = 'seven-day-forecast')
seven_list = seven_day.find_all(class_ = 'tombstone-container')

The next box of code shows how to gather wanted text and information once you have used a find_all statement to gather your wanted section of the HTML. For this first example we will use the first element in the seven_list to collect the day of the week, the discription of the weather for that day, the high or low temperature for that day, and the short description for that day.We collect the day of the week by using the find() command to find the first instance of the class named period-name and then use the get_text() command to get the text that is contained between the headers named period_name. We do the same procces to get the text for the description, temperature, and short description sections that are contained in the tombstone-container class. The only thing that is differnt when gathering the information is the name of the classes we use to get the information. To gather text for the discription section we first had to find the first instance of the work 'img' for each class then call the title section that is contained in img by using dis['title'].

In [2]:
today = seven_list[0]
period = today.find(class_ ='period-name').get_text()
dis = today.find('img')
short_dis = today.find(class_ = 'short-desc').get_text()
tem = today.find(class_ = 'temp').get_text()
print(period, dis['title'], tem, short_dis)


Today Today: Increasing clouds, with a high near 51. East southeast wind 5 to 13 mph becoming west in the afternoon.  High: 51 °F IncreasingClouds


The next box of code shows the steps that are needed to create a dataframe that contains the day, the discription, the high or low temperature, and short desripction for each day of the week. First, we must iterate through each tombstone-container that is contained in the seven-day-forecast class. In each iteration we find the day of the week, description, short description, and temperature contained in each tombstone-container class.

In [3]:
seven_title = []
seven_dis = []
seven_short_dis = []
seven_temp = []
for i in seven_list:
    title = i.find(class_ = 'period-name').get_text()
    seven_title.append(title)
    dis = i.find('img')
    seven_dis.append(dis['title'])
    short = i.find(class_ = 'short-desc').get_text()
    seven_short_dis.append(short)
    temp = i.find(class_ = 'temp').get_text()
    seven_temp.append(temp)
chart = pd.DataFrame([seven_dis,seven_short_dis,seven_temp], index = ['Discription', 'Short Discription', 'Temp']\
                     ,columns = seven_title)
chart

Unnamed: 0,Today,Tonight,Wednesday,WednesdayNight,Thursday,ThursdayNight,Friday,FridayNight,Saturday
Discription,"Today: Increasing clouds, with a high near 51....","Tonight: A 30 percent chance of showers, mainl...","Wednesday: A 30 percent chance of showers, mai...","Wednesday Night: Partly cloudy, with a low aro...","Thursday: Mostly sunny, with a high near 54. B...","Thursday Night: Mostly clear, with a low aroun...","Friday: Mostly sunny, with a high near 56.","Friday Night: Mostly clear, with a low around 43.","Saturday: Mostly sunny, with a high near 57."
Short Discription,IncreasingClouds,ChanceShowers,ChanceShowers,Partly Cloudy,Mostly Sunnyand Breezy,Mostly Clearand Breezythen PartlyCloudy,Mostly Sunny,Mostly Clear,Mostly Sunny
Temp,High: 51 °F,Low: 43 °F,High: 54 °F,Low: 44 °F,High: 54 °F,Low: 42 °F,High: 56 °F,Low: 43 °F,High: 57 °F


# Using An API to Scrape Data

Next, we will disscus how to use an API to gather our data. The website we will be scraping the data from is called the arxiv and we will use the API for the arxiv to gather our data. API stands for Application Programming Interface and is defined as a set of clearly defined methods of communication between various software coponents. The API makes it easier to navigate through the website and to search through the XXXneedXXX sections of the website you are using. Before using an API for a website it is very important to read the documentation for the website you desire to use and inform yourself on how the API for that website is set up. The documentation for the API of the arxiv website explains how each section/subject can be selected and how to choose how many articles you want to select. Here is a link to the API documentation for the arxiv website, https://arxiv.org/help/api/user-manual#_calling_the_api. The API for the arxiv limits the amount of articles you can scrape per subject. The API of the arxiv only allows us to pull 1,000 papers per subject. To pull a larger number of articles they suggest to us the OAI-PHM interface, which we will disscuss later in this presentation. In the code block below we are collecting the id, date published, title, subject, and authors for each first 10 papers in the section labeled stat.ap. The reason why we are only getting the first 10 papers is because the api is defaulted to only fetch the first 10 papers of each subject searched. We will explain how to search for more papers using the API in the next example. In this example we use urlopen to open the url we want to use and retrieve data from. Next we use Beautiful Soup to parse the HTML together so that we can view it and see the components of the HTML. First we collect all the information that is contained in all the headers labled entry that are contained in the HTML file. It is important to note that each entry represents the information avaliable for a single paper. Next, we interate through each item in the entry and collect the date, id, authors, title, and subject for each paper that is contained in the entry list. A dataframe that contains the information that was scraped from each entry is included below. 

In [4]:
import urllib.request
url_api = "http://export.arxiv.org/api/query?search_query=cat:stat.AP"

rd = urllib.request.urlopen(url_api)
soup = BeautifulSoup(rd,"xml")
entry = soup.findAll('entry')
id_list =[]
date_list = []
title_list = []
au_list = []
sub_list = []
for i in entry:
    au_new= []
    id_list.append(i.find('id').get_text())
    date_list.append(i.find('published').get_text())
    title_list.append(i.find('title').get_text())
    cat = i.find('category')
    sub_list.append(cat['term'])
    for j in i.find_all('name'):
        au_new.append(j.get_text())
    au_list.append(au_new)
paper_id = pd.DataFrame(id_list)
date = pd.DataFrame(date_list)
title = pd.DataFrame(title_list)
authors = pd.DataFrame(au_list)
subject = pd.DataFrame(sub_list)
date.rename(columns={0:'dates'},inplace = True)
paper_id.rename(columns= {0:'id'},inplace=True)
title.rename(columns = {0:'titles'},inplace = True)
subject.rename(columns={0:'subjects'},inplace = True)
use_data = pd.concat([paper_id,date,title,authors,subject],axis = 1)
use_data.head()



Unnamed: 0,id,dates,titles,0,1,2,3,subjects
0,http://arxiv.org/abs/0704.1711v2,2007-04-13T07:49:11Z,"Dynamical Equilibrium, trajectories study in a...",Patrick Letrémy,Marie Cottrell,Patrice Gaubert,Joseph Rynkiewicz,stat.AP
1,http://arxiv.org/abs/0704.3474v1,2007-04-26T05:03:08Z,Missing Data: A Comparison of Neural Network a...,Fulufhelo V. Nelwamondo,Shakir Mohamed,Tshilidzi Marwala,,stat.AP
2,http://arxiv.org/abs/0704.3862v1,2007-04-29T21:06:10Z,An Integrated Human-Computer System for Contro...,Tshilidzi Marwala,Monica Lagazio,Thando Tettey,,stat.AP
3,http://arxiv.org/abs/0705.0569v1,2007-05-04T07:24:17Z,Mixed models for longitudinal left-censored re...,Rodolphe Thiébaut,Hélène Jacqmin-Gadda,,,stat.AP
4,http://arxiv.org/abs/0705.2515v1,2007-05-17T11:29:08Z,Finite Element Model Updating Using Bayesian A...,Tshilidzi Marwala,Lungile Mdlazi,Sibusiso Sibisi,,stat.AP


The code below shows an example where we can choose how many papers we want to scrape from each section. It is important to remember that the API only allows us to collect 1,000 papers from each subject so our max search should not be greater than 1,000. In this example we are selecting the first 3 papers from the first 3 entries in the category list, which are astro-ph.CO, astro-ph.EP, and astro-ph.GA. Included below is a dataframe that shows what information was pulled when collecting information.  

In [5]:
url_1 = 'http://export.arxiv.org/help/api/user-manual#detailed_examples'
rd_1 =urllib.request.urlopen(url_1)
soup_2 = BeautifulSoup(rd_1,'xml')
table = soup_2.findAll('table')
tab_cat = table[-1]
cati = []
fin_cat = []
tab_cat = tab_cat.findAll('td')
for cat in tab_cat:
    cati.append(cat.contents[0].strip())
cou = len(cati)
for i in range(cou):
    if (i%2==1 or i == 0):
        pass
    else:
        fin_cat.append(cati[i])
extra_list = ["econ.EM","eess.AS", "eess.IV", "eess.SP","q-fin.CP", "q-fin.EC", "q-fin.GN", "q-fin.MF", "q-fin.PM",\
             "q-fin.PR", "q-fin.RM", "q-fin.ST", "q-fin.TR"]
all_cat = fin_cat + extra_list
data_list = []
title = []
new_title = []
au_list_2 = []
subject = []
date = []
for t in all_cat[:3]:#for all sub fields
    url_3 = 'http://export.arxiv.org/api/query?search_query=cat:'+ t + '&start=0&max_results=3'
    #shows first 1 papers for all subjects and subfields
    rd_3 = urllib.request.urlopen(url_3)#open the url
    soup_3 = BeautifulSoup(rd_3, 'html.parser')#allows me to read the xml for the url given
    entry = soup_3.findAll('entry')#gind all the headers named entry and then output all the data inside
    #the entry headers
    for i in entry:#iterates through each entry header
        au_new_2 = []
        data_list.append(i.find('id').get_text())
        date.append(i.find('published').get_text())
        title.append(i.find('title').get_text())
        cat_2 = i.find('category')
        subject.append(cat_2['term'])
        for j in i.find_all('name'):
            au_new_2.append(j.get_text())
        au_list_2.append(au_new_2)
id_ = pd.DataFrame(data_list)
paper_date = pd.DataFrame(date)
paper_title = pd.DataFrame(title)
paper_authors = pd.DataFrame(au_list_2)
paper_subject = pd.DataFrame(subject)
id_.rename(columns= {0:'id'},inplace=True)
paper_title.rename(columns = {0:'titles'},inplace = True)
paper_subject.rename(columns={0:'subjects'},inplace = True)
paper_date.rename(columns={0:'date'},inplace = True)
data_test = pd.concat([id_,paper_date,paper_title,paper_authors,paper_subject],axis = 1)
data_test.head()
     


Unnamed: 0,id,date,titles,0,1,2,3,4,5,6,...,15,16,17,18,19,20,21,22,23,subjects
0,http://arxiv.org/abs/0901.0173v1,2009-01-01T10:33:18Z,Non-Minimal Quintessence With Nearly Flat Pote...,Anjan A Sen,Gaveshna Gupta,Sudipta Das,,,,,...,,,,,,,,,,astro-ph.CO
1,http://arxiv.org/abs/0901.0189v1,2009-01-01T18:36:01Z,Robust determination of the major merger fract...,C. López-Sanjuan,M. Balcells,C. E. García-Dabó,M. Prieto,D. Cristóbal-Hornillos,M. C. Eliche-Moral,D. Abreu,...,,,,,,,,,,astro-ph.CO
2,http://arxiv.org/abs/0901.0245v2,2009-01-02T16:05:48Z,"Neutrino Masses, Dark Energy and the Gravitati...",R. Benton Metcalf,,,,,,,...,,,,,,,,,,astro-ph.CO
3,http://arxiv.org/abs/0901.0282v2,2009-01-02T21:04:22Z,HAT-P-11b: A Super-Neptune Planet Transiting a...,G. Á. Bakos,G. Torres,A. Pál,J. Hartman,Géza Kovács,R. W. Noyes,D. W. Latham,...,A. Howard,S. Vogt,Gábor Kovács,J. Fernandez,A. Moór,R. P. Stefanik,J. Lázár,I. Papp,P. Sári,astro-ph.EP
4,http://arxiv.org/abs/0901.0482v1,2009-01-05T13:53:42Z,Physical collisions of moonlets and clumps wit...,Sebastien Charnoz,,,,,,,...,,,,,,,,,,astro-ph.EP


# Using OAI to Scrape Data

Using the API to scrape data can be very useful when scraping a small amount of data, but when you are wanting to collect a large amount of data then the API may not be very useful. OAI stands for Open Archives Initiative and can be used to collect or harvest a large set of information from a website. It is important to read the documentation for the OAI you are using to see the set up and requirements that are needed to use the OAI. The link for the OAI documentation is http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm#ListMetadataFormats. For our semester project we wanted to collect all the articles that were available on the arxiv and had to figure out a way to web scrape that would allow us to collect a very large amount of data. After doing some research over some of the techniques that have been used to colect all the articles from the arxiv Tu converted code from another source to fit our needs. The code below shows how we collected all the articles for each section. First, we had to collect all the sections that are included in the arxiv. After collecting all the subjects we then collected all the papers from each subject. At the begining of the while loop you will notice a try except statment. This statement is in place for any pages that do not open on the first try. This stament forces a 30 second wait and then starts with the next item in the list. For each paper we collected the title, the date published, paper id, authors, and subject field. At the end of the code block below you will notice an if else statement. That if else statement determines if there is any section labeled ResumptionToken in the XML, and if there is a section labled ResumptionToken then there is more papers to collect and we use the url that includes the ResumptionToken from that section to collect more papers. This process is repeated till all papers are collected from that section. This code has been set to find all the papers in the econ section, but can be reset to find all the papers in the arxiv if needed. 

In [6]:
url = "http://export.arxiv.org/oai2?verb=ListSets"
u = urllib.request.urlopen(url, data = None)
f = io.TextIOWrapper(u,encoding='utf-8')
text = f.read()
soup = BeautifulSoup(text, 'xml')
all_cat = [sp.text for sp in soup.findAll("setSpec")]

f = open("all_cat_v01.txt", "w")
f.write(",".join(all_cat))
f.close()
def scrape(cat):
    
    # Initialization
    df = pd.DataFrame(columns=("doi", "date", "title", "authors", "category"))
    base_url = "http://export.arxiv.org/oai2?verb=ListRecords&"
    url = base_url + "set={}&metadataPrefix=arXiv".format(cat)
    
    # while loop in order to loop through all the resutls
    while True:
        # print url to keep track of stuff
        print(url)
        # accessing the url
        try:
            u = urllib.request.urlopen(url, data = None)
        except HTTPError as e:
            # Incase of some error that require us to wait
            if e.code == 503:
                to = int(e.hdrs.get("retry-after", 30))
                print("Got 503. Retrying after {0:d} seconds.".format(to))
                time.sleep(to)
                continue # Skip this loop, continue to the next one
            else:
                raise

        # reading the file
        f = io.TextIOWrapper(u,encoding='utf-8')
        text = f.read()
        soup = BeautifulSoup(text, 'xml')

        # collecting the data
        for record in soup.findAll("record"):
            try:
                doi = record.find("identifier").text
            except:
                doi = np.nan
            
            try:
                date = record.find("created").text
            except:
                date = np.nan
            
            try:
                title = record.find("title").text
            except:
                title = np.nan
            
            try:
                authors = ";".join([author.get_text(" ") for author in record.findAll("author")])
            except:
                authros = np.nan
            
            try:
                category = record.find("setSpec").text
            except:
                category = np.nan
                
            df = df.append({"doi":doi, "date":date, "title":title, "authors":authors, "category":category}, ignore_index=True)
                

        # Seeing if there is still data

        token = soup.find("resumptionToken")
        if token is None or token.text is None:
            break
        else:
            url = base_url + "resumptionToken=%s"%(token.text)
        
    return(df)
master_df = pd.DataFrame(columns=("doi", "date", "title", "authors", "category"))
# for i in all_cat:
#     print("----------------",i,"-------------------")
df = scrape('econ')
master_df = master_df.append(df, ignore_index = True)
master_df.head()


http://export.arxiv.org/oai2?verb=ListRecords&set=econ&metadataPrefix=arXiv
Got 503. Retrying after 10 seconds.
http://export.arxiv.org/oai2?verb=ListRecords&set=econ&metadataPrefix=arXiv


Unnamed: 0,doi,date,title,authors,category
0,oai:arXiv.org:0704.3649,2007-04-27,Quantile and Probability Curves Without Crossing,Chernozhukov Victor MIT;Fernandez-Val Ivan Bos...,econ
1,oai:arXiv.org:0704.3686,2007-04-27,Improving Estimates of Monotone Functions by R...,Chernozhukov Victor MIT;Fernandez-Val Ivan Bos...,econ
2,oai:arXiv.org:0708.1627,2007-08-12,Rearranging Edgeworth-Cornish-Fisher Expansions,Chernozhukov Victor;Fernandez-Val Ivan;Galicho...,econ
3,oai:arXiv.org:0806.4730,2008-06-28,Improving Point and Interval Estimates of Mono...,Chernozhukov Victor;Fernandez-Val Ivan;Galicho...,econ
4,oai:arXiv.org:0904.0951,2009-04-06,Inference on Counterfactual Distributions,Chernozhukov Victor;Fernandez-Val Ivan;Melly B...,econ


The code below is included to show the make up of the XML and to show why we choose the headers that we choose.

In [7]:
url_oai = 'http://export.arxiv.org/oai2?verb=ListRecords&set=cs&metadataPrefix=arXiv'
u = urllib.request.urlopen(url_oai, data = None)
f = io.TextIOWrapper(u,encoding='utf-8')
text = f.read()
soup_oai_test = BeautifulSoup(text, 'xml')
soup_oai_test

<?xml version="1.0" encoding="utf-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2018-02-20T18:07:48Z</responseDate>
<request metadataPrefix="arXiv" set="cs" verb="ListRecords">http://export.arxiv.org/oai2</request>
<ListRecords>
<record>
<header>
<identifier>oai:arXiv.org:0704.0002</identifier>
<datestamp>2008-12-13</datestamp>
<setSpec>cs</setSpec>
</header>
<metadata>
<arXiv xsi:schemaLocation="http://arxiv.org/OAI/arXiv/ http://arxiv.org/OAI/arXiv.xsd">
<id>0704.0002</id><created>2007-03-30</created><updated>2008-12-13</updated><authors><author><keyname>Streinu</keyname><forenames>Ileana</forenames></author><author><keyname>Theran</keyname><forenames>Louis</forenames></author></authors><title>Sparsity-certifying Graph Decompositions</title><categories>math.CO cs.CG</categories><comments>To appear i

# Selenium

Selenium is a collection of open source APIs which are used to automate the testing of a website. Selenium can be used with a number of browsers, but for this presentation we will use the FireFox browser. Selenium allows us to test websites and to make sure that we can travel within a website without any trouble. In this presentation we used the Selenium webdriver which allows us to use a browser to connect to websites we would like to open. Selenium uses a get command to go to the wanted website that the user would like to use. On this first example we used the Facebook URL in the get command so that the webdriver would open up a firefox browser. In this example we will use the webdriver to log into facebook. In order to type in the username we would like to use we must first tell the webdriver where the username space is located. We find the location of the username space by left clicking the username space on the facebook website and looking at the HTML file. By looking at the HTML file we can see that the username space is contained within the id named 'email', so we use the find_element_by_id command to find the location of the username space and then use the send_key command to type in the username that we want to use into the username blank. We use the same steps to type in the wanted password, but insted of email we use 'pass' as the id because that is where the password blank is located. To click the submit button we use the send_keys command but use KEYS.RETURN within that command to submit the username and password. When you are done using the webdriver to travel through the websites it is very important to close the driver, this can be done by using the .close() command.   

In [8]:
user = "im not gonna use my real user name"
pwd = "im not gonna use my own password"
driver = webdriver.Firefox()
driver.get("http://www.facebook.com")
assert "Facebook" in driver.title
elem = driver.find_element_by_id("email")
elem.send_keys(user)
elem = driver.find_element_by_id("pass")
elem.send_keys(pwd)
elem.send_keys(Keys.RETURN)

In this next example we open up a firefox browser and the load the Google website. We then search the word hockey in the search browser and click search. The next page that we are brought to is the page that contains all the results for the hockey search. We then tell the webdriver to select the wiki page for hockey and then select the inline section that is included in the hockey wiki page. We then selected the main source URL for the inline hockey section and then used Beautiful Soup to extract all the text from the main inline source. When using Selenium to travel through websites it is important to include wait statements which requires each action to delay itself until the wait has been met. If you do not use wait statements in your code there is a chance you will get an error because the code will run too fast and will not be able to open the pages you want to open because they may not be able to be found.

In [9]:
driver_2 = webdriver.Firefox()
driver_2.get('https://www.google.com/')
assert 'Google' in driver_2.title
select = driver_2.find_element_by_id('lst-ib')
select.send_keys('hockey')
select.send_keys(Keys.RETURN)
driver_2.implicitly_wait(5)
hockey_wik = driver_2.find_element_by_link_text('Hockey - Wikipedia')
hockey_wik.click()
inline= driver_2.find_element_by_css_selector('li.toclevel-2:nth-child(4)')
inline.click()
driver_2.implicitly_wait(10)
main_inline_link = driver_2.find_element_by_link_text('Roller in-line hockey')
main_inline_link.click()
url = driver_2.current_url
parge = requests.get(url)
soup = BeautifulSoup(parge.content, 'html.parser')
for i in soup.find_all('p'):
    print(i.get_text())

Roller in-line hockey is a team sport played on a wood, asphalt, cement or sport tile surface, in which players use a hockey stick to shoot a hard plastic hockey puck into their opponent's goal to score points.[1] It is considered a contact sport but body checking is prohibited. However, there are exceptions to that with the NRHL which involves fighting. Inline hockey teams are composed of up to four lines of players including two forwards and two defensemen on each line. There are five players including the goalie from each team on the rink at a time. It is the goalie's job to prevent the other team's players from scoring. Teams normally consist of 16 players that sit on the bench until it is their turn to play.[2] As the name suggests it is played on inline skates.
Inline hockey is a very fast paced and free flowing game, this is because it does not have the same rules as ice hockey. There are no blue lines or defensive zones in roller hockey unlike ice hockey. This means that, accor

In [10]:
#the pages must be open first
driver.close()
driver_2.close()

# Helpful Sources

1)https://economictimes.indiatimes.com/definition/selenium-web-driver

2)https://www.guru99.com/introduction-webdriver-comparison-selenium-rc.html

3)https://www.dataquest.io/blog/web-scraping-tutorial-python/

4)https://www.crummy.com/software/BeautifulSoup/bs4/doc/