# Scraping with Selenium
In this notebook I will give a short explanation of how you can use Selenium for webscraping. I think one of the big advantages of Selenium is how intuitive it is. You can write a script that does things exactly the way you would do it by hand, but then you can scale it up and run it one much more websites than you could do by hand in a similar time. In this Notebook I will go throug two short examples, the first one is a small scraper that gives you the weekly new Covid infection rates as published by the RIVM, the second scrapes the name of all senators of the United States of America. 

However, first let's start by loading the relevant packages for the script, and by setting a working directory.

In [2]:
#Packages
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
from fake_useragent import UserAgent
import time
import os
import pandas as pd
#Working directory
os.chdir('/home/timothy/Desktop/University/1Studium/Communication Science/Kurse/s3/Data Journalism')

## RIVM
The RIVM is the Dutch National Institute for Public Health and Environment and thus has credible covid numbers. Let's say we are interested in those, and usually we check the website daily. But now, we don't have a lot of time anymore and we would like to automate the process so that we just have to run our script, and then we get the number in return. 

Let's start by saving the starting URL as a variable

In [3]:
url = 'https://www.rivm.nl/'

Next, we will set up our browser, and send the first get request to the url we have specified. The most important part in the code below is the executable path you need for the browser for it to be able to locate the geckodrive exe file you need to download on your laptop. If everything works, a new mozilla firefox window should appear.

In [5]:
profile = webdriver.FirefoxProfile()
options = Options()         
#options.headless = True 
ua = UserAgent(verify_ssl=False)
options.add_argument('--disable-infobars')
options.add_argument('--disable-extensions')
options.add_argument('--profile-directory=Default')
options.add_argument('--incognito')
options.add_argument('--disable-plugins-discovery')
options.add_argument('--start-maximized')
userAgent = ua.random
options.add_argument(f'user-agent={userAgent}')
browser = webdriver.Firefox(options = options,firefox_profile=profile, executable_path = r'presentation/geckodriver.exe' ) #this is quite important 
wait = WebDriverWait(browser, 300)  
browser.get(url) # open url   

Now that the website is open we will use the xpath notation to help the computer identify which link to click on. 

In [6]:
browser.implicitly_wait(20) # in case the website doesn't load fast enough, this tells the laptop to wait up to 20 seconds if it cannot immediatly execute the next commands
browser.find_element_by_xpath('/html/body/div[2]/div/div[3]/main/div[2]/article/div/div/div/div/div/div[2]/div[2]/div/div/div/div/div/ul/li[1]/a').click() #this is the link we want to click on

Now that we are on the correct website, we will write some code to scroll down the website so the data we are interested in is in our field of vision, then we will use the x-path notation again to tell the computer which number we are interested in. At this point we will be done so we will close the browser window, and print out a statement that tells us how many new weekly infections we have. 

In [7]:
browser.execute_script("window.scrollTo(0, 1000);") #scrolling down

#this are the numbers we are interested in
new_cases_last_week = browser.find_element_by_xpath('/html/body/div[2]/div/div[3]/main/div[2]/article/div/div[2]/div/div/div/div[2]/div/table/tbody/tr[2]/td[2]').text

#closing the browser
browser.quit()

#print out what we want to know
print(f'There were {new_cases_last_week} new cases last week')

Done. There were 36.931 new cases last week


## US Senators
In the second example, we will use Selenium to scrape the US Senators. Lets start again with specifying the URL and starting the browser. We will also create empty lists in which we can later safe the information in.

In [9]:
url = 'https://www.senate.gov/senators/index.htm'
###Settings for the browser
profile = webdriver.FirefoxProfile()
options = Options()         
#options.headless = True 
ua = UserAgent(verify_ssl=False)
options.add_argument('--disable-infobars')
options.add_argument('--disable-extensions')
options.add_argument('--profile-directory=Default')
options.add_argument('--incognito')
options.add_argument('--disable-plugins-discovery')
options.add_argument('--start-maximized')
userAgent = ua.random
options.add_argument(f'user-agent={userAgent}')
browser = webdriver.Firefox(options = options,firefox_profile=profile, executable_path = r'presentation/geckodriver.exe' ) #this is quite important 
wait = WebDriverWait(browser, 300)  
browser.get(url) # open url   

#empty lists in which we will safe the data
Name = []
State = []
Party = []

Next we will scroll down the page so we can see the actual table.

In [10]:
#scroll down the page
browser.execute_script("window.scrollTo(0, 700);")
browser.implicitly_wait(20)

Unfortunatly, there are too many different names so they don't all fit on the window we see. In selenium, this will lead to issues because the scraper can only get data it can see. So lets reduce teh amount of senators we see at once from 100 to 10. 

In [11]:
#select 10 per page
browser.find_element_by_xpath('//select[@aria-controls = "listOfSenators"]').click()
browser.find_element_by_xpath('//select[@aria-controls = "listOfSenators"]/option').click()

Next we will have to extract all the information from this table that we are interested in. However, since we only see 10 names per page we will have to extract information the same way multiple times, so lets write a function that does this for us. 

In [12]:
def getting_senators():
    Senators_o = browser.find_elements_by_xpath('//tr[@class = "odd"]')
    
    for senator in Senators_o: 
        Name.append(senator.find_element_by_xpath('./td/a').text)
        State.append(senator.find_element_by_xpath('./td[2]').text)
        Party.append(senator.find_element_by_xpath('./td[3]').text)
    
    
    #even
    Senators_e = browser.find_elements_by_xpath('//tr[@class = "even"]')
    
    for senator in Senators_e: 
        Name.append(senator.find_element_by_xpath('./td/a').text)
        State.append(senator.find_element_by_xpath('./td[2]').text)
        Party.append(senator.find_element_by_xpath('./td[3]').text)


Next, lets start applying this this function to the first 3 pages and then quit the browser.

In [13]:
#extract info
getting_senators()  
#click next
browser.find_element_by_xpath('//a[@id = "listOfSenators_next"]').click()
#wait 2 secondswhi
time.sleep(2)
getting_senators()
browser.find_element_by_xpath('//a[@id = "listOfSenators_next"]').click()
time.sleep(2)
getting_senators()
browser.quit()

The last thing we will do is put the 3 lists together into one pandas dataframe.

In [15]:
Senators = pd.DataFrame(list(zip(Name, State, Party)),
                  columns =['Name', 'State', 'Party']) 

print(f'There are {len(Senators)} Senators in our dataset')
print(Senators.head())

There are 30 Senators in our dataset
                       Name      State       Party
0   Alexander, Lamar (R-TN)  Tennessee  Republican
1     Barrasso, John (R-WY)    Wyoming  Republican
2  Blackburn, Marsha (R-TN)  Tennessee  Republican
3         Blunt, Roy (R-MO)   Missouri  Republican
4      Boozman, John (R-AR)   Arkansas  Republican


### Scale up and doing it headless
we sucessfully scraped the first 30 senators. But usually when we scrape, we want all the information, so let's scale it up to all 100 senators. Furthermore, while watching the scraper work is nice for demonstration purposes, usually it is a lot easier to do it invisibly in the background. So what we will do this time, is to add the option headless = true. Further, instead of copy pasting the same data extraction commands multiple times, we will simply put it in a loop. 

In [17]:
#url
url = 'https://www.senate.gov/senators/index.htm'
#settings for the browser
profile = webdriver.FirefoxProfile()
options = Options()         
options.headless = True  #this is the line that makes it now invisible
ua = UserAgent(verify_ssl=False)
options.add_argument('--disable-infobars')
options.add_argument('--disable-extensions')
options.add_argument('--profile-directory=Default')
options.add_argument('--incognito')
options.add_argument('--disable-plugins-discovery')
options.add_argument('--start-maximized')
userAgent = ua.random
options.add_argument(f'user-agent={userAgent}')
browser = webdriver.Firefox(options = options,firefox_profile=profile, executable_path = r'presentation/geckodriver.exe' ) #this is quite important 
wait = WebDriverWait(browser, 300)  
browser.get(url) # open url 
#empty lists
Name = []
State = []
Party = []


#scroll down the page
browser.implicitly_wait(20)
browser.execute_script("window.scrollTo(0, 700);")

#select 10 per page
browser.find_element_by_xpath('//select[@aria-controls = "listOfSenators"]').click()
browser.find_element_by_xpath('//select[@aria-controls = "listOfSenators"]/option').click()

#define the data extraction function
def getting_senators():
    Senators_o = browser.find_elements_by_xpath('//tr[@class = "odd"]')
    
    for senator in Senators_o: 
        Name.append(senator.find_element_by_xpath('./td/a').text)
        State.append(senator.find_element_by_xpath('./td[2]').text)
        Party.append(senator.find_element_by_xpath('./td[3]').text)
    
    
    #even
    Senators_e = browser.find_elements_by_xpath('//tr[@class = "even"]')
    
    for senator in Senators_e: 
        Name.append(senator.find_element_by_xpath('./td/a').text)
        State.append(senator.find_element_by_xpath('./td[2]').text)
        Party.append(senator.find_element_by_xpath('./td[3]').text)


#write a for loop that repeats the same steps 10 times (and clicks next 9 times)
x = 0
while x < 10:
    getting_senators()
    if x < 9: 
        browser.find_element_by_xpath('//a[@id = "listOfSenators_next"]').click()
    else: 
        print('Scraping complete')
    
    x += 1

browser.quit()

#store in Pandas dataframe
Senators_data = pd.DataFrame(list(zip(Name, State, Party)),
                  columns =['Name', 'State', 'Party']) 



Scraping complete


Now lets look at our Dataset again.

In [20]:
#lets also create seperate dfs for republicans and democrats
republicans = Senators_data[Senators_data['Party'] == 'Republican']
democrats = Senators_data[Senators_data['Party'] == 'Democrat']
print(f'There are {len(Senators_data)} Senators in our Dataset, {len(republicans)} are Republicans and {len(democrats)} are Democrats.')

print(Senators_data.head(10))

There are 100 Senators in our Dataset, 53 are Republicans and 45 are Democrats.
                         Name        State       Party
0     Alexander, Lamar (R-TN)    Tennessee  Republican
1       Barrasso, John (R-WY)      Wyoming  Republican
2    Blackburn, Marsha (R-TN)    Tennessee  Republican
3           Blunt, Roy (R-MO)     Missouri  Republican
4        Boozman, John (R-AR)     Arkansas  Republican
5       Baldwin, Tammy (D-WI)    Wisconsin    Democrat
6   Bennet, Michael F. (D-CO)     Colorado    Democrat
7  Blumenthal, Richard (D-CT)  Connecticut    Democrat
8      Booker, Cory A. (D-NJ)   New Jersey    Democrat
9          Braun, Mike (R-IN)      Indiana  Republican
