# STA 220 Data & Web Technologies for Data Analysis

### Lecture 8, 2/1/24, Scraping

### Announcements

 - 

### Today's topics
- Scraping with Java Script
- GraphQL
    
### Ressources
- Mitchell: Scraping with Python, Chapters 9 and 10
- [GraphQL](https://www.mobilelive.ca/blog/graphql-vs-rest-what-you-didnt-know) (Attention: This is infotainment!)

### Scraping from `ratemyprofessors.com`

We are interested in retrieving information from the webpage `ratemyprofesors.com`. By navigating with our browser, we find that all professors at UCD can be retrieved as follows. 

In [1]:
import requests

In [2]:
endpoint = 'https://www.ratemyprofessors.com/search/professors/1073?'
params = {'q':'*'}

In [3]:
result=requests.get(endpoint, params)
result.raise_for_status

<bound method Response.raise_for_status of <Response [200]>>

In [4]:
import lxml
from bs4 import BeautifulSoup

In [5]:
html = BeautifulSoup(result.text,'lxml')
print(html.prettify()) 

<!DOCTYPE html>
<!-- SSR -->
<html>
 <head>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="#000000" name="theme-color"/>
  <meta content="https://www.ratemyprofessors.com/build/thumbnail.svg" name="thumbnail"/>
  <link href="/build/manifest.json" rel="manifest"/>
  <link href="/static/css/main.1773c5b7.css" rel="stylesheet" type="text/css"/>
  <!-- Google Optimize Anti-flicker snippet -->
  <style>
   .async-hide { opacity: 0 !important}
  </style>
  <script>
   (function(a,s,y,n,c,h,i,d,e){s.className+=' '+y;h.start=1*new Date;
        h.end=i=function(){s.className=s.className.replace(RegExp(' ?'+y),'')};
        (a[n]=a[n]||[]).hide=h;setTimeout(function(){i();h.end=null},c);h.timeout=c;
        })(window,document.documentElement,'async-hide','dataLayer',4000,
        {'OPT-MLW3VTZ':true});
  </script>
  <!-- Google Optimize -->
  <script async="" src="https://www.googleoptimize.com/optimize.js?id=OPT-MLW3VTZ">
  </script>
  <script async=""

In [6]:
result.text.find("Lynn")

50792

In [7]:
result.text.find("Stylianos")

-1

As we have already seen in the wikipedia example, the website (html) rendered by the browser does not coincide with the html returned by the request. Apparently, some information is fetched while the *browser* executed JS. 

The running of scripts is a client-side operation run in the browser itself, rather
than on a web server. 

JavaScript is, by far, the most common and most well-supported client-side scripting
language on the Web today. It can be used to collect information for user tracking,
submit forms without reloading the page, embed multimedia, and even power entire
online games. Even deceptively simple-looking pages can often contain multiple
pieces of JavaScript. You can find it embedded between `<script>` tags in the page’s
source code.

Since we are interested in the rendered html displayed by the browser, we have to artificialy render it first, then return the rendered html as a string. This can be achieved with `Selenium`. 

Selenium is a powerful web scraping tool developed originally for website testing.
These days it’s also used when the accurate portrayal of websites—as they appear in a
browser—is required. Selenium works by automating browsers to load the website,
retrieve the required data, and even take screenshots or assert that certain actions
happen on the website.

In [9]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

In [10]:
url = result.url
url

'https://www.ratemyprofessors.com/search/professors/1073?q=%2A'

In [None]:
# driver.get(url)

We have already seen that it takes a while to load the page in the browser. We don't have time for this. 

In [11]:
from selenium.common.exceptions import TimeoutException
driver.set_page_load_timeout(20) # twenty seconds should be enough

try:
    driver.get(url)
except TimeoutException:
    driver.execute_script("window.stop();")

Other professors are not displayed. For that, we have to hit the button `show more`, or, better, specify that we are only interested in stats professors. 

How do we navigate on this page? First, we need to get rid of the cookies banner. Using developer tools, we can inspect find the 'close' button for the cookies banner: 

    "/html/body/div[5]/div/div/button"

See the [docs](https://www.selenium.dev/selenium/docs/api/py/index.html)!

In [12]:
button=driver.find_element("xpath", "/html/body/div[5]/div/div/button")
button.click()

In [13]:
button=driver.find_element("xpath", '//*[@id="bx-close-inside-1177612"]')
button.click()

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//*[@id="bx-close-inside-1177612"]"}
  (Session info: chrome=122.0.6261.94); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
0   chromedriver                        0x0000000100bb12f8 chromedriver + 4625144
1   chromedriver                        0x0000000100ba8ea3 chromedriver + 4591267
2   chromedriver                        0x00000001007a7e6a chromedriver + 392810
3   chromedriver                        0x00000001007f341d chromedriver + 701469
4   chromedriver                        0x00000001007f35b1 chromedriver + 701873
5   chromedriver                        0x00000001008371c4 chromedriver + 979396
6   chromedriver                        0x000000010081589d chromedriver + 841885
7   chromedriver                        0x000000010083468f chromedriver + 968335
8   chromedriver                        0x0000000100815613 chromedriver + 841235
9   chromedriver                        0x00000001007e63da chromedriver + 648154
10  chromedriver                        0x00000001007e6d1e chromedriver + 650526
11  chromedriver                        0x0000000100b71ac0 chromedriver + 4364992
12  chromedriver                        0x0000000100b76e86 chromedriver + 4386438
13  chromedriver                        0x0000000100b5672e chromedriver + 4253486
14  chromedriver                        0x0000000100b77bc9 chromedriver + 4389833
15  chromedriver                        0x0000000100b48a79 chromedriver + 4196985
16  chromedriver                        0x0000000100b97b78 chromedriver + 4520824
17  chromedriver                        0x0000000100b97d57 chromedriver + 4521303
18  chromedriver                        0x0000000100ba8ae3 chromedriver + 4590307
19  libsystem_pthread.dylib             0x00007ff802563202 _pthread_start + 99
20  libsystem_pthread.dylib             0x00007ff80255ebab thread_start + 15


Next, we should select the stats professors. To do so, we need to access the dropdown menu. From the developer tools, we find that its coded as `div` element, so we cannot use the implemented `select` method to access the dropdown. 

First, we need to find which `div` actually opens the dropdown. 

In [14]:
driver.set_page_load_timeout(4) # twenty seconds should be enough

try:
    driver.find_element("xpath", '//div[@class=" css-1l6bn5c-control"]').click()
except TimeoutException:
    driver.execute_script("window.stop();")

ElementClickInterceptedException: Message: element click intercepted: Element <div class=" css-1l6bn5c-control">...</div> is not clickable at point (273, 22). Other element would receive the click: <div>...</div>
  (Session info: chrome=122.0.6261.94)
Stacktrace:
0   chromedriver                        0x0000000100bb12f8 chromedriver + 4625144
1   chromedriver                        0x0000000100ba8ea3 chromedriver + 4591267
2   chromedriver                        0x00000001007a7e6a chromedriver + 392810
3   chromedriver                        0x00000001007fa113 chromedriver + 729363
4   chromedriver                        0x00000001007f7fe1 chromedriver + 720865
5   chromedriver                        0x00000001007f5a17 chromedriver + 711191
6   chromedriver                        0x00000001007f4de2 chromedriver + 708066
7   chromedriver                        0x00000001007e8427 chromedriver + 656423
8   chromedriver                        0x0000000100815872 chromedriver + 841842
9   chromedriver                        0x00000001007e7db8 chromedriver + 654776
10  chromedriver                        0x0000000100815a2e chromedriver + 842286
11  chromedriver                        0x000000010083468f chromedriver + 968335
12  chromedriver                        0x0000000100815613 chromedriver + 841235
13  chromedriver                        0x00000001007e63da chromedriver + 648154
14  chromedriver                        0x00000001007e6d1e chromedriver + 650526
15  chromedriver                        0x0000000100b71ac0 chromedriver + 4364992
16  chromedriver                        0x0000000100b76e86 chromedriver + 4386438
17  chromedriver                        0x0000000100b5672e chromedriver + 4253486
18  chromedriver                        0x0000000100b77bc9 chromedriver + 4389833
19  chromedriver                        0x0000000100b48a79 chromedriver + 4196985
20  chromedriver                        0x0000000100b97b78 chromedriver + 4520824
21  chromedriver                        0x0000000100b97d57 chromedriver + 4521303
22  chromedriver                        0x0000000100ba8ae3 chromedriver + 4590307
23  libsystem_pthread.dylib             0x00007ff802563202 _pthread_start + 99
24  libsystem_pthread.dylib             0x00007ff80255ebab thread_start + 15


Lets see how the html looks like in the dropdown menu. 

In [None]:
import lxml.html as lx
from lxml import etree

html = lx.fromstring(driver.page_source)

In [None]:
driver.page_source.find("Statistics")

In [None]:
dropdown = html.xpath('//div[@class=" css-1hwfws3"]')[0]

In [None]:
dropdown

In [None]:
print(BeautifulSoup(etree.tostring(dropdown),'lxml').prettify())

We ought to select the element with `id="react-select-3-option-86"`. 

In [None]:
driver.find_element("xpath", '//div[@id="react-select-3-option-86"]').click()

We learn that there are 102 professors in the Statistics department, but only 8 are shown. Further investigation shows that we might use the class attribute that contains `Pagination button`. 

In [None]:
button=driver.find_element("xpath", "//button[contains(@class, 'PaginationButton')]")
button.click()

In [None]:
import time

In [None]:
while True: 
    try: 
        time.sleep(0.2)
        button=driver.find_element("xpath", "//button[contains(@class, 'PaginationButton')]")
        button.click()
    except: 
        break

In [None]:
html = lx.fromstring(driver.page_source)

We don't need the browser anymore. We can close it. 

In [None]:
driver.quit()

In [None]:
html.xpath('//a/@href')

Since we do not need visual confimation of what the browser does, we can run it in headless mode as well next time. 

In [None]:
#chrome_options.add_argument("--headless")
#browser = webdriver.Chrome('./chromedriver', options=chrome_options)

Lets retrieve name and link for now. Any further analysis can be performed similar to our previous case studies. 

In [None]:
links = html.xpath('//a[@class = "TeacherCard__StyledTeacherCard-syjs0d-0 dLJIlx"]/@href')
links[1:10]

In [None]:
names = html.xpath('//div[@class = "CardName__StyledCardName-sc-1gyrgim-0 cJdVEK"]')
names = [name.text for name in names]
names[1:10]

In [None]:
import pandas as pd

df=pd.DataFrame({'name': names, 'link': links})
df

So far so good. Next, we will see how these steps could have been achieved somewhat easier. 

We have seen that the html was rendered after some JS code has been executed. However, the information we retrieved must have been retrieved by querying some data base. To see which data base was queried using which script, we can use the performance tab in the developer tools. 

As it turns out, the information is fetched via *GraphQL*. GraphQL is an API as we have seen them before, but its not a REST API. Facebook developed it as an internal technology for their versatile applications, and later, publicly released it as open-source. Since then, the software development community has utilized it as one of the favourite technology stacks for developing web services.

As a query language, GraphQL defines specifications of how a client application can request the needed data from a remote server. As a result, the server application returns a response to the requested client query. The exciting thing to notice here is that the client application can also query exactly what it needs, without relying on the server-side application to define a query. 

GraphQL has become fairly common. Its adavantage is that due to specific queries, it avoids some problems of REST APIs, namely 
 - Multiple roundtrips with REST
 - Over-fetching and Under-fetching Problems with REST

Lets see how the GraphQL request is made. 

In [None]:
endpoint = 'https://www.ratemyprofessors.com/graphql'
headers = {
    "Authorization": "Basic dGVzdDp0ZXN0", 
}

In [None]:
# first query
data = {
    "query":"query TeacherSearchResultsPageQuery(\n  $query: TeacherSearchQuery!\n  $schoolID: ID\n) {\n  search: newSearch {\n    ...TeacherSearchPagination_search_1ZLmLD\n  }\n  school: node(id: $schoolID) {\n    __typename\n    ... on School {\n      name\n    }\n    id\n  }\n}\n\nfragment TeacherSearchPagination_search_1ZLmLD on newSearch {\n  teachers(query: $query, first: 8, after: \"\") {\n    didFallback\n    edges {\n      cursor\n      node {\n        ...TeacherCard_teacher\n        id\n        __typename\n      }\n    }\n    pageInfo {\n      hasNextPage\n      endCursor\n    }\n    resultCount\n    filters {\n      field\n      options {\n        value\n        id\n      }\n    }\n  }\n}\n\nfragment TeacherCard_teacher on Teacher {\n  id\n  legacyId\n  avgRating\n  numRatings\n  ...CardFeedback_teacher\n  ...CardSchool_teacher\n  ...CardName_teacher\n  ...TeacherBookmark_teacher\n}\n\nfragment CardFeedback_teacher on Teacher {\n  wouldTakeAgainPercent\n  avgDifficulty\n}\n\nfragment CardSchool_teacher on Teacher {\n  department\n  school {\n    name\n    id\n  }\n}\n\nfragment CardName_teacher on Teacher {\n  firstName\n  lastName\n}\n\nfragment TeacherBookmark_teacher on Teacher {\n  id\n  isSaved\n}\n",
    "variables":{
        "query":{
            "text":"",
            "schoolID":"U2Nob29sLTEwNzM=",
            "fallback":True,
            "departmentID":"RGVwYXJ0bWVudC0xNDA=", 
        },
        "schoolID":"U2Nob29sLTEwNzM="
    }
} 

In [None]:
response = requests.post(endpoint, headers = headers, json=data)
response.raise_for_status()
result = response.json()
result

In [None]:
def fetch_info(dic): 
    name = dic['node']['firstName'] + " " + dic['node']['lastName']
    lid = "/professor?tid=" + str(dic['node']['legacyId'])
    return name, lid
    
prof_list = result['data']['search']['teachers']['edges']
    
[fetch_info(prof) for prof in prof_list]

Using developer tools, we find the the subsequent requests can be done using a different data layout. Watch out, the `query` value has changed! 

In [None]:
cursor = result['data']['search']['teachers']['pageInfo']['endCursor']
cursor

In [None]:
def new_data(cursor):
    data = {
        "query":"query TeacherSearchPaginationQuery(\n  $count: Int!\n  $cursor: String\n  $query: TeacherSearchQuery!\n) {\n  search: newSearch {\n    ...TeacherSearchPagination_search_1jWD3d\n  }\n}\n\nfragment TeacherSearchPagination_search_1jWD3d on newSearch {\n  teachers(query: $query, first: $count, after: $cursor) {\n    didFallback\n    edges {\n      cursor\n      node {\n        ...TeacherCard_teacher\n        id\n        __typename\n      }\n    }\n    pageInfo {\n      hasNextPage\n      endCursor\n    }\n    resultCount\n    filters {\n      field\n      options {\n        value\n        id\n      }\n    }\n  }\n}\n\nfragment TeacherCard_teacher on Teacher {\n  id\n  legacyId\n  avgRating\n  numRatings\n  ...CardFeedback_teacher\n  ...CardSchool_teacher\n  ...CardName_teacher\n  ...TeacherBookmark_teacher\n}\n\nfragment CardFeedback_teacher on Teacher {\n  wouldTakeAgainPercent\n  avgDifficulty\n}\n\nfragment CardSchool_teacher on Teacher {\n  department\n  school {\n    name\n    id\n  }\n}\n\nfragment CardName_teacher on Teacher {\n  firstName\n  lastName\n}\n\nfragment TeacherBookmark_teacher on Teacher {\n  id\n  isSaved\n}\n",
        "variables":{
            "count":8,
            "cursor": cursor, 
            "query":{
                "text":"",
                "schoolID":"U2Nob29sLTEwNzM=",
                "fallback":True,
                "departmentID":"RGVwYXJ0bWVudC0xNDA=", 
            }
        }
    } 
    return data
data = new_data(cursor)

In [None]:
response = requests.post(endpoint, headers = headers, json=data)
response.raise_for_status()
result = response.json()
result

In [None]:
prof_list = result['data']['search']['teachers']['edges']
[fetch_info(prof) for prof in prof_list]

In [None]:
cursor = result['data']['search']['teachers']['pageInfo']['endCursor']
cursor

In [None]:
flag = result['data']['search']['teachers']['pageInfo']['hasNextPage']
flag

Lets formalize this. 

In [None]:
def fetch_profs(): 
    endpoint = 'https://www.ratemyprofessors.com/graphql'
    headers = {
        "Authorization": "Basic dGVzdDp0ZXN0", 
    }
    
    # first query
    data = {
        "query":"query TeacherSearchResultsPageQuery(\n  $query: TeacherSearchQuery!\n  $schoolID: ID\n) {\n  search: newSearch {\n    ...TeacherSearchPagination_search_1ZLmLD\n  }\n  school: node(id: $schoolID) {\n    __typename\n    ... on School {\n      name\n    }\n    id\n  }\n}\n\nfragment TeacherSearchPagination_search_1ZLmLD on newSearch {\n  teachers(query: $query, first: 8, after: \"\") {\n    didFallback\n    edges {\n      cursor\n      node {\n        ...TeacherCard_teacher\n        id\n        __typename\n      }\n    }\n    pageInfo {\n      hasNextPage\n      endCursor\n    }\n    resultCount\n    filters {\n      field\n      options {\n        value\n        id\n      }\n    }\n  }\n}\n\nfragment TeacherCard_teacher on Teacher {\n  id\n  legacyId\n  avgRating\n  numRatings\n  ...CardFeedback_teacher\n  ...CardSchool_teacher\n  ...CardName_teacher\n  ...TeacherBookmark_teacher\n}\n\nfragment CardFeedback_teacher on Teacher {\n  wouldTakeAgainPercent\n  avgDifficulty\n}\n\nfragment CardSchool_teacher on Teacher {\n  department\n  school {\n    name\n    id\n  }\n}\n\nfragment CardName_teacher on Teacher {\n  firstName\n  lastName\n}\n\nfragment TeacherBookmark_teacher on Teacher {\n  id\n  isSaved\n}\n",
        "variables":{
            "query":{
                "text":"",
                "schoolID":"U2Nob29sLTEwNzM=",
                "fallback":True,
                "departmentID":"RGVwYXJ0bWVudC0xNDA=", 
            },
            "schoolID":"U2Nob29sLTEwNzM="
        }
    } 
    
    response = requests.post(endpoint, headers = headers, json=data)
    result = response.json()
    
    prof_list = result['data']['search']['teachers']['edges']
    df = [fetch_info(prof) for prof in prof_list]
    
    cursor = result['data']['search']['teachers']['pageInfo']['endCursor']
    
    flag = True
    while flag: 
        data = new_data(cursor)
        response = requests.post(endpoint, headers = headers, json=data)
        result = response.json()
            
        prof_list = result['data']['search']['teachers']['edges']
        df.extend([fetch_info(prof) for prof in prof_list])
        cursor = result['data']['search']['teachers']['pageInfo']['endCursor']

        flag = result['data']['search']['teachers']['pageInfo']['hasNextPage']
        
    return df

In [None]:
df = fetch_profs()

In [None]:
pd.DataFrame(df)

### Summary 

- `Selenium` is very useful to remote-control a browser
- Internally, information is usually handled via APIs anyway