<a href="https://colab.research.google.com/github/AIDA-UIUC/datascience-workshops/blob/master/07_webscraping/07_REST_Webscraping_(Solution).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AIDA Data Science Workshop #7

Today we'll be learning about HTTP and web scraping!



In [None]:
# some package installation
!pip install vaderSentiment



### What is HTTP?
- 
HTTP is a set of protocols designed to enable communication between clients and servers. A web browser may be the client, and an application on a computer that hosts a web site may be the server.

### What is a GET request?
- One of the most common requests used to "GET" data from a specific source 

### What is a POST request?
- Another commonly used request that allows the user to send or "POST" data to a specific source 

### What is a DELETE request?
- Allows the user to "DELETE" from a specific source


### More information here:
https://www.w3schools.com/tags/ref_httpmethods.asp




Today we will only be working with a GET request.

(Simple demo of HTTP GET and POST requests)

## Anatomy of an HTTP request
```
GET /httpgallery/introduction/ HTTP/1.1
Accept: */*
Accept-Language: en-gb
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko
Host: www.httpwatch.com
Connection: Keep-Alive
```

##Webscraping with BS4

In [None]:
# We'll start with a pretty simple HTML document that you can see below
from IPython.core.display import display, HTML

html_doc = """
<html>
    <head>
        <title>
            The Dormouse's story
        </title>
        </head>
    <body>

    <p class="title">
        <b>
            The Dormouse's story
        </b>
    </p>

    <p class="story">
        Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.
    </p>

    <p class="story">
        This is another story p block
    </p>

    </body>
</html>
"""

display(HTML(html_doc))

In [None]:
# In core Python, you could do something like the following if you wanted to get the title of the html_doc...
## note that html_doc is a string so we can call .find() on it

print(html_doc[html_doc.find("<title>") + len("<title>"): html_doc.find("</title")])

# This is pretty confusing to read right? And really too much effort on us programmers


            The Dormouse's story
        


#### So if using core Python is too much work, what can we use?


In [None]:
# Beautiful Soup!
from bs4 import BeautifulSoup

# Create a soup object!

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup) #let's print out this new HTML soup and see what it looks like

NameError: ignored

That's pretty ugly right? Luckily, this new soup object we've created has a built-in "pretty print" function you can call like below:


In [None]:
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
        and they lived at the bottom of a well.
  </p>
  <p class="story">
   This is another story p block
  </p>
 </body>
</html>



Great! That's much easier to read but it's still just raw HTML text and no one likes that...

How do we use BS4 to navigate the soup?

In [None]:
# Let's take the previous example of finding the title of the HTML doc
# How do we do this in BS4?

print(soup.title)

<title>
            The Dormouse's story
        </title>


In [None]:
# Hmm, it's still got those weird <title> brackets

print(soup.title.get_text()) # Perfect!


            The Dormouse's story
        


Note that BS4 creates objects for every kind of HTML tag: "a", "p", "title", "div", etc. so be aware of what you're actually getting when you try to get something from your HTML soup!

### **STOP!** A bit of legality before we start looking at actual websites...

Many sites have begun to implement a `robots.txt`, which is a file located on the website specifying how robots can access their site. As a data scientist, or more specifically, a `high speed data access machine`, we have to respect the rules of these websites. Some sites are even illegal to scrape due to privacy concerns.

Luckily, if the site implements a robots.txt, which we can access through `/robots.txt`, then we can inspect it to see if we are allowed to do what we want to do. (Read more about it here: http://www.robotstxt.org/)

In [None]:
# example robots.txt
import requests

r = requests.get("https://www.tripadvisor.com/robots.txt")
print(r.text)

# Hi there,
#
# If you're sniffing around this file, and you're not a robot, we're looking to meet curious folks such as yourself.
#
# Think you have what it takes to join the best white-hat SEO growth hackers on the planet?
#
# Run - don't crawl - to apply to join TripAdvisor's elite SEO team
#
# Email seoRockstar@tripadvisor.com
#
# Or visit https://careers.tripadvisor.com/search-results?keywords=seo
#
#
Sitemap: https://www.tripadvisor.com/sitemap/2/en_US/sitemap_en_US_index.xml
Sitemap: https://www.tripadvisor.com/sitemap/2/en_US/sitemap_en_US_location_photo_direct_link_index.xml
Sitemap: https://www.tripadvisor.com/sitemap/2/en_US/sitemap_en_US_show_user_reviews_index.xml
Sitemap: https://www.tripadvisor.com/sitemap/vr/en_US/sitemap_en_US_rentals_index.xml
Sitemap: https://www.tripadvisor.com/sitemap/vr/en_US/sitemap_en_US_vacation_rental_review_index.xml
Sitemap: https://www.tripadvisor.com/sitemap/vr/en_US/sitemap_en_US_vacation_rentals_index.xml
Sitemap: https://www.tripadvisor

In [None]:
# robots.txt parser
# https://stackoverflow.com/questions/33596722/how-does-obey-robots-txt-using-python-2-7

from urllib import robotparser
from typing import Tuple


def can_access_site(base_url: str, access_path: str, agent_name: str="") -> Tuple[bool, int]:
    """
    This function will take in a base_url and an access_path (and optionally
    an agent_name and return whether the agent is allowed to access the specific
    path on the website. The function will return None in the first element of the
    tuple if a robots.txt is not found.

    agent name is the name of the robot. this can technically be set to anything
    but some websites have specific permissions for certain users (see the robots.txt
    for more detail)
    """
    parser = robotparser.RobotFileParser()
    parser.set_url(base_url  + '/robots.txt')

    try:
        parser.read()
    except Exception as e:
        # No robots.txt found
        return (None, None)

    return (parser.can_fetch(agent_name, access_path), parser.request_rate(agent_name))


print(can_access_site("https://www.tripadvisor.com", "/Restaurant_Review-g35790-d2010890-Reviews-Golden_Harbor-Champaign_Champaign_Urbana_Illinois.html"))
print(can_access_site("https://www.facebook.com", "/"))
print(can_access_site("https://www.reddit.com", "/r/datascience"))
print(can_access_site("https://www.aida.acm.illinois.edu", "/"))

(True, None)
(False, None)
(True, None)
(None, None)


### Back to webscraping, we don't usually have the entire text data of an HTML page here right?

#### Nope! 
BS4 itself doesn't support actually pulling text data from websites so you'll need another library to do that

In [None]:
# Here's the page for Golden Harbor on TripAdvisor 

golden_harbor_link = "https://www.tripadvisor.com/Restaurant_Review-g35790-d2010890-Reviews-Golden_Harbor-Champaign_Champaign_Urbana_Illinois.html"

# we're doing a GET request here (requests.post() also exists)
r = requests.get(golden_harbor_link)

soup = BeautifulSoup(r.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html lang="en" xmlns:og="http://opengraphprotocol.org/schema/">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <link href="https://static.tacdn.com/favicon.ico?v2" id="favicon" rel="icon" type="image/x-icon"/>
  <link color="#000000" href="https://static.tacdn.com/img2/brand_refresh/application_icons/mask-icon.svg" rel="mask-icon" sizes="any"/>
  <meta content="#34e0a1" name="theme-color"/>
  <meta content="telephone=no" name="format-detection"/>
  <script type="text/javascript">
   window.taRollupsAreAsync = true;
  </script>
  <link crossorigin="" href="https://static.tacdn.com/css2/webfonts/TripSans/TripSans.css?v1.002" rel="stylesheet"/>
  <title>
   GOLDEN HARBOR, Champaign - Menu, Prices &amp; Restaurant Reviews - Tripadvisor
  </title>
  <meta content="TripAdvisor" property="al:ios:app_name"/>
  <meta content="284876795" property="al:ios:app_store_id"/>
  <meta content="284876795" name="twitter:app:id:ipad" property="twitter:app

In [None]:
# sometimes div classes are just not intuitively named 
# here's an instance of one of the reviews
print(soup.find('div', class_="ui_column is-9").prettify())

# Looks like the useful tags we can get are 
# "quote": -------------------------------- a div tag
# "partial_entry": ------------------------ a p tag
# "prw_rup prw_reviews_stay_date_hsx": ---- a div tag
# "ui_bubble_rating..." ------------------- a span tag (this last one's a bit weird)

<div class="ui_column is-9">
 <span class="ui_bubble_rating bubble_50">
 </span>
 <span class="ratingDate" title="February 1, 2020">
  Reviewed February 1, 2020
 </span>
 <div class="quote">
  <a class="title " href="/ShowUserReviews-g35790-d2010890-r742485483-Golden_Harbor-Champaign_Champaign_Urbana_Illinois.html" id="rn742485483" onclick="(ta.prwidgets.getjs(this,'handlers')).reviewClick(this.href, '0');">
   <span class="noQuotes">
    Chinese food
   </span>
  </a>
 </div>
 <div class="prw_rup prw_reviews_text_summary_hsx" data-prwidget-init="handlers" data-prwidget-name="reviews_text_summary_hsx">
  <div class="entry">
   <p class="partial_entry">
    the best Chinese in Champaign-Urbana.....very fresh and many choices on the menue....highly recommend.
   </p>
  </div>
 </div>
 <div class="prw_rup prw_reviews_stay_date_hsx" data-prwidget-init="" data-prwidget-name="reviews_stay_date_hsx">
  <span class="stay_date_label">
   Date of visit:
  </span>
  December 2019
 </div>
 <div cl

### Great! Now we can get started actually using this data.


In [None]:
# Luckily there's a handy-dandy find_all() function in bs4

soups = soup.find_all('div', class_="ui_column is-9")
print(soups)

[<div class="ui_column is-9"><span class="ui_bubble_rating bubble_50"></span><span class="ratingDate" title="February 1, 2020">Reviewed February 1, 2020 </span><div class="quote"><a class="title " href="/ShowUserReviews-g35790-d2010890-r742485483-Golden_Harbor-Champaign_Champaign_Urbana_Illinois.html" id="rn742485483" onclick="(ta.prwidgets.getjs(this,'handlers')).reviewClick(this.href, '0');"><span class="noQuotes">Chinese food</span></a></div><div class="prw_rup prw_reviews_text_summary_hsx" data-prwidget-init="handlers" data-prwidget-name="reviews_text_summary_hsx"><div class="entry"><p class="partial_entry">the best Chinese in Champaign-Urbana.....very fresh and many choices on the menue....highly recommend.</p></div></div><div class="prw_rup prw_reviews_stay_date_hsx" data-prwidget-init="" data-prwidget-name="reviews_stay_date_hsx"><span class="stay_date_label">Date of visit:</span> December 2019</div><div class="prw_rup prw_reviews_vote_line_hsx" data-prwidget-deferred="deferred/

#### One thing to keep an eye out for.

In [None]:
# You'll notice something weird about the length of this list of reviews

print(len(soups))

10


#### Be careful of "next page" buttons and the like when you're trying to scrape things like reviews from a page.

In [None]:
# Here's some code we wrote to go through every page in TripAdvisor
# We're not super worried about runtime here since the amount of reviews is small but be careful in larger datasets
link_front_half = "https://www.tripadvisor.com/Restaurant_Review-g35790-d2010890-Reviews-"
link_back_half = "-Golden_Harbor-Champaign_Champaign_Urbana_Illinois.html"

page = 0
while True:
    review_page_num = f"or{page}0"
    golden_harbor_link = link_front_half + review_page_num + link_back_half
    r = requests.get(golden_harbor_link, allow_redirects = True)

    # r.history checks the redirects
    # if there is a redirect, we must have gone past the last availablge page
    # (with an exception for page 0, which redirects to the first page)
    if page != 0 and len(r.history) > 0:
        break
    
    soup = BeautifulSoup(r.content, 'html.parser')
    reviews = soup.find_all('div', class_="ui_column is-9")
    print(f"Page {page}: Adding {len(reviews)} reviews")
    soups += reviews
    page += 1

Page 0: Adding 10 reviews
Page 1: Adding 10 reviews
Page 2: Adding 10 reviews
Page 3: Adding 10 reviews
Page 4: Adding 10 reviews
Page 5: Adding 10 reviews
Page 6: Adding 10 reviews
Page 7: Adding 10 reviews
Page 8: Adding 10 reviews
Page 9: Adding 10 reviews
Page 10: Adding 10 reviews
Page 11: Adding 6 reviews


In [None]:
# Cool! That matches the review count
len(soups)

126

### Now it's just time to put everything together...

In [None]:
# Let's take a look at how we can get the values we want

soup = soups[0] # Just using the first soup as an example

soup.span.clear()

date_of_visit = soup.find("div", class_="prw_rup prw_reviews_stay_date_hsx").contents[-1].strip() # something weird happens here!

title = soup.find('div', class_="quote").get_text() # these two tags were nice

review_text = soup.find("p", class_="partial_entry").get_text()

rating = soup.find("span", class_= lambda value: value and value.startswith("ui_bubble_rating"))['class'] # something weird also happens here!
rating = rating[1][7:]
rating = float(rating) / 10

print(date_of_visit)

NameError: ignored

In [None]:
# Let's loop through each one of the soups first and put the useful parts into a list

data_tuples = []

for soup in soups:
    try:
        date_of_visit = soup.find("div", class_="prw_rup prw_reviews_stay_date_hsx").contents[-1].strip() # something weird happens here!
    except:
        date_of_visit = ""

    title = soup.find('div', class_="quote").get_text() # these two tags were nice
    review_text = soup.find("p", class_="partial_entry").get_text()
    rating = soup.find("span", class_= lambda value: value and value.startswith("ui_bubble_rating"))['class'] # something weird also happens here!
    rating = rating[1][7:]
    rating = float(rating) / 10

    to_add = (date_of_visit, title, review_text, rating)
    data_tuples.append(to_add)

In [None]:
import pandas as pd
import numpy as np

golden_harbor_data = pd.DataFrame(data_tuples, columns = ["date_of_visit", "title", "review_text", "rating"])

golden_harbor_data = golden_harbor_data.fillna("") # takes care of blank values
golden_harbor_data

Unnamed: 0,date_of_visit,title,review_text,rating
0,December 2019,Chinese food,the best Chinese in Champaign-Urbana.....very ...,5.0
1,August 2019,One of the best Chinese restaurants in Champaign,Great chi ese food. Try the whole fish and som...,4.0
2,September 2019,Excellent,I've heard so much about Golden Harbor and it ...,5.0
3,July 2019,Quite authentic,We visited with a large group and with an indi...,4.0
4,July 2019,Authentic Chinese,Old-fashioned Chinese family restaurant. You ...,4.0
...,...,...,...,...
121,December 2011,Excellent food,A review earlier on this page calls the Golden...,4.0
122,November 2011,The most authentic chinese restaurant in town,When you go to a chinese restaurant and the pl...,4.0
123,September 2011,Great authentic Chinese in a Facility that cou...,This restaurant is the real thing. Authentic C...,4.0
124,August 2011,Great food-no atmosphere.,I ate lunch here with a work group of ten peop...,3.0


In [None]:
# Let's do some sentiment analysis!
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [None]:
analyzer = SentimentIntensityAnalyzer()

sentiments = []

for review in golden_harbor_data["review_text"]:
    vs = analyzer.polarity_scores(review)
    print(f"review: '{review[:min(75,len(review))]}...'\nsentiment: {vs}\n")
    neg, neu, pos, compound = vs.values()
    sentiments.append(compound)

golden_harbor_data["sentiment"] = sentiments

review: 'the best Chinese in Champaign-Urbana.....very fresh and many choices on the...'
sentiment: {'neg': 0.0, 'neu': 0.526, 'pos': 0.474, 'compound': 0.8402}

review: 'Great chi ese food. Try the whole fish and some spicy dish. The home made t...'
sentiment: {'neg': 0.05, 'neu': 0.762, 'pos': 0.188, 'compound': 0.7534}

review: 'I've heard so much about Golden Harbor and it lived up to all expectations....'
sentiment: {'neg': 0.0, 'neu': 0.913, 'pos': 0.087, 'compound': 0.7346}

review: 'We visited with a large group and with an individual who lives in China. Sh...'
sentiment: {'neg': 0.0, 'neu': 0.895, 'pos': 0.105, 'compound': 0.5709}

review: 'Old-fashioned Chinese family restaurant.  You fill out a slip with your ord...'
sentiment: {'neg': 0.0, 'neu': 0.891, 'pos': 0.109, 'compound': 0.4927}

review: 'They have two menus, one for your standard American Chinese food, and one f...'
sentiment: {'neg': 0.0, 'neu': 0.847, 'pos': 0.153, 'compound': 0.8042}

review: 'The food is exquis

In [None]:
golden_harbor_data

Unnamed: 0,date_of_visit,title,review_text,rating,sentiment
0,December 2019,Chinese food,the best Chinese in Champaign-Urbana.....very ...,5.0,0.8402
1,August 2019,One of the best Chinese restaurants in Champaign,Great chi ese food. Try the whole fish and som...,4.0,0.7534
2,September 2019,Excellent,I've heard so much about Golden Harbor and it ...,5.0,0.7346
3,July 2019,Quite authentic,We visited with a large group and with an indi...,4.0,0.5709
4,July 2019,Authentic Chinese,Old-fashioned Chinese family restaurant. You ...,4.0,0.4927
...,...,...,...,...,...
121,December 2011,Excellent food,A review earlier on this page calls the Golden...,4.0,0.2263
122,November 2011,The most authentic chinese restaurant in town,When you go to a chinese restaurant and the pl...,4.0,0.2732
123,September 2011,Great authentic Chinese in a Facility that cou...,This restaurant is the real thing. Authentic C...,4.0,0.6297
124,August 2011,Great food-no atmosphere.,I ate lunch here with a work group of ten peop...,3.0,-0.1027


In [None]:
# write data to a CSV file
golden_harbor_data.to_csv("golden_harbor_data.csv")

In [None]:
def insert_newlines(string: str, every: int = 50, newline_char: str = "<br>") -> str:
    """
    This function will insert `newline_char` into a string every `every` characters at
    the next available space character.

    this block of code will insert <br> (HTML break blocks) into our review_text strings
    which will format the plotly output (next cell)
    """
    for i in range(every, len(string), every):
        next_whitespace = string.find(" ", i)
        if next_whitespace != -1:
            string = string[:next_whitespace] + newline_char + string[(next_whitespace+1):]
    return string


golden_harbor_data["review_text_display"] = golden_harbor_data["review_text"].apply(insert_newlines)

In [None]:
# let's plot the data now!

import plotly.express as px


fig = px.scatter(golden_harbor_data,
                 x = "rating", y = "sentiment",
                 hover_name = "title",
                 hover_data = ["review_text_display"])
fig.update_layout(title = f"Ratings vs. Sentiment (n={len(golden_harbor_data)})",
                  xaxis_title = "Ratings ([1,5])",
                  yaxis_title = "Sentiment ([-1,1])")
fig.show()

## Now you try!
Websites that were specifically built for web scraping practice: http://toscrape.com/

**Exercise**: Get the book titles, price, and availability from all books on this site: http://books.toscrape.com/

Hint: The site has a piece of text at the top telling you how many results there are in total, which might help you determine how to write your loop to go through pages (and yes, there are also pages on this site; the format for accessing each page is pretty simple though).

In [None]:
# Your code here!


## Other honorable mentions

There are other libraries that can do more (or less) for you in terms of webscraping, and here are some of them:

- Scrapy
- MechanicalSoup
- Selenium

Web scraping is overall a great tool if an API is not available for accessing data from a website, but make sure that you are abiding by the rules of the internet and not DDOSing anybody unintentionally.