# Web Scraping in Python

Presentation material: https://github.com/NoahBres/Workshop-2022-04-09-WebScraping

In [None]:
# Install packages
import sys
!{sys.executable} -m pip install requests beautifulsoup4 pandas seaborn matplotlib

## HTML Review

HTML = HyperText Markup Language

Describes the structure of a web page via tags.

Tags are represented like like so:

```html
<tag>Content</tag>
```

Tags can have tags nested inside them.

```html
<tag>
  <tag>Content</tag>
</tag>
```

HTML has a number of predefined tags: `body`, `p`, `h1`, `head`, `title`, etc.
A full list of tags can be found on MDN: https://developer.mozilla.org/en-US/docs/Web/HTML/Element

Tags can also have atrributes describing them:
```html
<img src="/dog-on-log.jpg" />
```

Here's a simple HTML page:
```html
<!DOCTYPE html>
<html>
  <head>
    <title>Page Title</title>
  </head>
  <body>
    <h1>My First Heading</h1>
    <p>My first paragraph.</p>
    <button>Click me!</button>
  </body>
</html>
```

## Import Modules
- BeautifulSoup
  - HTML/XML Parser
  - Allows us to read/parse HTML/XML
- Requests
  - HTTP Requests library
  - Allows us to fetch data from the web and download webpages

In [None]:
from bs4 import BeautifulSoup
import requests

## Downloading a simple page
Use the requests library to fetch/download a page and print its content.

In [None]:
page = requests.get("https://dataquestio.github.io/web-scraping-pages/simple.html")
page.status_code

In [None]:
page.content

## Parsing the page
Use BeautifulSoup to parse the page and read its contents.

In [None]:
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

## Find all the <p> tags
Read all the <p> tags from the page and print them

In [None]:
soup.find_all('p')

In [None]:
soup.find_all('p')[0].get_text()
# Same as soup.find('p').get_text()

## Find elements via selectors
Request a new page. Find all the elements via selectors instead of tags

In [None]:
page = requests.get("https://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

Select via tag and class

In [None]:
soup.find_all('p', class_="outer-text")

In [None]:
soup.select("div p")

## A real world sample
Here's a sandbox we can play around with and run some actual stats.
http://books.toscrape.com/catalogue/category/books_1/index.html

In [None]:
page = requests.get("http://books.toscrape.com/catalogue/category/books_1/index.html")
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

Notice in the HTML that each item has a `product_pod` class

In [None]:
products = soup.select(".product_pod")
products

Let's extract all the titles from it

In [None]:
titles = [x.text for x in soup.select(".product_pod h3")]
titles

Extract the prices from the page

In [None]:
prices = [float(x.text[1:]) for x in soup.select('.price_color')]
prices

Extracting availability

In [None]:
instock = [x.text.strip() for x in soup.select(".instock")]
instock

## Scraping all the pages
Loop through the urls by page number and repeat the same scraping method.
This can be improved through threading.

In [None]:
# Page 2-51
for page in range(2,51):
    url = "http://books.toscrape.com/catalogue/category/books_1/page-" + str(page) +".html"
    results = requests.get(url)
    soup = BeautifulSoup(results.content, 'html.parser')
    titles.extend([x.text for x in soup.select(".product_pod h3")])
    prices.extend([float(x.text[1:]) for x in soup.select('.price_color')])
    instock.extend([x.text.strip() for x in soup.select(".instock")])

## Making use of our data

In [None]:
import pandas
d = {"title": titles, "price": prices, "availability": instock}
books = pandas.DataFrame(data=d)
books.head()

Run some basic analysis

In [None]:
books['price'].describe().round()

Let's plot our data

In [None]:
import seaborn
import matplotlib.pyplot as plt

seaborn.set(style = 'darkgrid', color_codes = True)
f, ax = plt.subplots(figsize=(13, 3))
seaborn.despine(f, left=True, bottom=True)

boxplt = seaborn.boxplot(x=books["price"])

## RTS Fetch Intercept Example

- Right click the selected request
- Select "Copy as cURL"
- Open up [https://curlconverter.com](https://curlconverter.com)
- Paste the copied cURL into the "cURL" field
- Go to the Python tab and it'll show the equivalent Python code for this request with all the headers

🚨 THE FOLLOWING SAMPLE WON'T WORK FOR YOU🚨

The RTS application seems to have an expiration for the access time. You have to copy the cURL and use your own headers for this to work.

Otherwise, the request will return `No API access permitted`

The following sample will be expired.
You can probably reverse engineer how this works on your own if you care. I suspect it's calling the gettime method and then using the data from that.

Sample:

In [None]:
import requests

headers = {
    'Connection': 'keep-alive',
    'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="97", "Chromium";v="97"',
    'Accept': 'application/json',
    'X-Date': 'Fri, 08 Apr 2022 23:36:00 GMT',
    'sec-ch-ua-mobile': '?0',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36',
    'X-Request-ID': '4710baeb042156e238641cedf72802eddc82db50a5a3638782dcca0e9c2c4bbb',
    'sec-ch-ua-platform': '"macOS"',
    'Sec-Fetch-Site': 'same-origin',
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Dest': 'empty',
    'Referer': 'https://www.riderts.app/map',
    'Accept-Language': 'en-US,en;q=0.9',
}

params = {
    'requestType': 'getvehicles',
    'rt': '20',
    'key': 'Qskvu4Z5JDwGEVswqdAVkiA5B',
    'format': 'json',
    'xtime': '1649460961927',
}

response = requests.get('https://www.riderts.app/bustime/api/v3/getvehicles', headers=headers, params=params)

print(response.content)