# Web Scraping

**Web Scraping** is the process of crawling across a webpage and collecting the text content. Some popular services, such as SkyScanner, collect data from many sources and display it all together for their users to compare. 

Web scraping is often considered an option if there is no API, but beware that it lies in a grey area in terms of ethics and legality. You should not scrape data that isn't publicly available, and many websites have implemented protection measures to prevent web scrapers from taking their data. But in many cases, it is a perfectly acceptable data collection method, especially as a learning exercise. 

If you want to check a website's policy on scraping, most websites have a file called [`robots.txt`](https://www.promptcloud.com/blog/how-to-read-and-respect-robots-file/), which will be the base URL + "/robots.txt". This will contain rules to communicate to bots crawling the site what they should and shouldn't do. 

## HTML

The main content of a webpage is displayed in **HTML** (HyperText Markup Language) files. Web browsers format the HTML elements, but additional styling and functionality can be added with **CSS** (Cascading Style Sheet) and **JavaScript**. The code for CSS and JavaScript are commonly written in separate files and imported into the HTML to save space, but not always! 

We can use the inspector tool in a web browser to look at the HTML for any webpage. You could also consider using a tool like [Live DOM Viewer](https://software.hixie.ch/utilities/js/live-dom-viewer/). **DOM** = Document Object Model, which is the tree of elements that make up an HTML document.

If you want to create your own `.html` file and experiment:
  - Create a file called `index.html`, and boilerplate an HTML document with `!`
  - `<head>` tag holds metadata, `<body>` tag holds elements that render on the page
  <!-- - Add `<style>` tag to demonstrate CSS (and also how a separate file could be linked)
  - Add `<script>` tag to demonstrate JS (and also how a separate file could be linked) -->
  - Explore some basic tags and common attributes (`<h1>`, `<p>`, `<a>`, `<img>`)
  - Explore HTML tag nesting (`<div>`, `<ul>`)

Now that we know what we're looking for, we can look at some Python libraries to help us achieve it! Choosing which library will depend on the website you are looking at, and how the page content is generated and displayed. 

There are two main types of webpages you'll find online: **static** and **dynamic**. Static refers to HTML pages that are "finished" on arrival, all the content is already available when the document is received. Dynamic pages use JavaScript to build the HTML _after_ the browser receives the document, meaning there will be a short (often barely even perceptible to humans) time in which the page doesn't have the content. 

There is also the question of accessing the exact data you are after - will you need to interact with elements on the webpage to navigate, filter, search, etc? 

## Beautiful Soup

[**Beautiful Soup**](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is our recommended library for static sites. It is a lightweight package designed for parsing HTML and XML files. You should add `beautifulsoup4` (bs4) to your Conda environment for this project. You will also need to add [`requests`](https://pypi.org/project/requests/) to make the HTTP requests, and (optional) [`lxml`](https://lxml.de/) to help process the HTML/XML. 

In [None]:
import requests
import bs4

response = requests.get("https://example.com/")
# soup = bs4.BeautifulSoup(response.text, 'lxml') # with lxml package
soup = bs4.BeautifulSoup(response.text, 'html.parser') 
soup

Once we have some "soupified" HTML, we can target specific elements inside and collect the data content they are holding. There are actually many maaaany ways we can do this (refer to documentation), and we should specify if we want a single or multiple results: 

In [None]:
# target first <p> element inside the body:
soup.body.p

# target first <p> element in the document:
soup.find('p')
soup.select_one('p')

# # target all <p> elements in the document:
soup.find_all('p')
soup.select('p')

The `select` and `select_one` methods will accept [CSS Selectors](https://www.freecodecamp.org/news/css-selectors-cheat-sheet/), while `find` and `find_all` will take [multiple arguments](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find_all), including the tag name and attributes (if you ever want to practise CSS selectors, [here](https://flukeout.github.io/) is a great resource!)

In [None]:
# these both target <p class="red">this is a paragraph</p>
soup.find_all("p", attrs={"class": "red"})
soup.select("p.red")

Once you have selected an element, you can access the [attributes](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#bs4.Tag.attrs) and [text](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text), and you can also [navigate](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree) to its surrounding elements (children, parents, siblings) through the DOM tree. 

The website [books.toscrape.com](https://books.toscrape.com/) is a dummy website build for the purpose of practising web scraping. We can build a small dataset by taking some of the content. I want to make a dataframe from the books, collecting their title, rating, price, and the url for the cover image. 

In [None]:
response = requests.get("https://books.toscrape.com/")
soup = bs4.BeautifulSoup(response.text, 'html.parser')

By inspecting the HTML, I can see that the cards holding the data for each book are `<li>` (list) elements nested inside an `<ol>` (ordered list) element with a class of "row", and that the elements holding the content I want are grouped into an `<article>` element with a class of "product_pod". It is important to make sure I am not going to end up with any other rogue elements in my list of books, as this could break my code. The more specific the combination, the less likely the chance of overlapping selectors. 

In [None]:
# break the selector down to show nesting
# book_cards = soup.select("article.product_pod")
book_cards = soup.find_all("article", attrs={"class": "product_pod"})
book_cards

This returns a list of the article elements, which I can loop over and extract the data I want. To get the title, I'll need to look at the `<a>` inside the `<h3>` (there is more than one `<a>` element in the card, so I have to be specific when selecting). You'll notice though that the text cuts off the end of the title if it's too long, so we want to take the value from the `title` attribute, rather than the text.

In [None]:
titles = []

for book in book_cards:
  # title = book.select_one('h3 > a')['title']
  title = book.find('h3').findChild('a')['title']
  titles.append(title)

titles

To find the the rating, I need to understand their system. It seems they are using a class of "One" to "Five" to determine how the stars display, so the second class on the `<p>` element with a first class of "star-rating" is what tells me the rating. The class attribute will be a list of all classes applied to the element, and since I only want the second, I will use the index. At this point, I could also write a simple function to convert the string to an Int for my data:

In [None]:
ratings = []

for book in book_cards:
  # rating = book.select_one('.star-rating')['class'][1]
  rating = book.find(attrs={'class': 'star-rating'})['class'][1]
  ratings.append(rating)
  
ratings

In [None]:
def convert_rating(string):
  return ["One", "Two", "Three", "Four", "Five"].index(string) + 1

ratings_int = [convert_rating(string) for string in ratings]
ratings_int

To access the price, I can target the `<p>` element with a class of "price_color", and extract the text. At this point I could also convert this to a float (first remove the symbols). 

In [None]:
prices = []

for book in book_cards:
  price = book.find("p", attrs={"price_color"}).get_text()
  formatted_price = float(price.replace('Â£', ''))
  prices.append(formatted_price)
  
prices

Finally, to access the cover image URL, I will need to target the `<img>` tag and extract it from the `src` attribute. You'll notice though that it isn't a complete URL. This is because the images are hosted on the same domain, so we can create a full URL that can be linked from anywhere by adding the website's base URL to the string we get from the image source (always double check your `/` symbols when concatenating URL strings):

In [None]:
covers = []

base_url = "https://books.toscrape.com/"

for book in book_cards:
  cover_extension = book.find("img")["src"]
  covers.append(base_url + cover_extension)
  
covers

Once I have my lists, I can create a DataFrame and [CSV](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html)

In [None]:
import pandas as pd

books_df = pd.DataFrame({
  'title': titles,
  'price': prices,
  'rating': ratings,
  'cover': covers
})

books_df.to_csv("out.csv", index=False)

This has only collected data from the _first_ page though. If we look at the website, we can navigate to page 2 with the click of a button. Notice the URL indicates which page we are on - this is a great opportunity to loop and gather not just the first 50, but all 1000 books listed on the website! This site is nice enough to indicate how many pages we have to iterate over, but many websites won't give you this. It is also likely to change as stock levels rise and drop! So we have to come up with a dynamic solution. 

If the total page number is displayed on the website, you can scrape that value and `for` loop over a [range](https://www.w3schools.com/python/ref_func_range.asp). If it isn't displayed, then we're going to need to use a while loop until there are no more pages. If we open [page 51](https://books.toscrape.com/catalogue/page-51.html), we get an error 404 page. This error status code will be accessible in the response object returned by `requests.get()`, which we can set to the condition of the loop.

We can format the URL string using Python's [string_format](https://www.w3schools.com/python/ref_string_format.asp):

In [None]:
page_number = 1
while True:
  url = f"{base_url}catalogue/page-{page_number}.html"
  response = requests.get(url)
  print(page_number, response)
  page_number += 1
  if response.status_code != 200:
    break
  # here would be the rest of my scraping code

Be aware that Beautiful Soup is only going to work for static webpages. If we try to get the HTML for a dynamic page (take the LMS for example, which is a React App, meaning the page content is generated in the browser when you open it), you'll get a mostly empty page, and some JavaScript nonsense. 

In [None]:
response = requests.get("https://lms.codeacademyberlin.com/")
soup = bs4.BeautifulSoup(response.text, 'lxml')
soup

## Your Task! 

You will use BeautifulSoup to scrape data from the [Financial Times](https://www.ft.com/). Use the search feature to find results specific to your chosen cryptocurrency. You can also add additional filters, notice how they are added to the URL! The purpose of this exercise is to collect data, don't worry when the data you're collecting is not the most relevant.