# Web Scraping in Python

## Setting up

Install the packages we will use today.
We'll need
- requests, to get the HTML content of a web page
- BeautifulSoup, which is used to parse HTML into a manipulable object in Python
- pandas, to work with dataframes

In [None]:
!pip install beautifulsoup4 requests pandas

And let's import them

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Scraping a single page

### Downloading HTML

Let's start by scraping some content from the RBloggers webpage, which contains information and tutorials for R.

The first step is to download the HTML for a webpage using `requests.get`.

In [None]:
response = requests.get(
	url='https://www.r-bloggers.com'
)

We can view the HTML code that we get:

In [None]:
print(response.text)

What you saw above is the actual HTML code underlying the webpage on your browser. However, this is currently just text (string). To make it easier to work with, we will use BeautifulSoup to turn it into an object that allows us to extract information easily.

In [None]:
webpage = BeautifulSoup(response.text)

### Scraping text

Let's create a new variable for the page's logo as text using the CSS selector to identify it. (Hint: SelectorGadget can make this easier!)

In [None]:
page_logo = webpage.select_one('.logo-wrap').get_text()

Print the variable to see what we've downloaded:

In [None]:
print(page_logo)

Chances are you'll want only the text, so you can use `.strip()` to get rid of the surrounding spaces/empty lines.

In [None]:
print(page_logo.strip())

Changing `select_one` to `select` will select all of the elements with the attribute and save them as a list object. Let's do this for the article title on the page.

In [None]:
title_elements = webpage.select('h3 a')
title_elements

To turn this into a list of texts instead of list of elements, you can use a list comprehension. If you are not familiar with a list comprehension, briefly, it is like a for loop, but more compact (see below for comparison):

In [None]:
# For loop version:
titles = []
for title_element in title_elements:
	titles.append(title_element.get_text())
	
# List comprehension version (more compact):
titles = [title_element.get_text() for title_element in title_elements]

titles

## Scraping images

We can do the same thing for images by choosing their HTML selectors and using the `src` attribute:

In [None]:
image_elements = webpage.select('img')
images = [image_element.get('src') for image_element in image_elements]

Let's see the first two elements in the list of images:

In [None]:
images[:2]

To see the images themselves, you could follow the links. In a scraping scenario, we often want to download the images, so we will do that.

Note that common image file types are: `.png`, `.jpg`, `.jpeg`, `.gif` and `.webp`. The last one is a recent addition and becoming more common.

In [None]:
image_binary = requests.get(images[0]).content

# Note the mode 'wb', which tells Python to write the content as binary
# This is needed to save things that are not text
with open('r_bloggers.webp', 'wb') as f:
	f.write(image_binary)

## Exercise 1: basic web scraping

On your own, repeat the steps above on the Wikipedia homepage.

First, download the HTML for <https://en.wikipedia.org/wiki/Main_Page>:

In [None]:
# Code goes here

Next, view the HTML code:

In [None]:
# Code goes here

Scrape the headings for the homepage features ('From today's featured article', 'In the news', etc.)

In [None]:
# Code goes here

Scrape the text of the homepage features:

In [None]:
# Code goes here

## More advanced scraping

Often, when undertaking a web scraping project, we find we'll need to download content from multiple pages or multiple locations.

The Connosr database contains a variety of whisky reviews, ratings, and information. The website is structured with information nested under the main URL, www.connosr.com.

[https://www.connosr.com/](https://www.connosr.com)

Here is the structure of the webpage: 

![StructureWebsite](images/WebsiteStructure.jpg)

We're interested in scraping data about Scottish whisky, located in the [Scotch Whiskys sub-folder](https://www.connosr.com/scotch-whisky). Let's save the URL in a new variable:


In [None]:
whiskypage = 'https://www.connosr.com/scotch-whisky'

## Level 1: Home Page

### Extracting information from links

Let's get the links to all of the Scottish whisky distilleries listed on the page. Once we have the list, we'll be able to use a spider to 'crawl' through our links one at a time, extracting information about each distillery.

We'll use the CSS selector to grab the HTML nodes for the names and the corresponding HTML attributes:

In [None]:
def get_dist_links(url):
	html = requests.get(url).text
	html_soup = BeautifulSoup(html)
	name_elements = html_soup.select('.name')
	dist_links = [name_element['href'] for name_element in name_elements]
	return dist_links

dist_links = get_dist_links(whiskypage)

We can take a peek at the first few links to see that nothing is wrong.

In [None]:
dist_links[:10]

How long is the list of links?

In [None]:
len(dist_links)

You may notice that the link isn't actually a full link (with `http...` in front of the link). We will need the full URLs to work with, so we will add the homepage in front of the links using either a loop or list comprehension

In [None]:
full_links = ['http://www.connosr.com' + dist_link for dist_link in dist_links]

We can also extract the names of the distilleries. We might write a function for this, simiar to above.

In [None]:
def get_dist_names(url):
	html = requests.get(url).text
	html_soup = BeautifulSoup(html)
	name_elements = html_soup.select('.name')
	dist_names = [name_element.get_text() for name_element in name_elements]
	return dist_names

In [None]:
dist_names = get_dist_names(whiskypage)

Note that in practice, you can save some time and space by doing all of these steps in one loop!

### Cleaning scraped text

Data scraped from the web often needs some cleaning. As the output of the previous block shows us, the distillery names are preceded by an extra letter.

Using regex, we'll go ahead and remove the extra letters:

In [None]:
import re

# Compile the pattern in advance to speed things up
regex_pattern = re.compile(r'^\w+\s')

# Create a new list of dist names without the extra letter (and leading space).
cleaned_names = []
for dist_name in dist_names:
	cleaned_name = regex_pattern.sub('', dist_name)
	cleaned_names.append(cleaned_name)

Let's see how it looks:

In [None]:
cleaned_names[:10]

Now, we have a list of the Scottish distilleries in the Connosr database. We might also be interested in seeing how community members have rated them.

We can write a function to loop over the webpage that scrapes the rating for each distillery using the HTML nodes:

In [None]:
def get_rate_dist(url):
	html = requests.get(url).text
	html_soup = BeautifulSoup(html)
	avg_rating_elements = html_soup.select('.not-small-phone')
	average_rating_texts = [avg_rating_element.get_text() for avg_rating_element in avg_rating_elements]
	return average_rating_texts

rate_dist = get_rate_dist(whiskypage)

We'll also want to remove the 'Average Rating: ' appended to each distillery's rating, and turn them into actual ratings (i.e. numeric values).

In [None]:
cleaned_rate_dist = [float(rate_text.replace('Average rating: ', '')) for rate_text in rate_dist]

That gives you an error! That's because there's a rating that's not a numerical value (`~`). If you look at the website, it seems like these are whiskies that have no ratings. To handle this, we will have to decide what to do with these values. The best way is to assign them some equivalent of "NA". In this case, we will assign them `None`.

In [None]:
cleaned_rate_dist = []
for rate_text in rate_dist:
	rating = rate_text.replace('Average rating: ', '')
	if rating == '~':
		cleaned_rate_dist.append(None)
	else:
		cleaned_rate_dist.append(float(rating))


To save the information we've extracted, we can merge it into a dataframe:

In [None]:
distillery_df = pd.DataFrame(
	zip(full_links, cleaned_names, cleaned_rate_dist),
	columns=['full_link', 'cleaned_name', 'rating']
)
distillery_df

## Exercise 2: Level 1 scraping

On your own, use what we've learned to scrape a list of whisky distilleries for another region on Connosr.

First, save the URL for the page.


In [None]:
# Code goes here

Next, download the links by grabbing the HTML nodes for the names of distilleries and the corresponding HTML attributes. (Hint: we've already defined the function `get_dist_links`in the previous step, so you'll just need to use it on the new URL!)

In [None]:
# Code goes here

According to the Connosr databased, how many whisky distilleries are in the new region you've explored?

In [None]:
# Code goes here

## Level 2: Distilleries pages

At this point in the lesson, we'll be scraping multiple pages for information. So, you may find that the blocks of code may take longer to run.

We're going to repeat the process of writing a function to download the links for reviews for specific bottles

In [None]:
def get_bottle_links(url):
	html = requests.get(url).text
	html_soup = BeautifulSoup(html)
	name_elements = html_soup.select('.name')
	bottle_links = [name_element['href'] for name_element in name_elements]
	return bottle_links

bottle_links = []
for full_link in full_links:
	bottle_links += get_bottle_links(full_link)

### Completing partial URLs

The links are incomplete, with only part of the URL path. Let's fix this:

In [None]:
full_bottle_links = ['http://www.connosr.com' + bottle_link for bottle_link in bottle_links]

Let's look at some of them to see if we got it right

In [None]:
full_bottle_links[:10]

### Extracting information from links

Now we have the links of each bottle page. We also want to get the name of each bottle:

In [None]:
def get_bottle_names(url):
	html = requests.get(url).text
	html_soup = BeautifulSoup(html)
	name_elements = html_soup.select('.name')
	bottle_names = [name_element.get_text() for name_element in name_elements]
	return bottle_names

bottle_names = []
for full_link in full_links:
	bottle_names += get_bottle_names(full_link)

## Level 3: Reviews

### Downloading reviews

The full list of reviewed bottles includes 3,508 observations. To save time, we are going to work on a subset of the list of links. (If you want to work on the full list, please note that it can take up to 10 minutes for each function to run. You can work with the full list by subbing `test_links` with `full_bottle_links`.)

In [None]:
test_links = full_bottle_links[:100]

In [None]:
def get_bottle_reviews(url):
	html = requests.get(url).text
	html_soup = BeautifulSoup(html)
	p_elements = html_soup.select('.simple-review-content p')
	bottle_reviews = [p_element.get_text() for p_element in p_elements]
	return bottle_reviews

bottle_review_list = []
for full_bottle_link in test_links:
	bottle_review_list.append(get_bottle_reviews(full_bottle_link))

Now, we have a very long list of reviews! Let's have a look at the reviews for the first bottle:

In [None]:
bottle_review_list[0]

### Merging scraped data

We may want to merge the reviews for each bottle together. We can do this by joining the strings.

In [None]:
reviews_by_bottle = [' '.join(bottle_reviews) for bottle_reviews in bottle_review_list]

We can also create a dataframe for further data manipulation.

In [None]:
with_bottle_names = pd.DataFrame(
	zip(bottle_names[:100], reviews_by_bottle),
	columns=['bottle_name', 'review']
)
with_bottle_names

# The end!