# Introduction
Hello!  In this tutorial we will scrape a more complicated webage from Wikipedia.  This is a continuation of [Part 1](https://onefortheroad.github.io/python/tutorial/2017/04/29/web-scraping-part-1/) where we learned the basics of web scraping.

When we left off Part 1, we had a *pandas* dataframe containing the Top 100 Canadian Beers. I'd like to add some **geospatial** information to our beer list so I can plan a pilgrimage to these fantastic breweries.  (Actually, we'll use this geospatial information in a future tutorial on visualization.)  Wikipedia's [List of Breweries in Canada](https://en.wikipedia.org/wiki/List_of_breweries_in_Canada) is a fine place to start.  Let's go!

## Contents


# 1. Import Libraries

In [3]:
import re
import requests
from bs4 import BeautifulSoup
import pandas as pd

# 2. Download the web page

In [4]:
url = 'https://en.wikipedia.org/wiki/List_of_breweries_in_Canada'
page = requests.get(url)

# 3. Examine the HTML
Looking at the [wiki](https://en.wikipedia.org/wiki/List_of_breweries_in_Canada), the breweries are listed by province.  The HTML for the breweries in Alberta looks like this:

```html
<h3><span class="mw-headline" id="Alberta">Alberta</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=List_of_breweries_in_Canada&amp;action=edit&amp;section=2" title="Edit section: Alberta">edit</a><span class="mw-editsection-bracket">]</span></span></h3>
<ul>
<li>Alley Kat Brewing Company (<a href="/wiki/Edmonton" title="Edmonton">Edmonton</a>)</li>
...
</ul>
```

We can start thinking of the structure, and hence our parse logic, as follows:
- Heading `<h3>` followed by a `<span>` with class `mw-headline` gives the province
- Each province is followed by an unordered list `<ul>` of breweries
- Each list item `<li>` represents an individual brewery
- Repeat for each province

# 4. Parse the HTML
We'll first turn our `page` object into a Beautiful Soup object, then start looking for the headings denoting provinces:

In [115]:
soup = BeautifulSoup(page.content, 'lxml')

provinces = soup.find_all(lambda tag: tag.name == 'h3' and tag.find(class_='mw-headline'))
# Print the list of provinces
for i, province in enumerate(provinces, start=1):
    print(i, province.contents[0].string)

1 Alberta
2 British Columbia
3 Manitoba
4 Newfoundland & Labrador
5 Northwest Territories
6 Nova Scotia
7 New Brunswick
8 Ontario
9 Prince Edward Island
10 Saskatchewan
11 Quebec
12 Yukon


Nice! We use a lambda function in `find_all()` because we want to find only tags with particular children, and Beautiful Soup doesn't have any methods to do this directly.  Our lambda function does this quite elegantly in a single line.

>Why didn't we just search for all `<h3>` tags, or `mw_headline` classes?  These searches would have turned up other results along with the ones we want, leading to additional steps to strip out the ones we want.  Try it out as an exercise!
{:.blockquote}

Next, using the first province in the list (Alberta), let's get the list of breweries which are in the `<li>` tags:

In [116]:
brewers_by_province = provinces[0].find_next_sibling('ul').find_all('li')
# Truncate the printed list to first 5 brewers
for i, brewery in enumerate(brewers_by_province[:5], start=1):
    print(i, brewery.text)

1 Alley Kat Brewing Company (Edmonton)
2 Amber's Brewing Company (Edmonton)
3 Banded Peak Brewing Company (Calgary)
4 Banff Ave. Brewing Co. (Banff)
5 Bent Stick Brewing Co. (Edmonton)


Let's explain this line.  Working with the first province `provinces[0]`, we went sideways in the tree using `find_next_sibling()` to the unordered list `<ul>`.  Inside the `<ul>` we then gathered all the `<li>` tags using `find_all()`.

## Extract Data
We can now use *regex* to extract the brewery name and city into separate variables. Let's use the first brewer in our list:

In [118]:
brewers_by_province[0].text

'Alley Kat Brewing Company (Edmonton)'

In [185]:
# goal is to extract the first part and the last bracketed part
test = 'Hello Sam (Sharks) (2017)'
brackets = '\([^)]+\)'
name = '^[^(]+'
print(re.search(name, test))
print(re.findall(brackets, test))


<_sre.SRE_Match object; span=(0, 10), match='Hello Sam '>
['(Sharks)', '(2017)']
