![title](http://www.sixfeetup.com/blog/an-introduction-to-beautifulsoup/image_preview)

Last time, we saw how to obtain data from web pages that  give us their data in a clean and structured way. However, most web pages don't have these fancy APIs that let us get their data easily. And some,  don't give us all the data we need on their APIs. For all other purposes, we have to go down (and dirty) into the web pages themselves. First, I hope you read about HTML markups, and the structure of a web page, as mentioned in the previous chapter. If those weren't enough, take a look at some examples of HTML pages right here: https://www.w3schools.com/html/html_examples.asp.


Now that we are at it, open this page: https://en.wikipedia.org/wiki/List_of_football_clubs_in_England,  right click on your browser and select View Page Source. This will open the source markup of the web page! Here you can see the actual, coded details of the visuals of the page. Navigate to the section that says body (use CTRL-F to search for this). The body section is the one where the action happens. Study the details and the structure of the web page. When ready, continue on with the rest of the notebook. This notebook will be divided with the steps needed to obtain data from a webpage, organize it in a meaningful way, then do your analysis as always.

## Downloading the HTML of the page

Now before you start complaining about disk space and downloading things into your computer, hear me out. We'll not be downloading the page itself, but all the HTML content it contains as a huge string. So it will only be saved right here, in our Python environment. To download this, we'll use the **urllib** package, which contains a lot of useful functions to work with the web.  

In [5]:
#Import the request package specifically, from urllib.
from urllib import request
#Open the webpage.
page_connection = request.urlopen(" https://en.wikipedia.org/wiki/List_of_football_clubs_in_England")
#Read the html as string.
page_content= page_connection.read()
page_content[0:1000]

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of football clubs in England - Wikipedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_football_clubs_in_England","wgTitle":"List of football clubs in England","wgCurRevisionId":774113887,"wgRevisionId":774113887,"wgArticleId":539282,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Use dmy dates from August 2013","Use British English from August 2013","Football clubs in England","Lists of association football clubs","Association football in England lists","Lists of football clubs in England"],"wgBreakFrames":false,"wgPageContentLanguag

Now, we have all the content of the web page in the page_content string! But, this string is very difficult to work with. First of all, it's huge!

In [6]:
print(len(page_content))

384803


384803 characters. Considering the average word has approximately 6 characters and counting spaces, we would have approximately 55000 words to work with! And that's insane, we can't just go looking for certain words like that.

However, there's a very useful library we can use for working with HTML pages in Python, which is called **Beautiful Soup**. Beautiful Soup is a tool that let's us read HTML page content, and search for the content based on the HTML markup tags.   For example, tables in HTML are contained explicit hyperlinks are represented in HTML by the link tag like  <!link>  but without the "!". To get all the elements that have the link tag on that wikipedia page with Beautiful Soup, we would do this:

## Searching through the content of the page.

In [7]:
#Initialize the Beautiful Soup parser.
from bs4 import BeautifulSoup
HTML_Parser = BeautifulSoup(page_content,"lxml")

#Find all links.
all_links = HTML_Parser.find_all("link")
all_links

[<link href="/w/load.php?debug=false&amp;lang=en&amp;modules=ext.uls.interlanguage%7Cext.visualEditor.desktopArticleTarget.noscript%7Cext.wikimediaBadges%7Cmediawiki.legacy.commonPrint%2Cshared%7Cmediawiki.sectionAnchor%7Cmediawiki.skinning.interface%7Cskins.vector.styles%7Cwikibase.client.init&amp;only=styles&amp;skin=vector" rel="stylesheet"/>,
 <link href="/w/load.php?debug=false&amp;lang=en&amp;modules=site.styles&amp;only=styles&amp;skin=vector" rel="stylesheet"/>,
 <link href="android-app://org.wikipedia/http/en.m.wikipedia.org/wiki/List_of_football_clubs_in_England" rel="alternate"/>,
 <link href="/w/index.php?title=List_of_football_clubs_in_England&amp;action=edit" rel="alternate" title="Edit this page" type="application/x-wiki"/>,
 <link href="/w/index.php?title=List_of_football_clubs_in_England&amp;action=edit" rel="edit" title="Edit this page"/>,
 <link href="/static/apple-touch/wikipedia.png" rel="apple-touch-icon"/>,
 <link href="/static/favicon/wikipedia.ico" rel="shortcu

**Nice**, we got all elements that have the link tag, and it's on a list-like object. However, what if we want to get URLs in the links? We would first iterate through all the links we found, and extract the attribute **"href"** which contain the URL that leads to the webpage in the link.

In [15]:
def Get_hrefs(links):
    all_urls = []
    #Iterate through the link elements.
    for link in links:
    #Gets the content of the href attribute of the link and add it to the list of urls.
        url = link.get("href")
        all_urls.append(url)
        
    return all_urls

all_urls = Get_hrefs(all_links)

Excellent, but it's still not enough. Some of this URLs aren't complete, which means we can't open them if we just copy paste them on the browser. So let's create a function that extracts all clean urls from a list of urls. In this context, by **clean** we mean that they contain "http" in their name.

In [16]:
def Get_Actual_Urls(urls):
    #Let's use a list comprehension to be fancier.
    #Gets url that contain http.
    actual_urls = [url for url in urls if "http" in url]
    return actual_urls

real_urls = Get_Actual_Urls(all_urls)
real_urls

['android-app://org.wikipedia/http/en.m.wikipedia.org/wiki/List_of_football_clubs_in_England',
 'https://en.wikipedia.org/wiki/List_of_football_clubs_in_England']

Close enough. One is reserved for their mobile app, but there's little we can do about that, except for better pattern matching. You probably get the idea.   

#### Your turn now.

Use the Beautiful Soup html parser we've been using  until now to find all the tables that the page contains. 

**Hint** The html markup for tables is called table. You find them with the find_all function.

In [11]:
#Your code here.

## Expanding how we search the pages.

There are a couple of tricks that we can use to narrow down our searches  using Beautiful Soup. One of them  is by combining searching for a tag, while verifyting if a tag contains an attribute. One practical example, is that in webpages, most links won't even be found in link tags. They'll be inside  "**a**" tags, inside their "**href**" attributes. Let's search for these  on the wikipedia page.

In [13]:
#Search like before, but specifying that we want those a tags that have an href attribute.
a_withhref = HTML_Parser.find_all("a",attrs = {"href":True})
a_withhref

[<a href="#mw-head">navigation</a>,
 <a href="#p-search">search</a>,
 <a href="/wiki/Association_football" title="Association football">football</a>,
 <a href="/wiki/English_football_league_system" title="English football league system">English football league system</a>,
 <a href="/wiki/English_Football_League" title="English Football League">English Football League</a>,
 <a href="/wiki/English_football_league_system" title="English football league system">English football league system</a>,
 <a href="/w/index.php?title=List_of_football_clubs_in_England&amp;action=edit&amp;section=1" title="Edit section: By league and division">edit</a>,
 <a href="/wiki/2016%E2%80%9317_Premier_League" title="2016–17 Premier League">Premier League</a>,
 <a href="/wiki/English_Football_League" title="English Football League">English Football League</a>,
 <a href="/wiki/2016%E2%80%9317_EFL_Championship" title="2016–17 EFL Championship">Football League Championship</a>,
 <a href="/wiki/2016%E2%80%9317_EFL

Now we can do the same as before, and extract the content of the hrefs. 

In [17]:
hrefs = Get_hrefs(a_withhref)
hrefs

['#mw-head',
 '#p-search',
 '/wiki/Association_football',
 '/wiki/English_football_league_system',
 '/wiki/English_Football_League',
 '/wiki/English_football_league_system',
 '/w/index.php?title=List_of_football_clubs_in_England&action=edit&section=1',
 '/wiki/2016%E2%80%9317_Premier_League',
 '/wiki/English_Football_League',
 '/wiki/2016%E2%80%9317_EFL_Championship',
 '/wiki/2016%E2%80%9317_EFL_League_One',
 '/wiki/2016%E2%80%9317_EFL_League_Two',
 '/wiki/National_League_(division)',
 '/wiki/2016%E2%80%9317_National_League',
 '/wiki/2016%E2%80%9317_National_League',
 '/wiki/2016%E2%80%9317_National_League',
 '/wiki/2016%E2%80%9317_Isthmian_League',
 '/wiki/2016%E2%80%9317_Northern_Premier_League',
 '/wiki/2016%E2%80%9317_Southern_Football_League',
 '/wiki/2016%E2%80%9317_Combined_Counties_Football_League',
 '/wiki/2016%E2%80%9317_Eastern_Counties_Football_League',
 '/wiki/2016%E2%80%9317_Essex_Senior_Football_League',
 '/wiki/2016%E2%80%9317_Hellenic_Football_League',
 '/wiki/2016%E2%

Of course, most of these aren't usable, since they don't contain the full URL. However, this was just a demonstration of how we can narrow down our search on Beautiful Soup.

One other useful thing, is getting the **text** inside an html markup. Since, that's what we are actually seeing when we open the page. We can get  the text using the get_text attribute of the resultset.

In [21]:
#Find the span tag that has the attribute class, and the class name is mw-headline.
headline = HTML_Parser.find_all("span", attrs = {"class":"mw-headline"})
#Go through all headlines, and print the text of the headline.
for head in headline:
    print(head.get_text())


By league and division
Alphabetically
Key
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
Y
0–9
Clubs in Levels 1–10 last season
See also


Or another example:

In [36]:
#A more complex example.

div = HTML_Parser.find("div", attrs = {"id":"mw-content-text"})
for li in div.find_all("li"):
    if li.find("a"):
        for a in li.find_all("a",attrs = {"href":True,"title":True}):
            print(a.get_text())

Premier League
English Football League
Football League Championship
Football League One
Football League Two
Football League Championship
Football League One
Football League Two
National League
National League
National League North
National League South
National League
National League North
National League South
Isthmian League
Northern Premier League
Southern League
Isthmian League
Northern Premier League
Southern League
Combined Counties League
Eastern Counties League
Essex Senior League
Hellenic League
Midland League
Northern Counties East League
Northern League
North West Counties League
Southern Counties East League
Southern Combination League
Spartan South Midlands League
United Counties League
Wessex League
Western League
Combined Counties League
Eastern Counties League
Essex Senior League
Hellenic League
Midland League
Northern Counties East League
Northern League
North West Counties League
Southern Counties East League
Southern Combination League
Spartan South Midlands League
U

See if you can figure out which parts  of the page I scrapped with that cell! 

## A practical example.

Let's do an example, where we extract the information of all England football clubs that start with the letter A, and create a Pandas dataframe with it.  

First, let's inspect the page, and see what makes the tables with these names unique. Don't worry, I did that for you already. These tables are inside a "**table**" tag, and the class of this table is called "**wikitable sortable**".

In [42]:
ALetter_table = HTML_Parser.find("table",attrs = {"class":"wikitable sortable"})
ALetter_table

<table class="wikitable sortable" style="font-size:90%" width="100%">
<tr>
<th width="20%">Club</th>
<th width="30%">League/Division</th>
<th width="3%">Lvl</th>
<th width="10%">Nickname</th>
<th width="37%">Change from 2015–16</th>
</tr>
<tr>
<td><a href="/wiki/AC_London_F.C." title="AC London F.C.">AC London</a></td>
<td><a href="/wiki/2016%E2%80%9317_Combined_Counties_Football_League" title="2016–17 Combined Counties Football League">Combined Counties League Division One</a></td>
<td>10</td>
<td></td>
<td style="background:skyblue">Transferred from Kent Invicta League</td>
</tr>
<tr>
<td><a href="/wiki/A.F.C._Aldermaston" title="A.F.C. Aldermaston">A.F.C. Aldermaston</a></td>
<td><a href="/wiki/2016%E2%80%9317_Hellenic_Football_League" title="2016–17 Hellenic Football League">Hellenic League Division One East</a></td>
<td>10</td>
<td>Atom Men</td>
<td style="background:lightgreen">Promoted from Thames Valley Premier League Premier</td>
</tr>
<tr>
<td><a href="/wiki/A.F.C._Blackpool"

For our convenience the first result of this search is the first table, which contains the teams that start with the letter A. If we instead wanted the teams that start with another letter, we would use the find_all method, and select the one that corresponds with the number of the letter of the alphabet - 1.

Now, let's extract the headers of the table that we will use as column names on our dataframe.

In [43]:
headers = []
for table_header in ALetter_table.find_all("th"):
    headers.append(table_header.get_text())
    
headers

['Club', 'League/Division', 'Lvl', 'Nickname', 'Change from 2015–16']

And what's missing? Of course, the data itself. We'll iterate through each row of the table, and get the data that is inside the row. 

In [47]:
#Create the lists that we'll populate later.
Clubs = []
Leagues = []
Lvls = []
Nicknames = []
Changes = []

#Helper function to get the text.
def Get_Data_Text(table_data):
    a = table_data.find("a")
    text = a.get_text()
    #If it doesn't have text, return an empty string. Else return the text.
    if text is None:
        return " "
    
    return text

#Get all the table rows.
rows = ALetter_table.find_all("tr")
#Skip the first row, since it contains the headers.
for i in range(1,len(rows)-1):
    #Find all the data rows.
    table_data = rows[i].find_all("td")
    Clubs.append(Get_Data_Text(table_data[0]))
    Leagues.append(Get_Data_Text(table_data[1]))
    Lvls.append(table_data[2].get_text())
    nickname = table_data[3].get_text()
    #In case the club doesn't have a nickname.
    if nickname:
        Nicknames.append(nickname)
    else:
        Nicknames.append(" ")
    
    #Same with changes, check if the team has changes, this means the length of the list is 5.
    if len(table_data) == 5:
        Changes.append(table_data[4].get_text())
    else:
        Changes.append(" ")

Leagues[0:10]       

['Combined Counties League Division One',
 'Hellenic League Division One East',
 'North West Counties League Division One',
 'Premier League',
 'West Midlands (Regional) League Premier Division',
 'Southern Counties East League Premier Division',
 'North West Counties League Premier Division',
 'Southern League Division One Central',
 'Northern Counties East League Division One',
 'National League North']

Great, now to join it all into a dataframe.

In [55]:
#Dictionary to convert to a dataframe.
A_Clubs = { }
information_list = [Clubs,Leagues,Lvls,Nicknames,Changes]

#Fill the dictionary
for i in range(0,len(information_list)-1):
    A_Clubs[headers[i]] = information_list[i]

import pandas as pd
A_Clubs_df = pd.DataFrame(A_Clubs)
A_Clubs_df.set_index("Club",inplace= True)
A_Clubs_df.head(10)

Unnamed: 0_level_0,League/Division,Lvl,Nickname
Club,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AC London,Combined Counties League Division One,10,
A.F.C. Aldermaston,Hellenic League Division One East,10,Atom Men
A.F.C. Blackpool,North West Counties League Division One,10,Mechanics
A.F.C. Bournemouth,Premier League,1,Cherries
A.F.C. Bridgnorth,West Midlands (Regional) League Premier Division,10,Meadow Men
A.F.C. Croydon Athletic,Southern Counties East League Premier Division,9,Rams
A.F.C. Darwen,North West Counties League Premier Division,9,Salmoners
A.F.C. Dunstable,Southern League Division One Central,8,ODs
A.F.C. Emley,Northern Counties East League Division One,10,Pewits
A.F.C. Fylde,National League North,6,Coasters


And there you have it. We took information right out of a webpage, and created a pandas dataframe from it. We can now perform the same operations that we have done a lot of times before. Like:

In [56]:
A_Clubs_df.Lvl.mean()

1.4226776196916476e+93

#### Now your turn.

Create a pandas dataframe with the information of the last table of the page, Clubs in level 1-10 last season. 

In [57]:
#Your cells below.

## Exercise

Scrape the gross median income table from  this page: https://en.wikipedia.org/wiki/Median_income and create a pandas dataframe with it. Then, copy the dataframe and sort it by  per-capita income. Then compare the two tables. How do the rankings of the countries differ? 

In [58]:
#Your cells below.

## Further reading.

Beautiful Soup documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/