# Webscraping Tutorial - Using BeautifulSoup to store and process HTML tags in Python

SODA 308, April 16th, 2021

Hojin Ryoo

This tutorial is designed to teach someone how to use BeautifulSoup to scrape information off of webpages. What we obtain first to create our "soup" is to get the raw HTML text from the site we wish to scrape. Let's install the packages we need in order to create the soup.

In [17]:
pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [18]:
pip install requests

Note: you may need to restart the kernel to use updated packages.


Here we are importing the packages we need so we can use them in our code. The packages we will be using are the requests package to get the site raw content, and BeautifulSoup to search for HTML tags. 

### Web Scraping

In [19]:
import requests
from bs4 import BeautifulSoup

For this example we will be gathering the top 100 universities ranked by topuniversities.com. In order to do this we define the url and get the raw content of that page and assign it to the variable 'page'. ![base_site](https://drive.google.com/uc?export=view&id=1axalcr7loA15942f6ItPeKupsWFyJCbH)



In [20]:
url = "https://www.topuniversities.com/where-to-study/north-america/united-states/ranked-top-100-us-universities"
page = requests.get(url).content

Now that we have the raw content, we create a soup object from the BeautifulSoup library, and parse it with an html parser. Soups are special in that they have special attributes that you can call for, which can be useful when looking for certain HTML tags. Here we call the title tag of the soup object.

In [21]:
soup = BeautifulSoup(page, 'html.parser')
soup.title

<title>Ranked: The Top 100 US Universities | Top Universities</title>

It may be helpful to actually use the inspect tool in the browser to see what tags are used to contain the elements you are hoping to scrape from a website. ![inspect](https://drive.google.com/uc?export=view&id=1_M1UCx0rfkfS7A6Fd4lwAtrMcfeAURFN)



After inspecting the website, we see that the university names are under the 'td' tag, which is a common kind of HTML table. We also see that the names are links as well. Therefore we are looking for link elements that are enclosed in the 'td' tag table. We can do this by using the find_all function, which searches the raw HTML tags for a specific tag, in this case 'td'. It will return all 'td' tag elements in a list. Below we show the first 6 results. 

In [22]:
td_list = soup.find_all('td')
td_list[0:5]

[<td colspan="2" width="389">
 <h2 style="text-align: center;"><strong>Top 100 US universities</strong></h2>
 </td>,
 <td width="44">
 <p><strong>Rank</strong></p>
 </td>,
 <td width="343">
 <p><strong>University</strong></p>
 </td>,
 <td width="44">
 <p>  1</p>
 </td>,
 <td width="343">
 <p><a href="https://www.topuniversities.com/universities/harvard-university">Harvard University</a></p>
 </td>]

After inspecting the results of the function, we see that the elements of the table that are university names are enclosed in the 'a' tag, which is used to identify links. We can use the text attribute of the tag to find the text within the tag. 

After we find the text of the link, aka the university name, we append it to a list of all the university names. 

In [23]:
# Gathering Top 100 University Names
uni_names = []
# For each element of the list of table elements, find if there is a link.
for td in td_list:
  raw = td.find('a')
  # If a link is not returned, continue to the next element. 
  if raw == None:
    continue
  # Otherwise, add the text of the link to 
  else:
    uni = raw.text
    uni_names.append(uni)

In [24]:
# Print out the list of top 100 universities
for uni in uni_names:
  print(uni)

Harvard University
Stanford University
Massachusetts Institute of Technology (MIT)
University of California, Berkeley (UCB)
Columbia University
University of California, Los Angeles (UCLA)
Yale University
University of Pennsylvania
Princeton University
Cornell University
New York University
University of Chicago
Duke University
Johns Hopkins University
University of Southern California
Northwestern University
Carnegie Mellon University
University of Michigan
California Institute of Technology (Caltech)
Brown University
Boston University
Rice University
Georgetown University
University of Washington
University of Texas at Austin
University of California, San Diego (UCSD)
Emory University
University of California, Davis
Washington University in St. Louis
University of Rochester
Vanderbilt University
Georgia Institute of Technology
University of Illinois at Urbana-Champaign
George Washington University
Tufts University
University of Florida
Dartmouth College
University of North Carolina, 

### Cleaning Web Sraped Text

Scrolling down, we also see that the site organizes the universities by state. Suppose we want to collect the information of how many of the top 100 universities are in each state. We can attempt to collect this information as well by inspecting the site and seeing where the information we want is stored in the HTML tags. ![base_site_2](https://drive.google.com/uc?export=view&id=1ljt3V-TnaSpaK13tttUgGzxkWBGsCmUK) ![inspect_2](https://drive.google.com/uc?export=view&id=1fw7e0UJNERYG_b1vcka-KqfSBoM2p0zJ)

After inspecting the site we see that the counts and the state name are in the 'strong' tag, which refers to bold text. We can now gather those tags through the find_all function.

In [25]:
strong_list = soup.find_all('strong')
strong_list[0:4]

[<strong>Top 100 US universities</strong>,
 <strong>Rank</strong>,
 <strong>University</strong>,
 <strong style="font-size: 14px;">
 <div class="media media-element-container media-default">
 <div class="file file-image" id="file-319658">
 <div class="content">
 <img alt="" class="media-element file-default" height="466" src="/sites/default/files/arizonastate.jpg" style="" title="" width="700"/>
 </div>
 </div>
 </div></strong>]

Then, we will loop through the list of strong tag elements and get the text attribute of element. 

In [26]:
# Gathering the counts of universities of the top 100 in each state
state_counts = []
# For each strong tag element, scrape the text, and append it to the state_counts list.
for s in strong_list:
  sc = s.text
  state_counts.append(sc)

In [27]:
# Printing the list of counts of each state
count = 0
for sc in state_counts:
  print(sc)
  count += 1
  if count == 15:
    break

Top 100 US universities
Rank
University








Alabama - 3 
Alaska – 2 
Arizona - 3
Arkansas - 1 
California - 37
Colorado - 5
Connecticut - 5
Delaware - 1
Florida - 12
Georgia - 5
Hawaii - 4


After inspecting the results its clear that something is messed up. Something to note it text isn't always clean when scraped from a website, there may be hidden strings with the backslash escape character '\' that are hidden within the text. A good way to see this text is using the repr function in your print statements.  

In [28]:
# Printing the list of counts of each state
count = 0
for sc in state_counts:
  print(repr(sc))
  count += 1
  if count == 15:
    break

'Top 100 US universities'
'Rank'
'University'
'\n\n\n\n\n\n\n'
'Alabama - 3\xa0'
'Alaska – 2 '
'Arizona - 3'
'Arkansas - 1\xa0'
'California - 37'
'Colorado - 5'
'Connecticut - 5'
'Delaware - 1'
'Florida - 12'
'Georgia - 5'
'Hawaii - 4'


Upon further review, we see that there are some new line strings, as well as some extraneous escape characters at the end of our strings. Cleaning text is fairly common in web scraping, so we will go through this example of that process now. 

In [29]:
# Removing extraneous escape characters and strings from the state counts list. 
state_counts_clean = []
for sc in state_counts:
  # Remove newline escape character
  newline = sc.replace('\n', '')
  # Remove \xa0 character
  tag = newline.replace('\xa0', '')
  # If after cleaning the string is empty, do not append it
  if tag == "":
    continue
  # Otherwise, append the string to a new clean counts list
  else:
    state_counts_clean.append(tag)

In [30]:
# Printing the results:
# Removing the extraneous results using string splicing.
prelim = state_counts_clean[3:-2]
count = 0
for scc in prelim:
  print(repr(scc))
  count += 1
  if count == 15:
    break

'Alabama - 3'
'Alaska – 2 '
'Arizona - 3'
'Arkansas - 1'
'California - 37'
'Colorado - 5'
'Connecticut - 5'
'Delaware - 1'
'Florida - 12'
'Georgia - 5'
'Hawaii - 4'
'Idaho - '
'2 '
'Illinois - 12'
'Indiana - 8'


After this round of cleaning, and removing some of the elements of the list through slicing, we have a couple entries where the number and the state are separate. How do we take care of this issue? 

This format of states being assigned numbers lends itself very well to a dictionary format. Therefore we will make a dictionary, to organize the data into clean key-value pairs. The keys will be the state name and the values will be the number of schools that are in the top 100 that are in that state.

In [31]:
school_counts = {}
for scc in prelim:
  # Splitting the string into the state names and the counts.
  splitted1 = scc.split('-')
  splitted2 = scc.split('–')
  # Checking which split operator actually splits the string, and accepting the one that does.
  if len(splitted1) == 2:
    splitted = splitted1
  elif len(splitted2) == 2:
    splitted = splitted2
  else:
    continue
  # Removing extra spaces from the string.
  p1 = splitted[0].strip()
  p2 = splitted[1].strip()
  # Accounting for two exceptions.
  if p1 == 'Idaho':
    school_counts[p1] = 2
  elif p1 == 'Virginia':
    school_counts[p1] = 10
  else:
    school_counts[p1] = p2

In [32]:
for key in school_counts:
  print("{}:({})".format(key, school_counts[key]))

Alabama:(3)
Alaska:(2)
Arizona:(3)
Arkansas:(1)
California:(37)
Colorado:(5)
Connecticut:(5)
Delaware:(1)
Florida:(12)
Georgia:(5)
Hawaii:(4)
Idaho:(2)
Illinois:(12)
Indiana:(8)
Iowa:(3)
Kansas:(2)
Kentucky:(2)
Louisiana:(2)
Maine:(0)
Maryland:(3)
Massachusetts:(16)
Michigan:(7)
Minnesota:(2)
Mississippi:(2)
Missouri:(6)
Montana:(2)
Nebraska:(2)
Nevada:(2)
New Hampshire:(2)
New Jersey:(12)
New Mexico:(2)
New York:(36)
North Carolina:(8)
North Dakota:(0)
Ohio:(5)
Oklahoma:(4)
Oregon:(4)
Pennsylvania:(15)
Rhode Island:(3)
South Carolina:(2)
South Dakota:(1)
Tennessee:(3)
Texas:(18)
Utah:(2)
Vermont:(1)
Virginia:(10)
Washington:(6)
Washington D.C.:(6)
West Virginia:(1)
Wisconsin:(4)
Wyoming:(1)


And there you have it! You have now learned how to scrape a webpage using BeautifulSoup, and you have some ideas on how to clean the data that you are able to gather.

### Web Scraping Use Cases

Web scraping is very useful for research and information gathering purposes. Copying and pasting over information from websites can be a long and difficult task, especially if there is a lot of data to be collected. It helps to automate that process and make it simple and easy to collect information from websites that you may require.

In terms of research projects Web Scraping should be used when you need a lot of information from an online database or table, that is too massive in volume or has awkward formatting that makes it difficult to just copy directly. 