# Introduction to Web Scraping using beautifulsoup

Let's say you want to extract information from the websites for further analysis, for example, collecting news headlines from a particular news agency for sentiment analysis, you will have to use tools to do the job. Web scraping or web data extraction is the process of extracting information from the websites. There are many tools that is used to scrape data such as beautiful soup, scrapy, Urllib3, selenium etc. In this tutorial we will be using beautiful soup to extract the data.

### beautifulsoup

beautifulsoup is a python library used to parse HTML and XML documents. beautifulsoup uses a tree structure to identify the data and offers an automated encoding conversions, that makes it easier to handle the web data.

### Install beautifulsoup

In [None]:
!pip install beautifulsoup4

You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

Sample HTML Tree

Below is the sample HTML tree(truncated) for the website - https://web.ics.purdue.edu/~gchopra/class/public/pages/webdesign/05_simple.html (This is a publicly hosted website by Purdue University used for generating the tree)

![Sample HTML](image-20230119-090623.png)

Do you also want to generate a tree to for a web page?

Then install HTML Tree Generator as an extension to chrome. Then click on the added extension to generate tree for a particular webpage. 

### Get Complete HTML Data

In [None]:
from bs4 import BeautifulSoup
import requests

website = requests.get('https://www.colorado.edu/program/data-science/faculty')
print(website)
print(website.text)

<Response [200]>
<!DOCTYPE html>
<!--[if IEMobile 7]><html class="iem7"  lang="en" dir="ltr"><![endif]-->
<!--[if lte IE 6]><html class="lt-ie9 lt-ie8 lt-ie7"  lang="en" dir="ltr"><![endif]-->
<!--[if (IE 7)&(!IEMobile)]><html class="lt-ie9 lt-ie8"  lang="en" dir="ltr"><![endif]-->
<!--[if IE 8]><html class="lt-ie9"  lang="en" dir="ltr"><![endif]-->
<!--[if (gte IE 9)|(gt IEMobile 7)]><!--><html  lang="en" dir="ltr"
  xmlns:og="http://ogp.me/ns#"><!--<![endif]-->

<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link rel="shortcut icon" href="https://www.colorado.edu/program/data-science/profiles/express/themes/cumodern/favicon.ico" type="image/vnd.microsoft.icon" />
<link href="https://www.colorado.edu/program/data-science/feed/rss.xml" rel="alternate" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<link rel="apple-touch-icon" sizes="57x57" href="https://www.colorado.edu/program/data-science/profiles/express/themes/ucb/apple-icon-57x57.png" /

I have the complete HTML data of the faculty website for CU Boulder Data Science. 

print(website) prints the status of the response & print(website.data) prints the HTML data.

### Title of the Page

In [None]:
soup = BeautifulSoup(website.text, 'html.parser')
print(soup.title)

<title>Our People | Master of Science in Data Science | University of Colorado Boulder</title>


### Get all URL's

In [None]:
for link in soup.find_all('a'):
    print(link.get('href'))

#main
http://www.colorado.edu
https://www.colorado.edu/search
https://calendar.colorado.edu
https://www.colorado.edu/map
/program/data-science/
/program/data-science/
/program/data-science/big-opportunities-in-big-data-Landing-Page
/program/data-science/campus-overview
/program/data-science/coursera-overview
/program/data-science/faculty
/program/data-science/about
/program/data-science/student-jobs
/program/data-science/employers
/program/data-science/graduation-information
/program/data-science/news
/program/data-science/
/program/data-science/big-opportunities-in-big-data-Landing-Page
/program/data-science/campus-overview
/program/data-science/coursera-overview
/program/data-science/faculty
/program/data-science/about
/program/data-science/student-jobs
/program/data-science/employers
/program/data-science/graduation-information
/program/data-science/news
None
None
/program/data-science/jane-wall
/program/data-science/jane-wall
None
/program/data-science/bobby-schnabel
/program/data-

### Objects of BeautifulSoup

So technically what is BeautrifulSoup? Now we know that it is used to scrape data and even have tried to get some data. But how does that work internally? The answer is that the HTML is passed to the BeautifulSoup Constructor which gets converted to objects. The objects are 

- Comments


- BeautifulSoup


- Tag


- NavigableString

### BeautifulSoup

In [None]:
from bs4 import BeautifulSoup

soup=BeautifulSoup(website.text,'html.parser')

print(type(soup))

<class 'bs4.BeautifulSoup'>


The BeautifulSoup object is the complete document which we are trying to scrape.

### Tag

In [None]:

# Import Beautiful Soup
from bs4 import BeautifulSoup
   
# Create dummy HTML page
soup = BeautifulSoup('''
    <html>
        <b>Welcome to Data Mining</b>
    </html>
    ''', "html.parser")
   
# Get the tag
tag = soup.b
   
# Print the output
print(type(tag))

<class 'bs4.element.Tag'>


A tag object is the same as the HTML or XML tags. 

The tag has two features

Name - It can be accessed with a .name suffix. It will return the type of tag.
Attributes - A tag object can have  various attributes such as “class”, “href”, “id”, etc

In [None]:
# Import Beautiful Soup
from bs4 import BeautifulSoup
   
soup = BeautifulSoup('''
    <html>
        <p class="course">Data Mining</p>
    </html>
    ''', "html.parser")
   
# Get the tag
tag = soup.p
 
print(tag["class"])
 
# modifying class
tag["class"] = "lecture"
print(tag)
 

['course']
<p class="lecture">Data Mining</p>


### NavigableString Object

In [None]:
from bs4 import BeautifulSoup

soup=BeautifulSoup('<div>Welcome to Data Mining</div>','html.parser')

print(type(soup.text))

<class 'str'>


The text within the tags is the String

### Comments

In [None]:
from bs4 import BeautifulSoup

soup=BeautifulSoup('<div><!-- This is the comments section --></div>','html.parser')

print(soup.div)

<div><!-- This is the comments section --></div>


As the name suggests the object contains all the comments available in the HTML document

### Search in a parse tree - find() method

You can use find() when you know that there is only one element with a particular class ‘x’ then you can use the find() method to find that particular tag. 

In [1]:
from bs4 import BeautifulSoup

soup=BeautifulSoup('<div class="dm">data mining</div><div class="dc">data center</div><div class="ml">machine learning</div>','html.parser')

data = soup.find("div",{"class":"dm"}).text

print(data)

data mining


In [2]:
from bs4 import BeautifulSoup

soup=BeautifulSoup('<div class="dm">data mining</div><div class="dm">data center</div><div class="ml">machine learning</div>','html.parser')

data = soup.find("div",{"class":"dm"}).text

print(data)

data mining


Observe that class dm has two elements and the first one is chosen in the 2nd snippet.

### Search in a parse tree - find_all() method

Using the find_all() method you can extract all the elements with a particular tag. Unlike the find() method you can extract data from any tag even if it doesn’t appear first. 

In [9]:
from bs4 import BeautifulSoup

soup=BeautifulSoup('<div class="dm">data mining</div><div class="dm">data center</div><div class="ml">machine learning</div>','html.parser')

data = soup.find_all("div",{"class":"dm"})
print(data)
print(data[1].text)

[<div class="dm">data mining</div>, <div class="dm">data center</div>]
data center


## Sample program to extract data from CU Boulder Data Science faculty website

Let's perform data extraction on our faculty website and prepare a data frame with names of the faculty and their positions.

In [11]:
#Get the HTML page
import requests

page = requests.get("https://www.colorado.edu/program/data-science/faculty")
soup = BeautifulSoup(page.content, 'html.parser')
soup

<!DOCTYPE html>

<!--[if IEMobile 7]><html class="iem7"  lang="en" dir="ltr"><![endif]-->
<!--[if lte IE 6]><html class="lt-ie9 lt-ie8 lt-ie7"  lang="en" dir="ltr"><![endif]-->
<!--[if (IE 7)&(!IEMobile)]><html class="lt-ie9 lt-ie8"  lang="en" dir="ltr"><![endif]-->
<!--[if IE 8]><html class="lt-ie9"  lang="en" dir="ltr"><![endif]-->
<!--[if (gte IE 9)|(gt IEMobile 7)]><!--><html dir="ltr" lang="en" xmlns:og="http://ogp.me/ns#"><!--<![endif]-->
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="https://www.colorado.edu/program/data-science/profiles/express/themes/cumodern/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>
<link href="https://www.colorado.edu/program/data-science/feed/rss.xml" rel="alternate"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<link href="https://www.colorado.edu/program/data-science/profiles/express/themes/ucb/apple-icon-57x57.png" rel="apple-touch-icon" sizes="57x57"/>
<link href="https://www.c

In [12]:
# Use prettify to view the DOM structure
print(soup.prettify())

                  <div class="person-job-titles-grid">
                   Faculty Director
                  </div>
                  <div class="person-departments-grid">
                   Data Science
                  </div>
                 </div>
                </div>
               </div>
              </div>
              <h2 class="people-list-group-title people-list-grid-group-title">
               Interdisciplinary Leadership
              </h2>
              <div class="people-list-wrapper-grid">
               <a id="bobby+schnabel" name="bobby+schnabel">
               </a>
               <div class="person-view-mode-grid grid-person clearfix col-lg-4 col-md-4 col-sm-6 col-xs-12">
                <a href="/program/data-science/bobby-schnabel">
                 <img alt="Bobby Schnabel" class="image-large_square_thumbnail" height="600" src="https://www.colorado.edu/program/data-science/sites/default/files/styles/large_square_thumbnail/public/people/bobby_schnabel.png?ito

Now lets try to find the Faculty roles such as Leadership, Management etc.

In [14]:
soup.find_all(class_="people-list-group-title people-list-grid-group-title")

[<h2 class="people-list-group-title people-list-grid-group-title">Program Leadership</h2>,
 <h2 class="people-list-group-title people-list-grid-group-title">Interdisciplinary Leadership</h2>,
 <h2 class="people-list-group-title people-list-grid-group-title">Program Management</h2>,
 <h2 class="people-list-group-title people-list-grid-group-title">Faculty</h2>,
 <h2 class="people-list-group-title people-list-grid-group-title">Lecturer</h2>,
 <h2 class="people-list-group-title people-list-grid-group-title">Student Assistant</h2>,
 <h2 class="people-list-group-title people-list-grid-group-title">Lead Course Facilitator</h2>,
 <h2 class="people-list-group-title people-list-grid-group-title">Course Facilitator</h2>]

Wondering how I was able to figure out the class? I understand it's hard to view the structure to print statements. You can simply right click on the web page and choose 'Inspect element'. Now you will be able to view the structure better.

In [30]:
contents = soup.find(id="content")

In [32]:
# Get the tags that has details of jobs
job_tags = contents.select(".person-job-titles-grid")
jobs = [pt.get_text() for pt in job_tags]

In [33]:
# Get the tags that has details of names
names = contents.select('strong')
names = [nm.get_text() for nm in names]

In [35]:
# Create a dataframe
import pandas as pd
faculty = pd.DataFrame({
    "Names":names,
    "Position": jobs

})

In [36]:
faculty

Unnamed: 0,names,Position
0,Jane Wall,Faculty Director
1,Bobby Schnabel,Department External Chair • Professor
2,Brian Zaharatos,Director of the Professional Master’s Degree •...
3,Nick Dokkin,Residential Graduate Advisor
4,Josh Kawinski,Program and Marketing Coordinator
5,Mika Puseman,Coursera Course Coordinator
6,Kaitlyn Rye,Coursera Graduate Program Advisor
7,Jem Corcoran,Associate Professor
8,Anne Dougherty,Senior Instructor & University of Colorado Tea...
9,Ioana Fleming,Senior Instructor • Chair of Undergraduate Edu...


Congratulations! You have successfully completed a tutorial on Web scraping! 

This tutorial is prepared by Ajay Sadananda