There are mainly two ways to extract data from a website:

1. Use the API of the website (if it exists). For example, Facebook has the Facebook Graph API which allows retrieval of data posted on Facebook.

2. Access the HTML of the webpage and extract useful information/data from it. This technique is called web scraping or web harvesting or web data extraction.

This session discusses the steps involved in web scraping using the implementation of a Web Scraping framework of Python called Beautiful Soup. Steps involved in web scraping:

# Step 1: Installing the required third-party libraries

# After installation please let's first see few basic codes in files 9I7 and 9I8 from our list of files related to Beautifulsoup

Then we will come back to this file

# Step 2: Accessing the HTML content from webpage 

In [1]:
import requests
URL = "http://www.values.com/inspirational-quotes"
r = requests.get(URL)
print(r.content)

b'<!DOCTYPE html>\n<html class="no-js" dir="ltr" lang="en-US">\n<head>\n  <meta charset="utf-8">\n  <meta http-equiv="content-type" content="text/html; charset=utf-8" />\n  <meta http-equiv="X-UA-Compatible" content="IE=edge" />\n  <meta name="viewport" content="width=device-width,initial-scale=1.0" />\n  <title>Inspirational Quotes - Motivational Quotes - | The Foundation for a Better Life</title>\n<meta name="description" content="Find the perfect quotation from our hand-picked collection of inspiring quotes by hundreds of authors.">\n<meta name="keywords" content="pass, it, on, passiton, values, kindness">\n<meta name="twitter:site_name" content="The Foundation for a Better Life">\n<meta name="twitter:site" content="@passiton_values">\n<meta name="twitter:card" content="summary">\n<meta name="twitter:description" content="Thank you for visiting.">\n<meta name="twitter:image" content="https://www.passiton.com/passiton_fbl.jpg">\n<meta property="og:url" content="https://www.passiton.c

Let us try to understand this piece of code.

First of all import the requests library.

Then, specify the URL of the webpage you want to scrape.

Send a HTTP request to the specified URL and save the response from server in a response object called r.

Now, as print r.content to get the raw HTML content of the webpage. It is of ‘string’ type.

# Step 3: Parsing the HTML content 

In [5]:
#This will not run on online IDE
import requests
from bs4 import BeautifulSoup

URL = "http://www.values.com/inspirational-quotes"
r = requests.get(URL)

soup = BeautifulSoup(r.content, 'html5lib') # If this line causes an error, run 'pip install html5lib' or install html5lib
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js" dir="ltr" lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width,initial-scale=1.0" name="viewport"/>
  <title>
   Inspirational Quotes - Motivational Quotes - | The Foundation for a Better Life
  </title>
  <meta content="Find the perfect quotation from our hand-picked collection of inspiring quotes by hundreds of authors." name="description"/>
  <meta content="pass, it, on, passiton, values, kindness" name="keywords"/>
  <meta content="The Foundation for a Better Life" name="twitter:site_name"/>
  <meta content="@passiton_values" name="twitter:site"/>
  <meta content="summary" name="twitter:card"/>
  <meta content="Thank you for visiting." name="twitter:description"/>
  <meta content="https://www.passiton.com/passiton_fbl.jpg" name="twitter:image"/>
  <meta content="https://www.passiton.com/ins

A really nice thing about the BeautifulSoup library is that it is built on the top of the HTML parsing libraries like html5lib, lxml, html.parser, etc. So  BeautifulSoup object and specify the parser library can be created at the same time. In the example above,

Now we have printed soup.prettify() above, it gives the visual representation of the parse tree created from the raw HTML content. 

# Step 4: 

Searching and navigating through the parse tree Now, we would like to extract some useful data from the HTML content. The soup object contains all the data in the nested structure which could be programmatically extracted. In our example, we are scraping a webpage consisting of some quotes. So, we would like to create a program to save those quotes (and all relevant information about them). 

In [2]:
#Python program to scrape website
#and save quotes from website
import requests
from bs4 import BeautifulSoup
import csv

URL = "http://www.values.com/inspirational-quotes"
r = requests.get(URL)

soup = BeautifulSoup(r.content, 'html5lib')

quotes=[] # a list to store quotes

table = soup.find('div', attrs = {'id':'all_quotes'})

for row in table.findAll('div', attrs = {'class':'col-6 col-lg-4 text-center margin-30px-bottom sm-margin-30px-top'}):
    quote = {}
    #quote['theme'] = row.h5['class']
    #quote['url'] = row.a['href']
    quote['lines'] = row.img['alt'].split(" #")[0]
    #quote['author'] = row.img['alt'].split(" #")[1]
    quotes.append(quote)
    
filename = 'inspirational_quotes.csv'
with open(filename, 'w', newline='') as f:
    #w = csv.DictWriter(f,['theme','url','lines'])
    w = csv.DictWriter(f,['lines'])
    w.writeheader()
    for quote in quotes:
         w.writerow(quote)

Before moving on, we recommend you to go through the HTML content of the webpage which we printed using soup.prettify() method and try to find a pattern or a way to navigate to the quotes.

It is noticed that all the quotes are inside a div container whose id is ‘all_quotes’. So, we find that div element (termed as table in above code) using find() method :

The first argument is the HTML tag you want to search and second argument is a dictionary type element to specify the additional attributes associated with that tag. find() method returns the first matching element. You can try to print table to get a sense of what this piece of code does.

In [16]:
table

<div class="row" id="all_quotes">

          <div class="col-6 col-lg-4 text-center margin-30px-bottom sm-margin-30px-top">

            <a href="/inspirational-quotes/6651-for-its-not-light-that-is-needed-but-fire-its"><img alt="For it's not light that is needed, but fire; it's not the gentle shower, but thunder. We need the storm, the whirlwind and the earthquake in our hearts. #&lt;Author:0x00005616c3b5c658&gt;" class="margin-10px-bottom shadow" height="310" src="https://assets.passiton.com/quotes/quote_artwork/6651/medium/20230419_wednesday_quote.jpg" width="310"/></a>
            <h5 class="value_on_red"><a href="/inspirational-quotes/6651-for-its-not-light-that-is-needed-but-fire-its">INSPIRATION</a></h5>

          </div>


          <div class="col-6 col-lg-4 text-center margin-30px-bottom sm-margin-30px-top">

            <a href="/inspirational-quotes/7997-we-must-not-allow-the-clock-and-the-calendar-to"><img alt="We must not allow the clock and the calendar to blind us to th

Now, in the table element, one can notice that each quote is inside a div container whose class is'col-6 col-lg-4 text-center margin-30px-bottom sm-margin-30px-top'. So, we iterate through each div container whose class is 'col-6 col-lg-4 text-center margin-30px-bottom sm-margin-30px-top'. Here, we use findAll() method which is similar to find method in terms of arguments but it returns a list of all matching elements. Each quote is now iterated using a variable called row. Here is one sample row HTML content for better understanding:

In [14]:
quote['lines'][0:100] #Last quote fetched and stored in the file

"If it wasn't hard, everyone would do it. It's the hard that makes it great."

Now consider this piece of code:

Here we create a CSV file called inspirational_quotes.csv and save all the quotes in it for any further use.
So, this was a simple example of how to create a web scraper in Python.

# Another example of downloading a image from web

In [3]:
# imported the requests library
import requests
image_url = "https://www.python.org/static/community_logos/python-logo-master-v3-TM.png"

# URL of the image to be downloaded is defined as image_url
r = requests.get(image_url) # create HTTP response object

# send a HTTP request to the server and save
# the HTTP response in a response object called r
with open("python_logo.png",'wb') as f:

    # Saving received content as a png file in
    # binary format

    # write the contents of the response (r.content)
    # to a new file in binary mode.
    f.write(r.content)

This small piece of code written above will download the following image from the web. Now check your local directory(the folder where this script resides), and you will find the image of Python Logo...

All we need is the URL of the image source. (You can get the URL of image source by right-clicking on the image and selecting the View Image option.)

# Download large files with PDF downloading example

The HTTP response content (r.content) is nothing but a string which is storing the file data. So, it won’t be possible to save all the data in a single string in case of large files. To overcome this problem, we do some changes to our program:

Since all file data can’t be stored by a single string, we use r.iter_content method to load data in chunks, specifying the chunk size.

 r = requests.get(URL, stream = True)
 
Setting stream parameter to True will cause the download of response headers only and the connection remains open. This avoids reading the content all at once into memory for large responses. A fixed chunk will be loaded each time while r.iter_content is iterated.

Here is an example:

In [72]:
import requests
file_url = "https://africau.edu/images/default/sample.pdf"

r = requests.get(file_url, stream = True)

with open("python.pdf","wb") as pdf:
    for chunk in r.iter_content(chunk_size=1024):

        # writing one chunk at a time to pdf file
        if chunk:
            pdf.write(chunk)


Advantages of using Requests library to download web files are:

One can easily download the web directories by iterating recursively through the website!

This is a browser-independent method and much faster!

One can simply scrape a web page to get all the file URLs on a webpage and hence, download all files in a single command-

# BeautifulSoup – Scraping Paragraphs from HTML

# Explanation: 

After importing the modules urllib and bs4 we will provide a variable with a url which is to be read, the urllib.request.urlopen() function forwards the requests to the server for opening the url. BeautifulSoup() function helps us to parse the html file or you say the encoding in html. The loop used here with find_all() finds all the tags containing paragraph tag <p></p> and the text between them are collected by the get_text() method.

Below is the implementation:

In [21]:
# importing modules
import urllib.request
from bs4 import BeautifulSoup

# providing url
url = "https://en.wikipedia.org/wiki/The_Times_of_India"

# opening the url for reading
html = urllib.request.urlopen(url)

# parsing the html file
htmlParse = BeautifulSoup(html, 'html.parser')

# getting all the paragraphs
for para in htmlParse.find_all("p"):
    print(para.get_text())





The Times of India, also known by its abbreviation TOI, is an Indian English-language daily newspaper and digital news media owned and managed by The Times Group. It is the third-largest newspaper in India by circulation and largest selling English-language daily in the world.[1][2][3][4][5][6] It is the oldest English-language newspaper in India, and the second-oldest Indian newspaper still in circulation, with its first edition published in 1838.[7] It is nicknamed as "The Old Lady of Bori Bunder",[8][9] and is an Indian "newspaper of record".[10][11]

Near the beginning of the 20th century, Lord Curzon, the Viceroy of India, called TOI "the leading paper in Asia".[12][13] In 1991, the BBC ranked TOI among the world's six best newspapers.[14][15]

It is owned and published by Bennett, Coleman & Co. Ltd. (B.C.C.L.), which is owned by the Sahu Jain family. In the Brand Trust Report India study 2019, TOI  was rated as the most trusted English newspaper in India.[16] Reuters rated TO

In [25]:
# import module
import requests
import pandas as pd
from bs4 import BeautifulSoup

# link for extract html data
def getdata(url):
    r = requests.get(url)
    return r.text

htmldata = getdata("https://en.wikipedia.org/wiki/The_Times_of_India")

soup = BeautifulSoup(htmldata, 'html.parser')

#data = ''

for data in soup.find_all("p"):
    print(data.get_text())





The Times of India, also known by its abbreviation TOI, is an Indian English-language daily newspaper and digital news media owned and managed by The Times Group. It is the third-largest newspaper in India by circulation and largest selling English-language daily in the world.[1][2][3][4][5][6] It is the oldest English-language newspaper in India, and the second-oldest Indian newspaper still in circulation, with its first edition published in 1838.[7] It is nicknamed as "The Old Lady of Bori Bunder",[8][9] and is an Indian "newspaper of record".[10][11]

Near the beginning of the 20th century, Lord Curzon, the Viceroy of India, called TOI "the leading paper in Asia".[12][13] In 1991, the BBC ranked TOI among the world's six best newspapers.[14][15]

It is owned and published by Bennett, Coleman & Co. Ltd. (B.C.C.L.), which is owned by the Sahu Jain family. In the Brand Trust Report India study 2019, TOI  was rated as the most trusted English newspaper in India.[16] Reuters rated TO

# Note : Web Scraping is considered as illegal in many cases. It may also cause your IP to be blocked permanently by a website.¶