# Web Scraping with BeautifulSoup

We import the requests library. Then we define the URL that we want to scrape.
We then use the requests.get() method to send an HTTP GET request from the specified URL's server, and store the server response in an object r. We then print the raw HTML content of the webpage.

In [1]:
import requests
URL = "https://www.geeksforgeeks.org/data-structures/"
r = requests.get(URL)
print(r.content)

b'<!doctype html><html lang=en-us prefix="og: http://ogp.me/ns#"><meta charset=utf-8><meta name=viewport content="width=device-width,initial-scale=1,maximum-scale=1"><link rel="shortcut icon" href=https://media.geeksforgeeks.org/wp-content/cdn-uploads/gfg_favicon.png type=image/x-icon><meta name=theme-color content="#308D46"><meta name=image property="og:image" content="https://media.geeksforgeeks.org/wp-content/cdn-uploads/gfg_200x200-min.png"><meta property="og:image:type" content="image/png"><meta property="og:image:width" content="200"><meta property="og:image:height" content="200"><script defer src=https://apis.google.com/js/platform.js></script><script async src=//cdnjs.cloudflare.com/ajax/libs/require.js/2.1.14/require.min.js></script><title>Data Structures - GeeksforGeeks</title><link rel=profile href=http://gmpg.org/xfn/11><link rel=pingback href><script type=application/ld+json>\r\n    {\r\n        "@context" : "http://schema.org",\r\n        "@type" : "Organization",\r\n    

Now, we create a BeautifulSoup object by passing two parameters:
- r.content: The raw HTML content
- 'html5lib': The HTML parser to use

'soup.prettify()' gives us a visual representation of the parse tree.

In [2]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(r.content, 'html5lib')
print(soup.prettify())

<!DOCTYPE html>
<html lang="en-us" prefix="og: http://ogp.me/ns#">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width,initial-scale=1,maximum-scale=1" name="viewport"/>
  <link href="https://media.geeksforgeeks.org/wp-content/cdn-uploads/gfg_favicon.png" rel="shortcut icon" type="image/x-icon"/>
  <meta content="#308D46" name="theme-color"/>
  <meta content="https://media.geeksforgeeks.org/wp-content/cdn-uploads/gfg_200x200-min.png" name="image" property="og:image"/>
  <meta content="image/png" property="og:image:type"/>
  <meta content="200" property="og:image:width"/>
  <meta content="200" property="og:image:height"/>
  <script defer="" src="https://apis.google.com/js/platform.js">
  </script>
  <script async="" src="//cdnjs.cloudflare.com/ajax/libs/require.js/2.1.14/require.min.js">
  </script>
  <title>
   Data Structures - GeeksforGeeks
  </title>
  <link href="http://gmpg.org/xfn/11" rel="profile"/>
  <link href="" rel="pingback"/>
  <script type="application/l

Now, we will scrape a webpage for inspirational quotes and extract some useful data.

In [3]:
URL = "http://www.values.com/inspirational-quotes"
r = requests.get(URL)

soup = BeautifulSoup(r.content, 'html5lib')

quotes = [] # a list to store the extracted quotes

table = soup.find('div', attrs = {'id': 'all_quotes'})

# print(table.prettify())

for row in table.findAll('div', 
                         attrs = {'class': "col-6 col-lg-4 text-center margin-30px-bottom sm-margin-30px-top"}):
    quote = {}
    quote['theme'] = row.h5.text
    quote['url'] = row.a['href']
    quote['img'] = row.img['src']
    quote['lines'] = row.img['alt'].split(" #")[0]
    quote['author'] = row.img['alt'].split(" #")[1]
    quotes.append(quote)

print(quotes)

[{'theme': 'HARD WORK', 'url': '/inspirational-quotes/7750-a-dream-doesnt-become-reality-through-magic-it', 'img': 'https://assets.passiton.com/quotes/quote_artwork/7750/medium/20220419_tuesday_quote.jpg?1650051929', 'lines': "A dream doesn't become reality through magic; it takes sweat, determination, and hard work.", 'author': '<Author:0x000055a505e939c8>'}, {'theme': 'HARD WORK', 'url': '/inspirational-quotes/6188-far-and-away-the-best-prize-that-life-offers-is', 'img': 'https://assets.passiton.com/quotes/quote_artwork/6188/medium/20220418_monday_quote.jpg?1650051893', 'lines': 'Far and away the best prize that life offers is the chance to work hard at work worth doing. ', 'author': '<Author:0x000055a5049a01d8>'}, {'theme': 'LIVE LIFE', 'url': '/inspirational-quotes/4171-twenty-years-from-now-you-will-be-more', 'img': 'https://assets.passiton.com/quotes/quote_artwork/4171/medium/20220415_friday_quote.jpg?1649602334', 'lines': "Twenty years from now you will be more disappointed by t

What we have done here is get the HTML content and create a BeautifulSoup object of the content. We then make an array to store all the quotes that we collect.

Then, we make a table that extracts the 'div' elements with id = 'all_quotes'. 
From the table, we find the quotes using the class properties they have, and store properties of the quote such as the theme, url, image etc. in a dictionary, which we then append to our quotes array.

In [5]:
import csv

filename = 'inspirational_quotes.csv'
with open(filename, 'w', newline = '') as f:
    w = csv.DictWriter(f, ['theme', 'url', 'img', 'lines', 'author'])
    w.writeheader()
    for quote in quotes:
        w.writerow(quote)

Here, we have written the collected quotes to a csv file.