# Web Scraping using BeautifulSoup 

The internet is an absolutely massive source of data — data that we can access using web scraping and Python!

In fact, web scraping is often the only way we can access data. There is a lot of information out there that isn’t available in convenient CSV formats or easy-to-connect APIs. And websites themselves are often valuable sources of data — consider, for example, the kinds of analysis you could do if you could download every post on Facebook.

To access those sorts of on-page datasets, we’ll have to use web scraping.


In this tutorial we’ll learn to scrape web pages with Python using BeautifulSoup and requests.

## Load webpages into Python through requests

When we scrape the web, we write code that sends a request to the server that’s hosting the page we specified. The server will return the source code — HTML, mostly — for the page (or pages) we requested.

So far, we’re essentially doing the same thing a web browser does — sending a server request with a specific URL and asking the server to return the code for that page.

But unlike a web browser, our web scraping code won’t interpret the page’s source code and display the page visually. Instead, we’ll write some custom code that filters through the page’s source code looking for specific elements we’ve specified, and extracting whatever content we’ve instructed it to extract.

For example, if we wanted to get all of the data from inside a table that was displayed on a web page, our code would be written to go through these steps in sequence:

Request the content (source code) of a specific URL from the server
Download the content that is returned
Identify the elements of the page that are part of the table we want
Extract and (if necessary) reformat those elements into a dataset we can analyze or use in whatever way we require.
If that all sounds very complicated, don’t worry! Python and Beautiful Soup have built-in features designed to make this relatively straightforward.

One thing that’s important to note: from a server’s perspective, requesting a page via web scraping is the same as loading it in a web browser. When we use code to submit these requests, we might be “loading” pages much faster than a regular user, and thus quickly eating up the website owner’s server resources.

## Loading page source code into Python

Link: https://en.wikipedia.org/wiki/Thiruvananthapuram

In [1]:
import requests as req

In [2]:
URL = 'https://en.wikipedia.org/wiki/Thiruvananthapuram'


In [3]:
r=req.get(URL)

In [4]:
print(r.content[:1000])

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Thiruvananthapuram - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"e4b562dc-14c9-4577-9d2d-fba822bba447","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Thiruvananthapuram","wgTitle":"Thiruvananthapuram","wgCurRevisionId":1078622472,"wgRevisionId":1078622472,"wgArticleId":56142,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["All articles with dead external links","Articles with dead external links from February 2022","Harv and Sfn no-target errors","Articl

In [9]:
from bs4 import BeautifulSoup
soup=BeautifulSoup(r.content)

In [10]:
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Thiruvananthapuram - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"e4b562dc-14c9-4577-9d2d-fba822bba447","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Thiruvananthapuram","wgTitle":"Thiruvananthapuram","wgCurRevisionId":1078622472,"wgRevisionId":1078622472,"wgArticleId":56142,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["All articles with dead external links","Articles with dead external links from February 2022","Harv and Sfn no-target erro

# Access elements and attributes inside HTML pages

In [11]:
import requests as req
from bs4 import BeautifulSoup
URL = 'https://en.wikipedia.org/wiki/Thiruvananthapuram'
r = req.get(URL)
soup = BeautifulSoup(r.content)
soup.prettify()[:1000]

'<!DOCTYPE html>\n<html class="client-nojs" dir="ltr" lang="en">\n <head>\n  <meta charset="utf-8"/>\n  <title>\n   Thiruvananthapuram - Wikipedia\n  </title>\n  <script>\n   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"e4b562dc-14c9-4577-9d2d-fba822bba447","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Thiruvananthapuram","wgTitle":"Thiruvananthapuram","wgCurRevisionId":1078622472,"wgRevisionId":1078622472,"wgArticleId":56142,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["All articles with dead external links","Articles with dead external links from February 2022","Harv and Sfn no-ta

In [12]:
HiTag=soup.h1
print(HiTag)

<h1 class="firstHeading mw-first-heading" id="firstHeading">Thiruvananthapuram</h1>


In [13]:
print(soup.title)

<title>Thiruvananthapuram - Wikipedia</title>


In [14]:
print(soup.get_text())




Thiruvananthapuram - Wikipedia









































Thiruvananthapuram

From Wikipedia, the free encyclopedia



Jump to navigation
Jump to search
This article is about the city. For the district, see Thiruvananthapuram district. For the urban agglomeration area of Thiruvananthapuram, see Thiruvananthapuram metropolitan area.


Metropolis in Kerala, IndiaThiruvananthapuram
TrivandrumMetropolisClockwise, from top: View of Kulathoor, Padmanabhaswamy Temple, Niyamasabha Mandiram, East Fort, Technopark, Kanakakkunnu Palace, Thiruvananthapuram Central and Kovalam Beach

SealNickname(s): Evergreen City of IndiaGod's Own Capital[1]ThiruvananthapuramShow map of IndiaThiruvananthapuramShow map of KeralaCoordinates: 08°29′15″N 76°57′9″E﻿ / ﻿8.48750°N 76.95250°E﻿ / 8.48750; 76.95250Coordinates: 08°29′15″N 76°57′9″E﻿ / ﻿8.48750°N 76.95250°E﻿ / 8.48750; 76.95250Country IndiaState KeralaDistrictThiruvananthapuramGovernment • TypeMunicipal Corporation • BodyThiruvananthapuram M

In [15]:
for link in soup.find_all('a'):
    print(link.get('href'))

None
#mw-head
#searchInput
/wiki/Thiruvananthapuram_district
/wiki/Thiruvananthapuram_metropolitan_area
/wiki/Metropolis
/wiki/File:Trivandrum_Montage.jpg
/wiki/Padmanabhaswamy_Temple
/wiki/Niyamasabha_Mandiram
/wiki/East_Fort
/wiki/Technopark,_Trivandrum
/wiki/Kanakakkunnu_Palace
/wiki/Thiruvananthapuram_Central
/wiki/Kovalam_Beach
/wiki/File:Seal_of_Corporation_of_Thiruvananthapuram_by_dhevilal.svg
#cite_note-distcourthistory-1
/wiki/File:India_location_map.svg
/wiki/File:India_Kerala_location_map.svg
//geohack.toolforge.org/geohack.php?pagename=Thiruvananthapuram&params=08_29_15_N_76_57_9_E_type:city(957730)_region:IN
/wiki/Geographic_coordinate_system
//geohack.toolforge.org/geohack.php?pagename=Thiruvananthapuram&params=08_29_15_N_76_57_9_E_type:city(957730)_region:IN
/wiki/India
/wiki/States_and_union_territories_of_India
/wiki/Kerala
/wiki/List_of_districts_of_India
/wiki/Thiruvananthapuram_district
/wiki/Thiruvananthapuram_Corporation
/wiki/Arya_Rajendran
#cite_note-gulfnews-2


https://www.thehindu.com/news/national/kerala/vizhinjam-in-historical-perspective/article7468781.ece
#cite_ref-vizhis2_37-0
https://www.thehindu.com/news/national/kerala/shedding-light-on-vizhinjams-golden-past/article5981994.ece
#cite_ref-38
https://books.google.com/books?id=POltAAAAMAAJ
#cite_ref-kanthalloor_39-0
http://www.newindianexpress.com/cities/thiruvananthapuram/2018/apr/17/chronicles-of-kanthalloor-sala-which-got-lost-in-the-mists-of-time-1802832.html
#cite_ref-40
https://books.google.com/books?id=GpNECgAAQBAJ
/wiki/ISBN_(identifier)
/wiki/Special:BookSources/9781317321279
#cite_ref-askh_41-0
#cite_ref-askh_41-1
#cite_ref-askh_41-2
#cite_ref-askh_41-3
https://dcbookstore.com/books/a-survey-of-kerala-history
/wiki/ISBN_(identifier)
/wiki/Special:BookSources/9788126415786
#cite_ref-:302_42-0
#cite_ref-:3_43-0
https://www.britannica.com/topic/Pandya-dynasty
#cite_ref-FOOTNOTEKeay2011215_44-0
#CITEREFKeay2011
/wiki/Category:Harv_and_Sfn_template_errors
#cite_ref-majumdar407_45-0

/wiki/Napier_Museum
/wiki/Kanakakkunnu_Palace
/wiki/International_Film_Festival_of_Kerala
/wiki/Kerala_State_Chalachitra_Academy
/wiki/Culture_of_Thiruvananthapuram
/wiki/Thiruvananthapuram_(Lok_Sabha_constituency)
/wiki/Attingal_(Lok_Sabha_constituency)
/wiki/Thiruvananthapuram_Golf_Club
/wiki/Chandrasekharan_Nair_Stadium
/wiki/University_Stadium_(Thiruvananthapuram)
/wiki/Kerala_Soil_Museum
/wiki/Napier_Museum
/wiki/Template:Kerala
/wiki/Template_talk:Kerala
https://en.wikipedia.org/w/index.php?title=Template:Kerala&action=edit
/wiki/File:Flag_of_Kerala.png
/wiki/States_and_territories_of_India
/wiki/Kerala
/wiki/List_of_state_and_union_territory_capitals_in_India
None
/wiki/List_of_districts_in_Kerala
/wiki/Thiruvananthapuram_district
/wiki/Kollam_district
/wiki/Pathanamthitta_district
/wiki/Alappuzha_district
/wiki/Kottayam_district
/wiki/Idukki_district
/wiki/Ernakulam_district
/wiki/Thrissur_district
/wiki/Palakkad_district
/wiki/Malappuram_district
/wiki/Kozhikode_district
/wiki

In [16]:
print(soup.img)

<img alt="Clockwise, from top: View of Kulathoor, Padmanabhaswamy Temple, Niyamasabha Mandiram, East Fort, Technopark, Kanakakkunnu Palace, Thiruvananthapuram Central and Kovalam Beach" data-file-height="1257" data-file-width="747" decoding="async" height="454" src="//upload.wikimedia.org/wikipedia/commons/thumb/3/31/Trivandrum_Montage.jpg/270px-Trivandrum_Montage.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/3/31/Trivandrum_Montage.jpg/405px-Trivandrum_Montage.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/3/31/Trivandrum_Montage.jpg/540px-Trivandrum_Montage.jpg 2x" width="270"/>


In [17]:
tables=soup.find_all("table")
print(len(tables))

12


In [18]:
print(tables[3])

<table class="nowraplinks mw-collapsible mw-collapsed navbox-inner" style="border-spacing:0;background:transparent;color:inherit"><tbody><tr><th class="navbox-title" colspan="2" scope="col"><style data-mw-deduplicate="TemplateStyles:r1063604349">.mw-parser-output .navbar{display:inline;font-size:88%;font-weight:normal}.mw-parser-output .navbar-collapse{float:left;text-align:left}.mw-parser-output .navbar-boxtext{word-spacing:0}.mw-parser-output .navbar ul{display:inline-block;white-space:nowrap;line-height:inherit}.mw-parser-output .navbar-brackets::before{margin-right:-0.125em;content:"[ "}.mw-parser-output .navbar-brackets::after{margin-left:-0.125em;content:" ]"}.mw-parser-output .navbar li{word-spacing:-0.125em}.mw-parser-output .navbar a>span,.mw-parser-output .navbar a>abbr{text-decoration:inherit}.mw-parser-output .navbar-mini abbr{font-variant:small-caps;border-bottom:none;text-decoration:none;cursor:inherit}.mw-parser-output .navbar-ct-full{font-size:114%;margin:0 7em}.mw-pars

In [19]:
print(tables[1]["style"])

width:100%; text-align:center; line-height: 1.2em; margin:auto;


In [20]:
lists=soup.find_all("li")

In [21]:
print(len(lists))

1184


In [24]:
lists[5]

<li>KL-21 <a href="/wiki/Nedumangad" title="Nedumangad">Nedumangad</a></li>

In [22]:
child=list(lists[5].children)

In [26]:
print(child)

['KL-21 ', <a href="/wiki/Nedumangad" title="Nedumangad">Nedumangad</a>]


# Search for elements with given classes and attributes

In [29]:
import requests as req
from bs4 import BeautifulSoup

In [30]:
URL = 'https://en.wikipedia.org/wiki/Thiruvananthapuram'
r = req.get(URL)
soup = BeautifulSoup(r.content)

In [31]:
links=soup.find_all("a")

In [32]:
print(len(links))

2780


In [37]:
links[0:5]

[<a id="top"></a>,
 <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#searchInput">Jump to search</a>,
 <a href="/wiki/Thiruvananthapuram_district" title="Thiruvananthapuram district">Thiruvananthapuram district</a>,
 <a href="/wiki/Thiruvananthapuram_metropolitan_area" title="Thiruvananthapuram metropolitan area">Thiruvananthapuram metropolitan area</a>]

In [38]:
 attr_filter={"class": "mw-jump-link"}

In [39]:
print(attr_filter)

{'class': 'mw-jump-link'}


In [42]:
soup.find_all("a",attr_filter)

[<a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#searchInput">Jump to search</a>]