# Intro to web scraping

The first step of web scraping is to identify a website and download the html code from it. 

Real html from websites tends to be long and a bit too chaotic for a total beginner. Here we will start with a dummy html document and learn the basics of extracting info with beautifulsoup.

- You can learn about Html here https://www.w3schools.com/html/
- You can use codebeautify to make your html more readable and clean https://codebeautify.org/htmlviewer

In [100]:
import pandas as pd

In [1]:
html_doc = """ <!DOCTYPE html><html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p></html>"""

In [2]:
html_doc

' <!DOCTYPE html><html><head><title>The Dormouse\'s story</title></head><body><p class="title"><b>The Dormouse\'s story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p></html>'

In [3]:
from bs4 import BeautifulSoup

#### "creating the soup"

In [4]:
# parse the element
soup = BeautifulSoup(html_doc, 'html.parser') 

In [5]:
soup

 <!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p></body></html>

In [6]:
import pprint

In [7]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


#### accessing single elements

We can access html tags by appending to the `soup` a dot `.` and the name of the corresponding tag. In case of having multiple instances of the tag, only the first one will be retrieved.  

In [8]:
soup.title

<title>The Dormouse's story</title>

In [9]:
soup.title.parent

<head><title>The Dormouse's story</title></head>

In [10]:
soup.html.body.p

<p class="title"><b>The Dormouse's story</b></p>

<b> searching using find() function

In [11]:
soup.find("p").get_text()

"The Dormouse's story"

In [12]:
# this method only retrieves the first element of the specified tag
soup.p

<p class="title"><b>The Dormouse's story</b></p>

#### finding all elements of a tag with the powerful find_all()

In [13]:
p_tags = soup.find_all("p")
p_tags

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

To get the `text`from the corresponding html code, we can use the function: get_text()

In [14]:
for p in p_tags:
    print(p.get_text())

The Dormouse's story
Once upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well.
...


## Return the 3 names of the sisters

In [15]:
a_tags = soup.find_all('a')

In [16]:
for a in a_tags:
    print(a.get_text())

Elsie
Lacie
Tillie


## Using css selectors
Another way to find contents using select(). 

Let's learn first the syntax of css selectors playing this game: https://flukeout.github.io/

Everyone should reach level 6!

https://www.w3schools.com/css/css_howto.asp

In [17]:
soup.select("a")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [18]:
for a in soup.select('a'):
    print(a.get_text())

Elsie
Lacie
Tillie


In [19]:
soup.select('p')

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [20]:
soup.select("p")[0]

<p class="title"><b>The Dormouse's story</b></p>

using css selector, you can search directly using Css classes!

In [21]:
soup.select(".title")

[<p class="title"><b>The Dormouse's story</b></p>]

<b> comparing to find_all() ..

In [22]:
soup.find_all("a", class_="sister")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [23]:
soup.select("a.sister")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

<b>  You can searc directly using id attributes

In [24]:
soup.select("#link2")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

We can combine the `select()` method with other bs4 methods, such as `get_text()`.

`get_text()`, however, can only be applied to single elements, while `select()` might return multiple elements. It's common to iterate through the output of `select()`

In [25]:
print(soup.select(".story"))

[<p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p>, <p class="story">...</p>]


In [26]:
for p in soup.select("p.story"):
    print(p.get_text())

Once upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well.
...




Write code to print the following contents (not including the html tags, only human-readable text): 

1. All the "fun facts". 

2. The names of all the places. 

3. The content (name and fact) of all the cities (only cities, not countries!) 

4. The names (not facts!) of all the cities (not countries!)

In [36]:
geography = """
<!DOCTYPE html>
<html>
<head> Geography</head>
<body>

<div class="city">
  <h2>London</h2>
  <p>London is the most popular tourist destination in the world.</p>
</div>

<div class="city">
  <h2>Paris</h2>
  <p>Paris was originally a Roman City called Lutetia.</p>
</div>

<div class="country">
  <h2>Spain</h2>
  <p>Spain produces 43,8% of all the world's Olive Oil.</p>
</div>

</body>
</html>
"""

In [45]:
soup = BeautifulSoup(geography, 'html.parser')

print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  Geography
 </head>
 <body>
  <div class="city">
   <h2>
    London
   </h2>
   <p>
    London is the most popular tourist destination in the world.
   </p>
  </div>
  <div class="city">
   <h2>
    Paris
   </h2>
   <p>
    Paris was originally a Roman City called Lutetia.
   </p>
  </div>
  <div class="country">
   <h2>
    Spain
   </h2>
   <p>
    Spain produces 43,8% of all the world's Olive Oil.
   </p>
  </div>
 </body>
</html>



In [47]:
soup.select('p')

[<p>London is the most popular tourist destination in the world.</p>,
 <p>Paris was originally a Roman City called Lutetia.</p>,
 <p>Spain produces 43,8% of all the world's Olive Oil.</p>]

In [49]:
for h in soup.select('h2'):
    print(h.get_text())

London
Paris
Spain


In [50]:
soup.find_all("div", class_="city")

[<div class="city">
 <h2>London</h2>
 <p>London is the most popular tourist destination in the world.</p>
 </div>, <div class="city">
 <h2>Paris</h2>
 <p>Paris was originally a Roman City called Lutetia.</p>
 </div>]

In [51]:
for h in soup.find_all("div", class_="city"):
    print(h.select("h2"))
    print(h.select("p"))

[<h2>London</h2>]
[<p>London is the most popular tourist destination in the world.</p>]
[<h2>Paris</h2>]
[<p>Paris was originally a Roman City called Lutetia.</p>]


In [57]:
for p in soup.find_all("div", class_="city"):
    print(p.get_text())


London
London is the most popular tourist destination in the world.


Paris
Paris was originally a Roman City called Lutetia.



In [62]:
for p in soup.find_all(class_="city"):
    print(p.get_text())


London
London is the most popular tourist destination in the world.


Paris
Paris was originally a Roman City called Lutetia.



In [77]:
for h in soup.select("div.city h2"):
    print(h.get_text())

London
Paris


## Use case: 





In [78]:
# 1. import libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [79]:
# 2. find url and store it in a variable
url = "https://www.singtotheworld.com/100-most-popular-karaoke-songs-of-all-time"

# <b> using request package

In [80]:
# 3. download html with a get request 
response = requests.get(url)

In [81]:
response.status_code # 200 status code means OK!

200

### HTTP Response status codes 
https://developer.mozilla.org/en-US/docs/Web/HTTP/Status

In [82]:
# 4.1. parse html (create the 'soup')
soup = BeautifulSoup(response.content, "html.parser")
# 4.2. check that the html code looks like it should
soup


<!DOCTYPE html >

<html class="" id="pagehtml" lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head id="scriptsDynamic"><title>
	Top 100 Most Popular Karaoke songs of all time
</title>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0, maximum-scale=5" name="viewport"/>
<meta content="en" name="language">
<meta content="index,follow" name="robots"/>
<meta content="nositelinkssearchbox,notranslate" name="google"/>
<meta content="telephone=no" name="format-detection"/>
<meta content="ie=edge" http-equiv="x-ua-compatible"/>
<meta content="ddd0c8df208a4b396dc247e4a168f61eebdbdfad0199402898bdf321a55806fa" name="ahrefs-site-verification"/>
<link href="/apple-touch-icon.png" rel="apple-touch-icon" sizes="180x180"/>
<link href="/favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/>
<link href="/favicon-16x16.png" rel="icon" sizes="16x16" type="image/png"/>
<link color="#80118e" href="/safari-pinned-tab.svg" rel="mask-icon"/>
<meta content="#80118e" name="m

In [83]:
print(soup.prettify())

<!DOCTYPE html >
<html class="" id="pagehtml" lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head id="scriptsDynamic">
  <title>
   Top 100 Most Popular Karaoke songs of all time
  </title>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1.0, maximum-scale=5" name="viewport"/>
  <meta content="en" name="language">
   <meta content="index,follow" name="robots"/>
   <meta content="nositelinkssearchbox,notranslate" name="google"/>
   <meta content="telephone=no" name="format-detection"/>
   <meta content="ie=edge" http-equiv="x-ua-compatible"/>
   <meta content="ddd0c8df208a4b396dc247e4a168f61eebdbdfad0199402898bdf321a55806fa" name="ahrefs-site-verification"/>
   <link href="/apple-touch-icon.png" rel="apple-touch-icon" sizes="180x180"/>
   <link href="/favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/>
   <link href="/favicon-16x16.png" rel="icon" sizes="16x16" type="image/png"/>
   <link color="#80118e" href="/safari-pinned-tab.svg" rel="mask-

#### Building the dataframe

In [88]:
list_of_songs=[]
for s in soup.select("span.size3 a"):
    list_of_songs.append(s.get_text())

### Cleaning the data

In [97]:
title=[]
artist=[]
for song in list_of_songs:
    title.append(song.split("by")[0])
    artist.append(song.split("by")[-1])

In [98]:
title

['Shotgun ',
 'Sweet Caroline ',
 'Shallow ',
 'A Million Dreams ',
 'Perfect ',
 'This Is Me ',
 'Someone You Loved ',
 'Rewrite The Stars ',
 'Angels ',
 'The Greatest Show ',
 'The Greatest Showman',
 'My Way ',
 'Is This The Way To Amarillo ',
 'Bohemian Rhapsody ',
 'Never Enough ',
 'Your Song ']

In [99]:
artist

[' George Ezra',
 ' Neil Diamond',
 ' Lady Gaga & Bradley Cooper (A Star is Born)',
 ' The Greatest Showman',
 ' Ed Sheeran',
 ' The Greatest Showman',
 ' Lewis Capaldi',
 ' The Greatest Showman',
 ' Robbie Williams',
 ' The Greatest Showman',
 'The Greatest Showman',
 ' Frank Sinatra',
 ' Tony Christie',
 ' Queen',
 ' The Greatest Showman',
 ' Elton John']

In [101]:
df=pd.DataFrame({"title":title, "artist":artist})

In [102]:
df

Unnamed: 0,title,artist
0,Shotgun,George Ezra
1,Sweet Caroline,Neil Diamond
2,Shallow,Lady Gaga & Bradley Cooper (A Star is Born)
3,A Million Dreams,The Greatest Showman
4,Perfect,Ed Sheeran
5,This Is Me,The Greatest Showman
6,Someone You Loved,Lewis Capaldi
7,Rewrite The Stars,The Greatest Showman
8,Angels,Robbie Williams
9,The Greatest Show,The Greatest Showman


In [103]:
url_2="https://gutenberg.org/ebooks/search/?query=&submit_search=Search"
response_2=requests.get(url_2)

In [105]:
soup = BeautifulSoup(response_2.content) 

In [106]:
soup

<!DOCTYPE html>
<!--

DON'T USE THIS PAGE FOR SCRAPING.

Seriously. You'll only get your IP blocked.

Download https://www.gutenberg.org/feeds/catalog.rdf.bz2 instead,
which contains *all* Project Gutenberg metadata in one RDF/XML file.

--><html lang="en">
<head>
<style>
.icon   { background: transparent url(/pics/sprite.png) 0 0 no-repeat; }
.page_content a.subtle_link:link {color:currentColor; text-decoration: none;}
.page_content a.subtle_link:hover {color:#003366}
</style>
<link href="/gutenberg/pg-desktop-one.css" rel="stylesheet" type="text/css"/>
<link href="/gutenberg/new_nav.css" rel="stylesheet" type="text/css"/>
<link href="/gutenberg/style.css" rel="stylesheet" type="text/css"/>
<script>//
var canonical_url   = "http://www.gutenberg.org/ebooks/search/?query=&submit_search=Search";
var lang            = "en";
var msg_load_more   = "Load More Results…";
var page_mode       = "screen";
var dialog_title    = "";
var dialog_message  = "";
//</script>
<script src="/js/pg-two.js"

In [115]:
title_2=[]
for s in soup.select(".title")[2:]:
    title_2.append((s.get_text()))

In [125]:
author=[]
for s in soup.select(".subtitle")[2:]:
    author.append((s.get_text()))


In [130]:
len(author)

23

In [129]:
author

['William Shakespeare',
 'Silvanus P. Thompson',
 'Nathaniel Hawthorne',
 'Lewis Carroll',
 'Bram Stoker',
 'Charles Dickens',
 'F. Scott Fitzgerald',
 'Henrik Ibsen',
 'Oscar Wilde',
 'United States. Office of Strategic Services',
 'Franz Kafka',
 'Jonathan Swift',
 'Oscar Wilde',
 'Robert Louis Stevenson',
 'Herman Melville',
 'Charlotte Perkins Gilman',
 'Charles Dickens',
 'Charlotte Brontë',
 'Arthur Conan Doyle',
 'Charles Dickens',
 'Niccolò Machiavelli',
 'Mark Twain',
 'Homer']

In [127]:
len(title_2)

25

In [131]:
wiki="https://en.wikipedia.org/wiki/2023_in_film"

In [132]:
response_3=requests.get(wiki)

In [133]:
soup = BeautifulSoup(response_3.content) 

In [134]:
soup


<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-typography-survey-disabled vector-toc-available" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>2023 in film - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientp

In [153]:
soup.select("tbody tr td i a")

[<a href="/wiki/Category:2023" title="Category:2023">+...</a>,
 <a href="/wiki/Barbie_(film)" title="Barbie (film)">Barbie</a>,
 <a href="/wiki/The_Super_Mario_Bros._Movie" title="The Super Mario Bros. Movie">The Super Mario Bros. Movie</a>,
 <a href="/wiki/Oppenheimer_(film)" title="Oppenheimer (film)">Oppenheimer</a>,
 <a href="/wiki/Guardians_of_the_Galaxy_Vol._3" title="Guardians of the Galaxy Vol. 3">Guardians of the Galaxy Vol. 3</a>,
 <a href="/wiki/Fast_X" title="Fast X">Fast X</a>,
 <a href="/wiki/Spider-Man:_Across_the_Spider-Verse" title="Spider-Man: Across the Spider-Verse">Spider-Man: Across the Spider-Verse</a>,
 <a href="/wiki/Full_River_Red" title="Full River Red">Full River Red</a>,
 <a href="/wiki/The_Wandering_Earth_2" title="The Wandering Earth 2">The Wandering Earth 2</a>,
 <a href="/wiki/The_Little_Mermaid_(2023_film)" title="The Little Mermaid (2023 film)">The Little Mermaid</a>,
 <a href="/wiki/Mission:_Impossible_%E2%80%93_Dead_Reckoning_Part_One" title="Missio

In [154]:
title_3=[]
for s in soup.select("tbody tr td i a"):
    title_3.append((s.get_text()))

In [156]:
title_3[1:11]

['Barbie',
 'The Super Mario Bros. Movie',
 'Oppenheimer',
 'Guardians of the Galaxy Vol. 3',
 'Fast X',
 'Spider-Man: Across the Spider-Verse',
 'Full River Red',
 'The Wandering Earth 2',
 'The Little Mermaid',
 'Mission: Impossible – Dead Reckoning Part One']

In [169]:
import re

In [188]:
money=[]
result = soup.find_all("td", text=re.compile("^\$"))
for s in result:    
    money.append((s.get_text(strip=True)))


In [182]:
result

[<td>$1,441,740,954
 </td>, <td>$1,361,990,276
 </td>, <td>$950,682,319
 </td>, <td>$845,555,777
 </td>, <td>$714,414,576
 </td>, <td>$690,516,673
 </td>, <td>$673,596,577
 </td>, <td>$604,460,538
 </td>, <td>$569,543,411
 </td>, <td>$567,418,180
 </td>]

In [189]:
money

['$1,441,740,954',
 '$1,361,990,276',
 '$950,682,319',
 '$845,555,777',
 '$714,414,576',
 '$690,516,673',
 '$673,596,577',
 '$604,460,538',
 '$569,543,411',
 '$567,418,180']

In [190]:
frame=pd.DataFrame({"title":title_3[1:11], "income":money})
frame

Unnamed: 0,title,income
0,Barbie,"$1,441,740,954"
1,The Super Mario Bros. Movie,"$1,361,990,276"
2,Oppenheimer,"$950,682,319"
3,Guardians of the Galaxy Vol. 3,"$845,555,777"
4,Fast X,"$714,414,576"
5,Spider-Man: Across the Spider-Verse,"$690,516,673"
6,Full River Red,"$673,596,577"
7,The Wandering Earth 2,"$604,460,538"
8,The Little Mermaid,"$569,543,411"
9,Mission: Impossible – Dead Reckoning Part One,"$567,418,180"


In [192]:
soup.select("tbody tr td:nth-child(4)")

[<td>$1,441,740,954
 </td>, <td>$1,361,990,276
 </td>, <td>$845,555,777
 </td>, <td>$714,414,576
 </td>, <td>$690,516,673
 </td>, <td>$673,596,577
 </td>, <td>$569,543,411
 </td>, <td>$567,418,180
 </td>, <td><a href="/wiki/New_York_City" title="New York City">New York City</a>, New York, U.S.
 </td>, <td><a class="mw-redirect" href="/wiki/Beverly_Hills" title="Beverly Hills">Beverly Hills</a>, California, U.S.
 </td>, <td><a href="/wiki/Los_Angeles" title="Los Angeles">Los Angeles</a>, California, U.S.
 </td>, <td><a href="/wiki/Barcelona" title="Barcelona">Barcelona</a>, Catalonia, Spain
 </td>, <td><a href="/wiki/Zaragoza" title="Zaragoza">Zaragoza</a>, Aragon, Spain
 </td>, <td><a href="/wiki/Almer%C3%ADa" title="Almería">Almería</a>, Andalusia, Spain
 </td>, <td><a href="/wiki/Seville" title="Seville">Seville</a>, Andalusia, Spain
 </td>, <td style="text-align:center;"><sup class="reference" id="cite_ref-58"><a href="#cite_note-58">[58]</a></sup>
 </td>, <td>Beverly Hills, Califor

In [228]:
presidents="https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States"

In [229]:
response_4=requests.get(presidents)

In [230]:
soup = BeautifulSoup(response_4.content) 

In [208]:
joe=soup.select("td b a")[-1]["href"]

In [211]:
base_url="https://en.wikipedia.org"



response=requests.get(base_url+joe)
soup=BeautifulSoup(response.content)
soup

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-typography-survey-disabled vector-toc-available" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Joe Biden - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref

In [232]:
all_presidents=[]
for i in range(0,46):
    all_presidents.append(soup.select("td b a")[i]["href"])

In [233]:
all_presidents

['/wiki/George_Washington',
 '/wiki/John_Adams',
 '/wiki/Thomas_Jefferson',
 '/wiki/James_Madison',
 '/wiki/James_Monroe',
 '/wiki/John_Quincy_Adams',
 '/wiki/Andrew_Jackson',
 '/wiki/Martin_Van_Buren',
 '/wiki/William_Henry_Harrison',
 '/wiki/John_Tyler',
 '/wiki/James_K._Polk',
 '/wiki/Zachary_Taylor',
 '/wiki/Millard_Fillmore',
 '/wiki/Franklin_Pierce',
 '/wiki/James_Buchanan',
 '/wiki/Abraham_Lincoln',
 '/wiki/Andrew_Johnson',
 '/wiki/Ulysses_S._Grant',
 '/wiki/Rutherford_B._Hayes',
 '/wiki/James_A._Garfield',
 '/wiki/Chester_A._Arthur',
 '/wiki/Grover_Cleveland',
 '/wiki/Benjamin_Harrison',
 '/wiki/Grover_Cleveland',
 '/wiki/William_McKinley',
 '/wiki/Theodore_Roosevelt',
 '/wiki/William_Howard_Taft',
 '/wiki/Woodrow_Wilson',
 '/wiki/Warren_G._Harding',
 '/wiki/Calvin_Coolidge',
 '/wiki/Herbert_Hoover',
 '/wiki/Franklin_D._Roosevelt',
 '/wiki/Harry_S._Truman',
 '/wiki/Dwight_D._Eisenhower',
 '/wiki/John_F._Kennedy',
 '/wiki/Lyndon_B._Johnson',
 '/wiki/Richard_Nixon',
 '/wiki/Geral

In [247]:
base_url="https://en.wikipedia.org"


In [296]:
name=[]
date=[]
spouse=[]
children=[]
for i in all_presidents:
    response=requests.get(base_url+i)
    soup=BeautifulSoup(response.content)
    name.append([s.get_text() for s in soup.select("h1 span")])
    date.append([h.get_text() for h in soup.select("tbody span.bday")])
    spouse.append([w.get_text() for w in soup.select("tbody tr div.marriage-display-ws div a")])
    children.append([ch.get_text() for ch in soup.select("tbody tr:nth-child(28) a")])
    
    

In [298]:
date

[[],
 [],
 ['1743-04-13'],
 ['1751-03-16'],
 ['1758-04-28'],
 ['1767-07-11'],
 ['1767-03-15'],
 ['1782-12-05'],
 ['1773-02-09'],
 ['1790-03-29'],
 ['1795-11-02'],
 ['1784-11-24'],
 ['1800-01-07'],
 ['1804-11-23'],
 ['1791-04-23'],
 ['1809-02-12'],
 ['1808-12-29'],
 ['1822-04-27'],
 ['1822-10-04'],
 ['1831-11-19'],
 ['1829-10-05'],
 ['1837-03-18'],
 ['1833-08-20'],
 ['1837-03-18'],
 ['1843-01-29'],
 ['1858-10-27'],
 ['1857-09-15'],
 ['1856-12-28'],
 ['1865-11-02'],
 ['1872-07-04'],
 ['1874-08-10'],
 ['1882-01-30'],
 ['1884-05-08'],
 ['1890-10-14'],
 ['1917-05-29'],
 ['1908-08-27'],
 ['1913-01-09'],
 ['1913-07-14'],
 ['1924-10-01'],
 ['1911-02-06'],
 ['1924-06-12'],
 ['1946-08-19'],
 ['1946-07-06'],
 ['1961-08-04'],
 ['1946-06-14'],
 ['1942-11-20']]

In [249]:
name

[['George Washington'],
 ['John Adams'],
 ['Thomas Jefferson'],
 ['James Madison'],
 ['James Monroe'],
 ['John Quincy Adams'],
 ['Andrew Jackson'],
 ['Martin Van Buren'],
 ['William Henry Harrison'],
 ['John Tyler'],
 ['James K. Polk'],
 ['Zachary Taylor'],
 ['Millard Fillmore'],
 ['Franklin Pierce'],
 ['James Buchanan'],
 ['Abraham Lincoln'],
 ['Andrew Johnson'],
 ['Ulysses S. Grant'],
 ['Rutherford B. Hayes'],
 ['James A. Garfield'],
 ['Chester A. Arthur'],
 ['Grover Cleveland'],
 ['Benjamin Harrison'],
 ['Grover Cleveland'],
 ['William McKinley'],
 ['Theodore Roosevelt'],
 ['William Howard Taft'],
 ['Woodrow Wilson'],
 ['Warren G. Harding'],
 ['Calvin Coolidge'],
 ['Herbert Hoover'],
 ['Franklin D. Roosevelt'],
 ['Harry S. Truman'],
 ['Dwight D. Eisenhower'],
 ['John F. Kennedy'],
 ['Lyndon B. Johnson'],
 ['Richard Nixon'],
 ['Gerald Ford'],
 ['Jimmy Carter'],
 ['Ronald Reagan'],
 ['George H. W. Bush'],
 ['Bill Clinton'],
 ['George W. Bush'],
 ['Barack Obama'],
 ['Donald Trump'],
 [

In [277]:
response=requests.get(base_url+'/wiki/Bill_Clinton')
soup=BeautifulSoup(response.content)
soup

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-typography-survey-disabled vector-toc-available" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Bill Clinton - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientp

In [246]:
[i.get_text() for i in soup.select("h1 span")]

['Bill Clinton']

In [278]:
[h.get_text() for h in soup.select("tbody span.bday")]


['1946-08-19']

In [303]:
soup.select("tbody tr div.marriage-display-ws div a")

[<a class="mw-redirect" href="/wiki/Neilia_Hunter" title="Neilia Hunter">Neilia Hunter</a>,
 <a class="mw-redirect" href="/wiki/Jill_Jacobs" title="Jill Jacobs">Jill Jacobs</a>]

In [311]:
soup.find("th", string="Children").parent.find_all("a")

[<a href="/wiki/Beau_Biden" title="Beau Biden">Beau</a>,
 <a href="/wiki/Hunter_Biden" title="Hunter Biden">Hunter</a>,
 <a class="mw-redirect" href="/wiki/Naomi_Christina_Biden" title="Naomi Christina Biden">Naomi</a>,
 <a href="/wiki/Ashley_Biden" title="Ashley Biden">Ashley</a>]