# Intro to web scraping

The first step of web scraping is to identify a website and download the html code from it. 

Real html from websites tends to be long and a bit too chaotic for a total beginner. Here we will start with a dummy html document and learn the basics of extracting info with beautifulsoup.

- You can learn about Html here https://www.w3schools.com/html/
- You can use codebeautify to make your html more readable and clean https://codebeautify.org/htmlviewer

In [1]:
html_doc = """ <!DOCTYPE html><html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p></html>"""

In [2]:
html_doc

' <!DOCTYPE html><html><head><title>The Dormouse\'s story</title></head><body><p class="title"><b>The Dormouse\'s story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p></html>'

In [3]:
from bs4 import BeautifulSoup

#### "creating the soup"

In [4]:
# parse the element
soup = BeautifulSoup(html_doc, 'html.parser') 

In [5]:
soup

 <!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p></body></html>

In [6]:
import pprint

In [7]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>



#### accessing single elements

We can access html tags by appending to the `soup` a dot `.` and the name of the corresponding tag. In case of having multiple instances of the tag, only the first one will be retrieved.  

In [8]:
soup.title

<title>The Dormouse's story</title>

In [9]:
soup.title.parent #shows the title inside of the parent -> it's inside head

<head><title>The Dormouse's story</title></head>

In [10]:
soup.html.body.p

<p class="title"><b>The Dormouse's story</b></p>

<b> searching using find() function

In [11]:
soup.find("p")

<p class="title"><b>The Dormouse's story</b></p>

In [12]:
soup.find("p").get_text() #retrieves only the text part of the element

"The Dormouse's story"

In [13]:
# this method only retrieves the first element of the specified tag
soup.p  #same as find('p')

<p class="title"><b>The Dormouse's story</b></p>

#### finding all elements of a tag with the powerful find_all()

In [14]:
p_tags = soup.find_all("p")
p_tags #brings back all the paragraphs

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

To get the `text`from the corresponding html code, we can use the function: get_text()

In [15]:
for p in p_tags:  #in order to get text out of the list above we HAVE TO make a for loop
    print(p.get_text())

The Dormouse's story
Once upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well.
...


## Return the 3 names of the sisters

In [19]:
a_tags = soup.find_all('a')

In [20]:
a_tags

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [21]:
a_tags = soup.find_all('a')

In [22]:
for a in a_tags:
    print(a.get_text())

Elsie
Lacie
Tillie


## Using css selectors
Another way to find contents using select(). 

Let's learn first the syntax of css selectors playing this game: https://flukeout.github.io/

Everyone should reach level 6!

https://www.w3schools.com/css/css_howto.asp

In [23]:
soup.select("a")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [24]:
for a in soup.select('a'):
    print(a.get_text())

Elsie
Lacie
Tillie


In [25]:
soup.select('p')

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [28]:
soup.select("p")[0]

<p class="title"><b>The Dormouse's story</b></p>

using css selector, you can search directly using Css classes!

In [29]:
soup.select(".title") #gives all elements from the title class

[<p class="title"><b>The Dormouse's story</b></p>]

In [30]:
soup.select('p')

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [31]:
soup.select('p.story')  #gives us paragraphs that are in story 

[<p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

<b> comparing to find_all() ..

In [32]:
soup.find_all("a", class_="sister")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [33]:
soup.select("a.sister")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

<b>  You can searc directly using id attributes

In [34]:
soup.select("#link2")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

In [36]:
soup.select("#link2")[0].get_text()

'Lacie'

We can combine the `select()` method with other bs4 methods, such as `get_text()`.

`get_text()`, however, can only be applied to single elements, while `select()` might return multiple elements. It's common to iterate through the output of `select()`

In [None]:
print(soup.select(".story"))

In [None]:
for p in soup.select("p.story"):
    print(p.get_text())



Write code to print the following contents (not including the html tags, only human-readable text): 

1. All the "fun facts". 

2. The names of all the places. 

3. The content (name and fact) of all the cities (only cities, not countries!) 

4. The names (not facts!) of all the cities (not countries!)

In [37]:
geography = """
<!DOCTYPE html>
<html>
<head> Geography</head>
<body>

<div class="city">
  <h2>London</h2>
  <p>London is the most popular tourist destination in the world.</p>
</div>

<div class="city">
  <h2>Paris</h2>
  <p>Paris was originally a Roman City called Lutetia.</p>
</div>

<div class="country">
  <h2>Spain</h2>
  <p>Spain produces 43,8% of all the world's Olive Oil.</p>
</div>

</body>
</html>
"""

In [38]:
soup = BeautifulSoup(geography, 'html.parser')

print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  Geography
 </head>
 <body>
  <div class="city">
   <h2>
    London
   </h2>
   <p>
    London is the most popular tourist destination in the world.
   </p>
  </div>
  <div class="city">
   <h2>
    Paris
   </h2>
   <p>
    Paris was originally a Roman City called Lutetia.
   </p>
  </div>
  <div class="country">
   <h2>
    Spain
   </h2>
   <p>
    Spain produces 43,8% of all the world's Olive Oil.
   </p>
  </div>
 </body>
</html>



In [39]:
# Your code goes here

soup.select('p')

[<p>London is the most popular tourist destination in the world.</p>,
 <p>Paris was originally a Roman City called Lutetia.</p>,
 <p>Spain produces 43,8% of all the world's Olive Oil.</p>]

In [40]:
fun_facts = soup.select('p')

In [41]:
for p in fun_facts:
    print(p.get_text())

London is the most popular tourist destination in the world.
Paris was originally a Roman City called Lutetia.
Spain produces 43,8% of all the world's Olive Oil.


In [None]:
#retrieve name of all places 

In [42]:
soup.find_all('h2')

[<h2>London</h2>, <h2>Paris</h2>, <h2>Spain</h2>]

In [43]:
places = soup.find_all('h2')

In [44]:
for h2 in places:
    print(h2.get_text())

London
Paris
Spain


In [None]:
#The content (name and fact) of all the cities (only cities, not countries!)

In [50]:
soup.find_all("div", class_="city")

[<div class="city">
 <h2>London</h2>
 <p>London is the most popular tourist destination in the world.</p>
 </div>,
 <div class="city">
 <h2>Paris</h2>
 <p>Paris was originally a Roman City called Lutetia.</p>
 </div>]

In [46]:
#The names (not facts!) of all the cities (not countries!)

In [52]:
for p in soup.select ('div.city h2'):
    print(p.get_text())

London
Paris


## Use case: 





In [53]:
# 1. import libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [54]:
# 2. find url and store it in a variable
url = "https://www.singtotheworld.com/100-most-popular-karaoke-songs-of-all-time"

# <b> using request package

In [55]:
# 3. download html with a get request 
response = requests.get(url)

In [56]:
response.status_code # 200 status code means OK!

200

### HTTP Response status codes 
https://developer.mozilla.org/en-US/docs/Web/HTTP/Status

In [57]:
# 4.1. parse html (create the 'soup')
soup = BeautifulSoup(response.content, "html.parser")
# 4.2. check that the html code looks like it should
soup


<!DOCTYPE html >

<html class="" id="pagehtml" lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head id="scriptsDynamic"><title>
	Top 100 Most Popular Karaoke songs of all time
</title>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0, maximum-scale=5" name="viewport"/>
<meta content="en" name="language">
<meta content="index,follow" name="robots"/>
<meta content="nositelinkssearchbox,notranslate" name="google"/>
<meta content="telephone=no" name="format-detection"/>
<meta content="ie=edge" http-equiv="x-ua-compatible"/>
<meta content="ddd0c8df208a4b396dc247e4a168f61eebdbdfad0199402898bdf321a55806fa" name="ahrefs-site-verification"/>
<link href="/apple-touch-icon.png" rel="apple-touch-icon" sizes="180x180"/>
<link href="/favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/>
<link href="/favicon-16x16.png" rel="icon" sizes="16x16" type="image/png"/>
<link color="#80118e" href="/safari-pinned-tab.svg" rel="mask-icon"/>
<meta content="#80118e" name="m

#### Building the dataframe

In [59]:
import pprint
print(soup.prettify())

<!DOCTYPE html >
<html class="" id="pagehtml" lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head id="scriptsDynamic">
  <title>
   Top 100 Most Popular Karaoke songs of all time
  </title>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1.0, maximum-scale=5" name="viewport"/>
  <meta content="en" name="language">
   <meta content="index,follow" name="robots"/>
   <meta content="nositelinkssearchbox,notranslate" name="google"/>
   <meta content="telephone=no" name="format-detection"/>
   <meta content="ie=edge" http-equiv="x-ua-compatible"/>
   <meta content="ddd0c8df208a4b396dc247e4a168f61eebdbdfad0199402898bdf321a55806fa" name="ahrefs-site-verification"/>
   <link href="/apple-touch-icon.png" rel="apple-touch-icon" sizes="180x180"/>
   <link href="/favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/>
   <link href="/favicon-16x16.png" rel="icon" sizes="16x16" type="image/png"/>
   <link color="#80118e" href="/safari-pinned-tab.svg" rel="mask-

##getting the titles and artists


In [62]:
soup.select('span a')  #whitespace after span because of a class

[<a href="/terms-and-conditions" target="_blank">Terms of Use</a>,
 <a href="/privacy-policy" target="_blank">Privacy Policy</a>,
 <a href="/custom-disc/artist/george-ezra">Shotgun by George Ezra</a>,
 <a href="https://www.singtotheworld.com/custom-disc/artist/neil-diamond">Sweet Caroline by Neil Diamond</a>,
 <a href="https://www.singtotheworld.com/custom-disc/artist/lady-gaga-and-bradley-cooper-a-star-is-born">Shallow by Lady Gaga &amp; Bradley Cooper (A Star is Born)</a>,
 <a href="/custom-disc/artist/the-greatest-showman">A Million Dreams by The Greatest Showman</a>,
 <a href="/custom-disc/artist/ed-sheeran">Perfect by Ed Sheeran</a>,
 <a href="/custom-disc/artist/the-greatest-showman">This Is Me by The Greatest Showman</a>,
 <a href="/custom-disc/artist/lewis-capaldi">Someone You Loved by Lewis Capaldi</a>,
 <a href="/custom-disc/artist/the-greatest-showman">Rewrite The Stars by The Greatest Showman</a>,
 <a href="/custom-disc/artist/robbie-williams">Angels by Robbie Williams</a>,

In [63]:
soup.select('span.size3 a')

[<a href="/custom-disc/artist/george-ezra">Shotgun by George Ezra</a>,
 <a href="https://www.singtotheworld.com/custom-disc/artist/neil-diamond">Sweet Caroline by Neil Diamond</a>,
 <a href="https://www.singtotheworld.com/custom-disc/artist/lady-gaga-and-bradley-cooper-a-star-is-born">Shallow by Lady Gaga &amp; Bradley Cooper (A Star is Born)</a>,
 <a href="/custom-disc/artist/the-greatest-showman">A Million Dreams by The Greatest Showman</a>,
 <a href="/custom-disc/artist/ed-sheeran">Perfect by Ed Sheeran</a>,
 <a href="/custom-disc/artist/the-greatest-showman">This Is Me by The Greatest Showman</a>,
 <a href="/custom-disc/artist/lewis-capaldi">Someone You Loved by Lewis Capaldi</a>,
 <a href="/custom-disc/artist/the-greatest-showman">Rewrite The Stars by The Greatest Showman</a>,
 <a href="/custom-disc/artist/robbie-williams">Angels by Robbie Williams</a>,
 <a href="/custom-disc/artist/the-greatest-showman">The Greatest Show by The Greatest Showman</a>,
 <a data-mce-href="custom-disc

In [65]:
#get text

for k in soup.select('span.size3 a'):
    print (k.get_text())

Shotgun by George Ezra
Sweet Caroline by Neil Diamond
Shallow by Lady Gaga & Bradley Cooper (A Star is Born)
A Million Dreams by The Greatest Showman
Perfect by Ed Sheeran
This Is Me by The Greatest Showman
Someone You Loved by Lewis Capaldi
Rewrite The Stars by The Greatest Showman
Angels by Robbie Williams
The Greatest Show by The Greatest Showman
The Greatest Showman
My Way by Frank Sinatra
Is This The Way To Amarillo by Tony Christie
Bohemian Rhapsody by Queen
Never Enough by The Greatest Showman
Your Song by Elton John


In [None]:
#make one list for name and one list for artist

In [66]:
songs = []
for k in soup.select('span.size3 a'):
    songs.append (k.get_text())

In [68]:
title = []
artist = []
for song in songs:
    title.append(song.split('by')[0])
            
    

In [69]:
title

['Shotgun ',
 'Sweet Caroline ',
 'Shallow ',
 'A Million Dreams ',
 'Perfect ',
 'This Is Me ',
 'Someone You Loved ',
 'Rewrite The Stars ',
 'Angels ',
 'The Greatest Show ',
 'The Greatest Showman',
 'My Way ',
 'Is This The Way To Amarillo ',
 'Bohemian Rhapsody ',
 'Never Enough ',
 'Your Song ']

In [70]:
for song in songs:
    artist.append(song.split('by')[-1])

In [71]:
artist

[' George Ezra',
 ' Neil Diamond',
 ' Lady Gaga & Bradley Cooper (A Star is Born)',
 ' The Greatest Showman',
 ' Ed Sheeran',
 ' The Greatest Showman',
 ' Lewis Capaldi',
 ' The Greatest Showman',
 ' Robbie Williams',
 ' The Greatest Showman',
 'The Greatest Showman',
 ' Frank Sinatra',
 ' Tony Christie',
 ' Queen',
 ' The Greatest Showman',
 ' Elton John']

In [73]:
songs.remove('The Greatest Showman')

In [74]:
songs

['Shotgun by George Ezra',
 'Sweet Caroline by Neil Diamond',
 'Shallow by Lady Gaga & Bradley Cooper (A Star is Born)',
 'A Million Dreams by The Greatest Showman',
 'Perfect by Ed Sheeran',
 'This Is Me by The Greatest Showman',
 'Someone You Loved by Lewis Capaldi',
 'Rewrite The Stars by The Greatest Showman',
 'Angels by Robbie Williams',
 'The Greatest Show by The Greatest Showman',
 'My Way by Frank Sinatra',
 'Is This The Way To Amarillo by Tony Christie',
 'Bohemian Rhapsody by Queen',
 'Never Enough by The Greatest Showman',
 'Your Song by Elton John']

In [75]:
len(title)

16

In [76]:
len(artist)

16

## Example 2 scraping gutenberg.org

In [78]:
url = "https://gutenberg.org/ebooks/search/?query=&submit_search=Search"

In [79]:
##sending request
response = requests.get(url)
response.status_code

200

In [81]:
soup = BeautifulSoup(response.content)

In [95]:
#Selecting book titles
title = []
for i in soup.select('.title')[2:]:
    title.append(i.get_text())

In [96]:
title

['Frankenstein; Or, The Modern Prometheus',
 'Pride and Prejudice',
 'Romeo and Juliet',
 'Calculus Made Easy\r',
 'The Scarlet Letter',
 "Alice's Adventures in Wonderland",
 'Dracula',
 'A Christmas Carol in Prose; Being a Ghost Story of Christmas',
 'The Great Gatsby',
 "A Doll's House : a play",
 'The Picture of Dorian Gray',
 'Simple Sabotage Field Manual',
 'Metamorphosis',
 'A Modest Proposal\r',
 'The Importance of Being Earnest: A Trivial Comedy for Serious People',
 'The Strange Case of Dr. Jekyll and Mr. Hyde',
 'Moby Dick; Or, The Whale',
 'The Yellow Wallpaper',
 'A Tale of Two Cities',
 'Jane Eyre: An Autobiography',
 'The Adventures of Sherlock Holmes',
 'Great Expectations',
 'The Prince',
 'Adventures of Huckleberry Finn',
 'The Iliad']

In [None]:
#Select the authors

In [88]:
soup.select('.subtitle')

[<span class="subtitle">Mary Wollstonecraft Shelley</span>,
 <span class="subtitle">Jane Austen</span>,
 <span class="subtitle">William Shakespeare</span>,
 <span class="subtitle">Silvanus P. Thompson</span>,
 <span class="subtitle">Nathaniel Hawthorne</span>,
 <span class="subtitle">Lewis Carroll</span>,
 <span class="subtitle">Bram Stoker</span>,
 <span class="subtitle">Charles Dickens</span>,
 <span class="subtitle">F. Scott Fitzgerald</span>,
 <span class="subtitle">Henrik Ibsen</span>,
 <span class="subtitle">Oscar Wilde</span>,
 <span class="subtitle">United States. Office of Strategic Services</span>,
 <span class="subtitle">Franz Kafka</span>,
 <span class="subtitle">Jonathan Swift</span>,
 <span class="subtitle">Oscar Wilde</span>,
 <span class="subtitle">Robert Louis Stevenson</span>,
 <span class="subtitle">Herman Melville</span>,
 <span class="subtitle">Charlotte Perkins Gilman</span>,
 <span class="subtitle">Charles Dickens</span>,
 <span class="subtitle">Charlotte Brontë<

In [93]:
authors = []
for i in soup.select('.subtitle'):
    authors.append(i.get_text())

In [94]:
authors
            

['Mary Wollstonecraft Shelley',
 'Jane Austen',
 'William Shakespeare',
 'Silvanus P. Thompson',
 'Nathaniel Hawthorne',
 'Lewis Carroll',
 'Bram Stoker',
 'Charles Dickens',
 'F. Scott Fitzgerald',
 'Henrik Ibsen',
 'Oscar Wilde',
 'United States. Office of Strategic Services',
 'Franz Kafka',
 'Jonathan Swift',
 'Oscar Wilde',
 'Robert Louis Stevenson',
 'Herman Melville',
 'Charlotte Perkins Gilman',
 'Charles Dickens',
 'Charlotte Brontë',
 'Arthur Conan Doyle',
 'Charles Dickens',
 'Niccolò Machiavelli',
 'Mark Twain',
 'Homer']

In [100]:
book_df = pd.DataFrame({'title': title, 'author': authors})

In [101]:
book_df

Unnamed: 0,title,author
0,"Frankenstein; Or, The Modern Prometheus",Mary Wollstonecraft Shelley
1,Pride and Prejudice,Jane Austen
2,Romeo and Juliet,William Shakespeare
3,Calculus Made Easy\r,Silvanus P. Thompson
4,The Scarlet Letter,Nathaniel Hawthorne
5,Alice's Adventures in Wonderland,Lewis Carroll
6,Dracula,Bram Stoker
7,A Christmas Carol in Prose; Being a Ghost Stor...,Charles Dickens
8,The Great Gatsby,F. Scott Fitzgerald
9,A Doll's House : a play,Henrik Ibsen


### Cleaning the data

In [None]:
# your code here

In [102]:
url = "https://en.wikipedia.org/wiki/2023_in_film"


In [103]:
response = requests.get(url)
response.status_code

200

In [106]:
soup = BeautifulSoup(response.content)


In [107]:
import pprint
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-typography-survey-disabled vector-toc-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   2023 in film - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-lim

In [127]:
soup.select('tbody tr td i a ')

[<a href="/wiki/Category:2023" title="Category:2023">+...</a>,
 <a href="/wiki/Barbie_(film)" title="Barbie (film)">Barbie</a>,
 <a href="/wiki/The_Super_Mario_Bros._Movie" title="The Super Mario Bros. Movie">The Super Mario Bros. Movie</a>,
 <a href="/wiki/Oppenheimer_(film)" title="Oppenheimer (film)">Oppenheimer</a>,
 <a href="/wiki/Guardians_of_the_Galaxy_Vol._3" title="Guardians of the Galaxy Vol. 3">Guardians of the Galaxy Vol. 3</a>,
 <a href="/wiki/Fast_X" title="Fast X">Fast X</a>,
 <a href="/wiki/Spider-Man:_Across_the_Spider-Verse" title="Spider-Man: Across the Spider-Verse">Spider-Man: Across the Spider-Verse</a>,
 <a href="/wiki/Full_River_Red" title="Full River Red">Full River Red</a>,
 <a href="/wiki/The_Wandering_Earth_2" title="The Wandering Earth 2">The Wandering Earth 2</a>,
 <a href="/wiki/The_Little_Mermaid_(2023_film)" title="The Little Mermaid (2023 film)">The Little Mermaid</a>,
 <a href="/wiki/Mission:_Impossible_%E2%80%93_Dead_Reckoning_Part_One" title="Missio

In [128]:
for i in soup.select('tbody tr td i a '):
    print (i.get_text())

+...
Barbie
The Super Mario Bros. Movie
Oppenheimer
Guardians of the Galaxy Vol. 3
Fast X
Spider-Man: Across the Spider-Verse
Full River Red
The Wandering Earth 2
The Little Mermaid
Mission: Impossible – Dead Reckoning Part One
Still: A Michael J. Fox Movie
The Fire Within
The Horseman on the Roof
The Accidental Tourist
The Party Animal
Bruce Almighty
The Right Stuff
The Way of the Dragon
Fist of Fury
Sayonara
The Art of Love
The Terminator
To Be or Not to Be
Apollo 13
A Beautiful Mind
Wavelength
La Région Centrale
Creature
Second Thoughts
The Exorcist
The French Connection
Miracle
2012
Klute
Down and Out in Beverly Hills
Close Encounters of the Third Kind
A Christmas Story
Sun Valley
Warriors of Heaven and Earth
American Gigolo
Sixteen Candles
Wedding Night
The Blood on Satan's Claw
The Hunchback of Notre Dame
Recess: School's Out
All That Jazz
Dream Lover
Trainspotting
Billy Elliot
A Room with a View
Naked Lunch
The Horizontal Lieutenant
McHale's Navy
The Little Shop of Horrors
Tales

In [129]:
title = []
for i in soup.select('tbody tr td i a ')[1:11]:
    title.append(i.get_text())

In [130]:
title

['Barbie',
 'The Super Mario Bros. Movie',
 'Oppenheimer',
 'Guardians of the Galaxy Vol. 3',
 'Fast X',
 'Spider-Man: Across the Spider-Verse',
 'Full River Red',
 'The Wandering Earth 2',
 'The Little Mermaid',
 'Mission: Impossible – Dead Reckoning Part One']

In [142]:
soup.select('table tbody tr td')

[<td style="text-align:left; width:15%;">
 </td>,
 <td style="text-align:center"><a href="/wiki/List_of_years_in_film" title="List of years in film">List of years in film</a>
 </td>,
 <td style="text-align:right; width:15%;">
 </td>,
 <td style="text-align:center"><style data-mw-deduplicate="TemplateStyles:r1129693374">.mw-parser-output .hlist dl,.mw-parser-output .hlist ol,.mw-parser-output .hlist ul{margin:0;padding:0}.mw-parser-output .hlist dd,.mw-parser-output .hlist dt,.mw-parser-output .hlist li{margin:0;display:inline}.mw-parser-output .hlist.inline,.mw-parser-output .hlist.inline dl,.mw-parser-output .hlist.inline ol,.mw-parser-output .hlist.inline ul,.mw-parser-output .hlist dl dl,.mw-parser-output .hlist dl ol,.mw-parser-output .hlist dl ul,.mw-parser-output .hlist ol dl,.mw-parser-output .hlist ol ol,.mw-parser-output .hlist ol ul,.mw-parser-output .hlist ul dl,.mw-parser-output .hlist ul ol,.mw-parser-output .hlist ul ul{display:inline}.mw-parser-output .hlist .mw-empty-li

In [None]:
#select the gross revenue - one way to do it

In [148]:
import re

In [149]:
result = soup.find_all("td", text=re.compile("^\$"))

  result = soup.find_all("td", text=re.compile("^\$"))


In [150]:
result

[<td>$1,441,740,954
 </td>,
 <td>$1,361,990,276
 </td>,
 <td>$950,682,319
 </td>,
 <td>$845,555,777
 </td>,
 <td>$714,414,576
 </td>,
 <td>$690,516,673
 </td>,
 <td>$673,596,577
 </td>,
 <td>$604,460,538
 </td>,
 <td>$569,543,411
 </td>,
 <td>$567,418,180
 </td>]

In [157]:
money = []
for i in result:
    money.append(i.get_text(strip=True))

In [158]:
money

['$1,441,740,954',
 '$1,361,990,276',
 '$950,682,319',
 '$845,555,777',
 '$714,414,576',
 '$690,516,673',
 '$673,596,577',
 '$604,460,538',
 '$569,543,411',
 '$567,418,180']

In [146]:
#another way

In [153]:
gross_rev = []
for gross in soup.select("table tbody tr td"):
    if re.match("\$", gross.get_text()):
        gross_rev.append(gross.get_text(strip=True))
gross_rev

['$1,441,740,954',
 '$1,361,990,276',
 '$950,682,319',
 '$845,555,777',
 '$714,414,576',
 '$690,516,673',
 '$673,596,577',
 '$604,460,538',
 '$569,543,411',
 '$567,418,180']

In [159]:
#another way -> select the 4th child of the group td
#-> copy selector -> right click on element, copy selector -> then look up where to find

In [162]:
soup.select("tbody tr td:nth-child(4)")

[<td>$1,441,740,954
 </td>,
 <td>$1,361,990,276
 </td>,
 <td>$845,555,777
 </td>,
 <td>$714,414,576
 </td>,
 <td>$690,516,673
 </td>,
 <td>$673,596,577
 </td>,
 <td>$569,543,411
 </td>,
 <td>$567,418,180
 </td>,
 <td><a href="/wiki/New_York_City" title="New York City">New York City</a>, New York, U.S.
 </td>,
 <td><a class="mw-redirect" href="/wiki/Beverly_Hills" title="Beverly Hills">Beverly Hills</a>, California, U.S.
 </td>,
 <td><a href="/wiki/Los_Angeles" title="Los Angeles">Los Angeles</a>, California, U.S.
 </td>,
 <td><a href="/wiki/Barcelona" title="Barcelona">Barcelona</a>, Catalonia, Spain
 </td>,
 <td><a href="/wiki/Zaragoza" title="Zaragoza">Zaragoza</a>, Aragon, Spain
 </td>,
 <td><a href="/wiki/Almer%C3%ADa" title="Almería">Almería</a>, Andalusia, Spain
 </td>,
 <td><a href="/wiki/Seville" title="Seville">Seville</a>, Andalusia, Spain
 </td>,
 <td style="text-align:center;"><sup class="reference" id="cite_ref-58"><a href="#cite_note-58">[58]</a></sup>
 </td>,
 <td>Beverl

##scraping presidents of US

In [164]:
url = "https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States"
response = requests.get(url)
response.status_code
soup = BeautifulSoup(response.content)

200

In [167]:
soup.select("td b a")

[<a href="/wiki/George_Washington" title="George Washington">George Washington</a>,
 <a href="/wiki/John_Adams" title="John Adams">John Adams</a>,
 <a href="/wiki/Thomas_Jefferson" title="Thomas Jefferson">Thomas Jefferson</a>,
 <a href="/wiki/James_Madison" title="James Madison">James Madison</a>,
 <a href="/wiki/James_Monroe" title="James Monroe">James Monroe</a>,
 <a href="/wiki/John_Quincy_Adams" title="John Quincy Adams">John Quincy Adams</a>,
 <a href="/wiki/Andrew_Jackson" title="Andrew Jackson">Andrew Jackson</a>,
 <a href="/wiki/Martin_Van_Buren" title="Martin Van Buren">Martin Van Buren</a>,
 <a href="/wiki/William_Henry_Harrison" title="William Henry Harrison">William Henry Harrison</a>,
 <a href="/wiki/John_Tyler" title="John Tyler">John Tyler</a>,
 <a href="/wiki/James_K._Polk" title="James K. Polk">James K. Polk</a>,
 <a href="/wiki/Zachary_Taylor" title="Zachary Taylor">Zachary Taylor</a>,
 <a href="/wiki/Millard_Fillmore" title="Millard Fillmore">Millard Fillmore</a>,
 

In [168]:
for i in soup.select("td b a"):
    print(i.get_text())

George Washington
John Adams
Thomas Jefferson
James Madison
James Monroe
John Quincy Adams
Andrew Jackson
Martin Van Buren
William Henry Harrison
John Tyler
James K. Polk
Zachary Taylor
Millard Fillmore
Franklin Pierce
James Buchanan
Abraham Lincoln
Andrew Johnson
Ulysses S. Grant
Rutherford B. Hayes
James A. Garfield
Chester A. Arthur
Grover Cleveland
Benjamin Harrison
Grover Cleveland
William McKinley
Theodore Roosevelt
William Howard Taft
Woodrow Wilson
Warren G. Harding
Calvin Coolidge
Herbert Hoover
Franklin D. Roosevelt
Harry S. Truman
Dwight D. Eisenhower
John F. Kennedy
Lyndon B. Johnson
Richard Nixon
Gerald Ford
Jimmy Carter
Ronald Reagan
George H. W. Bush
Bill Clinton
George W. Bush
Barack Obama
Donald Trump
Joe Biden


In [None]:
#i want the hyperlink

In [170]:
for i in soup.select("td b a"):  #here are the hyperlinks. how to get them?
    print(i)

<a href="/wiki/George_Washington" title="George Washington">George Washington</a>
<a href="/wiki/John_Adams" title="John Adams">John Adams</a>
<a href="/wiki/Thomas_Jefferson" title="Thomas Jefferson">Thomas Jefferson</a>
<a href="/wiki/James_Madison" title="James Madison">James Madison</a>
<a href="/wiki/James_Monroe" title="James Monroe">James Monroe</a>
<a href="/wiki/John_Quincy_Adams" title="John Quincy Adams">John Quincy Adams</a>
<a href="/wiki/Andrew_Jackson" title="Andrew Jackson">Andrew Jackson</a>
<a href="/wiki/Martin_Van_Buren" title="Martin Van Buren">Martin Van Buren</a>
<a href="/wiki/William_Henry_Harrison" title="William Henry Harrison">William Henry Harrison</a>
<a href="/wiki/John_Tyler" title="John Tyler">John Tyler</a>
<a href="/wiki/James_K._Polk" title="James K. Polk">James K. Polk</a>
<a href="/wiki/Zachary_Taylor" title="Zachary Taylor">Zachary Taylor</a>
<a href="/wiki/Millard_Fillmore" title="Millard Fillmore">Millard Fillmore</a>
<a href="/wiki/Franklin_Pie

In [172]:
for i in soup.select("td b a"):  #like this because the link is in hfref
    print(i["href"])

/wiki/George_Washington
/wiki/John_Adams
/wiki/Thomas_Jefferson
/wiki/James_Madison
/wiki/James_Monroe
/wiki/John_Quincy_Adams
/wiki/Andrew_Jackson
/wiki/Martin_Van_Buren
/wiki/William_Henry_Harrison
/wiki/John_Tyler
/wiki/James_K._Polk
/wiki/Zachary_Taylor
/wiki/Millard_Fillmore
/wiki/Franklin_Pierce
/wiki/James_Buchanan
/wiki/Abraham_Lincoln
/wiki/Andrew_Johnson
/wiki/Ulysses_S._Grant
/wiki/Rutherford_B._Hayes
/wiki/James_A._Garfield
/wiki/Chester_A._Arthur
/wiki/Grover_Cleveland
/wiki/Benjamin_Harrison
/wiki/Grover_Cleveland
/wiki/William_McKinley
/wiki/Theodore_Roosevelt
/wiki/William_Howard_Taft
/wiki/Woodrow_Wilson
/wiki/Warren_G._Harding
/wiki/Calvin_Coolidge
/wiki/Herbert_Hoover
/wiki/Franklin_D._Roosevelt
/wiki/Harry_S._Truman
/wiki/Dwight_D._Eisenhower
/wiki/John_F._Kennedy
/wiki/Lyndon_B._Johnson
/wiki/Richard_Nixon
/wiki/Gerald_Ford
/wiki/Jimmy_Carter
/wiki/Ronald_Reagan
/wiki/George_H._W._Bush
/wiki/Bill_Clinton
/wiki/George_W._Bush
/wiki/Barack_Obama
/wiki/Donald_Trump
/w

In [None]:
#how to get the soup for joe biden

In [175]:
joe = soup.select("td b a")[-1]["href"]
joe

'/wiki/Joe_Biden'

In [181]:
base_url = "https://en.wikipedia.org"

response = requests.get(base_url+joe)
soup = BeautifulSoup(response.content)
soup

#find the scraping path

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-typography-survey-disabled vector-toc-available" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Joe Biden - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref

In [183]:
george = soup.select("td b a")[0]["href"]
base_url = "https://en.wikipedia.org"

response = requests.get(base_url+george)
soup = BeautifulSoup(response.content)
soup


<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-typography-survey-disabled vector-toc-available" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Incumbent - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref

In [None]:
#to do: go to main page ->name of president, born, spouse, children

In [189]:
url = "https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States"
response = requests.get(url)
response.status_code
soup = BeautifulSoup(response.content)


In [190]:
all_presidents = []
for i in soup.select("td b a"):
    all_presidents.append(i.get_text())

In [191]:
all_presidents

['George Washington',
 'John Adams',
 'Thomas Jefferson',
 'James Madison',
 'James Monroe',
 'John Quincy Adams',
 'Andrew Jackson',
 'Martin Van Buren',
 'William Henry Harrison',
 'John Tyler',
 'James K. Polk',
 'Zachary Taylor',
 'Millard Fillmore',
 'Franklin Pierce',
 'James Buchanan',
 'Abraham Lincoln',
 'Andrew Johnson',
 'Ulysses S. Grant',
 'Rutherford B. Hayes',
 'James A. Garfield',
 'Chester A. Arthur',
 'Grover Cleveland',
 'Benjamin Harrison',
 'Grover Cleveland',
 'William McKinley',
 'Theodore Roosevelt',
 'William Howard Taft',
 'Woodrow Wilson',
 'Warren G. Harding',
 'Calvin Coolidge',
 'Herbert Hoover',
 'Franklin D. Roosevelt',
 'Harry S. Truman',
 'Dwight D. Eisenhower',
 'John F. Kennedy',
 'Lyndon B. Johnson',
 'Richard Nixon',
 'Gerald Ford',
 'Jimmy Carter',
 'Ronald Reagan',
 'George H. W. Bush',
 'Bill Clinton',
 'George W. Bush',
 'Barack Obama',
 'Donald Trump',
 'Joe Biden']

In [None]:
#mw-content-text > div.mw-content-ltr.mw-parser-output > table.infobox.vcard > tbody > tr:nth-child(35) > td

In [195]:
url = "https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States"
response = requests.get(url)
response.status_code
soup = BeautifulSoup(response.content)

In [197]:
all_presidents = []

for i in soup.select("td b a"):  
    all_presidents.append(i["href"])

In [198]:
all_presidents

['/wiki/George_Washington',
 '/wiki/John_Adams',
 '/wiki/Thomas_Jefferson',
 '/wiki/James_Madison',
 '/wiki/James_Monroe',
 '/wiki/John_Quincy_Adams',
 '/wiki/Andrew_Jackson',
 '/wiki/Martin_Van_Buren',
 '/wiki/William_Henry_Harrison',
 '/wiki/John_Tyler',
 '/wiki/James_K._Polk',
 '/wiki/Zachary_Taylor',
 '/wiki/Millard_Fillmore',
 '/wiki/Franklin_Pierce',
 '/wiki/James_Buchanan',
 '/wiki/Abraham_Lincoln',
 '/wiki/Andrew_Johnson',
 '/wiki/Ulysses_S._Grant',
 '/wiki/Rutherford_B._Hayes',
 '/wiki/James_A._Garfield',
 '/wiki/Chester_A._Arthur',
 '/wiki/Grover_Cleveland',
 '/wiki/Benjamin_Harrison',
 '/wiki/Grover_Cleveland',
 '/wiki/William_McKinley',
 '/wiki/Theodore_Roosevelt',
 '/wiki/William_Howard_Taft',
 '/wiki/Woodrow_Wilson',
 '/wiki/Warren_G._Harding',
 '/wiki/Calvin_Coolidge',
 '/wiki/Herbert_Hoover',
 '/wiki/Franklin_D._Roosevelt',
 '/wiki/Harry_S._Truman',
 '/wiki/Dwight_D._Eisenhower',
 '/wiki/John_F._Kennedy',
 '/wiki/Lyndon_B._Johnson',
 '/wiki/Richard_Nixon',
 '/wiki/Geral

In [200]:
url = "https://en.wikipedia.org//wiki/George_Washington"
response = requests.get(url)
response.status_code
soup1 = BeautifulSoup(response.content)

In [201]:
soup1.select("h1 span")

[<span class="mw-page-title-main">George Washington</span>]

In [205]:
for i in soup1.select("h1 span"):
    print(i.get_text())

George Washington


In [203]:
soup.select("h1 span")

[<span class="mw-page-title-main">List of presidents of the United States</span>]

In [206]:
#create a loop that creates a soup for every url and then gets the info

In [None]:
name = []
for i in all_presidents:
    response = requests.get(url+i)
    soup = BeautifulSoup(response.content)

##lösung -> make a loop that loops trough all the sub url and extracts info

In [208]:
response = requests.get("https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States")
soup = BeautifulSoup(response.content)

In [210]:
pres_hrefs = []
for i in soup.select("td b a"):
    pres_hrefs.append(i["href"])

In [212]:
pres_hrefs

['/wiki/George_Washington',
 '/wiki/John_Adams',
 '/wiki/Thomas_Jefferson',
 '/wiki/James_Madison',
 '/wiki/James_Monroe',
 '/wiki/John_Quincy_Adams',
 '/wiki/Andrew_Jackson',
 '/wiki/Martin_Van_Buren',
 '/wiki/William_Henry_Harrison',
 '/wiki/John_Tyler',
 '/wiki/James_K._Polk',
 '/wiki/Zachary_Taylor',
 '/wiki/Millard_Fillmore',
 '/wiki/Franklin_Pierce',
 '/wiki/James_Buchanan',
 '/wiki/Abraham_Lincoln',
 '/wiki/Andrew_Johnson',
 '/wiki/Ulysses_S._Grant',
 '/wiki/Rutherford_B._Hayes',
 '/wiki/James_A._Garfield',
 '/wiki/Chester_A._Arthur',
 '/wiki/Grover_Cleveland',
 '/wiki/Benjamin_Harrison',
 '/wiki/Grover_Cleveland',
 '/wiki/William_McKinley',
 '/wiki/Theodore_Roosevelt',
 '/wiki/William_Howard_Taft',
 '/wiki/Woodrow_Wilson',
 '/wiki/Warren_G._Harding',
 '/wiki/Calvin_Coolidge',
 '/wiki/Herbert_Hoover',
 '/wiki/Franklin_D._Roosevelt',
 '/wiki/Harry_S._Truman',
 '/wiki/Dwight_D._Eisenhower',
 '/wiki/John_F._Kennedy',
 '/wiki/Lyndon_B._Johnson',
 '/wiki/Richard_Nixon',
 '/wiki/Geral

In [213]:
#FIRST TEST ON GEORGE WASHINGTON

response = requests.get("https://en.wikipedia.org/wiki/George_Washington")
soup = BeautifulSoup(response.content)
soup.select ( "div.fn")[0].get_text()

'George Washington'

In [217]:
response = requests.get("https://en.wikipedia.org/wiki/George_Washington")
soup = BeautifulSoup(response.content)
soup.select ("span.bday")[0].get_text()

'2019-03-02'

In [225]:
response = requests.get("https://en.wikipedia.org/wiki/Joe_Biden")
soup = BeautifulSoup(response.content)
soup.select ("td.infobox-data div.marriage-display-ws div a")[-1].get_text()

'Jill Jacobs'

In [228]:
len(soup.find("th", string="Children").parent.find_all("a")) #show how many children

4

In [231]:
soup.find("th", string="Children").parent.select("td.infobox-data")[0].get_text(strip=True)

'BeauHunterNaomiAshley'

In [237]:
import pandas as pd

In [238]:
#ACTUAL LOOP

name = []
born = []
spouse = []
children = []

for href in pres_hrefs:
    response = requests.get(base_url+href)
    soup= BeautifulSoup(response.content)
    name.append(soup.select ( "div.fn")[0].get_text())  #loop to append the name
    try:
        born.append(soup.select ("span.bday")[0].get_text())
    except:
         born.append("NA")
    try:
        spouse.append(soup.select ("td.infobox-data div.marriage-display-ws div a")[-1].get_text())
    except:
        spouse.append("NA")
    try:
        childrend.append(soup.find("th", string="Children").parent.select("td.infobox-data")[0].get_text(strip=True))
    except:
        children.append(0)
            
    

In [240]:
presidents_df = pd.DataFrame({"name":name,"born":born, "spouse":spouse,"children":children})

In [241]:
presidents_df

Unnamed: 0,name,born,spouse,children
0,George Washington,2019-03-02,Martha Dandridge,0
1,John Adams,2019-02-23,Abigail Smith,0
2,Thomas Jefferson,1743-04-13,Martha Wayles,0
3,James Madison,1751-03-16,Dolley Payne,0
4,James Monroe,1758-04-28,Elizabeth Kortright,0
5,John Quincy Adams,1767-07-11,Louisa Johnson,0
6,Andrew Jackson,1767-03-15,Rachel Donelson,0
7,Martin Van Buren,1782-12-05,Hannah Hoes,0
8,William Henry Harrison,1773-02-09,Anna Symmes,0
9,John Tyler,1790-03-29,Julia Gardiner,0
