# Intro to web scraping

The first step of web scraping is to identify a website and download the html code from it. 

Real html from websites tends to be long and a bit too chaotic for a total beginner. Here we will start with a dummy html document and learn the basics of extracting info with beautifulsoup.

- You can learn about Html here https://www.w3schools.com/html/
- You can use codebeautify to make your html more readable and clean https://codebeautify.org/htmlviewer

In [None]:
html_doc = """ <!DOCTYPE html><html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p></html>"""

In [None]:
html_doc

In [None]:
from bs4 import BeautifulSoup

#### "creating the soup"

In [None]:
# parse the element
soup = BeautifulSoup(html_doc) 

In [None]:
soup

In [None]:
print(soup.prettify())

#### accessing single elements

We can access html tags by appending to the `soup` a dot `.` and the name of the corresponding tag. In case of having multiple instances of the tag, only the first one will be retrieved.  

In [None]:
soup.title

In [None]:
soup.title.parent

In [None]:
soup.html.body.p

<b> searching using find() function

In [None]:
soup.find("p").get_text()

In [None]:
# this method only retrieves the first element of the specified tag
soup.p

#### finding all elements of a tag with the powerful find_all()

In [None]:
p_tags = soup.find_all("p")
p_tags

To get the `text`from the corresponding html code, we can use the function: get_text()

In [None]:
for p in p_tags:
    print(p.get_text())

## Return the 3 names of the sisters

In [None]:
print(soup.prettify())

In [None]:
a_tags = soup.find_all('a')

In [None]:
for name in a_tags:
    print(name.text)

## Using css selectors
Another way to find contents using select(). 

Let's learn first the syntax of css selectors playing this game: https://flukeout.github.io/

Everyone should reach level 6!

https://www.w3schools.com/css/css_howto.asp

In [None]:
soup.select("a")

In [None]:
for a in soup.select('a'):
    print(a.get_text())

In [None]:
soup.select('p')

In [None]:
soup.select("p")[0]

using css selector, you can search directly using Css classes!

In [None]:
soup.select(".title")

<b> comparing to find_all() ..

In [None]:
soup.find_all("a", class_="sister")

In [None]:
soup.select("a.sister")

<b>  You can searc directly using id attributes

In [None]:
soup.select("#link2")

We can combine the `select()` method with other bs4 methods, such as `get_text()`.

`get_text()`, however, can only be applied to single elements, while `select()` might return multiple elements. It's common to iterate through the output of `select()`

In [None]:
print(soup.select(".story"))

In [None]:
for p in soup.select("p.story"):
    print(p.get_text())



Write code to print the following contents (not including the html tags, only human-readable text): 

1. All the "fun facts". 

2. The names of all the places. 

3. The content (name and fact) of all the cities (only cities, not countries!) 

4. The names (not facts!) of all the cities (not countries!)

In [None]:
#1 fun facts
for p in soup.select("p"):
    print(p.get_text())

    print()
#2 
for place in soup.select("h2"):
    print(place.get_text())
    
#3 content of all cities


In [None]:
for city in soup.find_all("div", class_ = "city"):
    print(city.get_text())

In [None]:
for city in soup.select(".city h2"):
    print(city.get_text())

In [None]:
geography = """
<!DOCTYPE html>
<html>
<head> Geography</head>
<body>

<div class="city">
  <h2>London</h2>
  <p>London is the most popular tourist destination in the world.</p>
</div>

<div class="city">
  <h2>Paris</h2>
  <p>Paris was originally a Roman City called Lutetia.</p>
</div>

<div class="country">
  <h2>Spain</h2>
  <p>Spain produces 43,8% of all the world's Olive Oil.</p>
</div>

</body>
</html>
"""

In [None]:
soup = BeautifulSoup(geography, 'html.parser')

print(soup.prettify())

In [None]:
# Your code goes here

## Use case: 





In [None]:
# 1. import libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [None]:
# 2. find url and store it in a variable
url = "https://www.timeout.com/film/best-movies-of-all-time"
response = requests.get(url)
soup = BeautifulSoup(response.content)

In [None]:
soup

In [None]:
top100m = []
for title in soup.select("h3"):
    top100m.append(title.get_text())
top100m

In [None]:
top100m.pop()

In [None]:
top100m

In [None]:
best_movies = pd.DataFrame({"movie_title":top100m})

In [None]:
best_movies

### scraping gutenberg.org

In [None]:
url = "https://gutenberg.org/ebooks/search/?sort_order=downloads"
response = requests.get(url)
soup = BeautifulSoup(response.content)

In [None]:
books = []
for book in soup.select("li.booklink span.title"):
    books.append(book.get_text())

In [None]:
books

In [None]:
authors = []
for author in soup.select("li.booklink span.subtitle"):
    authors.append(author.get_text())
authors

In [None]:
best_books_df = pd.DataFrame({"books":books, "author":authors})
best_books_df

# <b> using requests package

In [None]:
# 3. download html with a get request 
response = requests.get(url)

In [None]:
response.status_code # 200 status code means OK!

### HTTP Response status codes 
https://developer.mozilla.org/en-US/docs/Web/HTTP/Status

In [None]:
# 4.1. parse html (create the 'soup')
soup = BeautifulSoup(response.content, "html.parser")
# 4.2. check that the html code looks like it should
soup

#### Building the dataframe

In [None]:
#your code here

### Cleaning the data

In [None]:
# your code here

In [129]:
url ="https://www.nytimes.com/books/best-sellers/hardcover-nonfiction/"
response = requests.get(url)
soup = BeautifulSoup(response.content)

In [130]:
response.status_code

200

In [138]:
titles = []
for name in soup.select("h3.css-5pe77f"):
    titles.append(name.get_text())
titles

['THE WAGER',
 'OATH AND HONOR',
 'OUTLIVE',
 'THE WOMAN IN ME',
 'THE KINGDOM, THE POWER, AND THE GLORY',
 'FRIENDS, LOVERS, AND THE BIG TERRIBLE THING',
 'MADNESS',
 'ELON MUSK',
 'READ WRITE OWN',
 'THE GRIFT',
 'THE IN-BETWEEN',
 'FIND ME THE VOTES',
 'LEGACY',
 "I'M GLAD MY MOM DIED",
 'OUR HIDDEN CONVERSATIONS']

In [141]:
authors = []
for author in soup.select("p.css-hjukut"):
    authors.append(author.get_text())

In [142]:
authors

['by David Grann',
 'by Liz Cheney',
 'by Peter Attia with Bill Gifford',
 'by Britney Spears',
 'by Tim Alberta',
 'by Matthew Perry',
 'by Antonia Hylton',
 'by Walter Isaacson',
 'by Chris Dixon',
 'by Clay Cane',
 'by Hadley Vlahos',
 'by Michael Isikoff and Daniel Klaidman',
 'by Uché Blackstock',
 'by Jennette McCurdy',
 'by Michele Norris']

Retrieve the description of the book

Create the dataframe with 3 columns:
- title
- author
- description

Clean the data with the following steps:
- Capitalize the title (lower case + capitalized)
- Remove "by" from the authors

In [144]:

camel_cased_words = [word.title() for word in titles]
camel_cased_words

['The Wager',
 'Oath And Honor',
 'Outlive',
 'The Woman In Me',
 'The Kingdom, The Power, And The Glory',
 'Friends, Lovers, And The Big Terrible Thing',
 'Madness',
 'Elon Musk',
 'Read Write Own',
 'The Grift',
 'The In-Between',
 'Find Me The Votes',
 'Legacy',
 "I'M Glad My Mom Died",
 'Our Hidden Conversations']

In [145]:
authors_noby = [author.split("by")[1].strip() for author in authors]
authors_noby

['David Grann',
 'Liz Cheney',
 'Peter Attia with Bill Gifford',
 'Britney Spears',
 'Tim Alberta',
 'Matthew Perry',
 'Antonia Hylton',
 'Walter Isaacson',
 'Chris Dixon',
 'Clay Cane',
 'Hadley Vlahos',
 'Michael Isikoff and Daniel Klaidman',
 'Uché Blackstock',
 'Jennette McCurdy',
 'Michele Norris']

In [147]:
descriptions = []
for desc in soup.select("p.css-14lubdp"):
    descriptions.append(desc.get_text())
descriptions

['The survivors of a shipwrecked British vessel on a secret mission during an imperial war with Spain have different accounts of events.',
 'The former congresswoman from Wyoming recounts how she helped lead the Select Committee to Investigate the Jan. 6. Attack on the United States Capitol.',
 'A look at recent scientific research on aging and longevity.',
 'The Grammy Award-winning pop star details her personal and professional experiences, including the years she spent under a conservatorship overseen by her father.',
 'The author of “American Carnage” looks at divisions within the American evangelical movement.',
 'The late actor, known for playing Chandler Bing on “Friends,” shares stories from his childhood and his struggles with sobriety.',
 'A Peabody and Emmy award-winning journalist unearths the 93-year-old history of a segregated asylum in Maryland.',
 'The author of “The Code Breaker” traces Musk’s life and summarizes his work on electric vehicles, private space exploration

In [148]:
nybs = {"title":camel_cased_words,"authors":authors_noby, "description":descriptions}

In [149]:
nybs_df = pd.DataFrame(nybs)

In [150]:
nybs_df

Unnamed: 0,title,authors,description
0,The Wager,David Grann,The survivors of a shipwrecked British vessel ...
1,Oath And Honor,Liz Cheney,The former congresswoman from Wyoming recounts...
2,Outlive,Peter Attia with Bill Gifford,A look at recent scientific research on aging ...
3,The Woman In Me,Britney Spears,The Grammy Award-winning pop star details her ...
4,"The Kingdom, The Power, And The Glory",Tim Alberta,The author of “American Carnage” looks at divi...
5,"Friends, Lovers, And The Big Terrible Thing",Matthew Perry,"The late actor, known for playing Chandler Bin..."
6,Madness,Antonia Hylton,A Peabody and Emmy award-winning journalist un...
7,Elon Musk,Walter Isaacson,The author of “The Code Breaker” traces Musk’s...
8,Read Write Own,Chris Dixon,A technology entrepreneur describes three eras...
9,The Grift,Clay Cane,An overview of Black Republicanism from the ti...


## Scraping US Presidents

In [151]:
url = "https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States"
response = requests.get(url)
soup = BeautifulSoup(response.content)

In [152]:
response.status_code

200

In [160]:
soup.select("td b a")[0]["href"]

'/wiki/George_Washington'

In [161]:
presidents_href = []
for item in soup.select("td b a"):
    presidents_href.append(item["href"])
presidents_href

['/wiki/George_Washington',
 '/wiki/John_Adams',
 '/wiki/Thomas_Jefferson',
 '/wiki/James_Madison',
 '/wiki/James_Monroe',
 '/wiki/John_Quincy_Adams',
 '/wiki/Andrew_Jackson',
 '/wiki/Martin_Van_Buren',
 '/wiki/William_Henry_Harrison',
 '/wiki/John_Tyler',
 '/wiki/James_K._Polk',
 '/wiki/Zachary_Taylor',
 '/wiki/Millard_Fillmore',
 '/wiki/Franklin_Pierce',
 '/wiki/James_Buchanan',
 '/wiki/Abraham_Lincoln',
 '/wiki/Andrew_Johnson',
 '/wiki/Ulysses_S._Grant',
 '/wiki/Rutherford_B._Hayes',
 '/wiki/James_A._Garfield',
 '/wiki/Chester_A._Arthur',
 '/wiki/Grover_Cleveland',
 '/wiki/Benjamin_Harrison',
 '/wiki/Grover_Cleveland',
 '/wiki/William_McKinley',
 '/wiki/Theodore_Roosevelt',
 '/wiki/William_Howard_Taft',
 '/wiki/Woodrow_Wilson',
 '/wiki/Warren_G._Harding',
 '/wiki/Calvin_Coolidge',
 '/wiki/Herbert_Hoover',
 '/wiki/Franklin_D._Roosevelt',
 '/wiki/Harry_S._Truman',
 '/wiki/Dwight_D._Eisenhower',
 '/wiki/John_F._Kennedy',
 '/wiki/Lyndon_B._Johnson',
 '/wiki/Richard_Nixon',
 '/wiki/Geral

In [163]:
name = []
political_party = []
number_of_c = []
occupation = []
for prez in presidents_href:
    url = "https://en.wikipedia.org"+prez
    response = requests.get(url)
    soup = BeautifulSoup(response.content)
    # your code goes here

prez_dict = {"name":name
             ,"party":political_party
             ,"children":number_of_c
            "occupation":occupation}

prez_dict = pd.DatFrame(prez_dict)
prez_dict.head()



<Response [200]>

Create a Dataframe of US Presidents with columns:
- Name of President
- Political Party
- Number of Children
- Occupation

In [166]:
url = "https://en.wikipedia.org/wiki/George_Washington"
response = requests.get(url)
soup = BeautifulSoup(response.content)

In [171]:
soup.find("th", string = "Political party").parent.find("a").get_text()

'Independent'

In [174]:
president_soups = []
for i in presidents_href:
    url = "https://en.wikipedia.org"+i
    response = requests.get(url)
    print(response.status_code)
    soup = BeautifulSoup(response.content)
    president_soups.append(soup)

200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200


In [175]:
soup.select("div.fn")

[<div class="fn" style="font-size:125%;">Joe Biden</div>]

In [180]:
name = []
for soup in president_soups:
    name.append(soup.select("div.fn")[0].get_text())
print(name, len(name))

['George Washington', 'John Adams', 'Thomas Jefferson', 'James Madison', 'James Monroe', 'John Quincy Adams', 'Andrew Jackson', 'Martin Van Buren', 'William Henry Harrison', 'John Tyler', 'James K. Polk', 'Zachary Taylor', 'Millard Fillmore', 'Franklin Pierce', 'James Buchanan', 'Abraham Lincoln', 'Andrew Johnson', 'Ulysses S. Grant', 'Rutherford B. Hayes', 'James A. Garfield', 'Chester A. Arthur', 'Grover Cleveland', 'Benjamin Harrison', 'Grover Cleveland', 'William McKinley', 'Theodore Roosevelt', 'William Howard Taft', 'Woodrow Wilson', 'Warren G. Harding', 'Calvin Coolidge', 'Herbert Hoover', 'Franklin D. Roosevelt', 'Harry S. Truman', 'Dwight D. Eisenhower', 'John F. Kennedy', 'Lyndon B. Johnson', 'Richard Nixon', 'Gerald Ford', 'Jimmy Carter', 'Ronald Reagan', 'George H. W. Bush', 'Bill Clinton', 'George W. Bush', 'Barack Obama', 'Donald Trump', 'Joe Biden'] 46


In [189]:
party = []
for soup in president_soups:
    party.append(soup.find("th", string = "Political party").parent.find("a").get_text())
party
print(len(party))

46


In [195]:
children = []
for soup in president_soups:
    try:
        children.append(soup.find("th", string = "Children").parent.find("td").get_text(strip=True))
    except:
        children.append(0)
print(len(children))

46


In [196]:
children

[0,
 '6, includingAbigail,John Quincy,Charles, andThomas',
 '6 with Martha Wayles, including:Martha Jefferson RandolphMary Jefferson EppesUp to 6 withSally Hemings,[a]including:Madison HemingsEston Hemings',
 0,
 '3, includingElizaandMaria',
 '4, includingGeorge,John IIandCharles',
 '2, includingLyncoya',
 '5, includingAbraham IIandJohn',
 '10, includingJohn',
 '15',
 0,
 '6, includingSarah,Mary, andRichard',
 'MillardMary',
 '3',
 0,
 'RobertEdwardWillieTad',
 'MarthaCharlesMaryRobertAndrew Jr.',
 'FrederickUlysses Jr.NellieJesse II',
 '8, includingWebb C. HayesandRutherford P. Hayes',
 '7, includingHal,James,andAbram',
 'WilliamChester IIEllen',
 '6, includingRuth,Esther,Richard, andFrancis',
 'RussellMaryElizabeth',
 '6, includingRuth,Esther,Richard, andFrancis',
 '2',
 0,
 'RobertHelenCharles II',
 'MargaretJessieEleanor',
 'Elizabeth(withNan Britton)',
 '2, includingJohn',
 'Herbert Jr.Allan',
 0,
 'Margaret',
 'DoudJohn',
 '4, includingCaroline,John\xa0Jr., andPatrick',
 'LyndaLu

In [203]:
soup.find("th", string = "Occupation").parent.select("ul")

[<ul><li>Politician</li>
 <li>lawyer</li>
 <li>author</li></ul>]

In [204]:
prez_occ = []
for job in soup.find("th", string = "Occupation").parent.select("li"):
    prez_occ.append(job.get_text(strip=True))
prez_occ


['Politician', 'lawyer', 'author']

In [205]:
occupation = []
for soup in president_soups:
    try:
        prez_occ = []
        for job in soup.find("th", string = "Occupation").parent.select("li"):
            prez_occ.append(job.get_text(strip=True))
        occupation.append(prez_occ)
    except:
        occupation.append("President")
print(len(occupation))

46


In [206]:
occupation

[['Planter', 'military officer', 'statesman', 'surveyor'],
 ['Politician', 'lawyer'],
 ['Politician', 'lawyer'],
 'President',
 ['Politician', 'lawyer'],
 ['Politician', 'lawyer'],
 ['Politician', 'lawyer', 'general'],
 ['Politician', 'lawyer'],
 ['Soldier', 'politician'],
 'President',
 ['Politician', 'lawyer'],
 'President',
 ['Politician', 'lawyer'],
 'President',
 ['Politician', 'lawyer'],
 ['Politician', 'lawyer'],
 'President',
 ['Military officer', 'politician'],
 ['Politician', 'lawyer'],
 ['Politician', 'lawyer', 'amateur mathematician'],
 'President',
 ['Politician', 'lawyer'],
 ['Politician', 'lawyer'],
 ['Politician', 'lawyer'],
 'President',
 ['Author',
  'conservationist',
  'explorer',
  'historian',
  'naturalist',
  'police commissioner',
  'politician',
  'soldier'],
 ['Politician', 'lawyer'],
 ['Academic', 'politician'],
 ['Journalist', 'politician'],
 ['Politician', 'lawyer'],
 'President',
 'President',
 ['Farmer', 'haberdasher', 'politician'],
 ['Military officer'

In [None]:
name = []
party = []
occupation = []
children = []
for soup in president_soups:
    name.append(soup.select("div.fn"))
    party.append()
    occupation.append()
    children.append()