We practiced web scraping when all the information is in a single table of a single page in a site. What happens when we want to scrape information from multiple pages?

Go to https://www.imdb.com/search/title/ and enter the following parameters, leaving all other fields blank or with its default value:

- Title Type: Feature film

- Release date: From 1990 to 1992

- User Rating: 7.5 to "-"

he page you get should be familiar. There's a list with movies and each movie has its title, release year, crew, etc. You could inspect the page and build the code to collect the date.

However, the results we obtained contain 631 movies, and each page only contains 50 of them (you can change the settings to obtain up to 250 movies/page, but that still won't make it till the end).

The way to automatize web scraping in these cases is to look at the URLs The one we've obtained is the following:

https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,

If you scroll down and click on "Next", the URL is now: https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=51&ref_=adv_nxt

Click again on "Next" and here's the new URL: https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=101&ref_=adv_nxt

The patterns are clear: our search options are in the parameters title_type, release_dateand user_rating. Then, we have the start parameter, which jumps in intervals of 50, and the ref_ parameter, which takes the value of "adv_nxt".

Let's do some requests:

In [10]:
# 1. import libraries
from bs4 import BeautifulSoup
import requests

In [11]:
# 2. url: we start with the 'second' page
url = "https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=51&ref_=adv_nxt"

In [12]:
# 3. download html with a get request
response = requests.get(url)
response.status_code # 200 status code means OK!

200

In [13]:
# 4.1. parse html (create the 'soup')
soup = BeautifulSoup(response.content, "html.parser")
# 4.2. check that the html code looks like it should
#soup

Now, we'll have to build a loop where we simply replace the 51 for all the other values (jumping by 50) up until the end of the results. For simplicity, we will build manually this list of values to iterate through:



In [14]:
iterations = range(1, 631, 50)

for i in iterations:
    start_at= str(i)
    url = "https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=" + start_at + "&ref_=adv_nxt"
    print(url)

https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=1&ref_=adv_nxt
https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=51&ref_=adv_nxt
https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=101&ref_=adv_nxt
https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=151&ref_=adv_nxt
https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=201&ref_=adv_nxt
https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=251&ref_=adv_nxt
https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=301&ref_=adv_nxt
https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating

Respectful scraping:

Before starting with the actual scraping, though, there's something we need to note when sending massive, automated requests to websites: it's rude.

We just have 13 of them, which is not too many, but it's still a good practice to let a few seconds pass in between requests. Some pages don't like being scraped and will block your IP if they detect it's sending automated requests. Others might have a small server for the traffic they handle, and sending too many requests might crash the site.

The sleep module will help us with that. Here's how it works, waiting 2 seconds between each iteration in a for loop:

In [15]:
# To make it more "human", we can randomize the waiting time:
from time import sleep
from random import randint

for i in range(5):
    print(i)
    wait_time = randint(1,4)
    print("I will sleep for " + str(wait_time) + " seconds.")
    sleep(wait_time)

0
I will sleep for 4 seconds.
1
I will sleep for 4 seconds.
2
I will sleep for 4 seconds.
3
I will sleep for 4 seconds.
4
I will sleep for 2 seconds.


We will now scrape all the pages and store the response into a list - waiting a few seconds in between requests:

In [16]:
pages = []

for i in iterations:

    # assemble the url:
    start_at= str(i)
    url = "https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=" + start_at + "&ref_=adv_nxt"

    # download html with a get request:
    response = requests.get(url)

    # monitor the process by printing the status code
    print("Status code: " + str(response.status_code))

    # store response into "pages" list
    pages.append(response)

    # respectful nap:
    wait_time = randint(1,4)
    print("I will sleep for " + str(wait_time) + " second/s.")
    sleep(wait_time)

Status code: 200
I will sleep for 1 second/s.
Status code: 200
I will sleep for 3 second/s.
Status code: 200
I will sleep for 1 second/s.
Status code: 200
I will sleep for 4 second/s.
Status code: 200
I will sleep for 4 second/s.
Status code: 200
I will sleep for 3 second/s.
Status code: 200
I will sleep for 3 second/s.
Status code: 200
I will sleep for 4 second/s.
Status code: 200
I will sleep for 2 second/s.
Status code: 200
I will sleep for 3 second/s.
Status code: 200
I will sleep for 4 second/s.
Status code: 200
I will sleep for 4 second/s.
Status code: 200
I will sleep for 2 second/s.


Note how if you print the object pages after running the code above, you'll just see the response code messages, but the html code is still accessible and you can parse it the same way we've always done:

In [25]:
#BeautifulSoup(pages[0].content, "html.parser")

It's the moment to build the code that collects all the 631 movie titles and their synopsis in a dataframe.

#### titles

In [19]:
# Parse just the first page, for testing purposes
#you need to select the different pages as each of the is an own list item
soup = BeautifulSoup(pages[0].content, "html.parser")

# Paste the Selector from the first movie title copied from Chrome Dev Tools
soup.select("#main > div > div.lister.list.detail.sub-list > div > div:nth-child(1) > div.lister-item-content > h3 > a")

# Trim the selection: now it grabs all the titles
soup.select("div.lister-item-content > h3 > a")

[<a href="/title/tt0099685/">GoodFellas - Drei Jahrzehnte in der Mafia</a>,
 <a href="/title/tt0102926/">Das Schweigen der Lämmer</a>,
 <a href="/title/tt0105236/">Reservoir Dogs: Wilde Hunde</a>,
 <a href="/title/tt0101921/">Grüne Tomaten</a>,
 <a href="/title/tt0105695/">Erbarmungslos</a>,
 <a href="/title/tt0099487/">Edward mit den Scherenhänden</a>,
 <a href="/title/tt0103064/">Terminator 2: Tag der Abrechnung</a>,
 <a href="/title/tt0099785/">Kevin - Allein zu Haus</a>,
 <a href="/title/tt0104257/">Eine Frage der Ehre</a>,
 <a href="/title/tt0100802/">Total Recall - Die totale Erinnerung</a>,
 <a href="/title/tt0099348/">Der mit dem Wolf tanzt</a>,
 <a href="/title/tt0104691/">Der letzte Mohikaner</a>,
 <a href="/title/tt0099674/">Der Pate 3</a>,
 <a href="/title/tt0106308/">Armee der Finsternis</a>,
 <a href="/title/tt0105323/">Der Duft der Frauen</a>,
 <a href="/title/tt0099810/">Jagd auf Roter Oktober</a>,
 <a href="/title/tt0104952/">Mein Vetter Winnie</a>,
 <a href="/title/tt

#### synopsis

In [20]:
# Paste the Selector from the first movie title copied from Chrome Dev Tools
soup.select("#main > div > div.lister.list.detail.sub-list > div > div:nth-child(1) > div.lister-item-content > p:nth-child(4)")



[<p class="text-muted">
 The story of <a href="/name/nm1453737">Henry Hill</a> and his life in the mob, covering his relationship with his wife Karen Hill and his mob partners Jimmy Conway and Tommy DeVito in the Italian-American crime syndicate.</p>]

In [21]:
# Trim the selection: now it grabs all the titles
soup.select("div.lister-item-content > p:nth-child(4)")

[<p class="text-muted">
 The story of <a href="/name/nm1453737">Henry Hill</a> and his life in the mob, covering his relationship with his wife Karen Hill and his mob partners Jimmy Conway and Tommy DeVito in the Italian-American crime syndicate.</p>,
 <p class="text-muted">
 A young F.B.I. cadet must receive the help of an incarcerated and manipulative cannibal killer to help catch another serial killer, a madman who skins his victims.</p>,
 <p class="text-muted">
 When a simple jewelry heist goes horribly wrong, the surviving criminals begin to suspect that one of them is a police informant.</p>,
 <p class="text-muted">
 A housewife who is unhappy with her life befriends an old lady in a nursing home and is enthralled by the tales she tells of people she used to know.</p>,
 <p class="text-muted">
 Retired Old West gunslinger William Munny reluctantly takes on one last job, with the help of his old partner Ned Logan and a young man, The "Schofield Kid."</p>,
 <p class="text-muted">
 A

There are many approaches to do this. The one we'll follow is: 

- Loop through the pages we collected, parse them ("create the soup") and store the parsed pages in a list. 

- For each parsed page, select the "blocks of HTML elements" that contain all the information of each movie (the title, the synopsis and other stuff). 

- For each one of the "blocks" we collected in the previous step: 

    - Get the movie titles and store them in a list 

    - Get the synopsis and store them in a list

In [22]:
pages_parsed = []
titles = []
synopsis = []

for i in range(len(pages)):
    # parse all pages
    pages_parsed.append(BeautifulSoup(pages[i].content, "html.parser"))
    # select only the info about the movies
    movies_html = pages_parsed[i].select("div.lister-item-content")
    # for movie, store titles and reviews into lists
    for j in range(len(movies_html)):
        titles.append(movies_html[j].select("h3 > a")[0].get_text())
        synopsis.append(movies_html[j].select("p:nth-child(4)")[0].get_text().strip())


# Checking our output:
print(len(titles)) # output: 631
print(len(synopsis))  # output: 631

# Note: in your output the movie titles might be in English:
titles[0:3] # output: ['El silencio de los corderos', 'Uno de los nuestros', 'Solo en casa']
synopsis[0:3] #output: ['A young F.B.I. cadet must receive the help of an incarcerated and manipulative cannibal killer to help catch another serial killer, a madman who skins his victims.', 'The story of Henry Hill and his life in the mob, covering his relationship with his wife Karen Hill and his mob partners Jimmy Conway and Tommy DeVito in the Italian-American crime syndicate.', 'An eight-year-old troublemaker must protect his house from a pair of burglars when he is accidentally left home alone by his family during Christmas vacation.']

540
540


['The story of Henry Hill and his life in the mob, covering his relationship with his wife Karen Hill and his mob partners Jimmy Conway and Tommy DeVito in the Italian-American crime syndicate.',
 'A young F.B.I. cadet must receive the help of an incarcerated and manipulative cannibal killer to help catch another serial killer, a madman who skins his victims.',
 'When a simple jewelry heist goes horribly wrong, the surviving criminals begin to suspect that one of them is a police informant.']

#### Scraping presidents

Our objective is to create a dataframe with information about the presidents of the United States. To do this, we will go through this steps:

1. Scrape this [list of presidents of the United States](https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States).


In [27]:
# 1. import libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd


# 2. find url and store it in a variable
url = "https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States"

# 3. download html with a get request
response = requests.get(url)
response.status_code # 200 status code means OK!

# 4.1. parse html (create the 'soup')
soup = BeautifulSoup(response.content, "html.parser")
# 4.2. check that the html code looks like it should
#soup

In [28]:
# this solution is not very elegant, but works. 
# The CSS selector we copied had an "nth-child" that we could iterate 
# to find presidents, but some elements were empty, so we concatenate 
# each new element with "+" instead of appending as usual:

presidents = []

for i in range(95):
    presidents = presidents + soup.select("tbody > tr:nth-child(" + str(i) + ") > td:nth-child(4) > b > a")

# check the output:
presidents

[<a href="/wiki/George_Washington" title="George Washington">George Washington</a>,
 <a href="/wiki/John_Adams" title="John Adams">John Adams</a>,
 <a href="/wiki/Thomas_Jefferson" title="Thomas Jefferson">Thomas Jefferson</a>,
 <a href="/wiki/James_Madison" title="James Madison">James Madison</a>,
 <a href="/wiki/James_Monroe" title="James Monroe">James Monroe</a>,
 <a href="/wiki/John_Quincy_Adams" title="John Quincy Adams">John Quincy Adams</a>,
 <a href="/wiki/Andrew_Jackson" title="Andrew Jackson">Andrew Jackson</a>,
 <a href="/wiki/Martin_Van_Buren" title="Martin Van Buren">Martin Van Buren</a>,
 <a href="/wiki/William_Henry_Harrison" title="William Henry Harrison">William Henry Harrison</a>,
 <a href="/wiki/John_Tyler" title="John Tyler">John Tyler</a>,
 <a href="/wiki/James_K._Polk" title="James K. Polk">James K. Polk</a>,
 <a href="/wiki/Zachary_Taylor" title="Zachary Taylor">Zachary Taylor</a>,
 <a href="/wiki/Millard_Fillmore" title="Millard Fillmore">Millard Fillmore</a>,
 

2. Collect all the links to the Wikipedia page of each president.


In [30]:
# we can access the links searching for the attribute "href"
# in each element
presidents[0]["href"]

'/wiki/George_Washington'

In [34]:
# Now, we just assemble a new request to the link
# send request
url = "https://en.wikipedia.org/" + presidents[0]["href"]
response = requests.get(url)
response.status_code

# parse & store html
soup = BeautifulSoup(response.content, "html.parser")
#soup.find("table", {"class":"infobox vcard"})

3. Scrape the Wikipedia page of each president.


In this step we could very well store the whole wikipedia page for each president, or just the tiny, final pieces of information. Storing the boxes is a middle ground (we don't have too much noise but retain the flexibility of deciding later which specific elements to extract).

When sending multiple requests, remember to be respectful by spacing the requests a few seconds from each other. We will also pring the success code to monitor that everything is going well:

In [32]:
# 2. find url and store it in a variable

presi_soups = []

for presi in presidents:
    # send request
    url = "https://en.wikipedia.org/" + presi["href"]
    response = requests.get(url)
    print(presi.get_text(), response.status_code)
    
    # parse & store html
    soup = BeautifulSoup(response.content, "html.parser")
    presi_soups.append(soup.find("table", {"class":"infobox vcard"}))
    
    # respectful nap:
    wait_time = randint(1,2)
    print("I will sleep for " + str(wait_time) + " second/s.")
    sleep(wait_time)

George Washington 200
I will sleep for 2 second/s.
John Adams 200
I will sleep for 1 second/s.
Thomas Jefferson 200
I will sleep for 2 second/s.
James Madison 200
I will sleep for 1 second/s.
James Monroe 200
I will sleep for 2 second/s.
John Quincy Adams 200
I will sleep for 1 second/s.
Andrew Jackson 200
I will sleep for 2 second/s.
Martin Van Buren 200
I will sleep for 1 second/s.
William Henry Harrison 200
I will sleep for 1 second/s.
John Tyler 200
I will sleep for 2 second/s.
James K. Polk 200
I will sleep for 1 second/s.
Zachary Taylor 200
I will sleep for 2 second/s.
Millard Fillmore 200
I will sleep for 2 second/s.
Franklin Pierce 200
I will sleep for 2 second/s.
James Buchanan 200
I will sleep for 1 second/s.
Abraham Lincoln 200
I will sleep for 2 second/s.
Andrew Johnson 200
I will sleep for 1 second/s.
Ulysses S. Grant 200
I will sleep for 1 second/s.
Rutherford B. Hayes 200
I will sleep for 2 second/s.
James A. Garfield 200
I will sleep for 1 second/s.
Chester A. Arthur 20

4. Find and store information about each president.


We extracted the 'infoboxes': now it's time to exctract especific pieces of information from them. Let's test what can we get from single presidents and then assemble a loop for all of them - as usual.

Here, we will use [the string argument](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-string-argument) in the find function, since wikipedia tags and classes are not always helpfulto locate. The string argument allows us to locate elements by its actual content.

In [33]:
#Birthday
presi_soups[-1].find("span", {"class":"bday"}).get_text()

#Political party
presi_soups[-1].find("th", string="Political party").parent.find("a").get_text()

#Number of sons/daughters
len(presi_soups[-1].find("th", string="Children").parent.find_all("li"))

4

5. Organize the information in a dataframe where we have each president as a row and each variable we collected as a column.