# Web Scraping multiple pages 

We have practiced web scraping when all the information we wanted was on a single table of a site. What happens when we want to scrape information from multiple pages?

## First example - IMDB 

Go to https://www.imdb.com/search/title/ and enter the following parameters, leaving all other fields blank or with its default value:

- Title Type: Feature film

- Release date: From 1990 to 1992

- User Rating: 7.5 to "-"

The page you get should be familiar. There's a list with movies and each movie has its title, release year, crew, etc. You could inspect the page and build the code to collect the date.

Note the resulting query obtained contain hundreds of movies, and each page only contains 50 of them (you can change the settings to obtain up to 250 movies/page, but that still won't be the complete list).

One way to automatize multi page web scraping is to look at the URLs. 

https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,

Note what the url looks like if you scroll down and click on "Next", the URL is now: 

https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=51&ref_=adv_nxt

Can you see the pattern?

our search options are in the parameters title_type, release_date and user_rating. Then, we have the start parameter, which jumps in intervals of 50, and the ref_ parameter, which takes the value of "adv_nxt".

In [1]:
#  import libraries
from bs4 import BeautifulSoup
import requests

In [2]:
#  url: this time, start with the 'second' page
url = "https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=51&ref_=adv_nxt"

In [4]:
# download html with a request, check response code 
response = requests.get(url)
response.status_code

200

In [None]:
#  parse html (create the 'soup')

soup = BeautifulSoup(response.content,'html.parser')

# check that the html code looks as expected 

print(soup.prettify())

Now, we'll have to build a list of values which jumps by 50, up to the total number of movies we want to scrape.  

In [6]:
# define iterations 

iterations = range(1,537,50)

In [12]:
# check the iterations work

for i in iterations:
    start_at = str(i)
    url="https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=" + start_at + "&ref_=adv_nxt"
    print(url)

https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=1&ref_=adv_nxt
https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=51&ref_=adv_nxt
https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=101&ref_=adv_nxt
https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=151&ref_=adv_nxt
https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=201&ref_=adv_nxt
https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=251&ref_=adv_nxt
https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=301&ref_=adv_nxt
https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating

In [None]:
# create the url string for the page search, populate with the iterations


In [None]:
# test the urls 


### Respectful scraping:

Before starting with the actual scraping, though, there's something we need to note when sending automated requests to websites: it's good practice to let a few seconds pass in between requests. 

Some pages don't like being scraped and will block your IP if they detect you are sending automated requests. Others might have a small server for the traffic they handle, and sending too many requests might crash the site.

The sleep module will help us with that. 

In [18]:
from time import sleep

#simple example 
for i in range(5):
    print(i)
    wait_time = randint(1,4)
    print('I will sleep for...' + str(wait_time) + ' seconds now')
    sleep(wait_time)



0
I will sleep for...2 seconds now
1
I will sleep for...3 seconds now
2
I will sleep for...4 seconds now
3
I will sleep for...3 seconds now
4
I will sleep for...4 seconds now


In [14]:
# To make it more "human", we can randomize the waiting time:
from random import randint

### Assembling the script to send and store multiple requests

In [None]:
"""ingredients for our multi page scraper:
    + iterations
    + url list with iterations
    + sleepy time + random gaps (to look human)"""

In [20]:
pages = []

#assemble urls
for i in iterations:
    start_at = str(i)
    url="https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=" + start_at + "&ref_=adv_nxt"
#download html with get request
    response = requests.get(url)
#monitor the status codes for each page
    print('status=' + str(response.status_code))
#store pages into a list
    pages.append(response)
#respectful nap time
    print(i)
    wait_time = randint(1,4)
    print('I will sleep for...' + str(wait_time) + ' seconds now')
    sleep(wait_time)

status=200
1
I will sleep for...4 seconds now
status=200
51
I will sleep for...4 seconds now
status=200
101
I will sleep for...4 seconds now
status=200
151
I will sleep for...2 seconds now
status=200
201
I will sleep for...2 seconds now
status=200
251
I will sleep for...2 seconds now
status=200
301
I will sleep for...4 seconds now
status=200
351
I will sleep for...1 seconds now
status=200
401
I will sleep for...2 seconds now
status=200
451
I will sleep for...1 seconds now
status=200
501
I will sleep for...2 seconds now


In [25]:
BeautifulSoup(pages[0].content, 'html.parser')


<!DOCTYPE html>

<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="app-id=342792525, app-argument=imdb:///?src=mdot" name="apple-itunes-app"/>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
<title>Feature Film,
Released between 1990-01-01 and 1992-12-31,
User Rating at least 7.5
(Sorted by Popularity Ascending) - IMDb</title>
<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>
<script>
    if (typeof uex == 'function') {
 

Note: if you print the object pages after running the code above, you'll just see the response code messages, but the html code is still accessible and you can parse it the same way as before

### Build code to collect the relevant information from the Request 

this is what we need : 

##### Parse just the first page, for testing purposes
- soup=BeautifulSoup(pages[0].content, "html.parser")

##### title and synopsis

- soup.select("div.lister-item-content > h3 > a")
- soup.select("div.lister-item-content > p:nth-child(4)")

#### titles

In [33]:
# Parse just the first page, for testing purposes

soup = BeautifulSoup(pages[0].content, "html.parser")

# Paste the Selector from the first movie title copied from Chrome Dev Tools
#main > div > div.lister.list.detail.sub-list > div > div:nth-child(47) > div.lister-item-content > h3 > a

# Trim the selection
soup.select("h3 > a")

[<a href="/title/tt0103064/">Terminator 2: Tag der Abrechnung</a>,
 <a href="/title/tt0099685/">GoodFellas - Drei Jahrzehnte in der Mafia</a>,
 <a href="/title/tt0099674/">Der Pate 3</a>,
 <a href="/title/tt0105236/">Reservoir Dogs: Wilde Hunde</a>,
 <a href="/title/tt0102926/">Das Schweigen der Lämmer</a>,
 <a href="/title/tt0104257/">Eine Frage der Ehre</a>,
 <a href="/title/tt0104691/">Der letzte Mohikaner</a>,
 <a href="/title/tt0100802/">Total Recall - Die totale Erinnerung</a>,
 <a href="/title/tt0101507/">Boyz n the Hood - Jungs im Viertel</a>,
 <a href="/title/tt0105695/">Erbarmungslos</a>,
 <a href="/title/tt0099785/">Kevin - Allein zu Haus</a>,
 <a href="/title/tt0104952/">Mein Vetter Winnie</a>,
 <a href="/title/tt0099348/">Der mit dem Wolf tanzt</a>,
 <a href="/title/tt0103074/">Thelma &amp; Louise</a>,
 <a href="/title/tt0105323/">Der Duft der Frauen</a>,
 <a href="/title/tt0099810/">Jagd auf Roter Oktober</a>,
 <a href="/title/tt0099487/">Edward mit den Scherenhänden</a>,

#### synopsis

In [47]:
# Paste the Selector from the first movie title copied from Chrome Dev Tools
soup.select('p:nth-child(4)')

[<p class="text-muted">
 A cyborg, identical to the one who failed to kill Sarah Connor, must now protect her ten year old son, John Connor, from a more advanced and powerful cyborg.</p>,
 <p class="text-muted">
 The story of <a href="/name/nm1453737">Henry Hill</a> and his life in the mob, covering his relationship with his wife Karen Hill and his mob partners Jimmy Conway and Tommy DeVito in the Italian-American crime syndicate.</p>,
 <p class="text-muted">
 Follows Michael Corleone, now in his 60s, as he seeks to free his family from crime and find a suitable successor to his empire.</p>,
 <p class="text-muted">
 When a simple jewelry heist goes horribly wrong, the surviving criminals begin to suspect that one of them is a police informant.</p>,
 <p class="text-muted">
 A young F.B.I. cadet must receive the help of an incarcerated and manipulative cannibal killer to help catch another serial killer, a madman who skins his victims.</p>,
 <p class="text-muted">
 Military lawyer Lieute

In [None]:
# Trim the selection


### combine all the code 

There are many approaches to do this. The one we'll follow is: 

- Loop through the pages we collected, parse them ("create the soup") and store the parsed pages in a list. 

- For each parsed page, select the "blocks of HTML elements" that contain all the information of each movie (the title, the synopsis and other stuff). 

- For each one of the "blocks" we collected in the previous step: 

    - Get the movie titles and store them in a list 

    - Get the synopsis and store them in a list

In [50]:
titles = []
synopsis = []
pages_parsed = []

for i in range(len(pages)):
    pages_parsed.append(BeautifulSoup(pages[i].content, 'html.parser'))
    movies_html = pages_parsed[i].select('div.lister-item-content')
    # for each movie, store title and synopsis in the list
    for j in range(len(movies_html)):
        titles.append(movies_html[j].select('h3 > a')[0].get_text)
        synopsis.append(movies_html[j].select('p:nth-child(4)')[0].get_text())
        
print(len(titles))
print(len(synopsis))

537
537


In [77]:
synopsis[0:4]

['\nA cyborg, identical to the one who failed to kill Sarah Connor, must now protect her ten year old son, John Connor, from a more advanced and powerful cyborg.',
 '\nThe story of Henry Hill and his life in the mob, covering his relationship with his wife Karen Hill and his mob partners Jimmy Conway and Tommy DeVito in the Italian-American crime syndicate.',
 '\nFollows Michael Corleone, now in his 60s, as he seeks to free his family from crime and find a suitable successor to his empire.',
 '\nWhen a simple jewelry heist goes horribly wrong, the surviving criminals begin to suspect that one of them is a police informant.']

In [None]:
# check the output and identify any wrangling steps we missed 

In [91]:
# strip the \n from the synopsis

synopsis_clean = [s.replace('\n','') for s in synopsis]

synopsis_clean

AttributeError: 'list' object has no attribute 'replace'

-----------

## 2nd example - Scraping presidents

Our objective is to create a dataframe with information about the presidents of the United States. To do this, we will go through 5 steps:

1. Scrape this [list of presidents of the United States](https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States).


In [66]:
# 1. import libraries

import pandas as pd

# 2. find url and store it in a variable

url = 'https://www.wikiwand.com/en/List_of_presidents_of_the_United_States'

# 3. download html with a get request
response = requests.get(url)
response.status_code

# 4.1. parse html (create the 'soup')
soup = BeautifulSoup(response.content, 'html.parser')

# 4.2. check that the html code looks like it should
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js" lang="en" ng-app="wikiwand" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="https://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
 <head>
  <!--header generic -->
  <meta charset="utf-8"/>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <link href="//wikiwand-19431.kxcdn.com/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
  <link href="//wikiwand-19431.kxcdn.com/img/wikiwand_icon_apple.png" rel="apple-touch-icon">
   <link href="https://www.wikiwand.com/en/List_of_presidents_of_the_United_States" rel="canonical"/>
   <!--this is the original url for google fonts-->
   <link href="//fonts.googleapis.com/css?family=Lato:300,400,700,300italic,400italic,700italic|Lora:400,700,400italic,700italic|Merriweather:400italic,400,300italic,300,700,700italic|Open+Sans:300italic,400italic,700italic,700,300,400&amp;subset=latin,cyrillic-ext,greek-ext,greek,

2. Collect all the links to the Wikipedia page of each president.


In [None]:
# we can access the links searching for the attribute "href"
# in each element


In [None]:
# Now, we just assemble a new request to the link
# send request


# parse & store html


3. Scrape the Wikipedia page of each president.


In this step we could very well store the whole wikipedia page for each president, or just the tiny, final pieces of information. Storing the boxes is a middle ground (we don't have too much noise but retain the flexibility of deciding later which specific elements to extract).

When sending multiple requests, remember to be respectful by spacing the requests a few seconds from each other. We will also ping the success code to monitor that everything is going well:

In [None]:
# 2. find url and store it in a variable


    # send request
 
   
    # parse & store html
    
    # respectful nap:
 

4. Find and store information about each president.


We extracted the 'infoboxes': now it's time to extract specific information from them. First test what can we get from a single president and then assemble a loop for all of them.

Here, we will use [the string argument](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-string-argument) in the find function, since wikipedia tags and classes are not always helpful to locate. The string argument allows us to locate elements by its actual content.

In [None]:
#Birthday

#Political party

#Number of sons/daughters


# collect with a loop 

5. Organize the information in a dataframe where we have each president as a row and each variable we collected as a column.