### Lap | Web Scraping Multiple Pages

Practice web scraping

As you've seen, scraping the internet is a skill that can get you all sorts of information. Here are some little challenges that you can try to gain more experience in the field:

- Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page: url ='https://en.wikipedia.org/wiki/Python'
- Find the number of titles that have changed in the United States Code since its last release point: url = 'http://uscode.house.gov/download/download.shtml'
- Create a Python list with the top ten FBI's Most Wanted names: url = 'https://www.fbi.gov/wanted/topten'
- Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe: url = 'https://www.emsc-csem.org/Earthquake/'
- List all language names and number of related articles in the order they appear in wikipedia.org: url = 'https://www.wikipedia.org/'
- A list with the different kind of datasets available in data.gov.uk: url = 'https://data.gov.uk/'
- Display the top 10 languages by number of native speakers stored in a pandas dataframe: url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [98]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
from time import sleep
pd.set_option('display.max_rows', None)


In [99]:
# Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page

url ='https://en.wikipedia.org/wiki/Python'
response = requests.get(url)
response.status_code

200

In [100]:
soup = BeautifulSoup(response.content, "html.parser")
soup

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Python - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"d1d836f0-272f-468c-9ab7-a06f576a5904","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Python","wgTitle":"Python","wgCurRevisionId":1087251762,"wgRevisionId":1087251762,"wgArticleId":46332325,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Disambiguation pages with short descriptions","Short description is different from Wikidata","All article disambiguation pages","All disambiguation pages","Animal common na

In [101]:
links = soup.select("div > ul > li > a")
for i in range(len(links)):
    print(links[i])  

soup.find_all('a') 

<a href="#Snakes"><span class="tocnumber">1</span> <span class="toctext">Snakes</span></a>
<a href="#Computing"><span class="tocnumber">2</span> <span class="toctext">Computing</span></a>
<a href="#People"><span class="tocnumber">3</span> <span class="toctext">People</span></a>
<a href="#Roller_coasters"><span class="tocnumber">4</span> <span class="toctext">Roller coasters</span></a>
<a href="#Vehicles"><span class="tocnumber">5</span> <span class="toctext">Vehicles</span></a>
<a href="#Weaponry"><span class="tocnumber">6</span> <span class="toctext">Weaponry</span></a>
<a href="#Other_uses"><span class="tocnumber">7</span> <span class="toctext">Other uses</span></a>
<a href="#See_also"><span class="tocnumber">8</span> <span class="toctext">See also</span></a>
<a href="/wiki/Pythonidae" title="Pythonidae">Pythonidae</a>
<a href="/wiki/Python_(mythology)" title="Python (mythology)">Python (mythology)</a>
<a href="/wiki/Python_(programming_language)" title="Python (programming language)

[<a id="top"></a>,
 <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#searchInput">Jump to search</a>,
 <a class="extiw" href="https://en.wiktionary.org/wiki/Python" title="wiktionary:Python">Python</a>,
 <a class="extiw" href="https://en.wiktionary.org/wiki/python" title="wiktionary:python">python</a>,
 <a href="#Snakes"><span class="tocnumber">1</span> <span class="toctext">Snakes</span></a>,
 <a href="#Computing"><span class="tocnumber">2</span> <span class="toctext">Computing</span></a>,
 <a href="#People"><span class="tocnumber">3</span> <span class="toctext">People</span></a>,
 <a href="#Roller_coasters"><span class="tocnumber">4</span> <span class="toctext">Roller coasters</span></a>,
 <a href="#Vehicles"><span class="tocnumber">5</span> <span class="toctext">Vehicles</span></a>,
 <a href="#Weaponry"><span class="tocnumber">6</span> <span class="toctext">Weaponry</span></a>,
 <a href="#Other_uses"><span class="tocnumber">7</span> <sp

In [102]:
urls = []

wiki = "https://en.wikipedia.org"

for link in links:
    urls.append(wiki + link['href'])

urls

['https://en.wikipedia.org#Snakes',
 'https://en.wikipedia.org#Computing',
 'https://en.wikipedia.org#People',
 'https://en.wikipedia.org#Roller_coasters',
 'https://en.wikipedia.org#Vehicles',
 'https://en.wikipedia.org#Weaponry',
 'https://en.wikipedia.org#Other_uses',
 'https://en.wikipedia.org#See_also',
 'https://en.wikipedia.org/wiki/Pythonidae',
 'https://en.wikipedia.org/wiki/Python_(mythology)',
 'https://en.wikipedia.org/wiki/Python_(programming_language)',
 'https://en.wikipedia.org/wiki/CMU_Common_Lisp',
 'https://en.wikipedia.org/wiki/PERQ#PERQ_3',
 'https://en.wikipedia.org/wiki/Python_of_Aenus',
 'https://en.wikipedia.org/wiki/Python_(painter)',
 'https://en.wikipedia.org/wiki/Python_of_Byzantium',
 'https://en.wikipedia.org/wiki/Python_of_Catana',
 'https://en.wikipedia.org/wiki/Python_Anghelo',
 'https://en.wikipedia.org/wiki/Python_(Efteling)',
 'https://en.wikipedia.org/wiki/Python_(Busch_Gardens_Tampa_Bay)',
 'https://en.wikipedia.org/wiki/Python_(Coney_Island,_Cinc

### Scraping Multiple Pages for music list

In [103]:
url ='https://musicbrainz.org/series/b3484a66-a4de-444d-93d3-c99a73656905'
response = requests.get(url)
response.status_code

200

In [104]:
soup = BeautifulSoup(response.content, 'html.parser')
print(soup.prettify())



<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <link href="/static/images/favicons/apple-touch-icon-57x57.png" rel="apple-touch-icon" sizes="57x57"/>
  <link href="/static/images/favicons/apple-touch-icon-60x60.png" rel="apple-touch-icon" sizes="60x60"/>
  <link href="/static/images/favicons/apple-touch-icon-72x72.png" rel="apple-touch-icon" sizes="72x72"/>
  <link href="/static/images/favicons/apple-touch-icon-76x76.png" rel="apple-touch-icon" sizes="76x76"/>
  <link href="/static/images/favicons/apple-touch-icon-114x114.png" rel="apple-touch-icon" sizes="114x114"/>
  <link href="/static/images/favicons/apple-touch-icon-120x120.png" rel="apple-touch-icon" sizes="120x120"/>
  <link href="/static/images/favicons/apple-touch-icon-144x144.png" rel="apple-touch-icon" sizes="144x144"/>
  <link href="/static/images/favicons/apple-touch-icon-15

In [105]:
for artist in soup.select("a[href*=artist]"):
    print(artist.get_text())



Bob Dylan
The Beatles
The Rolling Stones
John Lennon
Marvin Gaye
Aretha Franklin
The Beach Boys
Chuck Berry
The Beatles
Nirvana
Ray Charles
The Who
Sam Cooke
The Beatles
Bob Dylan
The Clash
The Beatles
The Jimi Hendrix Experience
Chuck Berry
Elvis Presley
The Beatles
Bruce Springsteen
The Ronettes
The Beatles
The Impressions
The Beach Boys
Otis Redding
Derek and the Dominos
The Beatles
The Beatles
Johnny Cash
The Tennessee Two
Led Zeppelin
The Rolling Stones
Ike & Tina Turner
The Righteous Brothers
The Doors
U2
Bob Marley & The Wailers
The Rolling Stones
Buddy Holly
Martha and the Vandellas
The Band
The Kinks
Little Richard
Ray Charles
Elvis Presley
David Bowie
Simon & Garfunkel
The Jimi Hendrix Experience
Eagles
Smokey Robinson
Grandmaster Flash & The Furious Five
Melle Mel
Duke Bootee
Prince
The Revolution
Sex Pistols
Percy Sledge
The Kingsmen
Little Richard
Procol Harum
Michael Jackson
Bob Dylan
Al Green
Jerry Lee Lewis
Bo Diddley
Buffalo Springfield
The Beatles
Cream
Bob Marley & T

In [106]:
for song in soup.select("a[href*=recording]"):
    print(song.get_text())

Like a Rolling Stone
Strawberry Fields Forever
(I Can’t Get No) Satisfaction
Imagine
What’s Going On
Respect
Good Vibrations
Johnny B. Goode
Hey Jude
Smells Like Teen Spirit
What’d I Say
My Generation
A Change Is Gonna Come
Yesterday
Blowin’ in the Wind
London Calling
I Want to Hold Your Hand
Purple Haze
Maybellene
Hound Dog
Let It Be
Born to Run
Be My Baby
In My Life
People Get Ready
God Only Knows
(Sittin’ on) the Dock of the Bay
Layla
A Day in the Life
Help!
I Walk the Line
Stairway to Heaven
Sympathy for the Devil
River Deep—Mountain High
You’ve Lost That Lovin’ Feelin’
Light My Fire
One
No Woman, No Cry
Gimme Shelter
That’ll Be the Day
Dancing in the Street
The Weight
Waterloo Sunset
Tutti Frutti
Georgia on My Mind
Heartbreak Hotel
“Heroes”
Bridge Over Troubled Water
All Along the Watchtower
Hotel California
The Tracks of My Tears
The Message
When Doves Cry
Anarchy in the U.K.
When a Man Loves a Woman
Louie Louie
Long Tall Sally
A Whiter Shade of Pale
Billie Jean
The Times They Ar

In [107]:
for song in soup.find_all('href'):
    print(song.get_text())

In [108]:
song_name = []
artist_name = []

for page in range(1,6):
    r = requests.get(f'https://musicbrainz.org/series/b3484a66-a4de-444d-93d3-c99a73656905?page={page}')
    soup = BeautifulSoup(r.content, 'html.parser')

    for song in soup.select("a[href*=recording]"):
        song_name.append(song.get_text(strip=True))

    for song in soup.select("a[href*=artist]"):
        artist_name.append(song.get_text(strip=True))

print(song_name)
print(artist_name)

['Like a Rolling Stone', 'Strawberry Fields Forever', '(I Can’t Get No) Satisfaction', 'Imagine', 'What’s Going On', 'Respect', 'Good Vibrations', 'Johnny B. Goode', 'Hey Jude', 'Smells Like Teen Spirit', 'What’d I Say', 'My Generation', 'A Change Is Gonna Come', 'Yesterday', 'Blowin’ in the Wind', 'London Calling', 'I Want to Hold Your Hand', 'Purple Haze', 'Maybellene', 'Hound Dog', 'Let It Be', 'Born to Run', 'Be My Baby', 'In My Life', 'People Get Ready', 'God Only Knows', '(Sittin’ on) the Dock of the Bay', 'Layla', 'A Day in the Life', 'Help!', 'I Walk the Line', 'Stairway to Heaven', 'Sympathy for the Devil', 'River Deep—Mountain High', 'You’ve Lost That Lovin’ Feelin’', 'Light My Fire', 'One', 'No Woman, No Cry', 'Gimme Shelter', 'That’ll Be the Day', 'Dancing in the Street', 'The Weight', 'Waterloo Sunset', 'Tutti Frutti', 'Georgia on My Mind', 'Heartbreak Hotel', '“Heroes”', 'Bridge Over Troubled Water', 'All Along the Watchtower', 'Hotel California', 'The Tracks of My Tears'

In [109]:
d = {'song':song_name, 'artist':artist_name}
df = pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in d.items() ]))
df = df.dropna()
df

Unnamed: 0,song,artist
0,Like a Rolling Stone,Bob Dylan
1,Strawberry Fields Forever,The Beatles
2,(I Can’t Get No) Satisfaction,The Rolling Stones
3,Imagine,John Lennon
4,What’s Going On,Marvin Gaye
5,Respect,Aretha Franklin
6,Good Vibrations,The Beach Boys
7,Johnny B. Goode,Chuck Berry
8,Hey Jude,The Beatles
9,Smells Like Teen Spirit,Nirvana
