# Lab | Web Scraping Multiple Pages

### Instructions

#### Prioritize the MVP

---------------------------------------------------------------------------------------------------------
In the previous lab, you had to scrape data about "hot songs". It's critical to be on track with that part, as it was part of the request from the CTO.

--------------------------------------------------------------------------------------------------------

#### Expand the project

--------------------------------------------------------------------------------------------------------
If you're done, you can try to expand the project on your own. Here are a few suggestions:

- Find other lists of hot songs on the internet and scrape them too: having a bigger pool of songs will be awesome!
- Apply the same logic to other "groups" of songs: the best songs from a decade or from a country / culture / language / genre.
- Wikipedia maintains a large collection of lists of songs: https://en.wikipedia.org/wiki/Lists_of_songs
        
--------------------------------------------------------------------------------------------------------

In [1]:
# Importing BeautifulSoup
from bs4 import BeautifulSoup
import pandas as pd
import requests
import math

----------------------------------------------------------------------------------------------
For this lab we'll use the NPR best songs from 2021 and 2022 to feed the other_songs dataframe:

- https://www.npr.org/2021/12/02/1054377950/the-100-best-songs-of-2021-page-1
- https://www.npr.org/2022/12/15/1135802083/100-best-songs-2022-page-1
    
----------------------------------------------------------------------------------------------

###### 1. NRP - 100 Best Songs of 2021

In [2]:
# First, check if we can make the connection without problems:
header = {
    "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'
}

r = requests.get("https://www.npr.org/2021/12/02/1054377950/the-100-best-songs-of-2021-page-1", headers=header)
r.status_code

200

In [3]:
# Getting the site's html
soup = BeautifulSoup(r.content, 'html.parser')
soup

<!DOCTYPE html>
<html class="no-js" lang="en"><head><!-- OneTrust Cookies Consent Notice start for npr.org -->
<script src="https://cdn.cookielaw.org/consent/82089dfe-410c-4e1b-a7f9-698174b62a86/OtAutoBlock.js" type="text/javascript"></script>
<script charset="UTF-8" data-domain-script="82089dfe-410c-4e1b-a7f9-698174b62a86" src="https://cdn.cookielaw.org/scripttemplates/otSDKStub.js" type="text/javascript"></script>
<script type="text/javascript">
function OptanonWrapper() {
    NPR_OptanonWrapper = true;
    document.dispatchEvent(new CustomEvent('npr:DataConsentAvailable'));
    
    OneTrust.OnConsentChanged(function() {
        document.dispatchEvent(new CustomEvent('npr:DataConsentChanged'));
    });
 }
</script>
<!-- OneTrust Cookies Consent Notice end for npr.org -->
<script ccpa-opt-out-geo="us" ccpa-opt-out-ids="C0004" ccpa-opt-out-lspa="false" charset="UTF-8" src="https://cdn.cookielaw.org/opt-out/otCCPAiab.js" type="text/javascript"></script><script class="optanon-category-C

In [4]:
# Searching for the artists and songs
soup.find_all("h3", attrs={'class': 'edTag'})

[<h3 class="edTag">Taylor Swift</h3>,
 <h3 class="edTag">"All Too Well (10 Minute Version) (Taylor's Version) (From The Vault)"<em> </em></h3>,
 <h3 class="edTag">Rhiannon Giddens &amp; Francesco Turrisi</h3>,
 <h3 class="edTag">"Avalon"</h3>,
 <h3 class="edTag">Anna B Savage</h3>,
 <h3 class="edTag">"Baby Grand"<em> </em></h3>,
 <h3 class="edTag">Moneybagg Yo</h3>,
 <h3 class="edTag">"Wockesha"<em> </em></h3>,
 <h3 class="edTag">Snail Mail</h3>,
 <h3 class="edTag">"Automate"<em> </em></h3>,
 <h3 class="edTag">Farruko</h3>,
 <h3 class="edTag">"Pepas"<em> </em></h3>,
 <h3 class="edTag">Carly Pearce</h3>,
 <h3 class="edTag">"29"<em> </em></h3>,
 <h3 class="edTag">Phoebe Bridgers</h3>,
 <h3 class="edTag">"That Funny Feeling"<em> </em></h3>,
 <h3 class="edTag">Yola</h3>,
 <h3 class="edTag">"Starlight"</h3>,
 <h3 class="edTag">Bachelor</h3>,
 <h3 class="edTag">"Back Of My Hand"</h3>,
 <h3 class="edTag">Tinashe (feat. Jeremih)</h3>,
 <h3 class="edTag">"X"</h3>,
 <h3 class="edTag">Pom Pom Squ

In [5]:
# Get all artists/songs
[artist_song.get_text(strip=True) for artist_song in soup.find_all("h3", attrs={'class': 'edTag'})]

['Taylor Swift',
 '"All Too Well (10 Minute Version) (Taylor\'s Version) (From The Vault)"',
 'Rhiannon Giddens & Francesco Turrisi',
 '"Avalon"',
 'Anna B Savage',
 '"Baby Grand"',
 'Moneybagg Yo',
 '"Wockesha"',
 'Snail Mail',
 '"Automate"',
 'Farruko',
 '"Pepas"',
 'Carly Pearce',
 '"29"',
 'Phoebe Bridgers',
 '"That Funny Feeling"',
 'Yola',
 '"Starlight"',
 'Bachelor',
 '"Back Of My Hand"',
 'Tinashe (feat. Jeremih)',
 '"X"',
 'Pom Pom Squad',
 '"Drunk Voicemail"',
 'Nas',
 '"Moments"',
 'Will Liverman',
 '"The Rain"',
 'Indigo De Souza',
 '"Way Out"',
 'Isaiah Rashad',
 '"Headshots (4r Da Locals)"',
 'STAYC',
 '"Stereotype"',
 'Khemmis',
 '"Avernal Gate"',
 'Yuta Orisaka',
 '"Orca"',
 'UNIIQU3',
 '"Microdosing"',
 'Next >']

In [6]:
# Filtering only the artists 
artists_page1 = [artist.get_text(strip=True) for artist in soup.find_all("h3", attrs={'class': 'edTag'}) if '"' not in artist.get_text() and artist.get_text() not in 'Next >']
artists_page1

['Taylor Swift',
 'Rhiannon Giddens & Francesco Turrisi',
 'Anna B Savage',
 'Moneybagg Yo',
 'Snail Mail',
 'Farruko',
 'Carly Pearce',
 'Phoebe Bridgers',
 'Yola',
 'Bachelor',
 'Tinashe (feat. Jeremih)',
 'Pom Pom Squad',
 'Nas',
 'Will Liverman',
 'Indigo De Souza',
 'Isaiah Rashad',
 'STAYC',
 'Khemmis',
 'Yuta Orisaka',
 'UNIIQU3']

In [7]:
len(artists_page1)

20

In [8]:
# Filtering only the songs 
songs_page1 = [song.get_text(strip=True).replace('"', '') for song in soup.find_all("h3", attrs={'class': 'edTag'}) if '"' in song.get_text()]
songs_page1

["All Too Well (10 Minute Version) (Taylor's Version) (From The Vault)",
 'Avalon',
 'Baby Grand',
 'Wockesha',
 'Automate',
 'Pepas',
 '29',
 'That Funny Feeling',
 'Starlight',
 'Back Of My Hand',
 'X',
 'Drunk Voicemail',
 'Moments',
 'The Rain',
 'Way Out',
 'Headshots (4r Da Locals)',
 'Stereotype',
 'Avernal Gate',
 'Orca',
 'Microdosing']

In [9]:
len(songs_page1)

20

-----------------------------------------------------------------------------------------------------
We need however to web scrap through multiple pages and get all the artists and songs.

So we'll create a for-loop to fetch that data.

-----------------------------------------------------------------------------------------------------

In [10]:
# Getting all the artists and songs from each page
artists_2021 = []
songs_2021 = []
id = 0

for i in range(1, 6):
    if i == 1:
        id = 1054377950
    elif i == 2:
        id = 1054378275
    elif i == 3:
        id = 1054379062
    elif i == 4:
        id = 1054379661
    else:
        id = 1054380365
    r = requests.get(f"https://www.npr.org/2021/12/02/{id}/the-100-best-songs-of-2021-page-{i}", headers=header)
    soup = BeautifulSoup(r.content, 'html.parser')
    artists_2021.extend([artist.get_text(strip=True) for artist in soup.find_all("h3", attrs={'class': 'edTag'}) if '"' not in artist.get_text() and 'Next >' not in artist.get_text() and '< Previous' not in artist.get_text() and '< Previous•Next >' not in artist.get_text()])
    songs_2021.extend([song.get_text(strip=True).replace('"', '') for song in soup.find_all("h3", attrs={'class': 'edTag'}) if '"' in song.get_text()])

In [11]:
artists_2021

['Taylor Swift',
 'Rhiannon Giddens & Francesco Turrisi',
 'Anna B Savage',
 'Moneybagg Yo',
 'Snail Mail',
 'Farruko',
 'Carly Pearce',
 'Phoebe Bridgers',
 'Yola',
 'Bachelor',
 'Tinashe (feat. Jeremih)',
 'Pom Pom Squad',
 'Nas',
 'Will Liverman',
 'Indigo De Souza',
 'Isaiah Rashad',
 'STAYC',
 'Khemmis',
 'Yuta Orisaka',
 'UNIIQU3',
 'Tyler, The Creator (feat. YoungBoy Never Broke Again & Ty Dolla $ign)',
 'Lithuanian National Symphony Orchestra',
 'Saudade',
 'Yotuel, Gente De Zona, Descemer Bueno, Maykel Osorbo and El Funky',
 'Meet Me @ The Altar',
 'Zao',
 'Silvana Estrada',
 'Porter Robinson',
 'Recap',
 'SOPHIE',
 'Daniel Bachman',
 'Cardi B',
 'Azealia Banks',
 'Durand Jones & The Indications',
 'The Baylor Project (feat. Dianne Reeves & Jazzmeia Horn)',
 'Tion Wayne x Russ Millions (feat. ArrDee, E1, ZT, Bugzy Malone, Buni, Fivio Foreign & Darkoo)',
 'Robert Plant & Alison Krauss',
 'Lana Del Rey',
 'Doja Cat (feat. SZA)',
 'Mannequin Pussy',
 'Kacey Musgraves',
 'Camila C

In [12]:
songs_2021

["All Too Well (10 Minute Version) (Taylor's Version) (From The Vault)",
 'Avalon',
 'Baby Grand',
 'Wockesha',
 'Automate',
 'Pepas',
 '29',
 'That Funny Feeling',
 'Starlight',
 'Back Of My Hand',
 'X',
 'Drunk Voicemail',
 'Moments',
 'The Rain',
 'Way Out',
 'Headshots (4r Da Locals)',
 'Stereotype',
 'Avernal Gate',
 'Orca',
 'Microdosing',
 'WUSYANAME',
 'Patria y Vida',
 'Hit Like A Girl',
 'Ship Of Theseus',
 'Marchita',
 'Musician',
 'Hedera',
 'UNISIL',
 'Coronach',
 'Up',
 'F*** Him All Night',
 'Witchoo',
 'We Swing',
 'Body (Remix)',
 'Go Your Way',
 'White Dress',
 'Kiss Me More',
 'To Lose You',
 'good wife',
 "Don't Go Yet",
 'La Perla',
 'Gyalis',
 'Iced Coffee',
 'Why the Bright Stars Glow',
 'Sacude',
 'Bussifame',
 'Notice',
 'SAD GIRLZ LUV MONEY (Remix)',
 'WHOLE LOTTA MONEY (Remix)',
 'Mi Conga Es De Akokán',
 'Jackie',
 'Go Down Deh',
 'Woman',
 'Motorbike',
 'Amoeba',
 'fitt',
 'Jordan',
 'JUNO',
 'RIGHT NOW',
 'Knife Talk',
 'Bunny is a Rider',
 'Nausicaa',
 'F

In [13]:
len(artists_2021)

101

In [14]:
len(songs_2021)

99

-----------------------------------------------------------------------------------------------------
One particular song, Lithuanian National Symphony Orchestra - Saudade, wasn't enclosed in between '"' in the website, therefore it was only included in the artists lists. Given that we'll remove it from the list and add it manually to both artists and songs list.

-------------------------------------------------------------------------------------------------

In [16]:
# The element to remove
element_to_remove = 'Lithuanian National Symphony Orchestra'

# Removing 'Lithuanian National Symphony Orchestra' from the artists list
artists_2021 = [artist for artist in artists_2021 if artist != element_to_remove]

len(artists_2021)

100

In [17]:
# The element to remove
element_to_remove = 'Saudade'

# Removing 'Saudade' from the artists list
artists_2021 = [artist for artist in artists_2021 if artist != element_to_remove]

len(artists_2021)

99

In [18]:
# Add 'Saudade' to the songs list
songs_2021.append('Saudade')

# Add 'Lithuanian National Symphony Orchestra' to the artists list
artists_2021.append('Lithuanian National Symphony Orchestra')

In [19]:
len(artists_2021)

100

In [20]:
len(songs_2021)

100

###### 1. NRP - 100 Best Songs of 2022

In [21]:
# First, check if we can make the connection without problems:
header = {
    "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'
}

r = requests.get("https://www.npr.org/2022/12/15/1135802083/100-best-songs-2022-page-1", headers=header)
r.status_code

200

In [22]:
# Getting the site's html
soup = BeautifulSoup(r.content, 'html.parser')
soup

<!DOCTYPE html>
<html class="no-js" lang="en"><head><!-- OneTrust Cookies Consent Notice start for npr.org -->
<script src="https://cdn.cookielaw.org/consent/82089dfe-410c-4e1b-a7f9-698174b62a86/OtAutoBlock.js" type="text/javascript"></script>
<script charset="UTF-8" data-domain-script="82089dfe-410c-4e1b-a7f9-698174b62a86" src="https://cdn.cookielaw.org/scripttemplates/otSDKStub.js" type="text/javascript"></script>
<script type="text/javascript">
function OptanonWrapper() {
    NPR_OptanonWrapper = true;
    document.dispatchEvent(new CustomEvent('npr:DataConsentAvailable'));
    
    OneTrust.OnConsentChanged(function() {
        document.dispatchEvent(new CustomEvent('npr:DataConsentChanged'));
    });
 }
</script>
<!-- OneTrust Cookies Consent Notice end for npr.org -->
<script ccpa-opt-out-geo="us" ccpa-opt-out-ids="C0004" ccpa-opt-out-lspa="false" charset="UTF-8" src="https://cdn.cookielaw.org/opt-out/otCCPAiab.js" type="text/javascript"></script><script class="optanon-category-C

In [23]:
# Searching for the artists and songs
soup.find_all("h3", attrs={'class': 'edTag'})

[<h3 class="edTag">Little Simz</h3>,
 <h3 class="edTag">"Gorilla"<em> </em></h3>,
 <h3 class="edTag">Ian William Craig</h3>,
 <h3 class="edTag">"Attention For It Radiates"</h3>,
 <h3 class="edTag">Viking Ding Dong x Ravi B</h3>,
 <h3 class="edTag">"Leave It Alone (Remix)"</h3>,
 <h3 class="edTag">Adeem the Artist</h3>,
 <h3 class="edTag">"Middle of a Heart"<em> </em></h3>,
 <h3 class="edTag">Zahsosaa, D STURDY &amp; DJ Crazy</h3>,
 <h3 class="edTag">"Shakedhat"</h3>,
 <h3 class="edTag">Gabriels</h3>,
 <h3 class="edTag">"If You Only Knew"</h3>,
 <h3 class="edTag">DOMi &amp; JD BECK</h3>,
 <h3 class="edTag">"SMiLE"</h3>,
 <h3 class="edTag">Rema</h3>,
 <h3 class="edTag">"Calm Down"</h3>,
 <h3 class="edTag">Pigeon Pit</h3>,
 <h3 class="edTag">"milk crates"</h3>,
 <h3 class="edTag">Tyler Childers</h3>,
 <h3 class="edTag">"Angel Band (Jubilee Version)"</h3>,
 <h3 class="edTag">Straw Man Army</h3>,
 <h3 class="edTag">"Human Kind"</h3>,
 <h3 class="edTag">Guitarricadelafuente</h3>,
 <h3 class=

In [24]:
# Filtering only the artists 
artists_page1 = [artist.get_text(strip=True) for artist in soup.find_all("h3", attrs={'class': 'edTag'}) if '"' not in artist.get_text() and artist.get_text() not in 'Next >']
artists_page1

['Little Simz',
 'Ian William Craig',
 'Viking Ding Dong x Ravi B',
 'Adeem the Artist',
 'Zahsosaa, D STURDY & DJ Crazy',
 'Gabriels',
 'DOMi & JD BECK',
 'Rema',
 'Pigeon Pit',
 'Tyler Childers',
 'Straw Man Army',
 'Guitarricadelafuente',
 'Mary Halvorson',
 'Leyla McCalla',
 'The Mountain Goats',
 'NewJeans',
 'Joyce',
 'Ayra Starr',
 'Disclosure feat. RAYE',
 'Ari Lennox']

In [25]:
# Filtering only the songs 
songs_page1 = [song.get_text(strip=True).replace('"', '') for song in soup.find_all("h3", attrs={'class': 'edTag'}) if '"' in song.get_text()]
songs_page1

['Gorilla',
 'Attention For It Radiates',
 'Leave It Alone (Remix)',
 'Middle of a Heart',
 'Shakedhat',
 'If You Only Knew',
 'SMiLE',
 'Calm Down',
 'milk crates',
 'Angel Band (Jubilee Version)',
 'Human Kind',
 'Quien encendió la luz',
 'Night Shift',
 'Dodinin',
 'Bleed Out',
 'Hype Boy',
 'Feminina',
 'Rush',
 'Waterfall',
 'POF']

In [26]:
len(artists_page1)

20

In [27]:
len(songs_page1)

20

In [28]:
# Getting all the artists and songs from each page
artists_2022 = []
songs_2022 = []
id = 0

for i in range(1, 6):
    if i == 1:
        id = 1135802083
    elif i == 2:
        id = 1135802978
    elif i == 3:
        id = 1135803422
    elif i == 4:
        id = 1135803900
    else:
        id = 1135804266
    r = requests.get(f"https://www.npr.org/2022/12/15/{id}/100-best-songs-2022-page-{i}", headers=header)
    soup = BeautifulSoup(r.content, 'html.parser')
    artists_2022.extend([artist.get_text(strip=True) for artist in soup.find_all("h3", attrs={'class': 'edTag'}) if '"' not in artist.get_text() and 'Next >' not in artist.get_text() and '< Previous' not in artist.get_text() and '< Previous•Next >' not in artist.get_text()])
    songs_2022.extend([song.get_text(strip=True).replace('"', '') for song in soup.find_all("h3", attrs={'class': 'edTag'}) if '"' in song.get_text()])

In [29]:
artists_2022

['Little Simz',
 'Ian William Craig',
 'Viking Ding Dong x Ravi B',
 'Adeem the Artist',
 'Zahsosaa, D STURDY & DJ Crazy',
 'Gabriels',
 'DOMi & JD BECK',
 'Rema',
 'Pigeon Pit',
 'Tyler Childers',
 'Straw Man Army',
 'Guitarricadelafuente',
 'Mary Halvorson',
 'Leyla McCalla',
 'The Mountain Goats',
 'NewJeans',
 'Joyce',
 'Ayra Starr',
 'Disclosure feat. RAYE',
 'Ari Lennox',
 'The 1975',
 'Anna Tivel',
 'Caroline Shaw & Attacca Quartet',
 'Beth Orton',
 'DJ Python',
 'Patricia Brennan',
 'Black Sherif',
 'Madison Cunningham',
 'La Doña',
 'Julia Jacklin',
 'Molly Tuttle & Golden Highway',
 'Black Country, New Road',
 'Khruangbin & Leon Bridges',
 'Nduduzo Makhathini',
 'KAROL G',
 'Vince Staples',
 'Tove Lo',
 'Jazmine Sullivan',
 'Denzel Curry',
 'beabadoobee',
 'Björk',
 'Hyd',
 'Sky Ferreira',
 'Regina Spektor',
 'Tommy McLain feat. Elvis Costello',
 'Yahritza y Su Esencia',
 'Víkingur Ólafsson',
 'Lil Yachty',
 'LF System',
 'Soccer Mommy',
 'LE SSERAFIM',
 'Flo Milli',
 'Kendri

In [30]:
songs_2022

['Gorilla',
 'Attention For It Radiates',
 'Leave It Alone (Remix)',
 'Middle of a Heart',
 'Shakedhat',
 'If You Only Knew',
 'SMiLE',
 'Calm Down',
 'milk crates',
 'Angel Band (Jubilee Version)',
 'Human Kind',
 'Quien encendió la luz',
 'Night Shift',
 'Dodinin',
 'Bleed Out',
 'Hype Boy',
 'Feminina',
 'Rush',
 'Waterfall',
 'POF',
 'Part of the Band',
 'Black Umbrella',
 'First Essay (Nimrod)',
 'Friday Night',
 'Angel',
 'Unquiet Respect',
 'Kwaku the Traveller',
 'Life According To Raechel',
 'Penas Con Pan',
 'Love, Try Not To Let Go',
 'Crooked Tree',
 'The Place Where He Inserted the Blade',
 'B-Side',
 'Unonkanyamba',
 'PROVENZA',
 'When Sparks Fly',
 '2 Die 4',
 'BPW',
 'Walkin',
 'Talk',
 'Atopos',
 'Afar',
 "Don't Forget",
 'Up The Mountain',
 'I Ran Down Every Dream',
 'Estas En Mi Pasado',
 'Schumann: Study in Canonic Form, Op. 56, No. 1',
 'Poland',
 'Afraid to Feel',
 'Shotgun',
 'Antifragile',
 'Bed Time',
 'The Heart Part 5',
 'Home Maker',
 'Love Song (Come Back)'

In [31]:
len(artists_2022)

100

In [32]:
len(songs_2022)

100

In [33]:
# Creating a DataFrame from the lists
other_songs = pd.DataFrame({
    'Artist': artists_2021 + artists_2022,
    'Song': songs_2021 + songs_2022
})

other_songs

Unnamed: 0,Artist,Song
0,Taylor Swift,All Too Well (10 Minute Version) (Taylor's Ver...
1,Rhiannon Giddens & Francesco Turrisi,Avalon
2,Anna B Savage,Baby Grand
3,Moneybagg Yo,Wockesha
4,Snail Mail,Automate
...,...,...
195,ROSALÍA,SAOKO
196,Alex G,Runner
197,Bad Bunny,El Apagón
198,Beyoncé,ALIEN SUPERSTAR


In [36]:
# Saving the dataframe with the NPR 100 Best songs from 2021 and 2022 into a csv file
other_songs.to_csv(r'C:\Users\mafal\Documents\ironhack\projects\project-song-recommender\other_songs.csv', index=False)