# Lab | Web Scraping Single Page

**Business goal:**
Check the case_study_gnod.md file.

Make sure you've understood the big picture of your project:

the goal of the company (Gnod),\
their current product (Gnoosic),\
their strategy, and\
how your project fits into this context.\
Re-read the business case and the e-mail from the CTO, take a look at the flowchart and create an initial Trello board with the tasks you think you'll have to accomplish.

**Instructions - Scraping popular songs**
Your product will take a song as an input from the user and will output another song (the recommendation). \
In most cases, the recommended song will have to be similar to the inputted song, but the CTO thinks that if the song is on the top charts at the moment, the user will enjoy more a recommendation of a song that's also popular at the moment.

You have find data on the internet about currently popular songs. Billboard maintains a weekly Top 100 of "hot" songs here: https://www.billboard.com/charts/hot-100.

It's a good place to start! Scrape the current top 100 songs and their respective artists, and put the information into a pandas dataframe.

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [4]:
df = pd.read_html('https://en.wikipedia.org/wiki/Soup')
df[1]

Unnamed: 0,0,1
0,Hungarian goulash soup,Hungarian goulash soup
1,Type,Soup
2,Main ingredients,"Liquid (stock, juice, water), meat or vegetabl..."
3,Variations,"Clear soup, thick soup"
4,Cookbook: Soup Media: Soup,Cookbook: Soup Media: Soup


### New link

In [5]:
song_name = []
artist = []

r = requests.get('http://www.popvortex.com/music/charts/top-100-songs.php')
r.status_code

200

In [6]:
html = r.content
html

b'<!DOCTYPE html><html lang="en"><head><meta charset="utf-8"><title>iTunes Top 100 Songs Chart 2022</title><meta name="viewport" content="width=device-width, initial-scale=1"><meta name="description" content="iTunes top 100 songs chart list. The most popular hit music and trending songs of 2022. Chart of today\'s current iTunes top 100 songs is updated daily."><meta property="og:title" content="iTunes Top 100 Songs Chart 2022"/><meta property="og:description" content="Chart of the top 100 songs on iTunes. Chart list of the top 100 song downloads of 2022 is updated daily."/><meta property="og:type" content="article"/><meta property="og:image" content="http://www.popvortex.com/images/logo-facebook.png"/><meta property="og:site_name" content="PopVortex"/><meta property="og:url" content="http://www.popvortex.com/music/charts/top-100-songs.php"/><meta property="fb:admins" content="100000239962942"/><meta property="fb:app_id" content="178831188827052"/><link rel="shortcut icon" href="/favico

In [7]:
soup = BeautifulSoup(html, 'html.parser')
soup

<!DOCTYPE html>
<html lang="en"><head><meta charset="utf-8"/><title>iTunes Top 100 Songs Chart 2022</title><meta content="width=device-width, initial-scale=1" name="viewport"/><meta content="iTunes top 100 songs chart list. The most popular hit music and trending songs of 2022. Chart of today's current iTunes top 100 songs is updated daily." name="description"/><meta content="iTunes Top 100 Songs Chart 2022" property="og:title"><meta content="Chart of the top 100 songs on iTunes. Chart list of the top 100 song downloads of 2022 is updated daily." property="og:description"><meta content="article" property="og:type"><meta content="http://www.popvortex.com/images/logo-facebook.png" property="og:image"/><meta content="PopVortex" property="og:site_name"/><meta content="http://www.popvortex.com/music/charts/top-100-songs.php" property="og:url"/><meta content="100000239962942" property="fb:admins"/><meta content="178831188827052" property="fb:app_id"/><link href="/favicon.png" rel="shortcut i

In [8]:
pretty_soup = soup.prettify()
pretty_soup

'<!DOCTYPE html>\n<html lang="en">\n <head>\n  <meta charset="utf-8"/>\n  <title>\n   iTunes Top 100 Songs Chart 2022\n  </title>\n  <meta content="width=device-width, initial-scale=1" name="viewport"/>\n  <meta content="iTunes top 100 songs chart list. The most popular hit music and trending songs of 2022. Chart of today\'s current iTunes top 100 songs is updated daily." name="description"/>\n  <meta content="iTunes Top 100 Songs Chart 2022" property="og:title">\n   <meta content="Chart of the top 100 songs on iTunes. Chart list of the top 100 song downloads of 2022 is updated daily." property="og:description">\n    <meta content="article" property="og:type">\n     <meta content="http://www.popvortex.com/images/logo-facebook.png" property="og:image"/>\n     <meta content="PopVortex" property="og:site_name"/>\n     <meta content="http://www.popvortex.com/music/charts/top-100-songs.php" property="og:url"/>\n     <meta content="100000239962942" property="fb:admins"/>\n     <meta content=

In [9]:
soup.select("div")

[<div class="container"><nav aria-label="Main" class="navbar navbar-default navbar-fixed-top"><div class="container"><div class="navbar-header"><button class="navbar-toggle" data-target="#bs-example-navbar-collapse-1" data-toggle="collapse" type="button"><span class="sr-only">Toggle navigation</span> <span class="icon-bar"></span><span class="icon-bar"></span><span class="icon-bar"></span></button><a class="navbar-brand" href="/"><img alt="PopVortex - Music, Books, Movies, Televsion" height="80" src="/images/popvortex-logo.png" style="border-width: 0px" width="126"/></a> </div><div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1"><ul class="nav navbar-nav"><li class="dropdown active"><a class="dropdown-toggle" data-toggle="dropdown" href="/music/">Music <b class="caret"></b></a><ul class="dropdown-menu"><li><a href="/music/">Music News and Reviews</a></li><li><a href="/music/charts/">iTunes Music Charts</a></li><li class="divider"></li><li><a href="/music/charts/"><st

In [10]:
soup.select("cite.title")

[<cite class="title">Running Up That Hill (A Deal with God)</cite>,
 <cite class="title">Hold My Hand</cite>,
 <cite class="title">As It Was</cite>,
 <cite class="title">About Damn Time</cite>,
 <cite class="title">First Class</cite>,
 <cite class="title">You Proof</cite>,
 <cite class="title">Danger Zone</cite>,
 <cite class="title">AA</cite>,
 <cite class="title">I Like You (A Happier Song) [feat. Doja Cat]</cite>,
 <cite class="title">She Had Me At Heads Carolina</cite>,
 <cite class="title">Voices in My Head</cite>,
 <cite class="title">Heartless</cite>,
 <cite class="title">Wasted On You</cite>,
 <cite class="title">Jiggle Jiggle</cite>,
 <cite class="title">About Damn Time</cite>,
 <cite class="title">Project</cite>,
 <cite class="title">Unstoppable</cite>,
 <cite class="title">Bam Bam (feat. Ed Sheeran)</cite>,
 <cite class="title">I Ain't Worried</cite>,
 <cite class="title">Cooped Up (feat. Roddy Ricch)</cite>,
 <cite class="title">Distraction</cite>,
 <cite class="title">'Til

In [11]:
soup.select("cite.title")[0].get_text()

'Running Up That Hill (A Deal with God)'

In [12]:
soup.select("em.artist")

[<em class="artist">Kate Bush</em>,
 <em class="artist">Lady Gaga</em>,
 <em class="artist">Harry Styles</em>,
 <em class="artist">Lizzo</em>,
 <em class="artist">Jack Harlow</em>,
 <em class="artist">Morgan Wallen</em>,
 <em class="artist">Kenny Loggins</em>,
 <em class="artist">Walker Hayes</em>,
 <em class="artist">Post Malone</em>,
 <em class="artist">Cole Swindell</em>,
 <em class="artist">Falling In Reverse</em>,
 <em class="artist">Kanye West</em>,
 <em class="artist">Morgan Wallen</em>,
 <em class="artist">Duke &amp; Jones &amp; Louis Theroux</em>,
 <em class="artist">Lizzo</em>,
 <em class="artist">Chase McDaniel</em>,
 <em class="artist">Sia</em>,
 <em class="artist">Camila Cabello</em>,
 <em class="artist">OneRepublic</em>,
 <em class="artist">Post Malone</em>,
 <em class="artist">Polo G</em>,
 <em class="artist">Cody Johnson</em>,
 <em class="artist">Kane Brown</em>,
 <em class="artist">Kanye West</em>,
 <em class="artist">Nathan Evans</em>,
 <em class="artist">Flo Rida</em

In [13]:
soup.select("em.artist")[0].get_text()

'Kate Bush'

In [14]:
num_iter = len(soup.select("cite.title"))
num_iter_2 = len(soup.select("em.artist"))

for i in range(num_iter):
    song_name.append(soup.select("cite.title")[i].get_text())
    
for i in range(num_iter_2):
    artist.append(soup.select("em.artist")[i].get_text())

In [15]:
print(song_name)
print()
print(artist)

['Running Up That Hill (A Deal with God)', 'Hold My Hand', 'As It Was', 'About Damn Time', 'First Class', 'You Proof', 'Danger Zone', 'AA', 'I Like You (A Happier Song) [feat. Doja Cat]', 'She Had Me At Heads Carolina', 'Voices in My Head', 'Heartless', 'Wasted On You', 'Jiggle Jiggle', 'About Damn Time', 'Project', 'Unstoppable', 'Bam Bam (feat. Ed Sheeran)', "I Ain't Worried", 'Cooped Up (feat. Roddy Ricch)', 'Distraction', "'Til You Can't", 'Like I Love Country Music', 'Stronger', 'Wellerman (Sea Shanty)', 'Right Round (feat. Ke$ha)', 'Cold Heart (PNAU Remix)', 'Low (feat. T-Pain)', 'In Jesus Name (God Of Possible)', 'Daylight', 'WAIT FOR U (feat. Drake & Tems)', 'The Monster (feat. Rihanna)', 'Fall In Love', 'Numb Little Bug', 'Fancy Like', 'Big Energy', 'True Love', 'Wrapped Around Your Finger', 'Whiskey On You', 'The Way I Are (feat. Keri Hilson & D.O.E.)', 'Heat Waves', 'Wild Ones (feat. Sia)', 'Nerve Flip', 'Take My Name', 'Love/Hate Letter To Alcohol (feat. Fleet Foxes)', 'Shi

In [16]:
# each list becomes a column
billboard = pd.DataFrame({"song":song_name,
                       "artist":artist,
                      })

billboard.head()

Unnamed: 0,song,artist
0,Running Up That Hill (A Deal with God),Kate Bush
1,Hold My Hand,Lady Gaga
2,As It Was,Harry Styles
3,About Damn Time,Lizzo
4,First Class,Jack Harlow


In [17]:
billboard.tail()

Unnamed: 0,song,artist
95,Whistle,Flo Rida
96,Happier Than Ever,Kelly Clarkson
97,Shoop,Salt-N-Pepa
98,Bones,Imagine Dragons
99,Drunk (And I Don't Wanna Go Home),Elle King & Miranda Lambert


# Lab | Web Scraping Multiple Pages

#### Find other lists of hot songs on the internet and scrape them too: having a bigger pool of songs will be awesome!

In [18]:
r = requests.get('https://www.officialcharts.com/charts/singles-chart/')
r.status_code

200

In [19]:
html2 = r.content
html2

b'\r\n\r\n<!doctype html>\r\n<!--[if lt IE 7]><html class="no-js ie6 oldie" lang="en"><![endif]-->\r\n<!--[if IE 7]><html class="no-js ie7 oldie" lang="en"><![endif]-->\r\n<!--[if IE 8]><html class="no-js ie8 oldie" lang="en"><![endif]-->\r\n<!--[if gt IE 8]><!-->\r\n<html class="no-js" lang="en">\r\n<!--<![endif]-->\r\n\r\n<head>\r\n    \r\n\r\n<meta charset="utf-8" />\r\n<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />\r\n\r\n<title>Official Singles Chart Top 100 | Official Charts Company</title>\r\n<meta name="description" content="The Official UK Top 40 chart is compiled by the Official Charts Company, based on official sales of sales of downloads, CD, vinyl, audio streams and video streams. The Top 40 is broadcast on BBC Radio 1 and MTV, the full Top 100 is published exclusively on OfficialCharts.com." />\r\n<meta name="keywords" content="Top 40, UK Top 40, Charts, Top 40 UK, UK Charts, UK singles chart, Music Charts, Official UK Top 40, Charts 2012, Hit 40 UK, UK 

In [20]:
soup2 = BeautifulSoup(html2, 'html.parser')
soup2


<!DOCTYPE html>

<!--[if lt IE 7]><html class="no-js ie6 oldie" lang="en"><![endif]-->
<!--[if IE 7]><html class="no-js ie7 oldie" lang="en"><![endif]-->
<!--[if IE 8]><html class="no-js ie8 oldie" lang="en"><![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en">
<!--<![endif]-->
<head>
<meta charset="utf-8"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<title>Official Singles Chart Top 100 | Official Charts Company</title>
<meta content="The Official UK Top 40 chart is compiled by the Official Charts Company, based on official sales of sales of downloads, CD, vinyl, audio streams and video streams. The Top 40 is broadcast on BBC Radio 1 and MTV, the full Top 100 is published exclusively on OfficialCharts.com." name="description"/>
<meta content="Top 40, UK Top 40, Charts, Top 40 UK, UK Charts, UK singles chart, Music Charts, Official UK Top 40, Charts 2012, Hit 40 UK, UK Chart, Official Singles Chart, Official Albums Chart, Number 1, Number One" name="k

In [21]:
soup2.select("div")

[<div style="display:none;">
 <img alt="Quantcast" border="0" height="1" src="//pixel.quantserve.com/pixel/p-qFYsnYXTfYNA0.gif" width="1"/>
 </div>,
 <div id="fb-root"></div>,
 <div id="container">
 <!-- Top Navigation elements -->
 <header class="site-header">
 <div class="inner-area">
 <a class="logo" href="/"><img alt="Official Charts " loading="lazy" src="/img/header/logo.png"/></a>
 <a class="site-nav-toggle icon-menutoggle" href=""></a>
 <!-- Mobile Site Nav -->
 <nav class="site-nav-mobile">
 <ul>
 <li><a class="search-mobile-on" href="">Search</a></li>
 <li><a href="/news/latest-news/">News</a></li>
 <li><a href="/new-releases/">New Releases</a></li>
 <li><a href="/charts/">Charts</a></li>
 <li><a href="/archive/">Archive</a></li>
 <li><a href="/artists/">Artists</a></li>
 <!-- Shop link is different -->
 <li><a href="https://shop.officialcharts.com/" target="_blank">Shop</a></li>
 </ul>
 </nav>
 <!-- Social bar -->
 <nav class="site-nav-social">
 <ul>
 <li><a class="icon-twitt

In [22]:
print(soup2.select("div.title")[0].get_text(strip=True))
print(soup2.select("div.artist")[0].get_text(strip=True))

AS IT WAS
HARRY STYLES


In [23]:
song_name_2 = []
artist_name_2 = []

num_iter_title = len(soup2.select("div.title"))
num_iter_artist = len(soup2.select("div.artist"))

for i in range(num_iter_title):
    song_name_2.append(soup2.select("div.title")[i].get_text(strip=True))
    
for i in range(num_iter_artist):
    artist_name_2.append(soup2.select("div.artist")[i].get_text(strip=True))
    
print(song_name_2)
print()
print(artist_name_2)

['AS IT WAS', 'GO', 'ABOUT DAMN TIME', 'LATE NIGHT TALKING', 'FIRST CLASS', 'MUSIC FOR A SUSHI RESTAURANT', 'IFTK', 'RUNNING UP THAT HILL', '2STEP', 'PERU', 'SPACE MAN', 'BAM BAM', 'CRAZY WHAT LOVE CAN DO', 'WAIT FOR U', 'WHERE DID YOU GO', 'POTION', 'GREEN GREEN GRASS', '21 REASONS', "SHE'S ALL I WANNA BE", 'PRINCE ANDREW IS A SWEATY N**CE', 'BIG ENERGY', "JE M'APPELLE", 'FLOWERS', 'HOLD MY HAND', 'BAD HABITS', 'COOPED UP', 'DIE HARD', 'WHAT WOULD YOU DO', 'WHERE ARE YOU NOW', 'REMIND ME', 'TRUE LOVE', 'SHIVERS', 'IN THE STARS', 'HEAT WAVES', 'BABY', '1989', 'N95', 'BALLING', 'STARLIGHT', "DON'T FORGET MY LOVE", '10 THINGS I HATE ABOUT YOU', 'BABA (TOMA TUSSI)', 'MAKE ME FEEL GOOD', 'LAST LAST', 'DIE YOUNG', 'DANDELIONS', 'NO EXCUSES', 'SEVENTEEN GOING UNDER', 'DOWN UNDER', 'BAD LIFE', 'MIDDLE OF THE NIGHT', 'MIXED EMOTIONS', 'COLD HEART', 'HOUSE ON FIRE', 'BMW', 'BRAZIL', 'MR BRIGHTSIDE', 'CHARMER', 'FREAKY DEAKY', 'THOUSAND MILES', 'CHURCHILL DOWNS', 'STAY THE NIGHT', 'USED TO THIS'

In [24]:
# each list becomes a column
billboard_2 = pd.DataFrame({"song":song_name_2,
                       "artist":artist_name_2,
                      })
print(len(billboard_2))
billboard_2.head()

100


Unnamed: 0,song,artist
0,AS IT WAS,HARRY STYLES
1,GO,CAT BURNS
2,ABOUT DAMN TIME,LIZZO
3,LATE NIGHT TALKING,HARRY STYLES
4,FIRST CLASS,JACK HARLOW


#### Apply the same logic to other "groups" of songs: the best songs from a decade or from a country / culture / language / genre.

In [37]:
df_mumf = pd.read_html('https://en.wikipedia.org/wiki/List_of_songs_recorded_by_Mumford_%26_Sons')
df_mumf[1]
df_mumf = df_mumf[1]

In [39]:
print(df_mumf.columns)
df_mumf = df_mumf.drop(['Notes', 'Unnamed: 4'], axis=1)
df_mumf

Index(['Title', 'Album(s) / EP(s) / single(s)', 'First released', 'Notes',
       'Unnamed: 4'],
      dtype='object')


Unnamed: 0,Title,Album(s) / EP(s) / single(s),First released
0,"""42""",Delta,2018
1,"""After All""",—,—
2,"""After The Storm""",Sigh No More,2009
3,"""Awake My Soul""",Mumford & Sons / Sigh No More,2008
4,"""Babel""",Babel / Babel - Single,2012
...,...,...,...
82,"""Wilder Mind""",Wilder Mind,2015
83,"""Wild Heart""",Delta,2018
84,"""Winter Winds (My Head Told My Heart)""",The Cave and the Open Sea / Sigh No More / Win...,2009
85,"""Woman""",Woman - Single / Delta,2018


#### top 500 songs

In [160]:
r = requests.get('https://musicbrainz.org/series/b3484a66-a4de-444d-93d3-c99a73656905')
r.status_code

200

In [161]:
html_3 = r.content

In [196]:
# firt getting the number of total results
soup = BeautifulSoup(html_3, 'html.parser')

In [167]:
soup.select("a[href*=artist]")[0].get_text(strip=True)

'Bob Dylan'

In [None]:
soup.select("a[href*=artist]")

In [168]:
soup.select("a[href*=recording]")[0].get_text(strip=True)

'Like a Rolling Stone'

In [170]:
song_name_3 = []
artist_name_3 = []

for song in soup.select("a[href*=recording]"):
    song_name_3.append(song.get_text(strip=True))
    
for song in soup.select("a[href*=artist]"):
    artist_name_3.append(song.get_text(strip=True))

In [None]:
total_results = 500

In [186]:
song_name_3 = []
artist_name_3 = []

for page in range(1,6):
    r = requests.get(f'https://musicbrainz.org/series/b3484a66-a4de-444d-93d3-c99a73656905?page={page}')
    soup = BeautifulSoup(r.content, 'html.parser')
    
    for song in soup.select("a[href*=recording]"):
        song_name_3.append(song.get_text(strip=True))
        
    for song in soup.select("a[href*=artist]"):
        artist_name_3.append(song.get_text(strip=True))


In [183]:
print(len(song_name_3))
len(artist_name_3)

500


513

In [200]:
songs = pd.DataFrame(song_name_3)
artists = pd.DataFrame(artist_name_3)

In [201]:
artists[0]

Unnamed: 0,0
0,Bob Dylan
1,The Beatles
2,The Rolling Stones
3,John Lennon
4,Marvin Gaye
...,...
508,The Miracles
509,The Rolling Stones
510,Weezer
511,Brook Benton


In [209]:
artists.loc[31] = artists[31:33].apply(lambda x: ' & '.join(x.astype(str)))
artists.loc[31]

0    The Tennessee Two & Led Zeppelin
Name: 31, dtype: object

In [216]:
artists.loc[52] = artists[52:55].apply(lambda x: ' & '.join(x.astype(str)))
artists.loc[52]

0    Grandmaster Flash & The Furious Five & Melle M...
Name: 52, dtype: object

In [220]:
artists.loc[55] = artists[55:57].apply(lambda x: ' & '.join(x.astype(str)))
artists.loc[55]

0    Prince & The Revolution
Name: 55, dtype: object

In [232]:
artists.loc[114] = artists[114:116].apply(lambda x: ' & '.join(x.astype(str)))
artists.loc[114]

0    Hank Williams & His Drifting Cowboys
Name: 114, dtype: object

In [239]:
artists.loc[147] = artists[147:149].apply(lambda x: ' & '.join(x.astype(str)))
artists.loc[147]

0    Prince & The Revolution
Name: 147, dtype: object

In [224]:
artists.iloc[56]

0    The Revolution
Name: 56, dtype: object

In [252]:
artists.loc[218] = artists[218:220].apply(lambda x: ' & '.join(x.astype(str)))
artists.loc[218]

0    Hank Williams & His Drifting Cowboys
Name: 218, dtype: object

In [257]:
artists.loc[243] = artists[243:245].apply(lambda x: ' & '.join(x.astype(str)))
artists.loc[243]

0    Patsy Cline & Jim Reeves
Name: 243, dtype: object

In [262]:
artists.loc[293] = artists[293:295].apply(lambda x: ' & '.join(x.astype(str)))
artists.loc[293]

0    Run‐D.M.C. & Aerosmith
Name: 293, dtype: object

In [268]:
artists.loc[297] = artists[297:299].apply(lambda x: ' & '.join(x.astype(str)))
artists.loc[297]

0    Eminem & Dido
Name: 297, dtype: object

In [274]:
artists.loc[315] = artists[315:317].apply(lambda x: ' & '.join(x.astype(str)))
artists.loc[315]

0    Frankie Lymon & The Teenagers
Name: 315, dtype: object

In [280]:
artists.loc[358] = artists[358:360].apply(lambda x: ' & '.join(x.astype(str)))
artists.loc[358]

0    2Pac & Dr. Dre
Name: 358, dtype: object

In [285]:
artists.loc[473] = artists[473:475].apply(lambda x: ' & '.join(x.astype(str)))
artists.loc[473]

0    The Revolution
Name: 474, dtype: object

In [288]:
artists.drop(artists.index[[32,53,54,56,115,148,219,244,294,298,316,359,474]])

Unnamed: 0,0
0,Bob Dylan
1,The Beatles
2,The Rolling Stones
3,John Lennon
4,Marvin Gaye
...,...
508,The Miracles
509,The Rolling Stones
510,Weezer
511,Brook Benton


In [293]:
top_500 = pd.concat([songs, artists], axis=1)
top_500.columns = ['song','artist']
top_500.head()

Unnamed: 0,song,artist
0,Like a Rolling Stone,Bob Dylan
1,Strawberry Fields Forever,The Beatles
2,(I Can’t Get No) Satisfaction,The Rolling Stones
3,Imagine,John Lennon
4,What’s Going On,Marvin Gaye


In [294]:
billboard.head()

Unnamed: 0,song,artist
0,Running Up That Hill (A Deal with God),Kate Bush
1,Hold My Hand,Lady Gaga
2,As It Was,Harry Styles
3,About Damn Time,Lizzo
4,First Class,Jack Harlow


In [295]:
billboard_2.head()

Unnamed: 0,song,artist
0,AS IT WAS,HARRY STYLES
1,GO,CAT BURNS
2,ABOUT DAMN TIME,LIZZO
3,LATE NIGHT TALKING,HARRY STYLES
4,FIRST CLASS,JACK HARLOW
