## Back to your challenge (now and homework)
For your in-class/homework challenge, please scrape the top 500 best-selling albums of the 2010 decade. Your data must include the following datapoints:

- Name of album
- Name of artist
- Number of albums sold
- The link to the page that breaks down sales by country (found by clicking album title)

# Skills required to accomplish this
- List Comprehension - a foundational skill that is like a more concise for loop.
- BeautifulSoup - a package that can parse html.
- Headers - to disguise your requests.
- zip() - a method to zip several related lists together.
- tuple - yet another data type in Python
Of course, you will still need

In [47]:
## import libraries
import requests  # Makes HTTP requests to fetch web pages from URLs
from bs4 import BeautifulSoup  # Parses HTML content into navigable Python objects for web scraping
import pandas as pd  # Creates and manipulates DataFrames for organizing scraped data into tables
import time  # Adds delays between requests to avoid overwhelming the server
from random import uniform  # Generates random time intervals to make scraping delays less predictable

In [48]:
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}

In [49]:
# set url and scrape the website, check that it worked
url = "https://bestsellingalbums.org/decade/2010"
response = requests.get(url, headers=headers)
response

<Response [200]>

In [50]:
# convert the response.txt into BeautifulSoup object 

soup = BeautifulSoup(response.text, "html.parser")

In [51]:
# target and extract album name element from the html 
album_name=soup.find_all("div", class_="album")
album_name

[<div class="album"><a href="https://bestsellingalbums.org/album/1034">21</a></div>,
 <div class="album"><a href="https://bestsellingalbums.org/album/1035">25</a></div>,
 <div class="album"><a href="https://bestsellingalbums.org/album/30524">CHRISTMAS</a></div>,
 <div class="album"><a href="https://bestsellingalbums.org/album/45488">1989</a></div>,
 <div class="album"><a href="https://bestsellingalbums.org/album/23318">PURPOSE</a></div>,
 <div class="album"><a href="https://bestsellingalbums.org/album/12876">DIVIDE</a></div>,
 <div class="album"><a href="https://bestsellingalbums.org/album/42961">FROZEN</a></div>,
 <div class="album"><a href="https://bestsellingalbums.org/album/23977">TEENAGE DREAM</a></div>,
 <div class="album"><a href="https://bestsellingalbums.org/album/12880">X</a></div>,
 <div class="album"><a href="https://bestsellingalbums.org/album/6777">DOO-WOPS &amp; HOOLIGANS</a></div>,
 <div class="album"><a href="https://bestsellingalbums.org/album/13756">RECOVERY</a></div

In [52]:
# for loop to extract and store the string element in a list

album_name_lc =[album.get_text() for album in album_name]

album_name_lc

['21',
 '25',
 'CHRISTMAS',
 '1989',
 'PURPOSE',
 'DIVIDE',
 'FROZEN',
 'TEENAGE DREAM',
 'X',
 'DOO-WOPS & HOOLIGANS',
 'RECOVERY',
 'NIGHT VISIONS',
 'IN THE LONELY HOUR',
 'UNORTHODOX JUKEBOX',
 'RED',
 '+',
 'VIEWS',
 'BEAUTY BEHIND THE MADNESS',
 'WHEN WE ALL FALL ASLEEP, WHERE DO WE GO?',
 'BORN THIS WAY',
 'MAP OF THE SOUL: 7',
 'BEERBONGS & BENTLEYS',
 'TAKE CARE',
 'SPEAK NOW',
 'PRISM',
 'BORN TO DIE',
 'LOUD',
 'ANTI',
 'BLURRYFACE',
 "HOLLYWOOD'S BLEEDING",
 'SCORPION',
 'STONEY',
 'TAKE ME HOME',
 'THE GREATEST SHOWMAN',
 'BEYONCÉ',
 'THE TRUTH ABOUT LOVE',
 'REPUTATION',
 '?',
 'TRAVELLER',
 'STARBOY',
 'UP ALL NIGHT',
 'MIDNIGHT MEMORIES',
 'MAP OF THE SOUL: PERSONA',
 'GOODBYE & GOOD RIDDANCE',
 'A HEAD FULL OF DREAMS',
 'THE HEIST',
 'THE MARSHALL MATHERS LP 2',
 'LOVER',
 'WATCH THE THRONE',
 "THIS ONE'S FOR YOU"]

In [53]:
# extract and store artist name from html 
artist_name=soup.find_all("div", class_="artist")
artist_name

[<div class="artist"><a href="https://bestsellingalbums.org/artist/218" title="ADELE album sales">ADELE</a></div>,
 <div class="artist"><a href="https://bestsellingalbums.org/artist/218" title="ADELE album sales">ADELE</a></div>,
 <div class="artist"><a href="https://bestsellingalbums.org/artist/8822" title="MICHAEL BUBLÉ album sales">MICHAEL BUBLÉ</a></div>,
 <div class="artist"><a href="https://bestsellingalbums.org/artist/12748" title="TAYLOR SWIFT album sales">TAYLOR SWIFT</a></div>,
 <div class="artist"><a href="https://bestsellingalbums.org/artist/6646" title="JUSTIN BIEBER album sales">JUSTIN BIEBER</a></div>,
 <div class="artist"><a href="https://bestsellingalbums.org/artist/3645" title="ED SHEERAN album sales">ED SHEERAN</a></div>,
 <div class="artist"><a href="https://bestsellingalbums.org/artist/12207" title="SOUNDTRACK album sales">SOUNDTRACK</a></div>,
 <div class="artist"><a href="https://bestsellingalbums.org/artist/6828" title="KATY PERRY album sales">KATY PERRY</a></di

In [54]:
# list comprehension to extract and store string element of artist name

artist_name_lc =[artist.get_text() for artist in artist_name]

artist_name_lc

['ADELE',
 'ADELE',
 'MICHAEL BUBLÉ',
 'TAYLOR SWIFT',
 'JUSTIN BIEBER',
 'ED SHEERAN',
 'SOUNDTRACK',
 'KATY PERRY',
 'ED SHEERAN',
 'BRUNO MARS',
 'EMINEM',
 'IMAGINE DRAGONS',
 'SAM SMITH',
 'BRUNO MARS',
 'TAYLOR SWIFT',
 'ED SHEERAN',
 'DRAKE',
 'THE WEEKND',
 'BILLIE EILISH',
 'LADY GAGA',
 'BTS (방탄소년단)',
 'POST MALONE',
 'DRAKE',
 'TAYLOR SWIFT',
 'KATY PERRY',
 'LANA DEL REY',
 'RIHANNA',
 'RIHANNA',
 'TWENTY ONE PILOTS',
 'POST MALONE',
 'DRAKE',
 'POST MALONE',
 'ONE DIRECTION',
 'SOUNDTRACK',
 'BEYONCÉ',
 'P!NK',
 'TAYLOR SWIFT',
 'XXXTENTACION',
 'CHRIS STAPLETON',
 'THE WEEKND',
 'ONE DIRECTION',
 'ONE DIRECTION',
 'BTS (방탄소년단)',
 'JUICE WRLD',
 'COLDPLAY',
 'MACKLEMORE & RYAN LEWIS',
 'EMINEM',
 'TAYLOR SWIFT',
 'JAY-Z & KANYE WEST',
 'LUKE COMBS']

In [55]:
# extract and store artist name from html 
sales=soup.find_all("div", class_="sales")
sales

[<div class="sales">Sales: 30,000,000</div>,
 <div class="sales">Sales: 23,000,000</div>,
 <div class="sales">Sales: 15,000,000</div>,
 <div class="sales">Sales: 14,748,116</div>,
 <div class="sales">Sales: 14,000,000</div>,
 <div class="sales">Sales: 13,787,460</div>,
 <div class="sales">Sales: 12,632,083</div>,
 <div class="sales">Sales: 12,134,000</div>,
 <div class="sales">Sales: 11,879,785</div>,
 <div class="sales">Sales: 11,270,000</div>,
 <div class="sales">Sales: 10,873,795</div>,
 <div class="sales">Sales: 9,616,263</div>,
 <div class="sales">Sales: 9,321,352</div>,
 <div class="sales">Sales: 8,976,749</div>,
 <div class="sales">Sales: 8,889,124</div>,
 <div class="sales">Sales: 7,705,000</div>,
 <div class="sales">Sales: 7,687,247</div>,
 <div class="sales">Sales: 7,584,588</div>,
 <div class="sales">Sales: 7,256,516</div>,
 <div class="sales">Sales: 7,166,944</div>,
 <div class="sales">Sales: 7,130,621</div>,
 <div class="sales">Sales: 7,116,118</div>,
 <div class="sales">S

In [56]:
# list comprehension to extract string element for sales

sales_lc =[sale.get_text() for sale in sales]

sales_lc

['Sales: 30,000,000',
 'Sales: 23,000,000',
 'Sales: 15,000,000',
 'Sales: 14,748,116',
 'Sales: 14,000,000',
 'Sales: 13,787,460',
 'Sales: 12,632,083',
 'Sales: 12,134,000',
 'Sales: 11,879,785',
 'Sales: 11,270,000',
 'Sales: 10,873,795',
 'Sales: 9,616,263',
 'Sales: 9,321,352',
 'Sales: 8,976,749',
 'Sales: 8,889,124',
 'Sales: 7,705,000',
 'Sales: 7,687,247',
 'Sales: 7,584,588',
 'Sales: 7,256,516',
 'Sales: 7,166,944',
 'Sales: 7,130,621',
 'Sales: 7,116,118',
 'Sales: 6,920,000',
 'Sales: 6,917,500',
 'Sales: 6,692,500',
 'Sales: 6,674,983',
 'Sales: 6,673,000',
 'Sales: 6,537,235',
 'Sales: 6,500,000',
 'Sales: 6,461,665',
 'Sales: 6,433,983',
 'Sales: 6,371,355',
 'Sales: 6,334,619',
 'Sales: 6,318,119',
 'Sales: 6,290,833',
 'Sales: 6,231,084',
 'Sales: 6,186,524',
 'Sales: 6,182,852',
 'Sales: 6,157,000',
 'Sales: 6,070,666',
 'Sales: 6,046,188',
 'Sales: 6,020,087',
 'Sales: 6,010,031',
 'Sales: 6,002,713',
 'Sales: 6,000,000',
 'Sales: 5,858,500',
 'Sales: 5,790,318',
 '

In [57]:
# extract and save just the number

sales_number =[sale.split(" ")[1] for sale in sales_lc]
sales_number

['30,000,000',
 '23,000,000',
 '15,000,000',
 '14,748,116',
 '14,000,000',
 '13,787,460',
 '12,632,083',
 '12,134,000',
 '11,879,785',
 '11,270,000',
 '10,873,795',
 '9,616,263',
 '9,321,352',
 '8,976,749',
 '8,889,124',
 '7,705,000',
 '7,687,247',
 '7,584,588',
 '7,256,516',
 '7,166,944',
 '7,130,621',
 '7,116,118',
 '6,920,000',
 '6,917,500',
 '6,692,500',
 '6,674,983',
 '6,673,000',
 '6,537,235',
 '6,500,000',
 '6,461,665',
 '6,433,983',
 '6,371,355',
 '6,334,619',
 '6,318,119',
 '6,290,833',
 '6,231,084',
 '6,186,524',
 '6,182,852',
 '6,157,000',
 '6,070,666',
 '6,046,188',
 '6,020,087',
 '6,010,031',
 '6,002,713',
 '6,000,000',
 '5,858,500',
 '5,790,318',
 '5,686,733',
 '5,550,000',
 '5,490,000']

In [58]:
# extract and save link for sales by country 
links=soup.find_all("div", class_="album")
links

[<div class="album"><a href="https://bestsellingalbums.org/album/1034">21</a></div>,
 <div class="album"><a href="https://bestsellingalbums.org/album/1035">25</a></div>,
 <div class="album"><a href="https://bestsellingalbums.org/album/30524">CHRISTMAS</a></div>,
 <div class="album"><a href="https://bestsellingalbums.org/album/45488">1989</a></div>,
 <div class="album"><a href="https://bestsellingalbums.org/album/23318">PURPOSE</a></div>,
 <div class="album"><a href="https://bestsellingalbums.org/album/12876">DIVIDE</a></div>,
 <div class="album"><a href="https://bestsellingalbums.org/album/42961">FROZEN</a></div>,
 <div class="album"><a href="https://bestsellingalbums.org/album/23977">TEENAGE DREAM</a></div>,
 <div class="album"><a href="https://bestsellingalbums.org/album/12880">X</a></div>,
 <div class="album"><a href="https://bestsellingalbums.org/album/6777">DOO-WOPS &amp; HOOLIGANS</a></div>,
 <div class="album"><a href="https://bestsellingalbums.org/album/13756">RECOVERY</a></div

In [59]:
# extract the a element 
links_a = [link.find("a") for link in links]
links_a

[<a href="https://bestsellingalbums.org/album/1034">21</a>,
 <a href="https://bestsellingalbums.org/album/1035">25</a>,
 <a href="https://bestsellingalbums.org/album/30524">CHRISTMAS</a>,
 <a href="https://bestsellingalbums.org/album/45488">1989</a>,
 <a href="https://bestsellingalbums.org/album/23318">PURPOSE</a>,
 <a href="https://bestsellingalbums.org/album/12876">DIVIDE</a>,
 <a href="https://bestsellingalbums.org/album/42961">FROZEN</a>,
 <a href="https://bestsellingalbums.org/album/23977">TEENAGE DREAM</a>,
 <a href="https://bestsellingalbums.org/album/12880">X</a>,
 <a href="https://bestsellingalbums.org/album/6777">DOO-WOPS &amp; HOOLIGANS</a>,
 <a href="https://bestsellingalbums.org/album/13756">RECOVERY</a>,
 <a href="https://bestsellingalbums.org/album/19810">NIGHT VISIONS</a>,
 <a href="https://bestsellingalbums.org/album/39978">IN THE LONELY HOUR</a>,
 <a href="https://bestsellingalbums.org/album/6778">UNORTHODOX JUKEBOX</a>,
 <a href="https://bestsellingalbums.org/album/4

In [60]:
# extract and store the link
links_lc = [link.get("href") for link in links_a]
links_lc

['https://bestsellingalbums.org/album/1034',
 'https://bestsellingalbums.org/album/1035',
 'https://bestsellingalbums.org/album/30524',
 'https://bestsellingalbums.org/album/45488',
 'https://bestsellingalbums.org/album/23318',
 'https://bestsellingalbums.org/album/12876',
 'https://bestsellingalbums.org/album/42961',
 'https://bestsellingalbums.org/album/23977',
 'https://bestsellingalbums.org/album/12880',
 'https://bestsellingalbums.org/album/6777',
 'https://bestsellingalbums.org/album/13756',
 'https://bestsellingalbums.org/album/19810',
 'https://bestsellingalbums.org/album/39978',
 'https://bestsellingalbums.org/album/6778',
 'https://bestsellingalbums.org/album/45494',
 'https://bestsellingalbums.org/album/12875',
 'https://bestsellingalbums.org/album/12457',
 'https://bestsellingalbums.org/album/47839',
 'https://bestsellingalbums.org/album/5207',
 'https://bestsellingalbums.org/album/25786',
 'https://bestsellingalbums.org/album/6859',
 'https://bestsellingalbums.org/album/36

In [61]:
# combine into a single data frame

top_albums_1 = list(zip(album_name_lc, artist_name_lc, links_lc))
df_1=pd.DataFrame(top_albums_1)
df_1.columns =["Album", "Artist", "Link"]
df_1

Unnamed: 0,Album,Artist,Link
0,21,ADELE,https://bestsellingalbums.org/album/1034
1,25,ADELE,https://bestsellingalbums.org/album/1035
2,CHRISTMAS,MICHAEL BUBLÉ,https://bestsellingalbums.org/album/30524
3,1989,TAYLOR SWIFT,https://bestsellingalbums.org/album/45488
4,PURPOSE,JUSTIN BIEBER,https://bestsellingalbums.org/album/23318
5,DIVIDE,ED SHEERAN,https://bestsellingalbums.org/album/12876
6,FROZEN,SOUNDTRACK,https://bestsellingalbums.org/album/42961
7,TEENAGE DREAM,KATY PERRY,https://bestsellingalbums.org/album/23977
8,X,ED SHEERAN,https://bestsellingalbums.org/album/12880
9,DOO-WOPS & HOOLIGANS,BRUNO MARS,https://bestsellingalbums.org/album/6777


In [62]:
## import libraries
import requests  # Makes HTTP requests to fetch web pages from URLs
from bs4 import BeautifulSoup  # Parses HTML content into navigable Python objects for web scraping
import pandas as pd  # Creates and manipulates DataFrames for organizing scraped data into tables
import time  # Adds delays between requests to avoid overwhelming the server
from random import uniform  # Generates random time intervals to make scraping delays less predictable

In [63]:
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}

In [64]:
# repeat for top 500 albums of 2010s 

base_url = "https://bestsellingalbums.org/decade/2010"
df_scrape_list = []
broken_links = []

for i, number in enumerate(range(2,11), start = 2):
    url = f"{base_url}-{number}"
    print(f"Scraping page {i}, url: {url}")
    try:
        response = requests.get(url, headers=headers)
        print(f"Status code: {response.status_code}")
        soup = BeautifulSoup(response.text, "html.parser")
        album_name=soup.find_all("div", class_="album")
        album_name_lc =[album.get_text() for album in album_name]
        artist_name=soup.find_all("div", class_="artist")
        artist_name_lc =[artist.get_text() for artist in artist_name]
        sales = soup.find_all("div", class_="sales")
        sales_lc = [sale.get_text() for sale in sales]
        sales_number = [sale.split(" ")[1] for sale in sales_lc]
        links=soup.find_all("div", class_="album")
        links_a = [link.find("a") for link in links]
        links_lc = [link.get("href") for link in links_a] 
        top_albums = list(zip(album_name_lc, artist_name_lc, sales_number, links_lc))
        df_scrape = pd.DataFrame(top_albums, columns=["Album", "Artist", "Sales", "Link"])
        df_scrape_list.append(df)

    except Exception as e:
        print(f"Encountered an issue: {e} at {url}")
        broken_links.append(url)
    finally:
        snoozer = uniform(5,8)
        print(f"Snoozing for {snoozer} seconds before next scrape")
        time.sleep(snoozer)

print(f"done scraping all urls")
final_df = pd.concat(df_scrape_list, ignore_index=True)
print(final_df)

Scraping page 2, url: https://bestsellingalbums.org/decade/2010-2
Status code: 200
Snoozing for 7.2912111778577895 seconds before next scrape
Scraping page 3, url: https://bestsellingalbums.org/decade/2010-3
Status code: 200
Snoozing for 5.97046498923487 seconds before next scrape
Scraping page 4, url: https://bestsellingalbums.org/decade/2010-4
Status code: 200
Snoozing for 7.9195462400171595 seconds before next scrape
Scraping page 5, url: https://bestsellingalbums.org/decade/2010-5
Status code: 200
Snoozing for 7.5427411832491496 seconds before next scrape
Scraping page 6, url: https://bestsellingalbums.org/decade/2010-6
Status code: 200
Snoozing for 6.418596636926709 seconds before next scrape
Scraping page 7, url: https://bestsellingalbums.org/decade/2010-7
Status code: 200
Snoozing for 5.555555597795449 seconds before next scrape
Scraping page 8, url: https://bestsellingalbums.org/decade/2010-8
Status code: 200
Snoozing for 6.935852072726977 seconds before next scrape
Scraping pa

In [65]:
combined_df=pd.concat([df_1, final_df], ignore_index=True)
combined_df

Unnamed: 0,Album,Artist,Link,Sales
0,21,ADELE,https://bestsellingalbums.org/album/1034,
1,25,ADELE,https://bestsellingalbums.org/album/1035,
2,CHRISTMAS,MICHAEL BUBLÉ,https://bestsellingalbums.org/album/30524,
3,1989,TAYLOR SWIFT,https://bestsellingalbums.org/album/45488,
4,PURPOSE,JUSTIN BIEBER,https://bestsellingalbums.org/album/23318,
...,...,...,...,...
495,UNDER PRESSURE,LOGIC,https://bestsellingalbums.org/album/27268,1060000
496,THE STRANGE CASE OF,HALESTORM,https://bestsellingalbums.org/album/17960,1060000
497,UNCAGED,ZAC BROWN BAND,https://bestsellingalbums.org/album/56701,1055000
498,FUTURE,FUTURE,https://bestsellingalbums.org/album/16036,1050371
