Okay this is the goal of our story: 
Please scrape the top 500 best-selling albums of the 2010 decade. Your data must include the following datapoints:

Name of album
Name of artist
Number of albums sold
The link to the page that breaks down sales by country (found by clicking album title)

This an example of a single album and its features in the main webpage:

Ranking: 6
Album Name: DIVIDE
Artist Name: ED SHEERAN
Sales: 13,787,460
Rank in 2017 : 1 (NOT NEEDED) 
Rank in 2010's: 6 (NOT NEEDED) 
Overall rank: 159 (NOT NEEDED) T
The link to the page that breaks down sales by country (found by clicking album title)

In [None]:
Now let’s see how the HTML maps to each value.

Album Name: DIVIDE
<div class="album"><a href="https://bestsellingalbums.org/album/12876">DIVIDE</a></div>
Artist Name: ED SHEERAN
<div class="artist"><a title="ED SHEERAN album sales" href="https://bestsellingalbums.org/artist/3645">ED SHEERAN</a></div>
Sales: 13,787,460
<div class="sales">Sales: 13,787,460</div>
All of these classes are nested within the parent container "album_card"

In [13]:
### Let's test it by scraping a single page 
import requests
import pandas as pd
from bs4 import BeautifulSoup
import time
import random

# Let's assign a variable to our URL 
base_url = "https://bestsellingalbums.org/decade/2010"

# Headers to mimic a browser
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}

# A snoozer to pause between requests
def snoozer():
    delay = random.uniform(2, 5)
    print(f"Sleeping for {delay:.2f} seconds...")
    time.sleep(delay)

# Let's test things out by scraping a single page 
def scrape_page(url):
    print(f"Fetching page: {url}")
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        print(f"Failed to fetch page: {response.status_code}")
        return []

    soup = BeautifulSoup(response.text, "html.parser")
    snoozer()

    albums_data = []
    album_cards = soup.find_all("div", class_="album_card")

    for card in album_cards:
        album_tag = card.find("div", class_="album").find("a")
        album_name = album_tag.get_text(strip=True)
        album_link = album_tag["href"]

        artist_tag = card.find("div", class_="artist").find("a")
        artist_name = artist_tag.get_text(strip=True)

        sales_tag = card.find("div", class_="sales")
        sales = sales_tag.get_text(strip=True).replace("Sales: ", "") if sales_tag else "N/A"

        albums_data.append({
            "album_name": album_name,
            "artist_name": artist_name,
            "sales": sales,
            "album_link": album_link
        })

    return albums_data

# Now, let's run the function
albums = scrape_page(base_url)

# Create and display the DataFrame
df = pd.DataFrame(albums)
print(df.head())
pd.set_option('display.max_rows', None)
df

Fetching page: https://bestsellingalbums.org/decade/2010
Sleeping for 4.28 seconds...
  album_name    artist_name       sales  \
0         21          ADELE  30,000,000   
1         25          ADELE  23,000,000   
2  CHRISTMAS  MICHAEL BUBLÉ  15,000,000   
3       1989   TAYLOR SWIFT  14,748,116   
4    PURPOSE  JUSTIN BIEBER  14,000,000   

                                  album_link  
0   https://bestsellingalbums.org/album/1034  
1   https://bestsellingalbums.org/album/1035  
2  https://bestsellingalbums.org/album/30524  
3  https://bestsellingalbums.org/album/45488  
4  https://bestsellingalbums.org/album/23318  


Unnamed: 0,album_name,artist_name,sales,album_link
0,21,ADELE,30000000,https://bestsellingalbums.org/album/1034
1,25,ADELE,23000000,https://bestsellingalbums.org/album/1035
2,CHRISTMAS,MICHAEL BUBLÉ,15000000,https://bestsellingalbums.org/album/30524
3,1989,TAYLOR SWIFT,14748116,https://bestsellingalbums.org/album/45488
4,PURPOSE,JUSTIN BIEBER,14000000,https://bestsellingalbums.org/album/23318
5,DIVIDE,ED SHEERAN,13787460,https://bestsellingalbums.org/album/12876
6,FROZEN,SOUNDTRACK,12632083,https://bestsellingalbums.org/album/42961
7,TEENAGE DREAM,KATY PERRY,12134000,https://bestsellingalbums.org/album/23977
8,X,ED SHEERAN,11879785,https://bestsellingalbums.org/album/12880
9,DOO-WOPS & HOOLIGANS,BRUNO MARS,11270000,https://bestsellingalbums.org/album/6777


In [11]:
print(len(df))

50


In [17]:
# It works so now let's do it till we get to the top 500 albums
import requests
import pandas as pd
from bs4 import BeautifulSoup
import time
import random

# Base URL for the decade
base_url = "https://bestsellingalbums.org/decade/2010"

# Headers to mimic a browser
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}

# Snoozer to avoid hitting the server too quickly
def snoozer():
    delay = random.uniform(2, 5)
    print(f"Sleeping for {delay:.2f} seconds...")
    time.sleep(delay)

# List to store all albums
all_albums = []
page_number = 1

# Loop through pages until we reach 500 albums
while len(all_albums) < 500:
    # Build URL for each page
    url = base_url if page_number == 1 else f"{base_url}-{page_number}"
    print(f"\nFetching page: {url}")

    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        print(f"Failed to fetch page: {response.status_code}")
        break

    soup = BeautifulSoup(response.text, "html.parser")
    snoozer()

    album_cards = soup.find_all("div", class_="album_card")

    if not album_cards:
        print("No more albums found, stopping.")
        break

    for card in album_cards:
        album_tag = card.find("div", class_="album").find("a")
        album_name = album_tag.get_text(strip=True)
        album_link = album_tag["href"]

        artist_tag = card.find("div", class_="artist").find("a")
        artist_name = artist_tag.get_text(strip=True)

        sales_tag = card.find("div", class_="sales")
        sales = sales_tag.get_text(strip=True).replace("Sales: ", "") if sales_tag else "N/A"

        all_albums.append({
            "album_name": album_name,
            "artist_name": artist_name,
            "sales": sales,
            "album_link": album_link
        })

        if len(all_albums) >= 500:
            break

    page_number += 1

# Convert to DataFrame
df = pd.DataFrame(all_albums)

# Display sample results
print(f"\n Total albums scraped: {len(df)}")
print("\n Sample of scraped albums:\n")
df_sample = df.head(100)
df_sample


Fetching page: https://bestsellingalbums.org/decade/2010
Sleeping for 4.00 seconds...

Fetching page: https://bestsellingalbums.org/decade/2010-2
Sleeping for 4.15 seconds...

Fetching page: https://bestsellingalbums.org/decade/2010-3
Sleeping for 3.49 seconds...

Fetching page: https://bestsellingalbums.org/decade/2010-4
Sleeping for 4.41 seconds...

Fetching page: https://bestsellingalbums.org/decade/2010-5
Sleeping for 2.92 seconds...

Fetching page: https://bestsellingalbums.org/decade/2010-6
Sleeping for 2.04 seconds...

Fetching page: https://bestsellingalbums.org/decade/2010-7
Sleeping for 3.88 seconds...

Fetching page: https://bestsellingalbums.org/decade/2010-8
Sleeping for 4.30 seconds...

Fetching page: https://bestsellingalbums.org/decade/2010-9
Sleeping for 2.74 seconds...

Fetching page: https://bestsellingalbums.org/decade/2010-10
Sleeping for 2.08 seconds...

 Total albums scraped: 500

 Sample of scraped albums:



Unnamed: 0,album_name,artist_name,sales,album_link
0,21,ADELE,30000000,https://bestsellingalbums.org/album/1034
1,25,ADELE,23000000,https://bestsellingalbums.org/album/1035
2,CHRISTMAS,MICHAEL BUBLÉ,15000000,https://bestsellingalbums.org/album/30524
3,1989,TAYLOR SWIFT,14748116,https://bestsellingalbums.org/album/45488
4,PURPOSE,JUSTIN BIEBER,14000000,https://bestsellingalbums.org/album/23318
5,DIVIDE,ED SHEERAN,13787460,https://bestsellingalbums.org/album/12876
6,FROZEN,SOUNDTRACK,12632083,https://bestsellingalbums.org/album/42961
7,TEENAGE DREAM,KATY PERRY,12134000,https://bestsellingalbums.org/album/23977
8,X,ED SHEERAN,11879785,https://bestsellingalbums.org/album/12880
9,DOO-WOPS & HOOLIGANS,BRUNO MARS,11270000,https://bestsellingalbums.org/album/6777
