- Author: **Võ Bách Khôi** | 19127037
- - -
Notebook này được dùng để lấy đường dẫn của các user và các playlist của các user đó.

Đầu tiên là import các thư viện cần thiết

In [1]:
from selenium.webdriver import Edge
from selenium.webdriver.support.ui import Select
from bs4 import BeautifulSoup
import re
import requests
import requests_cache
import pandas as pd
import time

Để lấy được danh sách các user, ta sẽ truy cập vào các bảng xếp hạng "**Top 50**" và bảng xếp hạng "**New & hot**" của *tất cả các thể loại nhạc*. 

In [2]:
top50_url = 'https://soundcloud.com/charts/top?genre=all-music'
newhot_url = 'https://soundcloud.com/charts/new?genre=all-music'
start_urls = [top50_url, newhot_url]

# Delay time
scroll_pause_time = 0.5
load_time = 2

Sử dụng thư viện **Selenium** với **Microsoft Edge** web driver

In [3]:
edge_path = 'D:\\Apps\\Microsoft Edge\\msedgedriver.exe'
edge = Edge(edge_path)
edge.get(top50_url)
time.sleep(load_time)

# Accept all cookies
edge.find_element_by_id('onetrust-accept-btn-handler').click()
time.sleep(load_time * 10)

Dùng **Selenium** để duyệt qua tất cả các thể loại của cả 2 bảng xếp hạng, rồi dùng **BeautifulSoup** để parse page source và lấy thông tin về *tên và url đến trang* của các user.

In [4]:
data = []

for url in start_urls:
    edge.get(url)
    time.sleep(load_time)
    
    # Find all genre urls
    genre_urls = []
    edge.find_element_by_css_selector('div.l-content > div >  div.chartsMain__filters > div:nth-child(3) > button').click()
    dropdown = edge.find_elements_by_class_name('linkMenu__group')
    for option in dropdown:
        genres = option.find_elements_by_tag_name('a')
        genre_urls += [genre.get_attribute('href') for genre in genres]

    # Get all users that appear in the chart
    for url in genre_urls:
        edge.get(url)
        time.sleep(load_time)
        # Load all songs in chart
        for i in range(0, 3):
            edge.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(scroll_pause_time)

        # Parse the page source
        soup = BeautifulSoup(edge.page_source, 'html.parser')
        soup.encoding = 'utf-8'

        # Get the list of tracks
        list_tracks = soup.find_all('li', {'class': 'chartTracks__item'})
        for track in list_tracks:
            track_info = track.find('div', {'class': 'chartTrack__details'})
            row = [track_info.find('div').text.strip()]
            row.append(track_info.find('a')['href'])
            data.append(row)


Chuyển thành **DataFrame** nhìn cho dễ nào :"D

In [5]:
urls_df = pd.DataFrame(data=data, columns=['username', 'url'])
urls_df.drop_duplicates(inplace=True, ignore_index=True)
urls_df

Unnamed: 0,username,url
0,lilcandypaint *Candypaint,/lilcandypaint
1,Nardo Wick,/nardo-wick
2,Mohamed Ramadan,/mohamedramadanofficial
3,Lil Tjay,/liltjay
4,muhammed fawzii,/muhammed-fawzii
...,...,...
2024,MyApple.pl,/myapplepl
2025,Tecnocast,/tecnoblog
2026,The Tech Guy,/the_tech_guy
2027,This Week in Tech,/this-week-in-tech


Tiếp theo để có được url đến danh sách các playlist của các user đã tìm được, ta chỉ cần thêm `"/sets"` vào đuôi của url đến các user đó  

In [6]:
requests_cache.install_cache(expire_after=None)

In [7]:
urls_df['playlist_sets'] = urls_df.url.apply(lambda row: 'https://soundcloud.com' + row + '/sets')
urls_df['playlist_sets']

0               https://soundcloud.com/lilcandypaint/sets
1                  https://soundcloud.com/nardo-wick/sets
2       https://soundcloud.com/mohamedramadanofficial/...
3                     https://soundcloud.com/liltjay/sets
4             https://soundcloud.com/muhammed-fawzii/sets
                              ...                        
2024                https://soundcloud.com/myapplepl/sets
2025                https://soundcloud.com/tecnoblog/sets
2026             https://soundcloud.com/the_tech_guy/sets
2027        https://soundcloud.com/this-week-in-tech/sets
2028              https://soundcloud.com/macmagazine/sets
Name: playlist_sets, Length: 2029, dtype: object

Ok, tiếp theo ta cần lấy url đến từng các playlist trong các danh sách playlist đã có

In [8]:
playlist_col = []
max_num_playlists = 5

for url in urls_df['playlist_sets']:
    edge.get(url)
    
    # We do not want to hit the page too much time in short period
    time.sleep(load_time)
    
    soup = BeautifulSoup(edge.page_source, 'html.parser')
    soup.encoding = 'utf-8'
    
    playlist_urls = []
    
    while not soup.find('div', {'class':'userMain__content'}):
        time.sleep(load_time)
        soup = BeautifulSoup(edge.page_source, 'html.parser')
        soup.encoding = 'utf-8'
    
    playlists = soup.find('div', {'class':'userMain__content'}).find_all('li', {'class':'soundList__item'})

    for playlist in playlists[:min(max_num_playlists, len(playlists))]:
        url = playlist.find('a', {'class':'sc-link-primary soundTitle__title sc-link-dark sc-text-h4'})
        playlist_urls.append(url['href'])

    playlist_col.append(','.join(playlist_urls))

In [9]:
playlist_col

['/lilcandypaint/sets/its-still-purple,/lilcandypaint/sets/s8tans-last-dance-1,/lilcandypaint/sets/s8tans-last-dance,/lilcandypaint/sets/internet',
 '',
 '',
 '/liltjay/sets/no-comparison',
 '/muhammed-fawzii/sets/music,/muhammed-fawzii/sets/kr3htfwd3vtn',
 '',
 '',
 '/olivertree/sets/when-im-down',
 '',
 '/neikedmusic/sets/neiked-sometimes-remixes,/neikedmusic/sets/call-me-remixes,/neikedmusic/sets/sexual-remixes',
 '/sleepy-hallow-880571354/sets/dont-sleep',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '/lildurk/sets/only-the-family-lil-durk,/lildurk/sets/family-over-everything,/lildurk/sets/lil-durk-presents-only-the-1,/lildurk/sets/lil-durk-presents-only-the,/lildurk/sets/just-cause-yall-waited-1',
 '',
 '/aidenschlecht/sets/leaks',
 '/polo-g/sets/polo-g',
 '',
 '/tiktoktunes/sets/tiktok-songs-2021-lyrics,/tiktoktunes/sets/ttt',
 '/nba-youngboy/sets/sincerely-kentrell-1,/nba-youngboy/sets/sincerely-kentrell,/nba-youngboy/sets/the-complete-collection,/nba-youngboy/sets/until-i-retu

Cập nhật url các playlist thôi nào!!

In [10]:
urls_df['playlist'] = playlist_col

In [11]:
urls_df

Unnamed: 0,username,url,playlist_sets,playlist
0,lilcandypaint *Candypaint,/lilcandypaint,https://soundcloud.com/lilcandypaint/sets,"/lilcandypaint/sets/its-still-purple,/lilcandy..."
1,Nardo Wick,/nardo-wick,https://soundcloud.com/nardo-wick/sets,
2,Mohamed Ramadan,/mohamedramadanofficial,https://soundcloud.com/mohamedramadanofficial/...,
3,Lil Tjay,/liltjay,https://soundcloud.com/liltjay/sets,/liltjay/sets/no-comparison
4,muhammed fawzii,/muhammed-fawzii,https://soundcloud.com/muhammed-fawzii/sets,"/muhammed-fawzii/sets/music,/muhammed-fawzii/s..."
...,...,...,...,...
2024,MyApple.pl,/myapplepl,https://soundcloud.com/myapplepl/sets,"/myapplepl/sets/myapple-weekly,/myapplepl/sets..."
2025,Tecnocast,/tecnoblog,https://soundcloud.com/tecnoblog/sets,/tecnoblog/sets/tecnoinvest
2026,The Tech Guy,/the_tech_guy,https://soundcloud.com/the_tech_guy/sets,
2027,This Week in Tech,/this-week-in-tech,https://soundcloud.com/this-week-in-tech/sets,


In [12]:
urls_df.to_csv('urls.csv')

In [13]:
# Turn off the Edge
edge.quit()

**Nguồn tham khảo**: 
- [Scrape Soundcloud using Selenium from scratch](https://shubhamchauhan125.medium.com/scrap-soundcloud-data-using-selenium-from-scratch-94761ef33f4e)