# Lab | Web Scraping Single Page

### Instructions - Scraping popular songs

Your product will take a song as an input from the user and will output another song (the recommendation). In most cases, the recommended song will have to be similar to the inputted song, but the CTO thinks that if the song is on the top charts at the moment, the user will enjoy more a recommendation of a song that's also popular at the moment.

You have find data on the internet about currently popular songs. 

It's a good place to start! Scrape the current top 100 songs and their respective artists, and put the information into a pandas dataframe.

In [261]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
from random import randint
from time import sleep
import random

In [262]:
url = 'https://www.popvortex.com/music/charts/top-100-songs.php'

In [263]:
response = requests.get(url)
response.status_code # 200 status code means OK!

200

In [264]:
soup = BeautifulSoup(response.content, "html.parser")

In [265]:
soup.select('#chart-position-1 > div.chart-content.col-xs-12.col-sm-8 > p')

[<p class="title-artist"><cite class="title">Unholy</cite><em class="artist">Sam Smith &amp; Kim Petras</em></p>]

In [266]:
top100_songs = []
top100_artist = []
year = []

num_iter = len(soup.select(".title-artist"))
song = soup.select(".title")
artist = soup.select(".artist")

for i in range(num_iter):
    top100_songs.append(song[i].get_text())
    top100_artist.append(artist[i].get_text())
    year.append(2022)

#print(top100_songs)
#print(top100_artist)

In [267]:
top100_songs_df = pd.DataFrame({'title':top100_songs,
                                'artist':top100_songs,
                                'year':year})

In [268]:
top100_songs_df.head()

Unnamed: 0,title,artist,year
0,Unholy,Unholy,2022
1,I'm Good (Blue),I'm Good (Blue),2022
2,wait in the truck,wait in the truck,2022
3,Thank God,Thank God,2022
4,Everywhere,Everywhere,2022


In [269]:
#ask user for a song he/she likes
music = input("\nEnter your music? ") 

#get a random music form the top100
random_music = random.choice(top100_songs) 
    
if music in top100_songs_df['title'].values:
    print('Great choice. Here is another music from the Top 100: ',random_music)
else:
    print('Oh, bad luck! Try again tomorrow or listen to one of the musics from the Top 100: ', random_music)



Enter your music? Thank God
Great choice. Here is another music from the Top 100:  Build a Boat


# Lab | Web Scraping Multiple Pages

#### Expand the project

If you're done, you can try to expand the project on your own. Here are a few suggestions:

 - Find other lists of hot songs on the internet and scrape them too: having a bigger pool of songs will be awesome!

In [270]:
url = "https://playback.fm/charts/top-100-songs/2000"

In [271]:
response = requests.get(url)
response.status_code

200

In [272]:
soup = BeautifulSoup(response.content, "html.parser")

In [273]:
iterations = range(2000, 2022)
for i in iterations:
    print(i)


2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021


In [274]:
pages = []

iterations = range(2000, 2022)
[i for i in iterations]

for i in iterations:
    year= str(i)
    url = "https://playback.fm/charts/top-100-songs/" + year
    
    response = requests.get(url)

    # monitor the process by printing the status code
    #print("Status code: " + str(response.status_code))

    # store response into "pages" list
    pages.append(response)

    # respectful nap:
    wait_time = randint(1,2)
    sleep(wait_time)
    

In [275]:
new_song_list = []
new_artist_list = []
year = []


for i in range(len(pages)):
    # parse all pages
    soup = BeautifulSoup(pages[i].content, "html.parser")
    
    songs_catalogue = soup.select('#myTable')
    
    song = soup.select(".song a")
    artist = soup.select(".artist")
    
    
  
    
    for j in range(len(song)):
        new_song_list.append(song[j].get_text().replace('\n','')) 
        new_artist_list.append(artist[j].get_text().replace('\n',''))
        year.append(i)
            
#print(new_song_list)
#print(new_artist_list)
#print(year)


In [276]:
# Turn new list into a dataframe

top100_new_list_df = pd.DataFrame({'title':new_song_list,
                                  'artist':new_artist_list,
                                  'year':year})
top100_new_list_df.head()

Unnamed: 0,title,artist,year
0,Music,Madonna,0
1,Beautiful Day,U2,0
2,"Bye, Bye, Bye",N Sync,0
3,Stan,Eminem,0
4,Oops!... I Did it Again,Britney Spears,0


In [277]:
top100_new_list_df['year']=top100_new_list_df['year'].replace(to_replace = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22],
                                                              value = [2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022])

In [278]:
full_list_df = pd.concat([top100_songs_df,top100_new_list_df], axis=0)

In [279]:
full_list_df.shape

(2298, 3)

In [280]:
full_list_df = full_list_df.drop_duplicates()

In [281]:
full_list_df.shape

(2296, 3)

In [283]:
full_list_df

Unnamed: 0,title,artist,year
0,Unholy,Unholy,2022
1,I'm Good (Blue),I'm Good (Blue),2022
2,wait in the truck,wait in the truck,2022
3,Thank God,Thank God,2022
4,Everywhere,Everywhere,2022
...,...,...,...
2193,Leave Before You Love Me,Marshmello & Jonas Brothers,2021
2194,Beggin,Maneskin,2021
2195,Famous Friends,Chris Young + Kane Brown,2021
2196,Lil Bit,Nelly & Florida Georgia Line,2021


 - Apply the same logic to other "groups" of songs: the best songs from a decade or from a country / culture / language / genre.

In [286]:
def get_page(url):
    return requests.get(url, timeout =1.).text

In [293]:
# For more details about usinf Beautiful Soup
#     https://www.dataquest.io/blog/web-scraping-python-using-beautiful-soup/

def scraping_music(url,df):
    
    #df = pd.DataFrame()
    
    #Request the content (source code) of a specific URL from the server
    response = requests.get(url) 
    print(response.status_code)
    
    soup = BeautifulSoup(response.content, "html.parser")
    
    
    #print out the HTML content of the page, formatted nicely, using the prettify
    #print(soup.prettify())
    
    
    
    # Identify the elements of the page that are part of the table we want to extract
    songs = []
    artists = []
    year = []

    # nr of times an element appears
    num_iter = len(soup.select(".title-artist")) #change according to the website
    
    # get titles
    song = soup.select(".title") ##change according to the website
    
    # get artist
    artist = soup.select(".artist") ##change according to the website
    
    
    
    # Extract those elements into a dataset
    for i in range(num_iter):
        songs.append(song[i].get_text())
        artists.append(artist[i].get_text())
        year.append(2022)
    
    df = pd.DataFrame({'title':songs,'artist':artist,'year':year})
    
    
    return df

In [294]:
scraping_music('https://www.popvortex.com/music/charts/top-100-songs.php','top100')

200


Unnamed: 0,title,artist,year
0,Unholy,[Sam Smith & Kim Petras],2022
1,I'm Good (Blue),[David Guetta & Bebe Rexha],2022
2,wait in the truck,[HARDY & Lainey Wilson],2022
3,A Thousand Years,[Christina Perri],2022
4,Son Of A Sinner,[Jelly Roll],2022
...,...,...,...
95,Sand In My Boots,[Morgan Wallen],2022
96,Betty (Get Money),[Yung Gravy],2022
97,Half Of Me (feat. Riley Green),[Thomas Rhett],2022
98,Boom Clap,[Charli XCX],2022


In [400]:
def billboard(url):
    
    #df = pd.DataFrame()
    
    #Request the content (source code) of a specific URL from the server
    response = requests.get(url) 
    print(response.status_code)
    
    soup = BeautifulSoup(response.content, "html.parser")
    
    
    #print out the HTML content of the page, formatted nicely, using the prettify
    #print(soup.prettify())
    
    
    
    # Identify the elements of the page that are part of the table we want to extract
    songs = []
    artists = []
    year = []

    # nr of times an element appears
    num_iter = len(soup.select("#title-of-a-story")) #change according to the website
    
    # get titles
    song = soup.select(".c-title") ##change according to the website
    
    # get artist
    artist = soup.select(".c-label") ##change according to the website
    
    
    
    # Extract those elements into a dataset
    for i in range(num_iter):
        songs.append(song[i].get_text().replace('\n','').replace('\t',''))
        artists.append(artist[i].get_text().replace('\n','').replace('\t',''))
        #year.append(2022)
    
    df = pd.DataFrame({'title':songs,'artist':artists})
    
    
    return df

In [403]:
billboard_df = billboard('https://www.billboard.com/charts/billboard-200/')
billboard_df

200


Unnamed: 0,title,artist
0,Un Verano Sin Ti,1
1,Imprint/Promotion Label:,13
2,Bad Bunny Replaces Himself at No. 1 on Hot Lat...,22
3,Gains in Weekly Performance,1
4,Additional Awards,Bad Bunny
...,...,...
411,Printemps Opens Podcast Studio in Paris Flagship,50
412,"NCAA Athletes Can Finally Make Money, but They...",DJ Khaled
413,Follow Us,24
414,Have a Tip?,1


# Lab | API wrappers - Create your collection of songs & audio features

To move forward with the project, you need to create a collection of songs with their audio features - as large as possible!

These are the songs that we will cluster. And, later, when the user inputs a song, we will find the cluster to which the song belongs and recommend a song from the same cluster. The more songs you have, the more accurate and diverse recommendations you'll be able to give. Although... you might want to make sure the collected songs are "curated" in a certain way. Try to find playlists of songs that are diverse, but also that meet certain standards.

The process of sending hundreds or thousands of requests can take some time - it's normal if you have to wait a few minutes (or, if you're ambitious, even hours) to get all the data you need.

An idea for collecting as many songs as possible is to start with all the songs of a big, diverse playlist and then go to every artist present in the playlist and grab every song of every album of that artist. The amount of songs you'll be collecting per playlist will grow exponentially!

In [408]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

In [411]:
secrets_file = open("secrets_spotify.txt","r")
string = secrets_file.read()

In [412]:
secrets_dict={} # create a dictionary

for line in string.split('\n'):
    if len(line) > 0: #this is used in case we have empty lines
        secrets_dict[line.split(':')[0]]=line.split(':')[1]

In [413]:

#Initialize SpotiPy with user credentials
sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(client_id=secrets_dict['cid'],
                                                           client_secret=secrets_dict['csecret']))

In [475]:
playlist = sp.user_playlist_tracks("spotify", "6FTVlz76p3Bmmw4vwKJgHy")

In [487]:
playlist.keys()
playlist['items'][0].keys() # explore items
playlist['items'][0]['track'].keys() #explore info about track
playlist['items'][0]['track']['uri']



'spotify:track:5LLfL25W8ELqVXOLBhkJOP'

In [522]:
def playlist_songs(playlist_id):
    
    df = pd.DataFrame()
    df_audio = pd.DataFrame()
    
    playlist = sp.user_playlist_tracks("spotify", playlist_id)['items']
    
    
    music_list = []
    artist_list = []
    album_list = []
    audio_features_list = []
    
    for track in playlist:
        music_list.append(track['track']['name']) #list all songs
        artist_list.append(track['track']['artists'][0]['name']) #list all artists. QUESTION = how to get multiple artists???
        album_list.append(track['track']['album']['name']) #list the name of the album
        
        audio_features_list.append(sp.audio_features(track['track']['uri'])) #audio features
    
    
    sleep(randint(1,3))
    
    #create df columns
    df['music'] = music_list
    df['artist'] = artist_list
    df['album'] = album_list
    
    #create df for audio features
    df_audio['audio_features'] = audio_features_list
    
    playlist_db = pd.concat([df,df_audio], axis = 1).reset_index()
    
    return playlist_db
    


In [533]:
playlist_db = playlist_songs('6FTVlz76p3Bmmw4vwKJgHy')

In [534]:
playlist_db #HELP: how can I use flatten function for audio_features??

Unnamed: 0,index,music,artist,album,audio_features
0,0,Miniyamba,Yeahman,Shika Shika / Botanas Series,"[{'danceability': 0.702, 'energy': 0.508, 'key..."
1,1,We,Marava,We EP,"[{'danceability': 0.867, 'energy': 0.4, 'key':..."
2,2,Quién Me Escucha,Sandra Bernardo,Quién Me Escucha,"[{'danceability': 0.746, 'energy': 0.53, 'key'..."
3,3,Kompaz,Rodrigo Gallardo,Harabe Daydreams I,"[{'danceability': 0.788, 'energy': 0.493, 'key..."
4,4,"Monday (feat. LEJ, Akhenaton & Blundetto)",Biga*Ranx,1988,"[{'danceability': 0.945, 'energy': 0.46, 'key'..."
...,...,...,...,...,...
94,94,MPFree Now,Session Victim,10.000 Hours,"[{'danceability': 0.788, 'energy': 0.715, 'key..."
95,95,Luna,Sandra Bernardo,Es el Momento,"[{'danceability': 0.624, 'energy': 0.687, 'key..."
96,96,Gobi,Tunnelvisions,Midnight Voyage,"[{'danceability': 0.689, 'energy': 0.59, 'key'..."
97,97,Desire,Joe Turner,Desire,"[{'danceability': 0.598, 'energy': 0.807, 'key..."


In [535]:
def flatten(data, col_list):
    for column in col_list:
        flattened = pd.DataFrame(dict(data[column])).transpose()
        columns = [str(col) for col in flattened.columns]
        flattened.columns = [column + '_' + colname for colname in columns]
        data = pd.concat([data, flattened], axis=1)
        data = data.drop(column, axis=1)
    return data

In [536]:
df_to_flat = playlist_db[['audio_features']]
flat_list = ['audio_features']
flat = flatten(df_to_flat,flat_list)
flat

Unnamed: 0,audio_features_0
0,"{'danceability': 0.702, 'energy': 0.508, 'key'..."
1,"{'danceability': 0.867, 'energy': 0.4, 'key': ..."
2,"{'danceability': 0.746, 'energy': 0.53, 'key':..."
3,"{'danceability': 0.788, 'energy': 0.493, 'key'..."
4,"{'danceability': 0.945, 'energy': 0.46, 'key':..."
...,...
94,"{'danceability': 0.788, 'energy': 0.715, 'key'..."
95,"{'danceability': 0.624, 'energy': 0.687, 'key'..."
96,"{'danceability': 0.689, 'energy': 0.59, 'key':..."
97,"{'danceability': 0.598, 'energy': 0.807, 'key'..."
