## Songs Dataset

19CSE353 Mining of Massive Datasets

Group 16

To build a clean song dataset, we are using song data available from any free music data APIs. The dataset is a collection of playlists. Each playlist is identified by a unique `playlist_URI`. A playlist contains multiple songs and a song can be in multiple playlists. Each song should contain the following features:
- `track_name` - Name of the song
- `id` - Unique ID of song
- `popularity` - Popularity score of a song
- `main_artist_name` - Name of the main artist
- `main_artist_id` - Unique ID of the artist
- `main_artist_pop` - Popularity score of the main artist
- `duration_ms` - Duration (in ms)
- `danceability` - a measure of how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity
- `Speechiness` - the presence of spoken words in a track. Tracks with high speechiness are likely to be tracks that include spoken words or a spoken word-like section
- `Acousticness` - a measure of the acoustic characteristics of a track. Tracks with high acousticness are likely to be acoustic recordings or tracks with a strong acoustic presence
- `Instrumentalness` - a measure of the presence of vocals in a track. Tracks with high instrumentalness have no vocals, or have very few vocals
- `Liveness` - a measure of the presence of a live audience in the recording. Tracks with high liveness are likely to be live recordings
- `lyrics` - lyrics

In [1]:
from spotipy import SpotifyClientCredentials, Spotify
import pandas as pd
import lyricsgenius

### 1. Extracting Features

Song lyrics for each song is extracted from the Genius API (https://genius.com/api-clients).

The other features in the dataset have been extracted from the Spotify API (https://developer.spotify.com).


In [2]:
# client ID
cid = '8a6426547acc4865aafacb74e5f6f8f2'

# secret key
secret = '8c68bb042c7143d2a02235da08959f65'

To authenticate without signing into an account, all we need are the IDs, client and secret. Then, we can create our `Spotify` object with the following lines of code:

In [3]:
#Authentication - without user
client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = Spotify(client_credentials_manager = client_credentials_manager)

As the dataset should contain several playlists (from different users), the playlist URIs for various playlists are collected from different users. `track_uris` contain the track URI for all songs in the playlist.

In [4]:
# mine = ['spotify:playlist:4uoauPzaCVqSttRQTqwZ2L', 'spotify:playlist:37i9dQZF1EQnqst5TRi17F', 'spotify:playlist:37i9dQZEVXbNG2KDcFcKOF', 'spotify:playlist:37i9dQZF1DX5Ejj0EkURtP',  'spotify:playlist:37i9dQZF1DXdPec7aLTmlC', 'spotify:playlist:4QxrT89woB5Wspx9ckSdGd',  'spotify:playlist:37i9dQZF1Eplih7obxH51u', 'spotify:playlist:0uRHOYgXiR3l9BwssW7IMH', 'spotify:playlist:37i9dQZF1E35wuCpDmWrap', 'spotify:playlist:37i9dQZF1E378oGuP4xCRG', 'spotify:playlist:37i9dQZF1E37mRrWXJqsgG', 'spotify:playlist:37i9dQZF1E36D4rGaPzxmb']
# sanky = ['spotify:playlist:0pYKCDEitaEtWRzYJX0Hwy']
# abhinav = ['spotify:playlist:37i9dQZF1E39ogSZU3u7TI', 'spotify:playlist:37i9dQZF1E38jgqdwoJ7FT', 'spotify:playlist:37i9dQZF1E34ZJaNzb1BvY', 'spotify:playlist:37i9dQZEVXcHeRa2NxjPb9']
# kuppesh = ['spotify:playlist:37i9dQZF1E38EzM8DOLYrX', 'spotify:playlist:37i9dQZF1E36VBVVxXKvCu', 'spotify:playlist:37i9dQZF1E36N5EV3mfscl', 'spotify:playlist:37i9dQZF1E35V6ccrOcTbQ', 'spotify:playlist:37i9dQZF1E387U3JrdbQ6M', 'spotify:playlist:37i9dQZF1DXcmMuW52BXP0', 'spotify:playlist:37i9dQZF1DX3YSRoSdA634', 'spotify:playlist:37i9dQZF1EIWkN17HuTXyC', 'spotify:playlist:37i9dQZF1DX1uHCeFHcn8X', 'spotify:playlist:37i9dQZF1DX4WYpdgoIcn6', 'spotify:playlist:0Blc7H0vclyoZYpkfN5oFn', 'spotify:playlist:7J8BhUluUUOFKMv9GkmfbI']
# sai = ['spotify:playlist:1mXn9gbgE549RBvI81m9AY']
# surya = ['spotify:playlist:3nDh41jdYfMypr5oivf9h5', 'spotify:playlist:0xXEyvUVKJZvgIWpa5cFlf', 'spotify:playlist:37i9dQZF1EIYW4IQAlhUJP', 'spotify:playlist:1pLd88z1nhEGQEHbt66lME', 'spotify:playlist:37i9dQZF1DXbYM3nMM0oPk']
# mahesh = ['spotify:playlist:2zMU7NGivRghXZJok2K8ES', 'spotify:playlist:4CCMYzUqYgU2i7opaB7fiM', 'spotify:playlist:6MoR16aokakIrchsWepo2x']
# srinithi = ['spotify:playlist:37i9dQZF1EQncLwOalG3K7', 'spotify:playlist:37i9dQZF1DWX83CujKHHOn', 'spotify:playlist:0wO7kqupC0YZBLlJsQwceS']
# siva = ['spotify:playlist:4Oj0QeN1Fbc7WVtUXEUf5g', 'spotify:playlist:5ocBqbbt61Uhgc3av7UkDT', 'spotify:playlist:72lmW37G35cATUdAiDPKdj', 'spotify:playlist:37i9dQZF1DXe9YJxYnhkr3', 'spotify:playlist:1pLk4wLoJaWXClz8nqih7G', 'spotify:playlist:4rIoHGE82kPRFP78nJ4HNG', 'spotify:playlist:3vaFOIAhoVXb1nnw0uhylC', 'spotify:playlist:6DuM2G1tnsfdpvm5kwHkqV', 'spotify:playlist:4RAoH95SBNTXdOuwdTawHT', 'spotify:playlist:1RxqaSgJvhbM5LBUzIICek']
# hari = ['spotify:playlist:6HgmymGxMWO1sbE3rydKyF', 'spotify:playlist:37i9dQZF1E36eS8SGYuyO0', 'spotify:playlist:37i9dQZF1E39rSG4iUwSQ2', 'spotify:playlist:37i9dQZF1E3a8UkUxaTox8', 'spotify:playlist:37i9dQZF1E39DManSPlYaT', 'spotify:playlist:37i9dQZF1E37I5iTRVDUy9']
# yashika = ['spotify:playlist:37i9dQZF1EVKuMoAJjoTIw']
# tarun = ['spotify:playlist:4jotW1rQxUQiYUzg2IjZWE', 'spotify:playlist:4Le64nQ9yMjj10cz6mBm8I', 'spotify:playlist:0cNYHRGCv0cyQjeLblv2Oa', 'spotify:playlist:4ZrYMTU1C6MXVTFyJWXDCY', 'spotify:playlist:6ySrUrNk0prJL8szfuiuIA', 'spotify:playlist:1km1zNQfIq8QViMuZskXCA', 'spotify:playlist:56xS80pvoIwoJ0nKYG1JcL', 'spotify:playlist:7aQsuG5GIEg4gkcqLRhapC', 'spotify:playlist:5Zu0NFDjBuDtV7QebwBGCa', 'spotify:playlist:2WF6UcoNpIqObGkChVUtj0', 'spotify:playlist:1jjh8pwHoOOOVm1MVB3QUg', 'spotify:playlist:4sXRtJgQNMcJPuOC34Gmux']
# nan = ['spotify:playlist:1MpFBN6y1cROZHGo7AmTFO']

In [42]:
# dict containing all track uris in all playlists {'1NDoImNoAtIVAB8stHbSY4': ['spotify:track:4cktbXiXOapiLBMprHFErI', 'spotify:track:4cktbXiXOapiLBMprHFasdfS']}
track_uris = {}

# dict containing all track details for each playlist {'1NDoImNoAtIVAB8stHbSY4': {[{'added_at': '2022-10-03T13:50:27Z','added_by': {'external_urls': {'spotify': 'https://open.spotify.com/user/ecq...}}
tracks = {}

for playlist_URI in tarun:
    track_uris[playlist_URI] = [x["track"]["uri"] for x in sp.playlist_tracks(playlist_URI)["items"]]
    tracks[playlist_URI] = sp.playlist_tracks(playlist_URI)["items"]

Extract features like `danceability`, `Speechiness`, `Acousticness`, `Instrumentalness`, `Liveness`, etc.

In [43]:
# dict containing features for each track {'spotify:track:6mQLN3zRtAp6ovjusyYKrV': [{'danceability': 0.775, 'energy': 0.327, 'key': 11,...}]
features = {}
for k in track_uris:
    features[k] = sp.audio_features(track_uris[k])

Add `track_name`, `id`, `popularity`, `main_artist_name`, `main_artist_id`, `main_artist_pop`, `duration_ms` to the existing features.

In [44]:
l = []
for i in tracks:
    for j in range(len(tracks[i])):
        d = {}
        track_name = tracks[i][j]["track"]["name"]
        id = tracks[i][j]["track"]["id"]
        popularity = tracks[i][j]["track"]["popularity"]
        main_artist_name = tracks[i][j]["track"]["artists"][0]['name']
        main_artist_id = tracks[i][j]["track"]["artists"][0]['id']
        main_artist_pop = sp.artist(main_artist_id)['popularity']
        duration_ms = tracks[i][j]['track']["duration_ms"]
        d['playlist_uri'] = i
        d['track_name'] = track_name
        d['id'] = id
        d['popularity'] = popularity
        d['main_artist_name'] = main_artist_name
        d['main_artist_id'] = main_artist_id
        d['main_artist_pop'] = main_artist_pop
        d['duration_ms'] = duration_ms
        d.update(features[i][j])
        l.append(d)

Convert the features into a `pandas` `DataFrame` named `songs` and drop unwanted columns.

In [45]:
songs = pd.DataFrame(l)
songs.drop(['energy', 'key', 'loudness', 'mode', 'valence', 'tempo', 'type', 'uri', 'track_href', 'analysis_url', 'time_signature'],axis=1,inplace=True)

The file `songs.csv` now contains the features of all songs.

In [2]:
# songs.to_csv('./songs.csv', index=False, mode='a', header=False)
songs = pd.read_csv('./songs.csv')

### 2. Extracting Lyrics

Song lyrics for each song is extracted from the Genius API (https://genius.com/api-clients).

Sign in to https://genius.com/api-clients and create a new API to get the client access token, which will be used to get song lyrics and song info.

In [3]:
# client access token
token = 'oHHffzOEjEPpZcWQvAfRuEdMvTUNYi7iA4ke-Cyez6WxGOkXtzjO-1zxnhWlfq5J'

In [16]:
ly = songs.get(['track_name', 'main_artist_name']).drop_duplicates()
ly = ly.drop(ly.index[:30]).reset_index(drop=True)
ly

Unnamed: 0,track_name,main_artist_name
0,Gin and Juice,Snoop Dogg
1,Real Muthaphuckkin' G's,Eazy-E
2,It Ain't Hard to Tell,Nas
3,Notorious Thugs - 2014 Remaster,The Notorious B.I.G.
4,Still D.R.E.,Dr. Dre
...,...,...
2197,Newyork Nagaram,A.R. Rahman
2198,Deep Blue,Wyatt Pike
2199,Everything I Want,Jordy Searcy
2200,Waterfall,Morningsiders


Create a `Genius` object to get the song lyrics. Get songs using the `Genius.search_song` method. And from each song, get the lyric using the `lyrics` state.

In [17]:
genius = lyricsgenius.Genius(token, skip_non_songs=True, excluded_terms=["(Remix)", "(Live)"], remove_section_headers=True, timeout=10)
search_songs = lambda row: genius.search_song(row['track_name'], row['main_artist_name']).lyrics.split('Lyrics\n')[-1][:-7].split('\n')

Search for song lyrics...

In [24]:
ly['lyrics'] = ly.head(5).apply(search_songs, axis = 1)
ly.head(5).to_csv('./lyrics.csv', mode='a', header=False, index=False)
ly.drop(ly.index[:5], inplace=True)
ly.reset_index(drop=True).head(5)

Searching for "Lay Low" by Snoop Dogg...
Done.
Searching for "Ghetto Gospel" by 2Pac...
Done.
Searching for "Stan" by Eminem...
Done.
Searching for "No Vaseline" by Ice Cube...
Done.
Searching for "Hustlin'" by Rick Ross...
Done.


Unnamed: 0,track_name,main_artist_name,lyrics
0,Window Shopper,50 Cent,
1,Can't C Me,2Pac,
2,Bitch Please II,Eminem,
3,Izzo (H.O.V.A.),JAY-Z,
4,Machine Gun Funk - 2006 Remaster,The Notorious B.I.G.,


Now, the file `lyrics.csv` contains the song lyrics data.