# Visualizing Music! - Data Extraction & Cleaning

## Data Acquisition and Cleaning Code

I have categorized our project broadly into three categories on the basis of the dataset that we will require. Dataset pertaining to these three categories will be comprehensive to help us answer out research question.

The two broad categories of dataset that we extact is: 

1) Top performing songs of 2019 along with their properties of each track.

2) Popular songs from different ages and their listening trends for past 3 years.

In order to get access to Spotify data, we had to create a developer account, generated client id and secret code, explored various endpoints at [Spotify Web API Console](https://developer.spotify.com/console/)

#### Here I am demonstrating how I extracted Spotify data of top songs of 2019 and extract it's features.

### Data Acquisition

In [2]:
#Setting up Spotify API end point
import spotipy

from spotipy.oauth2 import SpotifyClientCredentials 

cid ="841e0a8ce6a84bab829e9b438660e75f" 
secret = "f3019d467db34e3786e9204bc9d0eeb3" 
client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret) 

sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

I would be using "search" method provided by Spotipy Library to extract the tracks of 2019. A sample query and output is as shown below.

In [9]:
output = sp.search(q='year:2019', type='track')
print(output)

{'tracks': {'href': 'https://api.spotify.com/v1/search?query=year%3A2019&type=track&offset=0&limit=10', 'items': [{'album': {'album_type': 'single', 'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/0Y5tJX1MQlPlqiwlOH1tJY'}, 'href': 'https://api.spotify.com/v1/artists/0Y5tJX1MQlPlqiwlOH1tJY', 'id': '0Y5tJX1MQlPlqiwlOH1tJY', 'name': 'Travis Scott', 'type': 'artist', 'uri': 'spotify:artist:0Y5tJX1MQlPlqiwlOH1tJY'}], 'available_markets': ['AD', 'AE', 'AR', 'AT', 'AU', 'BE', 'BG', 'BH', 'BO', 'BR', 'CA', 'CH', 'CL', 'CO', 'CR', 'CY', 'CZ', 'DE', 'DK', 'DO', 'DZ', 'EC', 'EE', 'EG', 'ES', 'FI', 'FR', 'GB', 'GR', 'GT', 'HK', 'HN', 'HU', 'ID', 'IE', 'IL', 'IN', 'IS', 'IT', 'JO', 'JP', 'KW', 'LB', 'LI', 'LT', 'LU', 'LV', 'MA', 'MC', 'MT', 'MX', 'MY', 'NI', 'NL', 'NO', 'NZ', 'OM', 'PA', 'PE', 'PH', 'PL', 'PS', 'PT', 'PY', 'QA', 'RO', 'SA', 'SE', 'SG', 'SK', 'SV', 'TH', 'TN', 'TR', 'TW', 'US', 'UY', 'VN', 'ZA'], 'external_urls': {'spotify': 'https://open.spotify.com/album

Creating empty list and getting the artist name, track name, popularity parameter and the track id of 10000 tracks from the results shown above.

In [10]:
artist_name = []
track_name = []
popularity = []
track_id = []

for i in range(0,10000,50):
    track_results = sp.search(q='year:2019', type='track', limit=50,offset=i)
    for i, t in enumerate(track_results['tracks']['items']):
        artist_name.append(t['artists'][0]['name'])
        track_name.append(t['name'])
        track_id.append(t['id'])
        popularity.append(t['popularity'])

In [11]:
print("Number of tracks exracted: " + str(len(track_id)))

Number of tracks exracted: 10000


### Cleaning Code

![alt text](https://tenor.com/view/sponge-bob-cleaning-cats-household-chores-gif-4998775.gif "Logo Title Text 1")


Appending lists and converting them into a single dataframe.

In [13]:
import pandas as pd

tracks_df = pd.DataFrame({'artist_name':artist_name,'track_name':track_name,'track_id':track_id,'popularity':popularity})
print(tracks_df.shape)
tracks_df.head()

(10000, 4)


Unnamed: 0,artist_name,track_name,track_id,popularity
0,Travis Scott,HIGHEST IN THE ROOM,3eekarcy7kvN4yt5ZFzltW,99
1,Post Malone,Circles,21jGcNKet2qwijlDFuPiPb,100
2,Juice WRLD,Bandit (with YoungBoy Never Broke Again),6Gg1gjgKi2AK4e0qzsR7sd,94
3,blackbear,hot girl bummer,7aiClxsDWFRQ0Kzk5KI5ku,96
4,YNW Melly,223's (feat. 9lokknine),4sjiIpEv617LDXaidKioOI,92


Many a times, a single track has more than one entry in the dataset. This is due to the track having different track_id as they might have appeared as a single as well as in the album. It might also be due to the track appearing under the profile of two artists who might have collaborated on the track.

I check for these duplicate values by grouping them by artist names and track names and then checking 

In [23]:
grouped = tracks_df.groupby(['artist_name','track_name'], as_index=True).size()
print("Tracks grouped by artist name and track name:")
print("======================")
print(grouped.head(10))
print("======================")
print("All the tracks with duplicate values:")
print("======================")
print(grouped[grouped > 1].head(10))

Tracks grouped by artist name and track name:
artist_name  track_name                                                  
$NOT         Beretta (feat. Wifisfuneral)                                    1
             Billy Boy                                                       1
             Megan                                                           1
             Vision (feat. Lil Tracy)                                        1
$uicideBoy$  Aliens Are Ghosts ($UICIDEBOY$ X TRAVIS BARKER)                 2
             DON'T TRUST ANYONE! ($UICIDEBOY$ X TRAVIS BARKER)               1
             Individuality Was So Last Year ($UICIDEBOY$ X TRAVIS BARKER)    1
             Killing 2 Birds With 22 Stones ($UICIDEBOY$ X TRAVIS BARKER)    1
             Sour Grapes ($UICIDEBOY$ X TRAVIS BARKER)                       1
(G)I-DLE     Senorita                                                        1
dtype: int64
All the tracks with duplicate values:
artist_name          track_name        

In [24]:
print("Number of duplicate values: " + str(grouped[grouped > 1].count()))

Number of duplicate values: 485


Hence there are 485 duplicate values. In the below code we delete these duplicate values.

In [27]:
tracks_df.drop_duplicates(subset=['artist_name','track_name'], inplace=True)
print("Number of tracks left in the dataframe: " + str(len(tracks_df)))

Number of tracks left in the dataframe: 9480


## In the further parts of the project, I will extract the necessary features, segment it as needed and then further go on to analyze it. 

# Stay Tuned!

![alt text](https://tenor.com/view/thanks-thank-you-dwight-happy-tears-of-joy-gif-3553882.gif "Logo Title Text 1")
