# I ANALYSED MY SPOTIFY DATA
---

![](https://developer.spotify.com/assets/branding-guidelines/logo@2x.png)

### **COVID-19 CHANGED MY MUSIC TASTE**

After six months in lockdown and being one the privileged ones to experiment with new habits and lifestyles during the quarantine, I look back at my streaming history to identify the changes in my music tastes during the lockdown. During the process I identified the following:
* I listened to **964** artists during a period of 12 months, out of which I discovered **313** of them only after the lockdown.
* My top artists and tracks changed during the lockdown.
* During the lockdown, I streamed **BTS**'s music the most.

I further extended my analysis to dive deeper into BTS in an attempt to know what it was that changed my preferences drastically. During the process I identified that BTS's music has **high 'danceability' and energy**, which acted as a escape from the monotony of quarantine.

Lastly, I extracted some data of other [top K-Pop artists](https://www.koreaboo.com/lists/top-25-followed-kpop-artists-spotify/) to find similar tracks based on my preference of audio features.

In [None]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re
import matplotlib.style as style
style.use('seaborn-poster')
sns.set_style('darkgrid')

In [None]:
#Getting the data
mydata = pd.read_csv('../input/mydata/spotify_data.csv', index_col=0,parse_dates=['Date'])
btsdata = pd.read_csv('../input/mydata/bts.csv', index_col='index')

In [None]:
mydata['trackName']= mydata['trackName'].map(lambda x: x.lower())
mydata['minPlayed']= mydata['msPlayed'].map(lambda x: x/60000)
mydata['hrsPlayed']= mydata['msPlayed'].map(lambda x: x/3600000)

In [None]:
mydata.head()

In [None]:
btsdata.head()

In [None]:
#Dropping unnecessary columns
btsdata.drop(columns={'time_signature','key','artist_name'}, inplace=True)

In [None]:
btsdata['song_name']= btsdata['song_name'].map(lambda x: x.lower())
btsdata['duration_min']= btsdata['duration_ms'].map(lambda x: x/60000)

In [None]:
btsdata.head()

In [None]:
#Number or artists discovered during Sept 2019-Sept 2020
len(mydata.artistName.unique())

In [None]:
#Number or artists discovered during March 2020-Sept 2020
pre_artist = mydata[mydata.Date < '2020-03-01'].artistName.unique()
len(mydata[mydata.Date > '2020-03-01'][~mydata.artistName.isin(pre_artist)].artistName.unique())

VISUALIZING MY SPOTIFY DATA
---
---

In [None]:
def plotMean(data, mycolor, mylinestyle):
    plt.axvline(np.mean(data), color=mycolor, linestyle=mylinestyle, linewidth=1.5, label='Mean({})'.format(round(np.mean(data),2)))
    plt.legend(loc='best')
def plotMedian(data, mycolor, mylinestyle):
    plt.axvline(np.median(data), color=mycolor, linestyle=mylinestyle, linewidth=1.5, label='Median({})'.format(round(np.median(data),2)))
    plt.legend(loc='best')
def plotLabel(data,x):
    plt.annotate("Count: {}".format(round(data.max(),2)), (x, data.max()),bbox=dict(fc='yellow'))
def plotBar(data,palette):
    sns.barplot(x=data,y=data.keys(),palette = palette)
    plotMean(data,'r','-')
    plotMedian(data,'g','--')
    plt.show()

In [None]:
data = mydata.groupby(['Date','trackName'], as_index = False).size().groupby('Date').size()
data.plot.line()
plotLabel(data,'2020-01-19')
plt.title('Number of tracks streamed over time', fontweight='bold')
plt.ylabel('Number of tracks')
plt.show()

In [None]:
data = mydata.groupby(['Date','hrsPlayed'])['hrsPlayed'].sum().groupby('Date').sum()
data.plot.line()
plotLabel(data,'2020-01-19')
plt.title('Number of hours streamed over time', fontweight='bold')
plt.ylabel('Hours played')
plt.show()

## TOP TRACKS PRE-LOCKDOWN

In [None]:
plotBar(mydata[mydata.Date < '2020-03-01'].trackName.value_counts()[:15],'inferno')

## TOP TRACKS POST-LOCKDOWN

In [None]:
plotBar(mydata[mydata.Date > '2020-03-01'].trackName.value_counts()[:15], 'inferno')

## TOP STREAMED ARTISTS (SEPT'19-SEPT'20)

In [None]:
mydata.groupby(['artistName'])['hrsPlayed'].sum().sort_values(ascending=False)[:15].plot.pie(figsize=(10,10), autopct='%1.0f%%')
plt.title('Top 15 artists based on hours played in percentage', fontweight='bold')
plt.ylabel('')
plt.show()

## ANALYSING STREAMING ACTIVITY OF BTS'S MUSIC
Streaming history shows high activity after July 2020.

In [None]:
plt.figure(figsize=(15,8))
data = mydata[mydata.artistName.isin(['BTS','V','RM','BTSYOUNG4EVER'])].groupby(['Date','minPlayed'])['minPlayed'].sum().groupby('Date').sum()
plt.scatter(x=data.keys(), y=data,c=data, cmap='autumn_r',s= 250, edgecolors='black')
plt.ylabel('Minutes played')
plt.show()

## TOP ARTISTS BEFORE JULY-2020

In [None]:
plotBar(mydata[mydata.Date < '2020-07-01'].groupby(['artistName'])['hrsPlayed'].sum().sort_values(ascending=False)[:20],'viridis')

## TOP ARTISTS AFTER JULY-2020

In [None]:
plotBar(mydata[mydata.Date > '2020-07-01'].groupby(['artistName'])['hrsPlayed'].sum().sort_values(ascending=False)[:20],'viridis')

## TOP BTS TRACKS

In [None]:
plotBar(mydata[mydata.artistName.isin(['BTS','V','RM','BTSYOUNG4EVER'])].groupby(['trackName'])['minPlayed'].sum().sort_values(ascending=False)[:20],'nipy_spectral')

In [None]:
from wordcloud import WordCloud 
cloud=''
for x in mydata['artistName'].unique():
    x= x.replace(" ", "")
    cloud+= ''.join(x) +' '
plt.figure(figsize=(12,8))
wordcloud = WordCloud(background_color='white',max_font_size=50).generate(cloud)
plt.imshow(wordcloud)
plt.axis('off')
plt.title('ARTISTS', fontweight='bold')
plt.show()

UNDERSTANDING BTS's MUSIC:
---
---

![](https://ibighit.com/bts/images/bts/profile/profile-kv.png)

The audio features and their values are provided by Spotify.
They are explained in the [Spotify for Developers](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/) documentation.

In [None]:
#My preferred values of audio features
preference = pd.DataFrame()

In [None]:
def plotFeatures(feature):
    data = btsdata[btsdata.song_name.isin(mydata[mydata.msPlayed>1].trackName.tolist())].groupby(['song_name'])[feature].mean().sort_values(ascending=False)
    sns.barplot(x=data,y=data.keys(),palette = 'gnuplot')
    plotMean(data,'r','--') 
    preference['{}'.format(feature)]= [round(np.mean(data),2)]
    plt.show()

In [None]:
sns.heatmap(btsdata.corr(), annot=True, center=1)
plt.show()

HIGH POSITIVE CORRELATION:
 1. Acousticness - Speechiness
 2. Liveness - Speechiness
 
HIGH NEGATIVE CORRELATION:
 1. Acousticness - Energy
 (Visualizing this: )

In [None]:
sns.lmplot(x='acousticness',y='energy',data=btsdata, height=7,line_kws={'color': 'red'})
plt.title('Acousticness - Energy', fontweight='bold')
plt.show()

In [None]:
plotFeatures('danceability')

In [None]:
plotFeatures('energy')

In [None]:
plotFeatures('speechiness')

In [None]:
plotFeatures('acousticness')

In [None]:
plotFeatures('instrumentalness')

In [None]:
plotFeatures('liveness')

In [None]:
plotFeatures('tempo')

In [None]:
plotFeatures('valence')

In [None]:
preference

**PREFERENCES:**
 1. High danceability
 2. High energy
 3. Low speechiness
 4. Low acousticness
 5. Low instrumentalness
 6. Low liveness
 7. High Tempo
 8. High Valence

# FINDING K-POP SONGS BASED ON MY PREFERENCES
---
inspo: https://www.kaggle.com/ahmadal/spotify-extensive-analysis-song-recommender

In [None]:
kpop = pd.read_csv('../input/mydata/kpop.csv', index_col=0)
kpop['song_name']= kpop['song_name'].map(lambda x: x.strip().lower())
kpop.head()

In [None]:
#Features
kpop_features = kpop.loc[:,['acousticness','danceability','energy','instrumentalness','liveness','speechiness','tempo', 'valence']]
kpop_features.head()

## EUCLIDEAN DISTANCE TO FIND TRACKS WITH SIMILAR VALUES OF FEATURES

> The basis of many measures of similarity and dissimilarity is euclidean distance. The distance between vectors X and Y is defined as follows:
![image](http://www.analytictech.com/mb876/handouts/image001.gif)
In other words, euclidean distance is the square root of the sum of squared differences between corresponding elements of the two vectors. Note that the formula treats the values of X and Y seriously: no adjustment is made for differences in scale.
Euclidean distance is only appropriate for data measured on the same scale.
In order to compute similarities or dissimilarities among rows, we do not need to (in fact, must not) try to adjust for differences in scale. Hence, Euclidean distance is usually the right measure for comparing cases.
-[ source](http://www.analytictech.com/mb876/handouts/distance_and_correlation.htm#:~:text=The%20basis%20of%20many%20measures,elements%20of%20the%20two%20vectors.)

In [None]:
from sklearn.metrics.pairwise import euclidean_distances
kpop['Similarity'] = euclidean_distances(kpop_features, preference.to_numpy()).squeeze()

In [None]:
kpop.sort_values(by= 'Similarity', inplace= True)
similar = kpop[['artist_name', 'song_name', 'Similarity']]
similar = similar.drop_duplicates(subset=['artist_name', 'song_name'])

# **SIMILAR K-POP TRACKS**

In [None]:
similar.head(20)