<a href="https://colab.research.google.com/github/Soot3/Spotify-Analysis/blob/main/Spotify_SA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Spotify computes the Top 200 songs streamed per region, the data is collated [here](https://spotifycharts.com/)

In [1]:
import pandas as pd

The data is updated every week, so we will be looking at data from October 2019 to October 2020 with approximately a year's worth of data

In [17]:
# Compiling the csv files
import os, glob

path = "/content/"

all_files = glob.glob(os.path.join(path, "*.csv"))

all_df = []
for f in all_files:
    df = pd.read_csv(f, skiprows=1)
    df['Week'] = f.split('/')[-1]
    df['Week'] = df['Week'].str.replace('regional-za-weekly-','').str.replace('.csv','')
    all_df.append(df)
    
merged_df = pd.concat(all_df, ignore_index=True, sort=True)

In [18]:
merged_df.head()

Unnamed: 0,Artist,Position,Streams,Track Name,URL,Week
0,Roddy Ricch,1,90253,The Box,https://open.spotify.com/track/0nbXyq5TXYPCO7p...,2020-02-21--2020-02-28
1,Tones And I,2,72898,Dance Monkey,https://open.spotify.com/track/1rgnBhdG2JDFTbY...,2020-02-21--2020-02-28
2,The Weeknd,3,71787,Blinding Lights,https://open.spotify.com/track/0sf12qNH5qcw8qp...,2020-02-21--2020-02-28
3,Future,4,64670,Life Is Good (feat. Drake),https://open.spotify.com/track/5yY9lUy8nbvjM1U...,2020-02-21--2020-02-28
4,Kabza De Small,5,64600,eMcimbini - Live,https://open.spotify.com/track/2YXf32CaC2PzXIg...,2020-02-21--2020-02-28


In [19]:
merged_df['Week'].nunique()

54

A year's worth of Top 200 Spotify songs streamed in South Africa

In [3]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10800 entries, 0 to 10799
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Artist      10800 non-null  object
 1   Position    10800 non-null  int64 
 2   Streams     10800 non-null  int64 
 3   Track Name  10800 non-null  object
 4   URL         10800 non-null  object
 5   Week        10800 non-null  object
dtypes: int64(2), object(4)
memory usage: 506.4+ KB


In [4]:
songs = merged_df.groupby(['Track Name', 'Artist'], as_index=False)['Streams'].sum()
songs.sort_values(by='Streams', ascending=False).head(10)

Unnamed: 0,Track Name,Artist,Streams
186,Dance Monkey,Tones And I,3413872
111,Blinding Lights,The Weeknd,3061373
911,The Box,Roddy Ricch,2571056
158,Circles,Post Malone,2547789
573,Memories,Maroon 5,2405041
833,Someone You Loved,Lewis Capaldi,2279987
756,Roses - Imanbek Remix,SAINt JHN,2247756
219,Don't Start Now,Dua Lipa,2244422
91,Beautiful People (feat. Khalid),Ed Sheeran,2097508
728,ROCKSTAR (feat. Roddy Ricch),DaBaby,1964058


Highest number of streams in total, a sign of their dominance over time

In [5]:
songs_avg = merged_df.groupby(['Track Name', 'Artist'], as_index=False)['Streams'].mean()
songs_avg.sort_values(by='Streams', ascending=False).head(10)

Unnamed: 0,Track Name,Artist,Streams
989,WAP (feat. Megan Thee Stallion),Cardi B,97671.818182
494,Laugh Now Cry Later (feat. Lil Durk),Drake,84061.2
587,Mood (feat. iann dior),24kGoldn,76964.2
499,"Lemonade (feat. Gunna, Don Toliver & NAV)",Internet Money,73855.75
728,ROCKSTAR (feat. Roddy Ricch),DaBaby,72742.888889
111,Blinding Lights,The Weeknd,65135.595745
1044,Xola Moya Wam',Nomcebo Zikode,64290.2
186,Dance Monkey,Tones And I,63219.851852
947,Toosie Slide,Drake,61276.62069
467,John Vuli Gate,Mapara A Jazz,58451.5


Number of streams on average, can be seen to favour newer songs as they haven't had a downturn in streams yet that will lower the average

In [6]:
songs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1118 entries, 0 to 1117
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Track Name  1118 non-null   object
 1   Artist      1118 non-null   object
 2   Streams     1118 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 34.9+ KB


1118 Songs streamed over one year

The data gotten from the Spotify Chart site only has the number of streams per week, if we want to get other information on the songs in the dataset we need to get it from Spotify's API

In [7]:
!pip install spotipy
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

authorization={"client_id": "c0283872be394dfea6d1316401551d04", "client_secret": "e6418e97756b431c9276cd2d968b7051"}
client_credentials_manager = SpotifyClientCredentials(client_id=authorization['client_id'],client_secret=authorization['client_secret'])

sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

Collecting spotipy
  Downloading https://files.pythonhosted.org/packages/7a/cd/e7d9a35216ea5bfb9234785f3d8fa7c96d0e33999c2cb72394128f6b4cce/spotipy-2.16.1-py3-none-any.whl
Installing collected packages: spotipy
Successfully installed spotipy-2.16.1


In [8]:
def getTrackFeatures(id):
  meta = sp.track(id)
  features = sp.audio_features(id)
  artist_url = meta['artists'][0]['external_urls']['spotify']
  artist_data = sp.artist(artist_url)

  # meta
  name = meta['name']
  album = meta['album']['name']
  artist = meta['album']['artists'][0]['name']
  if artist_data['genres'] == []:
    genre = 'Missing'
  else:
    genre = artist_data['genres'][0]
  release_date = meta['album']['release_date']
  length = meta['duration_ms']
  popularity = meta['popularity']

  # features
  acousticness = features[0]['acousticness']
  danceability = features[0]['danceability']
  energy = features[0]['energy']
  instrumentalness = features[0]['instrumentalness']
  liveness = features[0]['liveness']
  loudness = features[0]['loudness']
  speechiness = features[0]['speechiness']
  tempo = features[0]['tempo']
  time_signature = features[0]['time_signature']

  track = [name, album, artist, genre, release_date, length, popularity, danceability, acousticness, energy, instrumentalness, liveness, loudness, speechiness, tempo, time_signature]
  return track

In [9]:
# loop over track ids 
tracks = []
for j in range(len(merged_df['URL'])):
  track = getTrackFeatures(merged_df['URL'][j])
  tracks.append(track)

# create dataset
df = pd.DataFrame(tracks, columns = ['name', 'album', 'artist', 'artist_top_genre', 'release_date', 'length', 'popularity', 'danceability', 'acousticness', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'time_signature'])

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10800 entries, 0 to 10799
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              10800 non-null  object 
 1   album             10800 non-null  object 
 2   artist            10800 non-null  object 
 3   artist_top_genre  10800 non-null  object 
 4   release_date      10800 non-null  object 
 5   length            10800 non-null  int64  
 6   popularity        10800 non-null  int64  
 7   danceability      10800 non-null  float64
 8   acousticness      10800 non-null  float64
 9   energy            10800 non-null  float64
 10  instrumentalness  10800 non-null  float64
 11  liveness          10800 non-null  float64
 12  loudness          10800 non-null  float64
 13  speechiness       10800 non-null  float64
 14  tempo             10800 non-null  float64
 15  time_signature    10800 non-null  int64  
dtypes: float64(8), int64(3), object(5)
memor

In [50]:
merged_df = pd.read_csv('/content/SA_Spotify.csv')
df = pd.read_csv('/content/SA_Spotify_data.csv')

In [51]:
# Adding the streams and week columns from the previous dataset
concat_df = pd.concat([merged_df,df],axis=1)
concat_df.drop(columns=['name', 'artist', 'URL'], inplace=True)
concat_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10800 entries, 0 to 10799
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Artist            10800 non-null  object 
 1   Position          10800 non-null  int64  
 2   Streams           10800 non-null  int64  
 3   Track Name        10800 non-null  object 
 4   Week              10800 non-null  object 
 5   album             10800 non-null  object 
 6   artist_top_genre  10800 non-null  object 
 7   release_date      10800 non-null  object 
 8   length            10800 non-null  int64  
 9   popularity        10800 non-null  int64  
 10  danceability      10800 non-null  float64
 11  acousticness      10800 non-null  float64
 12  energy            10800 non-null  float64
 13  instrumentalness  10800 non-null  float64
 14  liveness          10800 non-null  float64
 15  loudness          10800 non-null  float64
 16  speechiness       10800 non-null  float6

In [52]:
concat_df.head()

Unnamed: 0,Artist,Position,Streams,Track Name,Week,album,artist_top_genre,release_date,length,popularity,danceability,acousticness,energy,instrumentalness,liveness,loudness,speechiness,tempo,time_signature
0,Roddy Ricch,1,90253,The Box,2020-02-21--2020-02-28,Please Excuse Me For Being Antisocial,melodic rap,2019-12-06,196652,89,0.896,0.104,0.586,0.0,0.79,-6.687,0.0559,116.971,4
1,Tones And I,2,72898,Dance Monkey,2020-02-21--2020-02-28,Dance Monkey,australian pop,2019-05-10,209754,70,0.825,0.688,0.593,0.000161,0.17,-6.401,0.0988,98.078,4
2,The Weeknd,3,71787,Blinding Lights,2020-02-21--2020-02-28,Blinding Lights,canadian contemporary r&b,2019-11-29,201573,34,0.513,0.00147,0.796,0.000209,0.0938,-4.075,0.0629,171.017,4
3,Future,4,64670,Life Is Good (feat. Drake),2020-02-21--2020-02-28,Life Is Good (feat. Drake),atl hip hop,2020-01-10,237735,85,0.676,0.0706,0.609,0.0,0.152,-5.831,0.481,142.037,4
4,Kabza De Small,5,64600,eMcimbini - Live,2020-02-21--2020-02-28,Scorpion Kings Live,afro house,2020-02-03,407787,49,0.844,0.00646,0.564,0.163,0.0231,-10.817,0.0855,112.996,4


This data has more features to explore, with important data like the popularity of a song on Spotify added

In [53]:
concat_df['artist_top_genre'].nunique()

107

In [54]:
concat_df['artist_top_genre'].value_counts(ascending=False).head(10)

dance pop           1841
pop                 1546
afro house           826
chicago rap          378
alternative r&b      339
melodic rap          334
electropop           333
dfw rap              323
rap                  295
canadian hip hop     247
Name: artist_top_genre, dtype: int64

In [55]:
len(concat_df[concat_df['artist_top_genre'] == 'Missing'])

122

122 song entries had no genre data attached to their Spotify link

In [56]:
concat_df.to_csv('SA_Spotify_data.csv', index=False, encoding='utf-8')