# Analyzing Audio Features from Spotify

Who doesn't like music? Let's try to analyze [spotify data](https://www.kaggle.com/tomigelo/spotify-audio-features) from kaggle. 

First of all let's import CUDF from the RAPIDS library to assit with checking out the data.

In [2]:
# Libraries being used
import cudf as cf

### Intial Data overview
Now unto loading the data into a GPU dataframe so we can run some basic analysis to evaluate what we're working with!

In [5]:
# Read CSV file into environment
spotifyDF = cf.read_csv("data/SpotifyAudioFeaturesApril2019.csv")

# Display number of rows and column names
print("Number of rows: " + str(spotifyDF.shape[0]))
print("Number of columns: " + str(spotifyDF.shape[1]))

# Display column names
print("\nColumn Names: " + str(list(spotifyDF.columns)))

# Display top 3 rows of data
print("\n\nTop 3 Rows of data: ")
print(spotifyDF.head(3))
print("************************")


Number of rows: 130663
Number of columns: 17

Column Names: ['artist_name', 'track_id', 'track_name', 'acousticness', 'danceability', 'duration_ms', 'energy', 'instrumentalness', 'key', 'liveness', 'loudness', 'mode', 'speechiness', 'tempo', 'time_signature', 'valence', 'popularity']


Top 3 Rows of data: 
  artist_name                track_id  \
0          YG  2RM4jf1Xa9zPgMGRDiht8O   
1          YG  1tHDG53xJNGsItRA3vfVgs   
2       R3HAB  6Wosx2euFPMT14UXiWudMy   

                                       track_name  acousticness  danceability  \
0  Big Bank feat. 2 Chainz, Big Sean, Nicki Minaj       0.00582         0.743   
1                    BAND DRUM (feat. A$AP Rocky)       0.02440         0.846   
2                                   Radio Silence       0.02500         0.603   

   duration_ms  energy  instrumentalness  key  liveness  loudness  mode  \
0       238373   0.339               0.0    1    0.0812    -7.678     1   
1       214800   0.557               0.0    8    0.2


### Interpreting the initial output

So we can see that the dataset has 130663 rows and 17 columns, with the columns representing not only track information such as song title and artist, but also actual acoustical features such as "energy", "liveness", "acousticness" and even "danceability"!

Depending on the size of your display the top 3 rows of data may be displayed collapsed as there's many columns. However if we just evaluate the top row at index 0 we can see that each row displays information for one song... meaning that there are more than 130k songs in this dataset!

### Most popular songs

Let's see the top 10 most popular songs.

*(Reminder that this dataset was created on April 2019, so songs will be popular around that time)*

In [17]:
# Top 10 songs sorted by popularity

top10Songs = spotifyDF.sort_values('popularity', ascending=False).head(10)
top10Songs

Unnamed: 0,artist_name,track_id,track_name,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,popularity
99393,Daddy Yankee,5w9c2J52mkdntKOmRLeM2m,Con Calma,0.11,0.737,193227,0.86,2e-06,8,0.0574,-2.652,0,0.0593,93.989,4,0.656,100
99364,Post Malone,6MWtB6iiXyIwun0YzU6DFP,Wow.,0.163,0.833,149520,0.539,2e-06,11,0.101,-7.399,0,0.178,99.947,4,0.385,98
99366,Jonas Brothers,4y3OI86AEP6PQoDE6olYhO,Sucker,0.0427,0.842,181040,0.734,0.0,1,0.106,-5.065,0,0.0588,137.958,4,0.952,98
99368,Billie Eilish,2Fxmhks0bxGSBdJ92vM42m,bad guy,0.308,0.708,194088,0.418,0.093,0,0.1,-10.998,1,0.368,131.926,4,0.578,98
123909,Post Malone,3KkXRkHbMCARz0aVfEt68P,Sunflower - Spider-Man: Into the Spider-Verse,0.52,0.744,158040,0.494,0.0,2,0.0716,-5.563,1,0.0699,89.972,4,0.927,98
99365,Ariana Grande,4kV4N9D1iKVxx1KLvtTpjS,"break up with your girlfriend, i'm bored",0.0421,0.726,190440,0.554,0.0,5,0.106,-5.29,0,0.0917,169.999,4,0.335,97
99372,Ariana Grande,14msK75pk3pA33pzPVNtBF,7 rings,0.578,0.725,178640,0.321,0.0,1,0.0884,-10.744,0,0.323,70.142,4,0.319,96
99375,Sam Smith,6Qs4SXO9dwPj5GKvVOv8Ki,Dancing With A Stranger (with Normani),0.45,0.741,171030,0.52,2e-06,8,0.222,-7.513,1,0.0656,102.998,4,0.347,96
123873,Halsey,5p7ujcrUXASCNwRaWNHR1C,Without Me,0.297,0.752,201661,0.488,9e-06,6,0.0936,-7.05,1,0.0705,136.041,4,0.533,96
123879,Marshmello,2dpaYNEQHiRxtZbfNsse99,Happier,0.191,0.687,214290,0.792,0.0,5,0.167,-2.749,1,0.0452,100.015,4,0.671,96


With names like 'Daddy Yankee', 'Post Malone', 'Jonas Brothers', 'Billie Eilish' in the top 5 most popular songs, we can assume that everythings working thus far!

### Predicting similar songs based on acoustical features

Let's use the following features to find songs similar to ones we already know/like: 

* acousticness
* danceability
* energy
* instrumentalness
* liveness
* loudness
* spechiness
* valence

All the above features are continuous numerical values. We can use the K-Nearest Neighbors (K-NN) algorithm to find other songs with similar points to what we want. The KNN algorithm can be found in the cuML package in the RAPIDS library.

*Note that we've left out some features like time-signature/tempo/popularity as while they are somewhat acoustical features (except popularity), I don't believe they convey necessary information for the task.*
