# Analyzing Audio Features from Spotify

Who doesn't like music? Let's try to analyze [spotify data](https://www.kaggle.com/tomigelo/spotify-audio-features) from kaggle. 

First of all let's import CUDF from the RAPIDS library to assit with checking out the data.

In [None]:
# Libraries being used
import cudf as cf

### Initial Data overview
Now unto loading the data into a GPU dataframe so we can run some basic analysis to evaluate what we're working with!

In [None]:
# Read CSV file into environment
spotifyDF = cf.read_csv("data/SpotifyAudioFeaturesApril2019.csv")

# Display number of rows and column names
print("Number of rows: " + str(spotifyDF.shape[0]))
print("Number of columns: " + str(spotifyDF.shape[1]))

# Display column names
print("\nColumn Names: " + str(list(spotifyDF.columns)))

# Display top 3 rows of data
print("\n\nTop 3 Rows of data: ")
print(spotifyDF.head(3))
print("************************")



### Interpreting the initial output

So we can see that the dataset has 130663 rows and 17 columns, with the columns representing not only track information such as song title and artist, but also actual acoustical features such as "energy", "liveness", "acousticness" and even "danceability"!

Depending on the size of your display the top 3 rows of data may be displayed collapsed as there's many columns. However if we just evaluate the top row at index 0 we can see that each row displays information for one song... meaning that there are more than 130k songs in this dataset!

### Most popular songs

Let's see the top 10 most popular songs.

*(Reminder that this dataset was created on April 2019, so songs will be popular around that time)*

In [None]:
# Top 10 songs sorted by popularity

top10Songs = spotifyDF.sort_values('popularity', ascending=False).head(10)
top10Songs

With names like 'Daddy Yankee', 'Post Malone', 'Jonas Brothers', 'Billie Eilish' in the top 5 most popular songs, we can assume that everythings working thus far!

### Predicting similar songs based on acoustical features

Let's use the following features to find songs similar to ones we already know/like: 

* acousticness
* danceability
* energy
* instrumentalness
* liveness
* loudness
* speechiness
* valence

All the above features are continuous numerical values. We can use the K-Nearest Neighbors (K-NN) algorithm to find other songs with similar points to what we want. The KNN algorithm can be found in the cuML package in the RAPIDS library.

*Note that we've left out some features like time-signature/tempo/popularity as while they are somewhat acoustical features (except popularity), I don't believe they convey necessary information for the task.*


In [None]:
# Import Libraries
from cuml.neighbors import NearestNeighbors

We'll use the indicies of the spotifyDF dataframe to use as the unique identifier for the songs. 

*Note we can also use the track_id column, but this calls for re-indexing the Dataframe*

In [None]:
nnDF = spotifyDF[['acousticness', 'danceability', 'energy', \
                 'instrumentalness', 'liveness', 'loudness', \
                 'speechiness', 'valence']]

In [None]:
# Extract only columns being used
nnDF = spotifyDF[['acousticness', 'danceability', 'energy', \
                 'instrumentalness', 'liveness', 'loudness', \
                 'speechiness', 'valence']]

# Create an song/artist DF to be used as a lookup
artistDF = spotifyDF[['artist_name', 'track_name']]

# Create NN Model
nnSpotifyModel = NearestNeighbors()
nnSpotifyModel = nnSpotifyModel.fit(nnDF)

We know have a fit model 'nnSpotifyModel' that can be used to find the nearest neighbors of any song we want. 

Let's find songs similar to a song by 'John Mayer'.

We need to: 

1. Find the index of John Mayer
2. Extract the features associated with selected song
3. Call kneighbors() on nnSpotifyModel with the extracted features

In [None]:
# Determine index #'s of John Mayer songs
print(spotifyDF[spotifyDF['artist_name'] == 'John Mayer'][['artist_name','track_name']])

In [None]:
# Extracting Features of John Mayer - XO, at index #74564
nnFeatures= nnDF[nnDF.index == 74564]
nnFeatures

In [None]:
# Finding k-neighbors of song
nnSpotifyModel.kneighbors(nnFeatures,k=3)