The Pandas library contains several methods and functions for cleaning, manipulating and analyzing data. While NumPy is suited for working with homogenous numerical array data, Pandas is designed for working with tabular or heterogenous data. 

Let is import the Pandas library to use its methods and functions.

In [2]:
import pandas as pd

## Pandas data stuctures - Series and DataFrame

A DataFrame is a two-dimensional object - conmprising of tabular data organized in rows and columns, where individual columns can be of different value types (numeric / string / boolean etc.). A DataFrame has row indices which refer to individual rows, and column names that refer to indvidual columns. By default, the row indices are integers starting from zero. However, both the row indices and column names can be customized by the user.

Let us read the spotify data - *spotify_data.csv*, using the Pandas function `read_csv()`.

In [109]:
spotify_data = pd.read_csv('./Datasets/spotify_data.csv')
spotify_data.head()

Unnamed: 0,artist_followers,genres,artist_name,artist_popularity,track_name,track_popularity,duration_ms,explicit,release_year,danceability,...,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
0,16996777,rap,Juice WRLD,96,All Girls Are The Same,0,165820,1,2021,0.673,...,0,-7.226,1,0.306,0.0769,0.000338,0.0856,0.203,161.991,4
1,16996777,rap,Juice WRLD,96,Lucid Dreams,0,239836,1,2021,0.511,...,6,-7.23,0,0.2,0.349,0.0,0.34,0.218,83.903,4
2,16996777,rap,Juice WRLD,96,Hear Me Calling,0,189977,1,2021,0.699,...,7,-3.997,0,0.106,0.308,3.6e-05,0.121,0.499,88.933,4
3,16996777,rap,Juice WRLD,96,Robbery,0,240527,1,2021,0.708,...,2,-5.181,1,0.0442,0.348,0.0,0.222,0.543,79.993,4
4,5988689,rap,Roddy Ricch,88,Big Stepper,0,175170,0,2021,0.753,...,8,-8.469,1,0.292,0.0477,0.0,0.197,0.616,76.997,4


The object `spotify_data` is a pandas DataFrame:

In [46]:
type(spotify_data)

pandas.core.frame.DataFrame

A Series is a one-dimensional object, containing a sequence of values, where each value has an index. Each column of a DataFrame is Series as shown in the example below.

In [47]:
#Extracting movie titles from the movie_ratings DataFrame
spotify_songs = movie_ratings['track_name']
spotify_songs

0                      Eno Ide
1            Ee Tanuvu Ninnade
2            Munjaane Manjalli
3         Gudugudiya Sedi Nodo
4                        Ambar
                  ...         
245618         Coming Up Roses
245619               Young Kid
245620                Apricots
245621    Time I Love to Waste
245622                 Call me
Name: track_name, Length: 245623, dtype: object

In [48]:
#The object movie_titles is a Series
type(spotify_songs)

pandas.core.series.Series

## Data manipulations with Pandas

### Sub-setting data

In the chapter on reading data, we learned about operators `loc` and `iloc` that can be used to subset data based on axis labels and position of rows/columns respectively. However, usually we are not aware of the relevant row indices, and we may want to subset data based on some condition(s). For example, suppose we wish to analyze only those songs whose track popularity is higher than 50. 

**Q:** Do we need to subset rows or columns in this case? \
**A:** Rows, as songs correspond to rows, while features of songs correspond to columns. 

As we need to subset rows, the filter must be applied at the starting index. As we don't need to subset any specific features of the songs, there is no subsetting to be done on the columns. A `:` at the ending index means that all columns need to selected.

In [125]:
popular_songs = spotify_data.loc[spotify_data.track_popularity>=50,:]
popular_songs.head()

Unnamed: 0,artist_followers,genres,artist_name,artist_popularity,track_name,track_popularity,duration_ms,explicit,release_year,danceability,...,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
181,1277325,hip hop,Dave,77,Titanium,69,127750,1,2021,0.959,...,0,-8.687,0,0.437,0.152,1e-06,0.105,0.51,121.008,4
191,1123869,rap,Jay Wheeler,85,Viendo el Techo,64,188955,0,2021,0.741,...,11,-6.029,0,0.229,0.306,0.000327,0.1,0.265,179.972,4
208,3657199,rap,Polo G,91,RAPSTAR,89,165926,1,2021,0.789,...,6,-6.862,1,0.242,0.41,0.0,0.129,0.437,81.039,4
263,1461700,pop & rock,Teoman,67,Gecenin Sonuna Yolculuk,52,280600,0,2021,0.686,...,11,-7.457,0,0.0268,0.119,0.000386,0.108,0.56,100.932,4
293,299746,pop & rock,Lars Winnerbäck,62,Själ och hjärta,55,271675,0,2021,0.492,...,2,-6.005,0,0.0349,0.000735,0.000207,0.0953,0.603,142.042,4


Suppose we wish to analyze only *track_name, release year* and *track_popularity* of songs. Then, we can subset the revelant columns:

In [129]:
relevant_columns = spotify_data.loc[:,['track_name','release_year','track_popularity']]
relevant_columns.head()

Unnamed: 0,track_name,release_year,track_popularity
0,All Girls Are The Same,2021,0
1,Lucid Dreams,2021,0
2,Hear Me Calling,2021,0
3,Robbery,2021,0
4,Big Stepper,2021,0


### Sorting data

Sorting dataset is a very common operation. The [sort_values()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) function of Pandas can be used to sort a Pandas DataFrame or Series. Let us sort the spotify data in decreasing order of *track_popularity*:

In [133]:
spotify_sorted = spotify_data.sort_values(by = 'track_popularity', ascending = False)
spotify_sorted.head()

Unnamed: 0,artist_followers,genres,artist_name,artist_popularity,track_name,track_popularity,duration_ms,explicit,release_year,danceability,...,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
2398,1444702,pop,Olivia Rodrigo,88,drivers license,99,242014,1,2021,0.585,...,10,-8.761,1,0.0601,0.721,1.3e-05,0.105,0.132,143.874,4
2442,177401,hip hop,Masked Wolf,85,Astronaut In The Ocean,98,132780,0,2021,0.778,...,4,-6.865,0,0.0913,0.175,0.0,0.15,0.472,149.996,4
3133,1698014,pop,Kali Uchis,88,telepatía,97,160191,0,2020,0.653,...,11,-9.016,0,0.0502,0.112,0.0,0.203,0.553,83.97,4
6702,31308207,pop,The Weeknd,96,Save Your Tears,97,215627,1,2020,0.68,...,0,-5.487,1,0.0309,0.0212,1.2e-05,0.543,0.644,118.051,4
6703,31308207,pop,The Weeknd,96,Blinding Lights,96,200040,0,2020,0.514,...,1,-5.934,1,0.0598,0.00146,9.5e-05,0.0897,0.334,171.005,4


Drivers license is the most popular song!

In [148]:
#| echo: false

from jupyterquiz import display_quiz
import json
with open("./Datasets/questions_Pandas.json", "r") as file:
    questions=json.load(file)
display_quiz(questions)




### Unique values, value counts and membership

The Pandas function [unique](https://pandas.pydata.org/docs/reference/api/pandas.unique.html) provides the unique values of a Series. For example, let us find the number of unique genres of songs in the spotify dataset: 

In [149]:
spotify_data.genres.unique()

array(['rap', 'pop', 'miscellaneous', 'metal', 'hip hop', 'rock',
       'pop & rock', 'hoerspiel', 'folk', 'electronic', 'jazz', 'country',
       'latin'], dtype=object)

The Pandas function [value_counts()](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) provides the number of observations of each value of a Series. For example, let us find the number of songs of each genre in the spotify dataset:

In [153]:
spotify_data.genres.value_counts()

pop              70441
rock             49785
pop & rock       43437
miscellaneous    35848
jazz             13363
hoerspiel        12514
hip hop           7373
folk              2821
latin             2125
rap               1798
metal             1659
country           1236
electronic         790
Name: genres, dtype: int64

More than half the songs in the dataset are *pop, rock* or *pop & rock*.

The Pandas function [isin()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html) provides a boolean Series indicating the position of certain values in a Series. The function is helpful in sub-setting data. For example, let us subset the songs that are either *latin, rap,* or *metal*:

In [155]:
latin_rap_metal_songs = spotify_data.loc[spotify_data.genres.isin(['latin','rap','metal']),:]
latin_rap_metal_songs.head()

Unnamed: 0,artist_followers,genres,artist_name,artist_popularity,track_name,track_popularity,duration_ms,explicit,release_year,danceability,...,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
0,16996777,rap,Juice WRLD,96,All Girls Are The Same,0,165820,1,2021,0.673,...,0,-7.226,1,0.306,0.0769,0.000338,0.0856,0.203,161.991,4
1,16996777,rap,Juice WRLD,96,Lucid Dreams,0,239836,1,2021,0.511,...,6,-7.23,0,0.2,0.349,0.0,0.34,0.218,83.903,4
2,16996777,rap,Juice WRLD,96,Hear Me Calling,0,189977,1,2021,0.699,...,7,-3.997,0,0.106,0.308,3.6e-05,0.121,0.499,88.933,4
3,16996777,rap,Juice WRLD,96,Robbery,0,240527,1,2021,0.708,...,2,-5.181,1,0.0442,0.348,0.0,0.222,0.543,79.993,4
4,5988689,rap,Roddy Ricch,88,Big Stepper,0,175170,0,2021,0.753,...,8,-8.469,1,0.292,0.0477,0.0,0.197,0.616,76.997,4


## Operations between DataFrame and Series

Let us learn arithematic operations between DataFrame and Series with the help of an example.

**Example:** Spotify recommends songs based on songs listened by the user. Suppose you have listned to the song *drivers license*. Spotify intends to recommend you 5 songs that are *similar* to *drivers license*. Which songs should it recommend?

Let us see what information do we have about songs that can help us identify songs similar to drivers license. The [columns](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html) attribute of DataFrame will display all the columns names. The description of some of the column names relating to audio features is [here](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-audio-features).

In [54]:
spotify_data.columns

Index(['artist_followers', 'genres', 'artist_name', 'artist_popularity',
       'track_name', 'track_popularity', 'duration_ms', 'explicit',
       'release_year', 'danceability', 'energy', 'key', 'loudness', 'mode',
       'speechiness', 'acousticness', 'instrumentalness', 'liveness',
       'valence', 'tempo', 'time_signature'],
      dtype='object')

**Solution approach:** We have serveral features of a song. Let us find songs similar to *drivers license* in terms of *danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, time_signature* and *tempo*. Note that we are considering only audio features for simplicity.

To find the songs most similar to *drivers license*, we need to define a measure that quantifies the similarity. Let us define simililarity of a song with *drivers license* as the Euclidean distance of the song from *drivers license*, where the coordinates of a song are: (danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, time_signature, tempo). Thus, similarity can be formulated as:

$$Similarity_{DL-S} = \sqrt{(danceability_{DL}-danceability_{S})^2+(energy_{DL}-energy_{S})^2 +...+ (tempo_{DL}-tempo_{S})^2)},$$

where the subscript *DL* stands for *drivers license* and *S* stands for any song. The top 5 songs with the least value of $Similarity_{DL-S}$ will be the most similar to *drivers lincense* and should be recommended.

Let us subset the columns that we need to use to compute the Euclidean distance.

In [110]:
audio_features = spotify_data[['danceability', 'energy', 'key', 'loudness','mode','speechiness',
                               'acousticness', 'instrumentalness', 'liveness','valence', 'tempo', 'time_signature']]

In [111]:
audio_features.head()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
0,0.673,0.529,0,-7.226,1,0.306,0.0769,0.000338,0.0856,0.203,161.991,4
1,0.511,0.566,6,-7.23,0,0.2,0.349,0.0,0.34,0.218,83.903,4
2,0.699,0.687,7,-3.997,0,0.106,0.308,3.6e-05,0.121,0.499,88.933,4
3,0.708,0.69,2,-5.181,1,0.0442,0.348,0.0,0.222,0.543,79.993,4
4,0.753,0.597,8,-8.469,1,0.292,0.0477,0.0,0.197,0.616,76.997,4


In [112]:
#Distribution of values of audio_features
audio_features.describe()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
count,243190.0,243190.0,243190.0,243190.0,243190.0,243190.0,243190.0,243190.0,243190.0,243190.0,243190.0,243190.0
mean,0.568357,0.580633,5.240326,-9.432548,0.670928,0.111984,0.383938,0.071169,0.223756,0.552302,119.33506,3.884177
std,0.159444,0.236631,3.532546,4.449731,0.469877,0.198068,0.321142,0.209555,0.198076,0.250017,29.864219,0.458082
min,0.0,0.0,0.0,-60.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.462,0.405,2.0,-11.99,0.0,0.0332,0.07,0.0,0.0981,0.353,96.09925,4.0
50%,0.579,0.591,5.0,-8.645,1.0,0.0431,0.325,1.1e-05,0.141,0.56,118.002,4.0
75%,0.685,0.776,8.0,-6.131,1.0,0.0753,0.671,0.00222,0.292,0.76,137.929,4.0
max,0.988,1.0,11.0,3.744,1.0,0.969,0.996,1.0,1.0,1.0,243.507,5.0


Note that the audio features differ in terms of scale. Some features like *key* have a wide range of [0,11], while others like *danceability* have a very narrow range of [0,0.988]. If we use them directly, features like *danceability* will have a much higher influence on $Similarity_{DL-S}$ as compared to features like *key*. Assuming we wish all the features to have equal weight in quantifying a song's similarity to *drivers license*, we should scale the features, so that their values are comparable.

Let us scale the value of each column to a standard uniform distribtion: $U[0,1]$.

For scaling the values of a column to $U[0,1]$, we need to subtract the minimum value of the column from each value, and divide by the range of values of the column. For example, *danceability* can be standardized as follows:

In [113]:
#Scaling danceability to U[0,1]
danceability_value_range = audio_features.danceability.max()-audio_features.danceability.min()
danceability_std = (audio_features.danceability-audio_features.danceability.min())/danceability_value_range
danceability_std

0         0.681174
1         0.517206
2         0.707490
3         0.716599
4         0.762146
            ...   
243185    0.621457
243186    0.797571
243187    0.533401
243188    0.565789
243189    0.750000
Name: danceability, Length: 243190, dtype: float64

However, it will be cumbersome to repeat the above code for each audio feature. We can instead write a function that scales values of a column to $U[0,1]$, and apply the function on all the audio features.

In [114]:
#Function to scale a column to U[0,1]
def scale_uniform(x):
    return (x-x.min())/(x.max()-x.min())

We will use the Pandas function [apply()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html) to apply the above function to the DataFrame `audio_features`.

In [115]:
#Scaling all audio features to U[0,1]
audio_features_scaled = audio_features.apply(scale_uniform)

**lambda function:** Note that one line functions can be conveniently written as lambda functions in Python. These functions do not require a name, and can be defined using the keyword `lambda`. The above two blocks of code can be concisely written as:

In [156]:
audio_features_scaled = audio_features.apply(lambda x: (x-x.min())/(x.max()-x.min()))

In [116]:
#All the audio features are scaled to U[0,1]
audio_features_scaled.describe()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
count,243190.0,243190.0,243190.0,243190.0,243190.0,243190.0,243190.0,243190.0,243190.0,243190.0,243190.0,243190.0
mean,0.57526,0.580633,0.476393,0.79329,0.670928,0.115566,0.38548,0.071169,0.223756,0.552302,0.490068,0.776835
std,0.16138,0.236631,0.321141,0.069806,0.469877,0.204405,0.322431,0.209555,0.198076,0.250017,0.122642,0.091616
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.467611,0.405,0.181818,0.753169,0.0,0.034262,0.070281,0.0,0.0981,0.353,0.394647,0.8
50%,0.586032,0.591,0.454545,0.805644,1.0,0.044479,0.326305,1.1e-05,0.141,0.56,0.484594,0.8
75%,0.69332,0.776,0.727273,0.845083,1.0,0.077709,0.673695,0.00222,0.292,0.76,0.566427,0.8
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Since we need to find the Euclidean distance from the song *drivers license*, let us find the index of the row containing features of *drivers license.

In [117]:
drivers_license_index = spotify_data[spotify_data.track_name=='drivers license'].index[0]

Now, we'll subtract the audio features of *drivers license* from all other songs:

In [118]:
songs_minus_DL = audio_features_scaled-audio_features_scaled.loc[drivers_license_index,:]

Now, let us square the difference computed above. We'll use the in-built python function `pow()` to square the difference:

In [119]:
songs_minus_DL_sq = songs_minus_DL.pow(2)
songs_minus_DL_sq.head()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
0,0.007933,0.008649,0.826446,0.00058,0.0,0.064398,0.418204,1.0556e-07,0.000376,0.005041,0.005535,0.0
1,0.00561,0.0169,0.132231,0.000577,1.0,0.020844,0.139498,1.7161e-10,0.055225,0.007396,0.060654,0.0
2,0.013314,0.063001,0.07438,0.005586,1.0,0.002244,0.171942,5.3824e-10,0.000256,0.134689,0.050906,0.0
3,0.015499,0.064516,0.528926,0.003154,0.0,0.000269,0.140249,1.7161e-10,0.013689,0.168921,0.068821,0.0
4,0.028914,0.025921,0.033058,2.1e-05,0.0,0.057274,0.456981,1.7161e-10,0.008464,0.234256,0.075428,0.0


Now, we'll sum the squares of differences from all audio features to compute the similarity of all songs to *drivers license*.

In [120]:
distance_squared = songs_minus_DL_sq.sum(axis = 1)
distance_squared.head()

0    1.337163
1    1.438935
2    1.516317
3    1.004043
4    0.920316
dtype: float64

Now, we'll sort these distances to find the top 5 songs closest to drivers's license.

In [121]:
distances_sorted = distance_squared.sort_values()
distances_sorted.head()

2398      0.000000
81844     0.008633
4397      0.011160
130789    0.015018
143744    0.015058
dtype: float64

Using the indices of the top 5 distances, we will identify the top 5 songs most similar to *drivers license*:

In [122]:
spotify_data.loc[distances_sorted.index[0:6],:]

Unnamed: 0,artist_followers,genres,artist_name,artist_popularity,track_name,track_popularity,duration_ms,explicit,release_year,danceability,...,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
2398,1444702,pop,Olivia Rodrigo,88,drivers license,99,242014,1,2021,0.585,...,10,-8.761,1,0.0601,0.721,1.3e-05,0.105,0.132,143.874,4
81844,2264501,pop,Jay Chou,74,安靜,49,334240,0,2001,0.513,...,10,-7.853,1,0.0281,0.688,8e-06,0.116,0.123,143.924,4
4397,25457,pop,Terence Lam,60,拼命無恙 in Bb major,52,241062,0,2020,0.532,...,10,-9.69,1,0.0269,0.674,0.0,0.117,0.19,151.996,4
130789,176266,pop,Alan Tam,54,從後趕上,8,258427,0,1988,0.584,...,10,-11.889,1,0.0282,0.707,2e-06,0.107,0.124,140.147,4
143744,396326,pop & rock,Laura Branigan,64,How Am I Supposed to Live Without You,40,263320,0,1983,0.559,...,10,-8.26,1,0.0355,0.813,8.3e-05,0.134,0.185,139.079,4
35627,1600562,pop,Tiziano Ferro,68,Non Me Lo So Spiegare,44,240040,0,2014,0.609,...,11,-7.087,1,0.0352,0.706,0.0,0.13,0.207,146.078,4


We can see the top 5 songs most similar to *drivers license* in the *track_name* column above. Interestingly, three of the five songs are Asian! These songs indeed sound similart to *drivers license*!

## Correlation

Correlation may refer to any kind of association between two random variables. However, in this book, we will always consider correlation as the linear association between two random variables, or the Pearson's correlation coefficient. Note that correlation does not imply causalty and vice-versa.

The Pandas function [corr()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html) provides the pairwise correlation between all columns of a DataFrame, or between two Series. The function [corrwith()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corrwith.html#pandas.DataFrame.corrwith) provides the pairwise correlation of a DataFrame with another DataFrame or Series.

In [159]:
#Pairwise correlation amongst all columns
spotify_data.corr()

Unnamed: 0,artist_followers,artist_popularity,track_popularity,duration_ms,explicit,release_year,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
artist_followers,1.0,0.577861,0.197426,0.040435,0.082857,0.098589,-0.01012,0.080085,-0.000119,0.123771,0.004313,-0.059933,-0.107475,-0.033986,0.002425,-0.053317,0.016524,0.030826
artist_popularity,0.577861,1.0,0.285565,-0.097996,0.092147,0.062007,0.038784,0.039583,-0.011005,0.045165,0.018758,0.236942,-0.075715,-0.066679,0.099678,-0.034501,-0.032036,-0.033423
track_popularity,0.197426,0.285565,1.0,0.060474,0.193685,0.568329,0.158507,0.217342,0.013369,0.29635,-0.022486,-0.056537,-0.284433,-0.124283,-0.090479,-0.038859,0.058408,0.071741
duration_ms,0.040435,-0.097996,0.060474,1.0,-0.024226,0.067665,-0.145779,0.07599,0.00771,0.078586,-0.034818,-0.332585,-0.13396,0.067055,-0.034631,-0.155354,0.051046,0.085015
explicit,0.082857,0.092147,0.193685,-0.024226,1.0,0.215656,0.138522,0.104734,0.011818,0.12441,-0.06035,0.077268,-0.129363,-0.039472,-0.024283,-0.032549,0.006585,0.043538
release_year,0.098589,0.062007,0.568329,0.067665,0.215656,1.0,0.204743,0.338096,0.021497,0.430054,-0.071338,-0.032968,-0.369038,-0.149644,-0.04516,-0.070025,0.079382,0.089485
danceability,-0.01012,0.038784,0.158507,-0.145779,0.138522,0.204743,1.0,0.137615,0.020128,0.142239,-0.05113,0.198509,-0.143936,-0.179213,-0.114999,0.50535,-0.125061,0.111015
energy,0.080085,0.039583,0.217342,0.07599,0.104734,0.338096,0.137615,1.0,0.030824,0.747829,-0.053374,-0.043377,-0.678745,-0.131269,0.12605,0.348158,0.20596,0.170854
key,-0.000119,-0.011005,0.013369,0.00771,0.011818,0.021497,0.020128,0.030824,1.0,0.024674,-0.139688,-0.003533,-0.023179,-0.0066,-0.011566,0.024206,0.008336,0.007738
loudness,0.123771,0.045165,0.29635,0.078586,0.12441,0.430054,0.142239,0.747829,0.024674,1.0,-0.028151,-0.173444,-0.49302,-0.269008,0.002959,0.209588,0.171926,0.14603


**Q:** Which audio feature is the most correlated with *track_popularity*?

In [168]:
spotify_data.corrwith(spotify_data.track_popularity).sort_values(ascending = False)

track_popularity     1.000000
release_year         0.568329
loudness             0.296350
artist_popularity    0.285565
energy               0.217342
artist_followers     0.197426
explicit             0.193685
danceability         0.158507
time_signature       0.071741
duration_ms          0.060474
tempo                0.058408
key                  0.013369
mode                -0.022486
valence             -0.038859
speechiness         -0.056537
liveness            -0.090479
instrumentalness    -0.124283
acousticness        -0.284433
dtype: float64

Loudness is the audio feature having the highest correlation with *track_popularity*.