# Analysis of the Top 50 Spotify tracks

The objectives of this analysis of the Top 50 Spotify tracks are to develop skills related to the use of Pandas library: to practive working with data from Kaggle, practice performing exploratory data analysis, and 
practice reading data, performing queries and filtering data using Pandas.

## Preparing for analysis: importing and cleaning the data

### Importing Python libraries and the Spotify Top 50 data into a Pandas dataframe

The first step of the analysis was to download Spotify Top 50 data from Kaggle and import it into the Pandas dataframe.

In [1]:
import numpy as np
import pandas as pd


In [2]:
spotify = pd.read_csv("/Users/user/PycharmProjects/Spotify/data/spotifytoptracks.csv", index_col=0)

The basic variables of the dataframe are characterised below.

In [3]:
spotify.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 50 entries, 0 to 49
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   artist            50 non-null     object 
 1   album             50 non-null     object 
 2   track_name        50 non-null     object 
 3   track_id          50 non-null     object 
 4   energy            50 non-null     float64
 5   danceability      50 non-null     float64
 6   key               50 non-null     int64  
 7   loudness          50 non-null     float64
 8   acousticness      50 non-null     float64
 9   speechiness       50 non-null     float64
 10  instrumentalness  50 non-null     float64
 11  liveness          50 non-null     float64
 12  valence           50 non-null     float64
 13  tempo             50 non-null     float64
 14  duration_ms       50 non-null     int64  
 15  genre             50 non-null     object 
dtypes: float64(9), int64(2), object(5)
memory usag

### Checking for missing values

The function below was designed to check if there are missing values in each of the cell of the dataframe.

In [4]:
def check_missing_values():
    spotify.isnull()
    for i in range(len(spotify.iloc[0,:])):
        for j in range(len(spotify.iloc[:,0])):
            if spotify.iloc[j,i] is True:
                print(f'The missing value is in the index value {spotify.iloc[j,0]} of the column {spotify.columns[i]}.')
            else:
                pass
    print('There are no missing values in the dataframe.')
    


In [5]:
np.where(spotify.isnull())
#print('There are no missing values in the dataframe.')

(array([], dtype=int64), array([], dtype=int64))

It appeared that there were no missing values.

In [6]:
check_missing_values()


There are no missing values in the dataframe.


### Checking for duplicates

We applied Pandas function 'duplicated' to check for duplicates and it appear that there are no duplicates in the dataframe (see below).

In [9]:
duplicates_sum = spotify.apply(lambda x: x.duplicated(keep = False).sum())
print(duplicates_sum)

artist              17
album                9
track_name           0
track_id             0
energy               2
danceability         4
key                 48
loudness             0
acousticness         2
speechiness          4
instrumentalness    32
liveness             5
valence              6
tempo                0
duration_ms          0
genre               40
dtype: int64


In [11]:
spotify.apply(lambda x: x.duplicated(keep = False).sum())


artist              17
album                9
track_name           0
track_id             0
energy               2
danceability         4
key                 48
loudness             0
acousticness         2
speechiness          4
instrumentalness    32
liveness             5
valence              6
tempo                0
duration_ms          0
genre               40
dtype: int64

### Checking for outliers

The next step of the analysis was to check for outliers. We designed a function for finding outliers following the advice provided by this article (https://hersanyagci.medium.com/detecting-and-handling-outliers-with-pandas-7adbfcd5cad8). We used Tukey’s rule to detect outliers (also known as the IQR rule). We calculated the Interquartile Range of the data (IQR = Q3 — Q1) and determined our outlier boundaries with IQR. We got the lower boundary with the calculation Q1–1.5 * IQR and the upper boundary with the calculation Q3 + 1.5 * IQR. Then we looped each cell in the numerical variables to find out if it is an outlier or not.

In [14]:
def find_outliers(x,coef):
    Q1 = spotify.iloc[:,x].quantile(0.25)
    Q3 = spotify.iloc[:,x].quantile(0.75)
    IQR = Q3 - Q1
    lower_lim = Q1 - coef*IQR
    upper_lim = Q3 + coef*IQR
    for data in range(len(spotify.iloc[:,0])):
        if spotify.iloc[data,x] < lower_lim:
            print(f'Outlier for the index value {data} of the feature {spotify.columns[x]}: {round(spotify.iloc[data,x],3)}.')
            count_lower.append(data)
        elif spotify.iloc[data,x] > upper_lim:
            print(f'Outlier for the index value {data} of the feature {spotify.columns[x]}: {round(spotify.iloc[data,x],3)}.')
            count_upper.append(data)
        

The initial analysis revealed that 34 values of the feature variables could be characterized as outliers (see below). 

In [15]:
print('Outliers in the dataframe:\n')
count_lower = []    
count_upper = []
for i in range(4,15):
    find_outliers(i, 1.5)
print(f'\nIn total, there are {len(count_lower) + len(count_upper)} outliers.')
    

Outliers in the dataframe:

Outlier for the index value 16 of the feature danceability: 0.459.
Outlier for the index value 44 of the feature danceability: 0.351.
Outlier for the index value 47 of the feature danceability: 0.464.
Outlier for the index value 24 of the feature loudness: -14.454.
Outlier for the index value 1 of the feature acousticness: 0.688.
Outlier for the index value 7 of the feature acousticness: 0.731.
Outlier for the index value 9 of the feature acousticness: 0.751.
Outlier for the index value 18 of the feature acousticness: 0.837.
Outlier for the index value 24 of the feature acousticness: 0.902.
Outlier for the index value 44 of the feature acousticness: 0.934.
Outlier for the index value 47 of the feature acousticness: 0.866.
Outlier for the index value 19 of the feature speechiness: 0.487.
Outlier for the index value 26 of the feature speechiness: 0.375.
Outlier for the index value 27 of the feature speechiness: 0.375.
Outlier for the index value 29 of the feat

Such a high number of outliers seemed too much for the dataframe of 50 observations, thus, we increated the coefficient value from 1.5 to 2.5. After running the function, the number of outliers was reduced to 16. However, after deeper examination of the data, it became clear that the values of instrumentalness are widely scattered and each non-zero value of this feature is interpreted as an outlier. Thus, it makes no sense to treat any value of this feature as an outlier.

In [16]:
print('Outliers in the dataframe:\n')
count_lower = []    
count_upper = []
for i in range(4,15):
    find_outliers(i, 2.5)
print(f'\nIn total, there are {len(count_lower) + len(count_upper)} outliers.')

Outliers in the dataframe:

Outlier for the index value 44 of the feature danceability: 0.351.
Outlier for the index value 44 of the feature acousticness: 0.934.
Outlier for the index value 19 of the feature speechiness: 0.487.
Outlier for the index value 0 of the feature instrumentalness: 0.0.
Outlier for the index value 1 of the feature instrumentalness: 0.0.
Outlier for the index value 3 of the feature instrumentalness: 0.004.
Outlier for the index value 10 of the feature instrumentalness: 0.0.
Outlier for the index value 12 of the feature instrumentalness: 0.002.
Outlier for the index value 24 of the feature instrumentalness: 0.657.
Outlier for the index value 26 of the feature instrumentalness: 0.13.
Outlier for the index value 33 of the feature instrumentalness: 0.0.
Outlier for the index value 34 of the feature instrumentalness: 0.002.
Outlier for the index value 41 of the feature instrumentalness: 0.001.
Outlier for the index value 48 of the feature instrumentalness: 0.001.
Out

### Droping rows with outliers from the dataframe


The article https://hersanyagci.medium.com/detecting-and-handling-outliers-with-pandas-7adbfcd5cad8 reccomends three ways to handle outliers in the data: dropping the outliers, winsorize method, and log transformation. We decided to chose the method of dropping outliers. 

We designed a function which drops outliers from the dataframe based on the Tuckey rule and the argumentation that there is no sense to treat values of instrumentalness as outliers (values of the feature 'instrumentalness' were excluded from the condition for dropping outliers). 

In [17]:
def drop_outliers(x, df, coef):
    Q1 = spotify.iloc[:,x].quantile(0.25)
    Q3 = spotify.iloc[:,x].quantile(0.75)
    IQR = Q3 - Q1
    lower_lim = Q1 - coef*IQR
    upper_lim = Q3 + coef*IQR
    df = df.drop(df[(df.iloc[:,x] < lower_lim)|(df.iloc[:,x] > upper_lim)&(df.iloc[:,x]!=df.loc[:,"instrumentalness"])].index)
    return df


After running the function with the coefficient of 2.5, the new dataframe of 46 observations without outliers was created.

In [18]:
spotify_clean = pd.DataFrame(spotify)
for i in range(4,15):
    spotify_clean = drop_outliers(i, spotify_clean, 2.5)
spotify_clean.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 46 entries, 0 to 49
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   artist            46 non-null     object 
 1   album             46 non-null     object 
 2   track_name        46 non-null     object 
 3   track_id          46 non-null     object 
 4   energy            46 non-null     float64
 5   danceability      46 non-null     float64
 6   key               46 non-null     int64  
 7   loudness          46 non-null     float64
 8   acousticness      46 non-null     float64
 9   speechiness       46 non-null     float64
 10  instrumentalness  46 non-null     float64
 11  liveness          46 non-null     float64
 12  valence           46 non-null     float64
 13  tempo             46 non-null     float64
 14  duration_ms       46 non-null     int64  
 15  genre             46 non-null     object 
dtypes: float64(9), int64(2), object(5)
memory usag

## General information about the dataframe

### Number of observations

In [19]:
print(f'The number of observations in the dataframe is {spotify.shape[0]}.')
print(f'The number of observations after cleaning of the data in the dataframe is {spotify_clean.shape[0]}.')


The number of observations in the dataframe is 50.
The number of observations after cleaning of the data in the dataframe is 46.


### Number of features

In [20]:
print(f'The number of features in the dataframe is {spotify.shape[1]}.')


The number of features in the dataframe is 16.


### Number of categorical features

In [21]:
def count_cat_features():
    count = 0
    for column in spotify.columns:
        if spotify[column].dtypes == "object":
            count += 1
    return count


In [22]:
print(f'The number of categorical features in the dataframe is {count_cat_features()}.')


The number of categorical features in the dataframe is 5.


### Number of numerical features

In [23]:
def count_num_features():
    count = 0
    for column in spotify.columns:
        if spotify[column].dtypes != "object":
            count += 1
    return count


In [24]:
print(f'The number of numerical features in the dataframe is {count_num_features()}.')


The number of numerical features in the dataframe is 11.


## Selecting values of variables based on conditions

The following analysis helped to get responses to the questions about characteristics of such categorical valiables as track_name, artist, album, genre based on various conditions (values of other variables). it was decided to conduct the analysis of the initial dataframe rather than the dataframe after handling the outliers. The results of the analysis are presented below:

### Are there any artists that have more than 1 popular track? If yes, which and how many?

In [25]:
def count_artists_tracks():
    artists = pd.Series(spotify.loc[:,'artist'].value_counts())
    print(f'These artists have more than one popular track:\n')
    count = 0
    for index, value in artists.items():
        if artists[index] > 1:
            print(index, value)
            count += 1
    print(f'\nThe number of artists having more than 1 popular track is {count}.')
    

In [26]:
count_artists_tracks()


These artists have more than one popular track:

Billie Eilish 3
Dua Lipa 3
Travis Scott 3
Justin Bieber 2
Harry Styles 2
Lewis Capaldi 2
Post Malone 2

The number of artists having more than 1 popular track is 7.


### Who was the most popular artist?

In [27]:
def find_most_popular_artist():
    artists = pd.DataFrame(spotify.loc[:,'artist'].value_counts())
    popular_artists = artists[artists.loc[:,"artist"] == spotify.loc[:,'artist'].value_counts().max()]
    print(f'The most popular artists have the highest number of tracks in the top 50. Such artists are:\n')
    [print(x) for x in popular_artists.index]
    

In [28]:
find_most_popular_artist()


The most popular artists have the highest number of tracks in the top 50. Such artists are:

Billie Eilish
Dua Lipa
Travis Scott


### How many artists in total have their songs in the top 50?

In [29]:
number_of_artists = spotify.loc[:,'artist'].value_counts().count()
print(f'In total, {number_of_artists} artists have their songs in the top 50.')


In total, 40 artists have their songs in the top 50.


### Are there any albums that have more than 1 popular track? If yes, which and how many?

In [30]:
def count_albums_tracks():
    albums = pd.Series(spotify.loc[:,'album'].value_counts())
    print(f'These albums have more than one popular track:\n')
    count = 0
    for index, value in albums.items():
        if albums[index] > 1:
            print(index, value)
            count+=1
    print(f'The number of albums having more than 1 popular track is {count}.')
    

In [31]:
count_albums_tracks()


These albums have more than one popular track:

Future Nostalgia 3
Hollywood's Bleeding 2
Fine Line 2
Changes 2
The number of albums having more than 1 popular track is 4.


### How many albums in total have their songs in the top 50?

In [35]:
number_of_albums = spotify.loc[:,'album'].value_counts().count()
print(f'In total, {number_of_albums} albums have their songs in the top 50.')


In total, 45 albums have their songs in the top 50.


### Which tracks have a danceability score above 0.7?

In [36]:
def find_track_danceability_above():
    tracks_danceability = spotify.loc[:,"track_name"][spotify.loc[:,"danceability"]>0.7]
    print(f'These {tracks_danceability.count()} tracks have a danceability score above 0.7:\n')
    [print(x) for x in tracks_danceability.values]


In [37]:
find_track_danceability_above()

These 32 tracks have a danceability score above 0.7:

Dance Monkey
The Box
Roses - Imanbek Remix
Don't Start Now
ROCKSTAR (feat. Roddy Ricch)
death bed (coffee for your head)
Falling
Tusa
Blueberry Faygo
Intentions (feat. Quavo)
Toosie Slide
Say So
Memories
Life Is Good (feat. Drake)
Savage Love (Laxed - Siren Beat)
Breaking Me
everything i wanted
Señorita
bad guy
WAP (feat. Megan Thee Stallion)
Sunday Best
Godzilla (feat. Juice WRLD)
Break My Heart
Dynamite
Supalonely (feat. Gus Dapperton)
Sunflower - Spider-Man: Into the Spider-Verse
Hawái
Ride It
goosebumps
RITMO (Bad Boys For Life)
THE SCOTTS
SICKO MODE


### Which tracks have a danceability score below 0.4?

In [38]:
def find_tracks_danceability_below():
    tracks_danceability = spotify.loc[:,"track_name"][spotify.loc[:,"danceability"]<0.4]
    print(f'These {tracks_danceability.count()} tracks have a danceability score below 0.4:\n')
    [print(x) for x in tracks_danceability.values]


In [39]:
find_tracks_danceability_below()


These 1 tracks have a danceability score below 0.4:

lovely (with Khalid)


### Which tracks have their loudness above -5?

In [40]:
def find_track_loudness_above():
    tracks_loudness = spotify.loc[:,"track_name"][spotify.loc[:,"loudness"]> -5]
    print(f'These {tracks_loudness.count()} tracks have their loudness score above -5:\n')
    [print(x) for x in tracks_loudness.values]


In [41]:
find_track_loudness_above()

These 19 tracks have their loudness score above -5:

Don't Start Now
Watermelon Sugar
Tusa
Circles
Before You Go
Say So
Adore You
Mood (feat. iann dior)
Break My Heart
Dynamite
Supalonely (feat. Gus Dapperton)
Rain On Me (with Ariana Grande)
Sunflower - Spider-Man: Into the Spider-Verse
Hawái
Ride It
goosebumps
Safaera
Physical
SICKO MODE


### Which tracks have their loudness below -8?

In [42]:
def find_track_loudness_below():
    tracks_loudness = spotify.loc[:,"track_name"][spotify.loc[:,"loudness"]< -8]
    print(f'These {tracks_loudness.count()} tracks have their loudness score below -8:\n')
    [print(x) for x in tracks_loudness.values]


In [43]:
find_track_loudness_below()

These 9 tracks have their loudness score below -8:

death bed (coffee for your head)
Falling
Toosie Slide
Savage Love (Laxed - Siren Beat)
everything i wanted
bad guy
HIGHEST IN THE ROOM
lovely (with Khalid)
If the World Was Ending - feat. Julia Michaels


### Which track is the longest?

In [44]:
track_longest = spotify.loc[:,("track_name","duration_ms")].sort_values(by = ["duration_ms"], ascending=False, ignore_index=True).set_index("track_name")
print(f'The longest track is {track_longest.index[0]}.')

The longest track is SICKO MODE.


### Which track is the shortest?

In [45]:
track_shortest = spotify.loc[:,("track_name","duration_ms")].sort_values(by = ["duration_ms"], ascending=True, ignore_index=True).set_index("track_name")
print(f'The shortest track is {track_shortest.index[0]}.')

The shortest track is Mood (feat. iann dior).


### Which genre is the most popular?

In [46]:
def find_most_popular_genre():
    genres = pd.DataFrame(spotify.loc[:,'genre'].value_counts())
    popular_genres = genres[genres.loc[:,"genre"] == spotify.loc[:,'genre'].value_counts().max()]
    print(f'The most popular genre has the highest number of tracks in the top 50. Such genre is:\n')
    [print(x) for x in popular_genres.index]
    

In [47]:
find_most_popular_genre()


The most popular genre has the highest number of tracks in the top 50. Such genre is:

Pop


### Which genres have just one song on the top 50?

In [48]:
def find_genres_one_song():
    genres = pd.DataFrame(spotify.loc[:,'genre'].value_counts())
    genres_tracks = genres[genres.loc[:,"genre"] == 1]
    print(f'These genres have just one song on the top 50:\n')
    [print(x) for x in genres_tracks.index]
    

In [49]:
find_genres_one_song()


These genres have just one song on the top 50:

Nu-disco
R&B/Hip-Hop alternative
Pop/Soft Rock
Pop rap
Hip-Hop/Trap
Dance-pop/Disco
Disco-pop
Dreampop/Hip-Hop/R&B
Alternative/reggaeton/experimental
Chamber pop


### How many genres in total are represented in the top 50?

In [50]:
genres = spotify.loc[:,'genre'].value_counts().count()
print(f'In total, {genres} genres are represented in the top 50.')


In total, 16 genres are represented in the top 50.


## Finding correlations between features

The analysis of correlations between variables in Python could be performed with Pandas, Numpy or Scipy correlation tools (https://realpython.com/numpy-scipy-pandas-correlation-python/). Unlike Pandas and Numpy correlation tools, the Scipy correlation method has an option to calculate p-values indicating the statistical significance of correlation coeficients. However, as this project is aimed to apply Pandas and Numpy skills, we chose to use Pandas library for correlation analysis.


Correlations could be calculated only for numerical variables. In order to construct a correlation matrix, we created a dataframe consisting of numerical variables. Also, we droped the numerical variable 'key' as it does not define any feature of tracks in the top 50. Then, we constructed a correlation matrix from the variables defining features and save it into a dataframe and designed functions to find out which features are strongly positively correlated, strongly negatively correlated and no-correlated. The outputs of the functions are statements about relationships features. Statements about relationships between the same features are printed twice.

In [51]:
spotify_corr = spotify.iloc[:,4:15].drop("key",axis=1)
corr_matrix = pd.DataFrame(spotify_corr.corr())
corr_matrix

Unnamed: 0,energy,danceability,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms
energy,1.0,0.152552,0.79164,-0.682479,0.074267,-0.385515,0.069487,0.393453,0.075191,0.081971
danceability,0.152552,1.0,0.167147,-0.359135,0.226148,-0.017706,-0.006648,0.479953,0.168956,-0.033763
loudness,0.79164,0.167147,1.0,-0.498695,-0.021693,-0.553735,-0.069939,0.406772,0.102097,0.06413
acousticness,-0.682479,-0.359135,-0.498695,1.0,-0.135392,0.352184,-0.128384,-0.243192,-0.241119,-0.010988
speechiness,0.074267,0.226148,-0.021693,-0.135392,1.0,0.028948,-0.142957,0.053867,0.215504,0.366976
instrumentalness,-0.385515,-0.017706,-0.553735,0.352184,0.028948,1.0,-0.087034,-0.203283,0.018853,0.184709
liveness,0.069487,-0.006648,-0.069939,-0.128384,-0.142957,-0.087034,1.0,-0.033366,0.025457,-0.090188
valence,0.393453,0.479953,0.406772,-0.243192,0.053867,-0.203283,-0.033366,1.0,0.045089,-0.039794
tempo,0.075191,0.168956,0.102097,-0.241119,0.215504,0.018853,0.025457,0.045089,1.0,0.130328
duration_ms,0.081971,-0.033763,0.06413,-0.010988,0.366976,0.184709,-0.090188,-0.039794,0.130328,1.0


### Which features are strongly positively correlated?

According to the rule of thumb for the intepretation of values of correlation coeficients, he correlation between two variables is considered to be strong if the absolute value of r is greater than 0.75 (see https://www.statology.org/what-is-a-strong-correlation/). Thus, in order to find which features are strongly positively correlated, we looped over all values of the correlation matrix and selected values higher than 0,75 and excluded cofficients of same variables with values 1 as well.


In [3]:
def find_pos_correlation(corr_matrix, coef):
    for i in range(len(corr_matrix.iloc[0,:])):
        for j in range(len(corr_matrix.iloc[:,0])):
            if (corr_matrix.iloc[j,i] > coef) & (corr_matrix.iloc[j,i] != 1):
                print(f'There is a strong positive correlation between features {corr_matrix.iloc[:,i].name} and {corr_matrix.iloc[j,:].name}. The value of the correlation coeficient is {round(corr_matrix.iloc[j,i],3)}.')

In [4]:
find_pos_correlation(corr_matrix, 0.75)


NameError: name 'corr_matrix' is not defined

It can be seen from the output above, that just two features are strongly positively correlated - that is, 'energy' and 'loudness' (coefficient value 0.792).

### Which features are strongly negatively correlated?

The role of thumb indicates that the strong negative correlation between two variable is indicated by the correlation coeficient with the values lower than -0.75.

In [54]:
def find_negative_correlation(data):
    for i in range(len(corr_matrix.iloc[0,:])):
        for j in range(len(corr_matrix.iloc[:,0])):
            if (corr_matrix.iloc[j,i] < data) & (corr_matrix.iloc[j,i] != 1):
                print(f'There is a strong negative correlation between features {corr_matrix.iloc[:,i].name} and {corr_matrix.iloc[j,:].name}. The value of the correlation coeficient is {round(corr_matrix.iloc[j,i], 3)}.')
                      

In [55]:
find_negative_correlation(-0.75)


It can be seen that there are no features in the Top 50 tracks dataframe which are strongly negatively correlated according the 'lower than 0.75' criteria. 
However, the function 'find_negative_correlation' provides opportunity to easily change limits of relationships' 'strongness'.

In [56]:
find_negative_correlation(-0.5)

There is a strong negative correlation between features energy and acousticness. The value of the correlation coeficient is -0.682.
There is a strong negative correlation between features loudness and instrumentalness. The value of the correlation coeficient is -0.554.
There is a strong negative correlation between features acousticness and energy. The value of the correlation coeficient is -0.682.
There is a strong negative correlation between features instrumentalness and loudness. The value of the correlation coeficient is -0.554.


We can see that if we change strongness criteria to 'lower than -0.5', then we find strong negative relationship between such features as 'energy' and 'acousticness' (coefficient value -0.682) and 'loudness' and 'instrumentalness' (coefficient value -0.554). 

### Which features are not correlated?

We set the arbitrary criterion of non-correlation to "higher than - 0.2" and "lower than 0.2". According to this criterion, in total 34 relationships between features are not correlated (see below). 

In [57]:
def find_no_correlation():
    count = 0
    for i in range(len(corr_matrix.iloc[0,:])):
        for j in range(len(corr_matrix.iloc[:,0])):
            if (corr_matrix.iloc[j,i] > - 0.25) & (corr_matrix.iloc[j,i] < 0.25) & (corr_matrix.iloc[j,i] != 1):
                count+=1
                print(f'There is no correlation between features {corr_matrix.iloc[:,i].name} and {corr_matrix.iloc[j,:].name}. The value of the correlation coeficient is {round(corr_matrix.iloc[j,i],3)}.')
    print(f'\nThe total number of relationships where there are no correlations between features is {int(count/2)}.')

In [58]:
find_no_correlation()

There is no correlation between features energy and danceability. The value of the correlation coeficient is 0.153.
There is no correlation between features energy and speechiness. The value of the correlation coeficient is 0.074.
There is no correlation between features energy and liveness. The value of the correlation coeficient is 0.069.
There is no correlation between features energy and tempo. The value of the correlation coeficient is 0.075.
There is no correlation between features energy and duration_ms. The value of the correlation coeficient is 0.082.
There is no correlation between features danceability and energy. The value of the correlation coeficient is 0.153.
There is no correlation between features danceability and loudness. The value of the correlation coeficient is 0.167.
There is no correlation between features danceability and speechiness. The value of the correlation coeficient is 0.226.
There is no correlation between features danceability and instrumentalness. Th

## Comparing variables between groups

The following analysis answers questions about comparisons of different features between various genres of music. In order to answer these questions, we applied Pandas' groupby function and calcualted means of variables of features in different genre groups. 

### How does the danceability score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?

The table below indicates that Hip-Hop/Rap and Dance/Electronic genres have higher danceability score than Pop and Alternative/Indie genres. These genres are more danceable.

In [100]:
danceability_genre = pd.DataFrame(spotify.loc[:,("danceability","genre")].groupby("genre").describe())
print(f'{danceability_genre.loc[["Pop", "Hip-Hop/Rap", "Dance/Electronic", "Alternative/Indie"],[("danceability","mean"),("danceability","std")]]}')

                  danceability          
                          mean       std
genre                                   
Pop                   0.677571  0.109853
Hip-Hop/Rap           0.765538  0.085470
Dance/Electronic      0.755000  0.094744
Alternative/Indie     0.661750  0.211107


In [71]:
spotify.loc[:,("loudness","genre")].groupby("genre").describe()

Unnamed: 0_level_0,loudness,loudness,loudness,loudness,loudness,loudness,loudness,loudness
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
genre,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Electro-pop,2.0,-8.8985,2.922472,-10.965,-9.93175,-8.8985,-7.86525,-6.832
Alternative/Indie,4.0,-5.421,0.774502,-6.401,-5.8595,-5.2685,-4.83,-4.746
Alternative/reggaeton/experimental,1.0,-4.074,,-4.074,-4.074,-4.074,-4.074,-4.074
Chamber pop,1.0,-10.109,,-10.109,-10.109,-10.109,-10.109,-10.109
Dance-pop/Disco,1.0,-3.434,,-3.434,-3.434,-3.434,-3.434,-3.434
Dance/Electronic,5.0,-5.338,1.479047,-7.567,-5.652,-5.457,-4.258,-3.756
Disco-pop,1.0,-4.41,,-4.41,-4.41,-4.41,-4.41,-4.41
Dreampop/Hip-Hop/R&B,1.0,-4.368,,-4.368,-4.368,-4.368,-4.368,-4.368
Hip-Hop/Rap,13.0,-6.917846,1.891808,-8.82,-8.52,-7.648,-5.616,-3.37
Hip-Hop/Trap,1.0,-7.509,,-7.509,-7.509,-7.509,-7.509,-7.509


### How does the loudness score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?

The table below indicates that Alternative/Indie (mean value -5.421) and Dance/Electronic (mean value -5.338) genres have higher loudness score than Pop (mean value -6.46) and Hip-Hop/Rap (mean value -6.917) genres.

In [101]:
loudness_genre = pd.DataFrame(spotify.loc[:,("loudness","genre")].groupby("genre").describe())
print(f'{loudness_genre.loc[["Pop", "Hip-Hop/Rap", "Dance/Electronic", "Alternative/Indie"],[("loudness","mean"),("loudness","std")]]}')

                   loudness          
                       mean       std
genre                                
Pop               -6.460357  3.014281
Hip-Hop/Rap       -6.917846  1.891808
Dance/Electronic  -5.338000  1.479047
Alternative/Indie -5.421000  0.774502


### How does the acousticness score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?

The table below indicates that Alternative/Indie genre  has the highest acousticness score(with mean value 0,583) comparing with other genres, followed by the Pop genre (with mean value 0,324). Hip-Ho/Rap (mean value 0,189) and Dance/Electronic (mean value 0,099) genres are the least acoustic.

In [104]:
acousticness_genre = pd.DataFrame(spotify.loc[:,("acousticness","genre")].groupby("genre").describe())
print(f'{acousticness_genre.loc[["Pop", "Hip-Hop/Rap", "Dance/Electronic", "Alternative/Indie"],[("acousticness","mean"),("acousticness","std")]]}')


                  acousticness          
                          mean       std
genre                                   
Pop                   0.323843  0.318142
Hip-Hop/Rap           0.188741  0.186396
Dance/Electronic      0.099440  0.095828
Alternative/Indie     0.583500  0.204086


## Conclusions


The basic exploratory data analysis of the Top 50 Spotify tracks revealed various characteristics of the most popular Spotify tracks - that the most popular artists are Billie Eilish, Dua Lipa, and Travis Scott; the most popular genre is pop; that energy is correlated with loudness; that alternative/ indie music is the most accoustic while hip-top/ rap is the most loudly and danceable (together with dance/ electronic); etc. The analysis could be improved by applying tests of statistical significance - calculating p-values of correlation coeficients and conducting non-parametric Kruscall Wallis tests for comparison of differences between groups of observations. But for such an analysis the Python library Scipy should be used. Also, data visualisations with Matplotlib and Seaborn libraries could be preformed.

## Resources:

1. Stojiljković M. NumPy, SciPy, and Pandas: Correlation With Python, Real Python, https://realpython.com/numpy-scipy-pandas-correlation-python/
2. YAĞCI H.E. Detecting and Handling Outliers with Pandas, Medium, January 15 2021, https://hersanyagci.medium.com/detecting-and-handling-outliers-with-pandas-7adbfcd5cad8
3. What is Considered to Be a “Strong” Correlation?, January 22 2020, https://www.statology.org/what-is-a-strong-correlation/ 