# Spotify data analysis project

## Dataset import

Cell is used for reading given dataset and getting short info of it

In [26]:
import pandas as pd
import numpy as np

file = r"D:\IT_projects\Turing_Colledge\Modul1\Sprint2\project\spotifytoptracks.csv"
raw_df = pd.read_csv(file, index_col=0)
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 50 entries, 0 to 49
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   artist            50 non-null     object 
 1   album             50 non-null     object 
 2   track_name        50 non-null     object 
 3   track_id          50 non-null     object 
 4   energy            50 non-null     float64
 5   danceability      50 non-null     float64
 6   key               50 non-null     int64  
 7   loudness          50 non-null     float64
 8   acousticness      50 non-null     float64
 9   speechiness       50 non-null     float64
 10  instrumentalness  50 non-null     float64
 11  liveness          50 non-null     float64
 12  valence           50 non-null     float64
 13  tempo             50 non-null     float64
 14  duration_ms       50 non-null     int64  
 15  genre             50 non-null     object 
dtypes: float64(9), int64(2), object(5)
memory usage: 6.

Of the given dataset info 50 observations of  
16 features with 3 different Dtypes can be seen

In [27]:
raw_df.head(3)

Unnamed: 0,artist,album,track_name,track_id,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,genre
0,The Weeknd,After Hours,Blinding Lights,0VjIjW4GlUZAMYd2vXMi3b,0.73,0.514,1,-5.934,0.00146,0.0598,9.5e-05,0.0897,0.334,171.005,200040,R&B/Soul
1,Tones And I,Dance Monkey,Dance Monkey,1rgnBhdG2JDFTbYkYRZAku,0.593,0.825,6,-6.401,0.688,0.0988,0.000161,0.17,0.54,98.078,209755,Alternative/Indie
2,Roddy Ricch,Please Excuse Me For Being Antisocial,The Box,0nbXyq5TXYPCO7pr3N8S4I,0.586,0.896,10,-6.687,0.104,0.0559,0.0,0.79,0.642,116.971,196653,Hip-Hop/Rap


The .head() function was used to check how visually dataframe  
looks like and if there are any anomalies to be worried about.

## Data clean

Dataframe cleaning was performed by  
- stripping empty spaces of column labels;  
- changing "key" column's Dtype to object (because it does  
not has any true numeric value);  
- checking for duplicates;  
- checking for NaN values;  
- checking for empty cells.

In [28]:
raw_df.columns = raw_df.columns.str.strip()
raw_df['key'] = raw_df['key'].astype('object')
duplicates = raw_df.duplicated().any()
nans = raw_df.isna().any().any()
empties = (raw_df == "").any().any()
if duplicates == True:
    print("There are duplicated rows in dataframe")
elif nans == True:
    print("There are NaN values in dataframe")
elif empties == True:
    print("There are empty cells in dataframe")
else:
    print("There are no duplicates, NaN values or empty cells in dataframe\n")

numeric_columns = [
    "energy",
    "danceability",
    "loudness",
    "acousticness",
    "speechiness",
    "instrumentalness",
    "liveness",
    "valence",
    "tempo",
    "duration_ms",
]
numeric_df = raw_df[numeric_columns]
numeric_df.head(3)

There are no duplicates, NaN values or empty cells in dataframe



Unnamed: 0,energy,danceability,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms
0,0.73,0.514,-5.934,0.00146,0.0598,9.5e-05,0.0897,0.334,171.005,200040
1,0.593,0.825,-6.401,0.688,0.0988,0.000161,0.17,0.54,98.078,209755
2,0.586,0.896,-6.687,0.104,0.0559,0.0,0.79,0.642,116.971,196653


Performed scriped showed no duplicates, NaN values or empty cells in dataframe.  
Dataframe for only numeric values was created for following analysis.

### Outliers for numeric value columns

Outliers for all numeric values were sorted out by  
locking min/max indexes of the .describe() function's given table.

In [29]:
stats = numeric_df.describe()
min_out = stats.loc[["min", "max"]]
min_out

Unnamed: 0,energy,danceability,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms
min,0.225,0.351,-14.454,0.00146,0.029,0.0,0.0574,0.0605,75.801,140526.0
max,0.855,0.935,-3.28,0.934,0.487,0.657,0.792,0.925,180.067,312820.0


## Exploratory analysis

Answers to the Questions which required no more than 1 line code to answer  
were printed out in one code cell.

In [30]:
shp = raw_df.shape
catf = raw_df.dtypes
catf_list = list(raw_df.select_dtypes(include=["object"]).columns)
numf_list = list(raw_df.select_dtypes(include=np.number).columns)
the_most = raw_df["artist"][0]
total_art = raw_df["artist"].unique()
total_albs = raw_df["album"].unique()
track_longest = raw_df["track_name"].loc[np.argmax(raw_df["duration_ms"])]
track_shortest = raw_df["track_name"].loc[np.argmin(raw_df["duration_ms"])]


print(f"There are {shp[0]} observations and {shp[1]} features")
print("\nThese are categorical features:", ", ".join(catf_list))
print("\nThese are numeric features:", ", ".join(numf_list))
print(f"\n{the_most} is the most popular artist")
print(f"\nIn total there are {len(total_art)} artists that reached TOP 50")
print(f"\nIn total there are {len(total_albs)} albums that reached TOP 50")
print(f"\nThe longest track is named: {track_longest}")
print(f"\nThe shortest track is named: {track_shortest}")

There are 50 observations and 16 features

These are categorical features: artist, album, track_name, track_id, key, genre

These are numeric features: energy, danceability, loudness, acousticness, speechiness, instrumentalness, liveness, valence, tempo, duration_ms

The Weeknd is the most popular artist

In total there are 40 artists that reached TOP 50

In total there are 45 albums that reached TOP 50

The longest track is named: SICKO MODE

The shortest track is named: Mood (feat. iann dior)


### More than 1 popular track artists

Firstly, duplicates of artists column were selected because  
if there are same artist few times he probably has different tracks.  
Secondly, all duplicates were grouped by "artist" column and checked  
how many songs each artist has.

In [31]:
dupls = raw_df[raw_df.duplicated(subset=["artist"], keep=False)]
dupls_no = dupls.groupby("artist").size().reset_index(name="no_of_tracks")

print(
    f"These are {len(dupls_no)} artists that have more than 1 popular track:\n",
    dupls_no,
)

These are 7 artists that have more than 1 popular track:
           artist  no_of_tracks
0  Billie Eilish             3
1       Dua Lipa             3
2   Harry Styles             2
3  Justin Bieber             2
4  Lewis Capaldi             2
5    Post Malone             2
6   Travis Scott             3


It can be seen that there are 7 artists with 2 or 3  
songs at Spotify TOP 50.

### More than 1 popular track albums

Idea is basically the same, but this time a little bit different  
code was used for variety. raw_df dataframe has been grouped by "album" column  
and then filtered and left only those albums where no_of_tracks are higher than 7.

In [32]:
albs = raw_df.groupby("album").size().reset_index(name="no_of_tracks")
albs_no = albs[albs["no_of_tracks"] > 1].reset_index(drop=True)

print(
    f"These are {len(albs_no)} albums that have more than 1 popular track: \n", albs_no
)

These are 4 albums that have more than 1 popular track: 
                   album  no_of_tracks
0               Changes             2
1             Fine Line             2
2      Future Nostalgia             3
3  Hollywood's Bleeding             2


It can be seen that there are 4 albums with 2 or 3 tracks  
at Spotify TOP 50.

### Danceability observation

To get tracks which have danceability above 0.7 and bellow 0.4  
"track_name" column was taken and .where() function used to find  
all dataframe rows where number of column "danceability" is above  
0.7 of bellow 0.4. Also, NaN values were droped and .reset_index()  
was used.

In [33]:
danceab1 = (
    raw_df["track_name"]
    .where(raw_df["danceability"] > 0.7)
    .dropna()
    .reset_index(drop=True)
)
danceab2 = (
    raw_df["track_name"]
    .where(raw_df["danceability"] < 0.4)
    .dropna()
    .reset_index(drop=True)
)

print(f"These are {len(danceab1)} tracks where danceability is > 0.7:\n", danceab1)
print(f"These are {len(danceab2)} tracks where danceability is < 0.4:\n", danceab2)

These are 32 tracks where danceability is > 0.7:
 0                                      Dance Monkey
1                                           The Box
2                             Roses - Imanbek Remix
3                                   Don't Start Now
4                      ROCKSTAR (feat. Roddy Ricch)
5                  death bed (coffee for your head)
6                                           Falling
7                                              Tusa
8                                   Blueberry Faygo
9                          Intentions (feat. Quavo)
10                                     Toosie Slide
11                                           Say So
12                                         Memories
13                       Life Is Good (feat. Drake)
14                 Savage Love (Laxed - Siren Beat)
15                                      Breaking Me
16                              everything i wanted
17                                         Señorita
18            

There are 32 tracks with danceability > 0.7  
and only 1 track with danceability < 0.4.

### Loudness observation

Loudness observation was basically the same as danceability  
just values in .where() function were different.

In [34]:
loudn1 = (
    raw_df["track_name"].where(raw_df["loudness"] > -5).dropna().reset_index(drop=True)
)
loudn2 = (
    raw_df["track_name"].where(raw_df["loudness"] < -8).dropna().reset_index(drop=True)
)

print(f"These are {len(loudn1)} tracks where loudness is > -5:\n", loudn1)
print(f"These are {len(loudn2)} tracks where loudness is < -8:\n", loudn2)

These are 19 tracks where loudness is > -5:
 0                                   Don't Start Now
1                                  Watermelon Sugar
2                                              Tusa
3                                           Circles
4                                     Before You Go
5                                            Say So
6                                         Adore You
7                            Mood (feat. iann dior)
8                                    Break My Heart
9                                          Dynamite
10                 Supalonely (feat. Gus Dapperton)
11                  Rain On Me (with Ariana Grande)
12    Sunflower - Spider-Man: Into the Spider-Verse
13                                            Hawái
14                                          Ride It
15                                       goosebumps
16                                          Safaera
17                                         Physical
18                 

The answer about loudness observation is printed out above.

### Genres observation

To get most popular genre, repeated values of the "genre"  
column was counted and the .idxmax() function printed it out.
The same could have been done with the least popular genre  
but another method has been chosen.  
Total number of genres was extracted by getting unique values  
of the "genre" column and then counted with len() function.

In [35]:
most_genre = raw_df["genre"].value_counts()
least_genre = most_genre[most_genre == 1].index
no_genre = len(raw_df["genre"].unique())

print(
    f"The most popular genre with {most_genre.iloc[0]} tracks is: {most_genre.idxmax()}"
)
print(f"\nGenres with only 1 song in TOP 50 are:\n", ", \n".join(least_genre))
print(f"\nIn total there are {no_genre} genres represented in TOP 50")

The most popular genre with 14 tracks is: Pop

Genres with only 1 song in TOP 50 are:
 Nu-disco, 
R&B/Hip-Hop alternative, 
Pop/Soft Rock, 
Pop rap, 
Hip-Hop/Trap, 
Dance-pop/Disco, 
Disco-pop, 
Dreampop/Hip-Hop/R&B, 
Alternative/reggaeton/experimental, 
Chamber pop

In total there are 16 genres represented in TOP 50


The answers are provided above. 

### Features correlation

To get features correlation .corr() fuction was selected. Correlation was  
performed by Pearson's method. High positively correlation was decided all the values above 0.5  
and Higly negatively correlation respectively -0.5. No correlation features was selected  
all absolute values with number less than 0.1.

In [36]:
corr = numeric_df.corr(method="pearson")
high_corr = (
    corr[(corr > 0.5) & (corr < 1)].dropna(axis=0, how="all").dropna(axis=1, how="all")
)
low_corr = (
    corr[(corr < -0.5) & (corr < 1)].dropna(axis=0, how="all").dropna(axis=1, how="all")
)
not_corr = (
    corr[(np.abs(corr) < 0.1) & (corr < 1)]
    .dropna(axis=0, how="all")
    .dropna(axis=1, how="all")
)

print("These are highly positively correlated features: \n", high_corr)
print("\nThese are highly negatively correlated features: \n", low_corr)
print("\nThese are not correlated features: \n", not_corr)

These are highly positively correlated features: 
            energy  loudness
energy        NaN   0.79164
loudness  0.79164       NaN

These are highly negatively correlated features: 
                     energy  loudness  acousticness  instrumentalness
energy                 NaN       NaN     -0.682479               NaN
loudness               NaN       NaN           NaN         -0.553735
acousticness     -0.682479       NaN           NaN               NaN
instrumentalness       NaN -0.553735           NaN               NaN

These are not correlated features: 
                     energy  danceability  loudness  acousticness  speechiness  \
energy                 NaN           NaN       NaN           NaN     0.074267   
danceability           NaN           NaN       NaN           NaN          NaN   
loudness               NaN           NaN       NaN           NaN    -0.021693   
acousticness           NaN           NaN       NaN           NaN          NaN   
speechiness       0.07426

The tables of correlation are provided above. Positively correlated features are  
energy-loudness, while negatively correlated features are acousticness-energy,  
instrumentalness-loudness. All other pairs of features are not correlated.

### Danceability comparison between genres

In [37]:
genres = ["Pop", "Hip-Hop/Rap", "Dance/Electronic", "Alternative/Indie"]
danc_df = raw_df.groupby("genre").agg(
    {"danceability": "mean", "loudness": "mean", "acousticness": "mean"}
)
print(danc_df.loc[genres])

summary = """- Danceability avg. sore is the highest for hip-hop music
but dance/electronic music is also near top rate. The lowest
score is for alternative music which is lower by 0.1 in
comparison to hip-hop.
   - Loudness avg. score in absolute value is the highest
for hip-hop music too. Lowest score collected is by
alternative music.
   - Acousticness or for the most acoustic instruments
used goes to alternative music. The lowest score for it is
collected by dance/electronic music.

Short summary can be stated, that with louder music
comes higher danceability rate and less acoustic instruments
are being used. """

                   danceability  loudness  acousticness
genre                                                  
Pop                    0.677571 -6.460357      0.323843
Hip-Hop/Rap            0.765538 -6.917846      0.188741
Dance/Electronic       0.755000 -5.338000      0.099440
Alternative/Indie      0.661750 -5.421000      0.583500


- Danceability avg. sore is the highest for hip-hop music  
but dance/electronic music is also near top rate. The lowest  
score is for alternative music which is lower by 0.1 in  
comparison to hip-hop.  
- Loudness avg. score in absolute value is the highest  
for hip-hop music too. Lowest score collected is by  
alternative music.  
- Acousticness or for the most acoustic instruments  
used goes to alternative music. The lowest score for it is  
collected by dance/electronic music.  
  
Short summary can be stated, that with louder music  
comes higher danceability rate and less acoustic instruments  
are being used.