In [1]:
import pandas as pd
import numpy as np

# Context
# Imagine that you are a data analyst working for Spotify. Your team is responsible for content analysis, and in this quarter, you have decided to analyze Spotify's top hits and quantify what makes a hit song. Your team's product manager has many ideas and prepared a list of questions (requirements) that she wants you to get answers to. After reviewing the list of over 20 questions, you are not in a good mood - you'll have to work a couple of days to get all the answers. Luckily, a few days ago, an experienced data scientist working in your team queried the top 50 tracks for her machine learning project and agreed to share the data with you. This is a significant help - your SQL skills are not too sharp yet, and you don't yet know where to find all the relevant tables in your data warehouse. With this dataset, you are confident that you'll be able to answer all your PM's questions and maybe even look into some things that she didn't ask for.

# Provide clear explanations in your notebook. Your explanations should inform the reader what you are trying to achieve, what results you got, and what these results mean

#### Download the data from Spotify Top 50 Tracks of 2020 dataset (Kol kas negalima tiesiog siųsti per linką, nes kaggle reikalauja prisijungimo)
#### Load the data using Pandas.

## So first we load spotify top track csv file with pandas read_csv()

In [2]:
df = pd.read_csv("spotifytoptracks.csv")

## and checking how data look by just running df

In [3]:
df

Unnamed: 0.1,Unnamed: 0,artist,album,track_name,track_id,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,genre
0,0,The Weeknd,After Hours,Blinding Lights,0VjIjW4GlUZAMYd2vXMi3b,0.73,0.514,1,-5.934,0.00146,0.0598,9.5e-05,0.0897,0.334,171.005,200040,R&B/Soul
1,1,Tones And I,Dance Monkey,Dance Monkey,1rgnBhdG2JDFTbYkYRZAku,0.593,0.825,6,-6.401,0.688,0.0988,0.000161,0.17,0.54,98.078,209755,Alternative/Indie
2,2,Roddy Ricch,Please Excuse Me For Being Antisocial,The Box,0nbXyq5TXYPCO7pr3N8S4I,0.586,0.896,10,-6.687,0.104,0.0559,0.0,0.79,0.642,116.971,196653,Hip-Hop/Rap
3,3,SAINt JHN,Roses (Imanbek Remix),Roses - Imanbek Remix,2Wo6QQD1KMDWeFkkjLqwx5,0.721,0.785,8,-5.457,0.0149,0.0506,0.00432,0.285,0.894,121.962,176219,Dance/Electronic
4,4,Dua Lipa,Future Nostalgia,Don't Start Now,3PfIrDoz19wz7qK7tYeu62,0.793,0.793,11,-4.521,0.0123,0.083,0.0,0.0951,0.679,123.95,183290,Nu-disco
5,5,DaBaby,BLAME IT ON BABY,ROCKSTAR (feat. Roddy Ricch),7ytR5pFWmSjzHJIeQkgog4,0.69,0.746,11,-7.956,0.247,0.164,0.0,0.101,0.497,89.977,181733,Hip-Hop/Rap
6,6,Harry Styles,Fine Line,Watermelon Sugar,6UelLqGlWMcVH1E5c4H7lY,0.816,0.548,0,-4.209,0.122,0.0465,0.0,0.335,0.557,95.39,174000,Pop
7,7,Powfu,death bed (coffee for your head),death bed (coffee for your head),7eJMfftS33KTjuF7lTsMCx,0.431,0.726,8,-8.765,0.731,0.135,0.0,0.696,0.348,144.026,173333,Hip-Hop/Rap
8,8,Trevor Daniel,Nicotine,Falling,2rRJrJEo19S2J82BDsQ3F7,0.43,0.784,10,-8.756,0.123,0.0364,0.0,0.0887,0.236,127.087,159382,R&B/Hip-Hop alternative
9,9,Lewis Capaldi,Divinely Uninspired To A Hellish Extent,Someone You Loved,7qEHsqek33rTcFNT9PFqLf,0.405,0.501,1,-5.679,0.751,0.0319,0.0,0.105,0.446,109.891,182161,Alternative/Indie


## We have a column named "Unnamed: 0" which maybe means ranking of song, but there is a mistake, because it starts from 0, so it seems that it would be mroe clear if 'Unnamed' is removed and index of dataframe would  to ranking. So using pandas drop():
## ["Unnamed: 0"] - we want to remove "Unnamed: 0" values
## axis="columns" - we set to remove from columns, not rows
## inplace=True - apply remove of "Unnamed: 0" in original dataframe and don't create new.

In [4]:
df.drop("Unnamed: 0", axis="columns", inplace=True)
df

Unnamed: 0,artist,album,track_name,track_id,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,genre
0,The Weeknd,After Hours,Blinding Lights,0VjIjW4GlUZAMYd2vXMi3b,0.73,0.514,1,-5.934,0.00146,0.0598,9.5e-05,0.0897,0.334,171.005,200040,R&B/Soul
1,Tones And I,Dance Monkey,Dance Monkey,1rgnBhdG2JDFTbYkYRZAku,0.593,0.825,6,-6.401,0.688,0.0988,0.000161,0.17,0.54,98.078,209755,Alternative/Indie
2,Roddy Ricch,Please Excuse Me For Being Antisocial,The Box,0nbXyq5TXYPCO7pr3N8S4I,0.586,0.896,10,-6.687,0.104,0.0559,0.0,0.79,0.642,116.971,196653,Hip-Hop/Rap
3,SAINt JHN,Roses (Imanbek Remix),Roses - Imanbek Remix,2Wo6QQD1KMDWeFkkjLqwx5,0.721,0.785,8,-5.457,0.0149,0.0506,0.00432,0.285,0.894,121.962,176219,Dance/Electronic
4,Dua Lipa,Future Nostalgia,Don't Start Now,3PfIrDoz19wz7qK7tYeu62,0.793,0.793,11,-4.521,0.0123,0.083,0.0,0.0951,0.679,123.95,183290,Nu-disco
5,DaBaby,BLAME IT ON BABY,ROCKSTAR (feat. Roddy Ricch),7ytR5pFWmSjzHJIeQkgog4,0.69,0.746,11,-7.956,0.247,0.164,0.0,0.101,0.497,89.977,181733,Hip-Hop/Rap
6,Harry Styles,Fine Line,Watermelon Sugar,6UelLqGlWMcVH1E5c4H7lY,0.816,0.548,0,-4.209,0.122,0.0465,0.0,0.335,0.557,95.39,174000,Pop
7,Powfu,death bed (coffee for your head),death bed (coffee for your head),7eJMfftS33KTjuF7lTsMCx,0.431,0.726,8,-8.765,0.731,0.135,0.0,0.696,0.348,144.026,173333,Hip-Hop/Rap
8,Trevor Daniel,Nicotine,Falling,2rRJrJEo19S2J82BDsQ3F7,0.43,0.784,10,-8.756,0.123,0.0364,0.0,0.0887,0.236,127.087,159382,R&B/Hip-Hop alternative
9,Lewis Capaldi,Divinely Uninspired To A Hellish Extent,Someone You Loved,7qEHsqek33rTcFNT9PFqLf,0.405,0.501,1,-5.679,0.751,0.0319,0.0,0.105,0.446,109.891,182161,Alternative/Indie


## Now using df.index.name, we can set name for index as Rank and using simple addition of 1 to index make that index begin from 1 and ends as 50 and it would be as ranking numbers.

In [5]:
df.index.rename('Rank')
df.index += 1
df.index

RangeIndex(start=1, stop=51, step=1)

In [6]:
df

Unnamed: 0,artist,album,track_name,track_id,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,genre
1,The Weeknd,After Hours,Blinding Lights,0VjIjW4GlUZAMYd2vXMi3b,0.73,0.514,1,-5.934,0.00146,0.0598,9.5e-05,0.0897,0.334,171.005,200040,R&B/Soul
2,Tones And I,Dance Monkey,Dance Monkey,1rgnBhdG2JDFTbYkYRZAku,0.593,0.825,6,-6.401,0.688,0.0988,0.000161,0.17,0.54,98.078,209755,Alternative/Indie
3,Roddy Ricch,Please Excuse Me For Being Antisocial,The Box,0nbXyq5TXYPCO7pr3N8S4I,0.586,0.896,10,-6.687,0.104,0.0559,0.0,0.79,0.642,116.971,196653,Hip-Hop/Rap
4,SAINt JHN,Roses (Imanbek Remix),Roses - Imanbek Remix,2Wo6QQD1KMDWeFkkjLqwx5,0.721,0.785,8,-5.457,0.0149,0.0506,0.00432,0.285,0.894,121.962,176219,Dance/Electronic
5,Dua Lipa,Future Nostalgia,Don't Start Now,3PfIrDoz19wz7qK7tYeu62,0.793,0.793,11,-4.521,0.0123,0.083,0.0,0.0951,0.679,123.95,183290,Nu-disco
6,DaBaby,BLAME IT ON BABY,ROCKSTAR (feat. Roddy Ricch),7ytR5pFWmSjzHJIeQkgog4,0.69,0.746,11,-7.956,0.247,0.164,0.0,0.101,0.497,89.977,181733,Hip-Hop/Rap
7,Harry Styles,Fine Line,Watermelon Sugar,6UelLqGlWMcVH1E5c4H7lY,0.816,0.548,0,-4.209,0.122,0.0465,0.0,0.335,0.557,95.39,174000,Pop
8,Powfu,death bed (coffee for your head),death bed (coffee for your head),7eJMfftS33KTjuF7lTsMCx,0.431,0.726,8,-8.765,0.731,0.135,0.0,0.696,0.348,144.026,173333,Hip-Hop/Rap
9,Trevor Daniel,Nicotine,Falling,2rRJrJEo19S2J82BDsQ3F7,0.43,0.784,10,-8.756,0.123,0.0364,0.0,0.0887,0.236,127.087,159382,R&B/Hip-Hop alternative
10,Lewis Capaldi,Divinely Uninspired To A Hellish Extent,Someone You Loved,7qEHsqek33rTcFNT9PFqLf,0.405,0.501,1,-5.679,0.751,0.0319,0.0,0.105,0.446,109.891,182161,Alternative/Indie


## For me lower case features don't look "right", so using columns.str.capitalize() we can make features capitalized

In [7]:
df.columns = df.columns.str.capitalize()
df

Unnamed: 0,Artist,Album,Track_name,Track_id,Energy,Danceability,Key,Loudness,Acousticness,Speechiness,Instrumentalness,Liveness,Valence,Tempo,Duration_ms,Genre
1,The Weeknd,After Hours,Blinding Lights,0VjIjW4GlUZAMYd2vXMi3b,0.73,0.514,1,-5.934,0.00146,0.0598,9.5e-05,0.0897,0.334,171.005,200040,R&B/Soul
2,Tones And I,Dance Monkey,Dance Monkey,1rgnBhdG2JDFTbYkYRZAku,0.593,0.825,6,-6.401,0.688,0.0988,0.000161,0.17,0.54,98.078,209755,Alternative/Indie
3,Roddy Ricch,Please Excuse Me For Being Antisocial,The Box,0nbXyq5TXYPCO7pr3N8S4I,0.586,0.896,10,-6.687,0.104,0.0559,0.0,0.79,0.642,116.971,196653,Hip-Hop/Rap
4,SAINt JHN,Roses (Imanbek Remix),Roses - Imanbek Remix,2Wo6QQD1KMDWeFkkjLqwx5,0.721,0.785,8,-5.457,0.0149,0.0506,0.00432,0.285,0.894,121.962,176219,Dance/Electronic
5,Dua Lipa,Future Nostalgia,Don't Start Now,3PfIrDoz19wz7qK7tYeu62,0.793,0.793,11,-4.521,0.0123,0.083,0.0,0.0951,0.679,123.95,183290,Nu-disco
6,DaBaby,BLAME IT ON BABY,ROCKSTAR (feat. Roddy Ricch),7ytR5pFWmSjzHJIeQkgog4,0.69,0.746,11,-7.956,0.247,0.164,0.0,0.101,0.497,89.977,181733,Hip-Hop/Rap
7,Harry Styles,Fine Line,Watermelon Sugar,6UelLqGlWMcVH1E5c4H7lY,0.816,0.548,0,-4.209,0.122,0.0465,0.0,0.335,0.557,95.39,174000,Pop
8,Powfu,death bed (coffee for your head),death bed (coffee for your head),7eJMfftS33KTjuF7lTsMCx,0.431,0.726,8,-8.765,0.731,0.135,0.0,0.696,0.348,144.026,173333,Hip-Hop/Rap
9,Trevor Daniel,Nicotine,Falling,2rRJrJEo19S2J82BDsQ3F7,0.43,0.784,10,-8.756,0.123,0.0364,0.0,0.0887,0.236,127.087,159382,R&B/Hip-Hop alternative
10,Lewis Capaldi,Divinely Uninspired To A Hellish Extent,Someone You Loved,7qEHsqek33rTcFNT9PFqLf,0.405,0.501,1,-5.679,0.751,0.0319,0.0,0.105,0.446,109.891,182161,Alternative/Indie


## Perform data cleaning by:

##### Handling missing values.

## Now using pandas isnull() and any(), we can check if at least one value is missing

In [8]:
df.isnull().any()

Artist              False
Album               False
Track_name          False
Track_id            False
Energy              False
Danceability        False
Key                 False
Loudness            False
Acousticness        False
Speechiness         False
Instrumentalness    False
Liveness            False
Valence             False
Tempo               False
Duration_ms         False
Genre               False
dtype: bool

## We have "false" for every a feature that means that there is no missing values and we don't need to use pandas dropna() to remove rows or features.

##### Removing duplicate samples and features.

## With drop_duplicates() we can check if there is the duplicate info. Checking truthness by comparing original dataframe with returned dataframe by drop_duplicates() and using all() we can find if all elements are the same and they are because returned value is True.

In [9]:
all(df == df.drop_duplicates())

True

#### Treating the outliers

## Dataframe have 'Track_id' feature which is just specific spotify info and we don't need it so using pandas drop():
## ["Track_id"] - we want to remove "Track_id" values
## axis="columns" - we set to remove from columns, not rows
## inplace=True - apply remove of "Track_id" in original dataframe and doesn't create new.

In [10]:
df.drop(["Track_id"], axis="columns", inplace=True)
df

Unnamed: 0,Artist,Album,Track_name,Energy,Danceability,Key,Loudness,Acousticness,Speechiness,Instrumentalness,Liveness,Valence,Tempo,Duration_ms,Genre
1,The Weeknd,After Hours,Blinding Lights,0.73,0.514,1,-5.934,0.00146,0.0598,9.5e-05,0.0897,0.334,171.005,200040,R&B/Soul
2,Tones And I,Dance Monkey,Dance Monkey,0.593,0.825,6,-6.401,0.688,0.0988,0.000161,0.17,0.54,98.078,209755,Alternative/Indie
3,Roddy Ricch,Please Excuse Me For Being Antisocial,The Box,0.586,0.896,10,-6.687,0.104,0.0559,0.0,0.79,0.642,116.971,196653,Hip-Hop/Rap
4,SAINt JHN,Roses (Imanbek Remix),Roses - Imanbek Remix,0.721,0.785,8,-5.457,0.0149,0.0506,0.00432,0.285,0.894,121.962,176219,Dance/Electronic
5,Dua Lipa,Future Nostalgia,Don't Start Now,0.793,0.793,11,-4.521,0.0123,0.083,0.0,0.0951,0.679,123.95,183290,Nu-disco
6,DaBaby,BLAME IT ON BABY,ROCKSTAR (feat. Roddy Ricch),0.69,0.746,11,-7.956,0.247,0.164,0.0,0.101,0.497,89.977,181733,Hip-Hop/Rap
7,Harry Styles,Fine Line,Watermelon Sugar,0.816,0.548,0,-4.209,0.122,0.0465,0.0,0.335,0.557,95.39,174000,Pop
8,Powfu,death bed (coffee for your head),death bed (coffee for your head),0.431,0.726,8,-8.765,0.731,0.135,0.0,0.696,0.348,144.026,173333,Hip-Hop/Rap
9,Trevor Daniel,Nicotine,Falling,0.43,0.784,10,-8.756,0.123,0.0364,0.0,0.0887,0.236,127.087,159382,R&B/Hip-Hop alternative
10,Lewis Capaldi,Divinely Uninspired To A Hellish Extent,Someone You Loved,0.405,0.501,1,-5.679,0.751,0.0319,0.0,0.105,0.446,109.891,182161,Alternative/Indie


In [11]:
df.describe()

Unnamed: 0,Energy,Danceability,Key,Loudness,Acousticness,Speechiness,Instrumentalness,Liveness,Valence,Tempo,Duration_ms
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.6093,0.71672,5.72,-6.2259,0.256206,0.124158,0.015962,0.196552,0.55571,119.69046,199955.36
std,0.154348,0.124975,3.709007,2.349744,0.26525,0.116836,0.094312,0.17661,0.216386,25.414778,33996.122488
min,0.225,0.351,0.0,-14.454,0.00146,0.029,0.0,0.0574,0.0605,75.801,140526.0
25%,0.494,0.6725,2.0,-7.5525,0.0528,0.048325,0.0,0.09395,0.434,99.55725,175845.5
50%,0.597,0.746,6.5,-5.9915,0.1885,0.07005,0.0,0.111,0.56,116.969,197853.5
75%,0.72975,0.7945,8.75,-4.2855,0.29875,0.1555,2e-05,0.27125,0.72625,132.317,215064.0
max,0.855,0.935,11.0,-3.28,0.934,0.487,0.657,0.792,0.925,180.067,312820.0


### Perform exploratory data analysis. Your analysis should provide answers to these questions:

## Using python unpacking, we can save pandas shape info to separate variables and print it

#### How many observations are there in this dataset?

In [12]:
observations, features = df.shape
print(f"Number of observations = {observations}")

Number of observations = 50


#### How many features this dataset has?

In [13]:
print(f"Number of features = {features}")

Number of features = 15


## With pandas select_dtypes() we can filter the types we want and non-numeric values have type 'object', so:
## include=object - we select 'object' type
## .columns.values - we select column values to be filtered as 'object' type

In [14]:
not_numeric = df.select_dtypes(include=object).columns.values
print(not_numeric)

['Artist' 'Album' 'Track_name' 'Genre']


## Because only 'Track_name' can have one unique value so categorcial features are 'Artist' 'Album' and 'Genre', so using numpy delete() we can delete  value at specific location and with np.where we set condition that a value would be equal to 'Track_name'

#### Which of the features are categorical?

In [15]:
categorical = np.delete(not_numeric, np.where(not_numeric == "Track_name"))
print(categorical)

['Artist' 'Album' 'Genre']


#### Which of the features are numeric?

## as it was for categorical values we can use 'include=np.number' to find columns which have numeric values

In [18]:
print(df.select_dtypes(include=np.number).columns.values)

['Energy' 'Danceability' 'Key' 'Loudness' 'Acousticness' 'Speechiness'
 'Instrumentalness' 'Liveness' 'Valence' 'Tempo' 'Duration_ms']


## Using pandas column selection by ["Artist"] and value_counts() we can count how many time artist appear in top songs list and then using filterting more_than_1[more_than_1 > 1] aka boolean mask we can find artists with more then 1 track. 

#### Are there any artists that have more than 1 popular track? If yes, which and how many?

In [19]:
artist_more_than_1 = df['Artist'].value_counts()
artist_more_than_1[artist_more_than_1 > 1]

Billie Eilish    3
Dua Lipa         3
Travis Scott     3
Justin Bieber    2
Harry Styles     2
Lewis Capaldi    2
Post Malone      2
Name: Artist, dtype: int64

## As we can see, there 3 artist with 3 tracks, so using max() we can find maximal value and using filtering:
## more_than_1==more_than_1.max() we can find all artists who have maximal value of tracks

#### Who was the most popular artist?!!!!!!!!!!!!!!!!!!!!!!!!!

In [20]:
artist_more_than_1[artist_more_than_1==artist_more_than_1.max()]

Billie Eilish    3
Dua Lipa         3
Travis Scott     3
Name: Artist, dtype: int64

## Using pandas unique() we can find unique artists and df['Artist'].unique() is numpy array so we can use 'size' to find total number of unique artists

#### How many artists in total have their songs in the top 50?

In [21]:
df['Artist'].unique().size

40

## We do the same as for artist to find albums which have more than 1 track

#### Are there any albums that have more than 1 popular track? If yes, which and how many?

In [22]:
album_more_than_1 = df['Album'].value_counts()
album_more_than_1[album_more_than_1 > 1]

Future Nostalgia        3
Hollywood's Bleeding    2
Fine Line               2
Changes                 2
Name: Album, dtype: int64

#### How many albums in total have their songs in the top 50?

In [23]:
df['Album'].unique().size

45

## Using pandas .loc we can access rows and columns by labels or boolean arrays and that's what we do here:
## df['Danceability'] > 0.7 - boolean array for filtering out rows where Danceability > 0.7
## 'Track_name' - here we select the column 'Track_name'. 

#### Which tracks have a danceability score above 0.7?

In [24]:
df.loc[df['Danceability'] > 0.7, 'Track_name']

2                                      Dance Monkey
3                                           The Box
4                             Roses - Imanbek Remix
5                                   Don't Start Now
6                      ROCKSTAR (feat. Roddy Ricch)
8                  death bed (coffee for your head)
9                                           Falling
11                                             Tusa
14                                  Blueberry Faygo
15                         Intentions (feat. Quavo)
16                                     Toosie Slide
18                                           Say So
19                                         Memories
20                       Life Is Good (feat. Drake)
21                 Savage Love (Laxed - Siren Beat)
23                                      Breaking Me
25                              everything i wanted
26                                         Señorita
27                                          bad guy
28          

#### Which tracks have a danceability score below 0.4?

In [25]:
df.loc[df['Danceability'] < 0.4, 'Track_name']

45    lovely (with Khalid)
Name: Track_name, dtype: object

#### Which tracks have their loudness above -5?

In [26]:
df.loc[df['Loudness'] > -5, 'Track_name']

5                                   Don't Start Now
7                                  Watermelon Sugar
11                                             Tusa
13                                          Circles
17                                    Before You Go
18                                           Say So
22                                        Adore You
24                           Mood (feat. iann dior)
32                                   Break My Heart
33                                         Dynamite
34                 Supalonely (feat. Gus Dapperton)
36                  Rain On Me (with Ariana Grande)
38    Sunflower - Spider-Man: Into the Spider-Verse
39                                            Hawái
40                                          Ride It
41                                       goosebumps
44                                          Safaera
49                                         Physical
50                                       SICKO MODE
Name: Track_

#### Which tracks have their loudness below -8?

In [27]:
df.loc[df['Loudness'] < -8, 'Track_name']

8                   death bed (coffee for your head)
9                                            Falling
16                                      Toosie Slide
21                  Savage Love (Laxed - Siren Beat)
25                               everything i wanted
27                                           bad guy
37                               HIGHEST IN THE ROOM
45                              lovely (with Khalid)
48    If the World Was Ending - feat. Julia Michaels
Name: Track_name, dtype: object

## With idxmax() we can find index of max value aka longest track, so with df['Duration_ms'] select Duration_ms column, idxmax() finds longest track index and then we select row value of track by index applied to df["Track_name"].

#### Which track is the longest?

In [28]:
longest_track = df["Track_name"][df['Duration_ms'].idxmax()]
artist = df["Artist"][df['Duration_ms'].idxmax()]
print(f"{longest_track} by {artist}")

SICKO MODE by Travis Scott


## To find shortest track instead of idxmax() we use idxmin()

#### Which track is the shortest?

In [30]:
shortest_track = df["Track_name"][df['Duration_ms'].idxmin()]
artist = df["Artist"][df['Duration_ms'].idxmin()]
print(f"{shortest_track} by {artist}")

Mood (feat. iann dior) by 24kGoldn


## df['Genre'] selects 'Genre' column, value_counts() counts occurrence of every genre and idxmax find index of row (because default axis=0=row) with highest count

#### Which genre is the most popular?

In [31]:
most_popular_genre = df['Genre'].value_counts().idxmax()
most_popular_genre_number = df['Genre'].value_counts().max()
print(f"{most_popular_genre} {most_popular_genre_number}")

Pop 14


## with df['Genre'].value_counts() we count genres occurrences, then we create filter variable with 'genres == 1' and apply it to genres count dataframe.

#### Which genres have just one song on the top 50?

In [32]:
genres = df['Genre'].value_counts()
one_song = genres == 1
genres[one_song]

Nu-disco                              1
R&B/Hip-Hop alternative               1
Pop/Soft Rock                         1
Pop rap                               1
Hip-Hop/Trap                          1
Dance-pop/Disco                       1
Disco-pop                             1
Dreampop/Hip-Hop/R&B                  1
Alternative/reggaeton/experimental    1
Chamber pop                           1
Name: Genre, dtype: int64

## Again using unique() and numpy size, we can find total number of genres.

#### How many genres in total are represented in the top 50?

In [33]:
df['Genre'].unique().size

16

## To find correlation between features, we can use df.corr() and using numeric_only=True option we can automaticlly apply to numeric features.

In [34]:
df.corr(numeric_only=True)

Unnamed: 0,Energy,Danceability,Key,Loudness,Acousticness,Speechiness,Instrumentalness,Liveness,Valence,Tempo,Duration_ms
Energy,1.0,0.152552,0.062428,0.79164,-0.682479,0.074267,-0.385515,0.069487,0.393453,0.075191,0.081971
Danceability,0.152552,1.0,0.285036,0.167147,-0.359135,0.226148,-0.017706,-0.006648,0.479953,0.168956,-0.033763
Key,0.062428,0.285036,1.0,-0.009178,-0.113394,-0.094965,0.020802,0.278672,0.120007,0.080475,-0.003345
Loudness,0.79164,0.167147,-0.009178,1.0,-0.498695,-0.021693,-0.553735,-0.069939,0.406772,0.102097,0.06413
Acousticness,-0.682479,-0.359135,-0.113394,-0.498695,1.0,-0.135392,0.352184,-0.128384,-0.243192,-0.241119,-0.010988
Speechiness,0.074267,0.226148,-0.094965,-0.021693,-0.135392,1.0,0.028948,-0.142957,0.053867,0.215504,0.366976
Instrumentalness,-0.385515,-0.017706,0.020802,-0.553735,0.352184,0.028948,1.0,-0.087034,-0.203283,0.018853,0.184709
Liveness,0.069487,-0.006648,0.278672,-0.069939,-0.128384,-0.142957,-0.087034,1.0,-0.033366,0.025457,-0.090188
Valence,0.393453,0.479953,0.120007,0.406772,-0.243192,0.053867,-0.203283,-0.033366,1.0,0.045089,-0.039794
Tempo,0.075191,0.168956,0.080475,0.102097,-0.241119,0.215504,0.018853,0.025457,0.045089,1.0,0.130328


## Now we can save corrleations to variable 'corrleations', unstack it correlations.unstack() then using with filters (correlations > .75) set that only filter out correlation values higher then .75 for strongly positive correlation and with (correlations != 1) don't enclude feature correlation with it self.

#### Which features are strongly positively correlated?

In [35]:
correlations = df.corr(numeric_only=True)
correlations = correlations.unstack()
correlations[(correlations > .75) & (correlations != 1)]

Energy    Loudness    0.79164
Loudness  Energy      0.79164
dtype: float64

## same as for finding "strongly positively correlated", here we set correlation value -0.75 for 'strongly negitively correlated'

#### Which features are strongly negatively correlated?

In [36]:
correlations = df.corr(numeric_only=True)
correlations = correlations.unstack()
correlations[(correlations < -0.75) & (correlations != 1)]

Series([], dtype: float64)

## for not correlated feature we can set a value between -0.1 and 0.1, but we don't need to have 2 separate filter, with abs() we can find absolute values and just find lower absolute values than 0.1.

#### Which features are not correlated?

In [37]:
correlations = df.corr(numeric_only=True)
correlations = correlations.unstack()
correlations[(abs(correlations) < 0.1) & (correlations != 1)]

Energy            Key                 0.062428
                  Speechiness         0.074267
                  Liveness            0.069487
                  Tempo               0.075191
                  Duration_ms         0.081971
Danceability      Instrumentalness   -0.017706
                  Liveness           -0.006648
                  Duration_ms        -0.033763
Key               Energy              0.062428
                  Loudness           -0.009178
                  Speechiness        -0.094965
                  Instrumentalness    0.020802
                  Tempo               0.080475
                  Duration_ms        -0.003345
Loudness          Key                -0.009178
                  Speechiness        -0.021693
                  Liveness           -0.069939
                  Duration_ms         0.064130
Acousticness      Duration_ms        -0.010988
Speechiness       Energy              0.074267
                  Key                -0.094965
             

## Using pandas groupby() we can group colmuns by other columns. Here we set variables that we will use later: 'metric' as metric we want to find informaction about and 'gernes' as list of genres we are interested in. So:
## df.groupy("Genre") - we are grouping by "Genre"
## [metric] - we set column for which value we want to group.
## mean() - find mean of grouped colmuns scores.
## [genres] - filter out genres we are intrested in.
## sort_values() - sort mean values, default sorting is ascending.

#### How does the danceability score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?

In [38]:
metric = 'Danceability'
genres = ["Pop", "Hip-Hop/Rap", "Dance/Electronic", "Alternative/Indie"]          
df.groupby("Genre")[metric].mean()[genres].sort_values()

Genre
Alternative/Indie    0.661750
Pop                  0.677571
Dance/Electronic     0.755000
Hip-Hop/Rap          0.765538
Name: Danceability, dtype: float64

## Here we just need to change metric to 'Loudness'.

#### How does the loudness score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?

In [41]:
metric = 'Loudness'
df.groupby("Genre")[metric].mean()[genres].sort_values()

Genre
Hip-Hop/Rap         -6.917846
Pop                 -6.460357
Alternative/Indie   -5.421000
Dance/Electronic    -5.338000
Name: Loudness, dtype: float64

## Here we just need to change metric to 'Acousticness'.

#### How does the acousticness score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?

In [44]:
metric = 'Acousticness'
df.groupby("Genre")[metric].mean()[genres].sort_values()

Genre
Dance/Electronic     0.099440
Hip-Hop/Rap          0.188741
Pop                  0.323843
Alternative/Indie    0.583500
Name: Acousticness, dtype: float64

In [47]:
df.describe()

Unnamed: 0,Energy,Danceability,Key,Loudness,Acousticness,Speechiness,Instrumentalness,Liveness,Valence,Tempo,Duration_ms
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.6093,0.71672,5.72,-6.2259,0.256206,0.124158,0.015962,0.196552,0.55571,119.69046,199955.36
std,0.154348,0.124975,3.709007,2.349744,0.26525,0.116836,0.094312,0.17661,0.216386,25.414778,33996.122488
min,0.225,0.351,0.0,-14.454,0.00146,0.029,0.0,0.0574,0.0605,75.801,140526.0
25%,0.494,0.6725,2.0,-7.5525,0.0528,0.048325,0.0,0.09395,0.434,99.55725,175845.5
50%,0.597,0.746,6.5,-5.9915,0.1885,0.07005,0.0,0.111,0.56,116.969,197853.5
75%,0.72975,0.7945,8.75,-4.2855,0.29875,0.1555,2e-05,0.27125,0.72625,132.317,215064.0
max,0.855,0.935,11.0,-3.28,0.934,0.487,0.657,0.792,0.925,180.067,312820.0


# Provide clear explanations in your notebook. Your explanations should inform the reader what you are trying to achieve, what results you got, and what these results mean.

### Provide suggestions about how your analysis can be improved. 
### Try others values for correlation instead of 0.75,  try 0.7, 0.65.
### Pandas have 3 correlation methods (default pearson), so maybe test out others methods of correlation:
### kendall : Kendall Tau correlation coefficient
### spearman : Spearman rank correlation