### Top 50 Spotify Tracks of 2020

**Importing and reading the data**

In [33]:
# import zipfile
# zip_dir = 'E:/archive.zip'
# extr_dir = 'Spottify_data'
# with zipfile.ZipFile(zip_dir, 'r') as zip_file:
#     zip_file.extractall(extr_dir)

In [77]:
import pandas as pd
import numpy as np

In [78]:
df = pd.read_csv(r'Spottify_data\spotifytoptracks.csv', index_col = 0)

**Ispecting the data**

<span style='color: blue;'>What is the size of the table? i.e. how many observations and features</span>

In [79]:
df.shape

(50, 16)

<span style='color: blue;'>What are the names of the columns?</span>

In [80]:
df.columns

Index(['artist', 'album', 'track_name', 'track_id', 'energy', 'danceability',
       'key', 'loudness', 'acousticness', 'speechiness', 'instrumentalness',
       'liveness', 'valence', 'tempo', 'duration_ms', 'genre'],
      dtype='object')

<span style='color: blue;'>Get a glimpse of what it looks like</span>

In [81]:
df.head(3)

Unnamed: 0,artist,album,track_name,track_id,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,genre
0,The Weeknd,After Hours,Blinding Lights,0VjIjW4GlUZAMYd2vXMi3b,0.73,0.514,1,-5.934,0.00146,0.0598,9.5e-05,0.0897,0.334,171.005,200040,R&B/Soul
1,Tones And I,Dance Monkey,Dance Monkey,1rgnBhdG2JDFTbYkYRZAku,0.593,0.825,6,-6.401,0.688,0.0988,0.000161,0.17,0.54,98.078,209755,Alternative/Indie
2,Roddy Ricch,Please Excuse Me For Being Antisocial,The Box,0nbXyq5TXYPCO7pr3N8S4I,0.586,0.896,10,-6.687,0.104,0.0559,0.0,0.79,0.642,116.971,196653,Hip-Hop/Rap


<span style='color: blue;'>What kind of data is stored in each column?</span>

In [82]:
df.dtypes

artist               object
album                object
track_name           object
track_id             object
energy              float64
danceability        float64
key                   int64
loudness            float64
acousticness        float64
speechiness         float64
instrumentalness    float64
liveness            float64
valence             float64
tempo               float64
duration_ms           int64
genre                object
dtype: object

<span style='color: blue;'>What are categorical features and how many unique entries are there in each of them?</span>

In [83]:
print("Categorical features:", df.select_dtypes(include='object').nunique())

Categorical features: artist        40
album         45
track_name    50
track_id      50
genre         16
dtype: int64


<span style='color: blue;'>What are numerical features?</span>

In [84]:
print("Numerical features:", df.select_dtypes(include = 'number').columns.tolist())

Numerical features: ['energy', 'danceability', 'key', 'loudness', 'acousticness', 'speechiness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms']


<span style='color: blue;'>Is there any missing data?</span>

In [156]:
if np.count_nonzero(df.isna()) == 0:
    print(" There is no missing data")
else:
    x = np.count_nonzero(df.isna())
    print(f'There are {x} missing data cases')

 There is no missing data


<span style='color: blue;'>Are there any duplicates? </span> <br>In this case, track name and/or track ID should be checked. 

In [158]:
df.duplicated(subset=['track_name','track_id']).sum()

0

There are none but I already knew that from counting unique categorical feature entries

<span style='color: blue;'>Are there any outliers?</span> <br> My preffered way to check is boxplots. That works just fine if dataframe doesn't have too many features. If it was big, would probably go for interquartile range

In [133]:
import plotly.express as px

In [142]:
numerical_df = df.select_dtypes(include='number')

In [143]:
px.box(numerical_df, title="Box Plots for Numerical Columns")

This is not very clear. Need to make subplots

In [144]:
from plotly.subplots import make_subplots

In [145]:
fig = make_subplots(rows = len(numerical_df), cols = 1)

In [146]:
for i, col in enumerate(numerical_df, start=1):
    fig.add_trace(px.box(numerical_df, y=col).data[0], row=i, col=1)
    fig.update_yaxes(title_text=col, row=i, col=1)

In [152]:
fig.update_layout(
    title="Box Plots for Numerical Columns",
    showlegend=False,
    height=400 * len(numerical_df))

OK there's a bunch of outliers. Are they "real" though? Could be just a consequence of different genres, not mistakes. Honestly, in real life situation I would leave those untouched unless there was something crazy like duration 100 miliseconds

In [159]:
df.head()

Unnamed: 0,artist,album,track_name,track_id,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,genre
0,The Weeknd,After Hours,Blinding Lights,0VjIjW4GlUZAMYd2vXMi3b,0.73,0.514,1,-5.934,0.00146,0.0598,9.5e-05,0.0897,0.334,171.005,200040,R&B/Soul
1,Tones And I,Dance Monkey,Dance Monkey,1rgnBhdG2JDFTbYkYRZAku,0.593,0.825,6,-6.401,0.688,0.0988,0.000161,0.17,0.54,98.078,209755,Alternative/Indie
2,Roddy Ricch,Please Excuse Me For Being Antisocial,The Box,0nbXyq5TXYPCO7pr3N8S4I,0.586,0.896,10,-6.687,0.104,0.0559,0.0,0.79,0.642,116.971,196653,Hip-Hop/Rap
3,SAINt JHN,Roses (Imanbek Remix),Roses - Imanbek Remix,2Wo6QQD1KMDWeFkkjLqwx5,0.721,0.785,8,-5.457,0.0149,0.0506,0.00432,0.285,0.894,121.962,176219,Dance/Electronic
4,Dua Lipa,Future Nostalgia,Don't Start Now,3PfIrDoz19wz7qK7tYeu62,0.793,0.793,11,-4.521,0.0123,0.083,0.0,0.0951,0.679,123.95,183290,Nu-disco


In [169]:
np.count_nonzero(df['genre'] == 'Hip-Hop/Rap')#tik patikrinti kaip teisingai pasirinkti

13

In [170]:
print(f"There are {np.count_nonzero(df['genre'] == 'Hip-Hop/Rap')} Hip-Hop/Rap songs in TOP50")

There are 13 Hip-Hop/Rap songs in TOP50


kaip patikrinti kiek laiko trunka skaiciavimas

In [160]:
numbers = pd.Series(np.random.randint(0, 1000, 10000))

In [161]:
%%timeit -n 100
total = 0
for number in numbers:
    total+=number
total/len(numbers)

711 µs ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [162]:
%%timeit -n 100
total = np.sum(numbers)
total/len(numbers)

42.1 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
