# DATA EXPLORATION

Data exploration is used to visually represent and analyze data.

Each row of a dataset is an independent, the collection of which represents the entire dataset. These rows are called **instances**
A specific and different property of each instance is represented in the columns. These are also called **features** or **variables**

A feature can be:

- **Categorical**: When its value is discrete and ranges from a set of finite values (e.g., name, garde etc)
- **Continuous**: When it ranges from a lower to an upper bound in infinite combinations. These are of integer type

A **class** or **response** is a dependent variable, which depends on independent variable(s) in the dataset. If the dependent variable is discrete, it is called **class**. If it is continuous, it is called **response**.

**Dimensionality** of a dataset is the number of features contained in the dataset. Higher the dimensionality, more precise the data, and more the computational capacity needed to work with the dataset.

In [1]:
# The info method allows us to gather information about the contents of the dataset
# Calling the info method and observing the output, we have analyzed:
# The dataset has 261 rows, 18 columns, out of which, 5 hold decimal values, 3 hold integer values and 10 contain string values
import pandas as pd
spotify_artists = pd.read_csv("data/songs_normalize.csv")
spotify_artists.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   artist            2000 non-null   object 
 1   song              2000 non-null   object 
 2   duration_ms       2000 non-null   int64  
 3   explicit          2000 non-null   bool   
 4   year              2000 non-null   int64  
 5   popularity        2000 non-null   int64  
 6   danceability      2000 non-null   float64
 7   energy            2000 non-null   float64
 8   key               2000 non-null   int64  
 9   loudness          2000 non-null   float64
 10  mode              2000 non-null   int64  
 11  speechiness       2000 non-null   float64
 12  acousticness      2000 non-null   float64
 13  instrumentalness  2000 non-null   float64
 14  liveness          2000 non-null   float64
 15  valence           2000 non-null   float64
 16  tempo             2000 non-null   float64


In [2]:
# To view a concise format of the dataset, we ue the head method
# It shows the first 5 instances to help us understand the main contents of the dataset
spotify_artists.head()

Unnamed: 0,artist,song,duration_ms,explicit,year,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,genre
0,Britney Spears,Oops!...I Did It Again,211160,False,2000,77,0.751,0.834,1,-5.444,0,0.0437,0.3,1.8e-05,0.355,0.894,95.053,pop
1,blink-182,All The Small Things,167066,False,1999,79,0.434,0.897,0,-4.918,1,0.0488,0.0103,0.0,0.612,0.684,148.726,"rock, pop"
2,Faith Hill,Breathe,250546,False,1999,66,0.529,0.496,7,-9.007,1,0.029,0.173,0.0,0.251,0.278,136.859,"pop, country"
3,Bon Jovi,It's My Life,224493,False,2000,78,0.551,0.913,0,-4.063,0,0.0466,0.0263,1.3e-05,0.347,0.544,119.992,"rock, metal"
4,*NSYNC,Bye Bye Bye,200560,False,2000,65,0.614,0.928,8,-4.806,0,0.0516,0.0408,0.00104,0.0845,0.879,172.656,pop


In [4]:
# To check information about a feature, enter it's column label and use the describe function
# From the output returned, we can see that:
# There are 2000 instance names under "song" column, out of which 1879 are unique instances, where "Sorry" is the most frequently occurring instance with a frequency of 5
spotify_artists[["song"]].describe()

Unnamed: 0,song
count,2000
unique,1879
top,Sorry
freq,5


In [5]:
# The value_count method allows us to check the frequency of usage for each instance
spotify_artists[["song"]].value_counts()

# We can use to normalize parameter to represent the same result, but in percentage

# spotify_artists[["song"]].value_counts(normalize=True)

song                      
Sorry                         5
Rise                          3
Closer                        3
Mercy                         3
It's My Life                  3
                             ..
Hey Baby                      1
Heroes (we could be)          1
Hero (feat. Josey Scott)      1
Here's to Never Growing Up    1
Échame La Culpa               1
Length: 1879, dtype: int64

In [7]:
# We can check the mean value of a feature by using the mean method and passing the feature name
spotify_artists[["duration_ms"]].mean()

duration_ms    228748.1245
dtype: float64

In [9]:
# We can visualize the mean values sorted for each instance by using the groupby method and passing the feature names to it
spotify_artists.groupby("song")[["duration_ms"]].mean()

# We can also use multiple aggregations by using the agg method and passing the parameters needed

# spotify_artists.groupby("song")[["duration_ms"]].agg(("mean", "median", "min", "max"))

Unnamed: 0_level_0,duration_ms
song,Unnamed: 1_level_1
#SELFIE - Original Mix,183750.0
#thatPOWER,279506.0
'Till I Collapse,297786.0
(When You Gonna) Give It Up to Me (feat. Keyshia Cole) - Radio Version,243880.0
...Ready For It?,208186.0
...,...
no tears left to cry,205920.0
oui,238320.0
rockstar (feat. 21 Savage),218146.0
we fell in love in october,184153.0


In [10]:
# We can use the sort_values method to sort the data
spotify_artists.groupby("song")[["duration_ms"]].mean().sort_values(by = "duration_ms")
# As we can see, the list has been sorted in ascending order according to the song duration

Unnamed: 0_level_0,duration_ms
song,Unnamed: 1_level_1
Old Town Road,113000.0
Panini,114893.0
Jocelyn Flores,119133.0
changes,121886.0
Gucci Gang,124055.0
...,...
Days Go By,432146.0
LoveStoned / I Think She Knows (Interlude),444333.0
What Goes Around.../...Comes Around (Interlude),448573.0
Another Chance,452906.0
