# üìò Notebook 01: Data Description

### Why my data is in **.CSV**, instead of **.JSON** or **.XLSX**

**CSV** stores data in *rows and columns*. The ML model which I have used expects **tabular data**, not nested or document-style data.  
CSV files are universally supported, meaning they can be loaded in **Python (using pandas library)**, **R**, **SQL**, and **Spark**. No special software or schema parsing is required.

CSV files are smaller in size than .XLSX and faster to read and write than .XLSX or .JSON. This becomes a very important factor for large datasets.

---

#### Why **CSV** preferred over **JSON**

**JSON** is great for **APIs**, but not for **tabular analysis**. It is designed for transmitting structured data over the internet, not for direct use in machine learning models.

---

#### Why **CSV** preferred over **XLSX**

**XLSX** is better for **human interactions**, but has slower performance and is not ideal for automated pipelines. This makes **CSV** preferred for machine processing.


<p align="center">
  <img src="../assets/dividerlines.png" width="600"/>
</p>

### To load the CSV file in Python
using Python Library : pandas

<span style="color:#003d80; font-style: italic;">

**Nerdy information:** Why Wes McKinney created pandas?  
He was working as *QUANT* in a finance firm, became fed up with *NumPy* (great for numerical arrays, not so good for real-world-analysis).  

And for *financial data*, Wes McKinney required:  
- *Labeled data* (rows and columns)  
- *Time series handling*  
- *Data cleaning*  
- *Missing data support*  
- *Fast data alignment* etc.  

So he created *pandas* in **2008**. It was designed to provide **DataFrame** and **Series** structures, bringing *R-like data analysis capabilities* to Python.

</span>


In [1]:
import pandas as pd 

### üìã **This Notebook Covers:**
1. Importing necessary libraries
2. Loading the dataset
3. Understanding dataset structure (shape, columns, data types)
4. Detailed description of each feature

## Loading the Dataset

In [2]:
df = pd.read_csv('../data/dataset.csv')

## Data Structure Overview

In [3]:
print(f"NUmber of Rows -> Tracks =: {df.shape[0]:,}")
print(f"Number of Columns -> Features: {df.shape[1]}")

NUmber of Rows -> Tracks =: 114,000
Number of Columns -> Features: 21


In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
0,0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,Comedy,73,230666,False,0.676,0.461,...,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,acoustic
1,1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost (Acoustic),Ghost - Acoustic,55,149610,False,0.42,0.166,...,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,acoustic
2,2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,57,210826,False,0.438,0.359,...,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,acoustic
3,3,6lfxq3CG4xtTiEg7opyCyx,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,Can't Help Falling In Love,71,201933,False,0.266,0.0596,...,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3,acoustic
4,4,5vjLSffimiIP26QG5WcN2K,Chord Overstreet,Hold On,Hold On,82,198853,False,0.618,0.443,...,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4,acoustic


In [5]:
df.tail()

Unnamed: 0.1,Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
113995,113995,2C3TZjDRiAzdyViavDJ217,Rainy Lullaby,#mindfulness - Soft Rain for Mindful Meditatio...,Sleep My Little Boy,21,384999,False,0.172,0.235,...,-16.393,1,0.0422,0.64,0.928,0.0863,0.0339,125.995,5,world-music
113996,113996,1hIz5L4IB9hN3WRYPOCGPw,Rainy Lullaby,#mindfulness - Soft Rain for Mindful Meditatio...,Water Into Light,22,385000,False,0.174,0.117,...,-18.318,0,0.0401,0.994,0.976,0.105,0.035,85.239,4,world-music
113997,113997,6x8ZfSoqDjuNa5SVP5QjvX,Ces√°ria Evora,Best Of,Miss Perfumado,22,271466,False,0.629,0.329,...,-10.895,0,0.042,0.867,0.0,0.0839,0.743,132.378,4,world-music
113998,113998,2e6sXL2bYv4bSz6VTdnfLs,Michael W. Smith,Change Your World,Friends,41,283893,False,0.587,0.506,...,-10.889,1,0.0297,0.381,0.0,0.27,0.413,135.96,4,world-music
113999,113999,2hETkH7cOfqmz3LqZDHZf5,Ces√°ria Evora,Miss Perfumado,Barbincor,22,241826,False,0.526,0.487,...,-10.204,0,0.0725,0.681,0.0,0.0893,0.708,79.198,4,world-music


---

## Column Names and Data Types

- **Numerical columns**: Can be used directly in ML models
- **Categorical columns**: Need encoding before use in models
- **Incorrect types**: May need conversion (e.g., string dates to datetime)

In [6]:
print("The Features in the dataset are:")

i = 1
for col in df.columns:
    print(f"{i}.{col}")
    i += 1


The Features in the dataset are:
1.Unnamed: 0
2.track_id
3.artists
4.album_name
5.track_name
6.popularity
7.duration_ms
8.explicit
9.danceability
10.energy
11.key
12.loudness
13.mode
14.speechiness
15.acousticness
16.instrumentalness
17.liveness
18.valence
19.tempo
20.time_signature
21.track_genre


---

### Datatypes of Features

In [7]:
print(df.dtypes)

Unnamed: 0            int64
track_id             object
artists              object
album_name           object
track_name           object
popularity            int64
duration_ms           int64
explicit               bool
danceability        float64
energy              float64
key                   int64
loudness            float64
mode                  int64
speechiness         float64
acousticness        float64
instrumentalness    float64
liveness            float64
valence             float64
tempo               float64
time_signature        int64
track_genre          object
dtype: object


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114000 entries, 0 to 113999
Data columns (total 21 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Unnamed: 0        114000 non-null  int64  
 1   track_id          114000 non-null  object 
 2   artists           113999 non-null  object 
 3   album_name        113999 non-null  object 
 4   track_name        113999 non-null  object 
 5   popularity        114000 non-null  int64  
 6   duration_ms       114000 non-null  int64  
 7   explicit          114000 non-null  bool   
 8   danceability      114000 non-null  float64
 9   energy            114000 non-null  float64
 10  key               114000 non-null  int64  
 11  loudness          114000 non-null  float64
 12  mode              114000 non-null  int64  
 13  speechiness       114000 non-null  float64
 14  acousticness      114000 non-null  float64
 15  instrumentalness  114000 non-null  float64
 16  liveness          11

In [9]:
df.describe()

Unnamed: 0.1,Unnamed: 0,popularity,duration_ms,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
count,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0
mean,56999.5,33.238535,228029.2,0.5668,0.641383,5.30914,-8.25896,0.637553,0.084652,0.31491,0.15605,0.213553,0.474068,122.147837,3.904035
std,32909.109681,22.305078,107297.7,0.173542,0.251529,3.559987,5.029337,0.480709,0.105732,0.332523,0.309555,0.190378,0.259261,29.978197,0.432621
min,0.0,0.0,0.0,0.0,0.0,0.0,-49.531,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,28499.75,17.0,174066.0,0.456,0.472,2.0,-10.013,0.0,0.0359,0.0169,0.0,0.098,0.26,99.21875,4.0
50%,56999.5,35.0,212906.0,0.58,0.685,5.0,-7.004,1.0,0.0489,0.169,4.2e-05,0.132,0.464,122.017,4.0
75%,85499.25,50.0,261506.0,0.695,0.854,8.0,-5.003,1.0,0.0845,0.598,0.049,0.273,0.683,140.071,4.0
max,113999.0,100.0,5237295.0,0.985,1.0,11.0,4.532,1.0,0.965,0.996,1.0,1.0,0.995,243.372,5.0


---
Missing Values

In [10]:
df.isnull().any()

Unnamed: 0          False
track_id            False
artists              True
album_name           True
track_name           True
popularity          False
duration_ms         False
explicit            False
danceability        False
energy              False
key                 False
loudness            False
mode                False
speechiness         False
acousticness        False
instrumentalness    False
liveness            False
valence             False
tempo               False
time_signature      False
track_genre         False
dtype: bool

---
Providing random state of the data

In [11]:
df.sample(10, random_state=42)

Unnamed: 0.1,Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
113186,113186,6KwkVtXm8OUp2XffN5k7lY,Hillsong Worship,No Other Name,No Other Name,50,440247,False,0.369,0.598,...,-6.984,1,0.0304,0.00511,0.0,0.176,0.0466,148.014,4,world-music
42819,42819,2dp5I5MJ8bQQHDoFaNRFtX,Internal Rot,Grieving Birth,Failed Organum,11,93933,False,0.171,0.997,...,-3.586,1,0.118,0.00521,0.801,0.42,0.0294,122.223,4,grindcore
59311,59311,5avw06usmFkFrPjX8NxC40,Zhoobin Askarieh;Ali Sasha,Noise A Noise 20.4-1,"Save the Trees, Pt. 1",0,213578,False,0.173,0.803,...,-10.071,0,0.144,0.613,0.00191,0.195,0.0887,75.564,3,iranian
91368,91368,75hT0hvlESnDJstem0JgyR,Bryan Adams,All I Want For Christmas Is You,Merry Christmas,0,151387,False,0.683,0.511,...,-5.598,1,0.0279,0.406,0.000197,0.111,0.598,109.991,3,rock
61000,61000,4bY2oZGA5Br3pTE1Jd1IfY,Nogizaka46,„Éê„É¨„ÉÉ„Çø TypeD,Êúà„ÅÆÂ§ß„Åç„Åï,57,236293,False,0.555,0.941,...,-3.294,0,0.0481,0.484,0.0,0.266,0.813,92.487,4,j-idol
96815,96815,2zQt5C0AIv27RhfJCRZdZ4,BaianaSystem,Duas Cidades,"Jah Jah Revolta, Pt. 2",38,309493,False,0.776,0.8,...,-4.704,0,0.0438,0.00631,0.0731,0.35,0.583,143.989,4,samba
18939,18939,6BctgCJXlgxYeR0ObhLdtR,Joe DeRosa,You Let Me Down,Please Stop Communicating,21,496201,False,0.627,0.815,...,-6.959,0,0.911,0.785,0.0,0.744,0.532,119.197,5,comedy
72760,72760,1LDQFdGTEXOnycDC8CJ5p1,Cane Hill,A Form of Protest,A Form of Protest,54,189405,False,0.395,0.933,...,-3.6,0,0.116,9e-06,0.0125,0.328,0.194,92.53,4,metalcore
25788,25788,2DDR5F7bHFJBiX6lPPsT8O,Kano,I'm Ready (Disco Mix - Original 12 Inch Version),I'm Ready - Radio Edit,29,209226,False,0.914,0.766,...,-5.662,1,0.0387,0.192,0.65,0.0747,0.9,126.632,4,disco
87169,87169,2ScU6iEvgb0TIuKiyem9rg,Charlie Brown Jr.,Ac√∫stico (Ao Vivo),Proibida Pra Mim (Grazon) - Ao Vivo,48,151666,False,0.345,0.976,...,-4.74,1,0.0487,0.000587,0.0507,0.487,0.818,181.121,4,r-n-b


---

df.loc[0] is accessing the row at index position 0 (the first row) and returning it as a Series.

What it does:

- Uses label-based indexing with .loc[]

- Returns all columns for the row where the index label is 0

- Output is a pandas Series with column names as the index


In [12]:
df.loc[0]

Unnamed: 0                               0
track_id            5SuOikwiRyPMVoIQDJUgSV
artists                        Gen Hoshino
album_name                          Comedy
track_name                          Comedy
popularity                              73
duration_ms                         230666
explicit                             False
danceability                         0.676
energy                               0.461
key                                      1
loudness                            -6.746
mode                                     0
speechiness                          0.143
acousticness                        0.0322
instrumentalness                  0.000001
liveness                             0.358
valence                              0.715
tempo                               87.917
time_signature                           4
track_genre                       acoustic
Name: 0, dtype: object

In [13]:
df['track_genre'].unique()

array(['acoustic', 'afrobeat', 'alt-rock', 'alternative', 'ambient',
       'anime', 'black-metal', 'bluegrass', 'blues', 'brazil',
       'breakbeat', 'british', 'cantopop', 'chicago-house', 'children',
       'chill', 'classical', 'club', 'comedy', 'country', 'dance',
       'dancehall', 'death-metal', 'deep-house', 'detroit-techno',
       'disco', 'disney', 'drum-and-bass', 'dub', 'dubstep', 'edm',
       'electro', 'electronic', 'emo', 'folk', 'forro', 'french', 'funk',
       'garage', 'german', 'gospel', 'goth', 'grindcore', 'groove',
       'grunge', 'guitar', 'happy', 'hard-rock', 'hardcore', 'hardstyle',
       'heavy-metal', 'hip-hop', 'honky-tonk', 'house', 'idm', 'indian',
       'indie-pop', 'indie', 'industrial', 'iranian', 'j-dance', 'j-idol',
       'j-pop', 'j-rock', 'jazz', 'k-pop', 'kids', 'latin', 'latino',
       'malay', 'mandopop', 'metal', 'metalcore', 'minimal-techno', 'mpb',
       'new-age', 'opera', 'pagode', 'party', 'piano', 'pop-film', 'pop',
       'pow

In [14]:
# Another option: Get value counts to see frequency of each genre
df['track_genre'].value_counts()

track_genre
acoustic       1000
afrobeat       1000
alt-rock       1000
alternative    1000
ambient        1000
               ... 
techno         1000
trance         1000
trip-hop       1000
turkish        1000
world-music    1000
Name: count, Length: 114, dtype: int64

## The mean popularity for each unique track genre

In [15]:
print("Track_genre : Mean Popularity")
df.groupby('track_genre')['popularity'].mean().round(2).sort_values(ascending=False)


Track_genre : Mean Popularity


track_genre
pop-film          59.28
k-pop             56.90
chill             53.65
sad               52.38
grunge            49.59
                  ...  
chicago-house     12.34
detroit-techno    11.17
latin              8.30
romance            3.24
iranian            2.21
Name: popularity, Length: 114, dtype: float64

---
## Identifying Target Variable

In [16]:
target_variable = 'popularity'

print(f"Data Type of Target Variable: {df[target_variable].dtype}")

Data Type of Target Variable: int64


In [17]:
print(f"Minimum value ot Target: {df[target_variable].min()}")
print(f"Mean value ot Target: {df[target_variable].mean():.2f}")
print(f"Median value ot Target: {df[target_variable].median():.2f}")
print(f"Maximum value ot Target: {df[target_variable].max()}")

Minimum value ot Target: 0
Mean value ot Target: 33.24
Median value ot Target: 35.00
Maximum value ot Target: 100


---
## Sorting Target 

In [18]:
print("Sorting Target in ascending order")
df.sort_values('popularity', ascending=True).drop(['track_id', 'album_name', 'track_name'], axis=1).head(10)


Sorting Target in ascending order


Unnamed: 0.1,Unnamed: 0,artists,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
32,32,Chord Overstreet,0,234186,False,0.593,0.455,6,-8.192,1,0.0388,0.366,0.0,0.0914,0.564,202.019,4,acoustic
42,42,Brandi Carlile;Lucius,0,230098,False,0.568,0.686,1,-6.635,1,0.033,0.15,2e-06,0.0881,0.725,172.075,4,acoustic
43,43,Brandi Carlile;Lucius,0,230098,False,0.568,0.686,1,-6.635,1,0.033,0.15,2e-06,0.0881,0.725,172.075,4,acoustic
44,44,Brandi Carlile,0,193943,False,0.476,0.666,6,-3.438,1,0.0446,0.314,0.0,0.342,0.498,148.155,4,acoustic
45,45,Brandi Carlile;Lucius,0,230098,False,0.568,0.686,1,-6.635,1,0.033,0.15,2e-06,0.0881,0.725,172.075,4,acoustic
46,46,Brandi Carlile;Lucius,0,230098,False,0.568,0.686,1,-6.635,1,0.033,0.15,2e-06,0.0881,0.725,172.075,4,acoustic
47,47,Brandi Carlile;Lucius,0,230098,False,0.568,0.686,1,-6.635,1,0.033,0.15,2e-06,0.0881,0.725,172.075,4,acoustic
37019,37019,Anitta,0,193805,False,0.813,0.733,4,-5.417,0,0.0847,0.15,0.00186,0.0909,0.397,91.988,4,funk
19979,19979,Randall King,0,206539,False,0.674,0.86,7,-5.498,1,0.033,0.00469,0.00345,0.115,0.637,116.05,4,country
37021,37021,Anitta,0,193805,False,0.813,0.733,4,-5.417,0,0.0847,0.15,0.00186,0.0909,0.397,91.988,4,funk


In [19]:
print("Sorting Target in descending order")
df.sort_values('popularity', ascending=False).drop(['track_id', 'album_name', 'track_name'], axis=1).head(10)


Sorting Target in descending order


Unnamed: 0.1,Unnamed: 0,artists,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
81051,81051,Sam Smith;Kim Petras,100,156943,False,0.714,0.472,2,-7.375,1,0.0864,0.013,5e-06,0.266,0.238,131.121,4,pop
20001,20001,Sam Smith;Kim Petras,100,156943,False,0.714,0.472,2,-7.375,1,0.0864,0.013,5e-06,0.266,0.238,131.121,4,dance
51664,51664,Bizarrap;Quevedo,99,198937,False,0.621,0.782,2,-5.548,1,0.044,0.0125,0.033,0.23,0.55,128.033,4,hip-hop
89411,89411,Manuel Turizo,98,162637,False,0.835,0.679,7,-5.329,0,0.0364,0.583,2e-06,0.218,0.85,124.98,4,reggaeton
30003,30003,David Guetta;Bebe Rexha,98,175238,True,0.561,0.965,7,-3.673,0,0.0343,0.00383,7e-06,0.371,0.304,128.04,4,edm
20008,20008,David Guetta;Bebe Rexha,98,175238,True,0.561,0.965,7,-3.673,0,0.0343,0.00383,7e-06,0.371,0.304,128.04,4,dance
81210,81210,David Guetta;Bebe Rexha,98,175238,True,0.561,0.965,7,-3.673,0,0.0343,0.00383,7e-06,0.371,0.304,128.04,4,pop
88410,88410,Manuel Turizo,98,162637,False,0.835,0.679,7,-5.329,0,0.0364,0.583,2e-06,0.218,0.85,124.98,4,reggae
67356,67356,Manuel Turizo,98,162637,False,0.835,0.679,7,-5.329,0,0.0364,0.583,2e-06,0.218,0.85,124.98,4,latin
68303,68303,Manuel Turizo,98,162637,False,0.835,0.679,7,-5.329,0,0.0364,0.583,2e-06,0.218,0.85,124.98,4,latino
