# üìì Notebook 02: Exploratory Data Analysis (EDA) ‚Äî Statistical Analysis

This notebook focuses on **statistical exploration** of the dataset before any modeling or feature engineering steps.  
The goal is to understand the **structure, quality, and statistical behavior** of the data using quantitative measures.

Previous focused on *data format and structure*, this notebook answers questions like:
- Missing and duplicate values : "HOW MANY nulls?" + "What % of data?"
- How are numerical features distributed?
- Are there outliers or extreme values?
- How do key features behave across different groups (e.g., genres)?
- Do features use their full expected ranges?

---

## üì¶ Libraries Used for Statistical Analysis

### Why **NumPy** is Used

**NumPy** (Numerical Python) is the **numerical computing in Python**.
(some nerdy facts)
- It was created in **2005** by **Travis Oliphant**
- It evolved from earlier libraries called **Numeric** and **Numarray**
- NumPy was created to provide:
  - Fast numerical computation
  - Efficient handling of large multi-dimensional arrays
  - Vectorized operations written in C (much faster than Python loops)

Most scientific Python libraries ‚Äî including **pandas, SciPy, scikit-learn**, and **matplotlib** ‚Äî are built **on top of NumPy**.

In short:
> NumPy provides the *numerical backbone* of the Python data science ecosystem.

---

### Why **pandas** Exists (and How It Differs from NumPy)

**pandas** provide, which numpy doesnt:
- Column names
- Mixed data types
- Missing values
- Group-by operations
- Time-series indexing

#### Key Differences Between NumPy and pandas

| Aspect | NumPy | pandas |
|------|------|-------|
| Core structure | `ndarray` | `DataFrame` / `Series` |
| Data types | Mostly homogeneous | Heterogeneous (mixed types) |
| Column labels | ‚ùå No | ‚úÖ Yes |
| Missing values | Limited support | Native support |
| Grouped operations | Manual | Built-in (`groupby`) |
| Use case | Low-level numerical computing | High-level data analysis |

**Relationship between them**:
> pandas uses NumPy internally, but adds **labels, alignment, and data-awareness**.

---

## üîç My Scope of Statistical Analysis in This Notebook

Performs a **systematic statistical audit** of the dataset using the following analyses:

### ‚úÖ This notebook covers

1. **Missing Values Analysis**  
   Count and percentage of missing values per feature.

2. **Duplicate Rows Check**  
   Identification of exact duplicate records.

3. **Unique Values Count per Column**  
   Helps distinguish categorical, identifier, and continuous variables.

4. **Correlation Matrix (Numerical Features)**  
   Measures linear relationships between numerical features.

5. **Skewness Analysis**  
   Examines the asymmetry of feature distributions.

6. **Outlier Detection using IQR Method**  
   Identifies extreme values based on interquartile range.

7. **Feature-wise Descriptive Statistics**  
   Minimum, maximum, and range for numerical features.

8. **Grouped Statistics by Genre**  
   Mean values of popularity, energy, danceability, and other features grouped by genre.

9. **Explicit vs Non-Explicit Comparison**  
   Statistical comparison of audio features and popularity.

10. **Key Distribution Analysis**  
    Frequency analysis of musical keys.

11. **Duration Analysis**  
    Statistical patterns in track length (milliseconds).

12. **Percentile Analysis**  
    Analysis at 25th, 50th, 75th, 90th, and 99th percentiles.

13. **Variance and Standard Deviation Analysis**  
    Measures feature spread and variability.

14. **Zero and Near-Zero Value Detection**  
    Identifies features with low-information or constant values.

15. **Coefficient of Variation (CV)**  
    Relative variability normalized by mean (useful for feature comparison).
---

## üéØ Outcome of This Notebook

By the end of this notebook, we will have:
- A statistically validated understanding of the dataset
- Identified potential data quality issues
- Insights into feature distributions and variability
- give strong foundation for **EDA visualization and modeling** in subsequent notebooks

---


**Note:** This notebook focuses on **statistical analysis only**. Visualizations are in Notebook 03.

<p align="center">
 <img src="../assets/dividerlines.png" width="600"/>
</p>

In [1]:
# pandas: Data manipulation and analysis
# numpy: Numerical computations (percentiles, IQR calculations)
import pandas as pd
import numpy as np

df = pd.read_csv('../data/dataset.csv')

---
## 1Ô∏è‚É£ Missing Values Analysis

**Why?** Understanding missing data helps us decide:
- Which columns need imputation (filling missing values)
- Which columns might need to be dropped
- Data quality assessment

In [2]:

# Total number of rows in dataset
total = len(df)

# Count of missing values per column
missing_count = df.isnull().sum()

# Percentage of missing values per column
missing_pct = (missing_count / total) * 100

# Combine into a DataFrame for better readability
missing_df = pd.DataFrame({
    'Missing_Count': missing_count,
    'Missing_Percentage': missing_pct.round(2)
})

# Sort by missing count (descending) to see problematic columns first
missing_df = missing_df.sort_values('Missing_Count', ascending=False)

print("üìä MISSING VALUES ANALYSIS")
print("=" * 50)
print(missing_df)
print("=" * 50)
print(f"\nüîç Columns with missing values: {(missing_count > 0).sum()}")
print(f"üìà Total missing values: {missing_count.sum():,}")

üìä MISSING VALUES ANALYSIS
                  Missing_Count  Missing_Percentage
artists                       1                 0.0
track_name                    1                 0.0
album_name                    1                 0.0
Unnamed: 0                    0                 0.0
track_id                      0                 0.0
popularity                    0                 0.0
duration_ms                   0                 0.0
explicit                      0                 0.0
danceability                  0                 0.0
energy                        0                 0.0
key                           0                 0.0
loudness                      0                 0.0
mode                          0                 0.0
speechiness                   0                 0.0
acousticness                  0                 0.0
instrumentalness              0                 0.0
liveness                      0                 0.0
valence                       0    

In [51]:
# Drop rows with any missing values
df = df.dropna()

# Verify
print(f"Rows after dropping missing: {len(df):,}")

Rows after dropping missing: 113,999


---
## 2Ô∏è‚É£ Duplicate Rows Check

**Why?** Duplicate entries can:
- Bias our model (same song counted multiple times)
- Inflate dataset size artificially
- Affect statistical calculations

In [52]:
# Count total duplicates
duplicate_count = df.duplicated().sum()

# Percentage of duplicates
duplicate_pct = (duplicate_count / len(df)) * 100

print("DUPLICATE ROWS ANALYSIS")
print(f"Total duplicate rows: {duplicate_count:,}")
print(f"Percentage of dataset: {duplicate_pct:.2f}%")

# check duplicates based on track_id (should be unique)
track_id_duplicates = df['track_id'].duplicated().sum()
print(f"\nDuplicate track_ids: {track_id_duplicates:,} üò±")

DUPLICATE ROWS ANALYSIS
Total duplicate rows: 0
Percentage of dataset: 0.00%

Duplicate track_ids: 24,259 üò±


### My Finding:
- 114,000 rows √∑ 114 genres = 1,000 tracks per genre
- 89,741 unique tracks ‚Üí Same song appears in multiple genres

In [53]:
# Check: Same track_id but different genres?
df.groupby('track_id')['track_genre'].nunique().value_counts()

track_genre
1    73441
2    11424
3     2955
4     1361
5      431
6      104
7       21
8        2
9        1
Name: count, dtype: int64

In [54]:

# group by track_id and count unique genres
multi_genre_tracks = (
    df.groupby('track_id')['track_genre']
      .nunique()
      .reset_index(name='genre_count')
)

# keep only tracks with more than 1 genre
multi_genre_tracks = multi_genre_tracks[multi_genre_tracks['genre_count'] > 1]

print(f"Number of tracks appearing in multiple genres: {len(multi_genre_tracks)}")

multi_genre_details = (
    df[df['track_id'].isin(multi_genre_tracks['track_id'])]
    .groupby(['track_id', 'track_name', 'artists'])['track_genre']
    .unique()
    .reset_index()
)

multi_genre_details.head(10)


Number of tracks appearing in multiple genres: 16299


Unnamed: 0,track_id,track_name,artists,track_genre
0,001APMDOl3qtx1526T11n1,Better,Pink Sweat$;Kirby,"[chill, soul]"
1,001YQlnDSduXd5LgBd66gT,El Tiempo Es Dinero - Remasterizado 2007,Soda Stereo,"[punk-rock, ska]"
2,003vvx7Niy0yvhvHt4a68B,Mr. Brightside,The Killers,"[alt-rock, alternative, rock]"
3,004h8smbIoAkUNDJvVKwkG,Lovemark,Ouse;Powfu,"[emo, sad]"
4,006rHBBNLJMpQs8fRC2GDe,Agora Estou Sofrendo - Ao Vivo,Calcinha Preta;Gusttavo Lima,"[forro, pagode, sertanejo]"
5,006tmNZLXEXPqdb23wwSN1,Yemye≈üil Bir Deniz,ƒ∞lhan ƒ∞rem,"[j-pop, j-rock, jazz, turkish]"
6,00970cTs7LnxWt0d5Qk08m,Sleigh Ride,Ella Fitzgerald,"[blues, jazz]"
7,00B7SBwrjbycLMOgAmeIU8,Reach Out,Red Hot Chili Peppers,"[alt-rock, funk, metal]"
8,00EsQxsJv6vy7hEQN3jZWG,Beginning Middle End - Always and Forever Mix)...,Leah Nobel,"[singer-songwriter, songwriter]"
9,00GVRTIWMjYwwHEjTLclgf,Home,Robert Hood,"[chicago-house, detroit-techno]"


In [55]:
#Sort by songs with the highest number of genres
multi_genre_sorted = (
    df.groupby(['track_id', 'track_name', 'artists'])['track_genre']
      .nunique()
      .reset_index(name='genre_count')
      .sort_values(by='genre_count', ascending=False)
)

multi_genre_sorted.head(10)

Unnamed: 0,track_id,track_name,artists,genre_count
74275,6S3JlDAGk3uu3NtZbPnuhS,Baby Blue - Remastered 2010,Badfinger,9
25739,2Ey6v4Sekh3Z0RUSISRosD,Layla,Derek & The Dominos,8
31723,2kkvB3RNRzwjFdGhaUA0tz,Layla,Derek & The Dominos,8
88605,7tbzfR8ZvZzJEzy6v0d6el,Liggi,Ritviz,7
7396,0e5LcankE0UyJUuCoq1uH2,The Joker,Steve Miller Band,7
49041,4GPQDyw9hC1DiZVh0ouDVL,Keep My Name Outta Your Mouth,The Black Keys,7
59519,5BI1XqMJK91dsEq0Bfe0Ov,Show Me The Way,Peter Frampton,7
58364,54zCdkbIALAnv8Ihi3XWlD,Stay Alive,Jos√© Gonz√°lez,7
35737,36NwMJRaCy7x77xYGJiG2M,Midnight Rider,Allman Brothers Band,7
29766,2aaClnypAakdAmLw74JXxB,Arise,Sepultura,7


"For songs that appear in multiple genres (same track_id), does their popularity score vary across different genre listings, or is it consistent regardless of genre? This will help us understand if popularity is track-specific or genre-dependent in duplicate entries." 

In [56]:
for col in df.columns:
    duplicate_count = df[col].duplicated().sum()
    duplicate_pct = (duplicate_count / len(df)) * 100
    
    print(
        f"{col:20s} ‚Üí "
        f"Duplicates: {duplicate_count:8,} "
        f"({duplicate_pct:6.2f}%)"
    )

Unnamed: 0           ‚Üí Duplicates:        0 (  0.00%)
track_id             ‚Üí Duplicates:   24,259 ( 21.28%)
artists              ‚Üí Duplicates:   82,562 ( 72.42%)
album_name           ‚Üí Duplicates:   67,410 ( 59.13%)
track_name           ‚Üí Duplicates:   40,391 ( 35.43%)
popularity           ‚Üí Duplicates:  113,898 ( 99.91%)
duration_ms          ‚Üí Duplicates:   63,303 ( 55.53%)
explicit             ‚Üí Duplicates:  113,997 (100.00%)
danceability         ‚Üí Duplicates:  112,825 ( 98.97%)
energy               ‚Üí Duplicates:  111,916 ( 98.17%)
key                  ‚Üí Duplicates:  113,987 ( 99.99%)
loudness             ‚Üí Duplicates:   94,519 ( 82.91%)
mode                 ‚Üí Duplicates:  113,997 (100.00%)
speechiness          ‚Üí Duplicates:  112,510 ( 98.69%)
acousticness         ‚Üí Duplicates:  108,938 ( 95.56%)
instrumentalness     ‚Üí Duplicates:  108,653 ( 95.31%)
liveness             ‚Üí Duplicates:  112,277 ( 98.49%)
valence              ‚Üí Duplicates:  112,209 ( 

My Observation:
- Duplicate values of these variable, isnt duplicate entries/rows
- repeated values within individual feature like artists, loudness,tempo: is OK
- Categorical columns like (explicit, mode, key, track_genre): Naturally have very high duplicate percentages because they have few possible values

### Note
‚úîÔ∏è This analysis shows value repetition, not row duplication
‚ùå It does NOT mean the dataset is flawed

#### Example:
A song like **‚ÄúBetter‚Äù by Pink Sweat$ (feat. Kirby)** appears multiple times in the dataset because it is associated with more than one genre, such as:

- chill  
- soul  

As a result, the same track is repeated once for each genre it belongs to.


### Options I Have

#### Option A: SPlit Data
I can split my data into unique songs & duplicated songs (with multiple genres), and perform EDAs separately.
- Dataset 1: `unique_songs.csv` (73,441 songs) ‚Üí Songs in **1 genre only**
- Dataset 2: `multi_genre_songs.csv` (16,299 songs) ‚Üí Songs in 2+ genres
> This would *complicate* my workflow

#### Option B: Drop the Duplicates
Since I have **89,740** unique tracks! Which is a *good* number of datasets to feed to the model, to keep it simple and focused I am dropping the duplicates.

In [57]:
# number of rows before deduplication
rows_before = len(df)

# drop duplicates, keeping the first occurrence
df_dedup = df.drop_duplicates(subset='track_id', keep='first')

# number of rows after deduplication
rows_after = len(df_dedup)

print(f"Rows before deduplication: {rows_before:,}")
print(f"Rows after deduplication:  {rows_after:,}")
print(f"Tracks removed:            {rows_before - rows_after:,}")


Rows before deduplication: 113,999
Rows after deduplication:  89,740
Tracks removed:            24,259


In [58]:
#After deduplication, always reset the index!
df_dedup = df_dedup.reset_index(drop=True)


---
## 3Ô∏è‚É£ Unique Values Count

**Why?** Helps identify:
- High cardinality columns (many unique values - may need special handling)
- Low cardinality columns (few categories - good for encoding)
- Constant columns (only 1 value - useless for prediction)

In [59]:
# Count how many unique values exist in each column in the new dataset
# Get unique counts for all columns
unique_counts = df_dedup.nunique()

# Create DataFrame with unique counts and percentage
unique_df = pd.DataFrame({
    'Unique_Count': unique_counts,
    'Unique_Percentage': ((unique_counts / len(df_dedup)) * 100).round(2)
})

# Sort by unique count
unique_df = unique_df.sort_values('Unique_Count', ascending=False)

print("üìä UNIQUE VALUES PER COLUMN")
print(unique_df)


üìä UNIQUE VALUES PER COLUMN
                  Unique_Count  Unique_Percentage
Unnamed: 0               89740             100.00
track_id                 89740             100.00
track_name               73608              82.02
duration_ms              50696              56.49
album_name               46589              51.92
tempo                    45652              50.87
artists                  31437              35.03
loudness                 19480              21.71
instrumentalness          5346               5.96
acousticness              5061               5.64
energy                    2083               2.32
valence                   1790               1.99
liveness                  1722               1.92
speechiness               1489               1.66
danceability              1174               1.31
track_genre                113               0.13
popularity                 101               0.11
key                         12               0.01
time_signature      

---
## 4Ô∏è‚É£ Correlation Matrix

**Why?** Correlation tells us:
- Which features are related to our target (popularity)
- Which features are related to each other (multicollinearity)
- Potential feature engineering opportunities

In [None]:
# Select only numerical columns
numerical_cols = df_dedup.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Remove 'Unnamed: 0' if present (it's just an index)
if 'Unnamed: 0' in numerical_cols:
    numerical_cols.remove('Unnamed: 0')

# Calculate correlation matrix
correlation_matrix = df_dedup[numerical_cols].corr()

print("üìä CORRELATION MATRIX")
print(correlation_matrix.round(2))



üìä CORRELATION MATRIX
                  popularity  duration_ms  danceability  energy   key  \
popularity              1.00        -0.02          0.06    0.01  0.00   
duration_ms            -0.02         1.00         -0.06    0.06  0.01   
danceability            0.06        -0.06          1.00    0.14  0.04   
energy                  0.01         0.06          0.14    1.00  0.05   
key                     0.00         0.01          0.04    0.05  1.00   
loudness                0.07         0.00          0.27    0.76  0.04   
mode                   -0.02        -0.04         -0.06   -0.08 -0.14   
speechiness            -0.05        -0.06          0.11    0.14  0.02   
acousticness           -0.04        -0.11         -0.18   -0.73 -0.05   
instrumentalness       -0.13         0.12         -0.19   -0.18 -0.01   
liveness               -0.01         0.01         -0.13    0.19 -0.00   
valence                -0.01        -0.15          0.49    0.26  0.03   
tempo                   0.0

In [71]:
# CORRELATION WITH TARGET VARIABLE (POPULARITY)

popularity_corr = correlation_matrix['popularity'].drop('popularity')

print("üéØ CORRELATION WITH TARGET (popularity)")
print("Sorted by absolute correlation strength:")

# Sort by absolute correlation (descending)
sorted_corr = sorted(popularity_corr.items(), key=lambda x: abs(x[1]), reverse=True)

# Define dynamic thresholds based on your data
# Since correlations are very small, we'll use percentiles
corr_values = [abs(corr) for _, corr in sorted_corr]

# Define thresholds based on data distribution
# Top 25% get "Weak", next 25% "Very Weak", next 25% "Negligible", rest "None"
if len(corr_values) >= 4:
    thresholds = [
        sorted(corr_values)[-int(len(corr_values)*0.25)],  # Top 25% threshold
        sorted(corr_values)[-int(len(corr_values)*0.50)],  # Top 50% threshold
        sorted(corr_values)[-int(len(corr_values)*0.75)],  # Top 75% threshold
    ]
else:
    # Fallback for small datasets
    thresholds = [0.1, 0.05, 0.02]

for feature, corr in sorted_corr:
    abs_corr = abs(corr)
    
    # Dynamic thresholding
    if abs_corr >= thresholds[0]:
        strength = "üü° Weak"
    elif abs_corr >= thresholds[1]:
        strength = "üü¢ Very Weak"
    elif abs_corr >= thresholds[2]:
        strength = "üîµ Negligible"
    else:
        strength = "‚ö™ None"

    print(f"{feature:20} : {corr:+.4f}  {strength}")




üéØ CORRELATION WITH TARGET (popularity)
Sorted by absolute correlation strength:
instrumentalness     : -0.1275  üü° Weak
loudness             : +0.0717  üü° Weak
danceability         : +0.0643  üü° Weak
speechiness          : -0.0471  üü¢ Very Weak
acousticness         : -0.0388  üü¢ Very Weak
time_signature       : +0.0369  üü¢ Very Weak
duration_ms          : -0.0232  üîµ Negligible
mode                 : -0.0162  üîµ Negligible
liveness             : -0.0139  üîµ Negligible
energy               : +0.0137  ‚ö™ None
valence              : -0.0115  ‚ö™ None
tempo                : +0.0073  ‚ö™ None
key                  : +0.0034  ‚ö™ None


In [72]:
# Group features by strength
groups = {"üü° Weak": [], "üü¢ Very Weak": [], "üîµ Negligible": [], "‚ö™ None": []}
for feature, corr in sorted_corr:
    abs_corr = abs(corr)
    if abs_corr >= thresholds[0]:
        groups["üü° Weak"].append((feature, corr))
    elif abs_corr >= thresholds[1]:
        groups["üü¢ Very Weak"].append((feature, corr))
    elif abs_corr >= thresholds[2]:
        groups["üîµ Negligible"].append((feature, corr))
    else:
        groups["‚ö™ None"].append((feature, corr))

for strength_label, features in groups.items():
    if features:
        print(f"\n{strength_label}:")
        for feature, corr in features:
            print(f"  ‚Ä¢ {feature:20} : {corr:+.4f}")


üü° Weak:
  ‚Ä¢ instrumentalness     : -0.1275
  ‚Ä¢ loudness             : +0.0717
  ‚Ä¢ danceability         : +0.0643

üü¢ Very Weak:
  ‚Ä¢ speechiness          : -0.0471
  ‚Ä¢ acousticness         : -0.0388
  ‚Ä¢ time_signature       : +0.0369

üîµ Negligible:
  ‚Ä¢ duration_ms          : -0.0232
  ‚Ä¢ mode                 : -0.0162
  ‚Ä¢ liveness             : -0.0139

‚ö™ None:
  ‚Ä¢ energy               : +0.0137
  ‚Ä¢ valence              : -0.0115
  ‚Ä¢ tempo                : +0.0073
  ‚Ä¢ key                  : +0.0034


---
## 5Ô∏è‚É£ Skewness Analysis

**Why?** Skewness measures asymmetry of distribution:
- **Skewness = 0**: Symmetric (normal distribution)
- **Skewness > 0**: Right-skewed (tail on right, most values on left)
- **Skewness < 0**: Left-skewed (tail on left, most values on right)

Highly skewed features may need transformation (log, sqrt) for better model performance.

In [73]:
# SKEWNESS ANALYSIS
# Calculate skewness for all numerical columns
# Rule of thumb: |skewness| > 1 is highly skewed

skewness = df_dedup[numerical_cols].skew().sort_values(key=abs, ascending=False)

print("üìä SKEWNESS ANALYSIS")
print("Interpretation:")
print("  |skew| > 1.0    ‚Üí    üü• Highly skewed")
print("  0.5 < |skew| ‚â§ 1.0 ‚Üí üü® Moderately skewed")
print("  |skew| ‚â§ 0.5    ‚Üí    üü© Fairly symmetric")


# Group features by skewness level
highly_skewed = []
moderately_skewed = []
fairly_symmetric = []

for feature, skew in skewness.items():
    abs_skew = abs(skew)
    if abs_skew > 1.0:
        highly_skewed.append((feature, skew))
    elif abs_skew > 0.5:
        moderately_skewed.append((feature, skew))
    else:
        fairly_symmetric.append((feature, skew))

# Print Highly Skewed first (strongest issues)
if highly_skewed:
    print("\nüü• HIGHLY SKEWED (|skew| > 1.0):")
    for feature, skew in highly_skewed:
        print(f"  {feature:20} : {skew:+.4f}")
    
# Print Moderately Skewed
if moderately_skewed:
    print("\nüü® MODERATELY SKEWED (0.5 < |skew| ‚â§ 1.0):")
    for feature, skew in moderately_skewed:
        print(f"  {feature:20} : {skew:+.4f}")

# Print Fairly Symmetric
if fairly_symmetric:
    print("\nüü© FAIRLY SYMMETRIC (|skew| ‚â§ 0.5):")
    for feature, skew in fairly_symmetric:
        print(f"  {feature:20} : {skew:+.4f}")



üìä SKEWNESS ANALYSIS
Interpretation:
  |skew| > 1.0    ‚Üí    üü• Highly skewed
  0.5 < |skew| ‚â§ 1.0 ‚Üí üü® Moderately skewed
  |skew| ‚â§ 0.5    ‚Üí    üü© Fairly symmetric

üü• HIGHLY SKEWED (|skew| > 1.0):
  duration_ms          : +11.0728
  speechiness          : +4.5458
  time_signature       : -3.9988
  liveness             : +2.0621
  loudness             : -1.9599
  instrumentalness     : +1.5640

üü® MODERATELY SKEWED (0.5 < |skew| ‚â§ 1.0):
  acousticness         : +0.6558
  mode                 : -0.5697
  energy               : -0.5600

üü© FAIRLY SYMMETRIC (|skew| ‚â§ 0.5):
  danceability         : -0.3983
  tempo                : +0.1827
  valence              : +0.1276
  popularity           : +0.0709
  key                  : -0.0001


In [74]:
# Summary statistics
print("üìà SUMMARY STATISTICS:")
print(f"Total features analyzed     : {len(skewness)}")
print(f"üü• Highly skewed            : {len(highly_skewed)} features")
print(f"üü® Moderately skewed        : {len(moderately_skewed)} features")
print(f"üü© Fairly symmetric         : {len(fairly_symmetric)} features")

üìà SUMMARY STATISTICS:
Total features analyzed     : 14
üü• Highly skewed            : 6 features
üü® Moderately skewed        : 3 features
üü© Fairly symmetric         : 5 features


---
## 6Ô∏è‚É£ Outlier Detection (IQR Method)

**Why?** Outliers can:
- Distort statistical measures (mean, std)
- Negatively impact model training
- Sometimes indicate data errors

**IQR Method:**
- Q1 = 25th percentile
- Q3 = 75th percentile
- IQR = Q3 - Q1
- Outliers: values < Q1 - 1.5√óIQR or > Q3 + 1.5√óIQR

In [75]:
# Identify outliers in each numerical column using the IQR method

# Select audio features for outlier analysis (excluding IDs and indices)
audio_features = ['popularity', 'duration_ms', 'danceability', 'energy', 'loudness',
                  'speechiness', 'acousticness', 'instrumentalness', 'liveness', 
                  'valence', 'tempo']

print("üìä OUTLIER DETECTION (IQR Method)")
print(f"{'Feature':<20} {'Q1':>10} {'Q3':>10} {'IQR':>10} {'Outliers':>10} {'%':>8}")

outlier_summary = []

for col in audio_features:
    if col in df_dedup.columns:
        Q1 = df_dedup[col].quantile(0.25)
        Q3 = df_dedup[col].quantile(0.75)
        IQR = Q3 - Q1
        
        # Define outlier bounds
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        
        # Count outliers
        outliers = df_dedup[(df_dedup[col] < lower_bound) | (df_dedup[col] > upper_bound)][col].count()
        outlier_pct = (outliers / len(df_dedup)) * 100
        
        print(f"{col:<20} {Q1:>10.3f} {Q3:>10.3f} {IQR:>10.3f} {outliers:>10,} {outlier_pct:>7.2f}%")
        
        outlier_summary.append({
            'Feature': col,
            'Outlier_Count': outliers,
            'Outlier_Percentage': outlier_pct
        })

# Create summary DataFrame
outlier_df = pd.DataFrame(outlier_summary).sort_values('Outlier_Count', ascending=False)
print(f"\nFeature with most outliers: {outlier_df.iloc[0]['Feature']} ({outlier_df.iloc[0]['Outlier_Count']:,} outliers)")

üìä OUTLIER DETECTION (IQR Method)
Feature                      Q1         Q3        IQR   Outliers        %
popularity               19.000     49.000     30.000         11    0.01%
duration_ms          173040.000 264293.000  91253.000      4,225    4.71%
danceability              0.450      0.692      0.242        474    0.53%
energy                    0.457      0.853      0.396          0    0.00%
loudness                -10.322     -5.108      5.214      5,026    5.60%
speechiness               0.036      0.086      0.050     10,644   11.86%
acousticness              0.017      0.625      0.608          0    0.00%
instrumentalness          0.000      0.098      0.098     19,613   21.86%
liveness                  0.098      0.279      0.181      6,981    7.78%
valence                   0.249      0.682      0.433          0    0.00%
tempo                    99.263    140.077     40.814        514    0.57%

Feature with most outliers: instrumentalness (19,613 outliers)


---
## 7Ô∏è‚É£Feature-wise Statistics (Min, Max, Range)

**Why?** Understanding the range of values helps:
- Identify potential data errors (impossible values)
- Understand feature scales (important for scaling later)
- Spot anomalies

In [76]:
# Calculate min, max, and range for each numerical feature

print("üìä FEATURE-WISE STATISTICS")
print(f"{'Feature':<20} {'Min':>12} {'Max':>12} {'Range':>12} {'Mean':>12}")
print("-" * 80)
for col in audio_features:
    if col in df_dedup.columns:
        min_val = df_dedup[col].min()
        max_val = df_dedup[col].max()
        range_val = max_val - min_val
        mean_val = df_dedup[col].mean()
        
        print(f"{col:<20} {min_val:>12.3f} {max_val:>12.3f} {range_val:>12.3f} {mean_val:>12.3f}")


üìä FEATURE-WISE STATISTICS
Feature                       Min          Max        Range         Mean
--------------------------------------------------------------------------------
popularity                  0.000      100.000      100.000       33.199
duration_ms              8586.000  5237295.000  5228709.000   229144.366
danceability                0.000        0.985        0.985        0.562
energy                      0.000        1.000        1.000        0.634
loudness                  -49.531        4.532       54.063       -8.499
speechiness                 0.000        0.965        0.965        0.087
acousticness                0.000        0.996        0.996        0.328
instrumentalness            0.000        1.000        1.000        0.173
liveness                    0.000        1.000        1.000        0.217
valence                     0.000        0.995        0.995        0.469
tempo                       0.000      243.372      243.372      122.058


### **What This Means for Preprocessing:**

| Issue                                       | Solution                                    |
|---------------------------------------------|---------------------------------------------|
| `duration_ms` has huge range (8K to 5M)     | Scaling needed (StandardScaler or MinMaxScaler) |
| `loudness` has negative values              | Scaling needed                                  |
| Features on different scales                | Scaling will be essential before modeling   |

---
## 8Ô∏è‚É£  Grouped Statistics by Genre

**Why?** Different genres may have different characteristics:
- Do some genres have higher popularity?
- Are certain genres more danceable/energetic?
- This can inform feature engineering decisions

In [77]:
# Calculate mean statistics for key features grouped by genre

# Key features to analyze by genre
key_features = ['popularity', 'danceability', 'energy', 'valence', 'tempo']

# Group by genre and calculate mean
genre_stats = df_dedup.groupby('track_genre')[key_features].mean().round(3)

# Sort by popularity
genre_stats_sorted = genre_stats.sort_values('popularity', ascending=False)

print("üìä MEAN STATISTICS BY GENRE (Top 15 by Popularity)")
print(genre_stats_sorted.head(15))

üìä MEAN STATISTICS BY GENRE (Top 15 by Popularity)
                   popularity  danceability  energy  valence    tempo
track_genre                                                          
k-pop                  59.424         0.642   0.683    0.569  119.530
pop-film               59.097         0.591   0.600    0.529  116.953
metal                  56.422         0.481   0.841    0.425  129.480
chill                  53.739         0.666   0.430    0.408  115.383
latino                 51.789         0.755   0.712    0.622  121.420
sad                    51.110         0.702   0.479    0.440  119.359
grunge                 50.587         0.455   0.805    0.401  129.985
indian                 49.765         0.586   0.555    0.448  115.207
anime                  48.777         0.538   0.674    0.435  123.608
emo                    48.500         0.601   0.668    0.441  126.998
reggaeton              48.270         0.743   0.737    0.674  121.952
sertanejo              47.861        

Observation: "Pop-film and K-pop are the most popular genres. Moderate audio features (not extreme) correlate with higher popularity suggesting mainstream appeal favors balanced, accessible music."

In [78]:
print("\n MEAN STATISTICS BY GENRE (Bottom 10 by Popularity)")
print(genre_stats_sorted.tail(10))


 MEAN STATISTICS BY GENRE (Bottom 10 by Popularity)
                popularity  danceability  energy  valence    tempo
track_genre                                                       
idm                 15.522         0.527   0.556    0.303  123.340
kids                14.771         0.779   0.614    0.682  121.795
grindcore           14.522         0.272   0.926    0.217  119.162
classical           13.362         0.386   0.197    0.392  108.026
chicago-house       12.334         0.766   0.733    0.587  123.909
detroit-techno      11.131         0.723   0.708    0.469  126.408
latin                9.855         0.727   0.724    0.624  121.286
jazz                 9.790         0.489   0.309    0.487  115.741
romance              3.550         0.432   0.299    0.395  109.817
iranian              2.225         0.300   0.545    0.153  114.618


---
## 9Ô∏è‚É£ Explicit vs Non-Explicit Stats

**Why?** Explicit content flag might influence:
- Popularity (radio play, streaming restrictions)
- Audio characteristics (energy, speechiness)

In [79]:
# Compare statistics between explicit and non-explicit tracks

# Count of explicit vs non-explicit
explicit_counts = df_dedup['explicit'].value_counts()

print("üìä EXPLICIT CONTENT DISTRIBUTION")
print("=" * 50)
print(f"Non-Explicit (False): {explicit_counts.get(False, 0):,} tracks ({explicit_counts.get(False, 0)/len(df_dedup)*100:.1f}%)")
print(f"Explicit (True):      {explicit_counts.get(True, 0):,} tracks ({explicit_counts.get(True, 0)/len(df_dedup)*100:.1f}%)")
print("=" * 50)

# Compare mean statistics
explicit_stats = df_dedup.groupby('explicit')[key_features].mean().round(3)

print("\nüìä MEAN STATISTICS: EXPLICIT vs NON-EXPLICIT")
print("=" * 60)
print(explicit_stats)
print("=" * 60)

# Calculate difference
if True in explicit_stats.index and False in explicit_stats.index:
    diff = explicit_stats.loc[True] - explicit_stats.loc[False]
    print("\nüìà DIFFERENCE (Explicit - Non-Explicit):")
    print("-" * 40)
    for feat, val in diff.items():
        direction = "‚Üë" if val > 0 else "‚Üì" if val < 0 else "="
        print(f"  {feat}: {val:+.3f} {direction}")

üìä EXPLICIT CONTENT DISTRIBUTION
Non-Explicit (False): 82,036 tracks (91.4%)
Explicit (True):      7,704 tracks (8.6%)

üìä MEAN STATISTICS: EXPLICIT vs NON-EXPLICIT
          popularity  danceability  energy  valence    tempo
explicit                                                    
False         32.853         0.556   0.627    0.470  122.096
True          36.886         0.631   0.719    0.467  121.658

üìà DIFFERENCE (Explicit - Non-Explicit):
----------------------------------------
  popularity: +4.033 ‚Üë
  danceability: +0.075 ‚Üë
  energy: +0.092 ‚Üë
  valence: -0.003 ‚Üì
  tempo: -0.438 ‚Üì


Insight: Explicit songs are slightly more popular (+4 points on average)
- This binary feature requires no transformation and provides useful signal for the model.

---
## üîü Key Distribution Analysis

**Why?** Musical key (C, C#, D, etc.) might affect:
- Song mood and feel
- Popularity in certain genres

Key mapping: 0=C, 1=C#/Db, 2=D, 3=D#/Eb, 4=E, 5=F, 6=F#/Gb, 7=G, 8=G#/Ab, 9=A, 10=A#/Bb, 11=B

In [81]:
# Analyze distribution of musical keys and their relationship with popularity

# Key mapping
key_names = {0: 'C', 1: 'C#/Db', 2: 'D', 3: 'D#/Eb', 4: 'E', 5: 'F',
             6: 'F#/Gb', 7: 'G', 8: 'G#/Ab', 9: 'A', 10: 'A#/Bb', 11: 'B'}

# Count and popularity by key
key_stats = df_dedup.groupby('key').agg({
    'track_id': 'count',
    'popularity': 'mean'
}).round(2)

key_stats.columns = ['Track_Count', 'Avg_Popularity']
key_stats['Key_Name'] = key_stats.index.map(key_names)
key_stats['Percentage'] = (key_stats['Track_Count'] / len(df_dedup) * 100).round(2)

# Reorder columns
key_stats = key_stats[['Key_Name', 'Track_Count', 'Percentage', 'Avg_Popularity']]
key_stats = key_stats.sort_values('Track_Count', ascending=False)

print("üìä MUSICAL KEY DISTRIBUTION")
print("=" * 60)
print(key_stats)

# Most and least popular keys
most_popular_key = key_stats.sort_values('Avg_Popularity', ascending=False).iloc[0]
least_popular_key = key_stats.sort_values('Avg_Popularity', ascending=True).iloc[0]

print(f"\nüéµ Most common key: {key_stats.iloc[0]['Key_Name']} ({key_stats.iloc[0]['Percentage']}%)")
print(f"üîù Highest avg popularity: {most_popular_key['Key_Name']} ({most_popular_key['Avg_Popularity']})")
print(f"üîª Lowest avg popularity: {least_popular_key['Key_Name']} ({least_popular_key['Avg_Popularity']})")

üìä MUSICAL KEY DISTRIBUTION
    Key_Name  Track_Count  Percentage  Avg_Popularity
key                                                  
7          G        10550       11.76           32.61
0          C        10352       11.54           32.72
2          D         9327       10.39           33.68
9          A         8998       10.03           32.87
1      C#/Db         8576        9.56           32.86
5          F         7308        8.14           32.97
4          E         7133        7.95           34.06
11         B         7129        7.94           33.88
6      F#/Gb         6139        6.84           33.48
10     A#/Bb         5889        6.56           32.84
8      G#/Ab         5570        6.21           33.79
3      D#/Eb         2769        3.09           33.25

üéµ Most common key: G (11.76%)
üîù Highest avg popularity: E (34.06)
üîª Lowest avg popularity: G (32.61)


- Conclusion : "Musical key has negligible correlation (+0.003) with popularity. The ~2 point difference between keys is not meaningful. This feature is unlikely to improve model performance significantly.
- I will consider DROPPING key in preprocessing to simplify the  model.

---
## 1Ô∏è‚É£1Ô∏è‚É£Duration Analysis

**Why?** Song duration might affect:
- Streaming counts (shorter songs might get more replays)
- Radio play eligibility
- User engagement

In [83]:
# Analyze song duration patterns

# Convert milliseconds to minutes for easier interpretation
df_dedup['duration_min'] = df_dedup['duration_ms'] / 60000

print("üìä DURATION ANALYSIS")
print("=" * 50)
print(f"Minimum duration: {df_dedup['duration_min'].min():.2f} minutes ({df_dedup['duration_ms'].min()/1000:.0f} seconds)")
print(f"Maximum duration: {df_dedup['duration_min'].max():.2f} minutes")
print(f"Average duration: {df_dedup['duration_min'].mean():.2f} minutes")
print(f"Median duration:  {df_dedup['duration_min'].median():.2f} minutes")
print(f"Std deviation:    {df_dedup['duration_min'].std():.2f} minutes")
print("=" * 50)

# Duration categories
print("\nüìä DURATION DISTRIBUTION:")
print("-" * 40)
print(f"  < 2 min (short):    {(df_dedup['duration_min'] < 2).sum():,} tracks ({(df_dedup['duration_min'] < 2).sum()/len(df_dedup)*100:.1f}%)")
print(f"  2-3 min (standard): {((df_dedup['duration_min'] >= 2) & (df_dedup['duration_min'] < 3)).sum():,} tracks ({((df_dedup['duration_min'] >= 2) & (df_dedup['duration_min'] < 3)).sum()/len(df_dedup)*100:.1f}%)")
print(f"  3-4 min (standard): {((df_dedup['duration_min'] >= 3) & (df_dedup['duration_min'] < 4)).sum():,} tracks ({((df_dedup['duration_min'] >= 3) & (df_dedup['duration_min'] < 4)).sum()/len(df_dedup)*100:.1f}%)")
print(f"  4-5 min (medium):   {((df_dedup['duration_min'] >= 4) & (df_dedup['duration_min'] < 5)).sum():,} tracks ({((df_dedup['duration_min'] >= 4) & (df_dedup['duration_min'] < 5)).sum()/len(df_dedup)*100:.1f}%)")
print(f"  > 5 min (long):     {(df_dedup['duration_min'] >= 5).sum():,} tracks ({(df_dedup['duration_min'] >= 5).sum()/len(df_dedup)*100:.1f}%)")

# Remove temporary column
df_dedup.drop('duration_min', axis=1, inplace=True)

üìä DURATION ANALYSIS
Minimum duration: 0.14 minutes (9 seconds)
Maximum duration: 87.29 minutes
Average duration: 3.82 minutes
Median duration:  3.55 minutes
Std deviation:    1.88 minutes

üìä DURATION DISTRIBUTION:
----------------------------------------
  < 2 min (short):    5,514 tracks (6.1%)
  2-3 min (standard): 20,478 tracks (22.8%)
  3-4 min (standard): 32,186 tracks (35.9%)
  4-5 min (medium):   17,970 tracks (20.0%)
  > 5 min (long):     13,592 tracks (15.1%)


#### Understanding how duration relates to popularity.

In [86]:
# Create duration in minutes (temporary)
df_dedup['duration_min'] = df_dedup['duration_ms'] / 60000

# Create duration categories
df_dedup['duration_category'] = pd.cut(
    df_dedup['duration_min'],
    bins=[0, 2, 3, 4, 5, df_dedup['duration_min'].max()],
    labels=['Short (<2)', '2‚Äì3 min', '3‚Äì4 min', '4‚Äì5 min', 'Long (>5)'],
    include_lowest=True
)


In [92]:
#Which duration range has higher average / median popularity?
print("mean, median, std are of Target Variable : Popularity")
duration_popularity_stats = (
    df_dedup
    .groupby('duration_category', observed=True)['popularity']
    .agg(['count', 'mean', 'median', 'std'])
)

duration_popularity_stats


mean, median, std are of Target Variable : Popularity


Unnamed: 0_level_0,count,mean,median,std
duration_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Short (<2),5557,27.946014,25.0,18.043802
2‚Äì3 min,20518,32.600546,32.0,21.478855
3‚Äì4 min,32175,34.938462,36.0,21.379562
4‚Äì5 min,17924,34.766905,35.0,19.512619
Long (>5),13566,30.057497,28.0,18.751167


- popularity peaks for standard-length tracks (3‚Äì5 minutes), indicating a meaningful non-linear relationship with the target variable.


---
## 1Ô∏è‚É£2Ô∏è‚É£Percentile Analysis

**Why?** Percentiles help to understand:
- Distribution of values across the dataset
- Where most of the data lies
- Extreme values (1st and 99th percentiles)

In [95]:
# Calculate key percentiles for important features

percentiles = [1, 5, 10, 25, 50, 75, 90, 95, 99]

print("üìä PERCENTILE ANALYSIS")
print("=" * 100)

# Calculate percentiles for key features
percentile_df = df_dedup[audio_features].quantile([p/100 for p in percentiles]).round(3)
percentile_df.index = [f"{p}th" for p in percentiles]

print(percentile_df)


üìä PERCENTILE ANALYSIS
      popularity  duration_ms  danceability  energy  loudness  speechiness  \
1th          0.0     61871.85         0.123   0.027   -28.542        0.026   
5th          0.0    112041.90         0.237   0.142   -18.862        0.028   
10th         0.0    136853.00         0.318   0.251   -14.695        0.030   
25th        19.0    173040.00         0.450   0.457   -10.322        0.036   
50th        33.0    213295.50         0.576   0.676    -7.185        0.049   
75th        49.0    264293.00         0.692   0.853    -5.108        0.086   
90th        60.0    332718.60         0.782   0.942    -3.723        0.183   
95th        67.0    394001.40         0.825   0.970    -3.000        0.284   
99th        78.0    546010.98         0.904   0.993    -1.628        0.619   

      acousticness  instrumentalness  liveness  valence    tempo  
1th          0.000             0.000     0.041    0.033   65.165  
5th          0.000             0.000     0.061    0.064   76

In [94]:
# Specific insight for popularity
print("\nüéØ POPULARITY PERCENTILE INSIGHTS:")
print("-" * 50)
print(f"  ‚Ä¢ 1% of songs have popularity ‚â§ {df_dedup['popularity'].quantile(0.01):.0f}")
print(f"  ‚Ä¢ 50% of songs have popularity ‚â§ {df_dedup['popularity'].quantile(0.50):.0f} (median)")
print(f"  ‚Ä¢ 90% of songs have popularity ‚â§ {df_dedup['popularity'].quantile(0.90):.0f}")
print(f"  ‚Ä¢ Top 1% of songs have popularity > {df_dedup['popularity'].quantile(0.99):.0f}")


üéØ POPULARITY PERCENTILE INSIGHTS:
--------------------------------------------------
  ‚Ä¢ 1% of songs have popularity ‚â§ 0
  ‚Ä¢ 50% of songs have popularity ‚â§ 33 (median)
  ‚Ä¢ 90% of songs have popularity ‚â§ 60
  ‚Ä¢ Top 1% of songs have popularity > 78


---
## 1Ô∏è‚É£3Ô∏è‚É£Variance & Standard Deviation

**Why?** These measure spread/dispersion:
- High variance = data points spread far from mean
- Low variance = data points clustered near mean
- Important for feature scaling decisions

In [100]:
# Calculate variance and std for numerical features

variance = df_dedup[audio_features].var().round(4)
std_dev = df_dedup[audio_features].std().round(4)
mean_vals = df_dedup[audio_features].mean().round(4)

spread_df = pd.DataFrame({
    'Mean': mean_vals,
    'Variance': variance,
    'Std_Dev': std_dev
}).sort_values('Variance', ascending=False)

print("üìä VARIANCE & STANDARD DEVIATION")
print(spread_df)
print("=" * 60)

print("INTERPRETATION:")
print("-" * 50)
print(f"Highest variance: {spread_df.index[0]} ({spread_df.iloc[0]['Variance']:.2f})")
print(f"Lowest variance: {spread_df.index[-1]} ({spread_df.iloc[-1]['Variance']:.4f})")

üìä VARIANCE & STANDARD DEVIATION
                         Mean      Variance      Std_Dev
duration_ms       229144.3656  1.275675e+10  112945.7803
tempo                122.0581  9.070729e+02      30.1177
popularity            33.1988  4.235628e+02      20.5806
loudness              -8.4990  2.726420e+01       5.2215
acousticness           0.3283  1.145000e-01       0.3383
instrumentalness       0.1734  1.049000e-01       0.3238
valence                0.4695  6.910000e-02       0.2629
energy                 0.6345  6.580000e-02       0.2566
liveness               0.2170  3.800000e-02       0.1949
danceability           0.5622  3.120000e-02       0.1767
speechiness            0.0874  1.280000e-02       0.1133
INTERPRETATION:
--------------------------------------------------
Highest variance: duration_ms (12756749295.82)
Lowest variance: speechiness (0.0128)


Features with very different scales may need normalization

### Notes to Myself:
-  **Observations (Feature Scaling & Normalization)**

The features exhibit very different numeric scales, ranging from:

- `duration_ms` with a standard deviation of ~113,000  
- to `speechiness` with a standard deviation of ~0.11  

`popularity`, my target variable, also operates on a scale very different from most input features and should not be used for scaling reference.

Because of these scale differences:

**Distance-based models** (KNN, K-means)  
**Gradient-based models** (linear/logistic regression, SVM, neural networks)  

If I choose them, these would be biased toward high-magnitude features without normalization.

*[Very IMP] If I choose Tree-based models (Decision Trees, Random Forests, Gradient Boosting) are not sensitive to feature scale, so normalization is optional for them.*

---
## 1Ô∏è‚É£4Ô∏è‚É£Zero/Near-Zero Values Count

Some Spotify features have many zeros:
- `instrumentalness`: Most songs have vocals (value near 0)
- `speechiness`: Most songs aren't speech-heavy
- This affects distribution and modeling

In [101]:
# Count how many values are exactly 0 or very close to 0

print("üìä ZERO & NEAR-ZERO VALUES ANALYSIS")
print("=" * 70)
print(f"{'Feature':<18} {'Exactly 0':>12} {'< 0.001':>12} {'< 0.01':>12} {'< 0.1':>12}")
print("-" * 70)
zero_one_features = ['danceability', 'energy', 'speechiness', 'acousticness', 
                     'instrumentalness', 'liveness', 'valence']
for col in zero_one_features:
    exact_zero = (df_dedup[col] == 0).sum()
    near_zero_001 = (df_dedup[col] < 0.001).sum()
    near_zero_01 = (df_dedup[col] < 0.01).sum()
    near_zero_1 = (df_dedup[col] < 0.1).sum()
    
    print(f"{col:<18} {exact_zero:>12,} {near_zero_001:>12,} {near_zero_01:>12,} {near_zero_1:>12,}")

üìä ZERO & NEAR-ZERO VALUES ANALYSIS
Feature               Exactly 0      < 0.001       < 0.01        < 0.1
----------------------------------------------------------------------
danceability                157          157          157          624
energy                        1          111          463        3,080
speechiness                 157          157          157       70,976
acousticness                 39        9,503       19,185       36,984
instrumentalness         29,924       54,417       61,225       67,399
liveness                      2            2            4       23,746
valence                     176          313          384        7,465


###  Interpretation Notes for future:
- **High zero/near-zero counts** indicate sparse or skewed features
- **Transformation candidates:**
  - `instrumentalness` ‚Üí binary / binning (many exact zeros)
  - `speechiness` ‚Üí binning or quantile transform (heavy near-zero skew)
  - `acousticness` ‚Üí log transformation (continuous with many near-zero values)
  - `liveness`, `energy` ‚Üí optional log/power transform (if using linear models)
  - `danceability`, `valence` ‚Üí no transformation required

---

## 1Ô∏è‚É£5Ô∏è‚É£ Coefficient of Variation (CV)
CV = (Std Dev / Mean) √ó 100
- Allows comparison of variability across features with different scales
- CV > 100% indicates high variability
- Useful for comparing spread regardless of units

In [102]:

# CV = (std / mean) * 100 ‚Äî relative variability measure

# Calculate CV for features with non-zero mean
cv_data = []

for col in audio_features:
    mean = df_dedup[col].mean()
    std = df_dedup[col].std()
    
    if mean != 0:
        cv = (std / mean) * 100
    else:
        cv = np.nan
    
    cv_data.append({
        'Feature': col,
        'Mean': mean,
        'Std_Dev': std,
        'CV (%)': cv
    })

cv_df = pd.DataFrame(cv_data).sort_values('CV (%)', ascending=False)

print("üìä COEFFICIENT OF VARIATION (CV)")
print("Interpretation:")
print("  CV < 20%   ‚Üí Low variability")
print("  CV 20-50%  ‚Üí Moderate variability")
print("  CV > 50%   ‚Üí High variability")
print("  CV > 100%  ‚Üí Very high variability")

for _, row in cv_df.iterrows():
    cv = row['CV (%)']
    if pd.isna(cv):
        flag = "Cannot calculate (mean=0)"
    elif cv > 100:
        flag = "üî¥ Very high"
    elif cv > 50:
        flag = "üü° High"
    elif cv > 20:
        flag = "üü¢ Moderate"
    else:
        flag = "‚úÖ Low"
    
    print(f"{row['Feature']:<18} CV: {cv:>8.2f}%  {flag}")


üìä COEFFICIENT OF VARIATION (CV)
Interpretation:
  CV < 20%   ‚Üí Low variability
  CV 20-50%  ‚Üí Moderate variability
  CV > 50%   ‚Üí High variability
  CV > 100%  ‚Üí Very high variability
instrumentalness   CV:   186.75%  üî¥ Very high
speechiness        CV:   129.55%  üî¥ Very high
acousticness       CV:   103.06%  üî¥ Very high
liveness           CV:    89.82%  üü° High
popularity         CV:    61.99%  üü° High
valence            CV:    55.99%  üü° High
duration_ms        CV:    49.29%  üü¢ Moderate
energy             CV:    40.44%  üü¢ Moderate
danceability       CV:    31.43%  üü¢ Moderate
tempo              CV:    24.67%  üü¢ Moderate
loudness           CV:   -61.44%  ‚úÖ Low


## Observation: CV Analysis Insights

 **VERY HIGH VARIABILITY (>100%)**
1. **instrumentalness, speechiness, acousticness**  
   - Insight: Extreme skew with many near-zero values mixed with few high values  
   - Preprocessing: Logarithmic/power transformations, binning, or quantile encoding  
   - Visualization: Use log-scale histograms or violin plots  
   - Modeling: Consider tree-based models (handle skew better) or create binary flags

**HIGH VARIABILITY (50-100%)**
2. **liveness, popularity, valence**  
   - Insight: Good spread but skewed distributions  
   - Preprocessing: Moderate scaling (StandardScaler/RobustScaler)  
   - Feature selection: Strong candidates for predictive power  
   - Visualization: Box plots + distribution overlays

**MODERATE VARIABILITY (20-50%)**
3. **duration_ms, energy, danceability, tempo**  
   - Insight: Balanced distributions with reasonable spread  
   - Scaling: Standard normalization works well  
   - Modeling: Reliable features for most algorithms  
   - Encoding: Can be used directly without heavy transformation

**LOW VARIABILITY (<20%)**
4. **loudness** (negative CV due to negative mean)  
   - Insight: Values cluster tightly around mean (-8.2)  
   - Preprocessing: May need mean-centered scaling  
   - Feature selection: Lower priority unless domain-important  
   - Note: Negative CV indicates negative mean value

**ACTIONABLE RECOMMENDATIONS**
5. **Prioritization Strategy**:
   - High CV features ‚Üí Transform first (log, bin, normalize)  
   - Moderate CV features ‚Üí Scale appropriately  
   - Low CV features ‚Üí Check for information value  
   - Encoding: High CV features may benefit from binning  
   - Feature selection: Use CV as variability filter alongside correlation

---
# Summary 

"During EDA, I discovered 24,259 duplicate tracks (same song in multiple genres). To ensure model integrity and avoid data leakage, I removed duplicates ‚Äî keeping 89,741 unique tracks for modeling."

In [None]:
df_dedup = df_dedup.reset_index(drop=True)

df_dedup.to_csv('../data/spotify_dedup.csv', index=False)