

# **Assignment - step 1** (Group 16)

## *1 - Descriptive statistics*

---
### Descriptive Statistics — Project Intro

Descriptive statistics are the first, essential pass over a dataset to understand **"what we have"** before modeling or drawing conclusions. They summarize data without testing hypotheses.

**What we cover**
- **Size** and **data types**
- **Completeness**: missing values
- **Central tendency**: mean and median
- **Dispersion**: standard deviation, quartiles, IQR, min and max
- **Distribution shape**: skewness and kurtosis
- **Categorical features**: cardinality and category frequencies

**Why this matters for our project**
- Exposes data-quality issues like duplicates, outliers, and inconsistent labels  
- Guides cleaning steps such as imputation, type fixes, and category grouping  
- Informs Univariate and Bivariate Analysis and the choice of transformations  
  - High skew in a numeric feature may suggest a log transform  
  - Many rare categories may need consolidation  
  - Large dispersion or extreme values may call for robust methods

By documenting these summaries in our repo with small tables and brief notes, we create a reproducible baseline that justifies preprocessing decisions and ensures every teammate starts from the same, well-understood dataset.


In [41]:
import pandas as pd
import numpy as np

df = pd.read_csv('group_16.csv', sep=";")

print("Dimension (lines, columns):", df.shape)

print("\nSample (5 lines):")
display(df.sample(5, random_state=42))

rows_dup = df.duplicated().sum()
print("\nNumber of duplicated rows:", rows_dup)

Dimension (lines, columns): (3000, 49)

Sample (5 lines):


Unnamed: 0,duration_1,duration_2,duration_3,duration_4,duration_5,loudness_level,popularity_level,tempo_class,time_signature,key_mode,...,is_instrumental,is_dance_hit,temp_zscore,resonance_factor,timbre_index,echo_constant,distorted_movement,signal_power,target_class,target_regression
1801,0.0,0.0,0.0,1.0,0.0,2.0,0.0,1.0,0.2218,-0.6396,...,0.0,0.0,0.3456,-0.3284,0.2137,1,0.393,0.684,class_89,1.424
1190,0.0,0.0,1.0,0.0,0.0,3.0,1.0,1.0,0.2218,1.3302,...,0.0,0.0,0.729,-0.625,0.7659,1,-1.3242,0.723,class_89,-1.4902
1817,0.0,1.0,0.0,0.0,0.0,4.0,3.0,0.0,0.2218,0.486,...,0.0,0.0,1.9447,-1.463,0.0794,1,0.7099,0.374,class_89,0.348
251,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.2218,-1.2024,...,0.0,0.0,-0.3738,2.2269,0.6335,1,1.4244,0.806,class_68,-0.5487
2505,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.2218,-1.2024,...,0.0,0.0,-0.4703,1.2775,0.2601,1,-0.8171,0.803,class_90,-1.4902



Number of duplicated rows: 0


In [42]:
# NUMERIC VARIABLES
df_numeric_cols = df.select_dtypes(include= np.number).copy()

print("\nNumeric columns:")
print(df_numeric_cols.describe())

data_info = pd.DataFrame({
    'Data Type': df.dtypes,
    'Missing Values': df.isnull().sum(),
    'Unique Values': df.nunique()
})

data_info


Numeric columns:
       duration_1  duration_2  duration_3  duration_4  duration_5  \
count  3,000.0000  3,000.0000  3,000.0000  3,000.0000  3,000.0000   
mean       0.0370      0.1493      0.3400      0.4680      0.0057   
std        0.1888      0.3565      0.4738      0.4991      0.0751   
min        0.0000      0.0000      0.0000      0.0000      0.0000   
25%        0.0000      0.0000      0.0000      0.0000      0.0000   
50%        0.0000      0.0000      0.0000      0.0000      0.0000   
75%        0.0000      0.0000      1.0000      1.0000      0.0000   
max        1.0000      1.0000      1.0000      1.0000      1.0000   

       loudness_level  popularity_level  tempo_class  time_signature  \
count      3,000.0000        3,000.0000   3,000.0000      3,000.0000   
mean           1.6810            1.6930       1.0190          0.0870   
std            1.3381            1.0446       0.2805          0.7860   
min            0.0000            0.0000       0.0000         -6.7127   


Unnamed: 0,Data Type,Missing Values,Unique Values
duration_1,float64,0,2
duration_2,float64,0,2
duration_3,float64,0,2
duration_4,float64,0,2
duration_5,float64,0,2
loudness_level,float64,0,5
popularity_level,float64,0,5
tempo_class,float64,0,5
time_signature,float64,0,4
key_mode,float64,0,24


In [43]:
# CATEGORIC VARIABLES
df_categorical_cols = df.select_dtypes(exclude= np.number).copy()

print("\nCategoric columns:")
print(df_categorical_cols.describe())


Categoric columns:
       focus_factor target_class
count          3000         3000
unique          992            3
top             0.0     class_68
freq           1499         1000


## *2 - Univariate Analysis*

---

Univariate analysis examines **one variable at a time** to understand its distribution, quality, and quirks before any modeling. For **numeric** features we look at histograms, density and boxplots, plus median, spread, skewness, and outliers. For **categorical** features we review frequency tables, dominant categories, rare levels, and missing values.  
**Why it matters for our project:** it reveals issues that affect preprocessing and modeling choices. Highly skewed numeric features may need a log transform or robust methods. Zero-inflated variables may benefit from indicator features. Categorical features with many rare levels should be consolidated or encoded with target encoding rather than naive one-hot. The goal is to decide cleaning, transformations, and encoding strategies feature by feature.


## *3 - Bivariate Analysis*

---
Bivariate analysis studies **relationships between two variables**, especially **feature ↔ target**.  
For **numeric ↔ numeric**, we use scatter plots and correlation (Pearson for linear, Spearman for monotonic) to spot strength, direction, nonlinearity, and multicollinearity.  
For **numeric ↔ categorical**, we compare distributions across groups with box or violin plots and tests like t-test, ANOVA, or Kruskal–Wallis.  
For **categorical ↔ categorical**, we build contingency tables and compute chi-square and Cramér’s V.  
**Why it matters for our project:** it highlights predictive signals, warns about redundant features, and guides modeling. Strong yet nonlinear patterns suggest splines or tree-based models. High correlation between features flags multicollinearity to control with feature selection or regularization. Group differences in the target validate useful splits and inform encoding and feature engineering.

