# Relationship Analysis & Feature Vetting for Machine Learning

This notebook focuses on how variables relate to each other and how those relationships inform:
- Feature selection
- Model choice
- Data preprocessing decisions

We move beyond single-variable analysis and into:
- Pairwise relationships
- Correlation structure
- Group comparisons
- Statistical significance
- Practical significance (effect size)
- Feature usefulness for classification

This is the bridge between Exploratory Data Analysis (EDA) and Modeling.


In [2]:
#load in libraries and data sets
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats # statistics machine
df = pd.read_csv("Music_Data.csv")


## Scatterplots (Form, Direction, Strength, Linearity)

We use scatterplots to visually assess:
- Form (linear, curved, clustered, random)
- Direction (positive, negative, none)
- Strength (tight vs diffuse)
- Linearity (can we use linear models?)

If this relationship is roughly linear: Linear Regression is viable  
If curved: we should consider polynomial or tree-based models  
If cloud: weak predictive power

In [None]:
plt.figure()
plt.scatter(df["tempo"], df["loudness"])
plt.xlabel("Tempo")
plt.ylabel("Loudness")
plt.title("Tempo vs Loudness")
plt.show()


Pair plots allow us to quickly detect:
- Redundant features
- Strong relationships
- Completely uninformative features

In [None]:
sns.pairplot(df[["tempo", "loudness", "energy", "danceability"]])
plt.show()

Covariance shows whether two variables move together or in opposite directions.
However, because it depends on units, it is hard to interpret directly.
This is why we prefer correlation.

In [None]:
cov_matrix = df[["tempo", "loudness", "energy", "danceability"]].cov()
cov_matrix

Highly correlated features (|r| > 0.9) often contain redundant information.
Keeping both can:
- Increase model complexity
- Increase overfitting risk

In practice, we often drop one.

In [None]:
corr_matrix = df[["tempo", "loudness", "energy", "danceability"]].corr()

plt.figure(figsize=(8,6))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()

This line represents the **baseline linear model**.
The vertical distances from each point to the line are **residuals**.
The algorithm's job is to minimize these.


In [None]:
x = df["tempo"]
y = df["loudness"]

slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)

plt.figure()
plt.scatter(x, y)
plt.plot(x, slope*x + intercept)
plt.xlabel("Tempo")
plt.ylabel("Loudness")
plt.title("Least Squares Regression Line: Tempo vs Loudness")
plt.show()

r_value**2

- Genre is the target <br>
These are gut-check metrics.<br>
If differences are tiny → model will struggle.<br>
If differences are large → strong candidate feature.<br>

In [None]:
rock = df[df["genre"] == "Rock"]["loudness"]
jazz = df[df["genre"] == "Jazz"]["loudness"]

#mean diff
mean_diff = rock.mean() - jazz.mean()
mean_diff

# % change
percent_change = (mean_diff / jazz.mean()) * 100
percent_change

# fold change
fold_change = rock.mean() / jazz.mean()
fold_change

Interpretation:
- ~0.2 → small effect (weak feature)
- ~0.5 → medium effect
- ≥0.8 → large effect ("gold mine" feature)

In [None]:
def cohens_d(a, b):
    pooled_std = np.sqrt((np.std(a, ddof=1) ** 2 + np.std(b, ddof=1) ** 2) / 2)
    return (np.mean(a) - np.mean(b)) / pooled_std

d = cohens_d(rock, jazz)
d


## Hypothesis Testing – Is the Difference Real?

t-test (parametric)
- p < 0.05 → difference is statistically significant
- p ≥ 0.05 → difference may be noise

Mann-Whitney U (non-parametric)
- This test is safer for skewed, messy, real-world data (like audio features).

In [None]:
# t-test
t_stat, p_val = stats.ttest_ind(rock, jazz, equal_var=False)
t_stat, p_val

#Mann-Whitney
u_stat, p_val_u = stats.mannwhitneyu(rock, jazz, alternative="two-sided")
u_stat, p_val_u

## ANOVA – Multi-Group Comparison
ANOVA checks if the variation between genres is larger than the variation within genres?<br>

If NO → this feature is likely noise. <br>
If YES → strong candidate feature.<br>

In [None]:
groups = [group["tempo"].values for name, group in df.groupby("genre")]

f_stat, p_val = stats.f_oneway(*groups)
f_stat, p_val


## Normalization – Preparing Fair Comparisons
Min-Max scaling puts all features on [0,1].
This allows us to:
- Compare them visually
- Prevent scale dominance in models

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_features = scaler.fit_transform(df[["tempo", "loudness", "energy"]])

scaled_df = pd.DataFrame(scaled_features, columns=["tempo_scaled", "loudness_scaled", "energy_scaled"])
scaled_df.head()

At this point, we can now:

- Eliminate deadweight features (no group difference, no correlation, no effect size)
- Remove redundant features (highly correlated)
- Prioritize high-signal features (large effect size + significant tests)
- Choose appropriate model families:
    - Linear relationships → Linear / Logistic Regression
    - Nonlinear relationships → Trees, SVM, Neural Nets
    - Weak separation → Expect lower ceiling performance

This step prevents:
- Garbage-in-garbage-out modeling
- Overfitting
- Wasted training time
- Misleading accuracy

In [None]:
feature_scores = []

for col in ["tempo", "loudness", "energy", "danceability"]:
    groups = [group[col].values for name, group in df.groupby("genre")]
    f_stat, p_val = stats.f_oneway(*groups)
    feature_scores.append((col, f_stat, p_val))

rank_df = pd.DataFrame(feature_scores, columns=["Feature", "F-statistic", "p-value"])
rank_df.sort_values(by="F-statistic", ascending=False)
