## Engineering Notebook

---

In this notebook, we analyze the extracted features, assessing the necessity of normalization. We also investigate possible feature selection techniques to reduce the dimensionality of the data.
The sections are organized as follows:

1. [Load Data](#Load-Data)
2. [Feature Analysis](#2.-Feature-Analysis)
    1. [Visualize Features](#2.1.-Visualize-Features)
    2. [Feature correlation](#2.2.-Feature-Correlation)
3. [Covariance analysis](#3-covariance-matrix-of-the-groups)
4. [Feature Selection](#4.-Feature-Selection)
5. [Outliers Detection](#5.-Outliers-Detection)
6. [Feature distribution](#6-Feature-Distribution)
7. [PCA](#7-PCA)
8. [Save the data](#8-Save-Data)



In [79]:
# import all the functions
import numpy as np
import os
import sys

sys.path.append("../")
import pandas as pd
from utils import remove_highly_correlated_features
from scipy.stats import spearmanr


In [80]:
def get_features_not_correlated_with_target(
    data_df: pd.DataFrame, threshold: float = 0.41
) -> pd.Index:
    # Importa le librerie necessarie
    p_values = []
    correlazione = []
    features = []
    # Calcola i coefficienti di correlazione di Kendall e i valori p per ogni coppia di colonne nel dataframe
    for col1 in data_df.columns:
        correlation, p_value = spearmanr(data_df[col1], data_df["label"])
        p_values.append(p_value)
        correlazione.append(correlation)
        features.append(col1)

    correlazione_df = pd.DataFrame(
        {"Feature": features, "Correlazione": correlazione, "P-value": p_values}
    )
    features_to_drop = correlazione_df[
        np.abs(correlazione_df["Correlazione"]) <= threshold
    ].index

    return features_to_drop


def remove_features_highly_correlated(
    data_df: pd.DataFrame, threshold: float = 0.7, max_corr_count: int = 2
) -> pd.Index:
    correlation_matrix = data_df.corr(method="spearman")
    correlation_matrix_no_target = data_df.drop(columns=["label"]).corr(
        method="spearman"
    )
    features_to_drop = remove_highly_correlated_features(
        correlation_matrix,
        correlation_matrix_no_target,
        threshold=threshold,
        max_corr_count=max_corr_count,
    )
    return features_to_drop


def get_samples(file_path: str, names: list):
    dataset = []
    if ("posterior" in file_path) or ("both" in file_path):
        data = np.load(file_path, allow_pickle=True).item()
        data = data["train_bal"]
        X = data["X"]
        y = data["y"].reshape(-1, 1)
        dataset = np.concatenate((X, y), axis=1)
    else:
        data_list = []
        for name in names:
            data = np.load(file_path, allow_pickle=True).item()
            data = data[name]
            X = data["X"]
            y = data["y"].reshape(-1, 1)
            data_combined = np.concatenate((X, y), axis=1)
            data_list.append(data_combined)
        dataset = np.concatenate(data_list, axis=0)
    return dataset

### 1. Load Data <a id='Load-Data'></a>


In [81]:
# paths to the features and the labels
FEATURE_RAW_DIR = "../../features/raw/"
FEATURE_BAL_PRIOR_DIR = "../../features/balanced/priori/"
FEATURE_BAL_POSTERIOR_DIR = "../../features/balanced/posteriori/"
FEATURE_BAL_BOTH_DIR = "../../features/balanced/both/"

feature_files = {
    "30 MFCC": "30mfcc",
    "12  Chroma": "12chroma",
    "70 CQT": "70cqt",
    "40 RMS": "41rms",
    "40 Zero Crossing Rates": "41zcr",
    "40 Spectral Centroid": "41sc",
    "60 Spectral Bandwidth": "61sb",
    "40 Spectral Rolloff": "41sr",
}

feature_names = {
    "30 MFCC": [f"MFCC {i}" for i in range(1, 31)],
    "12  Chroma": [f"Chroma {i}" for i in range(1, 13)],
    "70 CQT": [f"CQT {i}" for i in range(1, 71)],
    "40 RMS": [f"RMS {i}" for i in range(1, 42)],
    "40 Zero Crossing Rates": [f"Zero Crossing Rates {i}" for i in range(1, 42)],
    "40 Spectral Centroid": [f"Spectral Centroid {i}" for i in range(1, 42)],
    "60 Spectral Bandwidth": [f"Spectral Bandwidth {i}" for i in range(1, 62)],
    "40 Spectral Rolloff": [f"Spectral Rolloff {i}" for i in range(1, 42)],
}
names = ["artifacts", "extrahls", "murmurs", "normals", "extrastoles"]

INTERVAL=2
SR=4000

In [86]:
# List of folders to process, each corresponding to a different feature set.
FOLDERS = [
    FEATURE_RAW_DIR,
    FEATURE_BAL_PRIOR_DIR,
    FEATURE_BAL_POSTERIOR_DIR,
    FEATURE_BAL_BOTH_DIR,
]

# Iterate over each folder in FOLDERS
for folder in FOLDERS:
    # Iterate over each feature name and corresponding file in feature_files
    for feature_name, feature_file in feature_files.items():
        print(f"Processing {feature_name} in {folder}")

        # Construct the full file path for the current feature file
        file_path = os.path.join(
            folder, f"full_data_{INTERVAL}s_{SR}hz_{feature_file}.npy"
        )

        dataset = []  # Initialize an empty list to store dataset
        data = None  # Initialize data to None

        # Check if the file path indicates posterior or both types of data
        if ("posterior" in file_path) or ("both" in file_path):
            # Load the data from the file and extract the "train_bal" subset
            data = np.load(file_path, allow_pickle=True).item()
            datam = data["train_bal"]

            # Separate features (X) and labels (y), then concatenate them into one array
            X = datam["X"]
            y = datam["y"].reshape(-1, 1)
            dataset = np.concatenate((X, y), axis=1)
        else:
            # Initialize a list to store data from multiple names
            data_list = []

            # Iterate over each name in names
            for name in names:
                # Load the data from the file and extract the subset for the current name
                data = np.load(file_path, allow_pickle=True).item()
                datam = data[name]

                # Separate features (X) and labels (y), then concatenate them into one array
                X = datam["X"]
                y = datam["y"].reshape(-1, 1)
                data_combined = np.concatenate((X, y), axis=1)
                data_list.append(data_combined)  # Add the combined data to the list

            # Concatenate all data from the list into one dataset
            dataset = np.concatenate(data_list, axis=0)

        # Convert the dataset into a pandas DataFrame with feature names and label
        data_df = pd.DataFrame(dataset, columns=feature_names[feature_name] + ["label"])

        # Identify features that are not correlated with the target
        print("Features not correlated with the target")
        features_not_correlated = get_features_not_correlated_with_target(
            data_df, threshold=0.3
        )
        print(len(features_not_correlated))

        # Identify features that are highly correlated with each other
        print("Features highly correlated with each other")
        features_highly_correlated = remove_features_highly_correlated(
            data_df, threshold=0.85, max_corr_count=6
        )
        print(len(features_highly_correlated))

        # Combine features to drop from both uncorrelated and highly correlated sets
        features_to_drop = set(features_not_correlated).union(
            set(features_highly_correlated)
        )
        print("Features to drop")
        print(len(features_to_drop))

        # Drop the identified features from the dataset
        indexes = data_df.columns.get_indexer(features_to_drop)
        print(indexes)
        print("Removing features from the dataset")
        filtered_data = data
        # Check if the file path indicates posterior or both types of data
        if ("posterior" in file_path) or ("both" in file_path):
            # Load the data from the file and extract the "train_bal" subset
            for key in filtered_data.keys():
                filtered_data[key]["X"] = np.delete(
                    filtered_data[key]["X"], indexes, axis=1
                )
                print(filtered_data[key]["X"].shape)
        else:
            # Iterate over each name in names
            for name in names:
                # Load the data from the file and extract the subset for the current name
                print(filtered_data.keys())
                filtered_data[name]["X"] = np.delete(
                    filtered_data[name]["X"], indexes, axis=1
                )
                print(filtered_data[name]["X"].shape)

        print("Saving filtered data")
        # Construct the full file path for the current feature file
        save_file_path = os.path.join(
            folder, f"full_data_filtered_{INTERVAL}s_{SR}hz_{feature_file}.npy"
        )
        np.save(save_file_path, filtered_data)

Processing 30 MFCC in ../../features/raw/
Features not correlated with the target
17
Features highly correlated with each other
Removing 0 features
0
Features to drop
17
[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1]
Removing features from the dataset
dict_keys(['artifacts', 'extrahls', 'murmurs', 'normals', 'extrastoles'])
(970, 29)
dict_keys(['artifacts', 'extrahls', 'murmurs', 'normals', 'extrastoles'])
(55, 29)
dict_keys(['artifacts', 'extrahls', 'murmurs', 'normals', 'extrastoles'])
(534, 29)
dict_keys(['artifacts', 'extrahls', 'murmurs', 'normals', 'extrastoles'])
(984, 29)
dict_keys(['artifacts', 'extrahls', 'murmurs', 'normals', 'extrastoles'])
(113, 29)
Saving filtered data
Processing 12  Chroma in ../../features/raw/
Features not correlated with the target
1
Features highly correlated with each other
Removing 0 features
0
Features to drop
1
[-1]
Removing features from the dataset
dict_keys(['artifacts', 'extrahls', 'murmurs', 'normals', 'extrastoles'])
(970, 11)
dict_ke

### 2. Feature Analysis


#### Compute the correlation coefficient between the features and the target variable.
