# Data mining project 22/23

## Import libraries

In [1]:
import os
os.environ["OMP_NUM_THREADS"] = "10" # to avoid possible memory leak with KMeans

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# zscore
from scipy.stats import zscore
# kmeans, bisecting kmeans, dbscan, hierarchical (sklearn)
from sklearn.cluster import KMeans, BisectingKMeans, DBSCAN, AgglomerativeClustering
# scaling, normalization
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# distance matrix (dbscan elbow, hierarchical)
from scipy.spatial.distance import pdist, squareform
# hierarchical (scipy)
from scipy.cluster.hierarchy import linkage, dendrogram

## Theory

To understand first each variable, let's review some audio concepts:

The Ryerson Audio-Visual database of Emotional Speech and Song has the following variables: 
A **sample** of an audio is a portion of an audio's wave in certain (small) interval.

The **sample rate** is the number of samples taken per second. This is a frequency and it is measured in kilohertz (kHz). Remember that the more often you take samples of the original audio, the closer to the original you can get.

For understanding the difference between sample and frame take into consideration these formulas:

* Sample rate = number of samples / second
* Frame = 1 sample from each channel (PCM)
* Frame Size = Sample size * Channels
* Frame Rate = frames / second

For PCM, which is a digital representation of an analog signal, the sample rate and the frame rate are the same since a frame consists of a a sample from each channel.

The sample size is the size of an individual sample, also called **Bit Depth** or **Sample Width**, and it indicates how many byte of information a sample contains. (This is one important factor in the quality/resolution of the audio).

An **audio channel** is the path via a signal or data is delivered, i.e, it's where a sound signal is conveyed from the player source to the speaker. For one channel we talk about **mono**, and several channels we refer to **stereo**.  For instance, in stereo sound, there are two audio sources: one speaker on the left, and one on the right. Each of these is represented by one channel.

An **audio frame** is a data record that contains the samples of all the channels available in an audio signal to the same point in time.

The **zero-crossing rate** is the rate at which a signal changes from positive to zero or negative, and from negative to zero or positive. It's a measure of the smoothness of the signal. The zero-crossing rate can be utilized as a basic pitch detection algorithm for monophonic tonal signals and is a key feature to classify percussive sounds.

**Mel-Frequency Cepstral Coefficients** is a small set of features (usually about 10-20) which concisely describe the overall shape of a spectral envelope.

**Spectral Centroid** indicates where the center of mass of the spectrum is located, and it is a good predictor of the 'brightness' of a sound, which depends on distribution of total power between high and low frequencies. Also can be seen as the amplitude-weighted mean of several frequency components.

**stft chromagram**: applying the Fourier transform, which is used to convert a time-dependent signal to a frequency-dependent signal, to local sections of an audio signal, one obtains the short-time Fourier transform (STFT). The Chroma feature or chromagram of an audio represents the intensity of the twelve distinctive pitch classes that are used to study music.

* 'Modality': media file types (audio-only).
* 'Actor' and 'Sex': corresponding identifier of the actor (01 to 24) and its sex (M of F).
* 'statement': phrase repeated by the actors ("Kids are talking by the door", "Dogs are sitting by the door").
* 'repetition': number of repetitions (1st repetition, 2nd repetition).
* 'vocal chanel': type of channel (speech or song).
* 'Emotion': the emotion of the speaker (neutral, calm, happy, sad, angry, fearful, disgust, surprised).
* 'Emotional intensity': level of emotion of each expression (normal, strong). NOTE: There is no strong intensity for the 'neutral' emotion.
* sample_width: number of bytes of storage needed to save the sample (1 means 8-bit, 2 means 16-bit).
* frame_rate: frequency of samples used (in Hertz).
* frame_width: Number of bytes for each frame. One frame contains a sample for each channel.
* length_ms: audio file length in milliseconds.
* frame_count: number of frames from the sample.
* intensity: loudness in dBFS, which is dB relative to the maximum possible loudness.
* zero_crossings_sum: sum of the zero-crossing rate.
* 'mean', 'std', 'min', 'max', 'kur', 'skew': statistics of the original audio signal.
* mfcc_ 'mean', 'std', 'min', 'max': statistics of the Mel-Frequency Cepstral Coefficients.
* sc_ 'mean', 'std', 'min', 'max', 'kur', 'skew': statistics of the spectral centroid.
* stft_ 'mean', 'std', 'min', 'max', 'kur', 'skew': statistics of the stft chromagram.

Measures to understand the shape of the data:
**Skewness** (skew) measures the symmetry of the distribution

**Kurtosis** (kur) measures the heaviness of the distribution tails, i.e., provides an indication of the presence of outliers.

### Data semantics

## Classification of the variables
* **Nominal/Categorical:** actor, sex, modality, statement, repetition, vocal channel, emotion.
* **Ordinal:**  emotional intensity.
* **Numeric:**   /
* **Ratio-Scaled:** lenght_ms, zero_crossings_sum, frame_rate, frame_width, sample_width, stft_mean, stft_std, stft_min, stft_max, stft_kur, stft_skew, sc_mean, sc_std, sc_min, sc_max, sc_kur, sc_skew, mfcc_mean, mfcc_std, mfcc_min, mfcc_max, 'std', 'min', 'max', 'kur', 'skew'.

The classification of variables depends on the real data?
**The data contained in the dataset are enough.**

What does characteristics of variables mean? It's just about the classification and the domain?
**Charateristics mean std, means, medians, if it's continous or categorical, etc.**

For emotion, is it nominal or ordinal? What about partial ordered sets?
**Emotion is nominal and partial order sets we take like they are not ordered.**

Possible tuples of features we can analyze:

* EI with SW (sample width, I don't know if it does make sense, SW has always 2 as value)


**Distributions in Claudio's notebook**:
* emotion x sex (s/c)
* emotion x intensity (box plot)
* emotion (pie chart - histo)
* audio length (hist)
* vocal channel x sex (stacked chart)
* sex with  emotional intensity (EI)
* SW - Frame Width (it doesn't make sense, SW has always the same value and frame width has almost always the same value)
* Frame rate - ZC (it doesn't make sense, frame rate has always the same value)
* Frame count - ZC
* Length_ms - Frame count
* EI with statement (stat), length, intensity (I), Zero-crossing sum (ZC)

**Tasks for next meeting**
1. To understand the meaning of each variable.
2. To think about the characteristics of each variable, for instance _what does it means that a sc_skew is higher or smaller than the others?_
3. If you have time, to think about the statistical analysis we can associate to pairs of variables.


## Data understanding and preparation

In [None]:
df = pd.read_csv("ravdess_features.csv") # read csv file (dataset)

In [None]:
df

In [None]:
df.shape # shape of dataset (rows, columns)

In [None]:
df.iloc[0]

In [None]:
df.describe().round(2) # some descriptive statistics

In [None]:
df.T

In [None]:
df["stft_mean"].describe()

In [None]:
nunique = df.nunique()
nunique

In [None]:
to_delete = []
for key, value in nunique.items():
    if(value == 1):
        print("To delete: ", key)
        to_delete.append(key)
    if(value > 1 and value < 100):
        print("To evaluate: ", key)

In [None]:
df = df.drop(columns=to_delete)

In [None]:
df.shape

In [None]:
df["emotion"].unique()

In [None]:
def get_emotion_positivity(x):
    if(x == 'fearful' or x == 'angry' or x =='sad' or x == 'disgust'):
        return -1;
    if(x == 'happy' or x == 'surprised'): # is calm positive?
        return 1;
    return 0;

df["emotion_positivity"] = df["emotion"].map(get_emotion_positivity)
df["emotion_positivity"]

In [None]:
print("Emotion positivity mean:", df["emotion_positivity"].mean())
print("Emotion positivity std:", df["emotion_positivity"].std())

In [None]:
df["length"] = df["length_ms"] / 1000
df = df.drop(columns=["length_ms"])

In [None]:
df["zero_crossings_sum"]

In [None]:
df["zero_crossings_rate"] = df["zero_crossings_sum"] / df["length"]
df["zero_crossings_rate"]

In [None]:
df = df.drop(columns=["zero_crossings_sum"])

### MFCC statistics

In [None]:
max_mfccMean = df["mfcc_mean"].max()
min_mfccMean = df["mfcc_mean"].min()
max_mfccStd = df["mfcc_std"].max()
min_mfccStd = df["mfcc_std"].min()
min_mfccMin = df["mfcc_min"].min()
max_mfccMax = df["mfcc_max"].max()
print("Max value of mfcc_mean is: ", round(max_mfccMean, 2))
print("Min value of mfcc_mean is: ", round(min_mfccMean, 2))
print("Max value of mfcc_std is: ", round(max_mfccStd, 2))
print("Min value of mfcc_std is: ", round(min_mfccStd, 2))
print("Min value of mfcc_min is: ", round(min_mfccMin, 2))
print("Max value of mfcc_max is: ", round(max_mfccMax, 2))

### SC statistics

In [None]:
max_scMean = df["sc_mean"].max()
min_scMean = df["sc_mean"].min()
max_scStd = df["sc_std"].max()
min_scStd = df["sc_std"].min()
min_scMin = df["sc_min"].min()
max_scMax = df["sc_max"].max()
max_scKur = df["sc_kur"].max()
min_scKur = df["sc_kur"].min()
max_scSkew = df["sc_skew"].max()
min_scSkew = df["sc_skew"].min()
print("Max value of sc_mean is: ", round(max_scMean, 2))
print("Min value of sc_mean is: ", round(min_scMean, 2))
print("Max value of sc_std is: ", round(max_scStd, 2))
print("Min value of sc_std is: ", round(min_scStd, 2))
print("Min value of sc_min is: ", round(min_scMin, 2))
print("Max value of sc_max is: ", round(max_scMax, 2))
print("Max value of sc_kur is: ", round(max_scKur, 2))
print("Min value of sc_kur is: ", round(min_scKur, 2))
print("Max value of sc_skew is: ", round(max_scSkew, 2))
print("Min value of sc_skew is: ", round(min_scSkew, 2))

### STFT chromagram statistics

In [None]:
max_stftMean = df["stft_mean"].max()
min_stftMean = df["stft_mean"].min()
max_stftStd = df["stft_std"].max()
min_stftStd = df["stft_std"].min()
min_stftMin = df["stft_min"].min()
#max_stftMax = df["stft_max"].max()
max_stftKur = df["stft_kur"].max()
min_stftKur = df["stft_kur"].min()
max_stftSkew = df["stft_skew"].max()
min_stftSkew = df["stft_skew"].min()
print("Max value of stft_mean is: ", round(max_stftMean, 2))
print("Min value of stft_mean is: ", round(min_stftMean, 2))
print("Max value of stft_std is: ", round(max_stftStd, 2))
print("Min value of stft_std is: ", round(min_stftStd, 2))
print("Min value of stft_min is: ", round(min_stftMin, 2))
#print("Max value of stft_max is: ", round(max_stftMax, 2))
print("Max value of stft_kur is: ", round(max_stftKur, 2))
print("Min value of stft_kur is: ", round(min_stftKur, 2))
print("Max value of stft_skew is: ", round(max_stftSkew, 2))
print("Min value of stft_skew is: ", round(min_stftSkew, 2))

### Checking synctactic accuracy

In this way we check if there are some strange values or entries that are not in the domain. At the same time, we check the mode of each variable. As result, we find that all nominal/categorical and ordinal variables values seems to be synctactically accurate.

### Checking semantic accuracy

In [None]:
actors = df[["actor", "sex"]]
actors.value_counts()

In [None]:
emo_EI = df[["emotion", "emotional_intensity"]]
emo_EI.value_counts()

We check if there are some semantic inconsistencies in the couples actor-sex (for instance duplicate actors with different sex) and emotion-emotional_intensity (there must be no _strong_ emotional_intensity values with _neutral_ emotion values). There seems to be no semantic inconsistencies.

### NaN values

In [None]:
df.isna().sum()

In this way we check if there are NaN values (missing values) and how many they are.

In [None]:
(df == 0).sum() # doing with "0" or 0.0 doesn't change anything

### Charts and relations

In [None]:
emotion_sex = pd.crosstab(df["sex"], df["emotion"])
emotion_sex

In [None]:
sns.set_context("notebook", font_scale = 1.1, rc = {"font.size": 16, "axes.titlesize": 16, "axes.labelsize": 16}) # set the
# context of chart, in this case this is a notebook, and some other size and scale
plt.rcParams["figure.figsize"] = [15, 10] # it configures the size of chart (x and y axes)

# stacked chart of vocal channel per sex
emotion_sex.plot(kind="bar", stacked = True)
plt.title("Distribution of emotion per sex")
plt.xlabel("Sex")
plt.ylabel("Count")
plt.xticks(rotation = 0)
#plt.savefig("stacked_vocalChannel.png") # it saves the chart as .png image
plt.show()

In [None]:
sns.set_context("notebook", font_scale = 1.1, rc = {"font.size": 16, "axes.titlesize": 16, "axes.labelsize": 16}) # set the
# context of chart, in this case this is a notebook, and some other size and scale
plt.rcParams["figure.figsize"] = [15, 10] # it configures the size of chart (x and y axes)

# stacked chart of vocal channel per sex
emotion_sex.plot(kind="bar", stacked = False)
plt.title("Distribution of emotion per sex")
plt.xlabel("Sex")
plt.ylabel("Count")
plt.xticks(rotation = 0)
#plt.savefig("stacked_vocalChannel.png") # it saves the chart as .png image
plt.show()

In [None]:
vocal_channel = pd.crosstab(df["sex"], df["vocal_channel"])
vocal_channel

In [None]:
# stacked chart of vocal channel per sex
vocal_channel.plot(kind="bar", stacked = True)
plt.title("Distribution of vocal channel per sex")
plt.xlabel("Sex")
plt.ylabel("Count")
plt.xticks(rotation = 0)
#plt.savefig("stacked_vocalChannel.png") # it saves the chart as .png image
plt.show()

In [None]:
sns.boxplot(x = "emotion", y = "intensity", data = df)
plt.title("Emotion per intensity")
plt.xlabel("Emotion")
plt.ylabel("Intensity")
#plt.savefig("boxplot_emotionIntensity.png")
plt.show()

In [None]:
sizes = df["emotion"].value_counts()
labels = df["emotion"].unique()

In [None]:
plt.pie(sizes, labels = labels, autopct = '%1.1f%%', shadow = False, startangle = 180, textprops = {'fontsize': 13})
plt.title("Emotion distribution")
plt.tight_layout() # it fixes padding around the chart
#plt.savefig("pie_emotion.png")
plt.show()

In [None]:
sns.histplot(df["length"])
plt.title("Distribution of audio length")
plt.xlabel("Audio length")
plt.ylabel("Count")
plt.axvline(df['length'].mean(), color = "k", linestyle = "--") # mean (black dotted line)
plt.axvline(df['length'].median(), color = "r", linestyle = "--") # median (red dotted line)
plt.tight_layout()
#plt.savefig("histplot_lenghtms.png")
plt.show()

In [None]:
emoInt_sex = pd.crosstab(df["sex"], df["emotional_intensity"])
emoInt_sex

In [None]:
# stacked chart of emotional intensity per sex
emoInt_sex.plot(kind="bar", stacked = True)
plt.title("Distribution of emotional intensity per sex")
plt.xlabel("Sex")
plt.ylabel("Count")
plt.xticks(rotation = 0)
#plt.savefig("stacked_vocalChannel.png") # it saves the chart as .png image
plt.show()

In [None]:
sns.regplot(x = "frame_count", y = "zero_crossings_rate", fit_reg = False, data = df)
plt.title("Frame count per Zero crossings sum")
plt.xlabel("Frame count")
plt.ylabel("Zero crossings sum")
#plt.savefig("scatter_fc-zcs.png")
plt.show()

Here we can notice a missing frame count or uncorrect entries, because there are some 0 values on _frame count_ axis.

**Calculate the outliers with interpolation.**

In [None]:
sns.regplot(x = "length", y = "frame_count", fit_reg = False, data = df)
plt.title("Length in ms per frame count")
plt.xlabel("Length (in ms)")
plt.ylabel("Frame count")
#plt.savefig("scatter_lms-fc.png")
plt.show()

In [None]:
df = df.drop(columns=["frame_count"])

Also here we can see there are some uncorrect or missing values on _frame count_ axis.

**Calculate the outliers with interpolation.**

In [None]:
EI_stat = pd.crosstab(df["emotional_intensity"], df["statement"])
EI_stat

In [None]:
EI_stat.plot(kind="bar", stacked = True)
plt.title("Distribution of emotional intensity per statement")
plt.xlabel("Emotional intensity")
plt.ylabel("Statement")
plt.xticks(rotation = 0)
#plt.savefig("stacked_EI-stat.png")
plt.show()

In [None]:
sns.boxplot(x = "emotional_intensity", y = "length", data = df)
plt.title("Emotional intensity per length")
plt.xlabel("Emotional intensity")
plt.ylabel("Length")
#plt.savefig("boxplot_EI-length.png")
plt.show()

In [None]:
sns.boxplot(x = "emotional_intensity", y = "intensity", data = df)
plt.title("Emotional intensity per intensity")
plt.xlabel("Emotional intensity")
plt.ylabel("Intensity")
#plt.savefig("boxplot_EI-intensity.png")
plt.show()

In [None]:
sns.boxplot(x = "emotional_intensity", y = "zero_crossings_rate", data = df)
plt.title("Emotional intensity per zero crossings sum")
plt.xlabel("Emotional intensity")
plt.ylabel("zero crossings sum")
#plt.savefig("boxplot_EI-ZCS.png")
plt.show()

In [None]:
plt.hist(df.emotion)
plt.show()

In [None]:
em_intensity_em = pd.crosstab(df["emotion"],df["emotional_intensity"])
em_intensity_em.plot(kind="bar", stacked = True)
plt.show()

In [None]:
df['statement'] = df['statement'].replace(['Dogs are sitting by the door'], '0')
df['statement'] = df['statement'].replace(['Kids are talking by the door'], '1')

In [None]:
em_stat = pd.crosstab(df.emotion,df.statement)
em_stat.plot(kind = 'bar', stacked = True)
plt.show()

In [None]:
em_sex = pd.crosstab(df.emotion,df.sex)
em_sex.plot(kind="bar",stacked = True)
plt.show()

In [None]:
sns.boxplot(x = "emotional_intensity", y = "mfcc_min", data = df)
plt.title("Emotional intensity per minimum mfcc")
plt.xlabel("Emotional intensity")
plt.ylabel("Minimum mfcc")
#plt.savefig("boxplot_EI-minMFCC.png")
plt.show()

In [None]:
sns.boxplot(x = "sex", y = "stft_mean", data = df)
plt.title("stft_mean per sex")
plt.xlabel("Sex")
plt.ylabel("Stft mean")
#plt.savefig("boxplot_EI-minMFCC.png")
plt.show()

In [None]:
sns.regplot(x = "length", y = "zero_crossings_rate", fit_reg = False, data = df)
plt.title("Length per Zero crossings sum")
plt.xlabel("Length")
plt.ylabel("Zero crossings sum")
#plt.savefig("scatter_fc-zcs.png")
plt.show()

In [None]:
density = df["mean"].plot.density(color='green')

In [None]:
df["std"].plot.kde(color='green')
plt.show()

In [None]:
df["min"].plot.kde(color='green')
plt.show()

In [None]:
#plt.scatter(df["max"],df["stft_max"])
#plt.show()

In [None]:
plt.scatter(df["max"],df["sc_max"])
plt.show()

In [None]:
plt.scatter(df["length"],df["sc_max"])
plt.show()

In [None]:
plt.scatter(df["mfcc_mean"],df["stft_skew"])
plt.show()

In [None]:
max_features = df[["max","sc_max"]]
fig, ax = plt.subplots(1,1)
for s in max_features.columns:
    df[s].plot(kind='density')
fig.show()

In [None]:
df["sc_max"].plot.kde(color='green')
plt.show()

In [None]:
df["max"].plot.kde(color='green')
plt.show()

In [None]:
df["zero_crossings_rate"].plot.kde(color='green')
plt.show()

In [None]:
df["intensity"].plot.kde(color='green')
plt.show()

In [None]:
skew_features = df[["sc_skew","stft_skew","skew"]]
fig, ax = plt.subplots(1,1)
for s in skew_features.columns:
    df[s].plot(kind='density')
fig.show()

In [None]:
kur_features = df[["sc_kur","stft_kur"]]
fig, ax = plt.subplots(1,1)
for s in kur_features.columns:
    df[s].plot(kind='density')
fig.show()

In [None]:
df["kur"].plot.kde(color='green')
plt.show()

Frame count: interpolation

Missing values: 
Vocal channel...


## Outliers - Single attribute

**Definition of outliers computation**

In [None]:
def outliers(variable) :
    Q1 = df[variable].quantile(0.25, interpolation = 'linear')
    Q3 = df[variable].quantile(0.75, interpolation = 'linear')
    IQR = Q3 - Q1
    Lowerfence = Q1 - 1.5*IQR
    Upperfence = Q3 + 1.5*IQR
    OUTLIERS = df.loc[(df[variable] < Lowerfence)  |  (df[variable] > Upperfence) ]
    NORMAL_Spec = df.loc[(df[variable] > Lowerfence)  & (df[variable] < Upperfence) ]
    return OUTLIERS[variable].round(2)

In [None]:
sns.boxplot(x = "length", data = df)
plt.title("Length (in ms)")
plt.xlabel("Lenght (in ms)")
#plt.savefig("boxplot_length.png")
plt.show()

In [None]:
print(outliers("length"))

In [None]:
sns.boxplot(x = "zero_crossings_rate", data = df)
plt.title("zero_crossings_rate")
plt.xlabel("zero_crossings_rate")
#plt.savefig("boxplot_length.png")
plt.show()

In [None]:
sns.boxplot(x = "frame_width", data = df)
plt.title("Frame width")
plt.xlabel("Frame width")
#plt.savefig("boxplot_frameWidth.png")
plt.show()

In [None]:
print(outliers("frame_width"))

In [None]:
sns.boxplot(x = "stft_mean", data = df)
plt.title("STFT mean")
plt.xlabel("STFT mean")
#plt.savefig("boxplot_stftMean.png")
plt.show()

In [None]:
print(outliers("stft_mean"))

In [None]:
sns.boxplot(x = "stft_std", data = df)
plt.title("STFT std")
plt.xlabel("STFT std")
#plt.savefig("boxplot_stftSTD.png")
plt.show()

In [None]:
print(outliers("stft_std"))

In [None]:
sns.boxplot(x = "stft_min", data = df)
plt.title("STFT min")
plt.xlabel("STFT min")
#plt.savefig("boxplot_stftMin.png")
plt.show()

In [None]:
print(outliers("stft_min")) # if the output is too long to visualize look at the "Length" property

In [None]:
sns.boxplot(x = "stft_kur", data = df)
plt.title("STFT kurtosis")
plt.xlabel("STFT kurtosis")
#plt.savefig("boxplot_stftKur.png")
plt.show()

In [None]:
print(outliers("stft_kur"))

In [None]:
sns.boxplot(x = "stft_skew", data = df)
plt.title("STFT skewness")
plt.xlabel("STFT skewness")
#plt.savefig("boxplot_stftSkew.png")
plt.show()

In [None]:
print(outliers("stft_skew"))

In [None]:
sns.boxplot(x = "sc_mean", data = df)
plt.title("SC mean")
plt.xlabel("SC mean")
#plt.savefig("boxplot_scMean.png")
plt.show()

In [None]:
print(outliers("sc_mean"))

In [None]:
sns.boxplot(x = "sc_std", data = df)
plt.title("SC std")
plt.xlabel("SC std")
#plt.savefig("boxplot_scSTD.png")
plt.show()

In [None]:
print(outliers("sc_std"))

In [None]:
sns.boxplot(x = "sc_max", data = df)
plt.title("SC max")
plt.xlabel("SC max")
#plt.savefig("boxplot_scMax.png")
plt.show()

In [None]:
print(outliers("sc_max"))

In [None]:
sns.boxplot(x = "sc_kur", data = df)
plt.title("SC kurtosis")
plt.xlabel("SC kurtosis")
#plt.savefig("boxplot_scKur.png")
plt.show()

In [None]:
print(outliers("sc_kur"))

In [None]:
sns.boxplot(x = "sc_skew", data = df)
plt.title("SC skewness")
plt.xlabel("SC skewness")
#plt.savefig("boxplot_scSkew.png")
plt.show()

In [None]:
print(outliers("sc_skew"))

In [None]:
sns.boxplot(x = "mfcc_mean", data = df)
plt.title("MFCC mean")
plt.xlabel("MFCC mean")
#plt.savefig("boxplot_mfccMean.png")
plt.show()

In [None]:
print(outliers("mfcc_mean"))

In [None]:
sns.boxplot(x = "mfcc_std", data = df)
plt.title("MFCC std")
plt.xlabel("MFCC std")
#plt.savefig("boxplot_mfccSTD.png")
plt.show()

In [None]:
print(outliers("mfcc_std"))

In [None]:
sns.boxplot(x = "mfcc_min", data = df)
plt.title("MFCC min")
plt.xlabel("MFCC min")
#plt.savefig("boxplot_mfccMin.png")
plt.show()

In [None]:
print(outliers("mfcc_min"))

In [None]:
sns.boxplot(x = "mfcc_max", data = df)
plt.title("MFCC max")
plt.xlabel("MFCC max")
#plt.savefig("boxplot_mfccMax.png")
plt.show()

In [None]:
print(outliers("mfcc_max"))

In [None]:
sns.boxplot(x = "mean", data = df)
plt.title("Mean")
plt.xlabel("Mean")
#plt.savefig("boxplot_mean.png")
plt.show()

In [None]:
sns.boxplot(x = "std", data = df)
plt.title("Std")
plt.xlabel("Std")
#plt.savefig("boxplot_STD.png")
plt.show()

In [None]:
print(outliers("std"))

In [None]:
sns.boxplot(x = "min", data = df)
plt.title("Min")
plt.xlabel("Min")
#plt.savefig("boxplot_min.png")
plt.show()

In [None]:
print(outliers("min"))

In [None]:
sns.boxplot(x = "max", data = df)
plt.title("Max")
plt.xlabel("Max")
#plt.savefig("boxplot_max.png")
plt.show()

In [None]:
print(outliers("max"))

In [None]:
sns.boxplot(x = "kur", data = df)
plt.title("Kurtosis")
plt.xlabel("Kurtosis")
#plt.savefig("boxplot_kur.png")
plt.show()

In [None]:
print(outliers("kur"))

In [None]:
sns.boxplot(x = "skew", data = df)
plt.title("Skewness")
plt.xlabel("Skewness")
#plt.savefig("boxplot_skew.png")
plt.show()

In [None]:
print(outliers("skew"))

**Maybe at this point we can calculate outliers for categorical/nominal attributes, which are values that occurs with a frequency extremely lower than the frequency of all other values.**

# Transformation

In [None]:
df[["emotional_intensity","intensity"]]

In [None]:
df.sort_values("intensity")["intensity"].dropna()

In [None]:
sns.scatterplot(x = "emotional_intensity", y = "intensity", data = df)
plt.title("Emotional intensity per intensity")
plt.xlabel("Emotional intensity")
plt.ylabel("Intensity")
#plt.savefig("boxplot_EI-intensity.png")
plt.show()

In [None]:
sns.scatterplot(x = "emotional_intensity", y = "intensity", data = df)
plt.title("Emotional intensity per intensity")
plt.xlabel("Emotional intensity")
plt.ylabel("Intensity")
#plt.savefig("boxplot_EI-intensity.png")
plt.show()

In [None]:
def densityPlot(vars):
    skew_features = df[vars]
    fig, ax = plt.subplots(1,1)
    for s in skew_features.columns:
        df[s].plot(kind = 'density')
    fig.show()

In [None]:
df['zero_crossings_rate_normalized'] = zscore(df['zero_crossings_rate'])
df = df.drop(columns=["zero_crossings_rate"])

In [None]:
densityPlot(["zero_crossings_rate_normalized"])

In [None]:
df['sc_mean_normalized'] = zscore(df['sc_mean'])
df = df.drop(columns=["sc_mean"])

In [None]:
densityPlot(["sc_mean_normalized", "zero_crossings_rate_normalized"])

## Pairwise correlations and eventual elimination of variables

We drop _frame_rate_ column because there is just _length_ that measures the length of audio and of course _frame_rate_ is proportional to _length_.

In [None]:
df_num = df.copy() # copy of all numerical and ratio-scaled attributes

In [None]:
df_num = df_num.drop(columns = ["vocal_channel", "emotion", "emotional_intensity", "statement", "repetition", "actor", "sex", "channels", "frame_width"])
df_num.T

In [None]:
#sns.pairplot(df_num)
#plt.savefig("pairplot_df_num.png")
#plt.show()

In [None]:
f = plt.figure(figsize=(19, 15))
plt.matshow(df_num.corr("spearman"), fignum=f.number)
plt.xticks(range(df_num.select_dtypes(['number']).shape[1]), df_num.select_dtypes(['number']).columns, fontsize=14, rotation=90)
plt.yticks(range(df_num.select_dtypes(['number']).shape[1]), df_num.select_dtypes(['number']).columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title('Correlation Matrix', fontsize=16);

In [None]:
corr = df_num[["intensity", "mfcc_min", "min"]].corr("spearman")
corr.style.background_gradient(cmap='coolwarm')

In [None]:
eps = 0.90
print("CORRELATION > ",eps,":")
corr = (df_num.corr())
for index, row in corr.iterrows():
    for k, v in row.items():
        if(v > eps and index != k):
            print(index, k, corr[index][k])
    
print("")
print("CORRELATION < -",eps,":")
neg_corr = (df_num.corr())
for index, row in neg_corr.iterrows():
    for k, v in row.items():
        if(v  < -eps and index != k):
            print(index, k, v)

In [None]:
df_num = df_num.drop(columns=["mfcc_min", "max", "mfcc_std", "stft_skew", "min"])

In [None]:
df_norm = pd.DataFrame()
for v in df_num:
    df_norm[v] = zscore(df_num[v].dropna())
df_norm

In [None]:
df_num.describe()

In [None]:
desc = df_num.describe()
for v in desc:
    print(v + ": "  + str(desc[v]["75%"] - desc[v]["25%"]))

Treshold > 15: "mfcc_max", "sc_std", "sc_min", "sc_max"

In [None]:
#'''
df_num["intensity_normalized"] = zscore(df_num["intensity"].dropna())
df_num["mfcc_mean_normalized"] = zscore(df_num["mfcc_mean"])
df_num["mfcc_max_normalized"] = zscore(df_num["mfcc_max"])
df_num["sc_std_normalized"] = zscore(df_num["sc_std"])
df_num["sc_min_normalized"] = zscore(df_num["sc_min"])
df_num["sc_max_normalized"] = zscore(df_num["sc_max"])
df_num["stft_min_normalized"] = zscore(df_num["stft_min"])
df_num["mean_normalized"] = zscore(df_num["mean"])
df_num["std_normalized"] = zscore(df_num["std"])
df_num["kur_normalized"] = zscore(df_num["kur"])
df_num = df_num.drop(columns=["intensity", "mfcc_mean", "sc_std", "mfcc_max", "sc_std", "sc_min", "sc_max", "stft_min", "mean", "std", "kur"])
#'''

In [None]:
desc = df_norm.describe()
for v in desc:
    print(v + ": "  + str(desc[v]["75%"] - desc[v]["25%"]))

In [None]:
df_num.T

## Analysis by centroid-based methods

### KMeans, Bisecting KMeans

**SSE, KMeans and Bisecting KMeans plot function**

In [None]:
def drawSSEKMeansPlots (df, column_indices, xlabel, ylabel, variables, nclusters) :
    # normalization (zscore)
    X = variables.values
    scaler = StandardScaler()
    scaler.fit(X)
    X_scal = scaler.transform(X)
    
    # SSE plot
    sse_list = []
    for k in range(2, 18): # 15 centroids
        kmeans = KMeans(n_clusters = k, n_init = 10, max_iter = 100)
        kmeans.fit(X_scal)
        sse_list.append(kmeans.inertia_)
    sns.lineplot(x = range(2, 18), y = sse_list, marker = 'o')
    plt.title("SSE " + xlabel + ' vs ' + ylabel, fontsize = 15)
    plt.xlabel('k (# of initial centroids)')
    plt.ylabel('SSE')
    plt.show()
    
    # KMeans plot
    kmeans = KMeans(n_clusters = nclusters, n_init = 10, max_iter = 100)
    kmeans.fit(X_scal)
    centers = kmeans.cluster_centers_
    centers = scaler.inverse_transform(centers)
    sns.scatterplot(data = df, x = xlabel, y = ylabel, hue = kmeans.labels_, style = kmeans.labels_, palette = "bright")
    plt.title("KMeans " + xlabel + " vs " + ylabel, fontsize = 15)
    plt.legend()
    plt.scatter(centers[:, 0], centers[:, 1], c = 'black', marker = '*', s = 200)
    plt.show()
    
    # Bisecting KMeans plot
    bkmeans = BisectingKMeans(n_clusters = nclusters)
    bkmeans.fit(X_scal)
    centersbis = bkmeans.cluster_centers_
    centersbis = scaler.inverse_transform(centersbis)
    sns.scatterplot(data = df, x = xlabel, y = ylabel, hue = bkmeans.labels_, palette = "bright", style = bkmeans.labels_)
    plt.title("Bisecting KMeans " + xlabel + " vs " + ylabel, fontsize = 15)
    plt.scatter(centersbis[:, 0], centersbis[:, 1], c = 'black', marker = '*', s = 200)
    plt.show()

In [None]:
xvar = "zero_crossings_rate_normalized" # first attribute
yvar = "sc_mean_normalized" # second attribute
indices = [8, 9] # column indices of attributes
nclust = 6 # # of clusters of KMeans and Bisecting KMeans
drawSSEKMeansPlots(df_num, indices, xvar, yvar, df_num[[xvar, yvar]], nclust)

According to the SSE plot, in this case six clusters seems to be a good compromise between SSE and number of clusters, because from that point onwards the variation tends to be more linear. The shape of this scatterplot is, more or less, globular. We can see how Bisecting KMeans produces hierarchical clustering compared to KMeans.

In [None]:
xvar = "skew"
yvar = "length"
indices = [5, 7]
nclust = 6
drawSSEKMeansPlots(df_num, indices, xvar, yvar, df_num[[xvar, yvar]], nclust)

Also in this case we can see that Bisecting KMeans produces a different clustering, more compact than the KMeans's one which tends to have a Christmas tree shape instead. Also in this case, according to the SSE plot, six clusters seems to be a good choice and the shape of the scatterplot tends to be globular in its boundaries.

In [None]:
xvar = "std_normalized"
yvar = "kur_normalized"
indices = [18, 19]
nclust = 5
drawSSEKMeansPlots(df_num, indices, xvar, yvar, df_num[[xvar, yvar]], nclust)

In this case the shape of the scatterplot is non-globular but according to the SSE plot, it seems that five cluster are enough to reach a good compromise. Again, we can notice the difference in clustering between KMeans and Bisecting KMeans.

In [None]:
xvar = "sc_min_normalized"
yvar = "stft_min_normalized"
indices = [13, 16]
nclust = 4
drawSSEKMeansPlots(df_num, indices, xvar, yvar, df_num[[xvar, yvar]], nclust)

Also here the shape is irregular but four clustering seems to be enough. In Bisecting KMeans the top cluster covers more area than the KMeans's one.

In [None]:
xvar = "mfcc_max_normalized"
yvar = "sc_max_normalized"
indices = [12, 15]
nclust = 5
drawSSEKMeansPlots(df_num, indices, xvar, yvar, df_num[[xvar, yvar]], nclust)

# Hierarchical Clustering

## Samples

In [None]:
def plot_dendrogram(model, **kwargs):
    # Create linkage matrix and then plot the dendrogram

    # create the counts of samples under each node
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack(
        [model.children_, model.distances_, counts]
    ).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)

In [None]:
x_label = "mfcc_mean"
y_label = "kur"

train_df = df_norm[[x_label, y_label]]
train_set = train_df.values
train_set

In [None]:
plt.scatter(train_df[x_label], train_df[y_label])
plt.show()

In [None]:
dist = pdist(train_set, 'euclidean')
dist = squareform(dist)

In [None]:
# setting distance_threshold=0 ensures we compute the full tree.
model = AgglomerativeClustering(distance_threshold=5, n_clusters=None, affinity='euclidean', linkage='complete')
model = model.fit(train_set)

In [None]:
plt.title("Hierarchical Clustering Dendrogram")
plot_dendrogram(model, truncate_mode="lastp")
plt.xlabel("Number of points in node (or index of point if no parenthesis).")
plt.show()

In [None]:
sns.scatterplot(data=df_norm, x=x_label, y=y_label, hue=model.labels_, 
                style=model.labels_, palette="bright")
plt.show()

#### Choosing the number of clusters

In [None]:
model = AgglomerativeClustering(n_clusters=10, affinity='euclidean', linkage='complete')
model.fit(train_set)

In [None]:
sns.scatterplot(data=df_norm, x=x_label, y=y_label, hue=model.labels_, 
                style=model.labels_, palette="bright")
plt.show()

#### Precomputed distance matrix

In [None]:
sns.scatterplot(data=df_norm, x=x_label, y=y_label, hue=model.labels_, 
                style=model.labels_, palette="bright")
plt.show()

In [None]:
model = AgglomerativeClustering(n_clusters=20, affinity='precomputed', linkage='complete')
model.fit(dist)

In [None]:
sns.scatterplot(data=df_norm, x=x_label, y=y_label, hue=model.labels_, 
                style=model.labels_, palette="bright")
plt.show()

## Real Use

In [None]:
 #linkage{‘ward’, ‘complete’, ‘average’, ‘single’}
def allHierarchicalClusters(datas, t = "scatterplot", linkage="complete"):
    pairs = []
    for c1 in datas:
        for c2 in datas:
            if(c1 != c2 and (not (pairs.count(c1 + " " + c2) > 0 or pairs.count(c2 + " " + c1) > 0))):
                pairs.append(c1 + " " + c2)

                x_label = c1
                y_label = c2

                train_df = datas[[x_label, y_label]]
                train_set = train_df.values

                model = AgglomerativeClustering(distance_threshold=5, n_clusters=None, affinity='euclidean', linkage=linkage)
                model = model.fit(train_set)

                if(t =="dendrogram"):
                    plt.title("Hierarchical Clustering Dendrogram")
                    plot_dendrogram(model, truncate_mode="lastp")
                    plt.xlabel("Number of points in node (or index of point if no parenthesis).")
                    plt.show()
                else:
                    sns.scatterplot(data=datas, x=x_label, y=y_label, hue=model.labels_, 
                    style=model.labels_, palette="bright")
                    plt.show()
#print("N: " + str(len(datas.columns)) + " Pairs: " + str(len(pairs)))
#print(pairs)

In [None]:
allHierarchicalClusters(df_norm.drop(columns=["emotion_positivity"]), "scatterplot", "ward")