# Feature Extraction

#### This script performs feature extraction and data processing on audio files stored in a directory. In detail:

1) Calls the function "directory_feature_extraction" from the MidTermFeatures module of the pyAudioAnalysis library, passing the path to the audio files directory, and parameters for the extraction process (1 audio file per 1 sec with 0.1 overlap, and 0.1 offset).

2) Stores the extracted features and corresponding audio file names in the variables "mid_term_features", "wav_file_list", and "mid_feature_names" respectively.

3) Normalizes the extracted features by calculating the mean and standard deviation and then subtracting the mean and dividing by the standard deviation.

4) Creates a pandas dataframe from the normalized features and sets the columns names to the mid_feature_names and exports the dataframe to a CSV file with a name of your choice "name_of_your_choice.csv". 

Function mid_feature_extraction() from the MidTermFeatures.py file extracts a number of statistcs (e.g. mean and standard deviation) short-term feature sequences.The total number of short-term features implemented in pyAudioAnalysis is 34. In addition, the delta features are optionally computed (they are by default enabled, but can be disabled by setting the deltas argument in feature_extraction() to false). So, the total number of short-term features, including the deltas is 64.

A table of the exact short term features can be found in:

https://github.com/tyiannak/pyAudioAnalysis/wiki/3.-Feature-Extraction

In the extracted features, beat extraction is included.Tempo induction is a rather important task in music information retrieval. This library provides a baseline method for estimating the beats per minute (BPM) rate of a music signal. 



In [1]:
import os
import pandas as pd
from pyAudioAnalysis import MidTermFeatures as mtf
import numpy as np

# Input is the file where the audios are located, used to train different composers and categories(Mozart-Piano,Bethoveen-Symphonies etc)
mid_term_features, wav_file_list, mid_feature_names = mtf.directory_feature_extraction('C:/Users/user/OneDrive/Υπολογιστής/Εργασία machine learning dataset/Bethoveen/Symphonies/Testing', 1, 1, 0.1, 0.1)
m = mid_term_features.mean(axis=0)
s = np.std(mid_term_features, axis = 0)
mid_term_features_2 = (mid_term_features - m) / s

# dataFrame from the features and file names
df = pd.DataFrame(mid_term_features_2, columns=mid_feature_names)


# dataFrame to CSV 
df.to_csv('Bethoveen_symphonies_testing2.csv', index=False)



Analyzing file 1 of 250: C:/Users/user/OneDrive/Υπολογιστής/Εργασία machine learning dataset/Bethoveen/Symphonies/Testing\chunk10.wav
Analyzing file 2 of 250: C:/Users/user/OneDrive/Υπολογιστής/Εργασία machine learning dataset/Bethoveen/Symphonies/Testing\chunk100.wav
Analyzing file 3 of 250: C:/Users/user/OneDrive/Υπολογιστής/Εργασία machine learning dataset/Bethoveen/Symphonies/Testing\chunk101.wav
Analyzing file 4 of 250: C:/Users/user/OneDrive/Υπολογιστής/Εργασία machine learning dataset/Bethoveen/Symphonies/Testing\chunk102.wav
Analyzing file 5 of 250: C:/Users/user/OneDrive/Υπολογιστής/Εργασία machine learning dataset/Bethoveen/Symphonies/Testing\chunk103.wav
Analyzing file 6 of 250: C:/Users/user/OneDrive/Υπολογιστής/Εργασία machine learning dataset/Bethoveen/Symphonies/Testing\chunk104.wav
Analyzing file 7 of 250: C:/Users/user/OneDrive/Υπολογιστής/Εργασία machine learning dataset/Bethoveen/Symphonies/Testing\chunk105.wav
Analyzing file 8 of 250: C:/Users/user/OneDrive/Υπολογισ

## Merging the produced dataframes and adding labels to them

1) Adds a new column named "label" to each of the dataframes and assigns each a unique label.
2) Merges the multiple dataframes into one dataframe using "pd.concat" function.
3) Saves the merged dataframe to a new CSV file.

The code can be used to create the exact csv file you want. For example if you want to create a Bethoveen_piano_testing.csv file you can use: 

`df8 = pd.read_csv('Bethoveen_piano_testing.csv')`

`df8["label"] = 4`

`merged_df= pd.concat([df8])`

`merged_df.to_csv("Bethoveen_piano_testing.csv", index=False)`

The code bellow creates a "piano.csv" for training and a "piano_testing.csv" for testing.


In [None]:
import pandas as pd

# Read CSV
df1 = pd.read_csv("Schubert_piano_training.csv")
df2 = pd.read_csv("Schumann_piano_training.csv")
df3 = pd.read_csv("Mozart_piano_training.csv")
df4 = pd.read_csv("Bethoveen_piano_training.csv")

# New column with the labels
df1["label"] = 1
df2["label"] = 2
df3["label"] = 3
df4["label"] = 4

# Merge dataframe
merged_df = pd.concat([df1, df2, df3, df4])

# Save merged to CSV 
merged_df.to_csv("Piano.csv", index=False)

# Same for "testing" 
df5 = pd.read_csv('Schubert_piano_testing.csv')
df6 = pd.read_csv('Schumann_piano_testing.csv')
df7 = pd.read_csv('Mozart_piano_testing.csv')
df8 = pd.read_csv('Bethoveen_piano_testing.csv')

df5["label"] = 1
df6["label"] = 2
df7["label"] = 3
df8["label"] = 4

merged_df_2 = pd.concat([df5,df6,df7,df8])
merged_df_2.to_csv("Piano_testing.csv", index=False)



## Used for training and input of testing

This code performs the following tasks:

1) Loads the training data and test data.
2) Selects the top N features using a classifier of your choise.
3) Drops the label columns from the training and test data to use as the features and target.
4) Fits a classifier using one of several options (SVM, Logistic Regression, Random Forest, KNN, Neural Network, or Naive Bayes) on the training data and selected features.
5) Performs 10-fold shuffle split cross-validation on the training data to evaluate the performance of the classifier.
6) Prints the mean f1 score the sigma f1 and the confidence of the 10-fold shuffle split cross-validation

In [4]:
from sklearn.feature_selection import SelectKBest, mutual_info_classif, chi2, mutual_info_regression, f_classif,f_regression, SelectFwe
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import RFE
from sklearn import preprocessing
from sklearn import svm
from sklearn.metrics import f1_score
import pandas as pd

# Read training data
train_data = pd.read_csv("Piano.csv")

# Define the features and target
X = train_data.drop("label", axis=1)
y = train_data["label"]

# SelectKBest to select the top N features
selector = RFE(SVC(kernel="linear"), n_features_to_select=10) # A different selector that can be used
#selector = SelectKBest(mutual_info_classif, k=20)
selector.fit(X, y)

# Get the selected features
selected_features = selector.get_support(indices=True)

# Print names of features
print(X.columns[selected_features])

# test data
test_data = pd.read_csv("Piano_testing.csv")

# Drop labels in x, y are test labels
X_test = test_data.drop("label", axis=1)
Y_test = test_data["label"]

#Select only the features that were selected in the previous step for test and trainign
selected_features_name = X.columns[selected_features]
X = X[selected_features_name]
X_test = X_test[selected_features_name]

# Train a classifier on the training data
#clf = RandomForestClassifier()
#clf = LogisticRegression()
#clf = SVC()
#clf = KNeighborsClassifier(n_neighbors=6)
clf = MLPClassifier(max_iter=20000)
#clf = GaussianNB()

clf.fit(X, y)

#K-fold validation
cv = ShuffleSplit(n_splits=10, test_size=0.25, random_state=0)
p1 = cross_val_score(clf, X, y, cv=cv, scoring='f1_macro')


print(f"mean f1: {p1.mean():.3f}, sigma f1: {p1.std():.3f}, 95% conf: {p1.mean()-2*p1.std():.3f} - {p1.mean()+2*p1.std():.3f}")











Index(['energy_entropy_mean', 'spectral_centroid_mean', 'spectral_spread_mean',
       'spectral_entropy_mean', 'mfcc_1_mean', 'mfcc_2_mean',
       'delta mfcc_8_std', 'delta mfcc_10_std', 'delta mfcc_11_std', 'ratio'],
      dtype='object')
mean f1: 0.656, sigma f1: 0.012, 95% conf: 0.631 - 0.681


## Testing

1) Predicts the class labels for test data using the chosen classifier of the previous step.
2) Calculates the accuracy of the predicted labels.
3) Calculates the F1 score of the model using the true labels of the test data and the predicted labels.
4) Prints the F1 score and accuracy of the model.

In [5]:
import numpy as np
Y_pred= clf.predict(X_test)
accuracy = accuracy_score(Y_test, Y_pred)
f1 = f1_score(Y_test, Y_pred, average='macro')
print("F1 score of the model: ", f1)
print("Accuracy of the model: ", accuracy)



F1 score of the model:  0.4038741857700229
Accuracy of the model:  0.407
