#### Your objective is to classify fMRI brain images taken while listening to music in five different genres: label 0=Ambient Music, 1=Country Music, 2=Heavy Metal, 3=Rock 'n Roll, 4=Classical Symphonic. The data consists of train_data.csv, train_labels.csv, and test_data.csv, for a one-person subset of a larger 20-subject study, linked above.

#### The training data (train_data.csv) consist of 160 event-related brain images (trials), corresponding to twenty 6-second music clips, four clips in each of the five genres, repeated in-order eight times (runs). The labels (train_labels.csv) correspond to the correct musical genres, listed above, for each of the 160 trials.

#### There are 22036 features in each brain image, corresponding to blood-oxygenation levels at each 2mm-cubed 3D location within a section of the auditory cortex. In human brain imaging, there are often many more features (brain sites) than samples (trials), thus making the task a relatively challenging multiway classification problem.

#### The testing data (test_data.csv) consists of 40 event-related brain images corresponding to novel 6-second music clips in the five genres. The test data is in randomized order with no labels. You must predict, using only the given brain images, the correct genre labels (0-4) for the 40 test trials.


# Final Project

# "Classifying The Brain on Music"

Michael Casey, https://www.frontiersin.org/journals/psychology/articles/10.3389/fpsyg.2017.01179/full


## **1. Multi-Class Genre Classifier** [[12 points]](https://)

#### Build a multi-class classifier for the 5 music genres. Your goal is to train a model to classify brain images into corresponding genre categories. You are free to choose any machine learning models from the class.

#### **1-1. Hyper-parameter Search.** [[4 points]](https://) Demonstrate your hyperparameter search process using cross-validation. Provide details for at least one hyperparameter with 10 different possible values.

#### **1-2. Model Training and Testing.** [[4 points]](https://) Following the hyperparameter search, train your model with the best combination of hyperparameters. Run the model on the test set and submit the results to the Kaggle competition. To get full marks, your model should outperform the baseline model, which is provided in Kaggle. You **must** show your test accuracy computed by Kaggle in this report.

#### **1-3. Model Analysis.** [[4 points]](https://) Conduct a thorough analysis of your model, including:

#### **1-3-1. Confusion Matrix:** Split the training set into train/validation sets. The data is organized into eight runs, in order, with each run repeating the same 20 music trials. You should split the data by run. Retrain your model using the best hyperparameter combination. Present the confusion matrix on the validation set.

#### **1-3-2. Example Examination:** Examine four validation samples where your model fails to classify into the correct category. Display the true label and the predicted label. Looking at the confusion matrix, how might you explain your results from the perspectives of human brain data and music genre similarity?


---

## **A. Data Download**

#### For your convenience, we have provided code to download the dataset, which includes true labels, training data (features), training labels, and testing data (features).


#### **A-1. Download Features and Labels.**

#### Run the following code to download the brain features and labels of the music clips.


In [None]:
import numpy as np
!pip install gdown

In [None]:
!gdown --id 1aFDPryEDcT5wg0k8NhWYpF8lulGmot5J # train data
!gdown --id 11kgAdB_hkEcC4npCEWJcAOOmGe3495yY # train labels
!gdown --id 1wXq56F6RIUtDzPceZegZAMA-JGW21Gqu # test data

In [1]:
# Data Import Method 1, with pandas
import pandas as pd

train_data = pd.read_csv("../train_data.csv", header=None)
train_labels = pd.read_csv("../train_labels.csv", header=None)
test_data = pd.read_csv("../test_data.csv", header=None)

print('train_data.shape: {}'.format(train_data.shape))
print('train_labels.shape: {}'.format(train_labels.shape))
print('test_data.shape: {}'.format(test_data.shape))

train_data.shape: (160, 22036)
train_labels.shape: (160, 1)
test_data.shape: (40, 22036)


#### Data exploration


In [None]:
print("\nFirst few rows of the dataset:\n")
train_data.head(2)

In [None]:
print("\nDescriptive statistics for numerical columns:\n")
train_data.describe()

In [None]:
print("\nInformation about the dataset:\n")
print(train_data.info())

print("\nShape of the dataset (rows, columns):\n")
print(train_data.shape)

print("\nData types of each column:\n")
print(train_data.dtypes)

# print(df['categorical_column'].value_counts())

print("\nNumber of missing values in each column:\n")
print(train_data.isnull().sum())

#### Step 1: Split the data into training


In [2]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(train_data, train_labels, test_size=0.3, random_state=0) # 70% to train

In [3]:
X_train

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22026,22027,22028,22029,22030,22031,22032,22033,22034,22035
118,-2.314397,-3.039203,-1.922149,-2.733876,-4.540047,-1.237867,-2.188299,-3.827710,-0.692026,-0.160446,...,-3.509140,-3.225268,-1.987008,-1.737181,-0.602613,-0.251485,-0.501410,-2.042777,-4.013156,-4.239245
95,-1.307402,-0.978637,-1.899713,-1.711495,-1.775354,-2.644640,-2.332035,-2.357386,-2.515284,-3.280651,...,-4.762068,-3.882897,-1.901563,0.797085,0.506390,0.288586,-5.840486,-4.511325,-3.731364,-3.447817
55,0.331377,0.738699,1.551261,1.786645,1.339275,2.847958,2.920124,2.422327,2.576581,3.669024,...,3.582624,3.730723,2.714881,1.313629,0.517216,-2.136820,2.836182,4.440277,4.521889,3.791714
109,-0.263944,-1.224477,0.527221,-1.141954,-1.276649,1.748549,-0.860848,-1.041420,4.880188,3.418812,...,5.961148,6.062754,3.741080,0.162845,0.907009,-0.092087,7.276143,5.977975,5.233674,5.221163
18,1.045658,0.158404,1.430457,-0.088408,-3.513397,1.624462,-0.278195,-4.185827,1.390569,2.215716,...,1.625480,2.924401,3.821515,0.884437,0.617778,0.485366,-1.133856,0.996806,1.658188,3.123640
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9,-1.776406,-1.757771,-2.242440,-1.522027,-0.109872,-2.347088,-0.985578,-0.174761,-2.351364,-1.260209,...,0.895446,-2.435271,-2.270254,2.291000,2.656067,2.051590,2.743168,3.810833,1.372110,-2.026703
103,-3.380314,-2.831648,-2.556751,-1.693914,-1.438040,-2.200296,-0.444568,0.689061,-2.497783,-2.217416,...,-11.526276,-12.696812,-6.558849,-0.673435,-3.066120,-2.289176,-3.955882,-5.050390,-8.200123,-9.937320
67,-0.168596,0.214571,-0.359398,0.949367,0.722810,-0.642784,0.963331,0.915781,-1.779895,-0.446726,...,4.444767,6.237965,5.752860,1.637919,1.536919,1.328895,6.336852,4.586332,5.405281,7.623970
117,-0.560468,-1.993829,1.343158,-0.582546,-0.861013,2.389700,0.666740,0.433878,2.166772,2.245840,...,2.941923,3.062529,1.659200,1.940238,3.031249,2.814192,0.097506,1.515483,2.866854,2.546117


#### Step 2: Normalize the features using StandardScaler


In [79]:
from sklearn.preprocessing import StandardScaler

# Seems to decrease accuracy.

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [6]:
X_train_scaled

array([[-1.50297059, -1.61024443, -1.17233137, ..., -0.56837978,
        -1.09890874, -1.20940757],
       [-0.83243981, -0.43640166, -1.15877359, ..., -1.32022538,
        -1.0143281 , -0.9682414 ],
       [ 0.25877859,  0.54191367,  0.92661449, ...,  1.40616335,
         1.4629022 ,  1.23780979],
       ...,
       [-0.07413977,  0.24333353, -0.22797708, ...,  1.45064755,
         1.72805405,  2.40558616],
       [-0.33507725, -1.01472619,  0.80086001, ...,  0.51535911,
         0.96614001,  0.85824792],
       [-1.03285619,  0.13103251, -1.6253256 , ..., -0.10399166,
         0.26068912,  0.27101714]])

#### Step 3: One-hot encode the target variable


In [78]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder

encoder = OneHotEncoder(sparse_output=False)
y_train_scaled = encoder.fit_transform(y_train.values.reshape(-1, 1))
y_test_scaled = encoder.transform(y_test.values.reshape(-1, 1))

#### Step 4: Create a simple sequential model


In [3]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score, cross_validate
from sklearn.multioutput import MultiOutputClassifier
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.svm import SVC
from sklearn.metrics import make_scorer, accuracy_score

In [15]:
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(max_iter=5000)

param_grid = {
    'C': [10, 15, 20, 25, 30, 60, 100, 200]
}
grid_search = GridSearchCV(logreg, param_grid, scoring='accuracy', cv=5)
grid_search.fit(X_train, y_train[0].tolist())

print("Best Parameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)

# Train the final model with the best parameters
best_logreg = grid_search.best_estimator_
best_logreg.fit(X_train, y_train[0].tolist())
test_acc = accuracy_score(y_test, best_logreg.predict(X_test))
print(f"Test Accuracy: {test_acc:.3f}")

Best Parameters:  {'C': 10}
Best Score:  0.6600790513833992
Test Accuracy: 0.708


In [80]:
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression

pca = PCA(n_components=90)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

logreg = LogisticRegression(max_iter=5000)

param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]
}
grid_search = GridSearchCV(logreg, param_grid, scoring='accuracy', cv=5)
grid_search.fit(X_train_pca, y_train[0].tolist())

print("Best Parameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)

# Train the final model with the best parameters
best_logreg = grid_search.best_estimator_
best_logreg.fit(X_train_pca, y_train[0].tolist())
test_acc = accuracy_score(y_test, best_logreg.predict(X_test_pca))
print(f"Test Accuracy: {test_acc:.3f}")

Best Parameters:  {'C': 0.01}
Best Score:  0.7596837944664031
Test Accuracy: 0.750


In [81]:
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression

pca = PCA(n_components=90)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

logreg = LogisticRegression(max_iter=10000)

param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]
}
grid_search = GridSearchCV(logreg, param_grid, scoring='accuracy', cv=5)
grid_search.fit(X_train_pca, y_train[0].tolist())

print("Best Parameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)

# Train the final model with the best parameters
best_logreg = grid_search.best_estimator_
best_logreg.fit(X_train_pca, y_train[0].tolist())
test_acc = accuracy_score(y_test, best_logreg.predict(X_test_pca))
print(f"Test Accuracy: {test_acc:.3f}")

Best Parameters:  {'C': 1}
Best Score:  0.6869565217391305
Test Accuracy: 0.688
