# Simple classification example from CSV data

This notebook shows how to use the csv data from the NEMO dataset for a simple classification task. Running the notebook does not require any code from the nemo package.

Set the `csv_data_path` variable to your local path to the csv data. 

In [1]:
csv_data_path = (
    "./../../data/empe_csv"  # Set this to the folder where you have the csv files
)

In [2]:
from collections import defaultdict
import numpy as np
import pandas as pd
from pathlib import Path

 ## Load data

In [3]:
epochs_df = pd.read_csv(Path(csv_data_path) / "epochs.csv", sep=";")
epochs_metadata = pd.read_csv(
    Path(csv_data_path) / "epochs_metadata.csv", sep=";"
)

## Construct X and y

In this example we will use the mean value of each channel as the only features. For more advanced feature extraction, see `nemo.feature_extraction.create_datasets_from_epochs_df`.

In [4]:
X, y = defaultdict(list), defaultdict(list)
chs = [c for c in epochs_df.columns if " hbo" in c or " hbr" in c]

for epoch in epochs_df["epoch"].unique():
    epoch_df = epochs_df[epochs_df["epoch"] == epoch]
    subject = epoch_df["subject"].iloc[0]

    # extract mean of each channel
    X[subject].append(epoch_df[chs].mean(axis=0))

    # get label from metadata (labels are also in epochs_df, this just shows how to use the metadata dataframe)
    y[subject].append(epochs_metadata.loc[epoch, "value"])

u, c = np.unique(np.concatenate([*y.values()]), return_counts=True)
print(
    f"""
Created X and y for {len(X)} subjects.
X size: {np.concatenate([*X.values()]).shape}
class counts: {dict(zip(u, c))}
"""
)


Created X and y for 31 subjects.
X size: (1203, 48)
class counts: {0: 301, 1: 300, 2: 301, 3: 301}



 ## Train and evaluate models

In [5]:
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

clf = LinearDiscriminantAnalysis(shrinkage="auto", solver="lsqr")


def get_cv(y, seed=1):
    _, label_counts = np.unique(y, return_counts=True)
    cv = StratifiedKFold(n_splits=np.min(label_counts), shuffle=True, random_state=seed)
    return cv


subject_scores = []
for subject in X:
    subject_score = cross_val_score(
        clf, X[subject], y[subject], cv=get_cv(y[subject])
    ).mean()
    subject_scores.append(subject_score)

print(f"Mean subject score: {np.mean(subject_scores):.3f}")

Mean subject score: 0.355
