# Full Tutorial: BCI Classification Pipeline (pyLittleEegle)

This notebook presents a complete workflow for BCI data analysis, ranging from database selection to classification using Riemannian Geometry.

We will use the core modules of the **Eegle-Python** package:
1.  **`Database`**: To explore and filter the FII BCI corpus.
2.  **`InOut`**: To load, filter, and standardize EEG recordings.
3.  **`BCI`**: To perform covariance encoding and cross-validation.

We will also leverage **`pyriemann`** and **`scikit-learn`** for the classification models.

In [None]:
# import modules and package
import sys
import os
import numpy as np
from pyriemann.classification import MDM, TSclassifier, LogisticRegression
from sklearn.svm import LinearSVC

# Eegle modules 
sys.path.append(os.path.abspath('src'))
from BCI import crval, encode
from Database import selectDB
from InOut import readNY

## 1. Database Selection and Exploration

The first step is to identify relevant databases within your local corpus. Instead of loading the heavy signal data immediately, we use the `Database` module to scan the metadata.

Here, we use **`selectDB`** with specific criteria:
* **Corpus**: The path to your folder containing the NY-formatted databases.
* **Paradigm**: `'MI'` (Motor Imagery).
* **Classes**: We only want databases containing *at least* the classes `'feet'` and `'right_hand'`.

> **Note:** Please modify the `corpusDir` variable to point to your own data directory.

In [None]:
# databases selection
corpusDir = "C:\\Users\\doumif\\work\\OfficeWork\\BCI Databases\\NY"
classes = ["feet", "right_hand"]
DBs = selectDB(corpusDir, "MI", classes)

## 2. Metadata Inspection

The `selectDB` function returns a list of `infoDB` objects. These objects are lightweight and contain all necessary information to understand the structure of a database without loading the signals.

By inspecting an element of this list (here `DBs[1]`), we get a formatted summary including:

* **Experimental Context:** The database name, condition, and BCI paradigm (e.g., 'MI').
* **Subject Statistics:** Total number of subjects (`nSubjects`) and the range of sessions per subject (min, max).
* **Signal Specifications:** Sampling rate (`sr`), number of electrodes (`nSensors`), sensor names, and sensor type (e.g., 'wet', 'dry').
* **Time parameters:** Window length (`wl`) in samples and offset.
* **Class Balance:** A detailed statistical breakdown of **nTrials per class** (Mean ¬± Standard Deviation, Min, Max), which is crucial to check if the data is balanced.

> **Note:** Additional metadata fields (DOI, hardware, software, authors, etc.) are stored in the object and can be accessed via dot notation (e.g., `DBs[1].doi` or `DBs[1].description`).

In [None]:
DBs[1]

## 3. Single File Pipeline (Session-Specific)

Before launching complex loops, it is best practice to test the pipeline on a single file (one session of one subject). Here are the three key steps:

1.  **Loading (`readNY`)**:
    * We load the `.npz` file.
    * `classes=classes`: We keep only "feet" and "right_hand" (other classes are ignored).
    * `bandPass=(8, 32)`: We apply a direct band-pass filter (8-32 Hz) to isolate motor rhythms (Mu/Beta bands).

2.  **Encoding (`encode`)**:
    * We transform EEG epochs into Covariance Matrices.
    * `covtype='scm'`: We use the *Sample Covariance Matrix*. Other options include `'lwf'` (Ledoit-Wolf) or `'oas'`. See `pyriemann` for more options.
    * The output format is automatically compatible with `pyriemann` `(n_trials, n_channels, n_channels)`.

3.  **Validation (`crval`)**:
    * We use an **MDM** (Minimum Distance to Mean) classifier, known for being robust and fast.
    * We perform a 10-fold Cross-Validation.

In [None]:
# Cross validation for a single file
clf = MDM(metric='riemann', n_jobs=4)
file = DBs[1].files[0]
o = readNY(file, classes=classes, bandPass = (8, 32))
ùêÇ = encode(o, paradigm = 'MI', covType='scm')
res = crval(clf, ùêÇ, o.y, n_folds = 10, shuffle = True, random_state = 42)
display(res)

## 4. Full Database Analysis

Now that the pipeline is validated, we can process an entire database (all subjects/sessions).

For this example, we switch the classification strategy to a method that often yields higher performance:
* **Tangent Space (`TSclassifier`)**: Projects the covariance matrices into the Tangent Space (Euclidean).
* **Logistic Regression**: A standard linear classifier applied to the tangent vectors, using L1 regularization (`lasso`) for feature selection.

We store the average accuracy (`avgAcc`) of each file to calculate the global performance of the database.

In [None]:
# Cross-validation for a full database 
DB = DBs[1] # here BNCI2014001 will be the example
accDB = np.zeros(len(DB.files))
print(f"Database name: {DB.dbName}-{DB.condition}")
clf = TSclassifier(clf=LogisticRegression(penalty='l1', solver='saga', max_iter=1000, n_jobs=4))
for f, file in enumerate(DB.files):
    nf = len(DB.files)
    print(f"file {f+1} of {nf}")
    o = readNY(file, classes=classes, bandPass = (8, 32))
    ùêÇ = encode(o, paradigm = 'MI', covType='scm')
    res = crval(clf, ùêÇ, o.y, n_folds = 10, shuffle = True, random_state = 89)
    accDB[f] = res.avgAcc

print(f"\nAverage accuracy for {DB.dbName}-{DB.condition} : ", np.round(np.mean(accDB)*100, decimals = 2), "% ¬±", np.round(np.std(accDB)*100, decimals=2))

## 5. Large-Scale Benchmark (Multi-Database)

Finally, we can launch an evaluation across all databases selected by `selectDB`.

* We iterate through every database in `DBs`.
* Inside, we iterate through every file.
* We use a **LinearSVC** (Support Vector Machine) in the Tangent Space.
* The results (accuracy per file) are saved into text files (`.txt`) named after the database for further statistical analysis.

> **Warning:** Depending on the size of your data and CPU cores, this cell might take some time to execute.

In [None]:
# Cross-validation for all database corresponding to arg classes
accDBs = np.zeros(len(DBs))
clf = TSclassifier(clf=LinearSVC(max_iter=1000))

for db, DB in enumerate(DBs):
    accDB = np.zeros(len(DB.files))
    print(f"Database name: {DB.dbName}-{DB.condition}")
    for f, file in enumerate(DB.files):
        nf = len(DB.files)
        print(f"file {f+1} of {nf}")
        o = readNY(file, classes=classes, bandPass = (8, 32))
        ùêÇ = encode(o, paradigm = 'MI', covType='scm')
        res = crval(clf, ùêÇ, o.y, n_folds = 10, shuffle = True, random_state = 89)
        accDB[f] = res.avgAcc
    print(f"\nAverage accuracy for {DB.dbName}-{DB.condition} : ", np.round(np.mean(accDB)*100, decimals = 2), "% ¬±", np.round(np.std(accDB)*100, decimals=2))
    np.savetxt(f"acc_{DB.dbName}-{DB.condition}.txt", accDB, fmt="%.5f")