# Data Analysis with Autoencoders

**Creation:**

This notebook is based on the Autoencoders notebook from Leonie.

She did an incredible job there. Because the training and test dataset accuracies are vastly different (99% and 80%) I want to check for overfitting. Because I do not want to change her notebook, I copied it and will not adjust it. 
Also, I will only be doing the Cebra Part.

A longer preprocessing part has been added, then the existing was slightly adjusted and used to create the new results.

**Idea:**

In the dataset, the mice were shown a small set of images, all of them numeral times.
My theory is, that the classifier might learn to understand the images, but that is not what we are interested in, we want to find differences of animal image responses and non-animal image responses.

Because of this, I want to make the test dataset only images not previously occurring in the training dataset. This will make the resulting accuracy more meaningful.

If the final accuracy is significantly bigger than 50%, we might have found what we were looking for.
However, the resulting accuracy fluctuated around 50%, making it not meaningful.

**Requirements:**

This notebook only works if you have the mouse data csv files from `neural_data_analysis.ipnyb`.

For the second part of this notebook you need to install CEBRA.

**Dont use "`pip install cebra`".** Instead clone "https://github.com/AdaptiveMotorControlLab/CEBRA", navigate into CEBRA and use "`pip install .`".

You can also try installing it another way, but you need the fix from this [commit](https://github.com/AdaptiveMotorControlLab/CEBRA/commit/5f46c3257952a08dfa9f9e1b149a85f7f12c1053), otherwise you will get errors with sklearns newer versions...

In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import accuracy_score, mean_squared_error, confusion_matrix
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.decomposition import PCA
from helper_functions import load_data_from_mouse_csv

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random

### Create Feature and Label Vector using helper function:

In [2]:
# To make sure that cebra does not learn the images instead of their shapes, I will make the test dataset only images not occurring in the training dataset.
# To do so, this will first print all used images in the dataset, and if they have animals or not.
feature_vector, label_vector, image_names_list = load_data_from_mouse_csv(labels=["animal_in_image"], additional_information=["image_name"])

label_counts = {}

for image_name, label in zip(image_names_list[0][0], label_vector[0]):
    if(image_name not in label_counts):
        label_counts[image_name] = {}
        label_counts[image_name]["Animal"] = 0
        label_counts[image_name]["No Animal"] = 0
    
    if label == 1:
        label_counts[image_name]["Animal"] += 1
    else:
        label_counts[image_name]["No Animal"] += 1

for image_name in label_counts:
    print(f"Image {image_name}: {label_counts[image_name]['Animal']} Animals and {label_counts[image_name]['No Animal']} non Animals")

Image omitted: 0 Animals and 306 non Animals
Image im077: 0 Animals and 21 non Animals
Image im054: 17 Animals and 0 non Animals
Image im063: 0 Animals and 21 non Animals
Image im000: 18 Animals and 0 non Animals
Image im073: 0 Animals and 19 non Animals
Image im066: 0 Animals and 19 non Animals
Image im069: 0 Animals and 18 non Animals
Image im075: 0 Animals and 16 non Animals
Image im045: 17 Animals and 0 non Animals
Image im035: 17 Animals and 0 non Animals
Image im031: 0 Animals and 20 non Animals
Image im106: 0 Animals and 17 non Animals
Image im062: 0 Animals and 20 non Animals
Image im061: 0 Animals and 21 non Animals
Image im065: 0 Animals and 19 non Animals
Image im085: 0 Animals and 18 non Animals


In [3]:
# Next, the dataset will exclude all omitted images, then randomly choose one animal and one non-animal picture, exclude them from the original features and labels and make it the test dataset.

animal_image_names =    [key for key, value in label_counts.items() if value["Animal"]    > 10 and key != "omitted"]
no_animal_image_names = [key for key, value in label_counts.items() if value["No Animal"] > 10 and key != "omitted"]

test_image_animal =    random.choice(animal_image_names)
test_image_no_animal = random.choice(no_animal_image_names)

X_train = []
X_test =  []
y_train = []
y_test =  []

for traces, label, image_name in zip(feature_vector[0], label_vector[0], image_names_list[0][0]):
    if image_name != "omitted":
        if image_name != test_image_animal and image_name != test_image_no_animal:
            X_train.append(traces)
            y_train.append(label)
        else:
            X_test.append(traces)
            y_test.append(label)

X_train = np.array(X_train)
X_test = np.array(X_test)
y_train = pd.Series(y_train)
y_test =  pd.Series(y_test)

print("Data split completed.")
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape:  {X_test.shape}, y_test shape: {y_test.shape}")

X_train_size = sum(x != "omitted" and x != test_image_animal and x != test_image_no_animal for x in image_names_list[0][0])
X_test_size = sum(x == test_image_animal or x == test_image_no_animal for x in image_names_list[0][0])

print(f"expected sizes: {X_train_size} and {X_test_size}")

Data split completed.
X_train shape: (261, 1105), y_train shape: (261,)
X_test shape:  (37, 1105), y_test shape: (37,)
expected sizes: 261 and 37


### Feature and Label Vectors

For the feature vector I used the traces.

For the labels I used animan_in_image [yes/no].


# CEBRA


In [4]:
from cebra import CEBRA
import cebra.integrations.plotly
import plotly.io as pio
import plotly.express as px

In [5]:
cebra_model = CEBRA(model_architecture='offset10-model',
                        batch_size=512,
                        learning_rate=3e-4,
                        temperature=1,
                        output_dimension=3,
                        max_iterations=50,
                        distance='cosine',
                        conditional='time_delta',
                        device='cuda_if_available',
                        verbose=True,
                        time_offsets=10)

# 30 iterations are not enough for training meaningfully, 50 might already overfit.

In [6]:
cebra_model.fit(X_train, y_train)

pos: -0.8950 neg:  6.7989 total:  5.9039 temperature:  1.0000: 100%|█| 50/50 [00:03<00:00, 


In [7]:
X_cebra_projected = cebra_model.transform(X_train) # Project data into latent space
X_cebra_projected_test = cebra_model.transform(X_test)

In [8]:
# Create the interactive embedding plot
fig = cebra.integrations.plotly.plot_embedding_interactive(
    X_cebra_projected,
    embedding_labels=y_train,  
    title="CEBRA-Behavior",
    cmap="rainbow",
    showlegend=True,
    discrete=True
)
fig.update_layout(
    legend=dict(
        itemsizing="constant",  # Uniform item sizing
        traceorder="normal"    # Normal or reversed order
    )
)

# Show the plot
fig.show(renderer="iframe")

<Figure size 500x500 with 0 Axes>

In [15]:
# Normalize data
results = []

scaler = StandardScaler()
X_cebra_projected_scaled = scaler.fit_transform(X_cebra_projected)
X_cebra_projected_test_scaled = scaler.transform(X_cebra_projected_test)

# Hyperparameters to tune
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],  # Regularization strength
    'penalty': ['l1', 'l2'],       # Regularization type
    'solver': ['liblinear', 'saga'] # Solvers that support both l1 and l2
}

# Train Logistic Regression and automatically compute balanced class weights
clf = LogisticRegression(max_iter=1000, solver='lbfgs', class_weight='balanced', C=0.1)

# Tune Hyperparameters (to avoid overfitting..)
grid_search = GridSearchCV(
    estimator=clf,
    param_grid=param_grid,
    cv=5,  
    scoring='accuracy',  
    n_jobs=-1  

)
grid_search.fit(X_cebra_projected_scaled, y_train)
best_model = grid_search.best_estimator_

# Evaluate on training and validation sets
y_pred_train = best_model.predict(X_cebra_projected_scaled)
clf_cebra_train_acc = accuracy_score(y_train, y_pred_train)

y_pred_test = best_model.predict(X_cebra_projected_test_scaled)
clf_cebra_test_acc = accuracy_score(y_test, y_pred_test)

results.append({
    "Model": "CEBRA", 
    "Reconstruction Train MSE": None,
    "Reconstruction Test MSE": None,
    "CLF Train Accuracy": clf_cebra_train_acc,
    "CLF Test Accuracy": clf_cebra_test_acc,
})

results_df = pd.DataFrame(results)
results_df

Unnamed: 0,Model,Reconstruction Train MSE,Reconstruction Test MSE,CLF Train Accuracy,CLF Test Accuracy
0,CEBRA,,,0.988506,0.648649


When the images are seperated and very few prediction steps choosen to prevent overfitting, the accuracy fluctuates around 50% which means that nothing interesting has been learned.