<a href="https://colab.research.google.com/github/JensBlack/IEECR_Hackathon24/blob/main/IEECR_Hackathon_24.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IEECR HACKATHON 2024

## Dataset: https://data.caltech.edu/records/s0vdx-0k302

## Short description

Multi-agent behavior modeling aims to understand the interactions that occur between agents. We present a multi-agent dataset from behavioral neuroscience, the Caltech Mouse Social Interactions (CalMS21) Dataset. Our dataset consists of trajectory data of social interactions, recorded from videos of freely behaving mice in a standard resident-intruder assay.

## more information and paper:

https://arxiv.org/abs/2104.02710

# Prepping everything (just run once)

In [1]:
#@title Install all necessary libraries (run cell)

! pip install numpy pandas matplotlib scipy scikit-learn



In [17]:
#@title Download and unzip dataset (run cell)

!wget https://data.caltech.edu/records/s0vdx-0k302/files/task1_classic_classification.zip
print("Unpacking file...")
#unpack zip file
!unzip /content/task1_classic_classification.zip

--2024-08-19 12:55:43--  https://data.caltech.edu/records/s0vdx-0k302/files/task1_classic_classification.zip
Resolving data.caltech.edu (data.caltech.edu)... 35.155.11.48
Connecting to data.caltech.edu (data.caltech.edu)|35.155.11.48|:443... connected.
HTTP request sent, awaiting response... 302 FOUND
Location: https://s3.us-west-2.amazonaws.com/caltechdata/06/7f/f1dd-1ef3-4a0f-8ab2-6b1d0d211b8e/data?response-content-type=application%2Foctet-stream&response-content-disposition=attachment%3B%20filename%3Dtask1_classic_classification.zip&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIARCVIVNNAP7NNDVEA%2F20240819%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20240819T125543Z&X-Amz-Expires=60&X-Amz-SignedHeaders=host&X-Amz-Signature=68182a8ef65e76b20852593eaeaff0877eb27789188f94db02d5396a0c8c17f0 [following]
--2024-08-19 12:55:43--  https://s3.us-west-2.amazonaws.com/caltechdata/06/7f/f1dd-1ef3-4a0f-8ab2-6b1d0d211b8e/data?response-content-type=application%2Foctet-stream&response-content

In [9]:
#@title Required functions for conversion (just run)

import json
import os
import numpy as np
import argparse
import pandas as pd
import pickles

"""Taken from calms21_convert_to_npy.py from CalMS21 repository"""

'''
Script for converting CalMS21 .json files into .npy files.
The .npy files have the same dictionary layout, except the entries are
numpy arrays instead of lists.
If treba features are not appended, the final dictionary 'keypoint' entries will have shape:
sequence_length x 2 x 2 x 7.
If treba features are appended, the final dictionary 'features' entries will have shape:
sequence_length x 60 (2x2x7 + 32).
'''

def convert_to_array(dictionary, feature_dictionary = None):
    # Convert dictionary values (lists) to numpy arrays, until depth 3.
    # If feature dictionary is not None, also concatenate the dictionary values.
    converted = {}

    # First key is the group name for the sequences
    for groupname in dictionary.keys():

        converted[groupname] = {}
        # Next key is the sequence id
        for sequence_id in dictionary[groupname].keys():

            converted[groupname][sequence_id] = {}

            # If not adding features, add keypoints, scores, and annotations & metadata (if available)
            if feature_dictionary is None:
                converted[groupname][sequence_id]['keypoints'] = np.array(dictionary[groupname][sequence_id]['keypoints'])
            else:
                keypoints = np.array(dictionary[groupname][sequence_id]['keypoints'])
                converted[groupname][sequence_id]['features'] = np.concatenate([keypoints.reshape(keypoints.shape[0], -1),
                                                feature_dictionary[groupname][sequence_id]['features']], axis = -1)

            converted[groupname][sequence_id]['scores'] = np.array(dictionary[groupname][sequence_id]['scores'])

            if 'annotations' in dictionary[groupname][sequence_id].keys():
                converted[groupname][sequence_id]['annotations'] = np.array(dictionary[groupname][sequence_id]['annotations'])

            if 'metadata' in dictionary[groupname][sequence_id].keys():
                converted[groupname][sequence_id]['metadata'] = dictionary[groupname][sequence_id]['metadata']

    return converted


def json_save_to_npy(input_name, output_name, feature_name = None):
    with open(input_name, 'r') as fp:
        input_data = json.load(fp)

    if feature_name is not None:
        with open(feature_name, 'r') as fp:
            features_data = json.load(fp)

        input_data = convert_to_array(input_data, features_data)
    else:
        input_data = convert_to_array(input_data)

    print("Saving " + output_name)
    np.save(output_name, input_data, allow_pickle=True)


def convert_all_calms21(input_directory, output_directory, parse_treba = False):

    calms21_files = []

    # find all files beginning with calms21 in the input dictionary and ending with .json
    for root, dirs, files in os.walk(input_directory):

        for name in files:
          if name.startswith('calms21_') and name.endswith('.json'):
            calms21_files.append(os.path.join(root, name))

    for single_file in calms21_files:
        if not parse_treba:
            # Parse keypoints only.
            file_name = single_file.split('/')[-1].split('.')[0]
            npy_output_name = os.path.join(output_directory, file_name)
            json_save_to_npy(single_file, npy_output_name)

        else:
            # Parse keypoints and concatenate with treba features.
            file_name = single_file.split('/')[-1].split('.')[0]
            npy_output_name = os.path.join(output_directory, file_name + '_features')

            current_root = single_file.rsplit('/', 1)[0]
            treba_feature_name = os.path.join(current_root, 'taskprog_features' + file_name[7:] + '.json')

            json_save_to_npy(single_file, npy_output_name, treba_feature_name)


""" Adapted from A-SOiD repository """
def convert_data_format(data):
    """ Convert the data format from the original data to a format that can be used for training.
    Args:
        data: np.array
            The original data format.
            Returns:
            collection: list
            A list of dataframes, each containing the pose data for one sequence.
            targets: list
            A list of targets for each sequence.
            """
    data_dict = dict(enumerate(data.flatten(), 1))
    file_names = list(data_dict[1]['annotator-id_0'].keys())
    pose_estimates = [data_dict[1]['annotator-id_0'][j]['keypoints']
                      for j in file_names]

    targets = [data_dict[1]['annotator-id_0'][j]['annotations']
                   for j in file_names]

    keypoint_names = ['nose', 'ear_left', 'ear_right', 'neck', 'hip_left', 'hip_right', 'tail_base']
    collection = []
    for i in range(len(pose_estimates)):
        single_sequence = pose_estimates[i]
        resident_pose = single_sequence[:, 0 , :]
        intruder_pose = single_sequence[:, 1 , :]

        resident_pose_2d = resident_pose.reshape(resident_pose.shape[0],
                                                        resident_pose.shape[1] *
                                                        resident_pose.shape[2],
                                                        order='F')
        intruder_pose_2d = intruder_pose.reshape(intruder_pose.shape[0],
                                                        intruder_pose.shape[1] *
                                                        intruder_pose.shape[2],
                                                        order='F')

        # merge resident and intruder
        pose = np.concatenate((resident_pose_2d, intruder_pose_2d), axis=1)

        # convert to dataframe
        keypoints_idx = pd.MultiIndex.from_product([[file_names[i]], ['resident', 'intruder'], keypoint_names, list('xy')],
                                               names=["source", 'animal', 'keypoints', 'coords'])


        df_keypoints = pd.DataFrame(pose, columns= keypoints_idx)

        collection.append(df_keypoints)

    if targets:
        return collection, targets
    else:
        return collection


Saving /content/task1_classic_classification/calms21_task1_test
Saving /content/task1_classic_classification/calms21_task1_train


In [13]:
#@title Run conversion code

#convert to numpy
calms21_directory = "/content/task1_classic_classification"
convert_all_calms21(input_directory=calms21_directory, output_directory=calms21_directory, parse_treba=False)

# pose
train_data_path = f"{calms21_directory}/calms21_task1_train.npy"
test_data_path = f"{calms21_directory}/calms21_task1_test.npy"

train_npy = np.load(train_data_path, allow_pickle=True)
train_data, train_targets = convert_data_format(train_npy)

test_npy = np.load(test_data_path, allow_pickle=True)
test_data, test_targets = convert_data_format(test_npy)

# save

base_path = "data"
# create directories for train and test data
# train
os.makedirs(f"{base_path}/train", exist_ok=True)
# test
os.makedirs(f"{base_path}/test", exist_ok=True)

# save train data as pickle file
with open(f"{base_path}/train/train_data.pkl", "wb") as f:
    pickle.dump(train_data, f)
with open(f"{base_path}/train/train_targets.pkl", "wb") as f:
    pickle.dump(train_targets, f)

# save test data and targets
with open(f"{base_path}/test/test_data.pkl", "wb") as f:
    pickle.dump(test_data, f)
with open(f"{base_path}/test/test_targets.pkl", "wb") as f:
    pickle.dump(test_targets, f)

print("Data conversion completed! Data can be found in /{}".format(base_path))

Data conversion completed! Data can be found in data


In [15]:
#@title Delete downloaded data to avoid confusion

!rm task1_classic_classification.zip
!rm -rf task1_classic_classification

print("Get started with the baseline notebook")

rm: cannot remove 'task1_classic_classification.zip': No such file or directory


# Data structure and content

In [27]:
#@title Load converted data

#load the data
with open('data/train/train_data.pkl', 'rb') as f:
    train_pose_collection = pickle.load(f)

#load the labels
with open('data/train/train_targets.pkl', 'rb') as f:
   train_label_collection = pickle.load(f)

# load the test data
with open('data/test/test_data.pkl', 'rb') as f:
    test_pose_collection = pickle.load(f)

# load the test labels
with open('data/test/test_targets.pkl', 'rb') as f:
    test_label_collection = pickle.load(f)

# Baseparameters

bodyparts = ['nose', 'ear_left', 'ear_right', 'neck', 'hip_left', 'hip_right', 'tail_base']
label_to_num = {'attack': 0, 'investigation': 1, 'mount': 2, 'other': 3}
num_to_label = {0: 'attack', 1: 'investigation', 2: 'mount', 3: 'other'}

print("Data loaded")

Data loaded


In [25]:
#@title Pose data explained:

print("Train consists of {} sequences".format(len(train_pose_collection)))
print("Test consists of {} sequences".format(len(test_pose_collection)))

print("Each sequence is a DataFrame containing the x, y coordinates of 7 body parts, for two mice (Resident, Intruder)")
print("The body parts are: ", *bodyparts)
print("Each sequence consists of varying number of frames, e.g. {} for sequence 0".format(len(train_pose_collection[0])))

print("Here is an example of the first 5 frames of the first sequence:")
train_pose_collection[0].head()

Train consists of 70 sequences
Test consists of 19 sequences
Each sequence is a DataFrame containing the x, y coordinates of 7 body parts, for two mice (Resident, Intruder)
The body parts are:  nose ear_left ear_right neck hip_left hip_right tail_base
Each sequence consists of varying number of frames, e.g. 21364 for sequence 0
Here is an example of the first 5 frames of the first sequence:


source,task1/train/mouse001_task1_annotator1,task1/train/mouse001_task1_annotator1,task1/train/mouse001_task1_annotator1,task1/train/mouse001_task1_annotator1,task1/train/mouse001_task1_annotator1,task1/train/mouse001_task1_annotator1,task1/train/mouse001_task1_annotator1,task1/train/mouse001_task1_annotator1,task1/train/mouse001_task1_annotator1,task1/train/mouse001_task1_annotator1,task1/train/mouse001_task1_annotator1,task1/train/mouse001_task1_annotator1,task1/train/mouse001_task1_annotator1,task1/train/mouse001_task1_annotator1,task1/train/mouse001_task1_annotator1,task1/train/mouse001_task1_annotator1,task1/train/mouse001_task1_annotator1,task1/train/mouse001_task1_annotator1,task1/train/mouse001_task1_annotator1,task1/train/mouse001_task1_annotator1,task1/train/mouse001_task1_annotator1
animal,resident,resident,resident,resident,resident,resident,resident,resident,resident,resident,...,intruder,intruder,intruder,intruder,intruder,intruder,intruder,intruder,intruder,intruder
keypoints,nose,nose,ear_left,ear_left,ear_right,ear_right,neck,neck,hip_left,hip_left,...,ear_right,ear_right,neck,neck,hip_left,hip_left,hip_right,hip_right,tail_base,tail_base
coords,x,y,x,y,x,y,x,y,x,y,...,x,y,x,y,x,y,x,y,x,y
0,831.659204,202.914433,805.659204,250.914433,775.659204,189.914433,780.659204,225.914433,711.659204,278.914433,...,897.915924,193.216902,866.915924,179.216902,796.915924,152.216902,840.915924,102.216902,766.915924,97.216902
1,833.050439,201.895063,809.050439,251.895063,778.050439,193.895063,783.050439,229.895063,723.050439,287.895063,...,899.907019,201.539977,869.907019,188.539977,799.907019,153.539977,846.907019,105.539977,766.907019,98.539977
2,838.718976,179.862692,816.718976,244.862692,776.718976,193.862692,787.718976,225.862692,730.718976,286.862692,...,897.195703,205.902935,868.195703,193.902935,800.195703,150.902935,860.195703,112.902935,777.195703,99.902935
3,826.757507,175.148063,815.757507,235.148063,774.757507,187.148063,785.757507,218.148063,743.757507,282.148063,...,886.788861,206.420539,856.788861,193.420539,794.788861,147.420539,856.788861,113.420539,786.788861,97.420539
4,822.045709,174.457936,812.045709,222.457936,768.045709,178.457936,779.045709,211.457936,749.045709,278.457936,...,876.578644,201.366469,848.578644,190.366469,789.578644,143.366469,862.578644,120.366469,793.578644,95.366469


In [26]:
#@title Behavioral labels explained:

print("For each sequence, there is a corresponding label array. Further each time step in the sequence has a single, exclusive label.")
print("The labels are: ", *label_to_num.keys(), " and are represented as integers in the data, e.g. 'attack' is represented as 0")
print(label_to_num)
print("Example: the first sequence is labeled as: ", train_label_collection[0][:5])
print("Which translates to: ", [num_to_label[i] for i in train_label_collection[0][:5]])


For each sequence, there is a corresponding label array. Further each time step in the sequence has a single, exclusive label.
The labels are:  attack investigation mount other  and are represented as integers in the data, e.g. 'attack' is represented as 0
{'attack': 0, 'investigation': 1, 'mount': 2, 'other': 3}
Example: the first sequence is labeled as:  [3 1 1 1 1]
Which translates to:  ['other', 'investigation', 'investigation', 'investigation', 'investigation']


# Baseline Classification

## Import model, run training and test on validation

In [None]:
from sklearn.model_selection import train_test_split
# import random forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# flatten the data and labels
X = np.vstack([df.values for df in train_pose_collection])
y = np.hstack(train_label_collection)

# split the data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# train the model
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# predict the validation set
y_pred = clf.predict(X_val)

# calculate the accuracy
accuracy = accuracy_score(y_val, y_pred)
print("Validation accuracy: {:.2f}%".format(accuracy * 100))

print("And what about the seperate classes?")
for i in range(4):
    print("Class {}: {:.2f}%".format(i, accuracy_score(y_val[y_val == i], y_pred[y_val == i]) * 100))


## Testing on the test set

In [None]:
# predict the test set
X_test = np.vstack([df.values for df in test_pose_collection])
y_test = np.hstack(test_label_collection)

y_pred = clf.predict(X_test)

# calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)

print("Test accuracy: {:.2f}%".format(accuracy * 100))

print("And what about the seperate classes?")
for i in range(4):
    print("Class {}: {:.2f}%".format(i, accuracy_score(y_test[y_test == i], y_pred[y_test == i]) * 100))

