 ## 

# Spine Parameter Data Compiler

## With this script data on reconstructed dendrites and spines can be compiled into two data sets: <br>
### - number and distribution of spines <br> 
### - morphology of spines <br>

For information on reconstruction of dendrites and spines, we would like to refer to our protocol in Appendix 1 of this publication. <br>

The following input compiled into one folder is needed: <br>
1. This script needs to be stored into the folder
2. Overview file linking Animal Number to the Condition, ending with _Overview.csv
3. Folder containing data for number and distribution of spines, called AllS
4. Folder containing data for morphology of spines, called SelS <br>
(if no differentiation is being made between AllS and SelS data, please adjust all "SelS" in this script to "Alls")

Folders containing data are named in the example folder in the following manner: <br>
ExperimentNumber_AnimalNumber_SliceNumber_Left/RightHemisphere_ImagedChannels_SO/SRHippocampalLayer_<br>
Deconvolved_AllSpines/SelectedSpines_ReactivationStatus

## 1. Import Packages

#### 1.1 Install Pandas package if required

In [36]:
!pip install pandas



#### 1.2 Import Python Packages

In [37]:
import pandas as pd
import os
import shutil

#### 2.1 Create 'Processed_data' folder
First the Current Working Directory of the Script is being stored as the variable *script_folder*. <br>
Next Python creates the *Processed_Data* folder within the same directory.

In [38]:
script_folder = os.getcwd()
folder_script = 'Processed_data'
base_directory = os.path.join(script_folder, folder_script)
os.makedirs(base_directory, exist_ok=True)

## 2. Prepare data structures

#### 2.2 Create required subfolders
Within the 'Processed_data' folder four subfolders are being constructed <br>
'Meta_data' - this subfolder will contain the meta data file in which subject number is linked to the experimental condition<br>
'Dendrite' - this subfolder will contain data files on dendrite level - spine density and dendrite length <br> 
'Morph' - this subfolder will contain data files related to spine morphology <br>
'Distr' - this subfolder will contain data files related to distribution of spines along the dendrite <br>
'Output' - this subfolder will contain the combined data files resulting of this pipeline

In [39]:
folder_to_make = ['Meta_data', 'Dendrite', 'Morph', 'Distr', 'Output']
for folder in folder_to_make:
    sub_directory = os.path.join(base_directory, folder)
    os.makedirs(sub_directory, exist_ok=True) 

#### 2.3 Setting up paths to subfolders
Creates specific directories to each subfolder in 'Procssed Data'

In [40]:
directory_morph = os.path.join(base_directory, "Morph") 
directory_distr = os.path.join(base_directory, "Distr") 
directory_dendrite = os.path.join(base_directory, "Dendrite") 
directory_distr_dendrite = os.path.join(base_directory, "Distr_and_Dendrite")
directory_meta_data = os.path.join(base_directory, "Meta_data") 
directory_output = os.path.join(base_directory, "Output") 

#### 2.4 Specify document / subfolder logic
Creates for each subfolder a specific dictionary to specify which files should be gathered in which folder <br>
Files are being recognised based on specific combinations of strings in their file name and then associated with the directory of the subfolder <br>

The unique word before the variable name (here: RFP+), should be adjusted before the variable name, if needed <br>

A second unique key (here: AllS or SelS) is added to choose the correct file, <br>
if there is no differentation made for morphological data and data on number and distribution, then this second key can be the same for all variables

In [41]:
file_destination = {"Meta_data": [["_Overview."], directory_meta_data, ""], 
                    "Dendrite": [["RFP+_Dendrite_Spine_Density.", "RFP+_Dendrite_Length."], directory_dendrite, "AllS"], 
                    "Morph": [["RFP+_Spine_Length.", "RFP+_Spine_Part_Max_Diameter_Head.", "RFP+_Spine_Part_Max_Diameter_Neck.",
                              "RFP+_Spine_Part_Volume_Head.", "RFP+_Spine_Part_Volume_Neck.", "RFP+_Spine_Volume.", 
                              "RFP+_Spine_Part_Mean_Diameter_Head.", "RFP+_Spine_Part_Mean_Diameter_Neck."], directory_morph, 
                              "SelS"], 
                    "Distr": [["RFP+_Spine_Attachment_Pt_Distance.", "RFP+_Spine_Attachment_Pt_Diameter."], directory_distr, 
                              "AllS"]}

#### Function: Find files and copy to the right subfolder
The aim of this function is to copy the correct files into their subfolders based on the unique combination of parts of the file name <br>

*Arguments*: <br>
**input_folder**: folder that will be inspected to find required files <br>
**destination_folder** : folder where files are copied to <br>
**name_in_filename**: string that must be in the document name to be copied <br>
**name2_in_filename**: second string that must be in the document name to be copied <br>

In [42]:
def copy_data_to_subfolder(input_folder, destination_folder, name_in_filename, name2_in_filename):
    for root, dirs, files in os.walk(input_folder):
        for file in files:
            if file.endswith('.csv') and name_in_filename in file and name2_in_filename in file: 
                file_path = os.path.join(root, file)
                try:
                    shutil.copy(file_path, destination_folder)
                except shutil.SameFileError:
                    pass

#### Function: Fill subfolders
Function uses the file_destination dictionary from 2.4 that provides the subfolders as keys and the two search strings and subfolderpaths as values. <br> It loops through the folders, searches the required documents and copies them to the folders using the copy_data_to_subfolders function.<br>
*Arguments*: <br><br>
**input_folder**: folder that will be inspected to find required files <br>
**destination_folder** : folder where files are copied to <br>

In [43]:
def fill_subfolders(destination_dict, input_folder):
    for key in destination_dict.keys():
        for name_in_filename in destination_dict[key][0]:
            destination_folder = destination_dict[key][1]
            name2_in_filename = destination_dict[key][2]
            copy_data_to_subfolder(input_folder, destination_folder, name_in_filename, name2_in_filename)

#### 2.5 Fill subfolders with morph, number and dendrite data
In this step, the above specified functions are being used to copy the desired files into their respective 
subfolders

In [44]:
fill_subfolders(file_destination, script_folder)

## 3. Merge and save data

#### Function: Remove second last value in unique key
This function accepts a string and converts it into a new string. <br>
For example 'aaa_bbb_ccc' --> 'aaa_ccc' <br>
This is needed later when merging df_distr and df_dendrite based on key.

In [45]:
def remove_second_last(value):
    # Split the string by underscores
    parts = value.split('_')
    # Remove the second last element
    if len(parts) > 1:  # Check to ensure there's a second last element
        parts.pop(-2)
    # Join the remaining parts back into a string
    return '_'.join(parts)

#### Function: Create a list with all name info from a file and the filename
This function separates the meta data contained in the file name based on "_" <br>
also other seperators can be used <br> 
information in the file name will be used in a later step to create a unique identification key for each dendrite/spine 


In [46]:
def create_document_info(directory):
    data_files = []
    # Iterate over files in directory
    for name in os.listdir(directory):
        # Open file
        split_name = name.split('_')
        split_name.append(name)
        data_files.append(split_name)
    return data_files

#### Function: Open a csv file and return it as a Pandas dataframe
This function opens a csv file, skips the first two rows if wanted so (required for Imaris files) and creates a unique key for each dendrite/spine <br>
Values on dendrite level (e.g. spine density and dendrite length) only contain a Filament ID <br>
Values on spine level (e.g. spine head volume and spine attachment Pt distance) contain a Filament ID and a spine (ID) <br>

*Arguments*: <br>
**file_name_data**: input from create_document_info function <br>
**directory** : path to folder of the file <br>
**remove_id_key**: option to remove the id from the key in the resulting dataframe <br>
**key_creation**: option to remove unique_ID creation in the resulting dataframe <br>
**skip2rows**: option to skip first 2 rows of the input csv file <br>

In [47]:
def open_csv(file_name_data, directory, remove_id_key=False, key_creation=True, skip2rows=True): # remove_id_key only works if key_creation = True
    document_path = os.path.join(directory, file_name_data[-1])
    if skip2rows:
        df = pd.read_csv(document_path, skiprows=[1, 2])
    else:
        df = pd.read_csv(document_path)
    if key_creation:
        df['unique_ID'] = f"{file_name_data[1]}_{file_name_data[2]}_{file_name_data[3]}_{file_name_data[5]}_{file_name_data[8]}"
        if remove_id_key:
            df['key'] = df['unique_ID'] + '_' + df['FilamentID'].astype(str)
        else:
            df['key'] = df['unique_ID'] + '_' + df['ID'].astype(str) + '_' + df['FilamentID'].astype(str)
        df = df.iloc[:, [0, -1]]
    return df

#### Function: Add the condition of the animal to the target dataframe based on the animal/condition table
This function returns the dataframe with the condition added in the last column matched by the subject number (here: animal) based on the provided meta data file.

In [48]:
def add_animal_condition_to_df(target_df, directory_meta_data):
    meta_data_document_data = create_document_info(directory_meta_data)[0]
    df_condition = open_csv(meta_data_document_data, directory_meta_data, key_creation=False, skip2rows=False)
    for index, row in target_df.iterrows():
        animal = row['key'].split('_')[0]
        condition_animal_value = df_condition.loc[df_condition['animal'] == animal, 'condition'].values[0]
        try:
            target_df.at[index, 'Condition'] = condition_animal_value
        except:
            target_df.at[index, 'Condition'] = None
    
    return target_df

#### Function: Add columns to the dataframe that can help slicing the data
Columns that are added are *animal*, *image*, *dendrite* and *reactivation_status*.

Rename variables according to order in file name as specified in open_csv function (see function above).

In [49]:
def add_slicing_columns_to_dataframe(df):
    df[['animal', 'slice', 'side', 'layer', 'reactivation_status', 'id', 'dendrite_id']] = df['key'].str.split('_', expand=True)
    df['image'] = df['animal'] + '_' + df['slice'] + '_' + df['side'] + '_' + df['layer']
    df['dendrite'] = df['image'] + '_' + df['dendrite_id']
    df = df.drop(columns=['slice', 'side', 'layer', 'id', 'dendrite_id'])
    
    return df

#### Function: Merge data from a folder and provide output in a dataframe
For example: This function can loop through all files in the Morph folder and merges it in one dataframe. 
Then the function uses the add_anmial_condition_to_df function to also add the condition of the animal. Another option is to add auxiliary columns for enhanced data slicing later. Lastly the function converts all letters to lower case. 

*Arguments*: <br>
**directory** : path to folder of the file <br>
**remove_id_key**: option to remove the id from the key in the resulting dataframe <br>
**add_slicing_columns**: option to add auxiliary columns for example for slicing, calculations or data manipulations  <br>

In [50]:
def create_dataframe(directory, remove_id_key=False, add_slicing_columns=False):
    first_iteration = True
    datafiles = create_document_info(directory)
    for datafile in datafiles:
        if first_iteration:
            df = open_csv(datafile, directory, remove_id_key)
            first_iteration = False
        else:
            df_add = open_csv(datafile, directory, remove_id_key)
            df = pd.concat([df, df_add])
    df = df.groupby('key').agg(lambda x: x.dropna().tolist()[0] if x.dropna().tolist() else None).reset_index()
    df = add_animal_condition_to_df(df, directory_meta_data)
    if add_slicing_columns:
        df = add_slicing_columns_to_dataframe(df)
    df = df.applymap(lambda x: x.lower() if isinstance(x, str) else x)
    return df


#### Function: Consolidate data and save output
This function calls the create_dataframe function and saves the consolidated data as .xlsx file in the output directory.

*Arguments*: <br>
**input_directory**: input from create_document_info function <br>
**output_directory**: path to output folder where the files are saved <br>
**name**: name of the Excel file  <br>
**create_df**: option to block automatic df creation <br>
**df**: option to manually input a df, if create_df == False <br>
**remove_id_key**: option to remove the id from the key in the resulting dataframe <br>
**add_slicing_columns**: option to add auxiliary columns for example for slicing, calculations or data manipulations  <br>

In [51]:
def store_data(input_directory, output_directory, name, create_df=True, df=None, remove_id_key=False, add_slicing_columns=False):
    if create_df:
        df = create_dataframe(input_directory, remove_id_key, add_slicing_columns)
    df.to_excel(os.path.join(output_directory, f"{name}.xlsx"), index=False)
    return df

#### 3.1 Consolidate and save morph, number and dendrite data
Here, the final .xlsx data files are created based on the specific subfolders and then stored in the output folder.

In [52]:
df_morph = store_data(directory_morph, directory_output, 'morph_data', add_slicing_columns=True)
df_distr = store_data(directory_distr, directory_output, 'distr_data', add_slicing_columns=True)
df_dendrite = store_data(directory_dendrite, directory_output, 'dendrite_data', remove_id_key=True)

#### 3.2 Create seperate file that has the number data enriched with the dendrite data
In a last step, the data frame on the distribution of spines is merged with the data frame on dendrite level, so that finally only two data files are needed for statistical analysis. <br>
The first data file can be used for analysis on the number and distribution of spines. <br>
The second data file can be used for analysis on the morphology of spines.

In [53]:
df_distr['key_dendrite'] = df_distr['key'].apply(remove_second_last)
df_distr = df_distr.drop(columns=['Condition'])
df_dendrite.rename(columns={'key': 'key_dendrite'}, inplace=True)
merged_df = pd.merge(df_distr, df_dendrite, on='key_dendrite', how='left')
df_distr_dendrite = merged_df.drop(columns=['key_dendrite'])
store_data(directory_distr_dendrite, directory_output, 'distr_dendrite_data', create_df=False, df=df_distr_dendrite)
;

''

The final data files can be found in the output folder and imported into R. <br> 
See R script in this repository to continue with analysis on spine parameters. 