# Jupyter Notebook UI to analyze baseline data from tap-habituation experiments!

### Beginner Essentials:
1. Shift-Enter to run each cell. After you run, you should see an output "done step #". If not, an error has occured
2. When inputting your own code/revising the code, make sure you close all your quotation marks '' and brackets (), [], {}.
3. Don't leave any commas (,) hanging! (make sure an object always follows a comma. If there is nothing after a comma, remove the comma!
4. Learning to code? Each line of code is annotated to help you understand how this code works!

**Run all cells/steps sequentially, even the ones that do not need input**

## Step-by-Step Analysis of the Jupyter Notebook

| Step | Purpose | Key Actions |
|------|---------|-------------|
| **1. Import Packages** | Load required Python libraries for data analysis | Imports `pandas`, `numpy`, `matplotlib`, etc. | 
| **2. Pick Filepath** | Select the folder containing experimental data files (.dat or .trv) | Input required: Uses `FileChooser` widget to select directory | 
| **3. User-Defined Variables** | Set experiment parameters | Defines: `bin`  | 
| **4. Construct Filelist** | Find all files in selected folder | Sets working directory and scans `folder_path` using; Displays no. of `.trv` files found in the folder |
| **5. Process Data Function** | Define functions to load, clean, and analyze raw data | - `ProcessData()`: Loads files, calculates metrics (reversal probability, speed) |
| **6.1 Process Data** | Apply processing to all strains| - Checks `filelist` for unique strain names (e.g., "N2") <br>- Runs `ProcessData()` for each strain | 
| **7. Grouping & Naming** | Combine data from all strains | - Concatenates DataFrames<br>- Assigns dataset names (e.g., "N2") | 
| **Output CSV** | Save processed data | Exports `Baseline_data` to CSV |

### Key Notes:
- User Input Required: Steps 2 (file selection), 3 (parameters), 6.1 (strain verification)
- Output: Final CSV contains all analyzed tap response data

# 1. Importing Packages Required (No input required, just run)

In [1]:
import pandas as pd #<- package used to import and organize data
import numpy as np #<- package used to import and organize data
import seaborn as sns #<- package used to plot graphs
from matplotlib import pyplot as plt #<- package used to plot graphs
import os #<- package used to work with system filepaths
from ipywidgets import widgets #<- widget tool to generate button
from IPython.display import display #<- displays button
from ipyfilechooser import FileChooser
# from tkinter import Tk, filedialog #<- Tkinter is a GUI package
from tqdm.notebook import tqdm
# import dask.dataframe as dd
# import pingouin as pg
pd.set_option('display.max_columns', 50)
print("done step 1")

done step 1


## 2. Pick filepath (just run and click button from output)

Run the following cell and click the button 'Select Folder' to pick a filepath.

**Important: Later on, this script uses the total file path for each file to import and group data. That means if your folder has whatever your strain is named, the script will not work.**

(ex. if your folder has "N2" in it this script sees all files inside this folder as having the "N2" search key)

**An easy fix is to just rename your folder to something else (make your strains lower-case, or just have the date)**

In [2]:
starting_directory = '/Users'
chooser = FileChooser(starting_directory)
display(chooser)

FileChooser(path='/Users', filename='', title='', show_hidden=False, select_desc='Select', change_desc='Change…

In [56]:
print(chooser.selected_path)
folder_path=chooser.selected_path

/Users/gurmehak/Documents/RankinLab/Test_Datasets/PDScreen_TapHab_August15_2022


In [57]:
screens = ['PD_Screen', 'ASD_Screen', 'G-Proteins_Screen', 'Glia_Genes_Screen', 'Neuron_Genes_Screen', 'Miscellaneous']

screen_chooser = widgets.Select(options=screens, value=screens[0], description='Screen:')
display(screen_chooser)

Select(description='Screen:', options=('PD_Screen', 'ASD_Screen', 'G-Proteins_Screen', 'Glia_Genes_Screen', 'N…

In [58]:
Screen=screen_chooser.value
print(Screen)

PD_Screen


# 3. User Defined Variables (Add input here)

Here, we add some constants to help you blaze through this code.

3.1: Setting time bins


3.2: Setting view range for your graph
- Top, bottom = y axis view range
- left, right = x axis view range



In [59]:
# Setting 1s Bins
bins = np.linspace(0,1200,1201) # np.linspace(start, end, steps in between)
print(bins)


print("done step 3")

[0.000e+00 1.000e+00 2.000e+00 ... 1.198e+03 1.199e+03 1.200e+03]
done step 3


# 4. Construct filelist from folder path (No input required, just run)

In [60]:
os.chdir(folder_path) # setting your working directory so that your images will be saved here

filelist = list() # empty list
for root, dirs, files in os.walk(folder_path): # this for loop goes through your folder 
    for name in files:
        if name.endswith('.dat'): # and takes out all files with a .dat (file that contains your data)
            if "_" in name.split(".")[-2]:
                filepath = os.path.join(root, name) # Notes down the file path of each data file
                filelist.append(filepath) # saves it into the list

if not filelist:
    raise FileNotFoundError("No .dat files found in the selected folder!")
else:
    print(f"Number of .dat files to process: {len(filelist)}")
    # print(f"Example of first and last file saved: {filelist[0]}, {filelist[-1]}") 

print('done step 4')

Number of .dat files to process: 13
done step 4


# 5. Process Data Function (No input required, just run)

In [132]:
def ProcessData(strain, experiment_counter): 
    """
    Filters and processes .dat files matching the given strain.

    Parameters: 
        strain (str): keyword to match in the files

    Returns:
        dict: N (Plate number) and Dataframe with required columns 
              ("time", "dura", "dist", "prob", "speed", "plate", "Date",
              "Plate_id", "Screen")

    """
    strain_filelist = [x for x in filelist if strain in x] # Goes through the list and filters for keyword
    Strain_N = len(strain_filelist) # Finds the number of plates per strain
    if Strain_N == 0:
        raise AssertionError ('{} is not a good identifier'.format(strain))
    else:
        pass
        print(f'Strain {strain}')
        print(f'Number of plates: {Strain_N}') 
        
        # visiting files in this strain
        strain_filelist = [file for file in filelist if strain in file]
        df_list=[]
        for i, file in enumerate(strain_filelist):
            if file.split('/')[-1].startswith('._'):
                pass
            else:
                try:
                    print(f"File: {file}")
                    df= pd.read_csv(file, sep=' ', header = None, encoding_errors='ignore')
                    df['Plate_id'] = file.split('/')[-1].split('_')[-1].split('.')[0]
                    df['Date'] = file.split('/')[-2].split('_')[0]
                    df['Screen'] = file.split('/')[-4]
                    df['Experiment'] = experiment_counter
                    experiment_counter = 1+experiment_counter
                    df_list.append(df)
                except:
                    print(f"error in file {file}")
                    pass
        DF_Total = pd.concat(df_list, ignore_index = True)
        DF_Total = DF_Total.rename( 
                    {0:'Time',
                    1:'n',
                    2:'Number',
                    3:'Instantaneous Speed',
                    4:'Interval Speed',
                    5:'Bias',
                    6:'Tap',
                    7:'Puff',
                    8:'x',
                    9:'y',
                    10:'Morphwidth',
                    11:'Midline',
                    12:'Area',
                    13:'Angular Speed',
                    14:'Aspect Ratio',
                    15:'Kink',
                    16:'Curve',
                    17:'Crab',
                    18:'Pathlength'}, axis=1)
        
        # check function here for NaN Columns
        DF_Total['plate'] = 0

        print("---------------------------------------------------------------------------------------------------------------------------------------------------------------------------")

    return{
            'N': Strain_N,
            'Confirm':DF_Total,
            'experiment_counter': experiment_counter
            # 'Final': DF_Final
    }



def assign_taps(df, tolerances):
    """
    Assigns tap number to each row in the DataFrame based on time tolerances.

    Parameters:
        df (pd.DataFrame): The DataFrame to modify
        tolerances (list of tuples): Each tuple is (lower, upper) time range

    Returns:
        None
    """
    df['taps'] = np.nan
    for taps, tolerance in enumerate(tolerances): #[(99, 101), (109,111), ...]
        tap_lower,tap_upper = tolerance
        TimesInTapRange = df['Time'].between(tap_lower,tap_upper, inclusive="both")
        df.loc[TimesInTapRange,'taps'] = int(taps)+1 # set the tap to i where times are between
    # df.bfill(inplace=True)


def insert_plates(df):   
    """
    Inserts a plate column into a dataframe.
    
    Parameters:
        df (pd.DataFrame): any dataframe
    
    Returns: 
        pd.DataFrame: dataframe with a plate column
    """
    df['plate']=(df['taps'] ==1).cumsum()


print('done step 5')

done step 5


# 6.1 Process Data

Create a dictionary `StrainNames` that contains all the genotype/strain names from each file path

In [133]:
genotype=[]
for f in filelist:
    genotype.append(f.split('/')[-3])

genotypes=np.unique(genotype).tolist()

if Screen =="Neuron_Genes_Screen":
    genotypes.insert(0, genotypes.pop(genotypes.index("N2_XJ1")))
    genotypes.insert(0, genotypes.pop(genotypes.index("N2_N2")))
else:
    genotypes.insert(0, genotypes.pop(genotypes.index("N2")))

nstrains = list(range(1, len(genotypes) + 1))
StrainNames = {nstrains[i]: genotypes[i] for i in range(len(nstrains))}

print(f"Number of genotypes/strains in the experiment: {len(genotypes)}")

# Display the first 5 Strain names in the experiment
for k in list(StrainNames)[:5]:
    print(f"{k}: {StrainNames[k]}")


print("done step 6.1")

# <---------------- Test element to use for dictionary buidling -------------------
# s = '/Users/Joseph/Desktop/OnFoodOffFoodTest/N2_OnFood/20220401_163048/N2_10x1_n96h20C_360sA0401_ka.00065.dat'
# slist=s.split('/')[5]
# print(slist)
# print(list(range(1,5+1)))

Number of genotypes/strains in the experiment: 3
1: N2
2: hipr-1_ok1081
3: hipr-1_tm10120
done step 6.1


# 6.2 Process Data (just run this cell)

Pass each strain through `ProcessData()` function 

In [202]:
DataLists = [0] # generates empty list at index 0 because we want indexing to start at 1 
                # when I say #1, I want the first point, not the second point

experiment_counter = 1

# the loop below goes through the dictionary in step 6.1 and processes data
# and appends all data into a list of dataframes
for s in tqdm(StrainNames.values()): 
    if not s == '':
        result = ProcessData(s, experiment_counter)
        DataLists.append(result['Confirm'])
        experiment_counter = result['experiment_counter'] 


# Taps
number_of_taps = 30 # Taps in your experiment (N)
ISI = 10  # ISI in your experiment
first_tap = 600 # when is your first tap? check your TRV files

# Here, open up one of the trv files to determine the times for each of these taps. 

# Record number of taps (N+1), e.g., if number_of_taps = 30, taps = [1, 2, 3, ..., 31]
taps = np.arange(1, number_of_taps+2).tolist()

# Assign tolerance to each tap
lower = np.arange(first_tap-2, first_tap-2+(number_of_taps*ISI), ISI) # (first tap, last tap+10s, ISI)
upper = np.arange(first_tap+2, first_tap+2+(number_of_taps*ISI), ISI) # (first tap, last tap+10s, ISI)
tolerances = [(int(l), int(u)) for l, u in zip(lower, upper)]
tolerances.append((1188,1191)) # (N+1)th tap


# the loop below assigns taps and plates to the processed data
for df in DataLists[1:]: 
    assign_taps(df, tolerances)
    insert_plates(df)


print('done step 6.2')

  0%|          | 0/3 [00:00<?, ?it/s]

Strain N2
Number of plates: 5
File: /Users/gurmehak/Documents/RankinLab/Test_Datasets/PDScreen_TapHab_August15_2022/N2/20220815_101538/N2_10x2_f72h20C_600s31x10s10s_B0811ab.dat
File: /Users/gurmehak/Documents/RankinLab/Test_Datasets/PDScreen_TapHab_August15_2022/N2/20220815_102652/N2_10x2_f96h20C_600s31x10s10s_A0811aa.dat
File: /Users/gurmehak/Documents/RankinLab/Test_Datasets/PDScreen_TapHab_August15_2022/N2/20220815_122801/N2_10x2_f96h20C_600s31x10s10s_A0811ad.dat
File: /Users/gurmehak/Documents/RankinLab/Test_Datasets/PDScreen_TapHab_August15_2022/N2/20220815_121502/N2_10x2_f72h20C_600s31x10s10s_B0811ae.dat
File: /Users/gurmehak/Documents/RankinLab/Test_Datasets/PDScreen_TapHab_August15_2022/N2/20220815_103433/N2_10x2_f72h20C_600s31x10s10s_C0811ac.dat
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Strain hipr-1_ok1081
Number of plates: 4
File: /Users/gurmehak/

# Convert float64 data to float32 to reduce memory load (can also convert to 16 if needed)

For plain english:

float16 = 4 decimal points

float32 = 8 decimal points

float64 = 16 decimal points

more decimal points = more data/memory that computer has to keep track of

In [203]:
# commented out this section in case memory load needs to be reduced

for n in tqdm(DataLists[1:]):
    print(n)
    TestData=n
    TestData[TestData.select_dtypes(np.float64).columns]=TestData.select_dtypes(np.float64).astype(np.float16)
    

  0%|          | 0/3 [00:00<?, ?it/s]

            Time   n  Number  Instantaneous Speed  Interval Speed  Bias  Tap  \
0          0.009  13       0               0.0000          0.0000  0.00    0   
1          0.061  13       0               0.0000          0.0000  0.00    0   
2          0.089  12       0               0.0000          0.0000  0.00    0   
3          0.136  12       0               0.0000          0.0000  0.00    0   
4          0.168  12       0               0.0000          0.0000  0.00    0   
...          ...  ..     ...                  ...             ...   ...  ...   
134125  1199.842  48      30               0.1738          0.1245  0.64    0   
134126  1199.884  48      30               0.1828          0.1286  0.00    0   
134127  1199.924  47      28               0.1896          0.1323  0.00    0   
134128  1200.110  47      28               0.0000          0.0000  0.00    0   
134129  1200.163  47      28               0.0000          0.0000  0.00    0   

        Puff        x        y  Morphwi

# 7. Grouping Data and Naming



In [236]:
# DataLists[1][DataLists[1]['Time']>605]
DataLists[1][(DataLists[1]['taps']>0) & (DataLists[1]['Time']>601)].head(60)

  has_large_values = (abs_vals > 1e6).any()


Unnamed: 0,Time,n,Number,Instantaneous Speed,Interval Speed,Bias,Tap,Puff,x,y,Morphwidth,Midline,Area,Angular Speed,Aspect Ratio,Kink,Curve,Crab,Pathlength,Plate_id,Date,Screen,Experiment,plate,taps
12145,601.5,16,14,0.234375,0.125854,-0.713867,0,0,28.3125,25.390625,0.113708,1.061523,0.142822,11.703125,0.318115,43.3125,33.59375,0.024307,6.4375,B0811ab,20220815,PDScreen_TapHab_August15_2022,1,75,1.0
12146,601.5,16,14,0.244751,0.128052,-0.713867,0,0,28.3125,25.390625,0.112,1.070312,0.141724,13.796875,0.315918,42.3125,34.0,0.026199,6.429688,B0811ab,20220815,PDScreen_TapHab_August15_2022,1,76,1.0
12147,601.5,16,14,0.250488,0.132568,-0.713867,0,0,28.3125,25.390625,0.114197,1.0625,0.143188,15.5,0.322998,45.09375,34.09375,0.027695,6.421875,B0811ab,20220815,PDScreen_TapHab_August15_2022,1,77,1.0
12148,601.5,16,14,0.246582,0.134644,-0.713867,0,0,28.3125,25.390625,0.116089,1.067383,0.146118,15.796875,0.333984,43.0,35.0,0.029694,6.414062,B0811ab,20220815,PDScreen_TapHab_August15_2022,1,78,1.0
12149,601.5,16,14,0.241089,0.127686,-0.713867,0,0,28.3125,25.390625,0.117493,1.067383,0.147461,15.101562,0.320068,44.8125,35.5,0.028793,6.40625,B0811ab,20220815,PDScreen_TapHab_August15_2022,1,79,1.0
12150,601.5,16,14,0.24646,0.126831,-0.713867,0,0,28.3125,25.390625,0.114502,1.066406,0.144043,16.296875,0.331055,42.1875,35.6875,0.029907,6.398438,B0811ab,20220815,PDScreen_TapHab_August15_2022,1,80,1.0
12151,601.5,16,14,0.249146,0.130127,-0.713867,0,0,28.296875,25.390625,0.11322,1.0625,0.142822,18.90625,0.342041,46.3125,35.59375,0.032806,6.390625,B0811ab,20220815,PDScreen_TapHab_August15_2022,1,81,1.0
12152,601.5,16,14,0.248413,0.131592,-0.713867,0,0,28.296875,25.390625,0.114197,1.06543,0.142944,22.0,0.335938,44.59375,35.3125,0.034485,6.386719,B0811ab,20220815,PDScreen_TapHab_August15_2022,1,82,1.0
12153,601.5,16,14,0.243652,0.133057,-0.713867,0,0,28.296875,25.390625,0.114197,1.064453,0.143311,24.296875,0.340088,47.0,35.59375,0.034302,6.378906,B0811ab,20220815,PDScreen_TapHab_August15_2022,1,83,1.0
12154,601.5,16,14,0.238892,0.130859,-0.713867,0,0,28.296875,25.390625,0.118225,1.075195,0.147705,25.5,0.342041,49.5,36.3125,0.031097,6.371094,B0811ab,20220815,PDScreen_TapHab_August15_2022,1,84,1.0


In [230]:
df_base=pd.concat(df.assign(dataset=StrainNames.get(i+1)) for i, df in enumerate(DataLists[1:]))

df_base[['Gene', 'Allele']] = df_base['dataset'].str.split(pat='_', n=1, expand=True)

df_base['Allele'] = df_base['Allele'].fillna('N2')

df_base['Screen']=Screen

# df_base['taps'] = df_base['taps'].astype(float)

# df_base['taps'] = df_base['taps'].ffill()

df_base.head()

  has_large_values = (abs_vals > 1e6).any()


Unnamed: 0,Time,n,Number,Instantaneous Speed,Interval Speed,Bias,Tap,Puff,x,y,Morphwidth,Midline,Area,Angular Speed,Aspect Ratio,Kink,Curve,Crab,Pathlength,Plate_id,Date,Screen,Experiment,plate,taps,dataset,Gene,Allele
0,0.009003,13,0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,B0811ab,20220815,PD_Screen,1,0,,N2,N2,N2
1,0.061005,13,0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,B0811ab,20220815,PD_Screen,1,0,,N2,N2,N2
2,0.088989,12,0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,B0811ab,20220815,PD_Screen,1,0,,N2,N2,N2
3,0.135986,12,0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,B0811ab,20220815,PD_Screen,1,0,,N2,N2,N2
4,0.167969,12,0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,B0811ab,20220815,PD_Screen,1,0,,N2,N2,N2


## 7.1 Baseline Data

This step takes all the individual strain data (processed in Step 6) and combines them into single dataframe, filters for time window 490s - 590s, drops unwanted columns. 

The final processed data `Baseline_data` is ready for analysis.

In [205]:
Baseline_data = df_base.drop(columns=["Puff", "x","y", "Experiment"]).dropna().reset_index(drop=True)

Baseline_data = Baseline_data[((Baseline_data.Time<=590)&(Baseline_data.Time >=490))] 

Baseline_data.head()

  has_large_values = (abs_vals > 1e6).any()


Unnamed: 0,Time,n,Number,Instantaneous Speed,Interval Speed,Bias,Tap,Morphwidth,Midline,Area,Angular Speed,Aspect Ratio,Kink,Curve,Crab,Pathlength,Plate_id,Date,Screen,plate,taps,dataset,Gene,Allele


In [194]:
Baseline_data.shape

(0, 24)

## 7.2 Post Stimulus Data 

In [206]:
# similar filters as baseline data

Post_stimulus_data_pre = df_base.drop(columns=["Puff", "x","y"]).dropna().reset_index(drop=True)

Post_stimulus_data_pre = Post_stimulus_data_pre[((Post_stimulus_data_pre.Time>598))]

Post_stimulus_data_pre['Time'] = round(Post_stimulus_data_pre['Time']).astype('int')

In [207]:
Post_stimulus_data_pre


  has_large_values = (abs_vals > 1e6).any()


Unnamed: 0,Time,n,Number,Instantaneous Speed,Interval Speed,Bias,Tap,Morphwidth,Midline,Area,Angular Speed,Aspect Ratio,Kink,Curve,Crab,Pathlength,Plate_id,Date,Screen,Experiment,plate,taps,dataset,Gene,Allele
6,598,16,14,0.080994,0.109680,0.213989,0,0.105591,1.101562,0.141357,3.699219,0.246948,37.50000,26.796875,0.011101,6.558594,B0811ab,20220815,PD_Screen,1,7,1.0,N2,N2,N2
7,598,16,14,0.080505,0.118103,0.285889,0,0.109070,1.105469,0.145264,3.800781,0.248047,37.09375,26.703125,0.009399,6.562500,B0811ab,20220815,PD_Screen,1,8,1.0,N2,N2,N2
8,598,16,14,0.072876,0.112305,0.285889,0,0.106628,1.102539,0.141846,3.699219,0.248047,36.31250,27.203125,0.008797,6.562500,B0811ab,20220815,PD_Screen,1,9,1.0,N2,N2,N2
9,598,16,14,0.065491,0.098511,0.285889,0,0.105591,1.099609,0.140259,3.099609,0.250000,37.90625,27.296875,0.007301,6.566406,B0811ab,20220815,PD_Screen,1,10,1.0,N2,N2,N2
10,598,16,14,0.087830,0.127075,0.285889,0,0.105103,1.096680,0.140015,4.000000,0.256104,37.90625,27.093750,0.011299,6.566406,B0811ab,20220815,PD_Screen,1,11,1.0,N2,N2,N2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36237,1191,22,17,0.174194,0.122070,-0.881836,0,0.103027,0.961426,0.121338,5.101562,0.270996,44.00000,31.500000,0.013702,22.125000,A0811cb,20220815,PD_Screen,13,332,31.0,hipr-1_tm10120,hipr-1,tm10120
36238,1191,22,17,0.168701,0.120178,-0.824219,0,0.102783,0.960938,0.120667,5.699219,0.270996,44.31250,31.296875,0.014503,22.125000,A0811cb,20220815,PD_Screen,13,332,31.0,hipr-1_tm10120,hipr-1,tm10120
36239,1191,22,17,0.159302,0.121277,-0.824219,0,0.102112,0.965332,0.120056,5.898438,0.266113,44.81250,32.593750,0.014999,22.109375,A0811cb,20220815,PD_Screen,13,332,31.0,hipr-1_tm10120,hipr-1,tm10120
36240,1191,22,17,0.160889,0.128174,-0.824219,0,0.101379,0.970215,0.119324,6.000000,0.269043,47.09375,32.187500,0.014603,22.109375,A0811cb,20220815,PD_Screen,13,332,31.0,hipr-1_tm10120,hipr-1,tm10120


In [211]:
# Add continuous tap numbers from 1 to 31 for each experiment
# E.g., Experiment 1 has taps 1-31, Experiment 2 has taps 1-31 and so on..

Post_stimulus_data_pre['Tap_num'] = Post_stimulus_data_pre.groupby(['Experiment'])['Tap'].cumsum()

Post_stimulus_data_pre.head()

  has_large_values = (abs_vals > 1e6).any()


Unnamed: 0,Time,n,Number,Instantaneous Speed,Interval Speed,Bias,Tap,Morphwidth,Midline,Area,Angular Speed,Aspect Ratio,Kink,Curve,Crab,Pathlength,Plate_id,Date,Screen,Experiment,plate,taps,dataset,Gene,Allele,Tap_num
6,598,16,14,0.080994,0.10968,0.213989,0,0.105591,1.101562,0.141357,3.699219,0.246948,37.5,26.796875,0.011101,6.558594,B0811ab,20220815,PD_Screen,1,7,1.0,N2,N2,N2,0
7,598,16,14,0.080505,0.118103,0.285889,0,0.10907,1.105469,0.145264,3.800781,0.248047,37.09375,26.703125,0.009399,6.5625,B0811ab,20220815,PD_Screen,1,8,1.0,N2,N2,N2,0
8,598,16,14,0.072876,0.112305,0.285889,0,0.106628,1.102539,0.141846,3.699219,0.248047,36.3125,27.203125,0.008797,6.5625,B0811ab,20220815,PD_Screen,1,9,1.0,N2,N2,N2,0
9,598,16,14,0.065491,0.098511,0.285889,0,0.105591,1.099609,0.140259,3.099609,0.25,37.90625,27.296875,0.007301,6.566406,B0811ab,20220815,PD_Screen,1,10,1.0,N2,N2,N2,0
10,598,16,14,0.08783,0.127075,0.285889,0,0.105103,1.09668,0.140015,4.0,0.256104,37.90625,27.09375,0.011299,6.566406,B0811ab,20220815,PD_Screen,1,11,1.0,N2,N2,N2,0


In [212]:
# Create windows from 7s to 9.5s post a tap ("Tap"=1) for each experiment
# and concatenate all these wondows into a single dataframe

Post_stimulus_data = []

for exp in Post_stimulus_data_pre['Experiment'].unique(): # loop through each experiment separately 
    df = Post_stimulus_data_pre[Post_stimulus_data_pre['Experiment'] == exp]  
    tap_times = df[df['Tap'] == 1]['Time']  # get times where tap occured

    for t in tap_times: 
        window = df[(df['Time'] >= t + 7) & (df['Time'] <= t + 9.5)]
        Post_stimulus_data.append(window)

Post_stimulus_data = pd.concat(Post_stimulus_data)

Post_stimulus_data.head()


  has_large_values = (abs_vals > 1e6).any()


Unnamed: 0,Time,n,Number,Instantaneous Speed,Interval Speed,Bias,Tap,Morphwidth,Midline,Area,Angular Speed,Aspect Ratio,Kink,Curve,Crab,Pathlength,Plate_id,Date,Screen,Experiment,plate,taps,dataset,Gene,Allele,Tap_num
93,608,17,14,0.117981,0.13269,0.285889,0,0.112915,1.078125,0.143921,8.203125,0.36792,47.8125,34.1875,0.021698,6.496094,B0811ab,20220815,PD_Screen,1,93,2.0,N2,N2,N2,1
94,608,17,14,0.125854,0.14209,0.285889,0,0.111572,1.073242,0.142334,8.898438,0.358887,45.0,35.40625,0.022202,6.503906,B0811ab,20220815,PD_Screen,1,93,2.0,N2,N2,N2,1
95,608,17,14,0.097473,0.108582,0.285889,0,0.111023,1.076172,0.140381,7.601562,0.36499,44.90625,35.6875,0.015297,6.507812,B0811ab,20220815,PD_Screen,1,93,2.0,N2,N2,N2,1
96,608,17,14,0.102417,0.114624,0.285889,0,0.116211,1.082031,0.145874,8.0,0.343018,43.90625,35.6875,0.016006,6.515625,B0811ab,20220815,PD_Screen,1,93,2.0,N2,N2,N2,1
97,608,17,14,0.111084,0.123779,0.356934,0,0.110596,1.069336,0.140381,8.203125,0.349121,40.90625,34.6875,0.016296,6.519531,B0811ab,20220815,PD_Screen,1,93,2.0,N2,N2,N2,1


In [213]:
# Post_stimulus_data[(Post_stimulus_data['taps']==Post_stimulus_data['Tap_num'])==False]

In [227]:
# Aggregate columns by "Experiment" + "Tap_num" by taking their means

Post_stimulus_data3 = Post_stimulus_data.groupby(['Experiment', 'Tap_num','Screen','Date','Plate_id','Gene','Allele','dataset', 'taps']).agg({
    'Time': 'min', # take minimum valu of time instead of mean
    'n': 'mean',
    'Number': 'mean',
    'Instantaneous Speed': 'mean',
    'Interval Speed' : 'mean',
    'Bias': 'mean',
    'Tap': 'mean',
    'Morphwidth': 'mean',
    'Midline': 'mean',
    'Area': 'mean',
    'Angular Speed': 'mean',
    'Aspect Ratio': 'mean',
    'Kink': 'mean',
    'Curve': 'mean',
    'Crab': 'mean',
    'Pathlength': 'mean'
})

Post_stimulus_data3 = Post_stimulus_data3.reset_index()

Post_stimulus_data3

NotImplementedError: agg function failed [how->min,dtype->int64]

In [228]:
Post_stimulus_data.head(60)

  has_large_values = (abs_vals > 1e6).any()


Unnamed: 0,Time,n,Number,Instantaneous Speed,Interval Speed,Bias,Tap,Morphwidth,Midline,Area,Angular Speed,Aspect Ratio,Kink,Curve,Crab,Pathlength,Plate_id,Date,Screen,Experiment,plate,taps,dataset,Gene,Allele,Tap_num
93,608,17,14,0.117981,0.13269,0.285889,0,0.112915,1.078125,0.143921,8.203125,0.36792,47.8125,34.1875,0.021698,6.496094,B0811ab,20220815,PD_Screen,1,93,2.0,N2,N2,N2,1
94,608,17,14,0.125854,0.14209,0.285889,0,0.111572,1.073242,0.142334,8.898438,0.358887,45.0,35.40625,0.022202,6.503906,B0811ab,20220815,PD_Screen,1,93,2.0,N2,N2,N2,1
95,608,17,14,0.097473,0.108582,0.285889,0,0.111023,1.076172,0.140381,7.601562,0.36499,44.90625,35.6875,0.015297,6.507812,B0811ab,20220815,PD_Screen,1,93,2.0,N2,N2,N2,1
96,608,17,14,0.102417,0.114624,0.285889,0,0.116211,1.082031,0.145874,8.0,0.343018,43.90625,35.6875,0.016006,6.515625,B0811ab,20220815,PD_Screen,1,93,2.0,N2,N2,N2,1
97,608,17,14,0.111084,0.123779,0.356934,0,0.110596,1.069336,0.140381,8.203125,0.349121,40.90625,34.6875,0.016296,6.519531,B0811ab,20220815,PD_Screen,1,93,2.0,N2,N2,N2,1
98,608,17,14,0.118774,0.140015,0.356934,0,0.109009,1.056641,0.137939,8.398438,0.340088,41.09375,35.6875,0.014801,6.527344,B0811ab,20220815,PD_Screen,1,93,2.0,N2,N2,N2,1
99,608,17,14,0.126831,0.14856,0.356934,0,0.112122,1.072266,0.143066,8.796875,0.326904,41.6875,36.8125,0.013397,6.535156,B0811ab,20220815,PD_Screen,1,93,2.0,N2,N2,N2,1
100,608,17,14,0.129883,0.14624,0.356934,0,0.111328,1.068359,0.141357,8.796875,0.325928,44.3125,36.59375,0.014702,6.539062,B0811ab,20220815,PD_Screen,1,93,2.0,N2,N2,N2,1
101,608,17,14,0.123718,0.132324,0.356934,0,0.111572,1.068359,0.142578,7.0,0.326904,47.90625,36.3125,0.0159,6.546875,B0811ab,20220815,PD_Screen,1,93,2.0,N2,N2,N2,1
102,609,17,14,0.125732,0.123108,0.356934,0,0.11261,1.075195,0.143799,6.398438,0.322021,48.6875,35.3125,0.017899,6.554688,B0811ab,20220815,PD_Screen,1,93,2.0,N2,N2,N2,1


In [216]:
# Post_stimulus_data[(Post_stimulus_data['taps']==Post_stimulus_data['Tap_num'])==False]

In [217]:
Post_stimulus_data2 = Post_stimulus_data.drop(columns = 'Tap_num')

# Aggregate columns by "Experiment" + "Tap_num" by taking their means

Post_stimulus_data2 = Post_stimulus_data2.groupby(['Experiment', 'Screen','Date','Plate_id','Gene','Allele','dataset', 'taps']).agg({
    'Time': 'min', # take minimum valu of time instead of mean
    'n': 'mean',
    'Number': 'mean',
    'Instantaneous Speed': 'mean',
    'Interval Speed' : 'mean',
    'Bias': 'mean',
    'Tap': 'mean',
    'Morphwidth': 'mean',
    'Midline': 'mean',
    'Area': 'mean',
    'Angular Speed': 'mean',
    'Aspect Ratio': 'mean',
    'Kink': 'mean',
    'Curve': 'mean',
    'Crab': 'mean',
    'Pathlength': 'mean'
})

Post_stimulus_data2 = Post_stimulus_data2.reset_index()

Post_stimulus_data2

NotImplementedError: agg function failed [how->min,dtype->int64]

In [None]:
Post_stimulus_data2.head(60)

Unnamed: 0,Experiment,Screen,Date,Plate_id,Gene,Allele,dataset,taps,Time,n,Number,Instantaneous Speed,Interval Speed,Bias,Tap,Morphwidth,Midline,Area,Angular Speed,Aspect Ratio,Kink,Curve,Crab,Pathlength
0,1,PD_Screen,20220815,B0811ab,N2,N2,N2,1.0,607,17.0,14.0,0.122842,0.135588,0.315071,0.0,0.1121,1.070025,0.141235,6.899816,0.358959,49.077206,36.746323,0.015419,6.450597
1,1,PD_Screen,20220815,B0811ab,N2,N2,N2,2.0,608,17.085106,14.659574,0.176925,0.134718,0.483097,0.0,0.113351,1.052589,0.139311,11.627327,0.325424,51.404919,33.878658,0.023303,6.539145
2,1,PD_Screen,20220815,B0811ab,N2,N2,N2,3.0,618,17.016393,13.967213,0.231305,0.134137,0.478255,0.0,0.108357,1.067079,0.137991,16.064037,0.303077,52.779202,30.996925,0.026821,6.89895
3,1,PD_Screen,20220815,B0811ab,N2,N2,N2,4.0,628,14.471698,12.415094,0.235759,0.117166,0.782908,0.0,0.10961,1.048681,0.136799,18.905365,0.317574,58.645046,31.220224,0.026794,7.813016
4,1,PD_Screen,20220815,B0811ab,N2,N2,N2,5.0,638,14.894737,12.0,0.258007,0.114002,0.973718,0.0,0.101728,1.072985,0.135522,16.989653,0.305227,45.543312,29.243694,0.027449,8.733827
5,1,PD_Screen,20220815,B0811ab,N2,N2,N2,6.0,648,15.695652,11.934783,0.262106,0.126116,0.94406,0.0,0.104321,1.078104,0.137921,12.387228,0.274043,37.398777,28.37466,0.02555,10.175611
6,1,PD_Screen,20220815,B0811ab,N2,N2,N2,7.0,658,16.361702,12.361702,0.252597,0.118103,0.984479,0.0,0.104204,1.085106,0.138072,11.306184,0.278372,38.798538,28.71476,0.024964,11.303857
7,1,PD_Screen,20220815,B0811ab,N2,N2,N2,8.0,668,17.0,12.372093,0.224649,0.135646,0.908442,0.0,0.101561,1.093114,0.135038,11.498183,0.283178,43.256905,29.24564,0.023855,10.638808
8,1,PD_Screen,20220815,B0811ab,N2,N2,N2,9.0,678,16.472727,12.163636,0.20105,0.122383,0.846191,0.0,0.101206,1.085352,0.134952,6.86005,0.239415,35.952557,27.113352,0.016693,9.671875
9,1,PD_Screen,20220815,B0811ab,N2,N2,N2,10.0,688,17.107143,12.0,0.210935,0.110816,0.901908,0.0,0.102663,1.073033,0.135141,8.048201,0.262939,40.632256,27.80385,0.018694,10.202706


In [218]:
Post_stimulus_data4 = Post_stimulus_data.drop(columns = 'taps')

# Aggregate columns by "Experiment" + "Tap_num" by taking their means

Post_stimulus_data4 = Post_stimulus_data4.groupby(['Experiment', 'Screen','Date','Plate_id','Gene','Allele','dataset', 'Tap_num']).agg({
    'Time': 'min', # take minimum valu of time instead of mean
    'n': 'mean',
    'Number': 'mean',
    'Instantaneous Speed': 'mean',
    'Interval Speed' : 'mean',
    'Bias': 'mean',
    'Tap': 'mean',
    'Morphwidth': 'mean',
    'Midline': 'mean',
    'Area': 'mean',
    'Angular Speed': 'mean',
    'Aspect Ratio': 'mean',
    'Kink': 'mean',
    'Curve': 'mean',
    'Crab': 'mean',
    'Pathlength': 'mean'
})

Post_stimulus_data4 = Post_stimulus_data4.reset_index()

Post_stimulus_data4

Unnamed: 0,Experiment,Screen,Date,Plate_id,Gene,Allele,dataset,Tap_num,Time,n,Number,Instantaneous Speed,Interval Speed,Bias,Tap,Morphwidth,Midline,Area,Angular Speed,Aspect Ratio,Kink,Curve,Crab,Pathlength
0,1,PD_Screen,20220815,B0811ab,N2,N2,N2,1,608,17.000000,14.000000,0.122196,0.130085,0.339172,0.0,0.111816,1.070923,0.141838,7.893555,0.332550,45.111328,35.310547,0.016444,6.543701
1,1,PD_Screen,20220815,B0811ab,N2,N2,N2,2,618,17.033333,14.966667,0.221281,0.156195,0.514119,0.0,0.108453,1.053288,0.136275,19.366667,0.305599,52.711460,30.558855,0.026352,6.675911
2,1,PD_Screen,20220815,B0811ab,N2,N2,N2,3,628,16.136364,13.000000,0.239746,0.095556,0.496227,0.0,0.109808,1.085982,0.141502,12.572443,0.273177,49.167614,29.379972,0.024515,7.243431
3,1,PD_Screen,20220815,B0811ab,N2,N2,N2,4,638,13.000000,12.000000,0.242992,0.114581,0.959889,0.0,0.099840,1.072627,0.134377,18.418259,0.312265,48.396992,28.658566,0.026236,8.048322
4,1,PD_Screen,20220815,B0811ab,N2,N2,N2,5,648,15.687500,12.000000,0.263863,0.134298,0.929962,0.0,0.103476,1.076324,0.137432,12.731445,0.268684,36.601562,28.312988,0.025394,9.686279
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
371,13,PD_Screen,20220815,A0811cb,hipr-1,tm10120,hipr-1_tm10120,25,848,21.806452,18.000000,0.153730,0.080751,0.872685,0.0,0.091167,0.962749,0.109030,7.951865,0.248504,33.544857,26.682964,0.012239,16.497480
372,13,PD_Screen,20220815,A0811cb,hipr-1,tm10120,hipr-1_tm10120,26,858,23.967742,18.000000,0.149867,0.086879,0.849546,0.0,0.092630,0.951392,0.108032,7.712891,0.242975,40.512096,25.210182,0.013087,16.477318
373,13,PD_Screen,20220815,A0811cb,hipr-1,tm10120,hipr-1_tm10120,27,868,24.733333,21.400000,0.154093,0.078680,0.916960,0.0,0.088470,0.966064,0.108742,6.230078,0.232874,37.823959,25.526562,0.010944,14.628646
374,13,PD_Screen,20220815,A0811cb,hipr-1,tm10120,hipr-1_tm10120,28,878,23.064516,21.000000,0.147772,0.082057,0.857028,0.0,0.088336,0.956653,0.107849,4.712954,0.224196,37.407257,25.116432,0.009828,12.583165


In [219]:
Post_stimulus_data4.head(60)

Unnamed: 0,Experiment,Screen,Date,Plate_id,Gene,Allele,dataset,Tap_num,Time,n,Number,Instantaneous Speed,Interval Speed,Bias,Tap,Morphwidth,Midline,Area,Angular Speed,Aspect Ratio,Kink,Curve,Crab,Pathlength
0,1,PD_Screen,20220815,B0811ab,N2,N2,N2,1,608,17.0,14.0,0.122196,0.130085,0.339172,0.0,0.111816,1.070923,0.141838,7.893555,0.33255,45.111328,35.310547,0.016444,6.543701
1,1,PD_Screen,20220815,B0811ab,N2,N2,N2,2,618,17.033333,14.966667,0.221281,0.156195,0.514119,0.0,0.108453,1.053288,0.136275,19.366667,0.305599,52.71146,30.558855,0.026352,6.675911
2,1,PD_Screen,20220815,B0811ab,N2,N2,N2,3,628,16.136364,13.0,0.239746,0.095556,0.496227,0.0,0.109808,1.085982,0.141502,12.572443,0.273177,49.167614,29.379972,0.024515,7.243431
3,1,PD_Screen,20220815,B0811ab,N2,N2,N2,4,638,13.0,12.0,0.242992,0.114581,0.959889,0.0,0.09984,1.072627,0.134377,18.418259,0.312265,48.396992,28.658566,0.026236,8.048322
4,1,PD_Screen,20220815,B0811ab,N2,N2,N2,5,648,15.6875,12.0,0.263863,0.134298,0.929962,0.0,0.103476,1.076324,0.137432,12.731445,0.268684,36.601562,28.312988,0.025394,9.686279
5,1,PD_Screen,20220815,B0811ab,N2,N2,N2,6,658,15.0,12.0,0.267924,0.122331,0.983398,0.0,0.10399,1.090039,0.138656,11.942708,0.277848,35.797916,28.089062,0.026446,11.172396
6,1,PD_Screen,20220815,B0811ab,N2,N2,N2,7,668,18.6875,13.0,0.207993,0.124386,0.826752,0.0,0.103668,1.083435,0.13752,7.256592,0.24073,33.936523,26.118164,0.016663,11.794922
7,1,PD_Screen,20220815,B0811ab,N2,N2,N2,8,678,15.62963,11.555556,0.203315,0.128685,0.850694,0.0,0.102749,1.06854,0.130597,9.196325,0.265657,40.670139,28.693865,0.018685,10.033854
8,1,PD_Screen,20220815,B0811ab,N2,N2,N2,9,688,16.066667,12.0,0.212834,0.115068,0.891797,0.0,0.100594,1.089811,0.137142,6.9,0.250008,39.21302,27.360416,0.017139,9.748959
9,1,PD_Screen,20220815,B0811ab,N2,N2,N2,10,698,19.451613,12.0,0.221054,0.103268,0.948967,0.0,0.098271,1.079448,0.132557,15.571446,0.300356,44.880039,29.63508,0.021087,10.983367


In [220]:
Post_stimulus_data4.drop(columns='Tap_num', inplace=True)
Post_stimulus_data2.drop(columns='taps', inplace=True)


In [221]:
for i in Post_stimulus_data2.columns:
    (Post_stimulus_data2[i] == Post_stimulus_data4[i]).all()

ValueError: Can only compare identically-labeled Series objects

In [222]:
Post_stimulus_data2.equals(Post_stimulus_data4)  # returns True/False


False

In [223]:
print(Post_stimulus_data2.columns.equals(Post_stimulus_data4.columns))  # True?
print(Post_stimulus_data2.shape == Post_stimulus_data4.shape)           # True?


False
False


In [224]:
Post_stimulus_data2.shape

(10914, 24)

In [225]:
Post_stimulus_data4.shape

(376, 23)

In [226]:
Post_stimulus_data2.head()

  has_large_values = (abs_vals > 1e6).any()


Unnamed: 0,Time,n,Number,Instantaneous Speed,Interval Speed,Bias,Tap,Morphwidth,Midline,Area,Angular Speed,Aspect Ratio,Kink,Curve,Crab,Pathlength,Plate_id,Date,Screen,Experiment,plate,dataset,Gene,Allele
93,608,17,14,0.117981,0.13269,0.285889,0,0.112915,1.078125,0.143921,8.203125,0.36792,47.8125,34.1875,0.021698,6.496094,B0811ab,20220815,PD_Screen,1,93,N2,N2,N2
94,608,17,14,0.125854,0.14209,0.285889,0,0.111572,1.073242,0.142334,8.898438,0.358887,45.0,35.40625,0.022202,6.503906,B0811ab,20220815,PD_Screen,1,93,N2,N2,N2
95,608,17,14,0.097473,0.108582,0.285889,0,0.111023,1.076172,0.140381,7.601562,0.36499,44.90625,35.6875,0.015297,6.507812,B0811ab,20220815,PD_Screen,1,93,N2,N2,N2
96,608,17,14,0.102417,0.114624,0.285889,0,0.116211,1.082031,0.145874,8.0,0.343018,43.90625,35.6875,0.016006,6.515625,B0811ab,20220815,PD_Screen,1,93,N2,N2,N2
97,608,17,14,0.111084,0.123779,0.356934,0,0.110596,1.069336,0.140381,8.203125,0.349121,40.90625,34.6875,0.016296,6.519531,B0811ab,20220815,PD_Screen,1,93,N2,N2,N2


In [186]:
Post_stimulus_data4.head()

Unnamed: 0,Experiment,Screen,Date,Plate_id,Gene,Allele,dataset,Time,n,Number,Instantaneous Speed,Interval Speed,Bias,Tap,Morphwidth,Midline,Area,Angular Speed,Aspect Ratio,Kink,Curve,Crab,Pathlength
0,1,PD_Screen,20220815,B0811ab,N2,N2,N2,607,17.0,14.0,0.122529,0.13292,0.326756,0.0,0.111963,1.07046,0.141528,7.381629,0.346154,47.154358,36.05019,0.015916,6.495739
1,1,PD_Screen,20220815,B0811ab,N2,N2,N2,617,17.081967,14.983607,0.213095,0.146495,0.536105,0.0,0.111345,1.048124,0.137155,16.41291,0.313805,53.698257,31.87039,0.026602,6.605213
2,1,PD_Screen,20220815,B0811ab,N2,N2,N2,627,16.641509,13.0,0.240483,0.105637,0.465415,0.0,0.108905,1.082731,0.14042,12.745283,0.289238,51.318398,30.573704,0.026129,7.16819
3,1,PD_Screen,20220815,B0811ab,N2,N2,N2,637,13.155172,12.0,0.237614,0.124159,0.974037,0.0,0.104987,1.04568,0.133888,21.080751,0.331943,57.469288,30.725754,0.027398,8.138604
4,1,PD_Screen,20220815,B0811ab,N2,N2,N2,647,16.129032,12.0,0.267568,0.124225,0.957157,0.0,0.103453,1.074865,0.137006,14.169733,0.283302,39.685482,29.018145,0.026916,9.523942


In [85]:
print('done step 7')

done step 7


# Save dataframe as `.csv`

In [None]:
Baseline_data.to_csv(f"{Screen}_baseline_output.csv")
print('saved Baseline data as .csv!')

In [None]:
Post_stimulus_data.to_csv(f"{Screen}_post_stimulus.csv")
print('saved Post stimulus data as .csv!')

# Done!