# Jupyter Notebook UI to analyze baseline data from tap-habituation experiments!

### Beginner Essentials:
1. Shift-Enter to run each cell. After you run, you should see an output "done step #". If not, an error has occured
2. When inputting your own code/revising the code, make sure you close all your quotation marks '' and brackets (), [], {}.
3. Don't leave any commas (,) hanging! (make sure an object always follows a comma. If there is nothing after a comma, remove the comma!
4. Learning to code? Each line of code is annotated to help you understand how this code works!

**Run all cells/steps sequentially, even the ones that do not need input**

## Step-by-Step Analysis of the Jupyter Notebook

| Step | Purpose | Key Actions |
|------|---------|-------------|
| **1. Import Packages** | Load required Python libraries for data analysis | Imports `pandas`, `numpy`, `matplotlib`, etc. | 
| **2. Pick Filepath** | Select the folder containing experimental data files (.dat or .trv) | Input required: Uses `FileChooser` widget to select directory | 
| **3. User-Defined Variables** | Set experiment parameters | Defines: `bin`  | 
| **4. Construct Filelist** | Find all files in selected folder | Sets working directory and scans `folder_path` using; Displays no. of `.trv` files found in the folder |
| **5. Process Data Function** | Define functions to load, clean, and analyze raw data | - `ProcessData()`: Loads files, calculates metrics (reversal probability, speed) |
| **6.1 Process Data** | Apply processing to all strains| - Checks `filelist` for unique strain names (e.g., "N2") <br>- Runs `ProcessData()` for each strain | 
| **7. Grouping & Naming** | Combine data from all strains | - Concatenates DataFrames<br>- Assigns dataset names (e.g., "N2") | 
| **Output CSV** | Save processed data | Exports `Baseline_data` to CSV |

### Key Notes:
- User Input Required: Steps 2 (file selection), 3 (parameters), 6.1 (strain verification)
- Output: Final CSV contains all analyzed tap response data

# 1. Importing Packages Required (No input required, just run)

In [None]:
import pandas as pd #<- package used to import and organize data
import numpy as np #<- package used to import and organize data
import seaborn as sns #<- package used to plot graphs
from matplotlib import pyplot as plt #<- package used to plot graphs
import os #<- package used to work with system filepaths
from ipywidgets import widgets #<- widget tool to generate button
from IPython.display import display #<- displays button
from ipyfilechooser import FileChooser
# from tkinter import Tk, filedialog #<- Tkinter is a GUI package
from tqdm.notebook import tqdm
# import dask.dataframe as dd
# import pingouin as pg
pd.set_option('display.max_columns', 50)
print("done step 1")

done step 1


## 2. Pick filepath (just run and click button from output)

Run the following cell and click the button 'Select Folder' to pick a filepath.

**Important: Later on, this script uses the total file path for each file to import and group data. That means if your folder has whatever your strain is named, the script will not work.**

(ex. if your folder has "N2" in it this script sees all files inside this folder as having the "N2" search key)

**An easy fix is to just rename your folder to something else (make your strains lower-case, or just have the date)**

In [2]:
starting_directory = '/Users'
chooser = FileChooser(starting_directory)
display(chooser)

FileChooser(path='/Users', filename='', title='', show_hidden=False, select_desc='Select', change_desc='Change…

In [3]:
print(chooser.selected_path)
folder_path=chooser.selected_path

/Users/gurmehak/Documents/RankinLab/Test_Datasets/PDScreen_TapHab_August15_2022


In [4]:
screens = ['PD_Screen', 'ASD_Screen', 'G-Proteins_Screen', 'Glia_Genes_Screen', 'Neuron_Genes_Screen', 'Miscellaneous']

screen_chooser = widgets.Select(options=screens, value=screens[0], description='Screen:')
display(screen_chooser)

Select(description='Screen:', options=('PD_Screen', 'ASD_Screen', 'G-Proteins_Screen', 'Glia_Genes_Screen', 'N…

In [5]:
Screen=screen_chooser.value
print(Screen)

PD_Screen


# 3. User Defined Variables (Add input here)

Here, we add some constants to help you blaze through this code.

3.1: Setting time bins


3.2: Setting view range for your graph
- Top, bottom = y axis view range
- left, right = x axis view range



In [6]:
# Setting 1s Bins
bins = np.linspace(0,1200,1201) # np.linspace(start, end, steps in between)
print(bins)


print("done step 3")

[0.000e+00 1.000e+00 2.000e+00 ... 1.198e+03 1.199e+03 1.200e+03]
done step 3


# 4. Construct filelist from folder path (No input required, just run)

In [7]:
os.chdir(folder_path) # setting your working directory so that your images will be saved here

filelist = list() # empty list
for root, dirs, files in os.walk(folder_path): # this for loop goes through your folder 
    for name in files:
        if name.endswith('.dat'): # and takes out all files with a .dat (file that contains your data)
            if "_" in name.split(".")[-2]:
                filepath = os.path.join(root, name) # Notes down the file path of each data file
                filelist.append(filepath) # saves it into the list

if not filelist:
    raise FileNotFoundError("No .dat files found in the selected folder!")
else:
    print(f"Number of .dat files to process: {len(filelist)}")
    # print(f"Example of first and last file saved: {filelist[0]}, {filelist[-1]}") 

print('done step 4')

Number of .dat files to process: 13
done step 4


# 5. Process Data Function (No input required, just run)

In [None]:
def ProcessData(strain, experiment_counter): 
    """
    Filters and processes .dat files matching the given strain.

    Parameters: 
        strain (str): keyword to match in the files

    Returns:
        dict: N (Plate number) and Dataframe with required columns 
              ("time", "dura", "dist", "prob", "speed", "plate", "Date",
              "Plate_id", "Screen")

    """
    strain_filelist = [x for x in filelist if strain in x] # Goes through the list and filters for keyword
    Strain_N = len(strain_filelist) # Finds the number of plates per strain
    if Strain_N == 0:
        raise AssertionError ('{} is not a good identifier'.format(strain))
    else:
        pass
        print(f'Strain {strain}')
        print(f'Number of plates: {Strain_N}') 
        
        # visiting files in this strain
        strain_filelist = [file for file in filelist if strain in file]
        df_list=[]
        for i, file in enumerate(strain_filelist):
            if file.split('/')[-1].startswith('._'):
                pass
            else:
                try:
                    print(f"File: {file}")
                    df= pd.read_csv(file, sep=' ', header = None, encoding_errors='ignore')
                    df['Plate_id'] = file.split('/')[-1].split('_')[-1].split('.')[0]
                    df['Date'] = file.split('/')[-2].split('_')[0]
                    df['Screen'] = file.split('/')[-4]
                    df['Experiment'] = experiment_counter
                    experiment_counter = 1+experiment_counter
                    df_list.append(df)
                except:
                    print(f"error in file {file}")
                    pass
        DF_Total = pd.concat(df_list, ignore_index = True)
        DF_Total = DF_Total.rename( 
                    {0:'Time',
                    1:'n',
                    2:'Number',
                    3:'Instantaneous Speed',
                    4:'Interval Speed',
                    5:'Bias',
                    6:'Tap',
                    7:'Puff',
                    8:'x',
                    9:'y',
                    10:'Morphwidth',
                    11:'Midline',
                    12:'Area',
                    13:'Angular Speed',
                    14:'Aspect Ratio',
                    15:'Kink',
                    16:'Curve',
                    17:'Crab',
                    18:'Pathlength'}, axis=1)
        
        # check function here for NaN Columns
        DF_Total['plate'] = 0

        print("---------------------------------------------------------------------------------------------------------------------------------------------------------------------------")

    return{
            'N': Strain_N,
            'Confirm':DF_Total,
            'experiment_counter': experiment_counter
            # 'Final': DF_Final
    }



def assign_taps(df, tolerances):
    """
    Assigns tap number to each row in the DataFrame based on time tolerances.

    Parameters:
        df (pd.DataFrame): The DataFrame to modify
        tolerances (list of tuples): Each tuple is (lower, upper) time range

    Returns:
        None
    """
    df['taps'] = 0
    for taps, tolerance in enumerate(tolerances): #[(99, 101), (109,111), ...]
        tap_lower,tap_upper = tolerance
        TimesInTapRange = df['Time'].between(tap_lower,tap_upper, inclusive="both")
        df.loc[TimesInTapRange,'taps'] = int(taps) # set the tap to i where times are between


def insert_plates(df):   
    """
    Inserts a plate column into a dataframe.
    
    Parameters:
        df (pd.DataFrame): any dataframe
    
    Returns: 
        pd.DataFrame: dataframe with a plate column
    """
    df['plate']=(df['taps'] ==1).cumsum()


print('done step 5')

done step 5
done step 5


# 6.1 Process Data

Create a dictionary `StrainNames` that contains all the genotype/strain names from each file path

In [106]:
genotype=[]
for f in filelist:
    genotype.append(f.split('/')[-3])

genotypes=np.unique(genotype).tolist()

if Screen =="Neuron_Genes_Screen":
    genotypes.insert(0, genotypes.pop(genotypes.index("N2_XJ1")))
    genotypes.insert(0, genotypes.pop(genotypes.index("N2_N2")))
else:
    genotypes.insert(0, genotypes.pop(genotypes.index("N2")))

nstrains = list(range(1, len(genotypes) + 1))
StrainNames = {nstrains[i]: genotypes[i] for i in range(len(nstrains))}

print(f"Number of genotypes/strains in the experiment: {len(genotypes)}")

# Display the first 5 Strain names in the experiment
for k in list(StrainNames)[:5]:
    print(f"{k}: {StrainNames[k]}")


print("done step 6.1")

# <---------------- Test element to use for dictionary buidling -------------------
# s = '/Users/Joseph/Desktop/OnFoodOffFoodTest/N2_OnFood/20220401_163048/N2_10x1_n96h20C_360sA0401_ka.00065.dat'
# slist=s.split('/')[5]
# print(slist)
# print(list(range(1,5+1)))

Number of genotypes/strains in the experiment: 3
1: N2
2: hipr-1_ok1081
3: hipr-1_tm10120
done step 6.1


# 6.2 Process Data (just run this cell)

Pass each strain through `ProcessData()` function 

In [107]:
DataLists = [0] # generates empty list at index 0 because we want indexing to start at 1 
                # when I say #1, I want the first point, not the second point

experiment_counter = 1

# the loop below goes through the dictionary in step 6.1 and processes data
# and appends all data into a list of dataframes
for s in tqdm(StrainNames.values()): 
    if not s == '':
        result = ProcessData(s, experiment_counter)
        DataLists.append(result['Confirm'])
        experiment_counter = result['experiment_counter'] 


# Taps
number_of_taps = 30 # Taps in your experiment (N)
ISI = 10  # ISI in your experiment
first_tap = 600 # when is your first tap? check your TRV files

# Here, open up one of the trv files to determine the times for each of these taps. 

# Record number of taps (N+1), e.g., if number_of_taps = 30, taps = [1, 2, 3, ..., 31]
taps = np.arange(1, number_of_taps+2).tolist()

# Assign tolerance to each tap
lower = np.arange(first_tap-2, first_tap-2+(number_of_taps*ISI), ISI) # (first tap, last tap+10s, ISI)
upper = np.arange(first_tap+2, first_tap+2+(number_of_taps*ISI), ISI) # (first tap, last tap+10s, ISI)
tolerances = [(int(l), int(u)) for l, u in zip(lower, upper)]
tolerances.append((1188,1191)) # (N+1)th tap


# the loop below assigns taps and plates to the processed data
for df in DataLists[1:]: 
    assign_taps(df, tolerances)
    insert_plates(df)


print('done step 6.2')

  0%|          | 0/3 [00:00<?, ?it/s]

Strain N2
Number of plates: 5
File: /Users/gurmehak/Documents/RankinLab/Test_Datasets/PDScreen_TapHab_August15_2022/N2/20220815_101538/N2_10x2_f72h20C_600s31x10s10s_B0811ab.dat
File: /Users/gurmehak/Documents/RankinLab/Test_Datasets/PDScreen_TapHab_August15_2022/N2/20220815_102652/N2_10x2_f96h20C_600s31x10s10s_A0811aa.dat
File: /Users/gurmehak/Documents/RankinLab/Test_Datasets/PDScreen_TapHab_August15_2022/N2/20220815_122801/N2_10x2_f96h20C_600s31x10s10s_A0811ad.dat
File: /Users/gurmehak/Documents/RankinLab/Test_Datasets/PDScreen_TapHab_August15_2022/N2/20220815_121502/N2_10x2_f72h20C_600s31x10s10s_B0811ae.dat
File: /Users/gurmehak/Documents/RankinLab/Test_Datasets/PDScreen_TapHab_August15_2022/N2/20220815_103433/N2_10x2_f72h20C_600s31x10s10s_C0811ac.dat
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Strain hipr-1_ok1081
Number of plates: 4
File: /Users/gurmehak/

In [112]:
for df in DataLists[1:]:
    # Check for weird values
    print("Max time:", df['Time'].max())
    print("Min time:", df['Time'].min())
    print("Inf count:", np.isinf(df['Time']).sum())
    print("NaNs:", df['Time'].isna().sum())



Max time: 1200.0
Min time: 0.009
Inf count: 0
NaNs: 0
Max time: 1200.0
Min time: 0.007
Inf count: 0
NaNs: 0
Max time: 1200.0
Min time: 0.007
Inf count: 0
NaNs: 0


In [119]:
numeric_cols = DataLists[1].select_dtypes(include=[np.number])
max_vals = numeric_cols.max()

# Show only columns with huge values
print(max_vals[max_vals > 1e2])

Time             1200.0000
Angular Speed     122.1875
Kink              115.8125
plate             458.0000
dtype: float64


In [124]:
DataLists[2]

  has_large_values = (abs_vals > 1e6).any()


Unnamed: 0,Time,n,Number,Instantaneous Speed,Interval Speed,Bias,Tap,Puff,x,y,Morphwidth,Midline,Area,Angular Speed,Aspect Ratio,Kink,Curve,Crab,Pathlength,Plate_id,Date,Screen,Experiment,plate,taps
0,0.012001,3,0,0.000000,0.000000,0.000,0,0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.00000,0.000000,0.00000,A0811bc,20220815,PDScreen_TapHab_August15_2022,6,0,0
1,0.053986,3,0,0.000000,0.000000,0.000,0,0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.00000,0.000000,0.00000,A0811bc,20220815,PDScreen_TapHab_August15_2022,6,0,0
2,0.086975,3,0,0.000000,0.000000,0.000,0,0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.00000,0.000000,0.00000,A0811bc,20220815,PDScreen_TapHab_August15_2022,6,0,0
3,0.129028,3,0,0.000000,0.000000,0.000,0,0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.00000,0.000000,0.00000,A0811bc,20220815,PDScreen_TapHab_August15_2022,6,0,0
4,0.171021,3,0,0.000000,0.000000,0.000,0,0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.00000,0.000000,0.00000,A0811bc,20220815,PDScreen_TapHab_August15_2022,6,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119166,1200.000000,10,10,0.138428,0.136719,0.375,0,0,20.500000,30.484375,0.119690,0.981934,0.135498,9.796875,0.391113,53.59375,33.68750,0.016006,48.59375,C0811be,20220815,PDScreen_TapHab_August15_2022,9,393,0
119167,1200.000000,10,10,0.141846,0.146851,0.000,0,0,20.515625,30.484375,0.116089,0.985352,0.134888,9.203125,0.398926,49.18750,33.40625,0.016006,48.59375,C0811be,20220815,PDScreen_TapHab_August15_2022,9,393,0
119168,1200.000000,10,10,0.000000,0.000000,0.000,0,0,20.515625,30.484375,0.114502,0.979980,0.133301,0.000000,0.383057,48.81250,32.90625,0.000000,48.59375,C0811be,20220815,PDScreen_TapHab_August15_2022,9,393,0
119169,1200.000000,10,10,0.000000,0.000000,0.000,0,0,20.515625,30.484375,0.117310,0.982910,0.134399,0.000000,0.377930,51.09375,32.40625,0.000000,48.59375,C0811be,20220815,PDScreen_TapHab_August15_2022,9,393,0


In [126]:
numeric_only

NameError: name 'numeric_only' is not defined

In [127]:
df

  has_large_values = (abs_vals > 1e6).any()


Unnamed: 0,Time,n,Number,Instantaneous Speed,Interval Speed,Bias,Tap,Puff,x,y,Morphwidth,Midline,Area,Angular Speed,Aspect Ratio,Kink,Curve,Crab,Pathlength,Plate_id,Date,Screen,Experiment,plate,taps
0,0.007000,18,0,0.000000,0.000000,0.000000,0,0,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000,B0811cc,20220815,PDScreen_TapHab_August15_2022,10,0,0
1,0.053986,18,0,0.000000,0.000000,0.000000,0,0,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000,B0811cc,20220815,PDScreen_TapHab_August15_2022,10,0,0
2,0.094971,18,0,0.000000,0.000000,0.000000,0,0,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000,B0811cc,20220815,PDScreen_TapHab_August15_2022,10,0,0
3,0.135010,19,0,0.000000,0.000000,0.000000,0,0,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000,B0811cc,20220815,PDScreen_TapHab_August15_2022,10,0,0
4,0.177979,19,0,0.000000,0.000000,0.000000,0,0,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000,B0811cc,20220815,PDScreen_TapHab_August15_2022,10,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110728,1200.000000,23,17,0.095825,0.085876,0.293945,0,0,29.359375,34.25,0.089722,0.963379,0.108765,7.398438,0.281982,44.50000,28.296875,0.005699,22.125,A0811cb,20220815,PDScreen_TapHab_August15_2022,13,308,0
110729,1200.000000,23,17,0.094727,0.089417,0.000000,0,0,29.359375,34.25,0.090027,0.961426,0.109009,6.101562,0.283936,45.09375,28.796875,0.005100,22.125,A0811cb,20220815,PDScreen_TapHab_August15_2022,13,308,0
110730,1200.000000,23,17,0.000000,0.000000,0.000000,0,0,29.359375,34.25,0.090393,0.961914,0.109070,0.000000,0.287109,43.59375,28.593750,0.000000,22.125,A0811cb,20220815,PDScreen_TapHab_August15_2022,13,308,0
110731,1200.000000,24,17,0.000000,0.000000,0.000000,0,0,29.375000,34.25,0.090088,0.962402,0.109070,0.000000,0.287109,43.50000,28.500000,0.000000,22.125,A0811cb,20220815,PDScreen_TapHab_August15_2022,13,308,0


In [132]:
DataLists[3].isna().sum()

Time                   0
n                      0
Number                 0
Instantaneous Speed    0
Interval Speed         0
Bias                   0
Tap                    0
Puff                   0
x                      0
y                      0
Morphwidth             0
Midline                0
Area                   0
Angular Speed          0
Aspect Ratio           0
Kink                   0
Curve                  0
Crab                   0
Pathlength             0
Plate_id               0
Date                   0
Screen                 0
Experiment             0
plate                  0
taps                   0
dtype: int64

In [128]:
print(DataLists[1].max(numeric_only=True))
print(DataLists[1].mean(numeric_only=True))  # this is where overflow usually shows

Time                   1200.000000
n                        56.000000
Number                   46.000000
Instantaneous Speed       1.030273
Interval Speed            1.446289
Bias                      1.000000
Tap                       1.000000
Puff                      0.000000
x                        39.250000
y                        43.406250
Morphwidth                0.232544
Midline                   1.336914
Area                      0.265381
Angular Speed           122.187500
Aspect Ratio              0.675781
Kink                    115.812500
Curve                    50.687500
Crab                      0.704590
Pathlength               37.781250
Experiment                5.000000
plate                   458.000000
taps                     30.000000
dtype: float64
Time                          NaN
n                       28.176292
Number                  20.456035
Instantaneous Speed      0.000000
Interval Speed           0.000000
Bias                          NaN
Tap        

  return count.astype(dtype, copy=False)
  return umr_sum(a, axis, dtype, out, keepdims, initial, where)


# Convert float64 data to float32 to reduce memory load (can also convert to 16 if needed)

For plain english:

float16 = 4 decimal points

float32 = 8 decimal points

float64 = 16 decimal points

more decimal points = more data/memory that computer has to keep track of

In [129]:
# commented out this section in case memory load needs to be reduced

for n in tqdm(DataLists[1:]):
    print(n)
    TestData=n
    TestData[TestData.select_dtypes(np.float64).columns]=TestData.select_dtypes(np.float64).astype(np.float16)
    

  0%|          | 0/3 [00:00<?, ?it/s]

               Time   n  Number  Instantaneous Speed  Interval Speed  \
0          0.009003  13       0             0.000000        0.000000   
1          0.061005  13       0             0.000000        0.000000   
2          0.088989  12       0             0.000000        0.000000   
3          0.135986  12       0             0.000000        0.000000   
4          0.167969  12       0             0.000000        0.000000   
...             ...  ..     ...                  ...             ...   
134125  1200.000000  48      30             0.173828        0.124512   
134126  1200.000000  48      30             0.182739        0.128540   
134127  1200.000000  47      28             0.189575        0.132324   
134128  1200.000000  47      28             0.000000        0.000000   
134129  1200.000000  47      28             0.000000        0.000000   

            Bias  Tap  Puff          x          y  Morphwidth   Midline  \
0       0.000000    0     0   0.000000   0.000000    0.00000

  has_large_values = (abs_vals > 1e6).any()
  has_large_values = (abs_vals > 1e6).any()
  has_large_values = (abs_vals > 1e6).any()


# 7. Grouping Data and Naming

This step takes all the individual strain data (processed in Step 6) and combines them into single dataframe, filters for time window 490s - 590s, drops unwanted columns. 

The final processed data `Baseline_data` is ready for analysis.

In [111]:
df=pd.concat(df.assign(dataset=StrainNames.get(i+1)) for i, df in enumerate(DataLists[1:]))

df[['Gene', 'Allele']] = df['dataset'].str.split(pat='_', n=1, expand=True)

df['Allele'] = df['Allele'].fillna('N2')

df['Screen']=Screen

Baseline_data = df.drop(columns=["Tap", "Puff", "x","y", "Experiment"]).dropna().reset_index(drop=True)

Baseline_data = Baseline_data[((Baseline_data.Time<=590)&(Baseline_data.Time >=490))] 

Baseline_data.head()

  has_large_values = (abs_vals > 1e6).any()


Unnamed: 0,Time,n,Number,Instantaneous Speed,Interval Speed,Bias,Morphwidth,Midline,Area,Angular Speed,Aspect Ratio,Kink,Curve,Crab,Pathlength,Plate_id,Date,Screen,plate,taps,dataset,Gene,Allele
9772,490.0,14,12,0.088074,0.132446,0.25,0.107178,1.088867,0.141113,7.199219,0.230957,30.203125,25.0,0.008598,11.882812,B0811ab,20220815,PD_Screen,0,0,N2,N2,N2
9773,490.0,14,12,0.098877,0.142944,0.25,0.108887,1.094727,0.144043,7.800781,0.233032,29.703125,24.90625,0.007,11.890625,B0811ab,20220815,PD_Screen,0,0,N2,N2,N2
9774,490.0,14,12,0.093872,0.13855,0.25,0.111084,1.09668,0.145752,7.300781,0.227051,31.90625,25.09375,0.005299,11.890625,B0811ab,20220815,PD_Screen,0,0,N2,N2,N2
9775,490.0,14,12,0.082275,0.119507,0.25,0.107788,1.09082,0.1427,6.101562,0.224976,31.09375,25.0,0.006599,11.890625,B0811ab,20220815,PD_Screen,0,0,N2,N2,N2
9776,490.0,14,12,0.073608,0.102417,0.25,0.105896,1.087891,0.140137,5.300781,0.218994,30.90625,24.703125,0.006401,11.898438,B0811ab,20220815,PD_Screen,0,0,N2,N2,N2


In [110]:
Baseline_data.shape

(30602, 23)

## Creating Post Stimulus Data 

In [None]:
# similar filters as baseline data

Post_stimulus_data_pre = df.drop(columns=["Puff", "x","y"]).dropna().reset_index(drop=True)

Post_stimulus_data_pre = Post_stimulus_data_pre[((Post_stimulus_data_pre.Time>598))]

Post_stimulus_data_pre['Time'] = round(Post_stimulus_data_pre['Time']).astype('int')

In [69]:
# Add continuous tap numbers from 1 to 31 for each experiment
# E.g., Experiment 1 has taps 1-31, Experiment 2 has taps 1-31 and so on..

Post_stimulus_data_pre['Tap_num'] = Post_stimulus_data_pre.groupby(['Experiment'])['Tap'].cumsum()

Post_stimulus_data_pre.head()

  has_large_values = (abs_vals > 1e6).any()


Unnamed: 0,Time,n,Number,Instantaneous Speed,Interval Speed,Bias,Tap,Morphwidth,Midline,Area,Angular Speed,Aspect Ratio,Kink,Curve,Crab,Pathlength,Plate_id,Date,Screen,Experiment,plate,taps,dataset,Gene,Allele,Tap_num
6,598,16,14,0.080994,0.10968,0.213989,0,0.105591,1.101562,0.141357,3.699219,0.246948,37.5,26.796875,0.011101,6.558594,B0811ab,20220815,PD_Screen,1,0,0.0,N2,N2,N2,0
7,598,16,14,0.080505,0.118103,0.285889,0,0.10907,1.105469,0.145264,3.800781,0.248047,37.09375,26.703125,0.009399,6.5625,B0811ab,20220815,PD_Screen,1,0,0.0,N2,N2,N2,0
8,598,16,14,0.072876,0.112305,0.285889,0,0.106628,1.102539,0.141846,3.699219,0.248047,36.3125,27.203125,0.008797,6.5625,B0811ab,20220815,PD_Screen,1,0,0.0,N2,N2,N2,0
9,598,16,14,0.065491,0.098511,0.285889,0,0.105591,1.099609,0.140259,3.099609,0.25,37.90625,27.296875,0.007301,6.566406,B0811ab,20220815,PD_Screen,1,0,0.0,N2,N2,N2,0
10,598,16,14,0.08783,0.127075,0.285889,0,0.105103,1.09668,0.140015,4.0,0.256104,37.90625,27.09375,0.011299,6.566406,B0811ab,20220815,PD_Screen,1,0,0.0,N2,N2,N2,0


In [70]:
# Create windows from 7s to 9.5s post a tap ("Tap"=1) for each experiment
# and concatenate all these wondows into a single dataframe

Post_stimulus_data = []

for exp in Post_stimulus_data_pre['Experiment'].unique(): # loop through each experiment separately 
    df = Post_stimulus_data_pre[Post_stimulus_data_pre['Experiment'] == exp]  
    tap_times = df[df['Tap'] == 1]['Time']  # get times where tap occured

    for t in tap_times: 
        window = df[(df['Time'] >= t + 7) & (df['Time'] <= t + 9.5)]
        Post_stimulus_data.append(window)

Post_stimulus_data = pd.concat(Post_stimulus_data)

Post_stimulus_data.head()


  has_large_values = (abs_vals > 1e6).any()


Unnamed: 0,Time,n,Number,Instantaneous Speed,Interval Speed,Bias,Tap,Morphwidth,Midline,Area,Angular Speed,Aspect Ratio,Kink,Curve,Crab,Pathlength,Plate_id,Date,Screen,Experiment,plate,taps,dataset,Gene,Allele,Tap_num
93,608,17,14,0.117981,0.13269,0.285889,0,0.112915,1.078125,0.143921,8.203125,0.36792,47.8125,34.1875,0.021698,6.496094,B0811ab,20220815,PD_Screen,1,1,1.0,N2,N2,N2,1
94,608,17,14,0.125854,0.14209,0.285889,0,0.111572,1.073242,0.142334,8.898438,0.358887,45.0,35.40625,0.022202,6.503906,B0811ab,20220815,PD_Screen,1,2,1.0,N2,N2,N2,1
95,608,17,14,0.097473,0.108582,0.285889,0,0.111023,1.076172,0.140381,7.601562,0.36499,44.90625,35.6875,0.015297,6.507812,B0811ab,20220815,PD_Screen,1,3,1.0,N2,N2,N2,1
96,608,17,14,0.102417,0.114624,0.285889,0,0.116211,1.082031,0.145874,8.0,0.343018,43.90625,35.6875,0.016006,6.515625,B0811ab,20220815,PD_Screen,1,4,1.0,N2,N2,N2,1
97,608,17,14,0.111084,0.123779,0.356934,0,0.110596,1.069336,0.140381,8.203125,0.349121,40.90625,34.6875,0.016296,6.519531,B0811ab,20220815,PD_Screen,1,5,1.0,N2,N2,N2,1


In [83]:
Post_stimulus_data[(Post_stimulus_data['taps']==Post_stimulus_data['Tap_num'])==False]

  has_large_values = (abs_vals > 1e6).any()


Unnamed: 0,index,Time,n,Number,Instantaneous Speed,Interval Speed,Bias,Tap,Morphwidth,Midline,Area,Angular Speed,Aspect Ratio,Kink,Curve,Crab,Pathlength,Plate_id,Date,Screen,Experiment,plate,taps,dataset,Gene,Allele,Tap_num
7231,24191,778,11,11,0.185669,0.127808,0.90918,0,0.111511,1.023438,0.133301,12.796875,0.308105,58.68750,32.406250,0.020294,16.875000,C0811be,20220815,PD_Screen,9,393,18.0,hipr-1_ok1081,hipr-1,ok1081,17
7232,24192,778,11,11,0.190552,0.123413,0.90918,0,0.111694,1.019531,0.133423,14.203125,0.319092,56.31250,31.093750,0.018402,16.890625,C0811be,20220815,PD_Screen,9,393,18.0,hipr-1_ok1081,hipr-1,ok1081,17
7233,24193,778,11,11,0.219971,0.138916,0.90918,0,0.113281,1.010742,0.134888,16.500000,0.311035,55.81250,31.093750,0.026306,16.890625,C0811be,20220815,PD_Screen,9,393,18.0,hipr-1_ok1081,hipr-1,ok1081,17
7234,24194,778,11,11,0.225586,0.151123,0.90918,0,0.114990,1.027344,0.137451,16.500000,0.297119,55.18750,31.796875,0.019699,16.906250,C0811be,20220815,PD_Screen,9,393,18.0,hipr-1_ok1081,hipr-1,ok1081,17
7235,24195,778,11,11,0.219604,0.161743,0.90918,0,0.114685,1.035156,0.137573,15.601562,0.294922,54.18750,32.312500,0.014603,16.906250,C0811be,20220815,PD_Screen,9,393,18.0,hipr-1_ok1081,hipr-1,ok1081,17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7599,25304,889,10,9,0.199463,0.107300,1.00000,0,0.113220,1.024414,0.136353,9.203125,0.332031,51.59375,34.000000,0.034607,28.406250,C0811be,20220815,PD_Screen,9,393,29.0,hipr-1_ok1081,hipr-1,ok1081,28
7600,25305,889,10,9,0.188721,0.104797,1.00000,0,0.113403,1.019531,0.135864,8.796875,0.320068,50.90625,32.406250,0.029907,28.421875,C0811be,20220815,PD_Screen,9,393,29.0,hipr-1_ok1081,hipr-1,ok1081,28
7601,25306,889,10,9,0.177002,0.101807,1.00000,0,0.111389,1.010742,0.134521,9.000000,0.305908,49.68750,31.500000,0.025406,28.421875,C0811be,20220815,PD_Screen,9,393,29.0,hipr-1_ok1081,hipr-1,ok1081,28
7602,25307,889,10,9,0.203613,0.116211,1.00000,0,0.110291,1.015625,0.133545,10.898438,0.309082,49.31250,32.093750,0.027893,28.437500,C0811be,20220815,PD_Screen,9,393,29.0,hipr-1_ok1081,hipr-1,ok1081,28


In [87]:
# Aggregate columns by "Experiment" + "Tap_num" by taking their means

Post_stimulus_data = Post_stimulus_data.groupby(['Experiment', 'Tap_num','Screen','Date','Plate_id','Gene','Allele','dataset']).agg({
    'Time': 'min', # take minimum valu of time instead of mean
    'n': 'mean',
    'Number': 'mean',
    'Instantaneous Speed': 'mean',
    'Interval Speed' : 'mean',
    'Bias': 'mean',
    'Tap': 'mean',
    'Morphwidth': 'mean',
    'Midline': 'mean',
    'Area': 'mean',
    'Angular Speed': 'mean',
    'Aspect Ratio': 'mean',
    'Kink': 'mean',
    'Curve': 'mean',
    'Crab': 'mean',
    'Pathlength': 'mean'
})

Post_stimulus_data = Post_stimulus_data.reset_index()

Post_stimulus_data

Unnamed: 0,Experiment,Tap_num,Screen,Date,Plate_id,Gene,Allele,dataset,Time,n,Number,Instantaneous Speed,Interval Speed,Bias,Tap,Morphwidth,Midline,Area,Angular Speed,Aspect Ratio,Kink,Curve,Crab,Pathlength
0,1,1,PD_Screen,20220815,B0811ab,N2,N2,N2,608,17.000000,14.000000,0.122196,0.130085,0.339172,0.0,0.111816,1.070923,0.141838,7.893555,0.332550,45.111328,35.310547,0.016444,6.543701
1,1,2,PD_Screen,20220815,B0811ab,N2,N2,N2,618,17.033333,14.966667,0.221281,0.156195,0.514119,0.0,0.108453,1.053288,0.136275,19.366667,0.305599,52.711460,30.558855,0.026352,6.675911
2,1,3,PD_Screen,20220815,B0811ab,N2,N2,N2,628,16.136364,13.000000,0.239746,0.095556,0.496227,0.0,0.109808,1.085982,0.141502,12.572443,0.273177,49.167614,29.379972,0.024515,7.243431
3,1,4,PD_Screen,20220815,B0811ab,N2,N2,N2,638,13.000000,12.000000,0.242992,0.114581,0.959889,0.0,0.099840,1.072627,0.134377,18.418259,0.312265,48.396992,28.658566,0.026236,8.048322
4,1,5,PD_Screen,20220815,B0811ab,N2,N2,N2,648,15.687500,12.000000,0.263863,0.134298,0.929962,0.0,0.103476,1.076324,0.137432,12.731445,0.268684,36.601562,28.312988,0.025394,9.686279
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
371,13,25,PD_Screen,20220815,A0811cb,hipr-1,tm10120,hipr-1_tm10120,848,21.806452,18.000000,0.153730,0.080751,0.872685,0.0,0.091167,0.962749,0.109030,7.951865,0.248504,33.544857,26.682964,0.012239,16.497480
372,13,26,PD_Screen,20220815,A0811cb,hipr-1,tm10120,hipr-1_tm10120,858,23.967742,18.000000,0.149867,0.086879,0.849546,0.0,0.092630,0.951392,0.108032,7.712891,0.242975,40.512096,25.210182,0.013087,16.477318
373,13,27,PD_Screen,20220815,A0811cb,hipr-1,tm10120,hipr-1_tm10120,868,24.733333,21.400000,0.154093,0.078680,0.916960,0.0,0.088470,0.966064,0.108742,6.230078,0.232874,37.823959,25.526562,0.010944,14.628646
374,13,28,PD_Screen,20220815,A0811cb,hipr-1,tm10120,hipr-1_tm10120,878,23.064516,21.000000,0.147772,0.082057,0.857028,0.0,0.088336,0.956653,0.107849,4.712954,0.224196,37.407257,25.116432,0.009828,12.583165


In [None]:
print('done step 7')

# Save dataframe as `.csv`

In [None]:
Baseline_data.to_csv(f"{Screen}_baseline_output.csv")
print('saved Baseline data as .csv!')

In [None]:
Post_stimulus_data.to_csv(f"{Screen}_post_stimulus.csv")
print('saved Post stimulus data as .csv!')

# Done!