# Jupyter Notebook UI to analyze baseline data from tap-habituation experiments!

### Beginner Essentials:
1. Shift-Enter to run each cell. After you run, you should see an output "done step #". If not, an error has occured
2. When inputting your own code/revising the code, make sure you close all your quotation marks '' and brackets (), [], {}.
3. Don't leave any commas (,) hanging! (make sure an object always follows a comma. If there is nothing after a comma, remove the comma!
4. Learning to code? Each line of code is annotated to help you understand how this code works!

**Run all cells/steps sequentially, even the ones that do not need input**

## Step-by-Step Analysis of the Jupyter Notebook

| Step | Purpose | Key Actions |
|------|---------|-------------|
| **1. Import Packages** | Load required Python libraries for data analysis | Imports `pandas`, `numpy`, `matplotlib`, etc. | 
| **2. Pick Filepath** | Select the folder containing experimental data files (.dat or .trv) | Input required: Uses `FileChooser` widget to select directory | 
| **3. User-Defined Variables** | Set experiment parameters | Defines: `bin`  | 
| **4. Construct Filelist** | Find all files in selected folder | Sets working directory and scans `folder_path` using; Displays no. of `.trv` files found in the folder |
| **5. Process Data Function** | Define functions to load, clean, and analyze raw data | - `ProcessData()`: Loads files, calculates metrics (reversal probability, speed) |
| **6.1 Process Data** | Apply processing to all strains| - Checks `filelist` for unique strain names (e.g., "N2") <br>- Runs `ProcessData()` for each strain | 
| **7. Grouping & Naming** | Combine data from all strains | - Concatenates DataFrames<br>- Assigns dataset names (e.g., "N2") | 
| **Output CSV** | Save processed data | Exports `Baseline_data` to CSV |

### Key Notes:
- User Input Required: Steps 2 (file selection), 3 (parameters), 6.1 (strain verification)
- Output: Final CSV contains all analyzed tap response data

# 1. Importing Packages Required (No input required, just run)

In [6]:
import pandas as pd #<- package used to import and organize data
import numpy as np #<- package used to import and organize data
import seaborn as sns #<- package used to plot graphs
from matplotlib import pyplot as plt #<- package used to plot graphs
import os #<- package used to work with system filepaths
from ipywidgets import widgets #<- widget tool to generate button
from IPython.display import display #<- displays button
from ipyfilechooser import FileChooser
# from tkinter import Tk, filedialog #<- Tkinter is a GUI package
from tqdm.notebook import tqdm
# import dask.dataframe as dd
# import pingouin as pg
pd.set_option('display.max_columns', 50)
print("done step 1")

done step 1


## 2. Pick filepath (just run and click button from output)

Run the following cell and click the button 'Select Folder' to pick a filepath.

**Important: Later on, this script uses the total file path for each file to import and group data. That means if your folder has whatever your strain is named, the script will not work.**

(ex. if your folder has "N2" in it this script sees all files inside this folder as having the "N2" search key)

**An easy fix is to just rename your folder to something else (make your strains lower-case, or just have the date)**

In [7]:
starting_directory = '/Volumes'
chooser = FileChooser(starting_directory)
display(chooser)

FileChooser(path='/Volumes', filename='', title='', show_hidden=False, select_desc='Select', change_desc='Chan…

In [9]:
print(chooser.selected_path)
folder_path=chooser.selected_path

/Volumes/RankinLabMehak_SSD/Glia_Genes_Screen_2025


In [10]:
screens = ['PD_Screen', 'ASD_Screen', 'G-Proteins_Screen', 'Glia_Genes_Screen', 'Neuron_Genes_Screen', 'Miscellaneous', 'ASD_WGS_Screen']

screen_chooser = widgets.Select(options=screens, value=screens[0], description='Screen:')
display(screen_chooser)

Select(description='Screen:', options=('PD_Screen', 'ASD_Screen', 'G-Proteins_Screen', 'Glia_Genes_Screen', 'N…

In [11]:
Screen=screen_chooser.value
print(Screen)

Glia_Genes_Screen


# 3. User Defined Variables (Add input here)

Here, we add some constants to help you blaze through this code.

3.1: Setting time bins


3.2: Setting view range for your graph
- Top, bottom = y axis view range
- left, right = x axis view range



In [12]:
# Setting 1s Bins
bins = np.linspace(0,1200,1201) # np.linspace(start, end, steps in between)
print(bins)


print("done step 3")

[0.000e+00 1.000e+00 2.000e+00 ... 1.198e+03 1.199e+03 1.200e+03]
done step 3


# 4. Construct filelist from folder path (No input required, just run)

In [13]:
os.chdir(folder_path) # setting your working directory so that your images will be saved here

filelist = list() # empty list
for root, dirs, files in os.walk(folder_path): # this for loop goes through your folder 
    for name in files:
        if name.endswith('.dat'): # and takes out all files with a .dat (file that contains your data)
            if "_" in name.split(".")[-2]:
                filepath = os.path.join(root, name) # Notes down the file path of each data file
                filelist.append(filepath) # saves it into the list

if not filelist:
    raise FileNotFoundError("No .dat files found in the selected folder!")
else:
    print(f"Number of .dat files to process: {len(filelist)}")
    # print(f"Example of first and last file saved: {filelist[0]}, {filelist[-1]}") 

print('done step 4')

Number of .dat files to process: 317
done step 4


# 5. Process Data Function (No input required, just run)

In [14]:
def ProcessData(strain, experiment_counter): 
    """
    Filters and processes .dat files matching the given strain.

    Parameters: 
        strain (str): keyword to match in the files

    Returns:
        dict: N (Plate number) and Dataframe with required columns 
              ("time", "dura", "dist", "prob", "speed", "plate", "Date",
              "Plate_id", "Screen")

    """
    strain_filelist = [x for x in filelist if strain in x] # Goes through the list and filters for keyword
    Strain_N = len(strain_filelist) # Finds the number of plates per strain
    if Strain_N == 0:
        raise AssertionError ('{} is not a good identifier'.format(strain))
    else:
        pass
        print(f'Strain {strain}')
        print(f'Number of plates: {Strain_N}') 
        
        # visiting files in this strain
        strain_filelist = [file for file in filelist if strain in file]
        df_list=[]
        for i, file in enumerate(strain_filelist):
            if file.split('/')[-1].startswith('._'):
                pass
            else:
                try:
                    print(f"File: {file}")
                    df= pd.read_csv(file, sep=' ', header = None, encoding_errors='ignore')
                    df['Plate_id'] = file.split('/')[-2]+"_"+ file.split('/')[-1].split('_')[-1].split('.')[0]
                    df['Date'] = file.split('/')[-2].split('_')[0]
                    df['Screen'] = file.split('/')[-4]
                    df['Experiment'] = experiment_counter
                    experiment_counter = 1+experiment_counter
                    df_list.append(df)
                except:
                    print(f"error in file {file}")
                    pass
        DF_Total = pd.concat(df_list, ignore_index = True)
        DF_Total = DF_Total.rename( 
                    {0:'Time',
                    1:'n',
                    2:'Number',
                    3:'Instantaneous Speed',
                    4:'Interval Speed',
                    5:'Bias',
                    6:'Tap',
                    7:'Puff',
                    8:'x',
                    9:'y',
                    10:'Morphwidth',
                    11:'Midline',
                    12:'Area',
                    13:'Angular Speed',
                    14:'Aspect Ratio',
                    15:'Kink',
                    16:'Curve',
                    17:'Crab',
                    18:'Pathlength'}, axis=1)
        
        # check function here for NaN Columns
        DF_Total['plate'] = 0

        print("---------------------------------------------------------------------------------------------------------------------------------------------------------------------------")

    return{
            'N': Strain_N,
            'Confirm':DF_Total,
            'experiment_counter': experiment_counter
            # 'Final': DF_Final
    }


print('done step 5')

done step 5


# 6.1 Process Data

Create a dictionary `StrainNames` that contains all the genotype/strain names from each file path

In [15]:
genotype=[]
for f in filelist:
    genotype.append(f.split('/')[-3])

genotypes=np.unique(genotype).tolist()

if Screen =="Neuron_Genes_Screen":
    genotypes.insert(0, genotypes.pop(genotypes.index("N2_XJ1")))
    genotypes.insert(0, genotypes.pop(genotypes.index("N2_N2")))
else:
    genotypes.insert(0, genotypes.pop(genotypes.index("N2")))

nstrains = list(range(1, len(genotypes) + 1))
StrainNames = {nstrains[i]: genotypes[i] for i in range(len(nstrains))}

print(f"Number of genotypes/strains in the experiment: {len(genotypes)}")

# Display the first 5 Strain names in the experiment
for k in list(StrainNames)[:5]:
    print(f"{k}: {StrainNames[k]}")


print("done step 6.1")

# <---------------- Test element to use for dictionary buidling -------------------
# s = '/Users/Joseph/Desktop/OnFoodOffFoodTest/N2_OnFood/20220401_163048/N2_10x1_n96h20C_360sA0401_ka.00065.dat'
# slist=s.split('/')[5]
# print(slist)
# print(list(range(1,5+1)))

Number of genotypes/strains in the experiment: 25
1: N2
2: AMshABLATE_nsIs109
3: ced-10_n3246
4: ced-5_n2002
5: delm-1_ok1226
done step 6.1


# 6.2 Process Data (just run this cell)

Pass each strain through `ProcessData()` function 

In [16]:
DataLists = [0] # generates empty list at index 0 because we want indexing to start at 1 
                # when I say #1, I want the first point, not the second point

experiment_counter = 1

# the loop below goes through the dictionary in step 6.1 and processes data
# and appends all data into a list of dataframes
for s in tqdm(StrainNames.values()): 
    if not s == '':
        result = ProcessData(s, experiment_counter)
        DataLists.append(result['Confirm'])
        experiment_counter = result['experiment_counter'] 

print('done step 6.2')

  0%|          | 0/25 [00:00<?, ?it/s]

Strain N2
Number of plates: 75
File: /Volumes/RankinLabMehak_SSD/Glia_Genes_Screen_2025/mgl-2_tm355/20241024_171133/N2_5x4_f96h20C_600s30x10s10s_B1024.dat
File: /Volumes/RankinLabMehak_SSD/Glia_Genes_Screen_2025/N2/20240724_025822/N2_5x4_f96h20C_600s30x10s10s_A0724.dat
File: /Volumes/RankinLabMehak_SSD/Glia_Genes_Screen_2025/N2/20240724_035049/N2_5x4_f96h20C_600s30x10s10s_A0724.dat
File: /Volumes/RankinLabMehak_SSD/Glia_Genes_Screen_2025/N2/20240724_094826/N2_5x4_f96h20C_600s31x10s10s_B0724.dat
File: /Volumes/RankinLabMehak_SSD/Glia_Genes_Screen_2025/N2/20240724_095505/N2_5x4_f96h20C_600s31x10s10s_C0724.dat
File: /Volumes/RankinLabMehak_SSD/Glia_Genes_Screen_2025/N2/20240724_103519/N2_5x4_f96h20C_600s31x10s10s_B0724.dat
File: /Volumes/RankinLabMehak_SSD/Glia_Genes_Screen_2025/N2/20240724_104235/N2_5x4_f96h20C_600s31x10s10s_C0724.dat
File: /Volumes/RankinLabMehak_SSD/Glia_Genes_Screen_2025/N2/20240727_144831/N2_5x4_f96h20C_600s31x10s10s_B0727.dat
File: /Volumes/RankinLabMehak_SSD/Glia_G

# Convert float64 data to float32 to reduce memory load (can also convert to 16 if needed)

For plain english:

float16 = 4 decimal points

float32 = 8 decimal points

float64 = 16 decimal points

more decimal points = more data/memory that computer has to keep track of

In [17]:
# commented out this section in case memory load needs to be reduced

for n in tqdm(DataLists[1:]):
    print(n)
    TestData=n
    TestData[TestData.select_dtypes(np.float64).columns]=TestData.select_dtypes(np.float64).astype(np.float16)
    

  0%|          | 0/25 [00:00<?, ?it/s]

             Time   n  Number  Instantaneous Speed  Interval Speed   Bias  \
0           0.007   0       0               0.0000          0.0000  0.000   
1           0.052   0       0               0.0000          0.0000  0.000   
2           0.098   0       0               0.0000          0.0000  0.000   
3           0.138   0       0               0.0000          0.0000  0.000   
4           0.176   0       0               0.0000          0.0000  0.000   
...           ...  ..     ...                  ...             ...    ...   
1638676  1199.799  76      28               0.1016          0.0852  0.115   
1638677  1199.848  76      28               0.1184          0.1005  0.000   
1638678  1199.926  75      28               0.0865          0.0726  0.000   
1638679  1200.007  75      28               0.0000          0.0000  0.000   
1638680  1200.087  76      28               0.0000          0.0000  0.000   

         Tap  Puff        x        y  Morphwidth  Midline      Area  \
0   

# 7. Grouping Data and Naming

In [18]:
base=pd.concat(df.assign(dataset=StrainNames.get(i+1)) for i, df in enumerate(DataLists[1:]))

base[['Gene', 'Allele']] = base['dataset'].str.split(pat='_', n=1, expand=True)

base['Screen']=Screen

base['Allele'] = base['Allele'].fillna('N2')

### Creating `Baseline_data` 

This step takes all the individual strain data (processed in Step 6) and combines them into single dataframe, filters for time window 490s - 590s, drops unwanted columns. 


In [19]:
Baseline_data = base.drop(columns=["Tap", "Puff", "x","y", "Experiment"]).dropna().reset_index(drop=True)

Baseline_data = Baseline_data[((Baseline_data.Time<=590)&(Baseline_data.Time >=490))] 

Baseline_data.head()

  has_large_values = (abs_vals > 1e6).any()


Unnamed: 0,Time,n,Number,Instantaneous Speed,Interval Speed,Bias,Morphwidth,Midline,Area,Angular Speed,Aspect Ratio,Kink,Curve,Crab,Pathlength,Plate_id,Date,Screen,plate,dataset,Gene,Allele
12178,490.0,14,12,0.027802,0.019897,0.083008,0.067078,0.784668,0.071655,4.800781,0.343018,52.3125,30.40625,0.0042,3.167969,20241024_171133_B1024,20241024,Glia_Genes_Screen,0,N2,N2,N2
12179,490.0,14,12,0.026993,0.019897,0.083008,0.067993,0.782227,0.072021,3.900391,0.341064,53.0,29.703125,0.004799,3.167969,20241024_171133_B1024,20241024,Glia_Genes_Screen,0,N2,N2,N2
12180,490.0,14,12,0.024597,0.016205,0.083008,0.068298,0.778809,0.072083,4.5,0.336914,52.5,29.59375,0.006001,3.167969,20241024_171133_B1024,20241024,Glia_Genes_Screen,0,N2,N2,N2
12181,490.0,14,12,0.019501,0.013802,0.083008,0.065796,0.775879,0.07074,4.199219,0.333984,53.09375,30.5,0.004902,3.167969,20241024_171133_B1024,20241024,Glia_Genes_Screen,0,N2,N2,N2
12182,490.0,14,12,0.026993,0.0186,0.083008,0.065308,0.767578,0.069275,4.300781,0.339111,52.1875,30.09375,0.003401,3.167969,20241024_171133_B1024,20241024,Glia_Genes_Screen,0,N2,N2,N2


In [20]:
Baseline_data.shape

(710261, 22)

## Creating Post Stimulus Data 

In [21]:
# similar filters as baseline data

Post_stimulus_data_pre = base.drop(columns=["Puff", "x","y"]).dropna().reset_index(drop=True)

Post_stimulus_data_pre = Post_stimulus_data_pre[((Post_stimulus_data_pre.Time>599))]

Post_stimulus_data_pre['Time'] = round(Post_stimulus_data_pre['Time']).astype('int')

# Add continuous tap numbers from 1 to 31 for each experiment
# E.g., Experiment 1 has taps 1-31, Experiment 2 has taps 1-31 and so on..
Post_stimulus_data_pre['Tap_num'] = Post_stimulus_data_pre.groupby(['Experiment'])['Tap'].cumsum()

Post_stimulus_data_pre.head()

  has_large_values = (abs_vals > 1e6).any()


Unnamed: 0,Time,n,Number,Instantaneous Speed,Interval Speed,Bias,Tap,Morphwidth,Midline,Area,Angular Speed,Aspect Ratio,Kink,Curve,Crab,Pathlength,Plate_id,Date,Screen,Experiment,plate,dataset,Gene,Allele,Tap_num
14894,600,15,12,0.021606,0.0112,0.0,0,0.065308,0.770508,0.068726,3.699219,0.363037,49.8125,30.59375,0.0065,2.84375,20241024_171133_B1024,20241024,Glia_Genes_Screen,1,0,N2,N2,N2,0
14895,600,15,12,0.023407,0.014999,0.0,0,0.066589,0.78418,0.070862,4.699219,0.358887,49.6875,31.0,0.007801,2.84375,20241024_171133_B1024,20241024,Glia_Genes_Screen,1,0,N2,N2,N2,0
14896,600,15,12,0.015701,0.006901,0.0,0,0.06781,0.794922,0.072327,2.400391,0.361084,49.1875,30.703125,0.005402,2.84375,20241024_171133_B1024,20241024,Glia_Genes_Screen,1,0,N2,N2,N2,0
14897,600,15,12,0.023804,0.013397,0.0,0,0.066711,0.79834,0.071716,3.5,0.374023,50.8125,31.0,0.006699,2.84375,20241024_171133_B1024,20241024,Glia_Genes_Screen,1,0,N2,N2,N2,0
14898,600,15,12,0.024994,0.014,0.0,0,0.066284,0.782227,0.070557,2.599609,0.366943,49.5,31.0,0.005901,2.84375,20241024_171133_B1024,20241024,Glia_Genes_Screen,1,0,N2,N2,N2,0


In [22]:
# Create windows from 7s to 9.5s post a tap ("Tap"=1) for each experiment
# and concatenate all these wondows into a single dataframe

Post_stimulus_data = []

for exp in Post_stimulus_data_pre['Experiment'].unique(): # loop through each experiment separately 
    df = Post_stimulus_data_pre[Post_stimulus_data_pre['Experiment'] == exp]  
    tap_times = df[df['Tap'] == 1]['Time']  # get times where tap occured

    for t in tap_times: 
        window = df[(df['Time'] >= t + 7) & (df['Time'] <= t + 9.5)]
        Post_stimulus_data.append(window)

Post_stimulus_data = pd.concat(Post_stimulus_data)

Post_stimulus_data.head()

  has_large_values = (abs_vals > 1e6).any()


Unnamed: 0,Time,n,Number,Instantaneous Speed,Interval Speed,Bias,Tap,Morphwidth,Midline,Area,Angular Speed,Aspect Ratio,Kink,Curve,Crab,Pathlength,Plate_id,Date,Screen,Experiment,plate,dataset,Gene,Allele,Tap_num
15065,607,15,12,0.052612,0.058289,0.083008,0,0.066223,0.77832,0.073914,15.398438,0.363037,61.0,35.6875,0.011703,2.626953,20241024_171133_B1024,20241024,Glia_Genes_Screen,1,0,N2,N2,N2,1
15066,607,15,12,0.061188,0.060089,0.0,0,0.065613,0.777344,0.073059,14.703125,0.362061,60.1875,35.3125,0.012497,2.628906,20241024_171133_B1024,20241024,Glia_Genes_Screen,1,0,N2,N2,N2,1
15067,607,15,12,0.059204,0.063721,0.0,0,0.063599,0.770996,0.071655,13.203125,0.370117,59.09375,34.8125,0.012299,2.628906,20241024_171133_B1024,20241024,Glia_Genes_Screen,1,0,N2,N2,N2,1
15068,607,15,12,0.055695,0.068115,0.083008,0,0.06311,0.766602,0.071472,11.203125,0.364014,59.6875,34.90625,0.009201,2.628906,20241024_171133_B1024,20241024,Glia_Genes_Screen,1,0,N2,N2,N2,1
15069,607,15,12,0.055786,0.070496,0.083008,0,0.065613,0.786621,0.073547,8.796875,0.360107,64.375,35.59375,0.007198,2.630859,20241024_171133_B1024,20241024,Glia_Genes_Screen,1,0,N2,N2,N2,1


In [23]:
# Aggregate columns by "Experiment" + "Tap_num" by taking their means

Post_stimulus_data = Post_stimulus_data.groupby(['Experiment', 'Tap_num','Screen','Date','Plate_id','Gene','Allele','dataset', "plate"]).agg({
    'Time': 'min', # take minimum valu of time instead of mean
    'n': 'mean',
    'Number': 'mean',
    'Instantaneous Speed': 'mean',
    'Interval Speed' : 'mean',
    'Bias': 'mean',
    'Tap': 'mean',
    'Morphwidth': 'mean',
    'Midline': 'mean',
    'Area': 'mean',
    'Angular Speed': 'mean',
    'Aspect Ratio': 'mean',
    'Kink': 'mean',
    'Curve': 'mean',
    'Crab': 'mean',
    'Pathlength': 'mean'
})

Post_stimulus_data = Post_stimulus_data.reset_index()

Post_stimulus_data

Unnamed: 0,Experiment,Tap_num,Screen,Date,Plate_id,Gene,Allele,dataset,plate,Time,n,Number,Instantaneous Speed,Interval Speed,Bias,Tap,Morphwidth,Midline,Area,Angular Speed,Aspect Ratio,Kink,Curve,Crab,Pathlength
0,1,1,Glia_Genes_Screen,20241024,20241024_171133_B1024,N2,N2,N2,0,607,15.000000,12.000000,0.058951,0.058327,0.256647,0.0,0.063646,0.774335,0.072058,6.153131,0.326810,66.139114,34.680946,0.008325,2.666646
1,1,2,Glia_Genes_Screen,20241024,20241024_171133_B1024,N2,N2,N2,0,617,15.984127,12.000000,0.095949,0.080637,0.552998,0.0,0.069399,0.790535,0.077552,13.373078,0.399368,71.857635,36.909225,0.013644,2.884301
2,1,3,Glia_Genes_Screen,20241024,20241024_171133_B1024,N2,N2,N2,0,627,17.000000,13.467742,0.108341,0.068823,0.586418,0.0,0.075289,0.814493,0.083328,9.304876,0.346762,73.954636,30.787550,0.012437,2.725775
3,1,4,Glia_Genes_Screen,20241024,20241024_171133_B1024,N2,N2,N2,0,637,20.000000,17.000000,0.112061,0.064840,0.828928,0.0,0.081280,0.815705,0.085755,7.869519,0.297792,50.762096,31.686745,0.012143,2.732737
4,1,5,Glia_Genes_Screen,20241024,20241024_171133_B1024,N2,N2,N2,0,647,20.528302,19.000000,0.095652,0.054642,0.808170,0.0,0.078201,0.848771,0.088958,5.326356,0.269462,50.684551,30.131191,0.007745,3.019273
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9837,318,26,Glia_Genes_Screen,20250313,20250313_220015_C0313,ztf-16,ok1916,ztf-16_ok1916,0,867,86.400000,68.542857,0.121849,0.069592,0.824344,0.0,0.109338,0.947949,0.123141,7.622991,0.287667,43.341072,29.637947,0.012086,5.264286
9838,318,27,Glia_Genes_Screen,20250313,20250313_220015_C0313,ztf-16,ok1916,ztf-16_ok1916,0,877,80.150000,59.150000,0.114047,0.067533,0.772021,0.0,0.110248,0.939062,0.122620,7.770215,0.298352,46.813282,30.241796,0.011915,5.338965
9839,318,28,Glia_Genes_Screen,20250313,20250313_220015_C0313,ztf-16,ok1916,ztf-16_ok1916,0,887,79.463415,52.560976,0.120032,0.071334,0.756407,0.0,0.109735,0.939132,0.123718,8.224466,0.302889,43.856709,30.286966,0.012758,5.445217
9840,318,29,Glia_Genes_Screen,20250313,20250313_220015_C0313,ztf-16,ok1916,ztf-16_ok1916,0,897,90.636364,62.212121,0.111446,0.067183,0.787243,0.0,0.110977,0.944440,0.124626,6.387547,0.290638,42.818180,29.170929,0.011620,4.668324


In [25]:
print('done step 7')

done step 7


# Save dataframe as `.csv`

In [26]:
Baseline_data.to_csv(f"{Screen}_baseline_output.csv")
print('saved Baseline data as .csv!')

saved Baseline data as .csv!


In [27]:
Post_stimulus_data.to_csv(f"{Screen}_post_stimulus.csv")
print('saved Post stimulus data as .csv!')

saved Post stimulus data as .csv!


# Done!