# Jupyter Notebook UI to generate TAP data for MWT_Dashboards!

### Beginner Essentials:
1. Shift-Enter to run each cell. After you run, you should see an output "done step #". If not, an error has occured
2. When inputting your own code/revising the code, make sure you close all your quotation marks '' and brackets (), [], {}.
3. Don't leave any commas (,) hanging! (make sure an object always follows a comma. If there is nothing after a comma, remove the comma!
4. Learning to code? Each line of code is annotated to help you understand how this code works!

**Run all cells/steps sequentially, even the ones that do not need input**


## Step-by-Step Analysis of the Jupyter Notebook

| Step | Purpose | Key Actions |
|------|---------|-------------|
| **1. Import Packages** | Load required Python libraries for data analysis | Imports `pandas`, `numpy`, `matplotlib`, etc. | 
| **2. Pick Filepath** | User input: select folder containing `.trv` files | Uses `FileChooser` widget to select directory | 
| **3. User-Defined Variables** | Set experiment parameters | Defines: `number_of_taps`, `ISI`,`first_tap`; Calculates `tolerances` (time windows for taps) | 
| **4. Construct Filelist** | Find all `.trv` files in selected folder | Sets working directory and scans `folder_path` using; Displays no. of `.trv` files found in the folder |
| **5. Process Data Function** | Define functions to load and clean `.trv` data to use in step 6| - `ProcessData()`: Loads files, calculates metrics (reversal probability, speed)<br>- `assign_taps()`: Labels data with tap numbers <br>- `insert_plates()` |
| **6.1 Process Data** | Apply processing to all strains| - Checks `filelist` for unique strain names (e.g., "N2") <br>- Runs `ProcessData()` and `assign_taps()` for each strain | 
| **7. Grouping & Naming** | Combine data from all strains | - Concatenates DataFrames<br>- Assigns dataset names (e.g., "N2") | 
| **Output CSV** | Save processed data | Exports `TotalConcatenated` to CSV (e.g., `PD_Screen_tap_output.csv`) |

### Key Notes:
- User Input Required: Steps 2 (file selection), 3 (parameters), 6.1 (strain verification)
- Output: Final CSV contains all analyzed tap response data

## 1. Importing Packages Required (No input required, just run)

In [1]:
import pandas as pd #<- package used to import and organize data
import numpy as np #<- package used to import and organize data
import math
import os #<- package used to work with system file paths
import seaborn as sns #<- package used to plot graphs
from matplotlib import pyplot as plt #<- another package used to plot graphs
from itertools import cycle #<- package used to iterate down rows (used in step 5 to add tap column)
import ipywidgets as widgets #<- widget tool to generate button and tab for graphs
from IPython.display import display #<- displays widgets
from ipyfilechooser import FileChooser
# from tkinter import Tk, filedialog #<- Tkinter is a GUI package
print("done step 1")

done step 1


## 2. Pick filepath (just run and click button from output)

Run the following cell and click the button 'Select Folder' to pick a filepath.

**Important: Later on, this script uses the total file path for each file to import and group data. That means if your folder has whatever your strain is named, the script will not work.**

(ex. if your folder has "N2" in it this script sees all files inside this folder as having the "N2" search key)

**An easy fix is to just rename your folder to something else (make your strains lower-case, or just have the date)**

In [2]:
starting_directory = '/Volumes'
chooser = FileChooser(starting_directory)
display(chooser)

FileChooser(path='/Volumes', filename='', title='', show_hidden=False, select_desc='Select', change_desc='Chan…

In [9]:
print(chooser.selected_path)
folder_path=chooser.selected_path

/Users/gurmehak/Documents/RankinLab/Test_Datasets/PDScreen_TapHab_August15_2022/N2


In [10]:
screens = ['PD_Screen', 'ASD_Screen', 'G-Proteins_Screen', 'Glia_Genes_Screen', 'Neuron_Genes_Screen']

screen_chooser = widgets.Select(options=screens, value=screens[0], description='Screen:')
display(screen_chooser)

Select(description='Screen:', options=('PD_Screen', 'ASD_Screen', 'G-Proteins_Screen', 'Glia_Genes_Screen', 'N…

In [11]:
Screen=screen_chooser.value

## 3. User-Defined Variables (Add input here)

Here, we add some constants to help you blaze through this code.

3.1: Number of taps is pretty self-explanatory. How many taps does your experiment have? put in that number + 1 (N+1)!


3.2: Change your ISI number. This will be reflected in the name/title of the output figure.


**Note:** if you have different ISIs in the same folder, then come back and change this when you are graphing for the second set of data with the other ISI (Generally data from same ISIs are graphed together). If changing ISI mid-analysis, you can just skip straight to step 8 after running this cell again

In [12]:
# 3.1 Input
number_of_taps = 30 # Taps in your experiment (N)

# 3.2 Input
ISI = 10  # ISI in your experiment
first_tap = 600 # when is your first tap? check your TRV files

In [13]:
# Here, open up one of the trv files to determine the times for each of these taps. 

# Record number of taps (N+1), e.g., if number_of_taps = 30, taps = [1, 2, 3, ..., 31]
taps = np.arange(1, number_of_taps+2).tolist()

# Assign tolerance to each tap
lower = np.arange(first_tap-2, first_tap-2+(number_of_taps*ISI), ISI) # (first tap, last tap+10s, ISI)
upper = np.arange(first_tap+2, first_tap+2+(number_of_taps*ISI), ISI) # (first tap, last tap+10s, ISI)
tolerances = [(int(l), int(u)) for l, u in zip(lower, upper)]
tolerances.append((1188,1191)) # (N+1)th tap

# Display taps with tolerances 
for i in taps:
    print(f"Tap {i}, tolerance: {tolerances[i-1]}")

print("done step 3")

Tap 1, tolerance: (598, 602)
Tap 2, tolerance: (608, 612)
Tap 3, tolerance: (618, 622)
Tap 4, tolerance: (628, 632)
Tap 5, tolerance: (638, 642)
Tap 6, tolerance: (648, 652)
Tap 7, tolerance: (658, 662)
Tap 8, tolerance: (668, 672)
Tap 9, tolerance: (678, 682)
Tap 10, tolerance: (688, 692)
Tap 11, tolerance: (698, 702)
Tap 12, tolerance: (708, 712)
Tap 13, tolerance: (718, 722)
Tap 14, tolerance: (728, 732)
Tap 15, tolerance: (738, 742)
Tap 16, tolerance: (748, 752)
Tap 17, tolerance: (758, 762)
Tap 18, tolerance: (768, 772)
Tap 19, tolerance: (778, 782)
Tap 20, tolerance: (788, 792)
Tap 21, tolerance: (798, 802)
Tap 22, tolerance: (808, 812)
Tap 23, tolerance: (818, 822)
Tap 24, tolerance: (828, 832)
Tap 25, tolerance: (838, 842)
Tap 26, tolerance: (848, 852)
Tap 27, tolerance: (858, 862)
Tap 28, tolerance: (868, 872)
Tap 29, tolerance: (878, 882)
Tap 30, tolerance: (888, 892)
Tap 31, tolerance: (1188, 1191)
done step 3


# 4. Constructing Filelist From Source File/Select File (Just run)

In [14]:
#folder_path = '/Users/Joseph/Desktop/AVR14_10sISI' #- manual folder path if Tkinter is acting up

os.chdir(folder_path) #<- setting your working directory so that your images will be saved here

filelist = list() #<- empty list
for root, dirs, files in os.walk(folder_path): #<- this for loop goes through your folder 
    for name in files:
        if name.endswith('.trv'): # filters files with .trv (file that contains your data)
            filepath = os.path.join(root, name) #<- Notes down the file path of each data file
            filelist.append(filepath) #<- saves it into the list

if not filelist:
    raise FileNotFoundError("No .trv files found in the selected folder!")
else:
    print(f"Number of .trv files to process: {len(filelist)}")
    # print(f"Example of first and last file saved: {filelist[0]}, {filelist[-1]}") 

print('done step 4')

Number of .trv files to process: 5
done step 4


# 5. Process Data Function (Just Run)

In [51]:
def ProcessData(strain): 
    """
    Filters and processes .trv files matching the given strain.

    Parameters: 
        strain (str): keyword to match in the files

    Returns:
        dict: N (Plate number) and Dataframe with required columns ("time", "dura", "dist", "prob", "speed", "plate", "Date","Plate_id","Screen")

    """
    strain_filelist = [x for x in filelist if strain in x] #<- goes through the list and filters for keyword
    Strain_N = len(strain_filelist) # Finds the number of plates per strain
    if Strain_N == 0:
        raise AssertionError ('{} is not a good identifier as number of plates = 0'.format(strain))
    else:
        pass
        print(f'Strain {strain}')
        print(f'Number of plates: {Strain_N}') 

        # visiting files in this strain
        strain_filelist = [file for file in filelist if strain in file]
        df_list=[]
        for file in strain_filelist:
            if file.split('/')[-1].startswith('._'):
                pass
            else:
                print(f"File: {file}")
                df= pd.read_csv(file, sep=' ', header = None, encoding_errors='ignore')
                df['Plate_id'] = file.split('/')[-1].split('_')[-1].split('.')[0]
                df['Date'] = file.split('/')[-2].split('_')[0]
                df['Screen'] = file.split('/')[-4]
                df_list.append(df)
        DF_Total = pd.concat(df_list, ignore_index = True)

    # for f in strain_filelist:
    #     DF_Total = pd.concat(pd.read_csv(f, sep=' ', skiprows = 4, header = None))
    #     DF_Total = pd.concat([pd.read_csv(f, sep=' ', header = None, encoding_errors='ignore') for f in strain_filelist],
    #                   ignore_index=True) #<- imports your data files
    #     DF_Total = DF_Total.dropna(axis = 1) #<- cleans your data
        
        # column names for trv files
        DF_Total = DF_Total.rename( #<- more cleaning
                    {0:'time',
                    2:'rev_before',
                    3:'no_rev',
                    4:'stim_rev',
                    7:'dist',
                    8:'dist_std',
                    9:'dist_stderr',
                    11:'dist_0th',
                    12:'dist_1st',
                    13:'dist_2nd',
                    14:'dist_3rd',
                    15:'dist_100th',
                    18:'dura',
                    19:'dura_std',
                    20:'dura_stderr',
                    22:'dura_0th',
                    23:'dura_1st',
                    24:'dura_2nd',
                    25:'dura_3rd',
                    26:'dura_100th'}, axis=1)
        
        # check function here for NaN Columns
        DF_Total['plate'] = 0

        # Calculate reversal probability 
        DF_Total['prob'] = DF_Total['stim_rev']/ (DF_Total['no_rev'] + DF_Total['stim_rev']) 

        # Calculate speed
        DF_Total['speed'] = DF_Total['dist']/DF_Total['dura']

        DF_Total_rows = int(DF_Total.shape[0])
        print(f'This strain/treatment has {DF_Total_rows} total taps') # Outputs as the second number. Check if you are missing taps!

        DF_Final = DF_Total[["time", "dura", "dist", "prob", "speed", "plate", "Date","Plate_id","Screen"]].copy()

        print("---------------------------------------------------------------------------------------------------------------------------------------------------------------------------")

    return{
            'N': Strain_N,
            'Confirm':DF_Total,
            'Final': DF_Final}



def assign_taps(df, tolerances):
    """
    Assigns tap number to each row in the DataFrame based on time tolerances.

    Parameters:
        df (pd.DataFrame): The DataFrame to modify
        tolerances (list of tuples): Each tuple is (lower, upper) time range

    Returns:
        None
    """
    df['taps'] = np.nan
    for taps, tolerance in enumerate(tolerances): #[(99, 101), (109,111), ...]
        tap_lower,tap_upper = tolerance
        TimesInTapRange = df['time'].between(tap_lower,tap_upper, inclusive="both")
        df.loc[TimesInTapRange,'taps'] = taps+1 # set the tap to i where times are between



    
def insert_plates(df):   
    """
    Inserts a plate column into a dataframe.
    
    Parameters:
        df (pd.DataFrame): any dataframe
    
    Returns: 
        pd.DataFrame: dataframe with a plate column
    """
    df['plate']=(df['taps'] ==1).cumsum()


            
print('done step 5')

done step 5


# 6.1 Process Data (PLEASE READ, Add input here)

This is the hardest part - from your naming convention, pick a unique identifier for each group.

This means that all of names of your files for that strain should have that in common but is not common across other files! If you did a good job naming your files and following a good naming convention, this should be easy.

**Be careful and really look hard in your naming structure. Note you want an unique identifier in the entire file path for the same group of files. An easy mistake is to have the strain name in the overall folder name, in this case if you use your strain name as a keyword it would include all files in that folder!**

For example, if all your N2 files have a certain pattern like "N2_5x4" in this following example:
'/Users/Joseph/Desktop/AVR14_10sISI_TapHab_0710_2019/N2/20190710_141740/N2_5x4_f94h20c_100s30x10s10s_C0710ab.trv'
Then you need to set that identifier for the strain keyword:
'Strain_1' = 'N2_5x4'

#### Depending on how many strains you are running for comparison, you may need to add/delete some lines!

* You are not naming your data groups here, we have a step for that later!
* Here, you want to note down ALL the strains you have in the folder
* If you have just 2 strains, add hashtags (#) in front of the lines you do not need.
If you need more strains, just add more lines, following the same format!

In [None]:
genotype=[]
for f in filelist:
    genotype.append(f.split('/')[-3])

genotypes=np.unique(genotype)

StrainNames=dict(enumerate(genotypes,1))

print(f"Number of genotypes/strains in the experiment: {len(genotypes)}")

# Display the first 5 Strain names in the experiment
for k in list(StrainNames)[:5]:
    print(f"{k}: {StrainNames[k]}")

Number of genotypes in the experiment: 1
1: N2


In [53]:
print(list(StrainNames.values()))
print(type(list(StrainNames.values())))
print(type(list(StrainNames.values())[0]))
print('done step 6.1')

[np.str_('N2')]
<class 'list'>
<class 'numpy.str_'>
done step 6.1


# 6.2 Process Data (just run this cell)

In [54]:
# with open('/Volumes/JOSEPH/PD_Screen/F53B2.5_ok226/20220524_141642/ZE1_10x2_f72h20C_600s31x10s10s_B0520ba.trv') as f:
#     print(f)

In [None]:
# import threading

DataLists = [0]  # generates empty list at index 0 because we want indexing to start at 1 
                 # when I say #1, I want the first point, not the second point



# the loop below goes through the dictionary in step 6.1 and processes the data
for s in list(StrainNames.values()):
    if not s == '':
        # threading.Thread(target=DataLists.append(ProcessData(s)['Final'])).start()
        DataLists.append(ProcessData(s)['Final']) # appends all data into a list of dataframes



# the loop below assigns taps and plates to the processed data
for df in DataLists[1:]: 
    assign_taps(df, tolerances)
    insert_plates(df)



print('done step 6.2')

Strain N2
Number of plates: 5
File: /Users/gurmehak/Documents/RankinLab/Test_Datasets/PDScreen_TapHab_August15_2022/N2/20220815_101538/N2_10x2_f72h20C_600s31x10s10s_B0811ab.trv
File: /Users/gurmehak/Documents/RankinLab/Test_Datasets/PDScreen_TapHab_August15_2022/N2/20220815_102652/N2_10x2_f96h20C_600s31x10s10s_A0811aa.trv
File: /Users/gurmehak/Documents/RankinLab/Test_Datasets/PDScreen_TapHab_August15_2022/N2/20220815_122801/N2_10x2_f96h20C_600s31x10s10s_A0811ad.trv
File: /Users/gurmehak/Documents/RankinLab/Test_Datasets/PDScreen_TapHab_August15_2022/N2/20220815_121502/N2_10x2_f72h20C_600s31x10s10s_B0811ae.trv
File: /Users/gurmehak/Documents/RankinLab/Test_Datasets/PDScreen_TapHab_August15_2022/N2/20220815_103433/N2_10x2_f72h20C_600s31x10s10s_C0811ac.trv
This strain/treatment has 155 total taps
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
done step 6.2


In [56]:
# DataLists.append(ProcessData('vps-35_ok1880')['Final'])

In [64]:
# Let's take a look at the processed data for the first strain (displaying first 5 rows only)
print(DataLists[1].head())
print("")
print(f"Shape of the first dataframe (Strain_1): {DataLists[1].shape}")

      time  dura   dist      prob     speed  plate      Date Plate_id  \
0  599.985  2.83  0.696  0.928571  0.245936      1  20220815  B0811ab   
1  609.993  2.98  0.746  0.857143  0.250336      1  20220815  B0811ab   
2  619.699  1.97  0.536  0.800000  0.272081      1  20220815  B0811ab   
3  629.956  2.57  0.686  0.900000  0.266926      1  20220815  B0811ab   
4  639.957  1.34  0.383  0.909091  0.285821      1  20220815  B0811ab   

                          Screen  taps  
0  PDScreen_TapHab_August15_2022   1.0  
1  PDScreen_TapHab_August15_2022   2.0  
2  PDScreen_TapHab_August15_2022   3.0  
3  PDScreen_TapHab_August15_2022   4.0  
4  PDScreen_TapHab_August15_2022   5.0  

Shape of the first dataframe (Strain_1): (155, 10)


# 7. Grouping Data and Naming (Optional: Add input here)

Here, you get to name your data groups/strain! Name your groups however you like under between the quotation marks for each strain.

For example: If your Strain1 is N2 and you wish for the group to be called N2,
your line should look like:

DataLists[x].assign(dataset = 'N2')

## Go back to step 6.1 to check which strain is which item on the DataLists.
In this example, the first item on DataLists is N2.


## Remember: Put your name in quotes. (ex: 'N2' and not N2)

As default, the names are set to the unique identifier labels.

## Depending on the number of strains you are running the comparison, you may have to delete/add lines of code (following the same format). 
## Remember to add/delete commas too.

# If you want to change your groups, you do that here. 
For example, if you have 5 strains in your folder but only want to compare between 2 or 3 strains, designate that here and follow through with steps 6 and 7. Once you are done, come back to step 6 and change your groups again (You are going to have to change your graph titles for the second run-through though)!

In [58]:
TotalConcatenated=pd.DataFrame()
for d in range(1,len(np.unique(genotypes))+1):
    TotalConcatenated=pd.concat([TotalConcatenated,
                                 DataLists[d].assign(dataset=StrainNames.get(d))])

TotalConcatenated.reset_index(inplace=False)
print(TotalConcatenated)

#if TotalConcatenated["taps"].loc[ind] is not 1:
#   TotalConcatenated["taps"].loc[ind:indices[c+1]] = list(range(1,len(TotalConcatenated["taps"].loc[ind:indices[c+1]])+1))
# missing_taps(TotalConcatenated, accurate_taps, tolerances)

print('done step 7')

         time  dura   dist      prob     speed  plate      Date Plate_id  \
0     599.985  2.83  0.696  0.928571  0.245936      1  20220815  B0811ab   
1     609.993  2.98  0.746  0.857143  0.250336      1  20220815  B0811ab   
2     619.699  1.97  0.536  0.800000  0.272081      1  20220815  B0811ab   
3     629.956  2.57  0.686  0.900000  0.266926      1  20220815  B0811ab   
4     639.957  1.34  0.383  0.909091  0.285821      1  20220815  B0811ab   
..        ...   ...    ...       ...       ...    ...       ...      ...   
150   859.968  1.35  0.309  0.315789  0.228889      5  20220815  C0811ac   
151   869.969  1.16  0.288  0.357143  0.248276      5  20220815  C0811ac   
152   879.969  2.01  0.533  0.444444  0.265174      5  20220815  C0811ac   
153   889.967  0.87  0.202  0.342857  0.232184      5  20220815  C0811ac   
154  1189.966  2.01  0.561  0.607143  0.279104      5  20220815  C0811ac   

                            Screen  taps dataset  
0    PDScreen_TapHab_August15_2022  

# Setting Colour Palette - Only run the below cell ONCE

The following code sets the colour palette for the whole experiment - and then designate one colour to each strain. After this, if as you are graphing you take away some strains, you can do so with the colours still matching accordingly.

In [59]:
# print(TotalConcatenated['dataset'].str.split("_", n=1, expand=True))
TotalConcatenated['Screen']=Screen
print(TotalConcatenated)

         time  dura   dist      prob     speed  plate      Date Plate_id  \
0     599.985  2.83  0.696  0.928571  0.245936      1  20220815  B0811ab   
1     609.993  2.98  0.746  0.857143  0.250336      1  20220815  B0811ab   
2     619.699  1.97  0.536  0.800000  0.272081      1  20220815  B0811ab   
3     629.956  2.57  0.686  0.900000  0.266926      1  20220815  B0811ab   
4     639.957  1.34  0.383  0.909091  0.285821      1  20220815  B0811ab   
..        ...   ...    ...       ...       ...    ...       ...      ...   
150   859.968  1.35  0.309  0.315789  0.228889      5  20220815  C0811ac   
151   869.969  1.16  0.288  0.357143  0.248276      5  20220815  C0811ac   
152   879.969  2.01  0.533  0.444444  0.265174      5  20220815  C0811ac   
153   889.967  0.87  0.202  0.342857  0.232184      5  20220815  C0811ac   
154  1189.966  2.01  0.561  0.607143  0.279104      5  20220815  C0811ac   

        Screen  taps dataset  
0    PD_Screen   1.0      N2  
1    PD_Screen   2.0     

In [60]:
TotalConcatenated[['Gene', 'Allele']] = TotalConcatenated['dataset'].str.split('_', n=1, expand=True)
print(TotalConcatenated)

ValueError: Columns must be same length as key

In [None]:
TotalConcatenated['Allele']=TotalConcatenated['Allele'].fillna('N2')

In [None]:
TotalConcatenated=TotalConcatenated.dropna()

In [None]:
TotalConcatenated.to_csv(f'{TotalConcatenated.Screen[0].values[0]}_tap_output.csv')
print('done')


In [None]:
# A debugging cell to test for strain 'XJ1' (which is the old N2)
print(TotalConcatenated[TotalConcatenated['Allele']=='XJ1'])

# Done!