# Jupyter Notebook UI to analyze baseline data from tap-habituation experiments!

Version 1.3 - Joseph Liang, Rankin Lab
Updated:
1. Upgraded folder path selection application
2. Upgraded dataset management (less moving parts for end-user)
3. output changed from tif -> png

## Known bug: Step 2 an empty windows displays in Mac. May also apply to linux/windows.

### Beginner Essentials:
1. Shift-Enter to run each cell. After you run, you should see an output "done step #". If not, an error has occured
2. When inputting your own code/revising the code, make sure you close all your quotation marks '' and brackets (), [], {}.
3. Don't leave any commas (,) hanging! (make sure an object always follows a comma. If there is nothing after a comma, remove the comma!
4. Learning to code? Each line of code is annotated to help you understand how this code works!

**Run all cells/steps sequentially, even the ones that do not need input**

# 1. Importing Packages Required (No input required, just run)

In [1]:
import pandas as pd #<- package used to import and organize data
import numpy as np #<- package used to import and organize data
import seaborn as sns #<- package used to plot graphs
from matplotlib import pyplot as plt #<- package used to plot graphs
import os #<- package used to work with system filepaths
from ipywidgets import widgets #<- widget tool to generate button
from IPython.display import display #<- displays button
from ipyfilechooser import FileChooser
# from tkinter import Tk, filedialog #<- Tkinter is a GUI package
from tqdm.notebook import tqdm
# import dask.dataframe as dd
# import pingouin as pg
pd.set_option('display.max_columns', 50)
print("done step 1")

done step 1


## 2. Pick filepath (just run and click button from output)

Run the following cell and click the button 'Select Folder' to pick a filepath.

**Important: Later on, this script uses the total file path for each file to import and group data. That means if your folder has whatever your strain is named, the script will not work.**

(ex. if your folder has "N2" in it this script sees all files inside this folder as having the "N2" search key)

**An easy fix is to just rename your folder to something else (make your strains lower-case, or just have the date)**

In [2]:
starting_directory = '/Users'
chooser = FileChooser(starting_directory)
display(chooser)

FileChooser(path='/Users', filename='', title='', show_hidden=False, select_desc='Select', change_desc='Change…

In [3]:
print(chooser.selected_path)
folder_path=chooser.selected_path

/Users/gurmehak/Documents/RankinLab/Test_Datasets/PDScreen_TapHab_August15_2022


In [4]:
screens = ['PD_Screen', 'ASD_Screen', 'G-Proteins_Screen', 'Glia_Genes_Screen', 'Neuron_Genes_Screen']

screen_chooser = widgets.Select(options=screens, value=screens[0], description='Screen:')
display(screen_chooser)

Select(description='Screen:', options=('PD_Screen', 'ASD_Screen', 'G-Proteins_Screen', 'Glia_Genes_Screen', 'N…

In [5]:
Screen=screen_chooser.value

# 3. User Defined Variables (Add input here)

Here, we add some constants to help you blaze through this code.

3.1: Setting time bins


3.2: Setting view range for your graph
- Top, bottom = y axis view range
- left, right = x axis view range



In [6]:
# Setting 1s Bins
bins = np.linspace(0,1200,1201) # np.linspace(start, end, steps in between)
print(bins)


print("done step 3")

[0.000e+00 1.000e+00 2.000e+00 ... 1.198e+03 1.199e+03 1.200e+03]
done step 3


# 4. Construct filelist from folder path (No input required, just run)

In [8]:
os.chdir(folder_path) # setting your working directory so that your images will be saved here

filelist = list() # empty list
for root, dirs, files in os.walk(folder_path): # this for loop goes through your folder 
    for name in files:
        if name.endswith('.dat'): # and takes out all files with a .dat (file that contains your data)
            if "_" in name.split(".")[-2]:
                filepath = os.path.join(root, name) # Notes down the file path of each data file
                filelist.append(filepath) # saves it into the list

if not filelist:
    raise FileNotFoundError("No .dat files found in the selected folder!")
else:
    print(f"Number of .dat files to process: {len(filelist)}")
    # print(f"Example of first and last file saved: {filelist[0]}, {filelist[-1]}") 

print('done step 4')

Number of .dat files to process: 13
done step 4


# 5. Process Data Function (No input required, just run)

In [None]:
def ProcessData(strain): 
    """
    Filters and processes .dat files matching the given strain.

    Parameters: 
        strain (str): keyword to match in the files

    Returns:
        dict: N (Plate number) and Dataframe with required columns ("time", "dura", "dist", "prob", "speed", "plate", "Date","Plate_id","Screen")

    """
    strain_filelist = [x for x in filelist if strain in x] # Goes through the list and filters for keyword
    Strain_N = len(strain_filelist) # Finds the number of plates per strain
    if Strain_N == 0:
        raise AssertionError ('{} is not a good identifier'.format(strain))
    else:
        pass
        print(f'Strain {strain}')
        print(f'Number of plates: {Strain_N}') 
        
        # visiting files in this strain
        strain_filelist = [file for file in filelist if strain in file]
        df_list=[]
        for file in strain_filelist:
            if file.split('/')[-1].startswith('._'):
                pass
            else:
                try:
                    print(f"File: {file}")
                    df= pd.read_csv(file, sep=' ', header = None, encoding_errors='ignore')
                    df['Plate_id'] = file.split('/')[-1].split('_')[-1].split('.')[0]
                    df['Date'] = file.split('/')[-2].split('_')[0]
                    df['Screen'] = file.split('/')[-4]
                    df_list.append(df)
                except:
                    print(f"error in file {file}")
                    pass
        DF_Total = pd.concat(df_list, ignore_index = True)

    # for f in strain_filelist:
    #     DF_Total = pd.concat(pd.read_csv(f, sep=' ', skiprows = 4, header = None))
    #     DF_Total = pd.concat([pd.read_csv(f, sep=' ', header = None) for f in strain_filelist],
    #                   ignore_index=True) #<- imports your data files
    #     DF_Total = DF_Total.dropna(axis = 1) #<- cleans your data

        # defining column names for .dat files
        DF_Total = DF_Total.rename( 
                    {0:'Time',
                    1:'n',
                    2:'Number',
                    3:'Instantaneous Speed',
                    4:'Interval Speed',
                    5:'Bias',
                    6:'Tap',
                    7:'Puff',
                    8:'x',
                    9:'y',
                    10:'Morphwidth',
                    11:'Midline',
                    12:'Area',
                    13:'Angular Speed',
                    14:'Aspect Ratio',
                    15:'Kink',
                    16:'Curve',
                    17:'Crab',
                    18:'Pathlength'}, axis=1)
        
        #check function here for NaN Columns
        DF_Total['plate'] = 0

        # Calculate reversal probability 
        # DF_Total['prob'] = DF_Total['stim_rev']/ (DF_Total['no_rev'] + DF_Total['stim_rev']) #<- calculate prob
        
        # Calculate speed
        # DF_Total['speed'] = DF_Total['dist']/DF_Total['dura'] #<- calculate speed
        
        
        # DF_Total_rows = int(DF_Total.shape[0])
        # print(f'this strain/treatment has {DF_Total_rows} total taps') #<- Outputs as the second number. Check if you are missing taps!
        # DF_Final = DF_Total[["time", "dura", "dist", "prob", "speed", "plate"]].copy()

    return{
            'N': Strain_N,
            'Confirm':DF_Total
            # 'Final': DF_Final
    }


print('done step 5')

done step 5


This is the hardest part - from your naming convention, pick a unique identifier for each group.

This means that all of names of your files for that strain should have that in common but is not common across other files! If you did a good job naming your files and following a good naming convention, this should be easy.

**Be careful and really look hard in your naming structure. Note you want an unique identifier in the entire file path for the same group of files. An easy mistake is to have the strain name in the overall folder name, in this case if you use your strain name as a keyword it would include all files in that folder!**

For example, if all your N2 files have a certain pattern like "N2_5x4" in this following example:
'/Users/Joseph/Desktop/AVR14_10sISI_TapHab_0710_2019/N2/20190710_141740/N2_5x4_f94h20c_100s30x10s10s_C0710ab.trv'
Then you need to set that identifier for the strain keyword:
'Strain_1' = 'N2_5x4'

#### Depending on how many strains you are running for comparison, you may need to add/delete some lines!

* You are not naming your data groups here, we have a step for that later!
* Here, you want to note down ALL the strains you have in the folder
* If you have just 2 strains, add hashtags (#) in front of the lines you do not need.
If you need more strains, just add more lines, following the same format!

In [None]:
strainnames=[]
for f in filelist:
    strainnames.append(f.split('/')[-3])
ustrainnames=list(set(strainnames))
# print(ustrainnames)

if Screen =="Neuron_Genes_Screen":
    ustrainnames.insert(0, ustrainnames.pop(ustrainnames.index("N2_XJ1")))
    ustrainnames.insert(0, ustrainnames.pop(ustrainnames.index("N2_N2")))
else:
    ustrainnames.insert(0, ustrainnames.pop(ustrainnames.index("N2")))


nstrains=list(range(1,len(ustrainnames)+1))
print(nstrains)

StrainNames = {}
StrainNames = {nstrains[i]: ustrainnames[i] for i in range(len(nstrains))}
print(StrainNames)
print("done step 6.1")

# <---------------- Test element to use for dictionary buidling -------------------
# s = '/Users/Joseph/Desktop/OnFoodOffFoodTest/N2_OnFood/20220401_163048/N2_10x1_n96h20C_360sA0401_ka.00065.dat'
# slist=s.split('/')[5]
# print(slist)
# print(list(range(1,5+1)))

['egl-30_n715', 'ebp-2_gk756', 'gpa-12_pk322', 'ubc-16_ok3177', 'dat-1_ok157', 'nhr-99_gk791', 'ser-1_ok345', 'aex-6_sa24', 'gipc-1_hc192', 'F39E9.6_ok3515', 'smg-1_r861', 'egl-19_n582', 'M195.2_ok1503', 'mbk-1_pk1389', 'W04B5.5_ok1309', 'rnf-5_tm794', 'che-14_ok193', 'jkk-1_km2', 'rap-2_gk11', 'pept-2_ok1192', 'dgk-1_nu62', 'ina-1_gm144', 'ric-4_gk322', 'npr-8_ok1446', 'F09A5.2_ok1900', 'citk-1_ok2328', 'mek-1_ks54', 'atn-1_ok84', 'tli-1_ok1724', 'trhr-1_ok1381', 'dop-2xdop-1xdop-3_vs105xvs100xvs106', 'gpa-13_pk1270', 'qui-1_ok3571', 'mics-1_ok1451', 'abl-1_ok171', 'pes-9_ok1037', 'ZK1248.15_ok2612', 'inx-10_ok2714', 'egl-10_md176', 'ubc-1_gk14', 'lin-3_e1417', 'egl-30_n686', 'C03B1.5_ok2345', 'Y106G6H.14_ok3081', 'cdh-3_pk87', 'snb-5_ok1434', 'flp-17_ok3587', 'atl-1_ok1063', 'F07A11.4_ok3152', 'rab-3_js49', 'cex-1_ok3163', 'F47G4.4_ok2219', 'mrp-4_cd8', 'vab-1_e2', 'itsn-1_ok268', 'atm-1_gk186', 'inx-11_ok2783', 'nimk-1_ok3082', 'strd-1_ok2283', 'C03A3.3_ok2834', 'cerk-1_ok1252', 'oc

# Cell below is to run testing/debugging. Do not need to run (commented out)

In [None]:


# DF_Read = pd.read_csv('/Users/Joseph/Desktop/OnFoodOffFoodTest/N2_OnFood/20220401_163048/N2_10x1_n96h20C_360sA0401_ka.00065.dat'
#                       , sep=' ', header = None, index_col=False) #<- imports and cleans data
# DF_Read["worm"]=1
# # print(DF_Read)
# DF_Total = DF_Read #<- more data cleaning
# DF_Total = DF_Total.rename( #<- more data cleaning
#             {0:'time',
#             1:'speed',
#             2: "x",
#             3: "y",
#             4: "angularspeed",
#             5: "pathlength",
#             6: "kink",
#             7: "bias",
#             8: "curve",
#             9: "dir",
#             10: "crab",
#             11: "length",
#             12: "midline",
#             13: "width",
#             14: "morphwidth",
#             15: "area"
#             }, axis=1)
# DF_Total["x_0"] = DF_Total.x - DF_Total.x.iloc[0]
# DF_Total["y_0"] = DF_Total.y - DF_Total.y.iloc[0]

# DF_Total["x_test"] = DF_Total.iloc[:,2] - DF_Total.iloc[0,2]
# DF_Total["y_test"] = DF_Total.iloc[:,3] - DF_Total.iloc[0,3]
# DF_Total["distance"]= 0
# print(DF_Total)

# A_i = np.array(DF_Total['pathlength'][1:])
# A_i_1 = np.array(DF_Total['pathlength'][0:-1])
# result = np.abs(A_i - A_i_1).tolist()
# result.insert(0,0)
# curr_sum = 0
# new_list = []
# for i in range(len(result)):
#     curr_sum += result[i]
#     new_list.append(curr_sum)
# DF_Total["distance"]=new_list
# print(DF_Total)



        


# A_i = np.array(DF_Total.iloc[1:,5])
# A_i_1 = np.array(DF_Total.iloc[0:-1,5])
# result = np.abs(A_i - A_i_1).tolist()
# result.insert(0,0)
# curr_sum = 0
# new_list = []
# for i in range(len(result)):
#     curr_sum += result[i]
#     new_list.append(curr_sum)
# print(A_i)
# print(A_i_1)
# print(len(A_i))
# print(len(A_i_1))
# # resultS=pd.Series(result)
# # print(resultS.cumsum())
# # print(new_list)

# curr_sum = 0
# new_list = []
# for i in range(len(result)):
#     curr_sum += result[i]
#     new_list.append(curr_sum)
    
    
# import matplotlib.pyplot as plt
# plt.plot(new_list)

# 6.2 Process Data (just run this cell)

In [9]:
DataLists = [0] #<- generates empty list. 0 is there to account for python's index starting at 0. 
# we want indexing to start at 1 (when I say #1 I want the first point, not the second point)

for s in tqdm(StrainNames.values()): #<- goes through the dictionary in step 6.1 and processes data
    if not s == '':
        DataLists.append(ProcessData(s)['Confirm']) #<- appends all data into a list of dataframes

# print(DataLists[2])
print('done step 6.2')

  0%|          | 0/486 [00:00<?, ?it/s]

this strain/treatment has 426 plates
now working on strain N2_N2
this strain/treatment has 331 plates
now working on strain N2_XJ1
this strain/treatment has 16 plates
now working on strain egl-30_n715
this strain/treatment has 4 plates
now working on strain ebp-2_gk756
this strain/treatment has 4 plates
now working on strain gpa-12_pk322
this strain/treatment has 4 plates
now working on strain ubc-16_ok3177
this strain/treatment has 4 plates
now working on strain dat-1_ok157
this strain/treatment has 4 plates
now working on strain nhr-99_gk791
this strain/treatment has 4 plates
now working on strain ser-1_ok345
this strain/treatment has 4 plates
now working on strain aex-6_sa24
this strain/treatment has 4 plates
now working on strain gipc-1_hc192
this strain/treatment has 4 plates
now working on strain F39E9.6_ok3515
this strain/treatment has 4 plates
now working on strain smg-1_r861
this strain/treatment has 4 plates
now working on strain egl-19_n582
this strain/treatment has 4 plates

# Convert float64 data to float32 to reduce memory load (can also convert to 16 if needed)

For plain english:

float16 = 4 decimal points

float32 = 8 decimal points

float64 = 16 decimal points

more decimal points = more data/memory that computer has to keep track of

In [10]:
print(DataLists[1])

            Time   n  Number  Instantaneous Speed  Interval Speed   Bias  Tap  \
0          0.702  34       0               0.0000          0.0000  0.000    0   
1          1.265  34       0               0.0000          0.0000  0.000    0   
2          1.410  34       0               0.0000          0.0000  0.000    0   
3          1.553  35       0               0.0000          0.0000  0.000    0   
4          1.707  34       0               0.0000          0.0000  0.000    0   
...          ...  ..     ...                  ...             ...    ...  ...   
9512932  899.914  32      15               0.2845          0.1418  0.733    0   
9512933  899.950  32      15               0.2940          0.1407  0.000    0   
9512934  899.977  33      15               0.0000          0.0000  0.000    0   
9512935  900.020  33      15               0.0000          0.0000  0.000    0   
9512936  900.062  33      15               0.0000          0.0000  0.000    0   

         Puff        x     

In [None]:
#No need to run here
# for n in tqdm(DataLists[1:]):
# #     print(n)
#     TestData=n
#     TestData[TestData.select_dtypes(np.float64).columns] = TestData.select_dtypes(np.float64).astype(np.float16)
#     print("done this strain")

In [19]:
#No need to run here

# # print(TotalConcatenated.dtypes)
# TotalConcatenated['time_bin'] = TotalConcatenated['time_bin'].astype(np.float16)
# # print(TotalConcatenated.dtypes)
# # TotalConcatenated.dtypes
# # Test Cell
# # DataLists[1].to_csv("test.csv")
# Test = TotalConcatenated.reset_index(drop=True)
# print(Test)

# 7. Grouping Data and Naming (Optional: Add input here)

Here, you get to name your data groups/strain! Name your groups however you like under between the quotation marks for each strain.

For example: If your Strain1 is N2 and you wish for the group to be called N2,
your line should look like:

DataLists[x].assign(dataset = 'N2')

## Go back to step 6.1 to check which strain is which item on the DataLists.
In this example, the first item on DataLists is AQ2028_b.


## Remember: Put your name in quotes. (ex: 'N2' and not N2)

As default, the names are set to the unique identifier labels.

## Depending on the number of strains you are running the comparison, you may have to delete/add lines of code (following the same format). 
## Remember to add/delete commas too.

# If you want to change your groups, you do that here. 
For example, if you have 5 strains in your folder but only want to compare between 2 or 3 strains, designate that here and follow through with steps 6 and 7. Once you are done, come back to step 6 and change your groups again (You are going to have to change your graph titles for the second run-through though)!

In [11]:
TotalConcatenated=pd.concat(df.assign(dataset=StrainNames.get(i+1)) for i,df in enumerate(DataLists[1:]))
TotalConcatenated[['Gene', 'Allele']] = TotalConcatenated['dataset'].str.split('_', n=1, expand=True)
TotalConcatenated['Allele']=TotalConcatenated['Allele'].fillna('N2')
Baseline_data=TotalConcatenated[((TotalConcatenated.Time<=590)&(TotalConcatenated.Time >=490))] ### future changes to be made
Baseline_data=Baseline_data.drop(columns=["plate", "Tap", "Puff", "x","y"]).reset_index()
# TotalConcatenated=TotalConcatenated.dropna()
# TotalConcatenated = TotalConcatenated.reset_index(drop=True)
print(Baseline_data)
# TotalConcatenated.to_csv("tap_baseline_output.csv")
# print("done output")
print('done step 7')

         index     Time   n  Number  Instantaneous Speed  Interval Speed  \
0        14362  490.007  44      35               0.1804          0.1032   
1        14363  490.040  44      35               0.1776          0.1044   
2        14364  490.070  44      35               0.1732          0.1022   
3        14365  490.107  44      35               0.1701          0.0984   
4        14366  490.140  44      35               0.1756          0.0983   
...        ...      ...  ..     ...                  ...             ...   
7298207  94871  589.859  74      55               0.0880          0.0826   
7298208  94872  589.890  73      55               0.0904          0.0826   
7298209  94873  589.921  75      55               0.0901          0.0800   
7298210  94874  589.957  75      55               0.0891          0.0782   
7298211  94875  589.992  75      55               0.0913          0.0803   

          Bias  Morphwidth  Midline      Area  Angular Speed  Aspect Ratio  \
0        

In [12]:
Baseline_data['Screen']=Screen
print(Baseline_data)

         index     Time   n  Number  Instantaneous Speed  Interval Speed  \
0        14362  490.007  44      35               0.1804          0.1032   
1        14363  490.040  44      35               0.1776          0.1044   
2        14364  490.070  44      35               0.1732          0.1022   
3        14365  490.107  44      35               0.1701          0.0984   
4        14366  490.140  44      35               0.1756          0.0983   
...        ...      ...  ..     ...                  ...             ...   
7298207  94871  589.859  74      55               0.0880          0.0826   
7298208  94872  589.890  73      55               0.0904          0.0826   
7298209  94873  589.921  75      55               0.0901          0.0800   
7298210  94874  589.957  75      55               0.0891          0.0782   
7298211  94875  589.992  75      55               0.0913          0.0803   

          Bias  Morphwidth  Midline      Area  Angular Speed  Aspect Ratio  \
0        

In [13]:
Baseline_data.to_csv(f"{Baseline_data.Screen[0]}_baseline_output.csv")
print('done')

done


# Done!