# About the dashboard project

-During early attempts, I developed various iterations of the dashboard utilizing the R-Shiny application. 

- [See old Github Repository](https://github.com/Samuelhsu3/At-Risk-Dashboard)


I would eventually end up using a Python-dash/Javascript based website for improved functionality and styling flexibility. The current code reads from a periodically updated JSON file that aggregates all metadata from every study within the At-Risk project ([Updating the JSON](DashBoard_Guide.ipynb#Json-File)). Previous methods tested included sourcing from multiple local excel sheets, hard coding data, and APIs. 


- [See current Github Repository](https://github.com/Samuelhsu3/Dashapp.git)
     - *Note that this is Private*

R-Shiny application: [Old Data Location Tracker](https://samuelhsu03.shinyapps.io/DMP6/)

---

## Early Stages of Data Migration 

I started moving MRI files from data8 (baycrest server) to CC on May 30, 2023

- I attempted to move entire directories at first but size posed an issue
- I started by moving each participant folder one at a time using the rsync command. These commands were run simultaneously on 2 lab computers to speed up the process (Fishing-Owl and Megapode). I would check the sizes both at Baycrest as well as at Compute Canada to ensure no error occured during transfer. 
    - *A few folders had differences in size, measurable in kb but are identical in all other aspects (These are noted on the excel sheet).*
    
    -To further expedite the process, I wrote R codes to generate the command lines for me as I input the participant IDs (Since the destination folders were named after the IDs, it could automatically determine the pathway from each unique input). This way I could paste the new command line as soon as the previous task finishes running. 
    - [Example of a Command Line Generater](Legacy_Code/Data_Transfer.ipynb)

---

## AI

- I decided to test out the capabilities of generative AI sometime after all the MRI files had been transferred. At first, it had trouble with handling the IDs but once properly trained (through feeding examples and correcting), it worked smoothly and was able to generate unique command lines in lumps efficiently. While tracking the dates, I realized that formats were not standardized and were inconsistent both between and within tracking sheets. I tried using chatgpt to fix this but it couldn’t consistently deliver. I ended up using excel formulas to format the dates correctly.

---

## Compression and Sorting 

- Taring every file needed to be done as the projects directory had a limit on file number. Early attempts of moving directories from scratch to project were inefficient and needlessly complicated. I would create a new folder in projects for every existing folder in scratch, I would then use the move command (later copy command) for every single file with the lines pasted in bulk (already generated). I thought this would ensure that all files were moved, but would later realize it introduces more room for error. Furthermore, the commands would occasionally get interrupted due to connection timeout issues. If an interruption occurred, I would delete all files associated with that batch just in case. Commands were later run in Tmux windows to avoid the timeout issues. 

- I also tested many versions of bash scripts for compression and moving. Initial attempts were once again inefficient as I had to specify both the filename as well as the folder name for the tar script. In addition, the compressed files were left inside each folder. This was troublesome as the following script would need to go into each folder again to find the compressed version (and only the compressed version) before moving them out and sorting them into their corresponding folders at projects. 
    - The structure looked something like this:

        for {list of filename} in {list of folder name}  tar.gz.
        
        
    - This was redundant and I realized the script can just loop through each folder and indiscriminately tar all files inside as long as the structure is defined. 
    
    - Meanwhile, the script for moving/copying to projects looked something like this: 
    
        determine current pathway based on {list of filenames}, cp -r to {list of pathways}.
        
        -*Note that these scripts were not saved*
        
    - Currently, one script does both tasks by creating a separate directory with all the compressed files stored inside while keeping the original folder structure untouched . This entire directory can then be copied to projects. I also wrote a script to extract the IDs of participants from the file names in order to sort them into their individual folders (on scratch), as opposed to specifically matching pathways with filenames. 
    
---


## Renaming

- To ensure that I had not missed out on certain files during data migration, I would retrieve the list of filenames from the source (where I am retrieving the data from) and compare it with existing tracking sheets. This was quite difficult and time consuming as the files on existing tracking sheets were not ordered in any specific way (alphabetical, size…).

- While moving viewpoint files I noticed that the existing tracking sheet and the actual files found at room 806 and 804 did not match up (Files that were supposed to exist did not, and files that were not supposed to exist were present). Day 3 eye tracking also had the same problem (This was not an issue for other studies since I created new tracking sheets for them). 

    - Rather than manually checking hundreds of files against the existing tracking sheets, I tried using chatgpt to find the discrepancies. Unfortunately chatgpt proved to be unreliable and I would catch mistakes every once in a while despite different approaches so I used simple python/R scripts to compared the lists of names. 
    
        - [Example of such code](Legacy_Code/File_Matching.ipynb)
        
        - Later on I would come to realize that the files were not incorrectly tracked or missed. Rather, files were incorrectly named. I renamed these files manually or with loops and added Notes for record keeping. Missing files on the other hand due to early termination or calibration failures were also recorded. Files with the wrong dates in their names were also fixed.

---


## Checking Size


- Typed commands were used on individual files in the beginning. However, this became tedious and I decided to run scripts that would find the size of all files and output into a txt file (The output file was made so I could use the control F command). Even with this I still had to manually match every file to their respective size on the tracking sheets. This did not have to be done for studies with new tracking sheets as I could order it in the same way as displayed and could simply paste the entire column.
    - The newer file checking scripts will simply echo in terminal as ordered


- While double-checking the files at projects, I noticed that some files were significantly smaller than their original size (at Baycrest and Scratch). This turned out to be a result of servers handling storage differently at CC and not data loss. Cryptographic hashes were checked to ensure they were unaffected by copying and moving. 
     - I also found the number of files inside each directory and noted them down (For MRI). Just to be safe, I copied every single directory out of the project and back into my scratch, untared them, and checked their file size and number. There were no issues aside from a few small differences which have been noted on the tracking sheet (e.g., untar 7.8 GB). *Note that this step was only done with MRI files.* 

---

### Checking Dates 


- Dates were mainly copied from existing tracking sheets and were done relatively quickly. I also wrote formulas to extract the dates for files with dates already in the filename (MRI and Neuropsych files). To make sure no errors were made, I double-checked every single file with the Session Dates tracking sheet. I did notice quite a few discrepancies which were mostly on the end of the SD sheet. 
    - To ensure that the dates were correct everywhere, the following were used: 
        1. [SD sheet](https://docs.google.com/spreadsheets/d/1KEDzGKlqu408hcJDGIXMCQXd6AT4uBbrUJPNAIOSwIE/edit#gid=0)
        2. Testing sheets [Like this one](https://docs.google.com/spreadsheets/d/12vJetlT-zT9Scr3jGxi8zwx0wsFJ6uRUemVkzrigMmM/edit#gid=974392866)
        3. Calibration sheets [Like this one](https://docs.google.com/spreadsheets/d/1Q7-dNEIsVCbJwY2UyHCWSu6Ir_Up8aHOYF6v-ke3Og4/edit#gid=0)
        4. Google Calendar
        5. The actual file itself (less command or find physical copies). 
        
- I was able to fix a number of mistakes with the dates either on the SD sheet or the existing tracking sheets for Oddity, Viewpoint, and Neuropsych.*For most I just overwrote if it was an obvious mistake*
- I also noted down files that had incorrect dates in their filenames and fixed them. 

- For Neuropsych files without dates in the filename, I opened them in Dropbox or found the physical copy (if the dates were blurred out digitally) to match with the right version. Some files did not have the dates written on them at all and the dates were assumed depending on the folder the physical copy was stored in.  

- For all eyetracking file I wrote a script that would head each edf file and awk the dates to ensure that they matched up with what was tracked. I noticed that cohort one resting state eyetracking were actually done on day 3.

---

###Finding Ages


- Some ages were already recorded on the old tracking sheets (Oddity and Viewpoint) and were simply copied. For MRI and Neuropsych, I wrote a python code that read from two sheets (the birthdays and the tested dates). This was able to output the list of ages in the same format and you can simply paste the output as given. However, this would require you to make updates to the time tested sheet, download it, run it, download it again, and open it back up to update the age. Now this can be done in one click by navigating to the sync age tab in the custom menu. 

---

## Automatic sheet update


- This was achieved using google apps scripts. When a new participant is tested using eye tracking, you would need to update 3 separate tracking sheets. 
    - After updating the USB/Drive and the CC tracker, you would need to find exactly where the new rows should be inserted on the JSON sheet or create new rows and replace the entire section. You would then need to download the excel sheet, cd to the directory, and run the python csv_to_json_converter.py. 
    - This was repetitive and time consuming, not to mention the room for error that it introduces. By using google apps scripts, new functions can be triggered by navigating to the custom tabs on the google sheet itself. 

- After updating the original eye tracking sheet (for USB and Google Drive), you can click the sync button which will create a new row with assumed information in the compute canada tracking sheet. Note that 4 new templates will be created for the Encoding sheet (4 blocks) while 3 will be created for Retrieval (3 blocks). It’s important to only sync after you have ensured that information entered in the first sheet is correct. I’ve tested out versions where it would update immediately upon edit but this would create multiple new rows if you made a mistake and tried to go back.

- When finished, you can click on the commit button which will trigger a function that generates a whole new/update the JSON file sheet. This sheet can then be downloaded as JSON instead of CSV. This will save you the trouble of going in and manually to find where to insert the new rows corresponding to the study. 

---


## Automating Data Transfer


- I tried using powershell scripting as it had more administrative privileges to achieve this but I realized it has limitations when it comes to data transfer. I first tried to make it sync as soon as a new file was detected but realized that errors made in moving the files on the local end might cause unnecessary downstream effects in the remote server. 
- I also tried using a windows scheduler to run this script regularly but I figured it might be better to still have someone execute the code only when it's needed. The current code is written in bash and different versions are stored in different folders of the fishing owl device. Since each project has a different pathway, a different script is needed for each. This code will automatically transfer the newest files added within 24 hours to compute canada. This reduces human error in data transfer and increases the speed of data backup.
