# Verify the presence of data files in a ReArm visit

The purpose of this notebook is to verify the presence of data files in a ReArm visit. The checks performed are as follows:

- Are all the expected data files for the visit present in the directory?
- Are there any extra files in the directory?

The results of the checks are stored in two files:

- `goodFiles.log`: Contains the list of files that are present in the directory and expected for the visit.
- `checkFilesInRearmVisit.log`: Contains information about any problems encountered during the checks.

Please note that this notebook assumes a specific directory structure and naming convention for the data files. The details of the directory structure and naming convention can be found in the subsequent cells.


## Location of the data files

The data files are accessed via the `dat/ReArm.lnk` symlink, which points to the directory where the data files are stored (use  `ln -s source_file myfile` to generate the symlink). This symlink is ignored by Git (as specified in the `.gitignore` file). This is important because we do not want to share the data files in Git.

The directory structure is as follows:

    dat
    └── ReArm.lnk --> Rearm       # Symlink to the directory where the data files are stored
        ├── ...
        ├── ReArm_C1P02
        │   ├── ...
        │   ├── ReArm_C1P02_V1
        │   │   ├── ...
        │   │   ├── ReArm_C1P02_20210306_V1_Accelerometry
        │   │   ├── ReArm_C1P02_20210306_V1_Armeo
        │   │   ├── ReArm_C1P02_20210306_V1_Circle
        │   │   ├── ReArm_C1P02_20210306_V1_Reaching
        │   │   └── ...
        │   ├── ReArm_C1P02_V2
        │   ├── ReArm_C1P02_V3
        │   └── ...
        ├── ...
        ...



## Location of the check files

The check files are `*.log` files directly at the root the directory that is checked.

For example : 

TODO: modify the example below to match the actual directory structure

```
...
└── ReArm_C1P02 
    ├── checkFilesInRearmVisit.log
    ├── goodFiles.log
    ├── ...
    ├── Accelerometry
    ├── ...
    └── Scan
    ...
```


## Structure of the name of the data files

The data files are named according to the following structure:

```
<project>_<participant>_<date>_<visit>_<record>_<specific>.<extension>
```

where:

- `<project>` is the name of the project (`ReArm`)
- `<participant>` is the participant ID (e.g., `C1P02` as C1 for the center and P02 for the participant)
- `<date>` is the date of the recording (e.g., `20190131` as YYYYMMDD)
- `<visit>` is the visit label (e.g., `V1`,`V2`, `V3`), 
- `<record>` is the name of the record (e.g., `r`, `c`, `a` ,`ac`, `Scan`)
- `<specific>` is a string that depends on the type of `<record>`
- `<extension>` depends on the type of data (csv, {easy, oxy3, oxy4}, xdf, cwa, pdf)




The following table shows the different types of `<visit>` :

|`<visit>`  | Content   |
| :--      | :------  |   
| `Incl` | Inclusion  |
| `V1` | Just before the start of therapy  |
| `V2` | Just after the end of therapy  |
| `V3` | 3 months after the end of therapy  |
| `Reed` | During the therapy  |


The following table shows the different types of records and the corresponding `<record>`:

| `<record>` | directory | Content   |
| :-- | :-----------     | :------  |   
| `r` | `Reaching`       | Reaching task |
| `c` | `Circle`         | Circular Steering task  | 
| `a` | `Armeo`          | Armeo's Ladybug Game|
| `rca` | `Reaching`     | r + c + a in the same file |
| `ac` | `Accelerometry` | Wrist accelerometer at home |
| `Scan` | `Scan` | Scanned paper (manually filled data sheets) |




The following table shows the different types of records and the corresponding `<specific>`:

| `<record>` | `<specific>` |` extension`| Content |
|--------|--------------------------|-------------|-------------|
| `r` | `_k` | `.csv`| Kinect MoCap  |
| `r` | `_k_m` | `.csv` | Kinect markers |
| `r` | `_l_m_mau_np` | `.csv` | `l` markers for `mau` and `np` |
| `r` | `_l_m_mau_p` | `.csv` | `l` markers for `mau` and `p` |
| `r` | `_l_m_sau_np` | `.csv` | `l` markers for `sau` and `np` |
| `r` | `_l_m_sau_p` | `.csv` | `l` markers for `sau` and `p` |
| `r` | none | `.easy` | Oxysoft csv export  |
| `r` | none | `.oxy4` | Oxysoft binary data |
| `r` | none | `.xdf` | XDF file (it contains all the previous information) |
| | | |
| `c` | `_k` | `.csv` | Kinect MoCap  |
| `c` | `_k_m` | `.csv` | Kinect markers |
| `c` | `_l_m_np` | `.csv` | `l` markers for `np` |
| `c` | `_l_m_p` | `.csv` | `l` markers for `p` |
| `c` | `_l_np` | `.csv` | `l` mouse mocap for `np` |
| `c` | `_l_p` | `.csv` | `l` mouse mocap for `p` |
| `c` | none | `.easy` | Oxysoft csv export  |
| `c` | none | `.oxy4` | Oxysoft binary data |
| `c` | none | `.xdf` | XDF file (it contains all the previous information) |
| | | |
| `a` | none | `.easy` | Oxysoft csv export  |
| `a` | none | `.oxy4` | Oxysoft binary data |
| `a` | none | `.xdf` | XDF file (it contains all the previous information) |
| | | |
| `ac` | `_p_<acc-ref>` | `.cwa` | Accelerometer data for `p` |
| `ac` | `_np_<acc-ref>` | `.cwa` | Accelerometer data for `np` |
| | | |
| `Scan` | `_Acti1` | `.pdf` | Activity questionnaire  NOT CHECKED |
| `Scan` | `_SetupW` | `.pdf` | Setup for clinical tests NOT CHECKED|
| `Scan` | `_V1` | `.pdf` | Results of clinical tests at V1 NOT CHECKED|
| `Scan` | `_DataSheet` | `.pdf` | Complementary information (lab notebook) NOT CHECKED|

The following table shows how to interpret the items in a `<specific>`:

| `<specific>` | Content |
|--------| -------------|
| `mau` |  maximal arm use   |
| `sau` |  spontaneous arm use   |
| `np` |  non-paretic arm   |
| `p` |  paretic arm   |
| `<acc-ref>` |   `<XXX>_<accelerometerID>`, where `<XXX>` =  [arm + 0 + visitID ] , with arm in [1=np, 2=p] and visitID in [1,2,3] |

DO NOT CHECK: 
- `Visuals` directory
- `Scan` directory
- Incl directory
- Reed directory

TO CHECK directories:
- V1
- V2
- V3

TO CHECK only ANOMNYMOUS DATA in V1, V2 , V3 directories: 
- Accelerometry
- Circle 
- Reaching
- Armeo

# Script to check the data files

The script checks for the presence of the data files and prints out information about the missing/unexpected files.  
This is a full OOP approach to the problem checking that all expected files are present in the directory.  

I use two classes, which are defined in the cells below:


In [None]:
# this should be the set to false for production use
# and set to true for testing (default) but can be already set to false from outside (e.g., by the test script)

if "doRunTests" not in globals():
    doRunTests = True
    pathToData = "dat/ReArm.lnk/ReArm_C1P02/ReArm_C1P02_20210306_V1"
    pathToData = "dat/ReArm.lnk/ReArm_C1P07/ReArm_C1P07_20210716_V1"
    # pathToData = "dat/ReArm.lnk/ReArm_C1P07/ReArm_C1P07_20211116_V3"

## class ExpectedRearmVisit

- `ExpectedRearmVisit`: Represents the expected files and directories for a ReArm visit. 
- Usage : 
    - `expectedVisit = ExpectedRearmVisit(path_to_visit_directory)`
    - `expectedVisit.print()`: Prints the expected files and directories in the visit.



In [None]:
import os
import token

from more_itertools import last


class ExpectedRearmVisit:
    """
    Represents the expected files and directories for a ReArm visit.

    Attributes:
        totalExpectedFiles (int): The total number of expected files.
        expectedFiles (list): A list of expected files organized by record directory.
        expectedDirectories (list): A list of expected directories.

    Methods (public):
        print(): Print the expected files and directories.
    """

    # Class variables (shared among all instances)
    totalExpectedFiles = 0
    expectedFiles = []
    expectedDirectories = []

    # All the variable below should be considered :
    # - protected => _variableName
    # - constant (should not be modified by instances)

    _records = {
        # NOTE: this list also defines the directories to check
        # inFileLabel : meaning
        "r": "Reaching",
        "c": "Circle",
        "ac": "Accelerometry",
        # "Scan": "Scan",
        # "a": "Armeo",
    }

    # a pattern to search for files with glob
    _begOfFileName = "ReArm_C?P??_????????_V?"

    # end of file names in the reaching directory
    # each element is a list of two elements: [name, extension]
    _reachingFiles = [
        ["_k_m", ".csv"],
        ["_k", ".csv"],
        ["_l_m_mau_np", ".csv"],
        ["_l_m_mau_p", ".csv"],
        ["_l_m_sau_np", ".csv"],
        ["_l_m_sau_p", ".csv"],
        ["", ".easy"],
        ["", ".oxy4"],
        ["", ".xdf"],
    ]
    # end of file names in the circle directory
    # each element is a list of two elements: [name, extension]
    _circleFiles = [
        ["_k", ".csv"],
        ["_k_m", ".csv"],
        ["_l_m_np", ".csv"],
        ["_l_m_p", ".csv"],
        ["_l_np", ".csv"],
        ["_l_p", ".csv"],
        ["", ".easy"],
        ["", ".oxy4"],
        ["", ".xdf"],
    ]
    # end of file names in the armeo directory
    _armeoFiles = [
        ["_a.Data", ".csv"],
        ["_a.Dump", ".csv"],
        ["_a.easy", ".easy"],
        ["_a", ".oxy4"],
    ]
    # end of file names in the accelerometer directory
    # we use a pattern for the name in glob search
    # see the description of <acc-ref> in the documentation
    _accelerometerFiles = [
        ["_p_???_???????", ".cwa"],
        ["_np_???_???????", ".cwa"],
    ]
    # end of file names in the scan directory
    _scanFiles = [
        "_Acti1.pdf",
        "_SetupW.pdf",
        "_V1.pdf",
        "_DataSheet.pdf",
    ]

    # Dictionaries linking the label in the file to the meaning
    _visits = {
        # inFileLabel : meaning
        "V1": "Visit 1",
        "V2": "Visit 2",
        "V3": "Visit 3",
    }

    _endOfFileNames = {
        # inFileLabel : meaning
        "r": _reachingFiles,
        "c": _circleFiles,
        "ac": _accelerometerFiles,
        "a": _armeoFiles,
        "Scan": _scanFiles,
    }

    def __init__(self, user_input_visit_abspath):
        self.__set_recordsPaths(user_input_visit_abspath)
        self.__set_totalExpectedFiles()
        self.__set_expectedFiles()

    def __set_recordsPaths(self, user_input_visit_abspath):
        visit_abspath = os.path.abspath(user_input_visit_abspath)
        self._recordsPaths = {}
        for recordKey, recordName in self._records.items():
            # we want the relative path from the project directory, 2 levels up
            project_abspath = os.path.normpath(visit_abspath + "/../..")
            visit_relpath = os.path.relpath(visit_abspath, project_abspath)
            lastDirectory = last(os.path.split(visit_abspath))
            # We need to remove the date token that might not be coherent
            # with the date in the visit_relpath
            tokens = lastDirectory.split("_")
            # change the date token to ????????
            tokens[2] = "????????"
            lastDirectory = "_".join(tokens[:])

            # we want patient/visit/record, where record is visit_recordName
            self._recordsPaths[recordKey] = (
                visit_relpath + "/" + lastDirectory + "_" + recordName
            )
            self.project_abspath = project_abspath
            self.visit_abspath = visit_abspath
            a = 1

    def __set_totalExpectedFiles(self):
        self.totalExpectedFiles = 0
        for record_key, record_name in self._records.items():
            self.totalExpectedFiles += len(self._endOfFileNames[record_key])

    def __set_expectedFiles(self):
        # the list matches the directory structure of the ReArm visit
        # each element is a list of two elements:
        #   1) the record directory
        #   2) a list of file names for that record
        self.expectedFiles = []
        self.expectedDirectories = []
        for record_key, record_relpath in self._recordsPaths.items():
            # sub list of file names for this record
            fileListForRecord = []
            # for endOfFileName in self._endOfFileNames[recordText]:
            #     fileListForRecord.append(
            #         self._begOfFileName + recordText + endOfFileName
            #     )
            for endOfFileName in self._endOfFileNames[record_key]:
                fileListForRecord.append(
                    self._begOfFileName
                    + "_"
                    + record_key
                    + endOfFileName[0]
                    + endOfFileName[1]
                )
            self.expectedFiles.append([record_relpath, fileListForRecord])
            self.expectedDirectories.append(record_relpath)

        # total number of expected files (files are in the second element of the sublists)
        total_expected_files = sum([len(sublist[1]) for sublist in self.expectedFiles])
        total_expected_directories = len(self.expectedFiles)

        if total_expected_files != self.totalExpectedFiles:
            print(
                "Error: total number of expected files does not match the sum of the files in the sublists"
            )
        a = 1

    def print(self):
        print("Expected files: {:d}".format(self.totalExpectedFiles))
        for record_relpath, record_fileList in self.expectedFiles:
            print("{:s}".format(record_relpath))
            for fileName in record_fileList:
                print("   {:s}".format(fileName))
        # print("Expected directories: {:d}".format(len(self.expectedDirectories)))
        # for directory in self.expectedDirectories:
        #     print("   {:s}".format(directory))


if doRunTests:
    ExpectedRearmVisit(pathToData).print()

## class CheckFilesInRearmVisit

- `CheckFilesInRearmVisit`: checks if the files are present in the directory


In [None]:
import os
import glob


class CheckFilesInRearmVisit:
    """
    A class for checking files in a Rearm visit.

    Attributes:
        dataDirectory (str): The directory containing the data files.
        absPathToData (str): The absolute path to the data directory.
        relPathToData (str): The relative path to the data directory from the current working directory.
        resPath (str): The path to the 'res' directory in the root of the project.
        logPath (str): The path where the check result files will be saved.
        logFullFileName (str): The full file name (including path) of the log file.

    Methods (public):
        run_check(): Run the check.
        printAllFiles(): Print all the files in the data directory.
        printAndSaveProblems(): Print and save the problems encountered during the check.
        saveGoodFiles(): Save the list of good files.

    """

    # path variables initialized to empty strings

    visit_path_user_input = ""  # the path to the visit directory (rel or abs)
    visit_abspath = ""
    visit_relpath = ""
    project_abspath = ""
    patient_abspath = ""
    # res directory in the root of the project
    resPath = ""
    # where to save the check result files
    logFiles_abspath = ""
    logFile_fullname = ""
    # name of files containing the check results
    checkRes_logFile_name = "checkFilesInRearmVisit.log"
    goodFiles_logFile_name = "goodFiles.log"
    badFiles_logFile_name = "badFiles.log"
    kinectDelay_logFile_name = "kinect_to_mouse_delay.tsv"
    # NOTE: these files will not be tagged as good or bad (if present in the visit directory)
    logFiles = {
        "checkRes": checkRes_logFile_name,
        "goodFiles": goodFiles_logFile_name,
        "badFiles": badFiles_logFile_name,
        "kinectDelay": kinectDelay_logFile_name,
    }

    # lists of files
    goodFilesList = []
    badFilesList = []
    allFilesList = []

    def __init__(self, user_input_dataDirectory):
        # all the analysis relies on the visit directory being given
        self.visit_path_user_input = user_input_dataDirectory

        # reset the output lists
        self.goodFilesList = []
        self.badFilesList = []

        self._set_all_paths()
        self._set_allFilesList()

    def run_check(self):
        self._expectedRearmVisit = ExpectedRearmVisit(self.visit_abspath)
        self._checkFilesByVisit()
        self._printAndSaveProblems()
        self._saveGoodFiles()

    def _set_allFilesList(self):
        """
        Sets the list of all files in the visit directory.
        """
        dataDirectory = self.visit_abspath
        allFiles = glob.glob(os.path.join(dataDirectory, "**/*"), recursive=True)
        allFiles = [f for f in allFiles if not os.path.isdir(f)]
        self.allFilesList = allFiles

    def _set_all_paths(self):
        self._set_visit_abspath()
        self._set_relVisitPath()
        # ensure results are saved next to the data
        self.logFiles_abspath = self.visit_abspath
        self._set_logFullFileName()

    def _getPathAfterLastLnk(self, path):
        """
        return the path after the last directory ending with .lnk (if any)
        otherwise return the path as is
        """
        str = path.split(os.sep)
        # keep the str after the last str ending with .lnk (if any)
        idx = -1
        for i, s in enumerate(str):
            if s.endswith(".lnk"):
                idx = i
        if idx >= 0:
            str = str[idx + 1 :]

        path = os.sep.join(str)
        return path

    def _set_logFullFileName(self):
        logFile_fullname = os.path.join(
            self.logFiles_abspath, self.checkRes_logFile_name
        )
        logFile_fullname = os.path.normpath(logFile_fullname)
        self.logFile_fullname = logFile_fullname

    def _set_visit_abspath(self):
        dataDirectory = self.visit_path_user_input
        # we expect a full path, but the user may have given a relative path
        if not os.path.isdir(self.visit_path_user_input):
            # most likely a relative path from the current directory
            dataDirectory = os.path.join(os.getcwd(), "..", dataDirectory)
            if not os.path.isdir(dataDirectory):
                raise NotADirectoryError(f"{dataDirectory} is not a directory")
            dataDirectory = os.path.normpath(dataDirectory)
        # ensure we have an absolute path
        dataDirectory = os.path.abspath(dataDirectory)
        self.visit_abspath = dataDirectory

    def _set_relVisitPath(self):
        fromDirectory = os.getcwd()
        relpathToDataDirectory = os.path.relpath(self.visit_abspath, fromDirectory)
        self.visit_relpath = relpathToDataDirectory

    def find_first_difference(self, str1, str2):
        min_length = min(len(str1), len(str2))

        for i in range(min_length):
            if str1[i] != str2[i]:
                return f"First difference at position {i}: '{str1[i]}' != '{str2[i]}'"

        if len(str1) != len(str2):
            return f"Strings differ in length. Extra character at position {min_length}: '{str1[min_length:]}' != '{str2[min_length:]}'"

        return None  # "Strings are identical"

    def _checkFilesByVisit(self):
        """
        Checks the files in the data directory against the expected files.
        It then saves the good files and the bad files in two lists:
        - self.goodFilesList
        - self.badFilesList

        Returns:
            None
        """

        # simpler names for the attributes
        visit_abspath = self.visit_abspath
        expectedVisit = self._expectedRearmVisit
        goodFiles = self.goodFilesList
        badFiles = self.badFilesList
        checkResultFiles = self.logFiles.values()

        for directory, filesInDirectory in expectedVisit.expectedFiles:
            for fileName in filesInDirectory:
                # fullFnamePattern = os.path.join(visit_abspath, directory, fileName)
                # fullFnamePattern = os.path.normpath(fullFnamePattern)

                # foundFiles = glob.glob(fullFnamePattern)

                # V2 with path and pattern
                project_abspath = os.path.normpath(self.visit_abspath + "/../..")
                glob_path = os.path.join(project_abspath, directory)
                glob_pattern = fileName
                fullFnamePattern = os.path.join(glob_path, glob_pattern)
                fullFnamePattern = os.path.normpath(fullFnamePattern)
                foundFiles = glob.glob(fullFnamePattern)
                # foundFiles = glob.glob(glob_path + os.sep + glob_pattern)

                if len(foundFiles) == 1:
                    goodFiles.append(foundFiles[0])
                else:
                    # we have a problem: either 0 or multiple files found
                    message = ""
                    if len(foundFiles) == 0:
                        message += "    File not found"
                    if len(foundFiles) > 1:
                        message += "    Multiple files found:"
                        for problem in foundFiles:
                            message += "\n      " + problem
                    badFiles.append([fullFnamePattern, message])

        list_of_all_fullname = self.allFilesList

        for fullname in list_of_all_fullname:
            if fullname not in goodFiles and fullname not in badFiles:
                if os.path.basename(fullname) not in checkResultFiles:
                    if self._is_file_in_dirs_to_check(fullname):
                        badFiles.append([fullname, "    File not expected"])

        self.goodFilesList = goodFiles
        self.badFilesList = badFiles

    def _is_file_in_dirs_to_check(self, fullname):
        """
        Check if the file is in the list of directories to check.
        """
        lastDirectory = last(os.path.split(os.path.dirname(fullname)))
        last_token = lastDirectory.split("_")[-1]
        expectedVisit = self._expectedRearmVisit
        if last_token in expectedVisit._records.values():
            return True
        return False

    def _printAllFiles(self):
        """
        Prints all the files in the data directory.
        """
        nbFiles = len(self.allFilesList)
        print("All {} files:".format(nbFiles))
        for file in self.allFilesList:
            print(f"  {file}")

    def _printAndSaveProblems(self):
        """
        Prints the problems encountered during file checking.
        It also writes the problems to a log file.

        Returns:
            None
        """
        nExpectedFiles = self._expectedRearmVisit.totalExpectedFiles
        nBadFiles = len(self.badFilesList)
        visitDirectory = os.path.basename(self.visit_abspath)
        msg = []
        msg.append(f"In {visitDirectory}:")
        if len(self.badFilesList) == 0:
            msg.append(f"- Found all the {nExpectedFiles} expected files.")
        else:
            msg.append(f"- Problem with {nBadFiles} files:")
            # define a set of possible problems fom the messages
            possibleProblems = set()
            for problem in self.badFilesList:
                possibleProblems.add(problem[1])
            # print the problems by type
            for problem in possibleProblems:
                msg.append(f" {problem}:")
                for file in self.badFilesList:
                    if file[1] == problem:
                        file_msg = os.path.relpath(file[0], self.visit_abspath)
                        msg.append(f"    {file_msg}")
        msg = "\n".join(msg)
        print(msg)
        print(f"See {self.logFile_fullname}")
        with open(self.logFile_fullname, "w") as f:
            f.write(msg)

    def _saveGoodFiles(self):
        """
        Save the list of good files to a log file.
        """
        saveFile = self.goodFiles_logFile_name
        saveFile = os.path.join(self.logFiles_abspath, saveFile)
        print(f"See {saveFile}")
        with open(saveFile, "w") as f:
            for goodFile in self.goodFilesList:
                goodFile = os.path.relpath(goodFile, self.visit_abspath)
                f.write(f"{goodFile}\n")

## Run the check of the data files

In [None]:
if doRunTests:
    pathToData = "dat/ReArm.lnk/ReArm_C1P02/ReArm_C1P02_20210306_V1"
    pathToData = "dat/ReArm.lnk/ReArm_C1P02/ReArm_C1P02_20210419_V2"
    pathToData = "dat/ReArm.lnk/ReArm_C1P02/ReArm_C1P02_20210715_V3"

    pathToData = "dat/ReArm.lnk/ReArm_C1P07/ReArm_C1P07_20210716_V1"
    # pathToData = "dat/ReArm.lnk/ReArm_C1P07/ReArm_C1P07_20210820_V2"
    # pathToData = "dat/ReArm.lnk/ReArm_C1P07/ReArm_C1P07_20211116_V3"

    CheckFilesInRearmVisit(pathToData).run_check()

# Testing the class `CheckFilesInRearmVisit`

To test the class `CheckFilesInRearmVisit`, I did some manual tests: 

- Suppress or modify one or several file name (e.g., with a wrong token[5] in the name) 
  - This is detected as one or several missing file
- Duplicate a file name with a wrong date in the name 
  - This is detected as a duplicate file
- Manually verify that the errors are correctly detected (after running `renameFilesInVisit.ipynb`)
  - in ReArm_C1P02
  - in ReArm_C1P07


