# Verify the presence of data files in a ReArm visit

The purpose of this notebook is to verify the presence of data files in a ReArm visit. The checks performed are as follows:

- Are all the expected data files for the visit present in the directory?
- Are there any extra files in the directory?

The results of the checks are stored in two files:

- `goodFiles.log`: Contains the list of files that are present in the directory and expected for the visit.
- `checkFilesInRearmVisit.log`: Contains information about any problems encountered during the checks.

Please note that this notebook assumes a specific directory structure and naming convention for the data files. The details of the directory structure and naming convention can be found in the subsequent cells.


## Location of the data files

The data files are accessed via the `dat/ReArm.lnk` symlink, which points to the directory where the data files are stored. This symlink is ignored by Git (as specified in the `.gitignore` file). This is important because we do not want to share the data files in Git.

The directory structure is as follows:

    dat
    └── ReArm.lnk 
        ├── ReArm_C1P02
        │   ├── Accelerometry
        │   ├── Armeo
        │   ├── Circle
        │   ├── Reaching
        │   └── Scan
        ├── ...
        ...



## Location of the check files

The check files are `*.log` files directly at the root the directory that is checked.

For example : 

```
...
└── ReArm_C1P02 
    ├── checkFilesInRearmVisit.log
    ├── goodFiles.log
    ├── ...
    ├── Accelerometry
    ├── ...
    └── Scan
    ...
```


## Structure of the name of the data files

The data files are named according to the following structure:

```
<project>_<participant>_<date>_<visit>_<record>_<record-specific-file>.<extension>
```

where:

- `<project>` is the name of the project (`ReArm`)
- `<participant>` is the participant ID (`C1P02` as C1 for the center and P02 for the participant)
- `<date>` is the date of the recording (`20190131` as YYYYMMDD)
- `<visit>` is the visit label (`1`,`2`, `3`, `Inc`,`Reed`), 
- `<record>` is the name of the record ( `r`, `c`, `a` ,`ac`, `Scan`)
- `<record-specific-file>` is a string that depends on the type of record
- `<extension>` depends on the type of record (csv, {easy, oxy3, oxy4}, xdf, cwa, pdf)




The following table shows the different types of `<visit>` :

|`<visit>`  | Content   |
| :--      | :------  |   
| `Inc` | Inclusion  |
| `1` | Just before the start of therapy  |
| `2` | Just after the end of therapy  |
| `3` | 3 months after the end of therapy  |
| `Reed` | During the therapy  |


The following table shows the different types of records and the corresponding `<record>`:

| `<record>` | directory | Record   |
| :-- | :-----------     | :------  |   
| `r` | `Reaching`       | Reaching task |
| `c` | `Circle`         | Circular Steering task  | 
| `a` | `Armeo`          | Armeo's Ladybug Game|
| `ac` | `Accelerometry` | Wrist accelerometer at home |
| `Scan` | `Scan` | Scanned paper (manually filled data sheet) |




The following table shows the different types of records and the corresponding `<record-specific-file>`:

| `<record>` | `<record-specific-file>` | Content |
|--------|--------------------------|-------------|
| `r` | `_k.csv` | Kinect MoCap  |
| `r` | `_k_m.csv` | Kinect markers |
| `r` | `_l_m_mau_np.csv` | l markers for `mau` and `np` |
| `r` | `_l_m_mau_p.csv` | l markers for `mau` and `p` |
| `r` | `_l_m_sau_np.csv` | l markers for `sau` and `np` |
| `r` | `_l_m_sau_p.csv` | l markers for `sau` and `p` |
| `r` | `.easy` | Oxysoft csv export  |
| `r` | `.oxy4` | Oxysoft binary data |
| `r` | `.xdf` | XDF file (it contains all the previous information) |
| | | |
| `c` | `_k.csv` | Kinect MoCap  |
| `c` | `_k_m.csv` | Kinect markers |
| `c` | `_l_m_np.csv` | l markers for `np` |
| `c` | `_l_m_p.csv` | l markers for `p` |
| `c` | `_l_np.csv` | l mouse mocap for `np` |
| `c` | `_l_p.csv` | l mouse mocap for `p` |
| `c` | `.easy` | Oxysoft csv export  |
| `c` | `.oxy4` | Oxysoft binary data |
| `c` | `.xdf` | XDF file (it contains all the previous information) |
| | | |
| `a` | `.easy` | Oxysoft csv export  |
| `a` | `.oxy4` | Oxysoft binary data |
| `a` | `.xdf` | XDF file (it contains all the previous information) |
| | | |
| `ac` | `_p.cwa` | Accelerometer data for `p` |
| `ac` | `_np.cwa` | Accelerometer data for `np` |
| | | |
| `Scan` | `_Acti1.pdf` | Activity questionnaire  |
| `Scan` | `_SetupW.pdf` | Setup for clinical tests |
| `Scan` | `_V1.pdf` | Results of clinical tests at V1 |
| `Scan` | `_DataSheet.pdf` | Complementary information (lab notebook) |

The following table shows how to interpret the items in a `<record-specific-file>`:

| `<record-specific-file>` | Content |
|--------| -------------|
| `mau` |  maximal arm use   |
| `sau` |  spontaneous arm use   |
| `np` |  non-paretic arm   |
| `p` |  paretic arm   |


# Script to check the data files

The script checks for the presence of the data files and prints out information about the missing/unexpected files.  
This is a full OOP approach to the problem checking that all expected files are present in the directory.  

I use two classes, which are defined in the cells below:


In [None]:
# this should be the set to false for production use
# and set to true for testing (default) but can be already set to false from outside (e.g., by the test script)

if "doRunTests" not in globals():
    doRunTests = True

## class ExpectedRearmVisit

- `ExpectedRearmVisit`: Represents the expected files and directories for a ReArm visit. 


In [None]:
class ExpectedRearmVisit:
    """
    Represents the expected files and directories for a ReArm visit.

    Attributes:
        totalExpectedFiles (int): The total number of expected files.
        expectedFiles (list): A list of expected files organized by record directory.

    Methods:
        set_totalExpectedFiles: Sets the totalExpectedFiles attribute.
        set_expectedFiles: Sets the expectedFiles attribute.
        print: Prints the expected files and directories.

    """

    # Class variables (shared among all instances)
    totalExpectedFiles = 0
    expectedFiles = []

    # All the variable below should be considered :
    # - protected => _variableName
    # - constant (should not be modified by instances)

    # a pattern to search for files with glob
    _begOfFileName = "ReArm_C?P??_????????_?_"

    # end of file names in the reaching directory
    _reachingFiles = [
        "_k.csv",
        "_k_m.csv",
        "_l_m_mau_np.csv",
        "_l_m_mau_p.csv",
        "_l_m_sau_np.csv",
        "_l_m_sau_p.csv",
        ".easy",
        ".oxy4",
        ".xdf",
    ]
    # end of file names in the circle directory
    _circleFiles = [
        "_k.csv",
        "_k_m.csv",
        "_l_m_np.csv",
        "_l_m_p.csv",
        "_l_np.csv",
        "_l_p.csv",
        ".easy",
        ".oxy4",
        ".xdf",
    ]
    # end of file names in the armeo directory
    _armeoFiles = [
        ".easy",
        ".oxy4",
    ]
    # end of file names in the accelerometer directory
    _accelerometerFiles = [
        "_p.cwa",
        "_np.cwa",
    ]
    # end of file names in the scan directory
    _scanFiles = [
        "_Acti1.pdf",
        "_SetupW.pdf",
        "_V1.pdf",
        "_DataSheet.pdf",
    ]

    # Dictionaries linking infFileLabel to the meaning
    _visits = {
        # infFileLabel : meaning
        "1": "Visit 1",
        "2": "Visit 2",
        "3": "Visit 3",
    }
    _records = {
        # infFileLabel : record directory
        "r": "Reaching",
        "c": "Circle",
        "a": "Armeo",
        "ac": "Accelerometry",
        "Scan": "Scan",
    }
    _endOfFileNames = {
        # infFileLabel : list of end of file names
        "r": _reachingFiles,
        "c": _circleFiles,
        "a": _armeoFiles,
        "ac": _accelerometerFiles,
        "Scan": _scanFiles,
    }

    def __init__(self):
        self.__set_totalExpectedFiles()
        self.__set_expectedFiles()

    def __set_totalExpectedFiles(self):
        self.totalExpectedFiles = 0
        for record, recordDirectory in self._records.items():
            self.totalExpectedFiles += len(self._endOfFileNames[record])

    def __set_expectedFiles(self):
        # the list matches the directory structure of the ReArm visit
        # each element is a list of two elements:
        #   1) the record directory
        #   2) a list of file names for that record
        self.expectedFiles = []
        for recordText, recordDirectory in self._records.items():
            # sub list of file names for this record
            fileListForRecord = []
            for endOfFileName in self._endOfFileNames[recordText]:
                fileListForRecord.append(
                    self._begOfFileName + recordText + endOfFileName
                )
            self.expectedFiles.append([recordDirectory, fileListForRecord])

    def print(self):
        print("Expected files: {:d}".format(self.totalExpectedFiles))
        for recordDirectory, fileListForRecord in self.expectedFiles:
            print("{:s}".format(recordDirectory))
            for fileName in fileListForRecord:
                print("   {:s}".format(fileName))


if doRunTests:
    ExpectedRearmVisit().print()

## class CheckFilesInRearmVisit

- `CheckFilesInRearmVisit`: checks if the files are present in the directory


In [None]:
import os
import glob


class CheckFilesInRearmVisit:
    """
    A class for checking files in a Rearm visit.

    Attributes:
        dataDirectory (str): The directory containing the data files.
        absPathToData (str): The absolute path to the data directory.
        relPathToData (str): The relative path to the data directory from the current working directory.
        resPath (str): The path to the 'res' directory in the root of the project.
        logPath (str): The path where the check result files will be saved.
        logFullFileName (str): The full file name (including path) of the log file.

    Methods:
        __init__(self, dataDirectory): Initializes the CheckFilesInRearmVisit object.
        set_allFilesList(self): Sets the list of all files in the data directory.
        defineAllPaths(self): Defines all the paths required for the check.
        getPathAfterLastLnk(self, path): Returns the path after the last directory ending with '.lnk'.
        set_logFullFileName(self): Sets the full file name of the log file.
        set_absPathToData(self): Sets the absolute path to the data directory.
        set_relPathToData(self): Sets the relative path to the data directory.
        set_resPath(self): Checks if the 'res' directory exists in the root of the project and creates it if not.
        set_logPath(self): Sets the path where to save the check result files.
        checkFilesByVisit(self): Checks the files in the data directory against the expected files.
        printAllFiles(self): Prints all the files in the data directory.
        printProblems(self): Prints the problems with the files in the data directory.
        printGoodFiles(self): Prints the list of good files.
        saveCheckResults(self): Saves the check result files.
    """

    # path variables initialized to empty strings
    visitPath = ""
    absVisitPath = ""
    relVisitPath = ""
    # res directory in the root of the project
    resPath = ""
    # where to save the check result files
    checkLogPath = ""
    checkLogFullFname = ""
    # name of files containing the check results
    checkLogFname = "checkFilesInRearmVisit.log"
    goodFilesFname = "goodFiles.log"
    badFilesFname = "badFiles.log"
    kinectToMouseDelayFname = "kinect_to_mouse_delay.tsv"
    # NOTE: these files will not be tagged as good or bad (if present in the visit directory)
    checkResultFiles = [
        checkLogFname,
        goodFilesFname,
        badFilesFname,
        kinectToMouseDelayFname,
    ]
    # lists of good and bad files
    goodFilesList = []
    badFilesList = []
    allFilesList = []
    
    def __init__(self, dataDirectory):
        self.visitPath = dataDirectory
        self.defineAllPaths()
        self.set_allFilesList()
        self.expectedVisit = ExpectedRearmVisit()
        self.checkFilesByVisit()
        self.printAndSaveProblems()
        self.saveGoodFiles()

    def set_allFilesList(self):
        """
        Sets the list of all files in the visit directory.
        """
        dataDirectory = self.absVisitPath
        allFiles = glob.glob(os.path.join(dataDirectory, "**/*"), recursive=True)
        allFiles = [f for f in allFiles if not os.path.isdir(f)]
        self.allFilesList = allFiles

    def defineAllPaths(self):
        self.set_absVisitPath()
        self.set_relVisitPath()
        # check results are saved next to the data
        self.checkLogPath = self.absVisitPath  # = self.define_logPath_in_res()
        self.set_logFullFileName()

    def getPathAfterLastLnk(self, path):
        """
        return the path after the last directory ending with .lnk (if any)
        otherwise return the path as is
        """
        str = path.split(os.sep)
        # keep the str after the last str ending with .lnk (if any)
        idx = -1
        for i, s in enumerate(str):
            if s.endswith(".lnk"):
                idx = i
        if idx >= 0:
            str = str[idx + 1 :]

        path = os.sep.join(str)
        return path

    def set_logFullFileName(self):
        logFileName = os.path.join(self.checkLogPath, self.checkLogFname)
        logFileName = os.path.normpath(logFileName)
        self.checkLogFullFname = logFileName

    def set_absVisitPath(self):
        dataDirectory = self.visitPath
        # we expect a full path, but the user may have given a relative path
        if not os.path.isdir(self.visitPath):
            # most likely a relative path from the current directory
            dataDirectory = os.path.join(os.getcwd(), "..", dataDirectory)
            if not os.path.isdir(dataDirectory):
                raise NotADirectoryError(f"{dataDirectory} is not a directory")
            dataDirectory = os.path.normpath(dataDirectory)
        # ensure we have an absolute path
        dataDirectory = os.path.abspath(dataDirectory)
        self.absVisitPath = dataDirectory

    def set_relVisitPath(self):
        fromDirectory = os.getcwd()
        relpathToDataDirectory = os.path.relpath(self.absVisitPath, fromDirectory)
        self.relVisitPath = relpathToDataDirectory

    def set_resPath(self):
        """
        check if the res directory exists in the root of the project
        if not, create it
        """
        # NOTE: notebooks might be run from subdirectories
        # go up in directories until we are in the root of the project
        cwd = os.getcwd()
        while not os.path.isdir("dat"):
            os.chdir("..")
        # create the res directory if it does not exist
        resPath = os.path.join(os.getcwd(), "res")
        if not os.path.isdir(resPath):
            os.mkdir(resPath)
        self.resPath = resPath
        # go back to the original directory
        os.chdir(cwd)

    def define_logPath_in_res(self):
        """
        The path is:
        - the res directory in the root of the project
        - the name of the class
        - the path after the last directory ending with .lnk in the data directory
        """
        # NOTE: log file in res make sense, but it is easier to find the log file if it is next to the data
        self.set_resPath()
        className = self.__class__.__name__
        afterLNK = self.getPathAfterLastLnk(self.absVisitPath)
        logPath = os.path.join(self.resPath, className, afterLNK)
        logPath = os.path.normpath(logPath)
        if not os.path.isdir(logPath):
            print(f"Creating directory {logPath}")
            os.makedirs(logPath)
        return logPath

    def checkFilesByVisit(self):
        """
        Checks the files in the data directory against the expected files.
        It then saves the good files and the bad files in two lists:
        - self.goodFilesList
        - self.badFilesList

        Returns:
            None
        """

        absPathToData = self.absVisitPath
        expectedVisit = self.expectedVisit
        goodFiles = self.goodFilesList
        badFiles = self.badFilesList
        checkResultFiles = self.checkResultFiles

        for directory, filesInDirectory in expectedVisit.expectedFiles:
            for fileName in filesInDirectory:
                fullFnamePattern = os.path.join(absPathToData, directory, fileName)
                fullFnamePattern = os.path.normpath(fullFnamePattern)

                foundFiles = glob.glob(fullFnamePattern)
                if len(foundFiles) == 1:
                    goodFiles.append(foundFiles[0])
                else:
                    # we have a problem
                    message = ""
                    if len(foundFiles) == 0:
                        message += "    File not found"
                    if len(foundFiles) > 1:
                        message += "    Multiple files found:"
                        for problem in foundFiles:
                            message += "\n      " + problem
                    badFiles.append([fullFnamePattern, message])

        allFiles = self.allFilesList
        for file in allFiles:
            if file not in goodFiles and file not in badFiles:
                if os.path.basename(file) not in checkResultFiles:
                    badFiles.append([file, "    File not expected"])

        self.goodFilesList = goodFiles
        self.badFilesList = badFiles

    def printAllFiles(self):
        """
        Prints all the files in the data directory.
        """
        nbFiles = len(self.allFilesList)
        print("All {} files:".format(nbFiles))
        for file in self.allFilesList:
            print(f"  {file}")

    def printAndSaveProblems(self):
        """
        Prints the problems encountered during file checking.
        It also writes the problems to a log file specified by `self.logFullFileName`.

        Returns:
            None
        """
        nExpectedFiles = self.expectedVisit.totalExpectedFiles
        nBadFiles = len(self.badFilesList)
        visitDirectory = os.path.basename(self.absVisitPath)
        msg = []
        msg.append(f"In {visitDirectory}:")
        if len(self.badFilesList) == 0:
            msg.append(f"- Found all the {nExpectedFiles} expected files.")
        else:
            msg.append(f"- Problem with {nBadFiles} files:")
            # define a set of possible problems fom the messages
            possibleProblems = set()
            for problem in self.badFilesList:
                possibleProblems.add(problem[1])
            # print the problems by type
            for problem in possibleProblems:
                msg.append(f" {problem}:")
                for file in self.badFilesList:
                    if file[1] == problem:
                        file_msg = os.path.relpath(file[0], self.absVisitPath)
                        msg.append(f"    {file_msg}")
        msg = "\n".join(msg)
        print(msg)
        print(f"See {self.checkLogFullFname}")
        with open(self.checkLogFullFname, "w") as f:
            f.write(msg)

    def saveGoodFiles(self):
        saveFile = self.goodFilesFname
        saveFile = os.path.join(self.checkLogPath, saveFile)
        print(f"See {saveFile}")
        with open(saveFile, "w") as f:
            for goodFile in self.goodFilesList:
                goodFile = os.path.relpath(goodFile, self.absVisitPath)
                f.write(f"{goodFile}\n")

## Run the check of the data files

In [None]:
if doRunTests:
    pathToData = "dat/ReArm.lnk/ReArm_C1P02"
    C1P02_V1 = CheckFilesInRearmVisit(pathToData)

# Testing the class `CheckFilesInRearmVisit`

To test the class `CheckFilesInRearmVisit`, I did some manual tests: 

- Suppress or modify one or several file name (e.g., with a wrong token[5] in the name) 
  - This is detected as one or several missing file
- Duplicate a file name with a wrong date in the name 
  - This is detected as a duplicate file


