# Check data files for the ReArm project

The purpose of this notebook is to verify the presence of data files in a ReArm visit. The checks performed are as follows:

- Are all the expected data files for the visit present in the directory?
- Are there any extra files in the directory?

The results of the checks are stored in two files:

- `goodFiles.log`: Contains the list of files that are present in the directory and expected for the visit.
- `checkFilesInRearmVisit.log`: Contains information about any problems encountered during the checks.

Please note that this notebook assumes a specific directory structure and naming convention for the data files. The details of the directory structure and naming convention can be found in the subsequent cells.


## Location of the data files

The data files are accessed via the `dat/ReArm.lnk` symlink, which points to the directory where the data files are stored. This symlink is ignored by Git (as specified in the `.gitignore` file). This is important because we do not want to share the data files in Git.

The directory structure is as follows:

    dat
    └── ReArm.lnk 
        ├── ReArm_C1P02
        │   ├── Accelerometry
        │   ├── Armeo
        │   ├── Circle
        │   ├── Reaching
        │   └── Scan
        ├── ...
        ...



## Structure of the name of the data files

The data files are named according to the following structure:

```
<project>_<participant>_<date>_<visit>_<record>_<record-specific-file>.<extension>
```

where:

- `<project>` is the name of the project (`ReArm`)
- `<participant>` is the participant ID (`C1P02` as C1 for the center and P02 for the participant)
- `<date>` is the date of the recording (`20190131` as YYYYMMDD)
- `<visit>` is the visit label (`1`,`2`, `3`, `Inc`,`Reed`), 
- `<record>` is the name of the record ( `r`, `c`, `a` ,`ac`, `Scan`)
- `<record-specific-file>` is a string that depends on the type of record
- `<extension>` depends on the type of record (csv, {easy, oxy3, oxy4}, xdf, cwa, pdf)




The following table shows the different types of `<visit>` :

|`<visit>`  | Content   |
| :--      | :------  |   
| `Inc` | Inclusion  |
| `1` | Just before the start of therapy  |
| `2` | Just after the end of therapy  |
| `3` | 3 months after the end of therapy  |
| `Reed` | During the therapy  |


The following table shows the different types of records and the corresponding `<record>`:

| `<record>` | directory | Record   |
| :-- | :-----------     | :------  |   
| `r` | `Reaching`       | Reaching task |
| `c` | `Circle`         | Circular Steering task  | 
| `a` | `Armeo`          | Armeo's Ladybug Game|
| `ac` | `Accelerometry` | Wrist accelerometer at home |
| `Scan` | `Scan` | Scanned paper (manually filled data sheet) |




The following table shows the different types of records and the corresponding `<record-specific-file>`:

| `<record>` | `<record-specific-file>` | Content |
|--------|--------------------------|-------------|
| `r` | `_k.csv` | Kinect MoCap  |
| `r` | `_k_m.csv` | Kinect markers |
| `r` | `_l_m_mau_np.csv` | l markers for `mau` and `np` |
| `r` | `_l_m_mau_p.csv` | l markers for `mau` and `p` |
| `r` | `_l_m_sau_np.csv` | l markers for `sau` and `np` |
| `r` | `_l_m_sau_p.csv` | l markers for `sau` and `p` |
| `r` | `.easy` | Oxysoft csv export  |
| `r` | `.oxy4` | Oxysoft binary data |
| `r` | `.xdf` | XDF file (it contains all the previous information) |
| | | |
| `c` | `_k.csv` | Kinect MoCap  |
| `c` | `_k_m.csv` | Kinect markers |
| `c` | `_l_m_np.csv` | l markers for `np` |
| `c` | `_l_m_p.csv` | l markers for `p` |
| `c` | `_l_np.csv` | l mouse mocap for `np` |
| `c` | `_l_p.csv` | l mouse mocap for `p` |
| `c` | `.easy` | Oxysoft csv export  |
| `c` | `.oxy4` | Oxysoft binary data |
| `c` | `.xdf` | XDF file (it contains all the previous information) |
| | | |
| `a` | `.easy` | Oxysoft csv export  |
| `a` | `.oxy4` | Oxysoft binary data |
| `a` | `.xdf` | XDF file (it contains all the previous information) |
| | | |
| `ac` | `_p.cwa` | Accelerometer data for `p` |
| `ac` | `_np.cwa` | Accelerometer data for `np` |
| | | |
| `Scan` | `_Acti1.pdf` | Activity questionnaire  |
| `Scan` | `_SetupW.pdf` | Setup for clinical tests |
| `Scan` | `_V1.pdf` | Results of clinical tests at V1 |
| `Scan` | `_DataSheet.pdf` | Complementary information (lab notebook) |

The following table shows how to interpret the items in a `<record-specific-file>`:

| `<record-specific-file>` | Content |
|--------| -------------|
| `mau` |  maximal arm use   |
| `sau` |  spontaneous arm use   |
| `np` |  non-paretic arm   |
| `p` |  paretic arm   |


# Script to check the data files

The script checks for the presence of the data files and prints out information about the missing/unexpected files.  
This is a full OOP approach to the problem checking that all expected files are present in the directory.  

I use two classes, which are defined in the cells below:


## class ExpectedFilesInRearmVisit

- `ExpectedInRearmVisit`: stores the expected file names and sub-directories for a ReArm visit


In [None]:
class ExpectedFilesInRearmVisit:
    """
    a class to hold the expected file names for a ReArm visit
    """

    # Class variables (shared among all instances)
    begOfFileName = "ReArm_C?P??_????????_?_"
    reachingFiles = [
        "_k.csv",
        "_k_m.csv",
        "_l_m_mau_np.csv",
        "_l_m_mau_p.csv",
        "_l_m_sau_np.csv",
        "_l_m_sau_p.csv",
        ".easy",
        ".oxy4",
        ".xdf",
    ]
    circleFiles = [
        "_k.csv",
        "_k_m.csv",
        "_l_m_np.csv",
        "_l_m_p.csv",
        "_l_np.csv",
        "_l_p.csv",
        ".easy",
        ".oxy4",
        ".xdf",
    ]
    armeoFiles = [
        ".easy",
        ".oxy4",
    ]
    accelerometerFiles = [
        "_p.cwa",
        "_np.cwa",
    ]
    scanFiles = [
        "_Acti1.pdf",
        "_SetupW.pdf",
        "_V1.pdf",
        "_DataSheet.pdf",
    ]
    visits = {
        # text : meaning
        "1": "Visit 1",
        "2": "Visit 2",
        "3": "Visit 3",
    }
    records = {
        # text : directory
        "r": "Reaching",
        "c": "Circle",
        "a": "Armeo",
        "ac": "Accelerometry",
        "Scan": "Scan",
    }
    endOfFileNames = {
        "r": reachingFiles,
        "c": circleFiles,
        "a": armeoFiles,
        "ac": accelerometerFiles,
        "Scan": scanFiles,
    }

    def __init__(self):
        self.setTotalExpectedFiles()
        self.setExpectedFileList()

    def setTotalExpectedFiles(self):
        self.totalExpectedFiles = 0
        for record, recordDirectory in self.records.items():
            self.totalExpectedFiles += len(self.endOfFileNames[record])

    def setExpectedFileList(self):
        self.expectedFileList = []
        for recordText, recordDirectory in self.records.items():
            fileListForRecord = []
            for endOfFileName in self.endOfFileNames[recordText]:
                fileListForRecord.append(
                    self.begOfFileName + recordText + endOfFileName
                )
            self.expectedFileList.append([recordDirectory, fileListForRecord])

    def print(self):
        print("Expected files: {:d}".format(self.totalExpectedFiles))
        for recordDirectory, fileListForRecord in self.expectedFileList:
            print("{:s}".format(recordDirectory))
            for fileName in fileListForRecord:
                print("   {:s}".format(fileName))


# ExpectedFilesInRearmVisit().print()

## class CheckFilesInRearmVisit

- `CheckFilesInRearmVisit`: checks if the files are present in the directory


In [None]:
import os
import glob


class CheckFilesInRearmVisit:
    """
    a class to check that the expected files are present in a ReArm visit
    """

    def __init__(self, dataDirectory):
        self.dataDirectory = dataDirectory
        self.defineAllPaths()
        self.set_allFilesList()
        self.goodFilesList = []
        self.badFilesList = []
        self.expectedFiles = ExpectedFilesInRearmVisit()

    def set_allFilesList(self):
        dataDirectory = self.absPathToData
        allFiles = glob.glob(os.path.join(dataDirectory, "**/*"), recursive=True)
        allFiles = [f for f in allFiles if not os.path.isdir(f)]
        self.allFilesList = allFiles

    def defineAllPaths(self):
        self.set_absPathToData()
        self.set_relPathToData()
        self.set_resPath()
        self.set_logPath()
        self.set_logFullFileName()

    def getPathAfterLastLnk(self, path):
        """
        return the path after the last directory ending with .lnk
        """
        str = path.split(os.sep)
        # keep the str after the last str ending with .lnk (if any)
        idx = -1
        for i, s in enumerate(str):
            if s.endswith(".lnk"):
                idx = i
        if idx >= 0:
            str = str[idx + 1 :]

        path = os.sep.join(str)
        return path
    
    def set_logFullFileName(self):
        logFileName = "checkFilesInRearmVisit.log"
        self.set_logPath()
        logFileName = os.path.join(self.logPath, logFileName)
        logFileName = os.path.normpath(logFileName)
        self.logFullFileName = logFileName

    def set_absPathToData(self):
        dataDirectory = self.dataDirectory
        # we expect a full path, but the user may have given a relative path
        if not os.path.isdir(self.dataDirectory):
            # most likely a relative path from the current directory
            dataDirectory = os.path.join(os.getcwd(), "..", dataDirectory)
            if not os.path.isdir(dataDirectory):
                raise NotADirectoryError(f"{dataDirectory} is not a directory")
            dataDirectory = os.path.normpath(dataDirectory)
        # ensure we have an absolute path
        dataDirectory = os.path.abspath(dataDirectory)
        self.absPathToData = dataDirectory

    def set_relPathToData(self):
        fromDirectory = os.getcwd()
        relpathToDataDirectory = os.path.relpath(self.absPathToData, fromDirectory)
        self.relPathToData = relpathToDataDirectory

    def set_resPath(self):
        """
        check if the res directory exists in the root of the project
        if not, create it
        """
        # NOTE: notebooks might be run from subdirectories
        # go up in directories until we are in the root of the project
        cwd = os.getcwd()
        while not os.path.isdir("dat"):
            os.chdir("..")
        # create the res directory if it does not exist
        resPath = os.path.join(os.getcwd(), "res")
        if not os.path.isdir(resPath):
            os.mkdir(resPath)
        self.resPath = resPath
        os.chdir(cwd)

    def set_logPath(self):
        """
        return: the path where to save the check result files
        The path is:
        - the res directory in the root of the project
        - the name of the class
        - the path after the last directory ending with .lnk in the data directory
        """

        self.set_resPath()
        className = self.__class__.__name__
        afterLNK = self.getPathAfterLastLnk(self.absPathToData)
        logPath = os.path.join(self.resPath, className, afterLNK)
        logPath = os.path.normpath(logPath)  # clean up the path
        if not os.path.isdir(logPath):
            print(f"Creating directory {logPath}")
            os.makedirs(logPath)
        self.logPath = logPath

    def checkFilesByVisit(self):
        absPathToData = self.absPathToData
        expected = self.expectedFiles
        goodFileList = self.goodFilesList
        badFileList = self.badFilesList

        for directory, fileListInDirectory in expected.expectedFileList:
            for fileName in fileListInDirectory:
                fullFnamePattern = os.path.join(absPathToData, directory, fileName)
                fullFnamePattern = os.path.normpath(fullFnamePattern)

                foundFiles = glob.glob(fullFnamePattern)
                if len(foundFiles) == 1:
                    goodFileList.append(foundFiles[0])
                else:
                    # we have a problem
                    message = ""
                    if len(foundFiles) == 0:
                        message += "    File not found"
                    if len(foundFiles) > 1:
                        message += "    Multiple files found:"
                        for problem in foundFiles:
                            message += "\n      " + problem
                    badFileList.append([fullFnamePattern, message])

        allFiles = self.allFilesList
        for file in allFiles:
            if file not in goodFileList and file not in badFileList:
                badFileList.append([file, "    File not expected"])

        self.goodFilesList = goodFileList
        self.badFilesList = badFileList

    def printAllFiles(self):
        nbFiles = len(self.allFilesList)
        print("All {} files:".format(nbFiles))
        for file in self.allFilesList:
            print(f"  {file}")

    def printProblems(self):
        nExpectedFiles = self.expectedFiles.totalExpectedFiles
        nBadFiles = len(self.badFilesList)
        msg = []
        msg.append(f"In {self.absPathToData}:")
        if len(self.badFilesList) == 0:
            msg.append(f"- Found all the {nExpectedFiles} expected files.")
        else:
            msg.append(f"- Problem with {nBadFiles} files:")
            # define a set of possible problems fom the messages
            possibleProblems = set()
            for problem in self.badFilesList:
                possibleProblems.add(problem[1])
            # print the problems by type
            for problem in possibleProblems:
                msg.append(f" {problem}:")
                for file in self.badFilesList:
                    if file[1] == problem:
                        msg.append(f"    {file[0]}")
        msg = "\n".join(msg)
        print(msg)
        print(f"See {self.logFullFileName} for details")
        with open(self.logFullFileName, "w") as f:
            f.write(msg)

    def printGoodFiles(self):
        print("Good files:")
        for goodFile in self.goodFilesList:
            print(f"  {goodFile}")

    def saveCheckResults(self):
        saveFile = "goodFiles.log"
        saveFile = os.path.join(self.logPath, saveFile)
        print(f"Saving check result files to {saveFile}")
        with open(saveFile, "w") as f:
            for goodFile in self.goodFilesList:
                f.write(f"{goodFile}\n")

## Run the check of the data files

In [None]:
pathToData = "dat/ReArm.lnk/ReArm_C1P02"
C1P02_V1 = CheckFilesInRearmVisit(pathToData)
C1P02_V1.checkFilesByVisit()
C1P02_V1.printProblems()
# C1P02_V1.printGoodFiles()
C1P02_V1.saveCheckResults()

# Testing the class `CheckFilesInRearmVisit`

To test the class `CheckFilesInRearmVisit`, I did some manual tests: 

- Suppress or modify one or several file name (e.g., with a wrong token[5] in the name) 
  - This is detected as one or several missing file
- Duplicate a file name with a wrong date in the name 
  - This is detected as a duplicate file


