# Check data files for the ReArm project

The goal is to check that all the expected data files for the ReArm project are present. 



## Location of the data files

The data files are accessed via `dat/ReArm.lnk`, a symlink to the data.  
The symlink point to the directory where the data files are effectively stored.   
The symlink is ignored by Git (see `.gitignore` file).

The directory structure is the following: 

    dat
    └── ReArm.lnk 
        ├── ReArm_C1P02
        │   ├── Accelerometry
        │   ├── Armeo
        │   ├── Circle
        │   ├── Reaching
        │   └── Scan
        ├── ...
        ...



## Structure of the name of the data files

The data files are named according to the following structure:

```
<project>_<participant>_<date>_<visit>_<record>_<record-specific-file>.<extension>
```

where:

- `<project>` is the name of the project (`ReArm`)
- `<participant>` is the participant ID (`C1P02` as C1 for the center and P02 for the participant)
- `<date>` is the date of the recording (`20190131` as YYYYMMDD)
- `<visit>` is the visit number (`1`,`2`, `3`, `Inc`,`Reed`), 
- `<record>` is the name of the record ( `r`, `c`, `a` ,`ac`, `Scan`)
- `<record-specific-file>` is a string that depends on the type of record
- `<extension>` depends on the type of record (csv, {easy, oxy3, oxy4}, xdf, cwa, pdf)




The following table shows the different types of `<visit>` :

|`<visit>`  | Content   |
| :--      | :------  |   
| `Inc` | Inclusion  |
| `1` | Just before the start of therapy  |
| `2` | Just after the end of therapy  |
| `3` | 3 months after the end of therapy  |
| `Reed` | During the therapy  |


The following table shows the different types of records and the corresponding `<record>`:

| `<record>` | directory | Record   |
| :-- | :-----------     | :------  |   
| `r` | `Reaching`       | Reaching task |
| `c` | `Circle`         | Circular Steering task  | 
| `a` | `Armeo`          | Armeo's Ladybug Game|
| `ac` | `Accelerometry` | Wrist accelerometer at home |
| `Scan` | `Scan` | Scanned paper (manually filled data sheet) |




The following table shows the different types of records and the corresponding `<record-specific-file>`:

| `<record>` | `<record-specific-file>` | Content |
|--------|--------------------------|-------------|
| `r` | `_k.csv` | Kinect MoCap  |
| `r` | `_k_m.csv` | Kinect markers |
| `r` | `_l_m_mau_np.csv` | l markers for `mau` and `np` |
| `r` | `_l_m_mau_p.csv` | l markers for `mau` and `p` |
| `r` | `_l_m_sau_np.csv` | l markers for `sau` and `np` |
| `r` | `_l_m_sau_p.csv` | l markers for `sau` and `p` |
| `r` | `.easy` | Oxysoft csv export  |
| `r` | `.oxy4` | Oxysoft binary data |
| `r` | `.xdf` | XDF file (it contains all the previous information) |
| | | |
| `c` | `_k.csv` | Kinect MoCap  |
| `c` | `_k_m.csv` | Kinect markers |
| `c` | `_l_m_np.csv` | l markers for `np` |
| `c` | `_l_m_p.csv` | l markers for `p` |
| `c` | `_l_np.csv` | l mouse mocap for `np` |
| `c` | `_l_p.csv` | l mouse mocap for `p` |
| `c` | `.easy` | Oxysoft csv export  |
| `c` | `.oxy4` | Oxysoft binary data |
| `c` | `.xdf` | XDF file (it contains all the previous information) |
| | | |
| `a` | `.easy` | Oxysoft csv export  |
| `a` | `.oxy4` | Oxysoft binary data |
| `a` | `.xdf` | XDF file (it contains all the previous information) |
| | | |
| `ac` | `_p.cwa` | Accelerometer data for `p` |
| `ac` | `_np.cwa` | Accelerometer data for `np` |
| | | |
| `Scan` | `_Acti1.pdf` | Activity questionnaire  |
| `Scan` | `_SetupW.pdf` | Setup for clinical tests |
| `Scan` | `_V1.pdf` | Results of clinical tests at V1 |
| `Scan` | `_DataSheet.pdf` | Complementary information (lab notebook) |

The following table shows how to interpret the items in a `<record-specific-file>`:

| `<record-specific-file>` | Content |
|--------| -------------|
| `mau` |  maximal arm use   |
| `sau` |  spontaneous arm use   |
| `np` |  non-paretic arm   |
| `p` |  paretic arm   |


# Script to check the data files

The script checks for the presence of the data files and prints out the missing files.  
This is a full OOP approach to the problem checking that all expected files are present in the directory.  

I use two classes :
- `ExpectedInRearmVisit`: stores the expected file names and sub-directories for a ReArm visit
- `CheckFilesInRearmVisit`: checks if the files are present in the directory


In [None]:
import os
import glob


class ExpectedFilesInRearmVisit:
    """
    a class to hold the expected file names and sub-directories for a ReArm visit
    """

    def __init__(self):
        self.begOfFileName = "ReArm_C?P??_????????_?_"
        self.reachingFiles = [
            "_k.csv",
            "_k_m.csv",
            "_l_m_mau_np.csv",
            "_l_m_mau_p.csv",
            "_l_m_sau_np.csv",
            "_l_m_sau_p.csv",
            ".easy",
            ".oxy4",
            ".xdf",
        ]
        self.circleFiles = [
            "_k.csv",
            "_k_m.csv",
            "_l_m_np.csv",
            "_l_m_p.csv",
            "_l_np.csv",
            "_l_p.csv",
            ".easy",
            ".oxy4",
            ".xdf",
        ]
        self.armeoFiles = [
            ".easy",
            ".oxy4",
        ]
        self.accelerometerFiles = [
            "_p.cwa",
            "_np.cwa",
        ]
        self.scanFiles = [
            "_Acti1.pdf",
            "_SetupW.pdf",
            "_V1.pdf",
            "_DataSheet.pdf",
        ]
        self.visits = {
            # text : meaning
            "1": "Visit 1",
            "2": "Visit 2",
            "3": "Visit 3",
        }
        self.records = {
            # text : directory
            "r": "Reaching",
            "c": "Circle",
            "a": "Armeo",
            "ac": "Accelerometry",
            "Scan": "Scan",
        }

        self.endOfFileNames = {
            "r": self.reachingFiles,
            "c": self.circleFiles,
            "a": self.armeoFiles,
            "ac": self.accelerometerFiles,
            "Scan": self.scanFiles,
        }
        self.__setTotalExpectedFiles()

    def __setTotalExpectedFiles(self):
        self.totalExpectedFiles = 0
        for record, recordDirectory in self.records.items():
            self.totalExpectedFiles += len(self.endOfFileNames[record])

    def print(self):
        print("Expected files:")
        for recordText, recordDirectory in self.records.items():
            for endOfFileName in self.endOfFileNames[recordText]:
                print(
                    "{:15s} {:s}{:s}{:s}".format(
                        recordDirectory, self.begOfFileName, recordText, endOfFileName
                    )
                )


class CheckFilesInRearmVisit:
    """
    a class to check that the expected files are present in a ReArm visit
    """

    def __init__(self, dataDirectory):
        self.abspathToData = self.__abspathToDataDirectory(dataDirectory)
        self.relpathToData = self.__relpathToDataDirectory()
        self.goodFilesList = []
        self.badFilesList = []
        self.expectedFiles = ExpectedFilesInRearmVisit()
        self.allFilesList = self.__getAllFiles()

    def __getAllFiles(self):
        dataDirectory = self.abspathToData
        # using glob, list all the files in the data directory
        allFiles = glob.glob(os.path.join(dataDirectory, "**/*"), recursive=True)
        allFiles = [f for f in allFiles if not os.path.isdir(f)]
        return allFiles

    def __abspathToDataDirectory(self, dataDirectory):
        # we expect a full path, but the user may have given a relative path
        if not os.path.isdir(dataDirectory):
            # most likely a relative path from the current directory
            dataDirectory = os.path.join(os.getcwd(), "..", dataDirectory)
            if not os.path.isdir(dataDirectory):
                raise NotADirectoryError(f"{dataDirectory} is not a directory")
            dataDirectory = os.path.normpath(dataDirectory)
        # ensure we have an absolute path
        dataDirectory = os.path.abspath(dataDirectory)
        return dataDirectory
    
    def __relpathToDataDirectory(self):
        fromDirectory = os.getcwd()
        relpathToDataDirectory = os.path.relpath(self.abspathToData, fromDirectory)
        return relpathToDataDirectory

    def checkFilesByVisit(self):
        dataDirectory = self.abspathToData
        expected = self.expectedFiles
        goodFileList = self.goodFilesList
        badFileList = self.badFilesList
        for recordText, recordDirectory in expected.records.items():
            for endOfFileName in expected.endOfFileNames[recordText]:
                fnamePattern = expected.begOfFileName + recordText + endOfFileName
                fullFnamePattern = os.path.join(
                    dataDirectory, recordDirectory, fnamePattern
                )
                fullFnamePattern = os.path.normpath(fullFnamePattern)

                foundFiles = glob.glob(fullFnamePattern)
                if len(foundFiles) == 1:
                    goodFileList.append(foundFiles[0])
                else:
                    # we have a problem
                    message = ""
                    if len(foundFiles) == 0:
                        message += "    File not found"
                    if len(foundFiles) > 1:
                        message += "    Multiple files found:"
                        for problem in foundFiles:
                            message += "\n      " + problem
                    badFileList.append([fullFnamePattern, message])
        # check if some files are not expected (but present in the directory)
        allFiles = self.allFilesList
        for file in allFiles:
            if file not in goodFileList and file not in badFileList:
                badFileList.append([file, "    File not expected"])


        self.goodFilesList = goodFileList
        self.badFilesList = badFileList

    def printAllFiles(self):
        nbFiles = len(self.allFilesList)
        print("All {} files:".format(nbFiles))
        for file in self.allFilesList:
            print(f"  {file}")

    def printProblems(self):
        nExpectedFiles = self.expectedFiles.totalExpectedFiles
        nBadFiles = len(self.badFilesList)
        print(f"In {self.abspathToData}:")
        if len(self.badFilesList) == 0:
            print(f"- Found all the {nExpectedFiles} expected files.")
        else:
            print(f"- Problem with {nBadFiles} files:")
            # define a set of possible problems fom the messages
            possibleProblems = set()
            for problem in self.badFilesList:
                possibleProblems.add(problem[1])
            # print the problems by type
            for problem in possibleProblems:
                print(f" {problem}:")
                for file in self.badFilesList:
                    if file[1] == problem:
                        print(f"    {file[0]}")

    def printGoodFiles(self):
        print("Good files:")
        for goodFile in self.goodFilesList:
            print(f"  {goodFile}")

    def __setResPath(self):
        cwd = os.getcwd()
        # go up in directories until we are in the root of the project
        while not os.path.isdir("dat"):
            os.chdir("..")
        resPath = os.path.join(os.getcwd(), "res")
        if not os.path.isdir(resPath):
            os.mkdir(resPath)

        self.resPath = resPath
        # set cwd to its original value
        os.chdir(cwd)

    # def getPathAfterLastLnk(self, path):
    #     """
    #     return the path after the last directory ending with .lnk
    #     """
    #     str = path.split(os.sep)
    #     # keep the str after the last str ending with .lnk (if any)
    #     idx = -1
    #     for i, s in enumerate(str):
    #         if s.endswith(".lnk"):
    #             idx = i
    #     if idx >= 0:
    #         str = str[idx+1:]
    #     return os.sep.join(str)

    def saveGoodFiles(self, saveFile):
        self.__setResPath()
        pp = self.getPathAfterLastLnk(self.abspathToData)
        saveFile = os.path.join(self.resPath, pp, saveFile)
        print(f"Saving good files to {saveFile}")
        saveFile = os.path.join(self.resPath, saveFile)
        print(f"Saving good files to {saveFile}")
        with open(saveFile, "w") as f:
            for goodFile in self.goodFilesList:
                f.write(f"{goodFile}\n")

In [None]:
import os
import glob

abspathToData = "dat/ReArm.lnk/ReArm_C1P02/toto"

# path = abspathToData

# #def getPathAfterLastLnk(path):
#     # """
#     # return the path after the last directory ending with .lnk

#     #     """
# str = path.split(os.sep)
# # # keep the str after the last str ending with .lnk (if any)
# # idx = -1
# # for i, s in enumerate(str):
# #     if s.endswith(".lnk"):
# #         idx = i
# # if idx >= 0:
# #     str = str[idx+1:]

# # ttt = os.sep.join(str)

# #return os.sep.join(str)

# #ttt = getPathAfterLastLnk(abspathToData)
# print( ttt)



In [None]:
abspathToData = "dat/ReArm.lnk/ReArm_C1P02"
C1P02_V1 = CheckFilesInRearmVisit(abspathToData)
C1P02_V1.checkFilesByVisit()
C1P02_V1.printProblems()
#C1P02_V1.printGoodFiles()
C1P02_V1.saveGoodFiles("goodFiles.txt")

# Testing the class `CheckFilesInRearmVisit`

To test the class `CheckFilesInRearmVisit`, I did some manual tests: 

- Suppress or modify one or several file name (e.g., with a wrong token[5] in the name) 
  - This is detected as one or several missing file
- Duplicate a file name with a wrong date in the name 
  - This is detected as a duplicate file


