Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset #7

Open
sungggat opened this issue Apr 3, 2023 · 8 comments
Open

Dataset #7

sungggat opened this issue Apr 3, 2023 · 8 comments

Comments

@sungggat
Copy link

sungggat commented Apr 3, 2023

How to get CINC2021 dataset? How to download dataset from url you provided in benchmarks. I could not find prepare_dataset.py but I found it from original repo.

@wenh06
Copy link
Collaborator

wenh06 commented Apr 4, 2023

Just call the download method. And of course, you may also download the zip files from google cloud using some other tools and uncompress them manually. The prepare_dataset function in the original repo was created since I had to keep the files in specific subfolders to maintain the paths. The _ls_rec method was updated and the paths are maintained in a pandas DataFrame now, so moving files in the prepare_dataset function is unnecessary and thus removed.

@sungggat
Copy link
Author

sungggat commented Apr 11, 2023

I downloaded dataset Cinc 2021 from https://physionet.org/content/challenge-2021/#files . I want to run trainer.py from benchmarks/cinc2021. I also added ds_train and ds val.
`

TrainCfg.db_dir  = 'data/CINC2021/physionet.org/files/challenge-2021/1.0.3/training/'

ds_train = CINC2021(TrainCfg, training=True, lazy=True)
ds_val = CINC2021(TrainCfg, training=False, lazy=True)

`

I am getting below error:

File "trainer.py", line 423, in
ds_train = CINC2021(TrainCfg, training=True, lazy=True)
File "/workspace/torch_ecg/benchmarks/train_crnn_cinc2021/dataset.py", line 101, in init
self.config.train_ratio, force_recompute=False
File "/workspace/torch_ecg/benchmarks/train_crnn_cinc2021/dataset.py", line 306, in _train_test_split
self.reader.all_records[t], dynamic_ncols=True, mininterval=1.0
TypeError: len() takes no keyword arguments

@wenh06
Copy link
Collaborator

wenh06 commented Apr 11, 2023

It's a typo in this file, which happened perhaps when doing copy-paste (from torch_ecg/databases/datasets/cinc2021/cinc2021_dataset.py). The right bracket of this len function was missing, and was added at a wrong place (perhaps by Copilot?). It is now corrected in 20203ca.

@AK-mehr
Copy link

AK-mehr commented Mar 25, 2024

Hi, I'm trying to run trainer.py for train_hybrid_cpsc2020. I have downloaded the CPSC 2020 dataset and specified the data path inside cfg.py like this:
BaseCfg.db_dir = 'D:/AUT/Data_Lab/Implementation/TinyML/data/TrainingSet/'
TrainingSet contains two subfolders, namely data and ref in which exist 10 .mat files. but come across this error whenever I run trainer.py

File "C:\Users\AK\miniconda3\envs\cpsc\Lib\site-packages\torch\utils\data\dataloader.py", line 350, in init
sampler = RandomSampler(dataset, generator=generator) # type: ignore[arg-type]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\AK\miniconda3\envs\cpsc\Lib\site-packages\torch\utils\data\sampler.py", line 143, in init
raise ValueError(f"num_samples should be a positive integer value, but got num_samples={self.num_samples}")
ValueError: num_samples should be a positive integer value, but got num_samples=0
Any advice on how can I fix this?

@wenh06
Copy link
Collaborator

wenh06 commented Mar 26, 2024

It seems that the data reader did not find the recording files. The CPSC2020 data reader searches for the recordings and annotation files using the following method:

    def _ls_rec(self) -> None:
        """Find all records in the database directory
        and store them (path, metadata, etc.) in some private attributes.
        """
        self._df_records = pd.DataFrame()
        n_records = 10
        all_records = [f"A{i:02d}" for i in range(1, 1 + n_records)]
        self._df_records["path"] = [path for path in self.db_dir.rglob(f"*.{self.rec_ext}") if path.stem in all_records]
        self._df_records["record"] = self._df_records["path"].apply(lambda x: x.stem)
        self._df_records.set_index("record", inplace=True)

        all_annotations = [f"R{i:02d}" for i in range(1, 1 + n_records)]
        df_ann = pd.DataFrame()
        df_ann["ann_path"] = [path for path in self.db_dir.rglob(f"*.{self.ann_ext}") if path.stem in all_annotations]
        df_ann["record"] = df_ann["ann_path"].apply(lambda x: x.stem.replace("R", "A"))
        df_ann.set_index("record", inplace=True)
        # take the intersection by the index of `df_ann` and `self._df_records`
        self._df_records = self._df_records.join(df_ann, how="inner")

        if len(self._df_records) > 0:
            if self._subsample is not None:
                size = min(
                    len(self._df_records),
                    max(1, int(round(self._subsample * len(self._df_records)))),
                )
                self._df_records = self._df_records.sample(n=size, random_state=DEFAULTS.SEED, replace=False)

        self._all_records = self._df_records.index.tolist()
        self._all_annotations = self._df_records["ann_path"].apply(lambda x: x.stem).tolist()

Theoretically, you can pass any of its parents because the pathlib.Path.rglob is used.

@wenh06
Copy link
Collaborator

wenh06 commented Mar 26, 2024

I think I know the reason now. The CPSC2020 dataset uses sliced recordings since the original recordings are fairly long. So, you should call the persistence method first, which takes quite a long time to slice the recordings.

@AK-mehr
Copy link

AK-mehr commented Mar 26, 2024

Thank you for your guidance, it seems like training requires a CNN.h5 and a CRNN.h5 file located in signal_processing/ecg_rpeaks_dl_models directory but I only have the corresponding json files. It's worth noting that I've only run trainer.py. Should I do anything before running trainer.py? could you please help me on this one as well?

@wenh06
Copy link
Collaborator

wenh06 commented Mar 26, 2024

I added automatic downloading of these models, which you can find in https://opensz.oss-cn-beijing.aliyuncs.com/ICBEB2020/file/CPSC2019-opensource.zip. However, these models were trained with a very older version of Keras. One might have trouble loading these models. I also removed the auto-load of deep learning models in the signal_processing module.

The changes were made in the dev branch currently and will be merged into the master branch soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants