Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load a custom preprocessed dataset Error #68

Closed
srashtchi opened this issue Aug 17, 2022 · 4 comments
Closed

load a custom preprocessed dataset Error #68

srashtchi opened this issue Aug 17, 2022 · 4 comments

Comments

@srashtchi
Copy link

  • OCTIS version: 1.10.4
  • Python version: 3.9
  • Operating System: MacOS

Description

I am trying to use evaluation metrics from OTICS package on my own dataset.
I did follow the guilds in main readme page on how to load a custom preprocessed dataset. I even used your sample .tsv file, but I got the following error:
NotADirectoryError: [Errno 20] Not a directory: '/Users/shabnam.rashtchi/DEB/topicModeling_project_folder/scratches/metadata.json/corpus.tsv'

What I Did

from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.load_custom_dataset_from_folder('/Users/../scratches/corpus.tsv')

Traceback (most recent call last):
File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/octis/dataset/dataset.py", line 327, in load_custom_dataset_from_folder
df = pd.read_csv(self.dataset_path + "/corpus.tsv", sep='\t', header=None)
File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 575, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 934, in init
self._engine = self._make_engine(f, self.engine)
File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1218, in _make_engine
self.handles = get_handle( # type: ignore[call-overload]
File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/pandas/io/common.py", line 786, in get_handle
handle = open(
NotADirectoryError: [Errno 20] Not a directory: '/Users/shabnam.rashtchi/DEB/topicModeling_project_folder/scratches/metadata.json/corpus.tsv'

@silviatti
Copy link
Collaborator

Hi!
The function load_custom_dataset_from_folder() requires the folder path, not the path to corpus file.

Can you check if it works in this way?

dataset.load_custom_dataset_from_folder('/Users/../scratches/')

Silvia

@srashtchi
Copy link
Author

Thanks Silvia for getting back to me.
It worked. I didn't know I need to name my own .tsv file same as your sample "corpus.tsv". and not include the corpus.tsv file name in the path.
Cheers

@silviatti
Copy link
Collaborator

Perfect. I'll fix the readme to make it clear.
Thanks,

Silvia

@srashtchi
Copy link
Author

Hi Silvia

I managed to get my code running fine, thanks for your response.

I have another question , I am trying to make the code smoother, right now in order to create a dataset object I have to save my variable to a .tsv file first, and then use the load_custom_dataset_from_folder method to load the data from .tsv into empty dataset object. without this object obviously the get_corpus() method wouldn't do its magic. See the sample code below.

So basically the question is: is there a way to directly pass my variable to a dataset object without saving and loading?

from octis.dataset.dataset import Dataset
f=Path('/myFolderPath/corpus.tsv')
df.to_csv(f, sep="\t", index=False, header=False, columns = ['document'])

dataset = Dataset()
dataset.load_custom_dataset_from_folder('/myFolderPath/')

texts=dataset.get_corpus()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants