Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Accept compressed files as input to predict when using a Predictor #5237

Open
danieldeutsch opened this issue Jun 2, 2021 · 10 comments
Open
Labels
Contributions welcome Feature request Good First Issue A great place to start for first time contributors

Comments

@danieldeutsch
Copy link
Contributor

Is your feature request related to a problem? Please describe.
I typically used compressed datasets (e.g. gzipped) to save disk space. This works fine with AllenNLP during training because I can write my dataset reader to load the compressed data. However, the predict command opens the file and reads lines for the Predictor. This fails when it tries to load data from my compressed files.

def _get_json_data(self) -> Iterator[JsonDict]:
if self._input_file == "-":
for line in sys.stdin:
if not line.isspace():
yield self._predictor.load_line(line)
else:
input_file = cached_path(self._input_file)
with open(input_file, "r") as file_input:
for line in file_input:
if not line.isspace():
yield self._predictor.load_line(line)

Describe the solution you'd like
Either automatically detect the file is compressed or add a flag to predict that indicates that the file is compressed. One method that I have used to detect if a file is gzipped is here, although it isn't 100% accurate. I have an implementation here. Otherwise a flag like --compression-type to mark how the file is compressed should be sufficient. Passing the type of compression would allow support for gzip, bz2, or any other method.

@ArjunSubramonian
Copy link
Contributor

I think this feature is a great idea! The latter design (passing a flag) seems better to me.

I am adding @epwalsh here to get his input as well.

@ArjunSubramonian ArjunSubramonian added Contributions welcome Good First Issue A great place to start for first time contributors labels Jun 3, 2021
@epwalsh
Copy link
Member

epwalsh commented Jun 3, 2021

Yeup, this seems reasonable. I think we should try to automatically detect the compression type, but also have the flag so that users can override it when the automatic detection fails.

You may find this helper function useful:

def open_compressed(

@Dbhasin1
Copy link
Contributor

Hi, I'd like to try working on this. I'm relatively a noobie so are there any pointers I should keep in mind before raising a pull request?

@epwalsh
Copy link
Member

epwalsh commented Jun 29, 2021

@spranjal25
Copy link

Hi @epwalsh! is this issue still not resolved? I'm looking for issues to start contributing to AllenNLP, can I take this up if not resolved already?

@epwalsh
Copy link
Member

epwalsh commented Oct 25, 2021

Hi @spranjal25, we haven't heard from @Dbhasin1 for a while on their PR, so it's probably okay for you take over at this point.

@Dbhasin1
Copy link
Contributor

Dbhasin1 commented Oct 25, 2021

hey, sorry I'd been engaged elsewhere for a while. I'd like to give it one more shot!

@aterzgar
Copy link

is the issue still open ?

@Akshat977
Copy link

Hi @danieldeutsch , @epwalsh ,
This issue seems to be a good initiation of my journey towards contribution to FOSS projects.
Can you please assign this to me?

@epwalsh
Copy link
Member

epwalsh commented Sep 12, 2022

Hi @Akshat977, feel free to open a PR when you're ready

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Contributions welcome Feature request Good First Issue A great place to start for first time contributors
Projects
None yet
7 participants