-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add /convert_alignment endpoint #119
Conversation
400 means Bad Request, but a request we can parse and process, just not g2p, is a good request, but with what we can consider a semantic problem, so 422 seems more appropriate.
This new web_api endpoint will need to "write" a TextGrid file to string in memory, and it will need to do so without having access to the audio file. So... - save_label_files() takes words and tokenized_xml instead of align_results - rename write_to_text_grid -> create_text_grid since the function does not actually write the results to file, but rather it creates an in-memory TextGrid object. - save_subtitles with words and tokenized_xml instead of align_results - factor out get_audio_duration()
This end point specifically converts to TextGrid, but there will be more endpoints, for the other formats, and probably some refactorings to unify the endpoints once they're all done.
Questions for review...
|
Codecov Report
@@ Coverage Diff @@
## master #119 +/- ##
==========================================
+ Coverage 80.38% 81.21% +0.83%
==========================================
Files 26 26
Lines 1840 1922 +82
Branches 338 340 +2
==========================================
+ Hits 1479 1561 +82
- Misses 320 321 +1
+ Partials 41 40 -1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
The utf-8 assumption is baked in in various subtle places, so just document it is required and remove the parameter. Did some refactoring in test_web_api.py at the same time.
Argh, CI is failing once again because of a failure to push to codecov. This problem usually goes away by itself after a few minutes, but it's a nuisance. The tests passed on my latest push, though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @joanise - this is awesome! Well done on your first API with FastAPI. I like the way you did it, but I have some suggestions in comments. Maybe we can discuss further on slack?
Thanks again - solid work. Well tested, and thought-out.
readalongs/web_api.py
Outdated
) | ||
|
||
|
||
class ConvertResponse(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm - I see why you did this - because some responses included two files, but I think we might actually want to design this a little differently. FastAPI has a built-in FileResponse object (https://fastapi.tiangolo.com/advanced/custom-response/?h=file#fileresponse) which takes care of sending appropriate headers along with the response (which we don't fully implement here). I think we should just parameterize the request for subtitles to return either a sentence or word response. More to come!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it turns out I cannot use a FileResponse, because that asynchronously reads the file to send it, but I wipe the file out as the endpoint end when I clean up the tempdir, as we promise in our data privacy policy, so I'll use a Response instead like this:
return Response(
slurp_file(prefix + ".TextGrid"),
media_type="text/plain",
headers={"Content-Disposition": "filename=aligned.TextGrid"},
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also tested with headers={"Content-Disposition": "attachment; filename=aligned.TextGrid"},
and it returns a download link with the contents, downloading to that filename, but I don't know if that's better for our use cases or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternately if you don't want to read the whole file (but the files are small enough that I think this is a non-issue) you can do this: https://fastapi.tiangolo.com/advanced/custom-response/#using-streamingresponse-with-file-like-objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was looking at StreamingResponse too, but I have to read it all into memory before I delete off the disk, so again there is not benefit in trying to do anything async optimized. The files are quite small, so I think whole-file fully sync is file.
readalongs/web_api.py
Outdated
return f.read() | ||
|
||
|
||
@v1.post("/convert_alignment", response_model=ConvertResponse) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@v1.post("/convert_alignment", response_model=ConvertResponse) | |
@v1.post("/convert_alignment/{output_format}", response_model=ConvertResponse) | |
async def convert_alignment(input: ConvertRequest, output_format) -> ConvertResponse: |
This is how you do what you were suggesting. I would also kind of prefer this I think instead of bundling it with the body of the request? I mean it's certainly what you would do in a RESTful API, but this is more RPC-ish.
See "path parameters" https://fastapi.tiangolo.com/tutorial/path-params/
Another option (and what might make sense for specifying sentences vs words on vtt would be something like:
/convert_alignment/vtt?tier=sentences
and in that case the tier=
would be a "query parameter" (https://fastapi.tiangolo.com/tutorial/query-params/). Then we could create an Enumaration class for the output formats that we accept and let FastAPI handle the case where somebody requests something else. See https://fastapi.tiangolo.com/tutorial/path-params/?h=enum#working-with-python-enumerations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum, I like the path parameter, but I can't figure out how to keep my documentation I had for the enum values.
With output_name
as a member of ConvertRequest
, as an enum or a str, I can do:
output_format: FormatName = Field(
example=FormatName.textgrid,
title="Format to convert to, one of textgrid (Praat TextGrid), eaf (ELAN EAF), srt (SRT subtitles), or vtt (VTT subtitles).",
)
Is there a corresponding place I can put that title
info when I do a path parameter? I tried giving the path param a default value as a Field with such a title, but it didn't put that title in the documentation.
I guess there's always the docstring for the endpoint itself, that would do too.
Most satisfying, though, would be if my class FormatName(Enum)
could itself have its own title
or similar documentation...
readalongs/web_api.py
Outdated
return ConvertResponse( | ||
file_name="aligned_sentences.srt", | ||
file_contents=slurp_file(prefix + "_sentences.srt"), | ||
other_file_name="aligned_words.srt", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So yes, just to re-iterate, my recommendations here would be:
- return
FileResponse
objects instead of this customConvertResponse
object - use path parameters for each output format and use a Python Enumeration to make use of FastAPI handling exceptions
- use query parameters to return either sentence or word subtitles for subtitle formats (with sentences being the default as this is what the vast majority of cases will want I expect)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just been following this discussion from afar, but these recommendations make a lot of sense to me and result in a much cleaner API. The extra round-trip to do sentences and words is maybe not a huge concern if as you say most of the requests are only going to be for sentences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree with both of you. The use case 99% of the time is get the sentence tier, so it's OK to de-optimize the case where your want both tiers for the sake of a cleaner API design that's easier to understand.
|
||
Returns: | ||
list of words, list of sentences | ||
list of words, list of sentences (as a list of lists of words) | ||
The returned words are dicts containing: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Orthogonal to this PR, but it would be more robust to use a NamedTuple or data class instead of a dict here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
God yes! I really don't like our dicts used like this all over the code. But I've never had the courage to change them, because they're really all over!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had an interesting idea in the shower: we could have a Word class that is initialized from the lxml.etree element and sets id, start and end right away, kinda like a dataclass (I prefer dataclass to NamedTuple, byt the way). .text
would then be an attribute implemented with a getter. This way, I wouldn't have to have a get_word_texts
in the first place, you'd just use word.text
when you need it. We'd still need get_sentences
, but it would do just one thing, figuring out the sentence identification logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like a better idea (it is also what pocketsphinx and soundswallower do internally, except that of course their "text" is our "id")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That, if and when we do it, will definitely be in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I like the idea enough that I think I'll do it.
I've done all the changes we've discussed, I think this PR can be ready to merge. @dhdaines @roedoejet do you want to take a look again or can I just merge? |
I'll have a quick look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - but why use Response
instead of FileResponse
?
readalongs/web_api.py
Outdated
from typing import Dict, List, Optional, Union | ||
|
||
from fastapi import Body, FastAPI, HTTPException | ||
from fastapi.middleware.cors import CORSMiddleware | ||
from fastapi.responses import Response |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not FileResponse
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's because the file has to be deleted immediately after sending it, and there isn't a way (or at least not an obvious way) to do that with FileResponse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be the same no? The response (whether Response
or FileResponse
) should return the file created in the temporary directory. Once it does that, the file gets deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that FileResponse
sends it asynchronously, so the directory will get deleted (from exiting the with block) before the event loop gets around to sending it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see - in that case I guess we could just write a background task to delete the file after the response is sent https://fastapi.tiangolo.com/tutorial/background-tasks/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I first tried FileResponse
, but it's broken for our use case: it asynchronously sends the file after the endpoint function exists, but I delete the temporary folder so the file is gone by the time FileResponse
tries to read it and I just get a file-not-found error. David suggested StreamingResponse
as an alternative, but it suffers from the same problem.
Because our data privacy policy promises to delete the file right away, I have to slurp the file into memory and send that. I dug around to find a "FileResponse from memory" solution and what I found was to put it in a plain Response
with the headers I set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My window was out of sync, I hadn't seen the background task suggestion yet.
I'll look into the background task, and I think it makes sense to use it, as long as it really does happen after the FileResponse is done sending the file. It might yield better performance, too, by making things asynchronous, returning the headers quickly and deferring reading and deleting the file in the background.
I'll want to make sure the temp dir really does get deleted, though, because a failure here would violate our privacy promise. I'm not quite sure how to test that convincingly.
The other downside is that I can't use the TemporaryDirectory as a context manager anymore, but that's a minor point.
I guess my real decision factor would be: is the FileResponse actually faster or better in some other concrete way? If so, the change is a no brainer. Otherwise, my Response solution already works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I was worried about creating a race condition, but FileResponse accepts a background
parameter provide a background task to run after running. See https://stackoverflow.com/a/68579536
This looks like a very elegant solution, I'll give it a shot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dang I hate Windows! I have it all working with a FileResponse now, but if I run the tests on Windows, the SRT, VTT and EAF files get CRLF newlines instead of LF. I guess that's valid too, but what a nuisance! It means my unit testing has to convert before testing for equality. Bleh.
My slurp_file solution was reading the file back on windows, converting it to LF in the process, but a FileResponse sends the file contents verbatim.
"lc3", "Third Language Name", | ||
... | ||
}` | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally this, and all of the removal of encoding parameters, would be a separate PR, but git doesn't make that very easy to do in my experience...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, what makes it hard is that my changes are kinda mixed together. If I had decided to separate the two tasks, I could have easily done it, and I guess I should have. Git actually helps doing that, but I was just too lazy to do it. Full disclosure: I consider myself a Git wizard and there's not much I can't do with it. I think it would have been easy, but a "normal" person would probably disagree. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me...
- Change Response to FileResponse - Use a BackgroundTask to make sure the temporary directory is cleaned up, but only *after* the FileResponse has been delivered to the client - Test to make sure the temporary directory (and therefore the files it contains) really always gets deleted, even when things go wrong.
That code was just there so I could test that my unit test really failed if the temporary directory was not deleted.
- enum names are constants, so they should be UPPER_CASE - input is a builtin, don't redefine it (s/input/request/) - and misc minor stuff
@@ -386,6 +423,13 @@ async def convert_alignment( | |||
detail="If this happens, FastAPI Enum validation didn't work so this is a bug!", | |||
) | |||
|
|||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
Feature implementation: the
/convert_alignment
endpoint supports converting alignments to TextGrid, eaf, srt, vtt.