Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add /convert_alignment endpoint #119

Merged
merged 19 commits into from
Oct 7, 2022
Merged

feat: add /convert_alignment endpoint #119

merged 19 commits into from
Oct 7, 2022

Conversation

joanise
Copy link
Member

@joanise joanise commented Sep 26, 2022

Feature implementation: the /convert_alignment endpoint supports converting alignments to TextGrid, eaf, srt, vtt.

400 means Bad Request, but a request we can parse and process, just not g2p,
is a good request, but with what we can consider a semantic problem, so
422 seems more appropriate.
This new web_api endpoint will need to "write" a TextGrid file to
string in memory, and it will need to do so without having access to the
audio file.

So...
 - save_label_files() takes words and tokenized_xml instead of align_results
 - rename write_to_text_grid -> create_text_grid since the function
   does not actually write the results to file, but rather it creates an
   in-memory TextGrid object.
 - save_subtitles with words and tokenized_xml instead of align_results
 - factor out get_audio_duration()
This end point specifically converts to TextGrid, but there will be more
endpoints, for the other formats, and probably some refactorings to
unify the endpoints once they're all done.
@joanise
Copy link
Member Author

joanise commented Sep 26, 2022

Questions for review...

  1. The /convert_alignment endpoint accepts the output format of part of the JSON input dict. It's probably more "conventional" to do /convert_to/TextGrid and /convert_to/eaf etc, but a) I don't really know how to do that, and b) since a bunch of parameters already have to come through that JSON structure, I'm not sure I see value in changing it.

  2. My ConvertResponse structure feels clunky with 4 fields, only two of which are used for TextGrid and eaf, and all four for srt and vtt, but I was trying to unify the four possible conversion outputs into one response object. Is that OK, or should I try to do better? Can one endpoint return one of several different responses, the way it can accept one of several different requests? (I mean, it's pretty obvious it can, but how to do you do cleanly and keep it user intuitive and well documented?) (Maybe a list of filenames and a list of filecontents? Or a list of (filename, filecontents) pairs?)

@codecov
Copy link

codecov bot commented Sep 26, 2022

Codecov Report

Merging #119 (c8462ff) into master (2dd66a5) will increase coverage by 0.83%.
The diff coverage is 96.74%.

@@            Coverage Diff             @@
##           master     #119      +/-   ##
==========================================
+ Coverage   80.38%   81.21%   +0.83%     
==========================================
  Files          26       26              
  Lines        1840     1922      +82     
  Branches      338      340       +2     
==========================================
+ Hits         1479     1561      +82     
- Misses        320      321       +1     
+ Partials       41       40       -1     
Impacted Files Coverage Δ
readalongs/align.py 95.88% <95.00%> (+0.38%) ⬆️
readalongs/web_api.py 96.35% <96.00%> (+0.24%) ⬆️
readalongs/text/make_smil.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

The utf-8 assumption is baked in in various subtle places, so just document
it is required and remove the parameter.

Did some refactoring in test_web_api.py at the same time.
@joanise
Copy link
Member Author

joanise commented Sep 28, 2022

Argh, CI is failing once again because of a failure to push to codecov. This problem usually goes away by itself after a few minutes, but it's a nuisance. The tests passed on my latest push, though.

Copy link
Collaborator

@roedoejet roedoejet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @joanise - this is awesome! Well done on your first API with FastAPI. I like the way you did it, but I have some suggestions in comments. Maybe we can discuss further on slack?

Thanks again - solid work. Well tested, and thought-out.

readalongs/align.py Show resolved Hide resolved
)


class ConvertResponse(BaseModel):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm - I see why you did this - because some responses included two files, but I think we might actually want to design this a little differently. FastAPI has a built-in FileResponse object (https://fastapi.tiangolo.com/advanced/custom-response/?h=file#fileresponse) which takes care of sending appropriate headers along with the response (which we don't fully implement here). I think we should just parameterize the request for subtitles to return either a sentence or word response. More to come!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it turns out I cannot use a FileResponse, because that asynchronously reads the file to send it, but I wipe the file out as the endpoint end when I clean up the tempdir, as we promise in our data privacy policy, so I'll use a Response instead like this:

            return Response(
                slurp_file(prefix + ".TextGrid"),
                media_type="text/plain",
                headers={"Content-Disposition": "filename=aligned.TextGrid"},
            )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also tested with headers={"Content-Disposition": "attachment; filename=aligned.TextGrid"}, and it returns a download link with the contents, downloading to that filename, but I don't know if that's better for our use cases or not.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternately if you don't want to read the whole file (but the files are small enough that I think this is a non-issue) you can do this: https://fastapi.tiangolo.com/advanced/custom-response/#using-streamingresponse-with-file-like-objects

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was looking at StreamingResponse too, but I have to read it all into memory before I delete off the disk, so again there is not benefit in trying to do anything async optimized. The files are quite small, so I think whole-file fully sync is file.

return f.read()


@v1.post("/convert_alignment", response_model=ConvertResponse)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@v1.post("/convert_alignment", response_model=ConvertResponse)
@v1.post("/convert_alignment/{output_format}", response_model=ConvertResponse)
async def convert_alignment(input: ConvertRequest, output_format) -> ConvertResponse:

This is how you do what you were suggesting. I would also kind of prefer this I think instead of bundling it with the body of the request? I mean it's certainly what you would do in a RESTful API, but this is more RPC-ish.

See "path parameters" https://fastapi.tiangolo.com/tutorial/path-params/
Another option (and what might make sense for specifying sentences vs words on vtt would be something like:

/convert_alignment/vtt?tier=sentences and in that case the tier= would be a "query parameter" (https://fastapi.tiangolo.com/tutorial/query-params/). Then we could create an Enumaration class for the output formats that we accept and let FastAPI handle the case where somebody requests something else. See https://fastapi.tiangolo.com/tutorial/path-params/?h=enum#working-with-python-enumerations

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, I like the path parameter, but I can't figure out how to keep my documentation I had for the enum values.
With output_name as a member of ConvertRequest, as an enum or a str, I can do:

output_format: FormatName = Field(
    example=FormatName.textgrid,
    title="Format to convert to, one of textgrid (Praat TextGrid), eaf (ELAN EAF), srt (SRT subtitles), or vtt (VTT subtitles).",
)

Is there a corresponding place I can put that title info when I do a path parameter? I tried giving the path param a default value as a Field with such a title, but it didn't put that title in the documentation.

I guess there's always the docstring for the endpoint itself, that would do too.

Most satisfying, though, would be if my class FormatName(Enum) could itself have its own title or similar documentation...

return ConvertResponse(
file_name="aligned_sentences.srt",
file_contents=slurp_file(prefix + "_sentences.srt"),
other_file_name="aligned_words.srt",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yes, just to re-iterate, my recommendations here would be:

  1. return FileResponse objects instead of this custom ConvertResponse object
  2. use path parameters for each output format and use a Python Enumeration to make use of FastAPI handling exceptions
  3. use query parameters to return either sentence or word subtitles for subtitle formats (with sentences being the default as this is what the vast majority of cases will want I expect)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just been following this discussion from afar, but these recommendations make a lot of sense to me and result in a much cleaner API. The extra round-trip to do sentences and words is maybe not a huge concern if as you say most of the requests are only going to be for sentences.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree with both of you. The use case 99% of the time is get the sentence tier, so it's OK to de-optimize the case where your want both tiers for the sake of a cleaner API design that's easier to understand.


Returns:
list of words, list of sentences
list of words, list of sentences (as a list of lists of words)
The returned words are dicts containing:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orthogonal to this PR, but it would be more robust to use a NamedTuple or data class instead of a dict here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

God yes! I really don't like our dicts used like this all over the code. But I've never had the courage to change them, because they're really all over!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had an interesting idea in the shower: we could have a Word class that is initialized from the lxml.etree element and sets id, start and end right away, kinda like a dataclass (I prefer dataclass to NamedTuple, byt the way). .text would then be an attribute implemented with a getter. This way, I wouldn't have to have a get_word_texts in the first place, you'd just use word.text when you need it. We'd still need get_sentences, but it would do just one thing, figuring out the sentence identification logic.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a better idea (it is also what pocketsphinx and soundswallower do internally, except that of course their "text" is our "id")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That, if and when we do it, will definitely be in a separate PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I like the idea enough that I think I'll do it.

@joanise
Copy link
Member Author

joanise commented Oct 6, 2022

I've done all the changes we've discussed, I think this PR can be ready to merge. @dhdaines @roedoejet do you want to take a look again or can I just merge?

@roedoejet
Copy link
Collaborator

I've done all the changes we've discussed, I think this PR can be ready to merge. @dhdaines @roedoejet do you want to take a look again or can I just merge?

I'll have a quick look

@roedoejet roedoejet self-requested a review October 6, 2022 20:42
Copy link
Collaborator

@roedoejet roedoejet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - but why use Response instead of FileResponse?

from typing import Dict, List, Optional, Union

from fastapi import Body, FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import Response
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not FileResponse?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's because the file has to be deleted immediately after sending it, and there isn't a way (or at least not an obvious way) to do that with FileResponse.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be the same no? The response (whether Response or FileResponse) should return the file created in the temporary directory. Once it does that, the file gets deleted.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that FileResponse sends it asynchronously, so the directory will get deleted (from exiting the with block) before the event loop gets around to sending it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - in that case I guess we could just write a background task to delete the file after the response is sent https://fastapi.tiangolo.com/tutorial/background-tasks/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I first tried FileResponse, but it's broken for our use case: it asynchronously sends the file after the endpoint function exists, but I delete the temporary folder so the file is gone by the time FileResponse tries to read it and I just get a file-not-found error. David suggested StreamingResponse as an alternative, but it suffers from the same problem.

Because our data privacy policy promises to delete the file right away, I have to slurp the file into memory and send that. I dug around to find a "FileResponse from memory" solution and what I found was to put it in a plain Response with the headers I set.

Copy link
Member Author

@joanise joanise Oct 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My window was out of sync, I hadn't seen the background task suggestion yet.

I'll look into the background task, and I think it makes sense to use it, as long as it really does happen after the FileResponse is done sending the file. It might yield better performance, too, by making things asynchronous, returning the headers quickly and deferring reading and deleting the file in the background.

I'll want to make sure the temp dir really does get deleted, though, because a failure here would violate our privacy promise. I'm not quite sure how to test that convincingly.

The other downside is that I can't use the TemporaryDirectory as a context manager anymore, but that's a minor point.

I guess my real decision factor would be: is the FileResponse actually faster or better in some other concrete way? If so, the change is a no brainer. Otherwise, my Response solution already works.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I was worried about creating a race condition, but FileResponse accepts a background parameter provide a background task to run after running. See https://stackoverflow.com/a/68579536
This looks like a very elegant solution, I'll give it a shot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dang I hate Windows! I have it all working with a FileResponse now, but if I run the tests on Windows, the SRT, VTT and EAF files get CRLF newlines instead of LF. I guess that's valid too, but what a nuisance! It means my unit testing has to convert before testing for equality. Bleh.

My slurp_file solution was reading the file back on windows, converting it to LF in the process, but a FileResponse sends the file contents verbatim.

"lc3", "Third Language Name",
...
}`
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this, and all of the removal of encoding parameters, would be a separate PR, but git doesn't make that very easy to do in my experience...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, what makes it hard is that my changes are kinda mixed together. If I had decided to separate the two tasks, I could have easily done it, and I guess I should have. Git actually helps doing that, but I was just too lazy to do it. Full disclosure: I consider myself a Git wizard and there's not much I can't do with it. I think it would have been easy, but a "normal" person would probably disagree. :)

Copy link
Collaborator

@dhdaines dhdaines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me...

 - Change Response to FileResponse
 - Use a BackgroundTask to make sure the temporary directory is cleaned
   up, but only *after* the FileResponse has been delivered to the
   client
 - Test to make sure the temporary directory (and therefore the files it
   contains) really always gets deleted, even when things go wrong.
That code was just there so I could test that my unit test really failed
if the temporary directory was not deleted.
 - enum names are constants, so they should be UPPER_CASE
 - input is a builtin, don't redefine it (s/input/request/)
 - and misc minor stuff
@@ -386,6 +423,13 @@ async def convert_alignment(
detail="If this happens, FastAPI Enum validation didn't work so this is a bug!",
)

except Exception:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

readalongs/web_api.py Outdated Show resolved Hide resolved
@joanise joanise merged commit 5c60518 into master Oct 7, 2022
@joanise joanise deleted the dev.web-api-convert branch October 7, 2022 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants