feat: add /convert_alignment endpoint #119

joanise · 2022-09-26T13:29:21Z

Feature implementation: the /convert_alignment endpoint supports converting alignments to TextGrid, eaf, srt, vtt.

400 means Bad Request, but a request we can parse and process, just not g2p, is a good request, but with what we can consider a semantic problem, so 422 seems more appropriate.

This new web_api endpoint will need to "write" a TextGrid file to string in memory, and it will need to do so without having access to the audio file. So... - save_label_files() takes words and tokenized_xml instead of align_results - rename write_to_text_grid -> create_text_grid since the function does not actually write the results to file, but rather it creates an in-memory TextGrid object. - save_subtitles with words and tokenized_xml instead of align_results - factor out get_audio_duration()

This end point specifically converts to TextGrid, but there will be more endpoints, for the other formats, and probably some refactorings to unify the endpoints once they're all done.

joanise · 2022-09-26T13:31:28Z

Questions for review...

The /convert_alignment endpoint accepts the output format of part of the JSON input dict. It's probably more "conventional" to do /convert_to/TextGrid and /convert_to/eaf etc, but a) I don't really know how to do that, and b) since a bunch of parameters already have to come through that JSON structure, I'm not sure I see value in changing it.
My ConvertResponse structure feels clunky with 4 fields, only two of which are used for TextGrid and eaf, and all four for srt and vtt, but I was trying to unify the four possible conversion outputs into one response object. Is that OK, or should I try to do better? Can one endpoint return one of several different responses, the way it can accept one of several different requests? (I mean, it's pretty obvious it can, but how to do you do cleanly and keep it user intuitive and well documented?) (Maybe a list of filenames and a list of filecontents? Or a list of (filename, filecontents) pairs?)

codecov · 2022-09-26T13:57:30Z

Codecov Report

Merging #119 (c8462ff) into master (2dd66a5) will increase coverage by 0.83%.
The diff coverage is 96.74%.

@@            Coverage Diff             @@
##           master     #119      +/-   ##
==========================================
+ Coverage   80.38%   81.21%   +0.83%     
==========================================
  Files          26       26              
  Lines        1840     1922      +82     
  Branches      338      340       +2     
==========================================
+ Hits         1479     1561      +82     
- Misses        320      321       +1     
+ Partials       41       40       -1

Impacted Files	Coverage Δ
readalongs/align.py	`95.88% <95.00%> (+0.38%)`	⬆️
readalongs/web_api.py	`96.35% <96.00%> (+0.24%)`	⬆️
readalongs/text/make_smil.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

The utf-8 assumption is baked in in various subtle places, so just document it is required and remove the parameter. Did some refactoring in test_web_api.py at the same time.

joanise · 2022-09-28T20:36:52Z

Argh, CI is failing once again because of a failure to push to codecov. This problem usually goes away by itself after a few minutes, but it's a nuisance. The tests passed on my latest push, though.

readalongs/text/make_smil.py

roedoejet

Hey @joanise - this is awesome! Well done on your first API with FastAPI. I like the way you did it, but I have some suggestions in comments. Maybe we can discuss further on slack?

Thanks again - solid work. Well tested, and thought-out.

readalongs/align.py

roedoejet · 2022-10-04T19:33:34Z

readalongs/web_api.py

+    )
+
+
+class ConvertResponse(BaseModel):


Hm - I see why you did this - because some responses included two files, but I think we might actually want to design this a little differently. FastAPI has a built-in FileResponse object (https://fastapi.tiangolo.com/advanced/custom-response/?h=file#fileresponse) which takes care of sending appropriate headers along with the response (which we don't fully implement here). I think we should just parameterize the request for subtitles to return either a sentence or word response. More to come!

Well, it turns out I cannot use a FileResponse, because that asynchronously reads the file to send it, but I wipe the file out as the endpoint end when I clean up the tempdir, as we promise in our data privacy policy, so I'll use a Response instead like this:

return Response( slurp_file(prefix + ".TextGrid"), media_type="text/plain", headers={"Content-Disposition": "filename=aligned.TextGrid"}, )

I've also tested with headers={"Content-Disposition": "attachment; filename=aligned.TextGrid"}, and it returns a download link with the contents, downloading to that filename, but I don't know if that's better for our use cases or not.

Alternately if you don't want to read the whole file (but the files are small enough that I think this is a non-issue) you can do this: https://fastapi.tiangolo.com/advanced/custom-response/#using-streamingresponse-with-file-like-objects

Yeah, I was looking at StreamingResponse too, but I have to read it all into memory before I delete off the disk, so again there is not benefit in trying to do anything async optimized. The files are quite small, so I think whole-file fully sync is file.

roedoejet · 2022-10-04T19:48:35Z

readalongs/web_api.py

+        return f.read()
+
+
+@v1.post("/convert_alignment", response_model=ConvertResponse)


Suggested change

@v1.post("/convert_alignment", response_model=ConvertResponse)

@v1.post("/convert_alignment/{output_format}", response_model=ConvertResponse)

async def convert_alignment(input: ConvertRequest, output_format) -> ConvertResponse:

This is how you do what you were suggesting. I would also kind of prefer this I think instead of bundling it with the body of the request? I mean it's certainly what you would do in a RESTful API, but this is more RPC-ish.

See "path parameters" https://fastapi.tiangolo.com/tutorial/path-params/
Another option (and what might make sense for specifying sentences vs words on vtt would be something like:

/convert_alignment/vtt?tier=sentences and in that case the tier= would be a "query parameter" (https://fastapi.tiangolo.com/tutorial/query-params/). Then we could create an Enumaration class for the output formats that we accept and let FastAPI handle the case where somebody requests something else. See https://fastapi.tiangolo.com/tutorial/path-params/?h=enum#working-with-python-enumerations

Hum, I like the path parameter, but I can't figure out how to keep my documentation I had for the enum values.
With output_name as a member of ConvertRequest, as an enum or a str, I can do:

output_format: FormatName = Field( example=FormatName.textgrid, title="Format to convert to, one of textgrid (Praat TextGrid), eaf (ELAN EAF), srt (SRT subtitles), or vtt (VTT subtitles).", )

Is there a corresponding place I can put that title info when I do a path parameter? I tried giving the path param a default value as a Field with such a title, but it didn't put that title in the documentation.

I guess there's always the docstring for the endpoint itself, that would do too.

Most satisfying, though, would be if my class FormatName(Enum) could itself have its own title or similar documentation...

roedoejet · 2022-10-04T19:51:01Z

readalongs/web_api.py

+            return ConvertResponse(
+                file_name="aligned_sentences.srt",
+                file_contents=slurp_file(prefix + "_sentences.srt"),
+                other_file_name="aligned_words.srt",


So yes, just to re-iterate, my recommendations here would be:

return FileResponse objects instead of this custom ConvertResponse object

use path parameters for each output format and use a Python Enumeration to make use of FastAPI handling exceptions

use query parameters to return either sentence or word subtitles for subtitle formats (with sentences being the default as this is what the vast majority of cases will want I expect)

I've just been following this discussion from afar, but these recommendations make a lot of sense to me and result in a much cleaner API. The extra round-trip to do sentences and words is maybe not a huge concern if as you say most of the requests are only going to be for sentences.

Yeah, I agree with both of you. The use case 99% of the time is get the sentence tier, so it's OK to de-optimize the case where your want both tiers for the sake of a cleaner API design that's easier to understand.

dhdaines · 2022-10-06T01:17:16Z

readalongs/align.py


    Returns:
-        list of words, list of sentences
+        list of words, list of sentences (as a list of lists of words)
+        The returned words are dicts containing:


Orthogonal to this PR, but it would be more robust to use a NamedTuple or data class instead of a dict here.

God yes! I really don't like our dicts used like this all over the code. But I've never had the courage to change them, because they're really all over!

I've had an interesting idea in the shower: we could have a Word class that is initialized from the lxml.etree element and sets id, start and end right away, kinda like a dataclass (I prefer dataclass to NamedTuple, byt the way). .text would then be an attribute implemented with a getter. This way, I wouldn't have to have a get_word_texts in the first place, you'd just use word.text when you need it. We'd still need get_sentences, but it would do just one thing, figuring out the sentence identification logic.

this seems like a better idea (it is also what pocketsphinx and soundswallower do internally, except that of course their "text" is our "id")

That, if and when we do it, will definitely be in a separate PR.

But I like the idea enough that I think I'll do it.

…formats

joanise · 2022-10-06T20:40:24Z

I've done all the changes we've discussed, I think this PR can be ready to merge. @dhdaines @roedoejet do you want to take a look again or can I just merge?

roedoejet · 2022-10-06T20:41:53Z

I've done all the changes we've discussed, I think this PR can be ready to merge. @dhdaines @roedoejet do you want to take a look again or can I just merge?

I'll have a quick look

roedoejet

Looks good - but why use Response instead of FileResponse?

roedoejet · 2022-10-06T20:44:24Z

readalongs/web_api.py

 from typing import Dict, List, Optional, Union

 from fastapi import Body, FastAPI, HTTPException
 from fastapi.middleware.cors import CORSMiddleware
+from fastapi.responses import Response


why not FileResponse?

I think it's because the file has to be deleted immediately after sending it, and there isn't a way (or at least not an obvious way) to do that with FileResponse.

It should be the same no? The response (whether Response or FileResponse) should return the file created in the temporary directory. Once it does that, the file gets deleted.

The problem is that FileResponse sends it asynchronously, so the directory will get deleted (from exiting the with block) before the event loop gets around to sending it.

I see - in that case I guess we could just write a background task to delete the file after the response is sent https://fastapi.tiangolo.com/tutorial/background-tasks/

I first tried FileResponse, but it's broken for our use case: it asynchronously sends the file after the endpoint function exists, but I delete the temporary folder so the file is gone by the time FileResponse tries to read it and I just get a file-not-found error. David suggested StreamingResponse as an alternative, but it suffers from the same problem.

Because our data privacy policy promises to delete the file right away, I have to slurp the file into memory and send that. I dug around to find a "FileResponse from memory" solution and what I found was to put it in a plain Response with the headers I set.

My window was out of sync, I hadn't seen the background task suggestion yet.

I'll look into the background task, and I think it makes sense to use it, as long as it really does happen after the FileResponse is done sending the file. It might yield better performance, too, by making things asynchronous, returning the headers quickly and deferring reading and deleting the file in the background.

I'll want to make sure the temp dir really does get deleted, though, because a failure here would violate our privacy promise. I'm not quite sure how to test that convincingly.

The other downside is that I can't use the TemporaryDirectory as a context manager anymore, but that's a minor point.

I guess my real decision factor would be: is the FileResponse actually faster or better in some other concrete way? If so, the change is a no brainer. Otherwise, my Response solution already works.

Interesting. I was worried about creating a race condition, but FileResponse accepts a background parameter provide a background task to run after running. See https://stackoverflow.com/a/68579536
This looks like a very elegant solution, I'll give it a shot.

Dang I hate Windows! I have it all working with a FileResponse now, but if I run the tests on Windows, the SRT, VTT and EAF files get CRLF newlines instead of LF. I guess that's valid too, but what a nuisance! It means my unit testing has to convert before testing for equality. Bleh.

My slurp_file solution was reading the file back on windows, converting it to LF in the process, but a FileResponse sends the file contents verbatim.

dhdaines · 2022-10-06T20:50:05Z

readalongs/web_api.py

+            "lc3", "Third Language Name",
+            ...
+        }`
+    """


Ideally this, and all of the removal of encoding parameters, would be a separate PR, but git doesn't make that very easy to do in my experience...

Actually, what makes it hard is that my changes are kinda mixed together. If I had decided to separate the two tasks, I could have easily done it, and I guess I should have. Git actually helps doing that, but I was just too lazy to do it. Full disclosure: I consider myself a Git wizard and there's not much I can't do with it. I think it would have been easy, but a "normal" person would probably disagree. :)

dhdaines

looks good to me...

- Change Response to FileResponse - Use a BackgroundTask to make sure the temporary directory is cleaned up, but only *after* the FileResponse has been delivered to the client - Test to make sure the temporary directory (and therefore the files it contains) really always gets deleted, even when things go wrong.

That code was just there so I could test that my unit test really failed if the temporary directory was not deleted.

- enum names are constants, so they should be UPPER_CASE - input is a builtin, don't redefine it (s/input/request/) - and misc minor stuff

roedoejet · 2022-10-07T16:56:25Z

readalongs/web_api.py

@@ -386,6 +423,13 @@ async def convert_alignment(
                detail="If this happens, FastAPI Enum validation didn't work so this is a bug!",
            )

+    except Exception:


readalongs/web_api.py

joanise added 8 commits September 26, 2022 09:00

fix: failure to g2p should be 422, not 400

b0da504

400 means Bad Request, but a request we can parse and process, just not g2p, is a good request, but with what we can consider a semantic problem, so 422 seems more appropriate.

feat: parse_smil() with unit testing

84af6b0

feat: new /convert_to_TextGrid endpoint in web_api

42f724f

This end point specifically converts to TextGrid, but there will be more endpoints, for the other formats, and probably some refactorings to unify the endpoints once they're all done.

feat: convert_alignment now also supports ELAN eaf

6fcc404

feat: endpoint /convert_alignment supports srt and vtt

e035cb2

style: polishing my code before submitting my PR

c3711c5

docs: document the return value of the /langs endpoint

4520ad8

joanise requested review from dhdaines and roedoejet September 26, 2022 13:29

joanise added 2 commits September 28, 2022 14:33

fix: remove unsupported encoding argument from web_api

41d5912

The utf-8 assumption is baked in in various subtle places, so just document it is required and remove the parameter. Did some refactoring in test_web_api.py at the same time.

style: I prefer audio_duration to audio_length

728991f

roedoejet reviewed Sep 28, 2022

View reviewed changes

readalongs/text/make_smil.py Outdated Show resolved Hide resolved

roedoejet requested changes Oct 4, 2022

View reviewed changes

dhdaines reviewed Oct 6, 2022

View reviewed changes

joanise added 5 commits October 6, 2022 11:13

fix: typo

ddd9f65

refactor: use an Enum instead of a regex to validate /convert output …

b2703c1

…formats

refactor: use a path parameter for /convert format

9d4d5ea

refactor: /convert returns a Response with one file only

badbd9c

refactor: word text instead of textual word

1a3a792

roedoejet self-requested a review October 6, 2022 20:42

roedoejet reviewed Oct 6, 2022

View reviewed changes

dhdaines reviewed Oct 6, 2022

View reviewed changes

dhdaines approved these changes Oct 6, 2022

View reviewed changes

joanise added 2 commits October 7, 2022 12:29

style: remove breadcrumb code that kept the temporary directory around

af5e085

That code was just there so I could test that my unit test really failed if the temporary directory was not deleted.

style: apply some selected pylint and PEP8 recommendations

c95ca35

- enum names are constants, so they should be UPPER_CASE - input is a builtin, don't redefine it (s/input/request/) - and misc minor stuff

roedoejet reviewed Oct 7, 2022

View reviewed changes

readalongs/web_api.py Outdated Show resolved Hide resolved

refactor: use filename= to set FileResponse headers

c8462ff

joanise merged commit 5c60518 into master Oct 7, 2022

joanise deleted the dev.web-api-convert branch October 7, 2022 17:41

joanise mentioned this pull request Oct 24, 2022

Refactor the representation of a word inside Studio #123

Open

		return f.read()


		@v1.post("/convert_alignment", response_model=ConvertResponse)

	@v1.post("/convert_alignment", response_model=ConvertResponse)
	@v1.post("/convert_alignment/{output_format}", response_model=ConvertResponse)
	async def convert_alignment(input: ConvertRequest, output_format) -> ConvertResponse:

feat: add /convert_alignment endpoint #119

feat: add /convert_alignment endpoint #119

Conversation

joanise commented Sep 26, 2022

joanise commented Sep 26, 2022

codecov bot commented Sep 26, 2022 • edited Loading

Codecov Report

joanise commented Sep 28, 2022

roedoejet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joanise commented Oct 6, 2022

roedoejet commented Oct 6, 2022

roedoejet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joanise Oct 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhdaines left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Sep 26, 2022 •

edited

Loading

joanise Oct 7, 2022 •

edited

Loading