Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transcript for Episode 1 #15

Closed
ole opened this issue Jan 20, 2019 · 16 comments
Closed

Transcript for Episode 1 #15

ole opened this issue Jan 20, 2019 · 16 comments

Comments

@ole
Copy link
Contributor

ole commented Jan 20, 2019

This is an exciting project. Great job on the first episode!

I ran the first episode through Amazon's Transcribe service. The result is a massive JSON file that includes not only the transcribed text but also timecodes and speaker identification (i.e. you tell the AWS Transcribe API how many speakers there were and it will try to distinguish them as "Speaker 1", "Speaker 2" and so on). Here's a screenshot of the AWS Transcribe console:

screen shot 2019-01-21 at 00 28 56

The transcription is obviously not perfect, but I think it's a good start and manually editing the file is probably way faster than typing everything out manually. I'm a big fan of transcripts to make it possible to find things again later, but I also think a podcast transcript need not (and perhaps should not) mirror the spoken word precisely. Transcribed text is generally not very readable if it includes every "uh" etc.

You can download the complete JSON file (5.6 MB). I ran it through a formatter and removed my AWS account ID, other than that it's unchanged.

The bulk of the file is a huge array of recognized words with a per-word timecode and sometimes with word alternatives if the system wasn't certain. For example, this is how the first two words ("Welcome to") look like:

    "items": [
      {
        "start_time": "0.54",
        "end_time": "1.31",
        "alternatives": [
          {
            "confidence": "1.0",
            "content": "Welcome"
          }
        ],
        "type": "pronunciation"
      },
      {
        "start_time": "1.31",
        "end_time": "1.45",
        "alternatives": [
          {
            "confidence": "1.0",
            "content": "to"
          }
        ],
        "type": "pronunciation"
      },
      ...

I'm not sure how much time I can spend on editing the transcript and/or writing a script to process the whole thing into something that can be published on the web. If anyone would like to help, feel free to chime in.

Lastly, I'd like to mention the Podlove web player, a great (I think) open-source HTML5 audio player that can, among many other features, display transcripts and sync them to the audio, i.e. you can search the transcript for something, click on a search result, and the player will jump to that timecode in the audio. I think this or something like this would be a great addition to the web site — at least if transcripts become a regular thing that we create for each episode (making a good transcript is a lot of work, and somebody has to do it).

@lattner
Copy link
Contributor

lattner commented Jan 21, 2019

Wow, this is super cool!

@ole
Copy link
Contributor Author

ole commented Jan 23, 2019

Current progress: I wrote a script to parse the JSON and convert it into readable Markdown. See the result in this Gist. The numbers above each paragraph are timecodes (measured in seconds from the start of recording).

That was the easy part. The hard part is to make the data editable (for manual fixes) while keeping the timecodes etc. intact.

@calebkleveter
Copy link

Maybe you could add a case to the script that would remove entries like this:

387.794–387.834
(Unknown)

@garricn
Copy link
Member

garricn commented Jan 24, 2019

This is amazing @ole! Do you think its good enough to open a PR adding it to the repo?

@garricn
Copy link
Member

garricn commented Jan 24, 2019

Also, is there a way to automate this?

@garricn
Copy link
Member

garricn commented Jan 25, 2019

@ole this is amazing! I’ll mention it during #22 and try to get you some support so we can get this integrated into the repo and work flow

@ole
Copy link
Contributor Author

ole commented Jan 25, 2019

Maybe you could add a case to the script that would remove entries like this.

@calebkleveter Good call. That wouldn't be difficult to include in my code. I wanted to preserve the structure as the automatic transcription created it for this first attempt.

Do you think its good enough to open a PR adding it to the repo?

@garricn We could of course publish the unedited transcript as-is. It's arguably better than nothing, even with all the transcription errors. And we can always push manual edits later (and possibly in a piecemeal fashion; it takes a lot of time to edit an entire episode).

I considered editing the portion of the transcript that I found most interesting to preserve for posterity (mostly @lattner's comments about the origins of Swift) and possibly post it to my own blog in addition to the entire transcript (which should definitely be hosted on the podcast's website). Would you be okay with that?

Also, is there a way to automate this?

@garricn The steps up to this current state are automatable, yes. Like all AWS services, Amazon Transcribe has an API, and so do other potential services, I believe. (Google has one, and there are probably others. Apple ships a speech recognition API in the iOS and macOS SDKs, but if I remember correctly it had some limitations regarding the length of the content you could transcribe in one go. I might be wrong about that.)

In any case, if you have an AWS account, setting up a transcription job on Amazon Transcribe takes just a few clicks. It would be worth automating if we have a complete process in place, but it's not the most important step right now IMO.

I wrote Swift code to parse the JSON file produced by Amazon Transcribe and a simple function to output it as Markdown. I haven't published the code yet, but I can certainly do that. This is where we currently stand.

The next step is where it gets complicated: Ideally, I'd like to be able to edit the transcript manually while preserving as much of the timecode information as possible. This means we can't just edit the Markdown source because it would be pretty much impossible to reintegrate the edits with the timecodes (at least on a per-word basis — arguably, per-sentence or per-paragraph timecodes would be good enough).

I plan to do some research if there is an existing transcription software (a desktop app or a web-based solution) that we could use for this. I don't know any off the top of my head, but I have a hunch others must have solved the same problem. If anyone has suggestions, I'd be all ears.

@lattner
Copy link
Contributor

lattner commented Jan 26, 2019

I considered editing the portion of the transcript that I found most interesting to preserve for posterity (mostly @lattner's comments about the origins of Swift) and possibly post it to my own blog in addition to the entire transcript (which should definitely be hosted on the podcast's website). Would you be okay with that?

Yes, absolutely. It occurs to me that we don't have a license for the repo, but IMO it makes sense to use the creative commons attribution license which allows very permissive use and re-use of the contents.

In terms of transcript, I think it would be really great to post something relatively raw and then ask for contributors to help with the editing. Github is pretty good for collaboration :-)

Thanks for driving this @ole!

@ole
Copy link
Contributor Author

ole commented Jan 31, 2019

Thank you for the shoutout in episode 2. I haven't forgotten this. I'll send a PR with the (unedited) transcript for episode 1 soon.

@jjonesdev
Copy link

jjonesdev commented Feb 1, 2019

Hello 👋 @ole ! This is really cool! I'm a Junior, but how can I help?

@JulianKahnert
Copy link
Contributor

Hi there, same here! I would love to get involved in this (side-)project! 😊

@ole
Copy link
Contributor Author

ole commented Feb 3, 2019

@jonesandcode @JulianKahnert Great! I pushed my code for parsing the Amazon Transcribe JSON format to this repository: ole/transcribe. Feel free to have a look.

The only functionality so far is parsing Amazon Transcribe files and outputting them in a (hardcoded) Markdown format. I still want to research a good existing file format for transcripts that would allow us to edit the transcript text while keeping speaker and timecode information (at least on a per-paragraph basis; the per-word timecodes that Amazon Transcribe produces are probably overkill).

@JulianKahnert
Copy link
Contributor

JulianKahnert commented Feb 3, 2019

@ole @jonesandcode what do you think about WebVTT. I have no experience with it, but it seems to be supported by auphonic and the beta version of the Podlove Web Player.

We can even see this in action (sry for the german reference):
https://forschergeist.de/podcast/fg060-klimawandel/

@ole
Copy link
Contributor Author

ole commented Feb 3, 2019

@JulianKahnert I love it! I opened a separate issue in the other repo: ole/transcribe#2

Maybe it's better to discuss concrete next steps over there.

@JulianKahnert
Copy link
Contributor

Does anyone has some experience with Auphonic Transcript Editor.

It seems like a perfect fit for creating the transcript (e.g. via AWS) and editing it afterwards with an inline HTML editor (example). Since I have never used Auphonic, I don't know if we can use it collaboratively.

Another option might be the "open source transcript editor":

Our open source transcript editor, which is embedded directly in the HTML Transcript File, has been designed to make checking and editing transcripts as easy as possible. Try it yourself with our Transcript Editor Examples.

https://auphonic.com/help/algorithms/speech_recognition.html#transcript-editor

It would be awesome if we could use the transcript editor, but I can not find any documentation except for the two Transcript Editor Examples. Am I missing something? 🤔

garricn pushed a commit that referenced this issue Feb 14, 2019
* Add unedited WebVTT transcript for episode 1

The transcript has been generated with the Amazon Transcribe service (as discussed in #15) and converted to the WebVTT format with a Swift tool written by @ole (https://github.com/ole) and @JulianKahnert (https://github.com/JulianKahnert).

This autogenerated transcript contains many transcription errors, but it gives us a good baseline for manual editing.

* Add some metadata and a "request for editing" to episode 1 transcript

* Edit episode 1 transcript from 00:00:00.000 to 00:05:38.497

* Edit episode 1 transcript from 00:05:38.887 to 00:09:46.529

* Edit episode 1 transcript from 00:59:15:213 to 01:07:26.240

* Edit episode 1 transcript from 00:50:18.774 to 00:59:15.093

* Small transcript edits

* Edit episode 1 transcript from 00:15:14.437 to 00:23:31.535

* Episode 1 transcript from 00:39:55.214 -> 00:50:18.494

* Edit episode 1 transcript from 00:09:51.039 to 00:15:13.397

* More transcript edits

* Edit episode 1 transcript from 00:23:32.785 to 00:30:32.178

* Edit episode 1 transcript from 00:30:32.658 to 00:39:54.794

* Edit episode 1 transcript header comment
@BasThomas
Copy link
Contributor

Closing as this is done! 🎉

Sent with GitHawk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants