Transcript for Episode 1 #15

ole · 2019-01-20T23:47:35Z

This is an exciting project. Great job on the first episode!

I ran the first episode through Amazon's Transcribe service. The result is a massive JSON file that includes not only the transcribed text but also timecodes and speaker identification (i.e. you tell the AWS Transcribe API how many speakers there were and it will try to distinguish them as "Speaker 1", "Speaker 2" and so on). Here's a screenshot of the AWS Transcribe console:

The transcription is obviously not perfect, but I think it's a good start and manually editing the file is probably way faster than typing everything out manually. I'm a big fan of transcripts to make it possible to find things again later, but I also think a podcast transcript need not (and perhaps should not) mirror the spoken word precisely. Transcribed text is generally not very readable if it includes every "uh" etc.

You can download the complete JSON file (5.6 MB). I ran it through a formatter and removed my AWS account ID, other than that it's unchanged.

The bulk of the file is a huge array of recognized words with a per-word timecode and sometimes with word alternatives if the system wasn't certain. For example, this is how the first two words ("Welcome to") look like:

    "items": [
      {
        "start_time": "0.54",
        "end_time": "1.31",
        "alternatives": [
          {
            "confidence": "1.0",
            "content": "Welcome"
          }
        ],
        "type": "pronunciation"
      },
      {
        "start_time": "1.31",
        "end_time": "1.45",
        "alternatives": [
          {
            "confidence": "1.0",
            "content": "to"
          }
        ],
        "type": "pronunciation"
      },
      ...

I'm not sure how much time I can spend on editing the transcript and/or writing a script to process the whole thing into something that can be published on the web. If anyone would like to help, feel free to chime in.

Lastly, I'd like to mention the Podlove web player, a great (I think) open-source HTML5 audio player that can, among many other features, display transcripts and sync them to the audio, i.e. you can search the transcript for something, click on a search result, and the player will jump to that timecode in the audio. I think this or something like this would be a great addition to the web site — at least if transcripts become a regular thing that we create for each episode (making a good transcript is a lot of work, and somebody has to do it).

lattner · 2019-01-21T02:10:13Z

Wow, this is super cool!

ole · 2019-01-23T21:45:44Z

Current progress: I wrote a script to parse the JSON and convert it into readable Markdown. See the result in this Gist. The numbers above each paragraph are timecodes (measured in seconds from the start of recording).

That was the easy part. The hard part is to make the data editable (for manual fixes) while keeping the timecodes etc. intact.

calebkleveter · 2019-01-23T22:35:42Z

Maybe you could add a case to the script that would remove entries like this:

387.794–387.834
(Unknown)

garricn · 2019-01-24T02:16:49Z

This is amazing @ole! Do you think its good enough to open a PR adding it to the repo?

garricn · 2019-01-24T02:17:19Z

Also, is there a way to automate this?

garricn · 2019-01-25T05:21:16Z

@ole this is amazing! I’ll mention it during #22 and try to get you some support so we can get this integrated into the repo and work flow

ole · 2019-01-25T13:21:42Z

Maybe you could add a case to the script that would remove entries like this.

@calebkleveter Good call. That wouldn't be difficult to include in my code. I wanted to preserve the structure as the automatic transcription created it for this first attempt.

Do you think its good enough to open a PR adding it to the repo?

@garricn We could of course publish the unedited transcript as-is. It's arguably better than nothing, even with all the transcription errors. And we can always push manual edits later (and possibly in a piecemeal fashion; it takes a lot of time to edit an entire episode).

I considered editing the portion of the transcript that I found most interesting to preserve for posterity (mostly @lattner's comments about the origins of Swift) and possibly post it to my own blog in addition to the entire transcript (which should definitely be hosted on the podcast's website). Would you be okay with that?

Also, is there a way to automate this?

@garricn The steps up to this current state are automatable, yes. Like all AWS services, Amazon Transcribe has an API, and so do other potential services, I believe. (Google has one, and there are probably others. Apple ships a speech recognition API in the iOS and macOS SDKs, but if I remember correctly it had some limitations regarding the length of the content you could transcribe in one go. I might be wrong about that.)

In any case, if you have an AWS account, setting up a transcription job on Amazon Transcribe takes just a few clicks. It would be worth automating if we have a complete process in place, but it's not the most important step right now IMO.

I wrote Swift code to parse the JSON file produced by Amazon Transcribe and a simple function to output it as Markdown. I haven't published the code yet, but I can certainly do that. This is where we currently stand.

The next step is where it gets complicated: Ideally, I'd like to be able to edit the transcript manually while preserving as much of the timecode information as possible. This means we can't just edit the Markdown source because it would be pretty much impossible to reintegrate the edits with the timecodes (at least on a per-word basis — arguably, per-sentence or per-paragraph timecodes would be good enough).

I plan to do some research if there is an existing transcription software (a desktop app or a web-based solution) that we could use for this. I don't know any off the top of my head, but I have a hunch others must have solved the same problem. If anyone has suggestions, I'd be all ears.

lattner · 2019-01-26T17:21:11Z

I considered editing the portion of the transcript that I found most interesting to preserve for posterity (mostly @lattner's comments about the origins of Swift) and possibly post it to my own blog in addition to the entire transcript (which should definitely be hosted on the podcast's website). Would you be okay with that?

Yes, absolutely. It occurs to me that we don't have a license for the repo, but IMO it makes sense to use the creative commons attribution license which allows very permissive use and re-use of the contents.

In terms of transcript, I think it would be really great to post something relatively raw and then ask for contributors to help with the editing. Github is pretty good for collaboration :-)

Thanks for driving this @ole!

ole · 2019-01-31T11:41:21Z

Thank you for the shoutout in episode 2. I haven't forgotten this. I'll send a PR with the (unedited) transcript for episode 1 soon.

jjonesdev · 2019-02-01T14:54:11Z

Hello 👋 @ole ! This is really cool! I'm a Junior, but how can I help?

JulianKahnert · 2019-02-02T14:56:52Z

Hi there, same here! I would love to get involved in this (side-)project! 😊

ole · 2019-02-03T18:29:18Z

@jonesandcode @JulianKahnert Great! I pushed my code for parsing the Amazon Transcribe JSON format to this repository: ole/transcribe. Feel free to have a look.

The only functionality so far is parsing Amazon Transcribe files and outputting them in a (hardcoded) Markdown format. I still want to research a good existing file format for transcripts that would allow us to edit the transcript text while keeping speaker and timecode information (at least on a per-paragraph basis; the per-word timecodes that Amazon Transcribe produces are probably overkill).

JulianKahnert · 2019-02-03T21:18:51Z

@ole @jonesandcode what do you think about WebVTT. I have no experience with it, but it seems to be supported by auphonic and the beta version of the Podlove Web Player.

We can even see this in action (sry for the german reference):
https://forschergeist.de/podcast/fg060-klimawandel/

ole · 2019-02-03T22:59:31Z

@JulianKahnert I love it! I opened a separate issue in the other repo: ole/transcribe#2

Maybe it's better to discuss concrete next steps over there.

JulianKahnert · 2019-02-12T20:18:06Z

Does anyone has some experience with Auphonic Transcript Editor.

It seems like a perfect fit for creating the transcript (e.g. via AWS) and editing it afterwards with an inline HTML editor (example). Since I have never used Auphonic, I don't know if we can use it collaboratively.

Another option might be the "open source transcript editor":

Our open source transcript editor, which is embedded directly in the HTML Transcript File, has been designed to make checking and editing transcripts as easy as possible. Try it yourself with our Transcript Editor Examples.

https://auphonic.com/help/algorithms/speech_recognition.html#transcript-editor

It would be awesome if we could use the transcript editor, but I can not find any documentation except for the two Transcript Editor Examples. Am I missing something? 🤔

@ole

* Add unedited WebVTT transcript for episode 1 The transcript has been generated with the Amazon Transcribe service (as discussed in #15) and converted to the WebVTT format with a Swift tool written by @ole (https://github.com/ole) and @JulianKahnert (https://github.com/JulianKahnert). This autogenerated transcript contains many transcription errors, but it gives us a good baseline for manual editing. * Add some metadata and a "request for editing" to episode 1 transcript * Edit episode 1 transcript from 00:00:00.000 to 00:05:38.497 * Edit episode 1 transcript from 00:05:38.887 to 00:09:46.529 * Edit episode 1 transcript from 00:59:15:213 to 01:07:26.240 * Edit episode 1 transcript from 00:50:18.774 to 00:59:15.093 * Small transcript edits * Edit episode 1 transcript from 00:15:14.437 to 00:23:31.535 * Episode 1 transcript from 00:39:55.214 -> 00:50:18.494 * Edit episode 1 transcript from 00:09:51.039 to 00:15:13.397 * More transcript edits * Edit episode 1 transcript from 00:23:32.785 to 00:30:32.178 * Edit episode 1 transcript from 00:30:32.658 to 00:39:54.794 * Edit episode 1 transcript header comment

BasThomas · 2019-02-14T18:46:44Z

Closing as this is done! 🎉

_{Sent with GitHawk}

ole mentioned this issue Feb 3, 2019

WebVTT support ole/transcribe#2

Closed

ole mentioned this issue Feb 10, 2019

Episode 1 transcript #43

Merged

JulianKahnert mentioned this issue Feb 12, 2019

add name option to CLI ole/transcribe#4

Merged

BasThomas closed this as completed Feb 14, 2019

ole mentioned this issue Aug 20, 2019

Add Episode 2 transcript #68

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transcript for Episode 1 #15

Transcript for Episode 1 #15

ole commented Jan 20, 2019 •

edited

Loading

lattner commented Jan 21, 2019

ole commented Jan 23, 2019 •

edited

Loading

calebkleveter commented Jan 23, 2019

garricn commented Jan 24, 2019

garricn commented Jan 24, 2019

garricn commented Jan 25, 2019

ole commented Jan 25, 2019

lattner commented Jan 26, 2019

ole commented Jan 31, 2019

jjonesdev commented Feb 1, 2019 •

edited

Loading

JulianKahnert commented Feb 2, 2019

ole commented Feb 3, 2019

JulianKahnert commented Feb 3, 2019 •

edited

Loading

ole commented Feb 3, 2019

JulianKahnert commented Feb 12, 2019

BasThomas commented Feb 14, 2019

Transcript for Episode 1 #15

Transcript for Episode 1 #15

Comments

ole commented Jan 20, 2019 • edited Loading

lattner commented Jan 21, 2019

ole commented Jan 23, 2019 • edited Loading

calebkleveter commented Jan 23, 2019

garricn commented Jan 24, 2019

garricn commented Jan 24, 2019

garricn commented Jan 25, 2019

ole commented Jan 25, 2019

lattner commented Jan 26, 2019

ole commented Jan 31, 2019

jjonesdev commented Feb 1, 2019 • edited Loading

JulianKahnert commented Feb 2, 2019

ole commented Feb 3, 2019

JulianKahnert commented Feb 3, 2019 • edited Loading

ole commented Feb 3, 2019

JulianKahnert commented Feb 12, 2019

BasThomas commented Feb 14, 2019

ole commented Jan 20, 2019 •

edited

Loading

ole commented Jan 23, 2019 •

edited

Loading

jjonesdev commented Feb 1, 2019 •

edited

Loading

JulianKahnert commented Feb 3, 2019 •

edited

Loading