# Convert transcripts to readable files

To work with this file, you should first obtain your transcription files from the [**KanjuTech Transcription and Diarization Model**](https://aws.amazon.com/marketplace/pp/prodview-ngtdx4ayt4emo), as demonstrated in this [sample notebook](https://github.com/KanjuTech/aws-marketplace/blob/main/KanjuTech-Transcription-Speaker-Diarization-Model.ipynb). Or you can use this [example output](https://github.com/KanjuTech/aws-marketplace/blob/main/example_output.json).

> **Note**: This notebook contains elements that render correctly in the Jupyter interface. Open it from an Amazon SageMaker Notebook Instance or Amazon SageMaker Studio.

JSON description:

**id** - Name of transcribed audio file.  
**speaker** - Speaker number.  
**start** - Phrase start time in seconds.  
**end** - Phrase end time in seconds.  
**text** - Text of the phrase.  

> **Note**: This reference notebook cannot run unless you make the suggested changes in the notebook.

## Contents:
1. [Phrase by phrase](#1.-Phrase-by-phrase)
2. [Speaker by speaker](#2.-Speaker-by-speaker)
3. [SRT](#3.-SRT)
4. [Clean-up JSON directory](#4.-Clean-up-JSON-directory)
5. [Questions](#5.-Questions)

In [None]:
import s3fs
import json
import os
import datetime

In [None]:
bucket = 's3://<Name-of-your-existing-S3-bucket>' # Write the name of your S3 bucket where you store your input files and want to save the output

In [None]:
# Specify S3 folders
json_outputs = bucket+'/'+'batch-transcript' # Your folder with results of transcription from the model
txt_transcripts = bucket+'/'+'final-transcript' # Folder for converted transcriptions

In [None]:
fs = s3fs.S3FileSystem()
fs_ls = fs.ls(json_outputs)
paths = list(filter(lambda k: '.' in k, fs_ls))

## 1. Phrase by phrase

This script will convert the transcript into text, segmented into phrases, without regard to the speaker who uttered them:

> Speaker_1 (0:00:00): Okay, how are you?  
Speaker_2 (0:00:02): I'm pretty good.  
Speaker_2 (0:00:03): That's a strange deal.  
Speaker_2 (0:00:05): What's that all about?  
Speaker_1 (0:00:06): Well, you know, I'm on a computer mailing list on my e-mail.  
Speaker_1 (0:00:14): It's a research thing for psycholinguistics.  
Speaker_1 (0:00:17): And at UPenn, they're building a linguistic database of many languages, and so they were offering free phone calls anywhere in the world.  
Speaker_1 (0:00:27): We have to only speak one language, though.  
Speaker_1 (0:00:30): So they're collecting lots of different languages, but you have to only speak the two parties have to speak the same language.  
Speaker_1 (0:00:37): So I could only call a native English speaker.  
Speaker_1 (0:00:40): So that was the deal.  

In [None]:
# Convert json files and save to txt
for json_file_path in paths:
    # Load json from s3
    with fs.open(json_file_path, "r") as f:
        output = f.read()
        contents = json.loads(output)
        
    # Convert and save
    file_name = os.path.splitext(os.path.split(json_file_path)[-1])[0]
    with fs.open(txt_transcripts+'/{}.txt'.format(file_name), 'w') as f:
        try:
            for content in contents:
                print(content["speaker"], 
                      '({}):'.format(datetime.timedelta(seconds=round(content["start"]))), 
                      content["text"], 
                      file=f)
        except:
            print(contents[0], file=f)
            continue

## 2. Speaker by speaker

This script will convert the transcript into text, segmented into speakers. If a speaker utters multiple phrases consecutively, they will be merged:

> Speaker_1 (0:00:00): Okay, how are you?  
Speaker_2 (0:00:02): I'm pretty good. That's a strange deal. What's that all about?  
Speaker_1 (0:00:06): Well, you know, I'm on a computer mailing list on my e-mail. It's a research thing for psycholinguistics. And at UPenn, they're building a linguistic database of many languages, and so they were offering free phone calls anywhere in the world. We have to only speak one language, though. So they're collecting lots of different languages, but you have to only speak the two parties have to speak the same language. So I could only call a native English speaker. So that was the deal.  

In [None]:
for json_file_path in paths:
    # Load json from s3
    with fs.open(json_file_path, "r") as f:
        output = f.read()
        contents = json.loads(output)
        
    # Convert and save
    file_name = os.path.splitext(os.path.split(json_file_path)[-1])[0]
    with fs.open(txt_transcripts+'/{}.txt'.format(file_name), 'w') as f:
        try:
            current_speaker = contents[0]['speaker']
            text = ''
            s_time = contents[0]["start"]
            for content in contents:
                speaker = content['speaker']
                if current_speaker == speaker:
                    text += content["text"]
                else:
                    print(current_speaker, 
                          '({}):'.format(datetime.timedelta(seconds=round(s_time))), 
                          text, 
                          file=f)
                    text = ''
                    s_time = content["start"]
                    text += content["text"]
                    current_speaker = speaker
            print(current_speaker, 
                  '({}):'.format(datetime.timedelta(seconds=round(s_time))), 
                  text, 
                  file=f)
        except:
            print(contents[0], file=f)
            continue

## 3. SRT

In progress.

## 4. Clean-up JSON directory

After converting JSON files to .txt, you can remove files if you don't need them anymore.

In [None]:
fs_ls = fs.ls(json_outputs)
paths = list(filter(lambda k: '.' in k, fs_ls))
for file in paths:
    fs.rm(file)

## 5. Questions

If you have any questions about our product, feel free to email us at aws@kanju.tech or schedule a [meeting](https://calendly.com/kanjutech).