# Automatic speech transcription using Whisper
In this module, we use [Whisper](https://github.com/openai/whisper) from OpenAI to transcribe speech automatically. Whisper is a robust automatic speech recognition (ASR) model that supports 99 different languages (e.g., English, Italian, Dutch, Japanese, Chinese, Spanish, etc).

Whisper provides 5 multilingual model sizes as follows:

|  Size  | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
|:------:|:----------:|:------------------:|:------------------:|:-------------:|:--------------:|
|  tiny  |    39 M    |     `tiny.en`      |       `tiny`       |     ~1 GB     |      ~32x      |
|  base  |    74 M    |     `base.en`      |       `base`       |     ~1 GB     |      ~16x      |
| small  |   244 M    |     `small.en`     |      `small`       |     ~2 GB     |      ~6x       |
| medium |   769 M    |    `medium.en`     |      `medium`      |     ~5 GB     |      ~2x       |
| large  |   1550 M   |        N/A         | `large/large-v2` |    ~10 GB     |       1x       |

As you can see, the smaller the model is, the faster computational time is, with less accurate results.


### Issues in timestamps accuracy
The original whisper models do not correctly capture silences. Rather, they just include sliences in timestamps. <br>
For example, if an utterance was produced from 00:00:11.000 to 00:00:15.000 followed by a 2-second silence, the timestamp will be 00:00:11.000 - 00:00:17.000 instead. <br>

To solve this, we can use **[whisper-timestamped](https://github.com/linto-ai/whisper-timestamped#python)** for improved timing accuracy, word-level timestamps, and partially recovering disfluencies (e.g., uh) as "*"


***

### Import packages and define paths

Let's first import whisper-timestamped and other required packages.


<font color = "mandarin">If you haven't installed the packages, follow the steps below</font>

1. Open terminal/anaconda prompt at the folder in which you store this notebook
1. Activate your conda environment (e.g., conda activate mld_study_group)
1. Run this code: pip install -r requirements.txt


In [1]:
import whisper_timestamped as whisper
import pandas as pd
import os
from datetime import datetime, timedelta

input_folder = "input"
output_folder = "output"

Importing the dtw module. When using in academic works please cite:
  T. Giorgino. Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package.
  J. Stat. Soft., doi:10.18637/jss.v031.i07.



### Create functions that convert time to the format ELAN can recognize.

**For float timestamps**
1. convert seconds to minute & second timedelta object: 84.14 -> 0:01.24.14
2. convert the timedelta object to time object to enable formatting with strftime(): 0:01.24.14 -> 00:00:01.24.140000
3. format t using strftime: 00:00:01.24.140000 -> "00:00:01.24.140"

In [2]:
def convert_time_float_to_string(flt_timestamp):
    t = timedelta(seconds = flt_timestamp)  #step 1
    t = (datetime.min + t).time()           #step 2

    format = "%H:%M:%S.%f"
    t_formatted = t.strftime(format)[:-3]   #step 3 *[:-3] means until the third decimals

    return t_formatted


def convert_time_string_to_float(str_series):
    dt_series = pd.to_timedelta(str_series).dt.total_seconds()
    return dt_series

# str_series = pd.Series(["01:23:45.678", "02:34:56.789"])
# output:
# 0    5025.678
# 1    9276.789

### Create a function to format the Whisper output for ELAN and export it as a csv file

In [3]:
def export_transcript_as_csv(result, output_filename):

    df_transcript = pd.DataFrame([], columns=['id', 'start', 'end', 'text'])
    transcript_list = []

    # save result["segments"] as segments so that we don't need to type result[""] everytime
    segments = result["segments"]
    
    for segment in segments:
        # get the start and end time of the segment
        id = segment["id"]
        start = segment["start"]
        end = segment["end"]

        # convert the start and end time to string
        start = convert_time_float_to_string(start)
        end = convert_time_float_to_string(end)

        # get the text of the segment
        text = segment["text"]

        # create a new row for the dataframe
        new_row = {'id': id, 'start': start, 'end': end, 'text': text}

        # append the new row to the list
        transcript_list.append(new_row)
    
    # add the list to the df_transcript dataframe
    df_transcript = pd.concat([df_transcript, pd.DataFrame(transcript_list)], ignore_index=True)

    output_file = os.path.join(output_folder, output_filename)
    df_transcript.to_csv(output_file, index=False)

### Run the model and export the output as csv

<details><summary>Click here for the list of languages</summary>

1.  "en": "english",
1.  "zh": "chinese",
1. "de": "german",
1. "es": "spanish",
1. "ru": "russian",
1. "ko": "korean",
1. "fr": "french",
1. "ja": "japanese",
1. "pt": "portuguese",
1. "tr": "turkish",
1. "pl": "polish",
1. "ca": "catalan",
1. "nl": "dutch",
1. "ar": "arabic",
1. "sv": "swedish",
1. "it": "italian",
1. "id": "indonesian",
1. "hi": "hindi",
1. "fi": "finnish",
1. "vi": "vietnamese",
1. "he": "hebrew",
1. "uk": "ukrainian",
1. "el": "greek",
1. "ms": "malay",
1. "cs": "czech",
1. "ro": "romanian",
1. "da": "danish",
1. "hu": "hungarian",
1. "ta": "tamil",
1. "no": "norwegian",
1. "th": "thai",
1. "ur": "urdu",
1. "hr": "croatian",
1. "bg": "bulgarian",
1. "lt": "lithuanian",
1. "la": "latin",
1. "mi": "maori",
1. "ml": "malayalam",
1. "cy": "welsh",
1. "sk": "slovak",
1. "te": "telugu",
1. "fa": "persian",
1. "lv": "latvian",
1. "bn": "bengali",
1. "sr": "serbian",
1. "az": "azerbaijani",
1. "sl": "slovenian",
1. "kn": "kannada",
1. "et": "estonian",
1. "mk": "macedonian",
1. "br": "breton",
1. "eu": "basque",
1. "is": "icelandic",
1. "hy": "armenian",
1. "ne": "nepali",
1. "mn": "mongolian",
1. "bs": "bosnian",
1. "kk": "kazakh",
1. "sq": "albanian",
1. "sw": "swahili",
1. "gl": "galician",
1. "mr": "marathi",
1. "pa": "punjabi",
1. "si": "sinhala",
1. "km": "khmer",
1. "sn": "shona",
1. "yo": "yoruba",
1. "so": "somali",
1. "af": "afrikaans",
1. "oc": "occitan",
1. "ka": "georgian",
1. "be": "belarusian",
1. "tg": "tajik",
1. "sd": "sindhi",
1. "gu": "gujarati",
1. "am": "amharic",
1. "yi": "yiddish",
1. "lo": "lao",
1. "uz": "uzbek",
1. "fo": "faroese",
1. "ht": "haitian creole",
1. "ps": "pashto",
1. "tk": "turkmen",
1. "nn": "nynorsk",
1. "mt": "maltese",
1. "sa": "sanskrit",
1. "lb": "luxembourgish",
1. "my": "myanmar",
1. "bo": "tibetan",
1. "tl": "tagalog",
1. "mg": "malagasy",
1. "as": "assamese",
1. "tt": "tatar",
1. "haw": "hawaiian",
1. "ln": "lingala",
1. "ha": "hausa",
1. "ba": "bashkir",
1. "jw": "javanese",
1. "su": "sundanese"

In [12]:
model_size = "base"
language = "en"
model = whisper.load_model(model_size)

# iterate over files in the videos folder & apply whisper model on each videos
for filename in os.listdir(input_folder):
    file = os.path.join(input_folder, filename)
    # check if it is a wav file
    if filename.endswith(".wav") or filename.endswith(".mp4"):
        # check if the output file already exists
        output_filename = filename.split(".")[0] + ".csv"
        if os.path.exists(os.path.join(output_folder, output_filename)):
            print(f"{output_filename} already exists in the output folder")
        else:
            #apply whisper model on each file
            print("Now, whisper is working on " + filename)
            result = whisper.transcribe(
                model, file, 
                language=language, 
                vad=True,               #default = False
                no_speech_threshold=0.01,
                beam_size=5, 
                best_of=5, 
                temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
                detect_disfluencies=True
                )

            #export the results as csv
            export_transcript_as_csv(result, output_filename)

salma_hayek_short.mp4
salma_hayek_short.csv already exists in the output folder
salma_hayek_short.wav
salma_hayek_short.csv already exists in the output folder


### <font color="orange">Exercise 1: Import transcript to ELAN</font>
Let's import the transcript to ELAN. [Here](https://www.mpi.nl/corpus/html/elan/ch04s03s01.html#Sec_Importing_CSV_Tab-delimited_Text_Files)'s official documentation of ELAN for importing csv files.

### <font color="orange">Exercise 2: Change the model size</font>
Change the model_size in the code above to "large-v2" and run the Whisper model again. After running the model, answer the following questions:

- Is the output more accurate compared to the based model?
- How long did it take for the large-v2 model to process a 37 seconds video?
- Did audience's voice affect the transcription accuracy?

*Make sure to add "_base" to the filename in the output folder. This is because whisper model won't run if there's a csv file with the same filename as the input wav file.

### <font color="orange">Exercise 3: Combine segments</font>
Did you notice that Whisper cut transcript in the middle of sentence? This is because Whisper wants to keep each line of output short so that when they are shown as subtitles, it doesn't go over the screen.

However, in research, this is not what we want. We want each line as a sentence, turn construction unit (TCU), etc. So, let's combine some short segments to make them sentences.

One thing to keep in mind is that people don't speak perfectly in natural settings: they produce pauses, disfluencies, etc. To reflect this, let's combine segments only if:

1. there's no time difference between the end of the target segment and the start of the next segment (diff = 0), and 
2. there's no period in the target segment. 

To illustrate, let's take a look at the following example:

| id | start | end | text | diff (sec) |
|----|-------|-----|------|------|
| 0  |00:00:08.940|00:00:13.500|I see the beautiful wedding cake in a little table|0|
| 1  |00:00:13.500|00:00:15.300|with two chairs for the bride and the groom.|0|
| 2  |00:00:15.300|00:00:16.780|Instead of the bride and the groom,|0|
| 3  |00:00:16.780|00:00:21.420|there is Lupe and Angie sitting perfectly.|4.69|

Here, we will combine segment 0 and 1 because the time difference between the end of segment 0 and the start of segment 1 is 0, and segment 0 doesn't have a period in it.<br>
As for the segment 1 and 2, although the time difference is 0, we will **not** combine them, as segment 1 has a period.<br>
Lastly, we will **not** combine segment 3 and 4 because the time difference isn't 0.

Let's write a script to implement this!!

In [13]:
### Function to combine segments
def combine_segments(file_path):
    silence_duration = 0  #in seconds
    
    new_df = pd.DataFrame([], columns=['start', 'end', 'text', 'diff'])
    new_list = []

    df = pd.read_csv(file_path)
    df['diff'] = convert_string_to_float(df['start'].shift(-1)) - convert_string_to_float(df['end']) # we need to convert from string to dtype: float64 to perform calculation

    #initiate variables used in the for loop
    count = 0
    text = ""
    start = ""
    end = ""

    for index, row in df.iterrows():
        if pd.isnull(df.loc[index, 'text']):            #if the value of text is null 
            continue                                    #skip the row  
        
        #when the difference between next segment's start time - current segment's end time is less than "silence_duration" and the text doesn't contain "."
        if df.loc[index, 'diff'] <= silence_duration and "." not in df.loc[index, 'text']:
            text += df.loc[index, 'text']               #add the value of text to "text" variable
            if count == 0:                              #if this is the first segment after a long pause (> silence_duration)
                start = df.loc[index, 'start']
                count += 1

        else:
            if start == "":
                start = df.loc[index, 'start']

            text += df.loc[index, 'text']
            end = df.loc[index, 'end']
            diff = df.loc[index, 'diff']

            new_row = {'start': start, 'end': end, 'text': text, 'diff': diff}
            new_list.append(new_row)

            #reset the variables
            count = 0
            text = ""
            start = ""
            end = ""

    new_df = pd.concat([new_df, pd.DataFrame(new_list)], ignore_index=True)
    
    return new_df


### Function to convert string time to float
def convert_string_to_float(str_series):
    dt_series = pd.to_timedelta(str_series).dt.total_seconds()
    return dt_series

# str_series = pd.Series(["01:23:45.678", "02:34:56.789"])
# output:
# 0    5025.678
# 1    9276.789

In [14]:
### import the csv file and convert the start and end time to float
for filename in os.listdir(output_folder):
    file_path = os.path.join(output_folder, filename)
    # check the file name doesn't end with "_merged.csv"
    if filename.endswith("_merged.csv"):
        continue #skip the file
    elif filename.endswith(".csv"):
        output_filename = filename.split(".csv")[0] + "_merged.csv"
        output_file = os.path.join(output_folder, output_filename)
        #check if the output file already exists
        if os.path.exists(output_file):
            print(f"{output_filename} already exists in the output folder")
        else:
            print("Now, combining segments for " + filename)
            output = combine_segments(file_path)
            output.to_csv(output_file, index=False)        

Now, combining segments for salma_hayek_short.csv
Now, combining segments for salma_hayek_short_largev2.csv
