Using whisper in R by Jalal Al-Tamimi
==============================

During this session, we will use `whisper` to automatically transcribe an audio file. We will use the R package `reticulate` to run Python code from within Rstudio. At the end of the session, we will export the transcription to a  `TextGrid` file that can be opened in `Praat` for further acoustic analysis.

# Loading and installation

## Loading packages

Before running anything, ake sure to run teh code below to install R within Python. Then use the code "%%R" each time you want to run an R code.

Then, make sure to have already installed and loaded the package `reticulate`.


In [10]:
%load_ext rpy2.ipython

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


In [11]:
%%R
### Use the code below to check if you have all required packages installed. If some are not installed already, the code below will install these. If you have all packages installed, then you could load them with the second code.
requiredPackages = c('reticulate', 'tidyverse')
for(p in requiredPackages){
  if(!require(p,character.only = TRUE)) install.packages(p, dependencies = TRUE)
  library(p,character.only = TRUE)
}
options(ggrepel.max.overlaps = Inf)

## Installation of `whisper`

Make sure to uncomment the line codes below if you haven't installed `openai-whisper`. To do so, we need to install it using the function `py_install` from `reticulate`. This needs to be done once, otherwise, the package will be installed again (and this may take some time).
We could use the new R package `audio.whisper`that is built within R. However, the first time you use this package, you will notice that to transcribe the audio file below, it will take you more than 20 minutes. All subsequent running if the code will be relatively faster. In our case, we use the original implementation of `whisper` in Python that we run via Rstudio and the reticulate package, which is much faster.


In [4]:
#!pip install openai-whisper


## Installation of `textgrid`

In [5]:
#!pip install textgrid

## Loading and using `whisper`

### Audio files

We obtain various audio files from the audio.whisper package's github website, [here](https://github.com/bnosac/audio.whisper/tree/master/inst/samples)
We will use the audio file `jfk.wav` which is a recording of John F. Kennedy's inaugural address in English.


In [14]:
%%R
# load whisper module
whisper <- import("whisper")
audio = whisper$load_audio("jfk.flac")


Error in py_call_impl(callable, call_args$unnamed, call_args$named) : 
  RuntimeError: Failed to load audio: ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
<...truncated...>-enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
  libavutil      56. 70.100 / 56. 70.100
  libavcodec     58.134.100 / 58.134.100
  libavformat    58. 76.100 / 58. 76.100
  libavdevice    58. 13.100 / 58. 13.100
  libavfilter     7.110.100 /  7.110.100
  libswscale      5.  9.100 /  5.  9.100
  libswresample   3.  9.100 /  3.  9.100
  libpostproc    55.  9.100 / 55.  9.100
jfk.flac: No such file or dire

RInterpreterError: Failed to parse and evaluate line '# load whisper module\nwhisper <- import("whisper")\naudio = whisper$load_audio("jfk.flac")\n'.
R error message: 'Error in py_call_impl(callable, call_args$unnamed, call_args$named) : \n  RuntimeError: Failed to load audio: ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers\n  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)\n<...truncated...>-enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared\n  libavutil      56. 70.100 / 56. 70.100\n  libavcodec     58.134.100 / 58.134.100\n  libavformat    58. 76.100 / 58. 76.100\n  libavdevice    58. 13.100 / 58. 13.100\n  libavfilter     7.110.100 /  7.110.100\n  libswscale      5.  9.100 /  5.  9.100\n  libswresample   3.  9.100 /  3.  9.100\n  libpostproc    55.  9.100 / 55.  9.100\njfk.flac: No such file or directory\n\nRun `reticulate::py_last_error()` for details.'

### Download and load model

We download the whisper model. In our case, we use the `base` model as it provides better predictions than the `tiny` model

In [5]:
%%R
whisper_model <- "base" # Size of the transcription model
# (Down)load the model
print("Loading the model")
model <- whisper$load_model(whisper_model)

[1] "Loading the model"


### Transcription

In [6]:
%%R
print("Transcription started")

transcription <- model$transcribe(audio, language = "en", verbose = FALSE, word_timestamps = TRUE)

print("Transcription completed")

RParsingError: Parsing status not OK - PARSING_STATUS.PARSE_ERROR

In [None]:
transcription %>%
  head(6)

### export

#### First sentence

In [None]:
%%R
textDF1 <- data.frame(t(transcription$segments))[1,1]
textDF1 %>%
  head(6)

#### Second sentence

In [None]:
%%R
textDF2 <- data.frame(t(transcription$segments))[1,2]
textDF2 %>%
  head(6)

#### Dataframes

In [None]:
%%R
textDF1 <- data.frame(textDF1)[1,]
textDF2 <- data.frame(textDF2)[1,]
textDF1
textDF2

#### Full sentence

##### Sentence 1

In [None]:
%%R
textDF1Sentence <- textDF1[,c(5,3,4)]
textDF1Sentence

##### Sentence 2

In [None]:
%%R
textDF2Sentence <- textDF2[,c(5,3,4)]
textDF2Sentence

#### Words

##### Sentence 1

In [None]:
%%R
textDF1Words <- textDF1[,c(11:66)]
textDF1Words <- textDF1Words %>%
  rename_with(~str_remove(., 'words.')) %>%
  rename(word.0 = word,
         start.0 = start,
         end.0 = end,
         probability.0 = probability)
colnames(textDF1Words) <- str_replace(colnames(textDF1Words), "\\d+",
      function(x) sprintf("%02d", as.integer(x)))

textDF1Words

In [None]:
%%R
textDF1Words <- textDF1Words %>%
  pivot_longer(cols= starts_with(c("word", "start", "end", "probability")),
               names_to = c(".value", "limit"),
               names_pattern = "(.*)(..)$") %>%
  rename(word = word.,
         start = start.,
         end = end.,
         probability = probability.)

textDF1Words

##### Sentence 2

In [None]:
%%R
textDF2Words <- textDF2[,c(11:42)]
textDF2Words <- textDF2Words %>%
  rename_with(~str_remove(., 'words.')) %>%
  rename(word.0 = word,
         start.0 = start,
         end.0 = end,
         probability.0 = probability)
colnames(textDF2Words) <- str_replace(colnames(textDF2Words), "\\d+",
      function(x) sprintf("%02d", as.integer(x)))

textDF2Words

In [None]:
%%R
textDF2Words <- textDF2Words %>%
  pivot_longer(cols= starts_with(c("word", "start", "end", "probability")),
               names_to = c(".value", "limit"),
               names_pattern = "(.*)(..)$") %>%
  rename(word = word.,
         start = start.,
         end = end.,
         probability = probability.)

textDF2Words

#### Merging

In [None]:
%%R
textDFFull <- rbind(textDF1Words, textDF2Words)

textDFFull
write.csv(textDFFull, "words_whisper.csv", row.names = FALSE)


### Transform to TextGrid

In [None]:
import csv
import textgrid # install with pip intall textgrid
# Load the CSV data
with open("words_whisper.csv",
          "r", encoding="utf-8") as f:
    reader = csv.DictReader(f,
                            delimiter=","
                            )
    data = [row for row in reader]

# Create a TextGrid object
tg = textgrid.TextGrid()

# Create IntervalTier objects
transcript_tier = textgrid.IntervalTier(name="word")

# Populate the interval tiers
for row in data:
    start_time = float(row["start"])
    end_time = float(row["end"])
    transcript_tier.add(start_time, end_time, row["word"])

# Add the interval tiers to the TextGrid
tg.append(transcript_tier)

# Write the TextGrid to a file
with open("words_whisper.TextGrid", "w", encoding="utf-8") as f:
    tg.write(f)


Finished. We demonstarted using openai-whisper to automatically transcribe an audo file and transformed the output into a Praat TextGrid.