## Whisper tutorial notebook: Transcript

This notebook provides an introduction to [Whisper](https://github.com/openai/whisper/tree/main), a tool for detecting signals and features for emotion recognition from speech audio.

Pros:


* It can be used to transcribe speech of varying length.
* It is very accurate and pretty fast.
* It is multilingual!

Cons:
* It outputs transcripts in ~30 second segments. So it might cut off the speaker mid-turn, mid-utterance or mid-word.
* It does not provide phoneme or word-level time alignments. So if you need that, this is not for you (but we know of things that might work for you! Come see us!).
* The authors of whisper say they trained on 680,000 hours of speech+text collected from the internet. However, they don't indicate which speech, from where, transcribed by whom, or with whose consent. 

To learn more about whisper, please refer to the [github repo](https://github.com/openai/whisper/tree/main).

## Step 1: Install packages and libraries

In [1]:
# processing libraries
import os 
import pandas as pd
import numpy as np
import csv

### Specify path settings

In [6]:
# get audio files from conversation video files in .wav format 
BASE_PATH = os.getcwd()

# where our input videos are stored
INPUT_VIDEOS = os.path.join(BASE_PATH,'conversation_data/ZR') 

# where our output audio files will be stored
OUTPUT_AUDIOS = os.path.join(BASE_PATH, 'conversation_audios')

#where final csv files containing audio features will be stored
OUTPUT_CSVS = os.path.join(BASE_PATH,'whisper_transcripts')

#create a directory to save the csv files
try:
    #os.mkdir(OUTPUT_AUDIOS) #uncomment this if you skipped tutorial2
    os.mkdir(OUTPUT_CSVS)
    
             
except:
    pass

### Convert video files to audio

If you have followed **tutorial2 on opensmile**, we walked through there how to get audio files (.wav) from
your .mp4 video recordings. You can skip this part if you have already converted your video to audio files.

Alternately you can directly upload your files to your input file folder.

In [3]:
input_videos = sorted([x for x in os.listdir(INPUT_VIDEOS) if not x.startswith(".")])
input_videos = [i.split('.')[0] for i in input_videos] 

for file in input_videos:
    !echo y | ffmpeg -i "$INPUT_VIDEOS"/"$file".mp4 -acodec pcm_s16le -ar 16000 -ac 2 "$OUTPUT_AUDIOS"/"$file".wav

ffmpeg version 4.2.7-0ubuntu0.1 Copyright (c) 2000-2022 the FFmpeg developers
  built with gcc 9 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --e

## Step 2: Install Whisper

**Note** You may need to run `!pip install numpy==1.23.4` to resolve software version and restart your session after doing it to 
let its effect take place.


In [3]:
#installing whisper from its github repository
!pip install git+https://github.com/openai/whisper.git 
!pip install numpy==1.23.4

Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-1ae1xvm7
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-1ae1xvm7
  Resolved https://github.com/openai/whisper.git to commit b91c907694f96a3fb9da03d4bbdc83fbcd3a40a4
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting triton==2.0.0
  Downloading triton-2.0.0-1-cp39-cp39-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (63.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.3/63.3 MB[0m [31m22.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting tiktoken==0.3.3
  Downloading tiktoken-0.3.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m70.2 MB/s

### Import additional libraries

Whisper supports many different langauge including English which can be found [here](https://github.com/openai/whisper).

Here we have selected English.

In [4]:
#import libraries

import torch
import whisper

# define parameters for whisper

#this ensures gpu-based parallel processing if available, if not uses cpu
torch.cuda.is_available()
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

#this chooses the model language to english
model = whisper.load_model("base.en", device=DEVICE)
print(
    f"Model is {'multilingual' if model.is_multilingual else 'English-only'} "
    f"and has {sum(np.prod(p.shape) for p in model.parameters()):,} parameters.")

100%|███████████████████████████████████████| 139M/139M [00:02<00:00, 60.3MiB/s]


Model is English-only and has 71,825,408 parameters.


## Step 3: Run Whisper on .wav file to get transcription

In [7]:
wav_files = sorted([x for x in os.listdir(OUTPUT_AUDIOS) if not x.startswith(".")]) 

for file in wav_files:
    with open(OUTPUT_CSVS + '/' + file.split('.')[0] + '.txt', mode="wt") as f:
        text = model.transcribe(OUTPUT_AUDIOS + '/' + file)
        print('Transcription for ', file)
        print('=========================')
        print(text['text'])
    
        #saving the transcription
        f.write(text['text'])
        
        #closing the current file
        f.close()
        
        
    

Transcription for  1031_ZT_1_Aff_Video_left.wav
 I'm a new girl. Okay, I watched new girl. Yeah. I wouldn't say it's my favorite. I know. I kind of have like one of those like, and not unbiased, or not biased, but it's like one of those opinions where everyone thinks it's so funny. And I don't really find the humor in it. Like sometimes okay, but then other times like, I don't know. I don't find it as funny as like other people find it funny. It's a joy. I do like that. I understand that. I feel like I've just watched this many times where I guess just become like my comfort show. Like at one point I found it really funny even though it's just like, it's something I can watch and I know like everything that's gonna happen. And she's so funny. Yeah, I also really like, like all those like related type shows like Parks and Rec and the Office and Brooklyn Nine Nine. I'm gonna go watch the movie. I've seen Parks and Rec. I love Parks and Rec. Personally, the Office I've seen, it's kind of,

### Congratulations! you now have your transcripts stored in `transcripts` folder.

### You can use analysis tools such as [Align](https://github.com/nickduran/align-linguistic-alignment/tree/master) to perform linguistic analysis using the generated transcripts.

**Note** We would recommend doing manual checking on these transcripts to correct word errors.