## Whisper tutorial notebook: Transcript

This notebook provides an introduction to [Whisper](https://github.com/openai/whisper/tree/main), a tool for detecting signals and features for emotion recognition from speech audio.

Pros:


* It can be used to transcribe speech of varying length.
* It is very accurate and pretty fast.
* It is multilingual!

Cons:
* It outputs transcripts in ~30 second segments. So it might cut off the speaker mid-turn, mid-utterance or mid-word.
* It does not provide phoneme or word-level time alignments. So if you need that, this is not for you (but we know of things that might work for you! Come see us!).
* The authors of whisper say they trained on 680,000 hours of speech+text collected from the internet. However, they don't indicate which speech, from where, transcribed by whom, or with whose consent. 

To learn more about whisper, please refer to the [github repo](https://github.com/openai/whisper/tree/main).

## Step 1: Install packages and libraries

In [1]:
# processing libraries
import os 
import pandas as pd
import numpy as np
import csv

If you have followed **tutorial2 on opensmile**, we walked through there how to get audio file (.wav) from
your .mp4 video recordings. You can skip this part if you have already converted your video to audio files.

### Specify path settings

In [2]:
# get audio files from conversation video files in .wav format 
BASE_PATH = os.getcwd()

# where our input videos are stored
INPUT_VIDEOS = os.path.join(BASE_PATH,'conversation_data/ZT') 

# where our output audio files will be stored
OUTPUT_AUDIOS = os.path.join(BASE_PATH, 'conversation_audios')

#where final csv files containing audio features will be stored
OUTPUT_CSVS = os.path.join(BASE_PATH,'whisper_transcripts')

#create a directory to save the csv files
try:
    #os.mkdir(OUTPUT_AUDIOS) #uncomment this if you skipped tutorial2
    os.mkdir(OUTPUT_CSVS)
    
             
except:
    pass

### Convert video files to audio

In [3]:
input_videos = sorted([x for x in os.listdir(INPUT_VIDEOS) if not x.startswith(".")])
input_videos = [i.split('.')[0] for i in input_videos] 

for file in input_videos:
    !echo y | ffmpeg -i "$INPUT_VIDEOS"/"$file".mp4 -acodec pcm_s16le -ar 16000 -ac 2 "$OUTPUT_AUDIOS"/"$file".wav

ffmpeg version 4.2.7-0ubuntu0.1 Copyright (c) 2000-2022 the FFmpeg developers
  built with gcc 9 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --e

## Step 2: Install Whisper

**Note** You may need to run `!pip install numpy==1.23.4` to resolve software version an drestrat your session after doing it


In [7]:
#installing whisper from its github repository
!pip install git+https://github.com/openai/whisper.git 
!pip install numpy==1.23.4

Collecting numpy==1.23.4
  Downloading numpy-1.23.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.1/17.1 MB[0m [31m42.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.24.4
    Uninstalling numpy-1.24.4:
      Successfully uninstalled numpy-1.24.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.9.2 requires protobuf<3.20,>=3.9.2, but you have protobuf 4.23.4 which is incompatible.
tensorboard 2.9.1 requires protobuf<3.20,>=3.9.2, but you have protobuf 4.23.4 which is incompatible.[0m[31m
[0mSuccessfully installed numpy-1.23.4
[0m

### Import additional libraries

Whisper supports many different langauge including English which can be found [here](https://github.com/openai/whisper)

In [3]:
#import libraries

import torch
import whisper

# define parameters for whisper

#this ensures gpu-based parallel processing if available, if not uses cpu
torch.cuda.is_available()
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

#this chooses the model language to english
model = whisper.load_model("base.en", device=DEVICE)
print(
    f"Model is {'multilingual' if model.is_multilingual else 'English-only'} "
    f"and has {sum(np.prod(p.shape) for p in model.parameters()):,} parameters.")

100%|███████████████████████████████████████| 139M/139M [00:11<00:00, 13.1MiB/s]


Model is English-only and has 71,825,408 parameters.


## Step 3: Run Whisper on .wav file to get transcription

In [4]:
wav_files = sorted([x for x in os.listdir(OUTPUT_AUDIOS) if not x.startswith(".")]) 

for file in wav_files:
    with open(OUTPUT_CSVS + '/' + file.split('.')[0] + '.txt', mode="wt") as f:
        text = model.transcribe(OUTPUT_AUDIOS + '/' + file)
        print('Transcription for ', file)
        print('=========================')
        print(text['text'])
    
        #saving the transcription
        f.write(text['text'])
        
        #closing the current file
        f.close()
        
        
    

Transcription for  1058_ZT_4_Aff_Video_left.wav
 I guess maybe just to start broadly, like what are your favorite forms of like consuming entertainment? I mean, music, like I listen to a lot of music. It's like movies. I don't really watch a lot of TV, honestly, but I like some TV shows. And yeah, where are you? For me, it's like definitely like before college, like a lot of video games, but since getting to college more, it's more like social media and like occasionally like TV movies and then definitely a good bit of music as well. So I guess what kind of music do you listen to? Cause I feel like you can probably find something and comment there. Yeah. Okay, well like a lot. So I mean, like mostly like rap and like I think that's been like some, some like indie, some kind of like, not country, but also like some Latin music. Yeah. Yeah. Yeah. But you know, so just like, like popular like rap songs and like old, like old, like old, I thought too. What are you? Yeah. No, that lines up 

### Congratulations! you now have your transcripts stored in `transcripts` folder.

### You can use analysis tools such as [Align](https://github.com/nickduran/align-linguistic-alignment/tree/master) to perform linguistic analysis using the generated transcripts.

**Note** We would recommend doing manual checking on these transcripts to correct word errors.