# 2 Extract speech and laughter from audio files

For speech recognition we try the [SpeechBrain](https://github.com/speechbrain/speechbrain) project and OpenAI's [Whisper](https://github.com/openai/whisper) model.

We also try identifying laughter with [Laughter Detection model](https://github.com/jrgillick/laughter-detection) by jrgillick. 

This code here is based on prototypes developed at Sage IDEMS hackathon in 2023 
https://github.com/chilledgeek/ethical_ai_hackathon_2023


In [1]:
import os
import time
import json
import pandas as pd
import utils

In [2]:
videos_in = "..\\LookitLaughter.test\\"
data_dir = "..\\data\\1_interim\\"



In [3]:
processedvideos = utils.getprocessedvideos(data_dir)
processedvideos.head()

Found existing processedvideos.xlsx


Unnamed: 0,VideoID,ChildID,JokeType,JokeNum,JokeRep,JokeTake,HowFunny,LaughYesNo,Frames,FPS,...,Audio.file,Faces.when,Faces.file,LastError,Speech.file,Speech.when,Objects.file,Objects.when,Understand.file,Understand.when
0,2UWdXP.joke1.rep2.take1.Peekaboo.mp4,2UWdXP,Peekaboo,1,2,1,Slightly funny,No,217,14.29891,...,..\data\1_interim\\2UWdXP.joke1.rep2.take1.Pee...,2023-10-04 11:31:32,..\data\1_interim\2UWdXP.joke1.rep2.take1.Peek...,,..\data\1_interim\2UWdXP.joke1.rep2.take1.Peek...,2023-09-20 16:58:38,,,,
1,2UWdXP.joke1.rep3.take1.Peekaboo.mp4,2UWdXP,Peekaboo,1,3,1,Slightly funny,No,152,14.359089,...,..\data\1_interim\\2UWdXP.joke1.rep3.take1.Pee...,2023-10-04 11:33:44,..\data\1_interim\2UWdXP.joke1.rep3.take1.Peek...,,..\data\1_interim\2UWdXP.joke1.rep3.take1.Peek...,2023-09-20 16:58:39,,,,
2,2UWdXP.joke2.rep1.take1.NomNomNom.mp4,2UWdXP,NomNomNom,2,1,1,Funny,No,95,13.241315,...,..\data\1_interim\\2UWdXP.joke2.rep1.take1.Nom...,2023-10-04 11:35:09,..\data\1_interim\2UWdXP.joke2.rep1.take1.NomN...,,..\data\1_interim\2UWdXP.joke2.rep1.take1.NomN...,2023-09-20 16:58:40,,,,
3,2UWdXP.joke2.rep2.take1.NomNomNom.mp4,2UWdXP,NomNomNom,2,2,1,Slightly funny,No,97,14.213813,...,..\data\1_interim\\2UWdXP.joke2.rep2.take1.Nom...,2023-10-04 11:36:15,..\data\1_interim\2UWdXP.joke2.rep2.take1.NomN...,,..\data\1_interim\2UWdXP.joke2.rep2.take1.NomN...,2023-09-20 16:58:40,,,,
4,2UWdXP.joke2.rep3.take1.NomNomNom.mp4,2UWdXP,NomNomNom,2,3,1,Slightly funny,No,133,14.223092,...,..\data\1_interim\\2UWdXP.joke2.rep3.take1.Nom...,2023-10-04 11:38:34,..\data\1_interim\2UWdXP.joke2.rep3.take1.NomN...,,..\data\1_interim\2UWdXP.joke2.rep3.take1.NomN...,2023-09-20 16:58:48,,,,


## 2.1 Audio extraction with moviepy

The first step is simple. We extract the audio from each video and save it as `mp3` or `wav`. We will use the `moviepy` library to do this. 
This will be helpful for later analysis and regenerating labeled videos with audio.

Note that `moviepy` is a wrapper around `ffmpeg` and `ffmpeg` needs to be installed separately. 

`conda install ffmpeg moviepy`

In [5]:
forceaudio = False
#output_ext="mp3"
output_ext="wav"

for index, r in processedvideos.iterrows():
    if forceaudio or pd.isnull(r["Audio.file"]):
        audiopath = utils.convert_video_to_audio_moviepy(videos_in,r["VideoID"], data_dir, output_ext=output_ext)
        r["Audio.file"] = audiopath
        r["Audio.when"] = time.strftime("%Y-%m-%d %H:%M:%S", time.gmtime())
        #update this row in processedvideos dataframe
        processedvideos.loc[index] = r
    else:
        print("Audio already extracted for video: ", r["VideoID"])
        

utils.saveprocessedvideos(processedvideos, data_dir)
processedvideos.head()

Audio already extracted for video:  2UWdXP.joke1.rep2.take1.Peekaboo.mp4
Audio already extracted for video:  2UWdXP.joke1.rep3.take1.Peekaboo.mp4
Audio already extracted for video:  2UWdXP.joke2.rep1.take1.NomNomNom.mp4
Audio already extracted for video:  2UWdXP.joke2.rep2.take1.NomNomNom.mp4
Audio already extracted for video:  2UWdXP.joke2.rep3.take1.NomNomNom.mp4
Audio already extracted for video:  2UWdXP.joke3.rep2.take1.ThatsNotAHat.mp4
Audio already extracted for video:  2UWdXP.joke3.rep3.take1.ThatsNotAHat.mp4
Audio already extracted for video:  2UWdXP.joke4.rep1.take1.TearingPaper.mp4
Audio already extracted for video:  2UWdXP.joke4.rep2.take1.TearingPaper.mp4
Audio already extracted for video:  2UWdXP.joke4.rep3.take1.TearingPaper.mp4
Audio already extracted for video:  2UWdXP.joke5.rep1.take1.ThatsNotACat.mp4
Audio already extracted for video:  2UWdXP.joke5.rep2.take1.ThatsNotACat.mp4
Audio already extracted for video:  2UWdXP.joke5.rep3.take1.ThatsNotACat.mp4
Audio already ex

Unnamed: 0,VideoID,ChildID,JokeType,JokeNum,JokeRep,JokeTake,HowFunny,LaughYesNo,Frames,FPS,...,Audio.file,Faces.when,Faces.file,LastError,Speech.file,Speech.when,Objects.file,Objects.when,Understand.file,Understand.when
0,2UWdXP.joke1.rep2.take1.Peekaboo.mp4,2UWdXP,Peekaboo,1,2,1,Slightly funny,No,217,14.29891,...,..\data\1_interim\\2UWdXP.joke1.rep2.take1.Pee...,2023-10-04 11:31:32,..\data\1_interim\2UWdXP.joke1.rep2.take1.Peek...,,..\data\1_interim\2UWdXP.joke1.rep2.take1.Peek...,2023-09-20 16:58:38,,,,
1,2UWdXP.joke1.rep3.take1.Peekaboo.mp4,2UWdXP,Peekaboo,1,3,1,Slightly funny,No,152,14.359089,...,..\data\1_interim\\2UWdXP.joke1.rep3.take1.Pee...,2023-10-04 11:33:44,..\data\1_interim\2UWdXP.joke1.rep3.take1.Peek...,,..\data\1_interim\2UWdXP.joke1.rep3.take1.Peek...,2023-09-20 16:58:39,,,,
2,2UWdXP.joke2.rep1.take1.NomNomNom.mp4,2UWdXP,NomNomNom,2,1,1,Funny,No,95,13.241315,...,..\data\1_interim\\2UWdXP.joke2.rep1.take1.Nom...,2023-10-04 11:35:09,..\data\1_interim\2UWdXP.joke2.rep1.take1.NomN...,,..\data\1_interim\2UWdXP.joke2.rep1.take1.NomN...,2023-09-20 16:58:40,,,,
3,2UWdXP.joke2.rep2.take1.NomNomNom.mp4,2UWdXP,NomNomNom,2,2,1,Slightly funny,No,97,14.213813,...,..\data\1_interim\\2UWdXP.joke2.rep2.take1.Nom...,2023-10-04 11:36:15,..\data\1_interim\2UWdXP.joke2.rep2.take1.NomN...,,..\data\1_interim\2UWdXP.joke2.rep2.take1.NomN...,2023-09-20 16:58:40,,,,
4,2UWdXP.joke2.rep3.take1.NomNomNom.mp4,2UWdXP,NomNomNom,2,3,1,Slightly funny,No,133,14.223092,...,..\data\1_interim\\2UWdXP.joke2.rep3.take1.Nom...,2023-10-04 11:38:34,..\data\1_interim\2UWdXP.joke2.rep3.take1.NomN...,,..\data\1_interim\2UWdXP.joke2.rep3.take1.NomN...,2023-09-20 16:58:48,,,,


## 2.2 Speech-to-text 

### 2.2.1 SpeechBrain Example 

Let's look at [SpeechBrain](https://github.com/speechbrain/speechbrain). It's not on Anaconda so we'll have to install it with pip.

`pip install speechbrain`

It depends on pytorch and torchaudio. So we'll install them with conda. Note that we need to specify the cuda version. And install a sound processing backend libary. On windows this is `soundfile` on mac\linux it is `sox`. 

```
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
conda install -c conda-forge pysoundfile
conda install -c conda-forge ffmpeg
```

Windows users: If you encounter `Backend not found.` or similiar errors try restarting the PC.   
Windows users: If you encounter `: UserWarning: huggingface_hub cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in ` You could try running VSCode as Administrator (Right click icon in start menu look under More >).  See [speechbrain issue 1155](https://github.com/speechbrain/speechbrain/issues/1155) 


*Initially we tried with Google Cloud Speech to text. But it's a closed model and kept crashing my ipykernel. Then we tried the [Speech Recognition](https://github.com/Uberi/speech_recognition) project to try and access the [Sphinx](https://github.com/cmusphinx/pocketsphinx) speech model. But that pocketsphinx is not maintained on Anaconda any more and compiling from source is a bit beyond me :)*

In [6]:
from speechbrain.pretrained import EncoderDecoderASR

source="speechbrain/asr-crdnn-rnnlm-librispeech" 
savedir="pretrained_models/asr-crdnn-rnnlm-librispeech"

asr_model = EncoderDecoderASR.from_hparams(
    source=source, 
    savedir=savedir)

The torchaudio backend is switched to 'soundfile'. Note that 'sox_io' is not supported on Windows.
The torchaudio backend is switched to 'soundfile'. Note that 'sox_io' is not supported on Windows.


In [8]:
demo_data = r"..\data\demo"
AUDIO_FILE = os.path.join(demo_data, "2UWdXP.joke1.rep2.take1.Peekaboo.mp3")
AUDIO_FILE2 = os.path.join(demo_data, "2UWdXP.joke2.rep1.take1.NomNomNom.mp3")
testset = [AUDIO_FILE, AUDIO_FILE2]

In [9]:
for audio_file in testset:
    results = asr_model.transcribe_file(audio_file)
    print(results)

DRINK DRINK DRINK SAID D'ARTAGNAN
HE MURMURED WON'T YOU RUDDY


Speechbrain not very accurate (with these default settings). Rather than trying to improve it. Let's try the OpenAI Whisper model insteat

## 2.3 Speech-to-text using OpenAI Whisper 

There is a free version of the [OpenAI Whisper](https://github.com/openai/whisper) model. It is multilingual (xx languages) and comes in a range of different sizes (and accuracies). We'll try the `base` model. 

Simple tutorial: https://analyzingalpha.com/openai-whisper-python-tutorial 

In [None]:
import whisper
model = whisper.load_model("base")

In [None]:
def whisper_transcribe(audio_file, save_path, saveJSON = True):
    result = model.transcribe(audio_file, verbose = True)
    if saveJSON:
        basename = os.path.basename(audio_file)
        filename, ext = os.path.splitext(basename)
        jsonfile = f"{save_path}{filename}.json"
        with open(jsonfile, "w") as f:
            json.dump(result, f)
        return jsonfile, result
    else:
        return result

In [None]:
processedvideos = utils.getprocessedvideos(data_dir)
processedvideos.head()

In [None]:
for index, r in processedvideos.iterrows():
    if pd.isnull(r["Speech.file"]) and not pd.isnull(r["Audio.file"]):
        speechpath, result = whisper_transcribe(r["Audio.file"],save_path=data_dir)
        r["Speech.file"] = speechpath
        r["Speech.when"] = time.strftime("%Y-%m-%d %H:%M:%S", time.gmtime())
        #update this row in processedvideos dataframe
        processedvideos.loc[index] = r
        
utils.saveprocessedvidoes(processedvideos, data_dir)

In [None]:
processedvideos.head()

## 2.3 TODO - Laughter detection

Might not do this here as it seems like we would need to import a lot of supporting code. 

In [None]:
from laughter-detection import laughter_segmenter


def segment_laughter(wav_filename):
        #results[file_prefix]["laughs"] = segment_laughter(wav_filename)

    return results
