# 2 Extract speech and laughter from audio files

For speech recognition we try the [SpeechBrain](https://github.com/speechbrain/speechbrain) project and OpenAI's [Whisper](https://github.com/openai/whisper) model.

We also try identifying laughter with [Laughter Detection model](https://github.com/jrgillick/laughter-detection) by jrgillick. 

This code here is based on prototypes developed at Sage IDEMS hackathon in 2023 
https://github.com/chilledgeek/ethical_ai_hackathon_2023


In [None]:
import os
import time
import json
import pandas as pd
import utils

In [None]:
videos_in = os.path.join("..","LookitLaughter.test")
data_out = os.path.join("..", "data", "1_interim")


#videos_in = r"..\..\LookitLaughter.full"
#data_out = r"..\..\LookitLaughter.full.data\1_interim"

In [None]:
processedvideos = utils.getProcessedVideos(data_out)
processedvideos.head()

## 2.1 Audio extraction with moviepy

The first step is simple. We extract the audio from each video and save it as `mp3` or `wav`. We will use the `moviepy` library to do this. 
This will be helpful for later analysis and regenerating labeled videos with audio.

Note that `moviepy` is a wrapper around `ffmpeg` and `ffmpeg` needs to be installed separately. 

`conda install ffmpeg moviepy`

In [None]:
forceaudio = False
#output_ext="mp3"
output_ext="wav"

for index, r in processedvideos.iterrows():
    if forceaudio or pd.isnull(r["Audio.file"]):
        audiopath = utils.convert_video_to_audio_moviepy(videos_in,r["VideoID"], data_out, output_ext=output_ext)
        r["Audio.file"] = audiopath
        r["Audio.when"] = time.strftime("%Y-%m-%d %H:%M:%S", time.gmtime())
        #update this row in processedvideos dataframe
        processedvideos.loc[index] = r
    else:
        print("Audio already extracted for video: ", r["VideoID"])
        

utils.saveProcessedVideos(processedvideos, data_out)
processedvideos.head()

## 2.2 Speech-to-text 



### 2.2.1 How not do to it - SpeechBrain Example 

Let's look at [SpeechBrain](https://github.com/speechbrain/speechbrain). It's not on Anaconda so we'll have to install it with pip.

`pip install speechbrain`

It depends on pytorch and torchaudio. So we'll install them with conda. Note that we need to specify the cuda version. And install a sound processing backend libary. On windows this is `soundfile` on mac\linux it is `sox`. 

```
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
conda install -c conda-forge pysoundfile
conda install -c conda-forge ffmpeg
```

Windows users: If you encounter `Backend not found.` or similiar errors try restarting the PC.   
Windows users: If you encounter `: UserWarning: huggingface_hub cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in ` You could try running VSCode as Administrator (Right click icon in start menu look under More >).  See [speechbrain issue 1155](https://github.com/speechbrain/speechbrain/issues/1155) 


*Initially we tried with Google Cloud Speech to text. But it's a closed model and kept crashing my ipykernel. Then we tried the [Speech Recognition](https://github.com/Uberi/speech_recognition) project to try and access the [Sphinx](https://github.com/cmusphinx/pocketsphinx) speech model. But that pocketsphinx is not maintained on Anaconda any more and compiling from source is a bit beyond me :)*

In [None]:
import os
from speechbrain.inference import EncoderDecoderASR
import torchaudio
    

# source="speechbrain/asr-crdnn-rnnlm-librispeech" 
# savedir="pretrained_models/asr-crdnn-rnnlm-librispeech"
source="speechbrain/asr-conformer-transformerlm-librispeech"
savedir="pretrained_models/asr-transformer-transformerlm-librispeech"


asr_model = EncoderDecoderASR.from_hparams( source=source, savedir=savedir)


In [None]:
demo_data = os.path.join("..","data", "demo")
AUDIO_FILE = os.path.join(demo_data, "2UWdXP.joke1.rep2.take1.Peekaboo.mp3")
AUDIO_FILE2 = os.path.join(demo_data, "2UWdXP.joke2.rep1.take1.NomNomNom.mp3")
testset = [AUDIO_FILE, AUDIO_FILE2]

In [None]:
# Ensure the audio file is in a supported format
for audio_file in testset:
    results = asr_model.transcribe_file(audio_file)
    print(results)

Speechbrain not very accurate (with these default settings). Rather than trying to improve it. Let's try the OpenAI Whisper model instead

### 2.2.2 Speech-to-text using OpenAI Whisper 

There is a free version of the [OpenAI Whisper](https://github.com/openai/whisper) model. It is multilingual (xx languages) and comes in a range of different sizes (and accuracies). We'll try the `base` model. 

Simple tutorial: https://analyzingalpha.com/openai-whisper-python-tutorial 

In [None]:
import whisper
model = whisper.load_model("base")

In [None]:
def whisper_transcribe(audio_file, save_path, saveJSON = True):
    result = model.transcribe(audio_file, verbose = True)
    if saveJSON:
        basename = os.path.basename(audio_file)
        filename, ext = os.path.splitext(basename)
        jsonfile = os.path.join(save_path,filename,".json")
        with open(jsonfile, "w") as f:
            json.dump(result, f)
        return jsonfile, result
    else:
        return result

In [None]:
processedvideos = utils.getProcessedVideos(data_out)
processedvideos.head()

In [None]:
for index, r in processedvideos.iterrows():
    if pd.isnull(r["Speech.file"]) and not pd.isnull(r["Audio.file"]):
        speechpath, result = whisper_transcribe(r["Audio.file"],save_path=data_out)
        r["Speech.file"] = speechpath
        r["Speech.when"] = time.strftime("%Y-%m-%d %H:%M:%S", time.gmtime())
        #update this row in processedvideos dataframe
        processedvideos.loc[index] = r
        
utils.saveProcessedVideos(processedvideos, data_out)

In [None]:
processedvideos.head()

## 2.3 #TODO - Laughter detection

We would like to process videos to identifying laughter with [Laughter Detection model](https://github.com/jrgillick/laughter-detection) by jrgillick. However, want to find a simple way to call that from remote project rather than incorporating code into our own project.



In [None]:
import laughter_segmenter


def segment_laughter(wav_filename):
        #results[file_prefix]["laughs"] = segment_laughter(wav_filename)

    return results
