# Experiment 9 - Speech Recognition
This is an experiment that highlights the use of `huggingface` transformers in Python making it extremely easy to perform speech recognition on audio files with just a few lines of code. This notebook will compare various pre-trained speech recognition models that are accessible through the aforementioned `transformers` library. This experiment uses sound files in `.wav` format from the TESS dataset.

## Import Dependencies

Here we define the `transformers` library to import Facebook's `wav2vec2-base-960h` speech recognition transformer. Also, `pipeline` from `transformers` helps us pass all of our input `.wav` files to the model.

In [1]:
from transformers import pipeline
import pandas as pd
import warnings
import os

warnings.filterwarnings('ignore')

## Transformers Example for Speech Recognition

### Create Pipeline

Creating the speech recogniser's pipeline using `huggingface`. We pass whatever model we require as an argument to the `pipeline()` function. As you can clearly see, we're using the Facebook `wav2vec2-base-960h` model.

In [2]:
speechRecognizer = pipeline(task='automatic-speech-recognition', model='facebook/wav2vec2-base-960h')

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.60k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/378M [00:00<?, ?B/s]

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading (…)okenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

Downloading (…)rocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

We're only going to handle the 200 odd `.wav` files in the `OAF_Fear` directory. To do this, we make a list to store the paths of all the files in the directory.

In [3]:
fearFiles = []

for dirname, _, filenames in os.walk('/kaggle/input/toronto-emotional-speech-set-tess/TESS Toronto emotional speech set data/OAF_Fear'):
    for filename in filenames:
        fearFiles.append(os.path.join(dirname, filename))
        
len(fearFiles)

200

Once we have the file paths to all 200 to all of those files, we can start to apply our pipeline to each of those files. We then create a list of the output from each of those files being passed to the model pipeline.

### Pass Data to Pipeline

In [4]:
predictedDicts = []

for file in fearFiles:
    predictedDicts.append(speechRecognizer(file))

ffmpeg: /opt/conda/lib/libncursesw.so.6: no version information available (required by /lib/x86_64-linux-gnu/libcaca.so.0)
ffmpeg: /opt/conda/lib/libncursesw.so.6: no version information available (required by /lib/x86_64-linux-gnu/libcaca.so.0)
ffmpeg: /opt/conda/lib/libncursesw.so.6: no version information available (required by /lib/x86_64-linux-gnu/libcaca.so.0)
ffmpeg: /opt/conda/lib/libncursesw.so.6: no version information available (required by /lib/x86_64-linux-gnu/libcaca.so.0)
ffmpeg: /opt/conda/lib/libncursesw.so.6: no version information available (required by /lib/x86_64-linux-gnu/libcaca.so.0)
ffmpeg: /opt/conda/lib/libncursesw.so.6: no version information available (required by /lib/x86_64-linux-gnu/libcaca.so.0)
ffmpeg: /opt/conda/lib/libncursesw.so.6: no version information available (required by /lib/x86_64-linux-gnu/libcaca.so.0)
ffmpeg: /opt/conda/lib/libncursesw.so.6: no version information available (required by /lib/x86_64-linux-gnu/libcaca.so.0)
ffmpeg: /opt/con

All the outputs we have in the list are of the dictionary datatype. We now need to extract the text from the dictionary and we do that in the following way.

In [5]:
text = []

for dictionary in predictedDicts:
    text.append(dictionary['text'])
    
len(text)

200

Converting this list of text to a pandas dataframe to easily view the text extracted.

### Representing Data in DataFrame

In [6]:
data = pd.DataFrame(data=text, columns=['Speech Text'])
data

Unnamed: 0,Speech Text
0,SAY THE WORD GAP
1,SAY THE WORD CHOICE
2,SAY THE WORD PAGE
3,SAY THE WORD WHOLE
4,SAY THE WORD LEAST
...,...
195,SAY THE WORD HAW
196,SAY THE WORD NEAR
197,SAY THE WORD JUG
198,SAY THE WORD DIP


In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 1 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Speech Text  200 non-null    object
dtypes: object(1)
memory usage: 1.7+ KB


In [8]:
data.describe()

Unnamed: 0,Speech Text
count,200
unique,197
top,SAY THE WORD WHOLE
freq,2
