![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/open-source-nlp/17.0.Speech_Recognition.ipynb)

# **`Automatic Speech Recognition in Spark NLP`**



📌 Automatic Speech Recognition (ASR), also known as Speech to Text (STT), is the capability to automatically undesrtand audio inputs and transcribe them to text.

📌 This allows applications in many fields such as automatic caption generation in videos, transcribing business meetings, helping people with typing messages through voice, an much more.


We currently support three types of models [Whisper](https://sparknlp.org/docs/en/transformers#whisperforctc), [Wav2Vec](https://sparknlp.org/docs/en/transformers#wav2vec2forctc), and [HuBERT](https://sparknlp.org/docs/en/transformers#hubertforctc). They are end-to-end implementation of ASR, meaning that they encode audio input and transcribe with a langauge model using [Connectionist Temporal Classification (CTC)](https://dl.acm.org/doi/10.1145/1143844.1143891) decoder.

To find all pretrained models and pipelines, you can follow these links:

* List of all available ASR [models](https://nlp.johnsnowlabs.com/models?task=Automatic+Speech+Recognition&type=model)
* List of all available ASR [pipelines](https://nlp.johnsnowlabs.com/models?task=Automatic+Speech+Recognition&type=pipeline)


For importing models from Hugging Face or TFHub to Spark NLP, you can follow the steps described [here](https://sparknlp.org/docs/en/transformers#import-transformers-into-spark-nlp).

📌Additional blogposts and videos:

* https://huggingface.co/tasks/automatic-speech-recognition
* https://huggingface.co/docs/transformers/model_doc/wav2vec2
* https://medium.com/usabilitygeek/automatic-speech-recognition-asr-software-an-introduction-824390b9282d

## **Install Spark NLP**

In [1]:
# This is only to setup PySpark and Spark NLP on Colab
!pip install -q pyspark==3.4.1 spark-nlp==5.3.2

# to process audio files
!pip install -q librosa

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m565.0/565.0 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [2]:
import librosa
import pandas as pd
import pyspark.sql.functions as F
import sparknlp
from IPython.display import Audio
from pyspark.sql import functions as F
from pyspark.sql.types import (
    ArrayType,
    FloatType,
    LongType,
    StringType,
    StructField,
    StructType,
)
from sparknlp.annotator import *
from sparknlp.base import *
from sparknlp.pretrained import PretrainedPipeline

print("Spark NLP Version :", sparknlp.version())

spark = sparknlp.start()
spark


Spark NLP Version : 5.3.2


# **Spark NLP ASR Pipeline & Model**

## Loading audio files

▶︎Loading an audio file. Let's download a sample Wav file

In [3]:
!wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/audio/samples/wavs/ngm_12484_01067234848.wav

--2024-04-16 13:58:10--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/audio/samples/wavs/ngm_12484_01067234848.wav
Resolving s3.amazonaws.com (s3.amazonaws.com)... 16.182.41.240, 52.217.168.88, 54.231.233.208, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|16.182.41.240|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 417836 (408K) [audio/wav]
Saving to: ‘ngm_12484_01067234848.wav’


2024-04-16 13:58:10 (5.17 MB/s) - ‘ngm_12484_01067234848.wav’ saved [417836/417836]



►Let's listen to the audio

In [4]:
FILE_PATH = "ngm_12484_01067234848.wav"
Audio(filename=FILE_PATH)

▶︎ We will use `librosa` library to load/resample our Wav file

In [5]:
data, sampling_rate = librosa.load(FILE_PATH, sr=16000)

# let's convert them to floats
data=[float(x) for x in data]

►This is how we can create PySpark DataFrame from the `librosa` results

In [7]:
schema = StructType([
        StructField("audio_content", ArrayType(FloatType())),
        StructField("sampling_rate", LongType())
])

df = pd.DataFrame({
    "audio_content":[data],
    "sampling_rate":[sampling_rate]
})

spark_df = spark.createDataFrame(df, schema)

In [9]:
spark_df.printSchema()

root
 |-- audio_content: array (nullable = true)
 |    |-- element: float (containsNull = true)
 |-- sampling_rate: long (nullable = true)



In [10]:
spark_df.show(1)

+--------------------+-------------+
|       audio_content|sampling_rate|
+--------------------+-------------+
|[-5.3283205E-5, -...|        16000|
+--------------------+-------------+



## Using Pretrained Pipelines

►Simplest and fastest way is to use a pre-trained pipeline for ASR




In [11]:
#Download a pre-trained pipeline
pipeline = PretrainedPipeline('asr_whisper_tiny_english_pipeline', lang='en')

pipelineDF = pipeline.transform(spark_df)

asr_whisper_tiny_english_pipeline download started this may take some time.
Approx size to download 238.9 MB
[OK!]


►Let's see what's inside out-of-the-box

In [15]:
pipeline.model.stages

[AudioAssembler_9aaff852926c, WhisperForCTC_83343c021daf]

In [16]:
pipelineDF.printSchema()

root
 |-- audio_content: array (nullable = true)
 |    |-- element: float (containsNull = true)
 |-- sampling_rate: long (nullable = true)
 |-- audio_assembler: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- result: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |-- text: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = 

In [17]:
pipelineDF.select("text.result", "text.metadata").show(truncate=False)

+------------------------------------------------+-------------------------------+
|result                                          |metadata                       |
+------------------------------------------------+-------------------------------+
|[ People who died while living in other places.]|[{length -> 69632, audio -> 0}]|
+------------------------------------------------+-------------------------------+



## Using Pretrained Models with Custom Pipeline

►You can also construct your own custom Pipeline by using Spark NLP pretrained Models. This way you have more control and flexibility over the entire pipeline.





### Whisper Model

There are many pretrianed Whisper models, a good start is to use the official OpenAI models:

- [Whisper base](https://sparknlp.org/2023/10/17/asr_whisper_base_english_en.html)
- [Whisper small](https://sparknlp.org/2023/10/17/asr_whisper_small_english_en.html)
- [Whisper tiny](https://sparknlp.org/2023/10/17/asr_whisper_tiny_english_en.html)

In [23]:
audio_assembler = (
    AudioAssembler().setInputCol("audio_content").setOutputCol("audio_assembler")
)

speech_to_text = (
    WhisperForCTC.pretrained("asr_whisper_tiny_opt")
    .setInputCols("audio_assembler")
    .setOutputCol("text")
)

pipeline = Pipeline(stages=[audio_assembler, speech_to_text])

pipelineDF = pipeline.fit(spark_df).transform(spark_df)

asr_whisper_tiny_opt download started this may take some time.
Approximate size to download 228.2 MB
[OK!]


▶︎Let's have a look:

In [24]:
pipelineDF.select("text.result", "text.metadata").show(truncate=False)

+------------------------------------------------+-------------------------------+
|result                                          |metadata                       |
+------------------------------------------------+-------------------------------+
|[ People who died while living in other places.]|[{length -> 69632, audio -> 0}]|
+------------------------------------------------+-------------------------------+



### Wav2Vec Model

In [20]:
audio_assembler = (
    AudioAssembler().setInputCol("audio_content").setOutputCol("audio_assembler")
)

speech_to_text = (
    Wav2Vec2ForCTC.pretrained("asr_wav2vec2_base_960h")
    .setInputCols("audio_assembler")
    .setOutputCol("text")
)

pipeline = Pipeline(stages=[audio_assembler, speech_to_text])

pipelineDF = pipeline.fit(spark_df).transform(spark_df)

asr_wav2vec2_base_960h download started this may take some time.
Approximate size to download 217 MB
[OK!]


In [21]:
pipelineDF.select("text.result", "text.metadata").show(truncate=False)

+-----------------------------------------------+----------------------------------------------+
|result                                         |metadata                                      |
+-----------------------------------------------+----------------------------------------------+
|[PEOPLE WHO DIED WHILE LIVING IN OTHER PLACES ]|[{audio -> 0, sentence -> 0, length -> 69632}]|
+-----------------------------------------------+----------------------------------------------+



### HuBERT Model

In [8]:
audio_assembler = (
    AudioAssembler().setInputCol("audio_content").setOutputCol("audio_assembler")
)

speech_to_text = (
    HubertForCTC.pretrained("asr_hubert_large_ls960", "en")
    .setInputCols("audio_assembler")
    .setOutputCol("text")
)

pipeline = Pipeline(stages=[audio_assembler, speech_to_text])

pipelineDF = pipeline.fit(spark_df).transform(spark_df)


asr_hubert_large_ls960 download started this may take some time.
Approximate size to download 1.4 GB
[OK!]


In [9]:
pipelineDF.select("text.result", "text.metadata").show(truncate=False)

+-----------------------------------------------+----------------------------------------------+
|result                                         |metadata                                      |
+-----------------------------------------------+----------------------------------------------+
|[PEOPLE WHO DIED WHILE LIVING IN OTHER PLACES ]|[{audio -> 0, sentence -> 0, length -> 69632}]|
+-----------------------------------------------+----------------------------------------------+



### Processing with LightPipeline


LightPipelines support processing audio files only through the `.fullAnnotate` method.

In [13]:
empty_df = spark.createDataFrame([[""]]).toDF("text")

pipelineModel = pipeline.fit(empty_df)
light_model = LightPipeline(pipelineModel)

In [14]:
light_model.fullAnnotate(data)[0]['text']

[Annotation(document, 0, 4, [PAD], Map(audio -> 0, sentence -> 0, length -> 69632), [])]

## Processing HuggingFace Datasets





▶︎Let's create a DataFrame from HuggingFace Datasets library

In [15]:
!pip install -q datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [16]:
from datasets import load_dataset

# Taking the first 5 examples only for illustration purposes
ds = load_dataset(
    "patrickvonplaten/librispeech_asr_dummy", "clean", split="validation[:5]"
)

pandas_dataframe = pd.DataFrame(ds["audio"])
pandas_dataframe["array"] = pandas_dataframe["array"].apply(
    lambda row: [float(value) for value in row]
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/5.16k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Generating validation split: 0 examples [00:00, ? examples/s]

In [17]:
schema = StructType(
    [
        StructField("path", StringType()),
        StructField("audio_content", ArrayType(FloatType())),
        StructField("sampling_rate", LongType()),
    ]
)

spark_df = spark.createDataFrame(pandas_dataframe, schema)

In [18]:
spark_df.printSchema()

root
 |-- path: string (nullable = true)
 |-- audio_content: array (nullable = true)
 |    |-- element: float (containsNull = true)
 |-- sampling_rate: long (nullable = true)



In [19]:
spark_df.show()

+--------------------+--------------------+-------------+
|                path|       audio_content|sampling_rate|
+--------------------+--------------------+-------------+
|/root/.cache/hugg...|[0.002380371, 0.0...|        16000|
|/root/.cache/hugg...|[-1.5258789E-4, -...|        16000|
|/root/.cache/hugg...|[-6.713867E-4, 6....|        16000|
|/root/.cache/hugg...|[-4.5776367E-4, -...|        16000|
|/root/.cache/hugg...|[2.1362305E-4, -5...|        16000|
+--------------------+--------------------+-------------+



In [20]:
# Download a pre-trained pipeline
pipeline = PretrainedPipeline('asr_whisper_tiny_english_pipeline', lang='en')

pipelineDF = pipeline.transform(spark_df)

asr_whisper_tiny_english_pipeline download started this may take some time.
Approx size to download 238.9 MB
[OK!]


In [21]:
pipelineDF.select("text.result").show(5, False)

pipelineDF.select("text.metadata").show(5, False)

pipelineDF.select("text").show(5, False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                   |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Processing longer audio files

The models are trained to process a maximum of 30 seconds clips of audios. If your audio data is longer than that, you need to first split the audio data into clips limited to 30 seconds of data. Let's see how to do that.

In [None]:
# Download audio with ~50 seconds of audio data
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/open-source-nlp/data/audio_50secs.mp3

In [22]:
FILE_PATH = "./audio_50secs.mp3"
Audio(filename=FILE_PATH)

We use the sampling rate information to split the audio into clips of 30 seconds. For every second, the data contains `SR` numbers in the array.

In [24]:
data, sampling_rate = librosa.load(FILE_PATH, sr=16000)
data = [float(x) for x in data]


schema = StructType([
        StructField("audio_content", ArrayType(FloatType())),
        StructField("sampling_rate", LongType())
])

spark_df = spark.createDataFrame([(data, sampling_rate)], schema)

In [25]:
# Split the file into clips of 30 secods each
clips = []
clip_length = sampling_rate * 30  # 30 seconds
for i in range(0, len(data), clip_length):
    clips.append(data[i : i + clip_length])


print(f"The audio was split in {len(clips)} clips")
print(f"Clip 1 has {len(clips[0]) / sampling_rate} seconds")
print(f"The last clip has {len(clips[-1]) / sampling_rate} seconds")

The audio was split in 2 clips
Clip 1 has 30.0 seconds
The last clip has 20.0 seconds


In [None]:
# Creting the Spark Data Frame
spark_df = spark.createDataFrame([(c, sampling_rate) for c in clips], schema)
spark_df.show()

In [26]:
audio_assembler = (
    AudioAssembler().setInputCol("audio_content").setOutputCol("audio_assembler")
)

speech_to_text = (
    WhisperForCTC.pretrained("asr_whisper_tiny_opt")
    .setInputCols("audio_assembler")
    .setOutputCol("text")
)

pipeline = Pipeline(stages=[audio_assembler, speech_to_text])

pipelineDF = (
    pipeline.fit(spark_df)
    .transform(spark_df)
    .selectExpr("text.result[0] as clip_text")
    .toPandas()
)

asr_whisper_tiny_opt download started this may take some time.
Approximate size to download 228.2 MB
[OK!]


In [27]:
print(" ".join(pipelineDF["clip_text"].tolist()))

 By a little bit, history of Wave2Wag2. Wave2Wag2 is one of the current state of the art models for automatic speech recognition due to self-supervised training, which is quite a new concept in this field. Using one-ever of label data, Wave2Wag2 app performs the previous state of the art on the 100-hour subset while using 100 times less label data. Spark an LP4.2 was released just


# Spark NLP ASR-NER Pipeline

### 📌 **`ASR`, `OntoNotes NER`, and `BERT`**


In [28]:
!wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/audio/samples/1664116679869-voicemaker.in-speech.mp3

--2024-04-16 13:42:42--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/audio/samples/1664116679869-voicemaker.in-speech.mp3
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.56.40, 52.217.167.120, 52.217.134.56, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.56.40|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 40221 (39K) [audio/mp3]
Saving to: ‘1664116679869-voicemaker.in-speech.mp3’


2024-04-16 13:42:42 (1.97 MB/s) - ‘1664116679869-voicemaker.in-speech.mp3’ saved [40221/40221]



In [29]:
from IPython.display import Audio

FILE_PATH = "./1664116679869-voicemaker.in-speech.mp3"
Audio(FILE_PATH)

In [30]:
data, sampling_rate = librosa.load(FILE_PATH, sr=16000)
data=[float(x) for x in data]

In [31]:
schema = StructType(
    [
        StructField("audio_content", ArrayType(FloatType())),
        StructField("sampling_rate", LongType()),
    ]
)

df = pd.DataFrame({"audio_content": [data], "sampling_rate": [sampling_rate]})

spark_df = spark.createDataFrame(df, schema)

In [33]:
audio_assembler = (
    AudioAssembler().setInputCol("audio_content").setOutputCol("audio_assembler")
)

speech_to_text = (
    WhisperForCTC.pretrained("asr_whisper_tiny_opt")
    .setInputCols("audio_assembler")
    .setOutputCol("document")
)

token = Tokenizer().setInputCols("document").setOutputCol("token")

normalizer = (
    Normalizer().setInputCols("token").setOutputCol("normalized").setLowercase(True)
)

bert = (
    BertEmbeddings.pretrained("small_bert_L4_256")
    .setInputCols("document", "normalized")
    .setOutputCol("embeddings")
)

ner_onto = (
    NerDLModel.pretrained("onto_small_bert_L4_256", "en")
    .setInputCols(["document", "normalized", "embeddings"])
    .setOutputCol("ner")
)

entities = (
    NerConverter()
    .setInputCols(["document", "normalized", "ner"])
    .setOutputCol("entities")
)

pipeline = Pipeline(
    stages=[
        audio_assembler,
        speech_to_text,
        token,
        normalizer,
        bert,
        ner_onto,
        entities,
    ]
)

asr_pipelineDF = pipeline.fit(spark_df).transform(spark_df)

asr_whisper_tiny_opt download started this may take some time.
Approximate size to download 228.2 MB
[OK!]
small_bert_L4_256 download started this may take some time.
Approximate size to download 40.5 MB
[OK!]
onto_small_bert_L4_256 download started this may take some time.
Approximate size to download 14.1 MB
[OK!]


In [34]:
asr_pipelineDF.selectExpr("document.result as text").show(1, False)

asr_pipelineDF.selectExpr("normalized.result as normalized_text").show(1, False)

asr_pipelineDF.selectExpr("ner.result as NER").show(1, False)

asr_pipelineDF.selectExpr("entities.result as entities").show(1, False)

+---------------------------------------------------------------------------------------------------------+
|text                                                                                                     |
+---------------------------------------------------------------------------------------------------------+
|[ The Mona Lisa is a 16th-century oil painting created by Leonardo. It's how that the rover embarrassed.]|
+---------------------------------------------------------------------------------------------------------+

+------------------------------------------------------------------------------------------------------------------+
|normalized_text                                                                                                   |
+------------------------------------------------------------------------------------------------------------------+
|[the, mona, lisa, is, a, thcentury, oil, painting, created, by, leonardo, its, how, that, the, rover, embar

In [35]:
asr_pipelineDF.select(
    F.explode(
        F.arrays_zip(
            asr_pipelineDF.token.result,
            asr_pipelineDF.normalized.result,
            asr_pipelineDF.ner.result,
            asr_pipelineDF.ner.begin,
            asr_pipelineDF.ner.end,
        )
    ).alias("cols")
).select(
    F.expr("cols['0']").alias("token"),
    F.expr("cols['3']").alias("begin"),
    F.expr("cols['4']").alias("end"),
    F.expr("cols['1']").alias("normalized"),
    F.expr("cols['2']").alias("ner"),
).show()

+------------+-----+----+-----------+-------------+
|       token|begin| end| normalized|          ner|
+------------+-----+----+-----------+-------------+
|         The|    1|   3|        the|B-WORK_OF_ART|
|        Mona|    5|   8|       mona|I-WORK_OF_ART|
|        Lisa|   10|  13|       lisa|I-WORK_OF_ART|
|          is|   15|  16|         is|            O|
|           a|   18|  18|          a|            O|
|16th-century|   20|  28|  thcentury|            O|
|         oil|   33|  35|        oil|            O|
|    painting|   37|  44|   painting|            O|
|     created|   46|  52|    created|            O|
|          by|   54|  55|         by|            O|
|    Leonardo|   57|  64|   leonardo|     B-PERSON|
|           .|   67|  69|        its|            O|
|        It's|   72|  74|        how|            O|
|         how|   76|  79|       that|            O|
|        that|   81|  83|        the|            O|
|         the|   85|  89|      rover|            O|
|       rove

In [36]:
asr_pipelineDF.select(
    F.explode(
        F.arrays_zip(
            asr_pipelineDF.entities.result,
            asr_pipelineDF.entities.begin,
            asr_pipelineDF.entities.end,
            asr_pipelineDF.entities.metadata,
        )
    ).alias("col")
).select(
    F.expr("col['0']").alias("entities"),
    F.expr("col['1']").alias("begin"),
    F.expr("col['2']").alias("end"),
    F.expr("col['3']['entity']").alias("ner_label"),
).show(truncate=False)

+-------------+-----+---+-----------+
|entities     |begin|end|ner_label  |
+-------------+-----+---+-----------+
|The Mona Lisa|1    |13 |WORK_OF_ART|
|Leonardo     |57   |64 |PERSON     |
+-------------+-----+---+-----------+

