[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/annotation/audio/whisper/Automatic_Speech_Recognition_Whisper_(WhisperForCTC).ipynb)

# Automatic Speech Recognition in Spark NLP
## Whisper (WhisperForCTC)

WhisperForCTC is a Whisper Model with a language modeling head on top for Connectionist Temporal Classification (CTC). Whisper was proposed in [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356). This annotator requires Spark versions 3.4.0 and up.

The annotator takes audio files and transcribes it as text. The audio needs to be provided pre-processed an array of floats.

- List of all available ASR [models](https://sparknlp.org/models?task=Automatic+Speech+Recognition&type=model)
- List of all available ASR [pipelines](https://sparknlp.org/models?task=Automatic+Speech+Recognition&type=pipeline)

In [None]:
# Only run this cell when you are using Spark NLP on Google Colab
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash

# to process audio files
!pip install -q pyspark==3.4.1 librosa

--2023-08-24 13:47:43--  https://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2023-08-24 13:47:43--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1191 (1.2K) [text/plain]
Saving to: ‘STDOUT’

-                     0%[                    ]       0  --.-KB/s               Installing PySpark 3.2.3 and Spark NLP 5.0.2
setup Colab for PySpark 3.2.3 

In [None]:
import sparknlp
# let's start Spark with Spark NLP
spark = sparknlp.start()

print(sparknlp.version())

4.3.1


# Spark NLP ASR Pipeline & Model
## Whisper
Loading an audio file

Let's download a sample Wav file

In [None]:
!wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/audio/samples/wavs/ngm_12484_01067234848.wav

--2023-08-24 13:50:54--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/audio/samples/wavs/ngm_12484_01067234848.wav
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.90.214, 16.182.72.72, 16.182.74.176, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.90.214|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 417836 (408K) [audio/wav]
Saving to: ‘ngm_12484_01067234848.wav’


2023-08-24 13:50:54 (5.24 MB/s) - ‘ngm_12484_01067234848.wav’ saved [417836/417836]



Let's listen to the audio

In [None]:
from IPython.display import Audio
FILE_PATH = "ngm_12484_01067234848.wav"
Audio(filename=FILE_PATH)

We will use `librosa` library to load/resample our Wav file

In [None]:
import librosa
data,sampleing_rate = librosa.load(FILE_PATH, sr=16000)
# let's convert them to floats
data=[float(x) for x in data]

This is how we can create PySpark DataFrame from the `librosa` results

In [None]:
from pyspark.sql.types import *
import pyspark.sql.functions as F
import pandas as pd
schema = StructType([StructField("audio_content", ArrayType(FloatType())),
                     StructField("sampling_rate", LongType())])

df = pd.DataFrame({
    "audio_content":[data],
    "sampling_rate":[sampleing_rate]
})

spark_df=spark.createDataFrame(df, schema)
spark_df.printSchema()
spark_df.show(1)

root
 |-- audio_content: array (nullable = true)
 |    |-- element: float (containsNull = true)
 |-- sampling_rate: long (nullable = true)

+--------------------+-------------+
|       audio_content|sampling_rate|
+--------------------+-------------+
|[-5.3283205E-5, -...|        16000|
+--------------------+-------------+



### Creating the Pipeline
You can also construct your own custom Pipeline by using Spark NLP pretrained Models. This way you have more control and flexibility over the entire pipeline.


In [None]:
from sparknlp.annotator import *
from sparknlp.base import *

audio_assembler = AudioAssembler() \
    .setInputCol("audio_content") \
    .setOutputCol("audio_assembler")

speech_to_text = WhisperForCTC \
    .pretrained()\
    .setInputCols("audio_assembler") \
    .setOutputCol("text")

pipeline = Pipeline(stages=[
  audio_assembler,
  speech_to_text,
])

pipelineDF = pipeline.fit(spark_df).transform(spark_df)

asr_whisper_tiny_opt download started this may take some time.
Approximate size to download 231.4 MB
[OK!]


Let's have a look:

In [None]:
pipelineDF.select("text.result").show(1, False)

pipelineDF.select("text.metadata").show(1, False)

pipelineDF.select("text").show(1, False)

+------------------------------------------------+
|result                                          |
+------------------------------------------------+
|[ People who died while living in other places.]|
+------------------------------------------------+

+-------------------------------+
|metadata                       |
+-------------------------------+
|[{length -> 69632, audio -> 0}]|
+-------------------------------+

+------------------------------------------------------------------------------------------------------+
|text                                                                                                  |
+------------------------------------------------------------------------------------------------------+
|[{document, 0, 45,  People who died while living in other places., {length -> 69632, audio -> 0}, []}]|
+------------------------------------------------------------------------------------------------------+



# Spark NLP ASR-NER Pipeline
## Whisper, OntoNotes NER, and BERT

In [None]:
!wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/audio/samples/1664116679869-voicemaker.in-speech.mp3

--2023-08-24 13:54:44--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/audio/samples/1664116679869-voicemaker.in-speech.mp3
Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.162.40, 54.231.196.248, 52.217.225.200, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.162.40|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 40221 (39K) [audio/mp3]
Saving to: ‘1664116679869-voicemaker.in-speech.mp3’


          166411667   0%[                    ]       0  --.-KB/s               

2023-08-24 13:54:44 (1.98 MB/s) - ‘1664116679869-voicemaker.in-speech.mp3’ saved [40221/40221]



In [None]:
from IPython.display import Audio
FILE_PATH = "./1664116679869-voicemaker.in-speech.mp3"
Audio(FILE_PATH)

In [None]:
data,sampleing_rate = librosa.load(FILE_PATH, sr=16000)
data=[float(x) for x in data]

In [None]:
#Create PySpark DataFrame from Pandas
from pyspark.sql.types import *
import pyspark.sql.functions as F

schema = StructType([StructField("audio_content", ArrayType(FloatType())),
                     StructField("sampling_rate", LongType())])

df = pd.DataFrame({
    "audio_content":[data],
    "sampling_rate":[sampleing_rate]
})

spark_df=spark.createDataFrame(df, schema)

In [None]:
from sparknlp.annotator import *
from sparknlp.base import *

audio_assembler = AudioAssembler() \
    .setInputCol("audio_content") \
    .setOutputCol("audio_assembler")

speech_to_text = WhisperForCTC \
    .pretrained()\
    .setInputCols("audio_assembler") \
    .setOutputCol("document")

token = Tokenizer() \
    .setInputCols("document") \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols("token") \
    .setOutputCol("normalized") \
    .setLowercase(True)

bert = BertEmbeddings.pretrained("small_bert_L4_256") \
    .setInputCols("document", "normalized") \
    .setOutputCol("embeddings")

ner_onto = NerDLModel.pretrained("onto_small_bert_L4_256", "en") \
    .setInputCols(["document", "normalized", "embeddings"]) \
    .setOutputCol("ner")

entities = NerConverter() \
    .setInputCols(["document", "normalized", "ner"]) \
    .setOutputCol("entities")

pipeline = Pipeline(stages=[
  audio_assembler,
  speech_to_text,
  token,
  normalizer,
  bert,
  ner_onto,
  entities
])

asr_pipelineDF = pipeline.fit(spark_df).transform(spark_df)

asr_whisper_tiny_opt download started this may take some time.
Approximate size to download 231.4 MB
[OK!]
small_bert_L4_256 download started this may take some time.
Approximate size to download 40.5 MB
[OK!]
onto_small_bert_L4_256 download started this may take some time.
Approximate size to download 14.1 MB
[OK!]


In [None]:
asr_pipelineDF.select("document.result").show(1, False)

asr_pipelineDF.select("normalized.result").show(1, False)

asr_pipelineDF.select("ner.result").show(1, False)

asr_pipelineDF.select("entities.result").show(1, False)

+---------------------------------------------------------------------------------------------------------+
|result                                                                                                   |
+---------------------------------------------------------------------------------------------------------+
|[ The Mona Lisa is a 16th-century oil painting created by Leonardo. It's how that the rover embarrassed.]|
+---------------------------------------------------------------------------------------------------------+

+------------------------------------------------------------------------------------------------------------------+
|result                                                                                                            |
+------------------------------------------------------------------------------------------------------------------+
|[the, mona, lisa, is, a, thcentury, oil, painting, created, by, leonardo, its, how, that, the, rover, embar

# Spark NLP ASR pipeline and model
## HuggingFace Datasets

Let's create a DataFrame from HuggingFace Datasets library

In [None]:
!pip install -q datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import pandas as pd
import librosa

from datasets import load_dataset

ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
pandas_dataframe = pd.DataFrame(ds['audio'])
pandas_dataframe['array'] = pandas_dataframe['array'].apply(lambda  row : [float(value) for value in row ])

Downloading builder script:   0%|          | 0.00/5.16k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating validation split: 0 examples [00:00, ? examples/s]

In [None]:
#Create PySpark DataFrame from Pandas
from pyspark.sql.types import *
import pyspark.sql.functions as F

schema = StructType([StructField("path", StringType()),
                     StructField("audio_content", ArrayType(FloatType())),
                     StructField("sampling_rate", LongType())])
spark_df=spark.createDataFrame(pandas_dataframe, schema)
spark_df.printSchema()
spark_df.show()

root
 |-- path: string (nullable = true)
 |-- audio_content: array (nullable = true)
 |    |-- element: float (containsNull = true)
 |-- sampling_rate: long (nullable = true)

+--------------------+--------------------+-------------+
|                path|       audio_content|sampling_rate|
+--------------------+--------------------+-------------+
|/root/.cache/hugg...|[-1.8310547E-4, -...|        16000|
|/root/.cache/hugg...|[-0.0013427734, -...|        16000|
|/root/.cache/hugg...|[-3.9672852E-4, -...|        16000|
|/root/.cache/hugg...|[-0.006164551, -0...|        16000|
|/root/.cache/hugg...|[-0.001373291, -0...|        16000|
|/root/.cache/hugg...|[-0.004852295, 2....|        16000|
|/root/.cache/hugg...|[0.0011291504, 5....|        16000|
|/root/.cache/hugg...|[-0.0027160645, 0...|        16000|
|/root/.cache/hugg...|[0.002380371, 0.0...|        16000|
|/root/.cache/hugg...|[-0.0033874512, 0...|        16000|
|/root/.cache/hugg...|[-9.1552734E-4, -...|        16000|
|/root/.cach

In [None]:
from sparknlp.annotator import *
from sparknlp.base import *

audio_assembler = AudioAssembler() \
    .setInputCol("audio_content") \
    .setOutputCol("audio_assembler")

speech_to_text = WhisperForCTC \
    .pretrained()\
    .setInputCols("audio_assembler") \
    .setOutputCol("text")

pipeline = Pipeline(stages=[
  audio_assembler,
  speech_to_text,
])

pipelineDF = pipeline.fit(spark_df).transform(spark_df)

asr_whisper_tiny_opt download started this may take some time.
Approximate size to download 231.4 MB
[OK!]


In [None]:
pipelineDF.select("text.result").show(5, False)

pipelineDF.select("text.metadata").show(5, False)

pipelineDF.select("text").show(5, False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                               |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[ Because you are asleep instead of conquering, the lovely Rose Princess has become a fiddle without a bow. While poor Shaggy sits there, a cooling dove.]           |
|[ He has gone and gone for good answered Paul Icrom who had managed to squeeze into the room beside the dragon and had witnessed the occurrences with much interest.]|
|[ I have remained a prisoner only because I wished to be one. And with this, he stepped forward and burst the stoutchains as easily as if they had been threads