[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/annotation/audio/asr-wav2vec2/Automatic_Speech_Recognition_Wav2Vec2_(Wav2Vec2ForCTC).ipynb)

# Automatic Speech Recognition in Spark NLP
## Wav2Vec2 (Wav2Vec2ForCTC)

Wav2Vec2ForCTC is a Wav2Vec2 Model with a language modeling head on top for Connectionist Temporal Classification (CTC). Wav2Vec2 was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.

The annotator takes audio files and transcribes it as text. The audio needs to be provided pre-processed an array of floats.

Note that this annotator is currently not supported on Apple Silicon processors such as the M1. This is due to the processor not supporting instructions for XLA.

- List of all available ASR [models](https://sparknlp.org/models?task=Automatic+Speech+Recognition&type=model)
- List of all available ASR [pipelines](https://sparknlp.org/models?task=Automatic+Speech+Recognition&type=pipeline)

In [None]:
# Only run this cell when you are using Spark NLP on Google Colab
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash

# to process audio files
!pip install -q pyspark librosa

--2022-12-23 14:10:21--  https://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2022-12-23 14:10:21--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1191 (1.2K) [text/plain]
Saving to: ‘STDOUT’


2022-12-23 14:10:21 (23.1 MB/s) - written to stdout [1191/1191]

Installing PySpark 3.2.3 and Spark NLP 4.2.6
setup Colab for PySpark 3.2.3 and Spark NLP

In [None]:
import sparknlp
# let's start Spark with Spark NLP
spark = sparknlp.start()

print(sparknlp.version())

4.3.1


# Spark NLP ASR Pipeline & Model
## Wav2Vec2 
Loading an audio file

Let's download a sample Wav file

In [None]:
!wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/audio/samples/wavs/ngm_12484_01067234848.wav

--2023-02-17 15:55:34--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/audio/samples/wavs/ngm_12484_01067234848.wav
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.108.254, 52.217.163.192, 52.216.144.93, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.108.254|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 417836 (408K) [audio/wav]
Saving to: ‘ngm_12484_01067234848.wav’


2023-02-17 15:55:35 (857 KB/s) - ‘ngm_12484_01067234848.wav’ saved [417836/417836]



Let's listen to the audio

In [None]:
from IPython.display import Audio
FILE_PATH = "ngm_12484_01067234848.wav"
Audio(filename=FILE_PATH)

We will use `librosa` library to load/resample our Wav file

In [None]:
import librosa
data,sampleing_rate = librosa.load(FILE_PATH, sr=16000)
# let's convert them to floats
data=[float(x) for x in data]

This is how we can create PySpark DataFrame from the `librosa` results

In [None]:
from pyspark.sql.types import *
import pyspark.sql.functions as F
import pandas as pd
schema = StructType([StructField("audio_content", ArrayType(FloatType())),
                     StructField("sampling_rate", LongType())])

df = pd.DataFrame({
    "audio_content":[data],
    "sampling_rate":[sampleing_rate]
})

spark_df=spark.createDataFrame(df, schema)
spark_df.printSchema()
spark_df.show(1)

root
 |-- audio_content: array (nullable = true)
 |    |-- element: float (containsNull = true)
 |-- sampling_rate: long (nullable = true)

+--------------------+-------------+
|       audio_content|sampling_rate|
+--------------------+-------------+
|[-5.3640502E-5, -...|        16000|
+--------------------+-------------+



### Simplest and fastest way is to use a pre-trained [pipeline for ASR](https://sparknlp.org/models?task=Automatic+Speech+Recognition&type=pipeline):





In [None]:
import sparknlp
from sparknlp.pretrained import PretrainedPipeline
# Download a pre-trained pipeline
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_960h', lang='en')

pipelineDF = pipeline.transform(spark_df)

pipeline_asr_wav2vec2_base_960h download started this may take some time.
Approx size to download 217 MB
[OK!]


In [None]:

# let's see what's inside out-of-the-box
pipelineDF.printSchema()

pipelineDF.select("text.result").show(1, False)

pipelineDF.select("text.metadata").show(1, False)

pipelineDF.select("text").show(1, False)

root
 |-- audio_content: array (nullable = true)
 |    |-- element: float (containsNull = true)
 |-- sampling_rate: long (nullable = true)
 |-- audio_assembler: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- result: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |-- text: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = 

### Custom Pipeline
You can also construct your own custom Pipeline by using Spark NLP pretrained Models. This way you have more control and flexibility over the entire pipeline.


In [None]:
from sparknlp.annotator import *
from sparknlp.base import *

audio_assembler = AudioAssembler() \
    .setInputCol("audio_content") \
    .setOutputCol("audio_assembler")

speech_to_text = Wav2Vec2ForCTC \
    .pretrained()\
    .setInputCols("audio_assembler") \
    .setOutputCol("text")

pipeline = Pipeline(stages=[
  audio_assembler,
  speech_to_text,
])

pipelineDF = pipeline.fit(spark_df).transform(spark_df)

asr_wav2vec2_base_960h download started this may take some time.
Approximate size to download 217 MB
[OK!]


Let's have a look:

In [None]:
pipelineDF.select("text.result").show(1, False)

pipelineDF.select("text.metadata").show(1, False)

pipelineDF.select("text").show(1, False)

+-----------------------------------------------+
|result                                         |
+-----------------------------------------------+
|[PEOPLE WHO DIED WHILE LIVING IN OTHER PLACES ]|
+-----------------------------------------------+

+----------------------------------------------+
|metadata                                      |
+----------------------------------------------+
|[{audio -> 0, sentence -> 0, length -> 69632}]|
+----------------------------------------------+

+--------------------------------------------------------------------------------------------------------------------+
|text                                                                                                                |
+--------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 44, PEOPLE WHO DIED WHILE LIVING IN OTHER PLACES , {audio -> 0, sentence -> 0, length -> 69632}, []}]|
+--------------------------

# Spark NLP ASR-NER Pipeline
## Wav2Vec2, OntoNotes NER, and BERT

In [None]:
!wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/audio/samples/1664116679869-voicemaker.in-speech.mp3

--2023-02-17 15:58:03--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/audio/samples/1664116679869-voicemaker.in-speech.mp3
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.135.80, 52.216.108.237, 52.216.162.13, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.135.80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 40221 (39K) [audio/mp3]
Saving to: ‘1664116679869-voicemaker.in-speech.mp3’


2023-02-17 15:58:04 (336 KB/s) - ‘1664116679869-voicemaker.in-speech.mp3’ saved [40221/40221]



In [None]:
from IPython.display import Audio
FILE_PATH = "./1664116679869-voicemaker.in-speech.mp3"
Audio(FILE_PATH)

In [None]:
data,sampleing_rate = librosa.load(FILE_PATH, sr=16000)
data=[float(x) for x in data]

In [None]:
#Create PySpark DataFrame from Pandas
from pyspark.sql.types import *
import pyspark.sql.functions as F

schema = StructType([StructField("audio_content", ArrayType(FloatType())),
                     StructField("sampling_rate", LongType())])

df = pd.DataFrame({
    "audio_content":[data],
    "sampling_rate":[sampleing_rate]
})

spark_df=spark.createDataFrame(df, schema)

In [None]:
from sparknlp.annotator import *
from sparknlp.base import *

audio_assembler = AudioAssembler() \
    .setInputCol("audio_content") \
    .setOutputCol("audio_assembler")

speech_to_text = Wav2Vec2ForCTC \
    .pretrained()\
    .setInputCols("audio_assembler") \
    .setOutputCol("document")

token = Tokenizer() \
    .setInputCols("document") \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols("token") \
    .setOutputCol("normalized") \
    .setLowercase(True)

bert = BertEmbeddings.pretrained("small_bert_L4_256") \
    .setInputCols("document", "normalized") \
    .setOutputCol("embeddings")

ner_onto = NerDLModel.pretrained("onto_small_bert_L4_256", "en") \
    .setInputCols(["document", "normalized", "embeddings"]) \
    .setOutputCol("ner")

entities = NerConverter() \
    .setInputCols(["document", "normalized", "ner"]) \
    .setOutputCol("entities")

pipeline = Pipeline(stages=[
  audio_assembler,
  speech_to_text,
  token,
  normalizer,
  bert,
  ner_onto,
  entities
])

asr_pipelineDF = pipeline.fit(spark_df).transform(spark_df)

asr_wav2vec2_base_960h download started this may take some time.
Approximate size to download 217 MB
[OK!]
small_bert_L4_256 download started this may take some time.
Approximate size to download 40.5 MB
[OK!]
onto_small_bert_L4_256 download started this may take some time.
Approximate size to download 14.1 MB
[OK!]


In [None]:
asr_pipelineDF.select("document.result").show(1, False)

asr_pipelineDF.select("normalized.result").show(1, False)

asr_pipelineDF.select("ner.result").show(1, False)

asr_pipelineDF.select("entities.result").show(1, False)

+--------------------------------------------------------------------------------------------------------+
|result                                                                                                  |
+--------------------------------------------------------------------------------------------------------+
|[THE MONALISA IS THE SIXTENTH CENTURY OIL PAINTING CREATED BY LEONARDO IT'S HELD AT THE LUVRE IN PARIS ]|
+--------------------------------------------------------------------------------------------------------+

+-----------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                 |
+-----------------------------------------------------------------------------------------------------------------------+
|[the, monalisa, is, the, sixtenth, century, oil, painting, created, by, leonardo, its, held, at, 

# Spark NLP ASR pipeline and model
## HuggingFace Datasets

Let's create a DataFrame from HuggingFace Datasets library

In [None]:
!pip install -q datasets

In [None]:
import pandas as pd
import librosa

from datasets import load_dataset

ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
pandas_dataframe = pd.DataFrame(ds['audio'])
pandas_dataframe['array'] = pandas_dataframe['array'].apply(lambda  row : [float(value) for value in row ])

Downloading builder script: 100%|██████████| 5.16k/5.16k [00:00<00:00, 1.63MB/s]


Downloading and preparing dataset librispeech_asr_dummy/clean to /home/root/.cache/huggingface/datasets/patrickvonplaten___librispeech_asr_dummy/clean/2.1.0/f2c70a4d03ab4410954901bde48c54b85ca1b7f9bf7d616e7e2a72b5ee6ddbfc...


Downloading data files: 100%|██████████| 1/1 [00:00<00:00, 3214.03it/s]
Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 308.20it/s]
                                                             

Dataset librispeech_asr_dummy downloaded and prepared to /home/root/.cache/huggingface/datasets/patrickvonplaten___librispeech_asr_dummy/clean/2.1.0/f2c70a4d03ab4410954901bde48c54b85ca1b7f9bf7d616e7e2a72b5ee6ddbfc. Subsequent calls will reuse this data.


In [None]:
#Create PySpark DataFrame from Pandas
from pyspark.sql.types import *
import pyspark.sql.functions as F

schema = StructType([StructField("path", StringType()), 
                     StructField("audio_content", ArrayType(FloatType())),
                     StructField("sampling_rate", LongType())])
spark_df=spark.createDataFrame(pandas_dataframe, schema)
spark_df.printSchema()
spark_df.show()

root
 |-- path: string (nullable = true)
 |-- audio_content: array (nullable = true)
 |    |-- element: float (containsNull = true)
 |-- sampling_rate: long (nullable = true)

+--------------------+--------------------+-------------+
|                path|       audio_content|sampling_rate|
+--------------------+--------------------+-------------+
|/home/root/.cach...|[-1.8310547E-4, -...|        16000|
|/home/root/.cach...|[-0.0013427734, -...|        16000|
|/home/root/.cach...|[-3.9672852E-4, -...|        16000|
|/home/root/.cach...|[-0.006164551, -0...|        16000|
|/home/root/.cach...|[-0.001373291, -0...|        16000|
|/home/root/.cach...|[-0.004852295, 2....|        16000|
|/home/root/.cach...|[0.0011291504, 5....|        16000|
|/home/root/.cach...|[-0.0027160645, 0...|        16000|
|/home/root/.cach...|[0.002380371, 0.0...|        16000|
|/home/root/.cach...|[-0.0033874512, 0...|        16000|
|/home/root/.cach...|[-9.1552734E-4, -...|        16000|
|/home/root/.cach...|[2

In [None]:
import sparknlp
from sparknlp.pretrained import PretrainedPipeline
# Download a pre-trained pipeline
pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_960h', lang='en')

pipelineDF = pipeline.transform(spark_df)

pipeline_asr_wav2vec2_base_960h download started this may take some time.
Approx size to download 217 MB
[OK!]


In [None]:
pipelineDF.select("text.result").show(5, False)

pipelineDF.select("text.metadata").show(5, False)

pipelineDF.select("text").show(5, False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                        |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[BECAUSE YOU ARE SLEPING INSTEAD OF CONQUERING THE LOVELY ROSE PRINCES HAS BECOME A FIDLE WITHOUT A BAW WHILE POR SHAGY SITS THERE A COING DOVE ]             |
|[HE HAS GONE AND GONE FOR GOD ANSWERED POLYCHROME WHO HAD MANAGED TO SQUEZE INTO THE ROM BESIDE THE DRAGON AND HAD WITNESED THE OCURENCES WITH MUCH INTEREST ]|
|[I HAVE REMAINED A PRISONER ONLY BECAUSE I WISHED TO BE ONE AND WITH THIS HE STEPED FORWARD AND BURST THE STOUT CHAINS AS EASILY AS IF THEY HAD BEN THREADS ] |
|[THE LITLE GIRL HAD BEN ASLEP BUT