![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/open-source-nlp/22.2.Retrieval_Augmented_Generation_with_Spark_NLP.ipynb)

# Automatic Speech Recognition in Spark NLP

**Automatic Speech Recognition** (**ASR**), also known as **Speech-to-Text** (**STT**), is the capability to automatically undesrtand audio inputs and transcribe them to text.

This allows applications in many fields such as automatic caption generation in videos, transcribing business meetings, helping people with typing messages through voice, an much more.


We currently support three types of models:

- [Whisper](https://sparknlp.org/docs/en/transformers#whisperforctc)
- [Wav2Vec](https://sparknlp.org/docs/en/transformers#wav2vec2forctc)
- [HuBERT](https://sparknlp.org/docs/en/transformers#hubertforctc).

They are end-to-end implementation of ASR, meaning that they encode audio input and transcribe with a langauge model using [Connectionist Temporal Classification (CTC)](https://dl.acm.org/doi/10.1145/1143844.1143891) decoder.

Pretrained models can be found at the [Spark NLP Models Hub](https://nlp.johnsnowlabs.com/models?task=Automatic+Speech+Recognition&type=model).


For importing models from Hugging Face or TFHub to Spark NLP, you can follow the steps described [here](https://sparknlp.org/docs/en/transformers#import-transformers-into-spark-nlp).

## Setup

We install `librosa` to open audio files and standardize the sampling rate.

In [None]:
!pip install -q johnsnowlabs librosa

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.5/265.5 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m579.2/579.2 kB[0m [31m29.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m682.2/682.2 kB[0m [31m36.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.6/95.6 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m50.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.2/139.2 kB[0m [31m493.2 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m9

In [None]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Applied_Generative_AI/data/clinical.mp3

--2024-07-14 23:17:58--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Applied_Generative_AI/data/clinical.mp3
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3442455 (3.3M) [audio/mpeg]
Saving to: ‘clinical.mp3’


2024-07-14 23:17:58 (30.1 MB/s) - ‘clinical.mp3’ saved [3442455/3442455]



In [None]:
import librosa
import sparknlp
import pandas as pd
from johnsnowlabs import nlp
from pyspark.sql.types import ArrayType, FloatType, LongType, StructField, StructType
from IPython.display import Audio

import warnings

warnings.filterwarnings("ignore")

In [None]:
spark = sparknlp.start()
spark.sparkContext.setLogLevel("ERROR")

## Transcribing a clinical audio file (synthetic data)

### Reading the audio file

In [None]:
Audio(filename="./clinical.mp3")

We split the audio file in 30 seconds clips to be processed by the model.

In [None]:
data, sampling_rate = librosa.load("./clinical.mp3", sr=16000)
data = [float(x) for x in data]

# Split the files into 30sec clips
clips = []
clip_length = sampling_rate * 30 # 30 seconds
for i in range(0, len(data), clip_length):
    clips.append(data[i:i+clip_length])

len(clips)

21

In [None]:
# Let's transcribe the first 2 minutes only
clips = clips[:4]

In [None]:
schema = StructType([
        StructField("audio_content", ArrayType(FloatType())),
        StructField("sampling_rate", LongType())
])

spark_df = spark.createDataFrame([
    (c, sampling_rate) for c in clips
    ], schema)
spark_df.show()

+--------------------+-------------+
|       audio_content|sampling_rate|
+--------------------+-------------+
|[-0.112586305, -0...|        16000|
|[-4.200524E-4, -1...|        16000|
|[0.14874943, 0.13...|        16000|
|[-2.8103145E-4, -...|        16000|
+--------------------+-------------+



### Building the pipeline

In [None]:
audioAssembler = (
    nlp.AudioAssembler()
    .setInputCol("audio_content")
    .setOutputCol("audio_assembler")
)

speechToText = (
    nlp.WhisperForCTC.pretrained("asr_whisper_base_english", "en")
    .setInputCols(["audio_assembler"])
    .setOutputCol("text")
)

pipeline = nlp.Pipeline().setStages([audioAssembler, speechToText])
asr_model = model = pipeline.fit(spark_df)

asr_whisper_base_english download started this may take some time.
Approximate size to download 387.4 MB
[OK!]


### Making the transcription

In [None]:
%%time

result = asr_model.transform(spark_df)
result_df = result.select("text.result").toPandas()
result_df.head(3)

CPU times: user 322 ms, sys: 37.1 ms, total: 359 ms
Wall time: 54.9 s


Unnamed: 0,result
0,[ what brought you in today? Sure. I'm just ha...
1,"[ It started last night, but it's becoming sha..."
2,[ like eight. Okay. Has it been constant throu...


In [None]:
import textwrap


result_df["text"] = result_df["result"].apply(lambda x: x[0])
final_text = " ".join(result_df["text"].values)
print(textwrap.fill(final_text, width=100))

 what brought you in today? Sure. I'm just having a lot of chest pain. And so I thought I should get
it checked out. OK. And before we start, could you remind me of your gender and age? Sure. 39, I'm a
male. OK. And so when did this chest pain start?  It started last night, but it's becoming sharper.
Okay. And where is this pain located? It's located on the last side of my chest. Okay. And so how
long has it been going on for benefit started last night? So I guess it would be a couple hours now.
Maybe  like eight. Okay. Has it been constant throughout that time or changing? I would say it's
been pretty constant, yeah. Okay. And how would you describe the pain? People will use words
sometimes like sharp, burning, achy. I think it's pretty sharp, yeah. Sharp, okay.  Anything that
you have done tried since last night that's made the pain better? Not laying down. Helps. Okay, so
do you find laying down makes the pain worse? Yes, definitely. Okay. Do you find that the pain is
radiating anyw

> **The model performed very well to extract the conversation between the clinician and the patient.**