<a href="https://colab.research.google.com/github/SeongeunKim-sonja/Keras_example_study_2023/blob/main/Audio_Data/Automatic_Speech_Recognition_using_CTC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#CTC를 이용한 음성 자동 인식 예제

*이 문서는 keras 예제를 분석하여 해설을 작성한 것입니다.*

<br>[예제 페이지](https://keras.io/examples/audio/ctc_asr/)
<br>[CTC 참고](https://ratsgo.github.io/speechbook/docs/neuralam/ctc)
<br>[음성인식 딥러닝 강의 영상](https://youtu.be/MghbHLuupoA)
<br>[강의 영상 정리 글](https://lynnshin.tistory.com/42)

<br>이 페이지의 예제 코드는 2D CNN과 RNN, CTC loss를 이용하여 음성 자동인식(ASR)을 구현한 것입니다. 
<br>데이터 셋은 LibriVox 프로젝트의 LJSSpeech 데이터셋(논픽션 책 7권의 문장을 녹음한 오디오 클립)을 사용하였습니다.
<br>모델의 정확도를 판별하기 위해서는 WER(단어 오류율)을 사용하는데, 오류율의 계산을 위해 jiwer패키지를 설치합니다. 





#CTC
input 데이터와 output 데이터 사이의 구간별 1대 1 라벨링이 되어있지 않을 때 사용하는 분류 기법.<br>
CTC모델의 기본은 RNN기반 시퀀스모델 여러개와 softmax를 통과해서 출력함.<br>
음소사이에는 묵음을 처리하는 입실론ε이 추가됨.<br><br>

1. 구간을 나누어 음소를 판별
![CTC1](https://i.imgur.com/hpVlJXr.png)
<br><br>
2. 입실론(묵음)구간 및 반복되는 음소 삭제
![CTC2](https://i.imgur.com/LRjrS68.png)


## Setup

In [None]:
pip install jiwer

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting jiwer
  Downloading jiwer-2.5.1-py3-none-any.whl (15 kB)
Collecting levenshtein==0.20.2
  Downloading Levenshtein-0.20.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB)
[K     |████████████████████████████████| 1.4 MB 4.9 MB/s 
[?25hCollecting rapidfuzz<3.0.0,>=2.3.0
  Downloading rapidfuzz-2.13.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB)
[K     |████████████████████████████████| 2.2 MB 56.7 MB/s 
[?25hInstalling collected packages: rapidfuzz, levenshtein, jiwer
Successfully installed jiwer-2.5.1 levenshtein-0.20.2 rapidfuzz-2.13.2


In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt
from IPython import display
from jiwer import wer   #단어 오류율 

## Load the LJSpeech Dataset

Let's download the [LJSpeech Dataset](https://keithito.com/LJ-Speech-Dataset/).
The dataset contains 13,100 audio files as `wav` files in the `/wavs/` folder.
The label (transcript) for each audio file is a string
given in the `metadata.csv` file. The fields are:

- **ID**: this is the name of the corresponding .wav file
- **Transcription**: words spoken by the reader (UTF-8)
- **Normalized transcription**: transcription with numbers,
ordinals, and monetary units expanded into full words (UTF-8).

For this demo we will use on the "Normalized transcription" field.

Each audio file is a single-channel 16-bit PCM WAV with a sample rate of 22,050 Hz.

In [None]:
data_url = "https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2"
data_path = keras.utils.get_file("LJSpeech-1.1", data_url, untar=True)
wavs_path = data_path + "/wavs/"
metadata_path = data_path + "/metadata.csv"


# Read metadata file and parse it
metadata_df = pd.read_csv(metadata_path, sep="|", header=None, quoting=3)
metadata_df.columns = ["file_name", "transcription", "normalized_transcription"]
metadata_df = metadata_df[["file_name","normalized_transcription"]]
metadata_df = metadata_df.sample(frac=1).reset_index(drop=True)
metadata_df.head(3)


Downloading data from https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2


Unnamed: 0,file_name,normalized_transcription
0,LJ036-0013,found along the path of flight taken by the gu...
1,LJ040-0227,Mrs. Siegel concluded her report with the stat...
2,LJ014-0323,"The warrant thus represented money, and was of..."


In [None]:
split = int(len(metadata_df) * 0.90)    #데이터셋을 9:1의 비율로 나누어 각각 학습 데이터와 테스트 데이터로 사용함
df_train = metadata_df[:split]
df_val = metadata_df[split:]

print(f"Size of the training set: {len(df_train)}")
print(f"Size of the validation set: {len(df_val)}")


Size of the training set: 11790
Size of the validation set: 1310


## Preprocessing

We first prepare the vocabulary to be used.

In [None]:
# The set of characters accepted in the transcription.
characters = [x for x in "abcdefghijklmnopqrstuvwxyz'?! "]    #리스트 컴프리헨션 list comprehension, 참고 https://wikidocs.net/22805
# Mapping characters to integers
char_to_num = keras.layers.StringLookup(vocabulary=characters, oov_token="")    
#oov_token 설정한 경우에는 그 값을 index에 추가하여 out of vocabulary 인자를 대체한다. (예외처리)
# Mapping integers back to original characters
num_to_char = keras.layers.StringLookup(
    vocabulary=char_to_num.get_vocabulary(), oov_token="", invert=True
)

print(
    f"The vocabulary is: {char_to_num.get_vocabulary()} "
    f"(size ={char_to_num.vocabulary_size()})"
)

The vocabulary is: ['', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', "'", '?', '!', ' '] (size =31)


Next, we create the function that describes the transformation that we apply to each
element of our dataset.

In [None]:
# An integer scalar Tensor. The window length in samples.
frame_length = 256
# An integer scalar Tensor. The number of samples to step.
frame_step = 160
# An integer scalar Tensor. The size of the FFT to apply.
# If not provided, uses the smallest power of 2 enclosing frame_length.
#제시되지 않은 경우 frame_length보다 크거나 같은, 가장 작은 2의 제곱수로 한다.
fft_length = 384


def encode_single_sample(wav_file, label):
    #샘플을 인코딩 하는 과정, 스펙트로그램을 얻은 뒤 진폭값(절대값)만을 취하고, 표준화 하는 순서로 진행된다. 
    #영어는 대,소문자가 나뉘어 있으므로, 모두 소문자로 통일하여 case 값을 줄인다. 
    ###########################################
    ##  Process the Audio
    ##########################################
    # 1. Read wav file
    file = tf.io.read_file(wavs_path + wav_file + ".wav")
    # 2. Decode the wav file
    audio, _ = tf.audio.decode_wav(file)
    audio = tf.squeeze(audio, axis=-1)  
    # 3. Change type to float
    audio = tf.cast(audio, tf.float32)
    # 4. Get the spectrogram
    spectrogram = tf.signal.stft(
        audio, frame_length=frame_length, frame_step=frame_step, fft_length=fft_length
    )
    # 5. We only need the magnitude, which can be derived by applying tf.abs
    spectrogram = tf.abs(spectrogram)
    spectrogram = tf.math.pow(spectrogram, 0.5)
    # 6. normalisation
    means = tf.math.reduce_mean(spectrogram, 1, keepdims=True)
    stddevs = tf.math.reduce_std(spectrogram, 1, keepdims=True)
    spectrogram = (spectrogram - means) / (stddevs + 1e-10)
    ###########################################
    ##  Process the label
    ##########################################
    # 7. Convert label to Lower case
    label = tf.strings.lower(label)
    # 8. Split the label
    label = tf.strings.unicode_split(label, input_encoding="UTF-8")
    # 9. Map the characters in label to numbers
    label = char_to_num(label)
    # 10. Return a dict as our model is expecting two inputs
    return spectrogram, label


## Creating `Dataset` objects

We create a `tf.data.Dataset` object that yields
the transformed elements, in the same order as they
appeared in the input.

In [None]:
batch_size = 32
# Define the trainig dataset
train_dataset = tf.data.Dataset.from_tensor_slices(
    (list(df_train["file_name"]), list(df_train["normalized_transcription"]))
)
train_dataset = (
    train_dataset.map(encode_single_sample, num_parallel_calls=tf.data.AUTOTUNE)
    .padded_batch(batch_size)
    .prefetch(buffer_size=tf.data.AUTOTUNE)
)

# Define the validation dataset
validation_dataset = tf.data.Dataset.from_tensor_slices(
    (list(df_val["file_name"]), list(df_val["normalized_transcription"]))
)
validation_dataset = (
    validation_dataset.map(encode_single_sample, num_parallel_calls=tf.data.AUTOTUNE)
    .padded_batch(batch_size)
    .prefetch(buffer_size=tf.data.AUTOTUNE)
)


## Visualize the data

Let's visualize an example in our dataset, including the
audio clip, the spectrogram and the corresponding label.

In [None]:
#데이터 시각화
#스펙트로그램과 라벨, 진폭을 그린 모양을 출력한다.
fig = plt.figure(figsize=(8, 5))
for batch in train_dataset.take(1):
    spectrogram = batch[0][0].numpy()
    spectrogram = np.array([np.trim_zeros(x) for x in np.transpose(spectrogram)])
    label = batch[1][0]
    # Spectrogram
    label = tf.strings.reduce_join(num_to_char(label)).numpy().decode("utf-8")
    ax = plt.subplot(2, 1, 1)
    ax.imshow(spectrogram, vmax=1)
    ax.set_title(label)
    ax.axis("off")
    # Wav
    file = tf.io.read_file(wavs_path + list(df_train["file_name"])[0] + ".wav")
    audio, _ = tf.audio.decode_wav(file)
    audio = audio.numpy()
    ax = plt.subplot(2, 1, 2)
    plt.plot(audio)
    ax.set_title("Signal Wave")
    ax.set_xlim(0, len(audio))
    display.display(display.Audio(np.transpose(audio), rate=16000))
plt.show()

## Model

We first define the CTC Loss function.

In [None]:
#Loss함수로 사용될 CTCLoss 를 구현한다. 
def CTCLoss(y_true, y_pred):
    # Compute the training-time loss value
    batch_len = tf.cast(tf.shape(y_true)[0], dtype="int64")
    input_length = tf.cast(tf.shape(y_pred)[1], dtype="int64")
    label_length = tf.cast(tf.shape(y_true)[1], dtype="int64")

    input_length = input_length * tf.ones(shape=(batch_len, 1), dtype="int64")
    label_length = label_length * tf.ones(shape=(batch_len, 1), dtype="int64")

    loss = keras.backend.ctc_batch_cost(y_true, y_pred, input_length, label_length)
    return loss


We now define our model. We will define a model similar to
[DeepSpeech2](https://nvidia.github.io/OpenSeq2Seq/html/speech-recognition/deepspeech2.html).

In [None]:
#모델의 레이어를 구성한다. 
#Baidu 에서 발표한 DeepSpeech2(2015)의 모델과 유사하게 구성하였다.
#Conv 2회 반복 후 RNN레이어를 반복문으로 구성한다. 이 모델에서는 5개의 RNN레이어가 사용되었다. 
#Optimizer는 Adam, Loss function은 위에서 선언한 CTCLoss 함수를 사용하였다. 이 덕분에 해당 모델에서는 디코더를 별도로 필요로 하지 않는다. 
def build_model(input_dim, output_dim, rnn_layers=5, rnn_units=128):
    """Model similar to DeepSpeech2."""
    # Model's input
    input_spectrogram = layers.Input((None, input_dim), name="input")
    # Expand the dimension to use 2D CNN.
    x = layers.Reshape((-1, input_dim, 1), name="expand_dim")(input_spectrogram)
    # Convolution layer 1
    x = layers.Conv2D(
        filters=32,
        kernel_size=[11, 41],
        strides=[2, 2],
        padding="same",
        use_bias=False,
        name="conv_1",
    )(x)
    x = layers.BatchNormalization(name="conv_1_bn")(x)
    x = layers.ReLU(name="conv_1_relu")(x)
    # Convolution layer 2
    x = layers.Conv2D(
        filters=32,
        kernel_size=[11, 21],
        strides=[1, 2],
        padding="same",
        use_bias=False,
        name="conv_2",
    )(x)
    x = layers.BatchNormalization(name="conv_2_bn")(x)
    x = layers.ReLU(name="conv_2_relu")(x)
    # Reshape the resulted volume to feed the RNNs layers
    x = layers.Reshape((-1, x.shape[-2] * x.shape[-1]))(x)
    # RNN layers
    for i in range(1, rnn_layers + 1):
        recurrent = layers.GRU(
            units=rnn_units,
            activation="tanh",
            recurrent_activation="sigmoid",
            use_bias=True,
            return_sequences=True,
            reset_after=True,
            name=f"gru_{i}",
        )
        x = layers.Bidirectional(
            recurrent, name=f"bidirectional_{i}", merge_mode="concat"
        )(x)
        if i < rnn_layers:
            x = layers.Dropout(rate=0.5)(x)
    # Dense layer
    x = layers.Dense(units=rnn_units * 2, name="dense_1")(x)
    x = layers.ReLU(name="dense_1_relu")(x)
    x = layers.Dropout(rate=0.5)(x)
    # Classification layer
    output = layers.Dense(units=output_dim + 1, activation="softmax")(x)
    # Model
    model = keras.Model(input_spectrogram, output, name="DeepSpeech_2")
    # Optimizer
    opt = keras.optimizers.Adam(learning_rate=1e-4)
    # Compile the model and return
    model.compile(optimizer=opt, loss=CTCLoss)
    return model


# Get the model
model = build_model(
    input_dim=fft_length // 2 + 1,
    output_dim=char_to_num.vocabulary_size(),
    rnn_units=512,
)
model.summary(line_length=110)

## Training and Evaluating

In [None]:
# A utility function to decode the output of the network
def decode_batch_predictions(pred):
    input_len = np.ones(pred.shape[0]) * pred.shape[1]
    # Use greedy search. For complex tasks, you can use beam search
    # greedy search는 그때그때 확률이 높은 단어를 선택하기 때문에 연산 속도가 빠르지만 문법적으로 이상해지는 경우가 있음
    # beam search는 미래를 고려하여 확률이 갖장 높은 조합 선택, 자연어 처리에서 성능을 높이기 위해 필수적이나 메모리 사용량과 계산양이 높다.
    results = keras.backend.ctc_decode(pred, input_length=input_len, greedy=True)[0][0]
    # Iterate over the results and get back the text
    output_text = []
    for result in results:
        result = tf.strings.reduce_join(num_to_char(result)).numpy().decode("utf-8")
        output_text.append(result)
    return output_text


# A callback class to output a few transcriptions during training
class CallbackEval(keras.callbacks.Callback):
    """Displays a batch of outputs after every epoch."""

    def __init__(self, dataset):
        super().__init__()
        self.dataset = dataset

    def on_epoch_end(self, epoch: int, logs=None):
        predictions = []
        targets = []
        for batch in self.dataset:
            X, y = batch
            batch_predictions = model.predict(X)
            batch_predictions = decode_batch_predictions(batch_predictions)
            predictions.extend(batch_predictions)
            for label in y:
                label = (
                    tf.strings.reduce_join(num_to_char(label)).numpy().decode("utf-8")
                )
                targets.append(label)
        wer_score = wer(targets, predictions)
        print("-" * 100)
        print(f"Word Error Rate: {wer_score:.4f}")
        print("-" * 100)
        for i in np.random.randint(0, len(predictions), 2):
            print(f"Target    : {targets[i]}")
            print(f"Prediction: {predictions[i]}")
            print("-" * 100)


Let's start the training process.

In [None]:
# Define the number of epochs. 에폭 수를 설정한다.
epochs = 1
# Callback function to check transcription on the val set.
validation_callback = CallbackEval(validation_dataset)
# Train the model
history = model.fit(
    train_dataset,
    validation_data=validation_dataset,
    epochs=epochs,
    callbacks=[validation_callback],
)


## Inference

In [None]:
# Let's check results on more validation samples
#결과를 출력한다. 각 Epoch 마다 단어오류율WER을 표기하고, 원래의 문장과 예측한 문장을 차례로 출력한다. 
predictions = []
targets = []
for batch in validation_dataset:
    X, y = batch
    batch_predictions = model.predict(X)
    batch_predictions = decode_batch_predictions(batch_predictions)
    predictions.extend(batch_predictions)
    for label in y:
        label = tf.strings.reduce_join(num_to_char(label)).numpy().decode("utf-8")
        targets.append(label)
wer_score = wer(targets, predictions)
print("-" * 100)
print(f"Word Error Rate: {wer_score:.4f}")
print("-" * 100)
for i in np.random.randint(0, len(predictions), 5):
    print(f"Target    : {targets[i]}")
    print(f"Prediction: {predictions[i]}")
    print("-" * 100)


## Conclusion

In practice, you should train for around 50 epochs or more. Each epoch
takes approximately 5-6mn using a `GeForce RTX 2080 Ti` GPU.
The model we trained at 50 epochs has a `Word Error Rate (WER) ≈ 16% to 17%`.

해당 문서에서는 50 epoch진행 결과 WER 값이 16~17%정도로 낮아지는 것을 확인할 수 있었다고 한다. 
아래는 50 epoch의 결과를 출력한 것이다. 

Some of the transcriptions around epoch 50:

**Audio file: LJ017-0009.wav**
```
- Target    : sir thomas overbury was undoubtedly poisoned by lord rochester in the reign
of james the first
- Prediction: cer thomas overbery was undoubtedly poisoned by lordrochester in the reign
of james the first
```

**Audio file: LJ003-0340.wav**
```
- Target    : the committee does not seem to have yet understood that newgate could be
only and properly replaced
- Prediction: the committee does not seem to have yet understood that newgate could be
only and proberly replace
```

**Audio file: LJ011-0136.wav**
```
- Target    : still no sentence of death was carried out for the offense and in eighteen
thirtytwo
- Prediction: still no sentence of death was carried out for the offense and in eighteen
thirtytwo
```

Example available on HuggingFace.
| Trained Model | Demo |
| :--: | :--: |
| [![Generic badge](https://img.shields.io/badge/🤗%20Model-CTC%20ASR-black.svg)](https://huggingface.co/keras-io/ctc_asr) | [![Generic badge](https://img.shields.io/badge/🤗%20Spaces-CTC%20ASR-black.svg)](https://huggingface.co/spaces/keras-io/ctc_asr) |

#확장

- 한국어 음성의 경우에는 어떤 식으로 분류하는 것이 좋을까?<br>
  github sooftware의 kospeech 위키 문서에 따르면, 기존에 예상했던 대로 데이터셋에 존재하는 모든 음소(2337개)를 학습한 것으로 보인다.<br> [github wiki 문서](https://github.com/sooftware/KoSpeech/wiki/Preparation-before-Training)
  <br>
  '칠십 퍼센트' 문장의 경우 
  '318 119 0 489 551 156' 와 같이 출력된다. 
  <br>
  이때 0은 공백문자이다.<br>
  낮은 단어오류율을 위해 충분히 많은 데이터셋을 필요로 할 것으로 보인다.
  <br>
  이외에도 Hybrid CTC-Attention을 이용한 음성 인식 모델의 경우 각각 자음과 모음을 음소로 분리하여 49개 자소를 사용하기도 하였다.

  [참고 논문 링크](https://koreascience.kr/article/CFKO201832073078862.pdf)
  <br><br>
- 기존에 발표된 Deep speech2는 Baidu에서 발표하여 중국어 음성을 인식하는 모델인 것 같은데, 중국어는 발음기호(병음) 표기가 로마자로 이루어져 비교적 수월해보인다. <br>다만 동일한 병음이어도 성조가 다른 경우 어떻게 처리하는지 추후 확인해보면 좋을 것 같다. <br> -> Deep speech2의 경우 중국어 특성을 반영하여 가장 많이 사용하는 6,000개의 문자를 출력으로 정의하였다고 한다.<br>
[Deep speech2 논문 링크](https://arxiv.org/abs/1512.02595)


