# FeedPulse checkpoint generation

This concept aims to relieve the effort of summarizing the contents of a feedback meeting. The initial solution is to use speech-to-text to get the contents of a meeting, followed by a summarization algorithm, which summarizes the contents of the meeting. The project delegates into two sub-tasks: the text-to-speech project and summarization project.


## Import

All dependencies are imported at the beginning, the following packages are required:
- pytorch
- librosa
- ipython
- datasets
- transformers

If Conda is used as the general package and environment manager, the following one-liner will install all required dependencies:

```bash
conda install -c conda-forge -c pytorch datasets transformers librosa pytorch ipython
```

In [7]:
import IPython.display
import torch
import gc
import librosa
import numpy as np
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

## Setting up the processor and model

An existing model was used from [jonatasgrosman/wav2vec2-large-xlsr-53-dutch](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-dutch), which is a fine-tuned version of the `facebook/wav2vec2-large-xlsr-53` model on Dutch using the train and validation splits of Common Voice 6.1 and CSS10. The model is initialised by using a pretrained `Wav2Vec2` processor and [Connectionist temporal classification](https://en.wikipedia.org/wiki/Connectionist_temporal_classification) model. Additionally, a target device can be specified, which can be used to target either the CPU or GPU (CUDA) when using the model.

In [18]:
LANG_ID = "nl"
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-dutch"
DEVICE = "cpu"  # cuda or cpu

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID).to(DEVICE)

Downloading:   0%|          | 0.00/262 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.74k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/360 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.18G [00:00<?, ?B/s]

## Testing the model accuracy

The accuracy of the [jonatasgrosman/wav2vec2-large-xlsr-53-dutch](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-dutch) can be tested by using samples from the [common voice](https://commonvoice.mozilla.org/en) dataset. Each sample is tested against an audio waveform and a reference text, which can be compared to the predicted sentence. The sample count is specified by declaring the `TEST_SAMPLES` option at the top of the code.

In [9]:
TEST_SAMPLES = 10

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{TEST_SAMPLES}]")


def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch


test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(
    test_dataset["speech"],
    sampling_rate=16_000,
    return_tensors="pt",
    padding=True
)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

for i, predicted_sentence in enumerate(predicted_sentences):
    print("-" * 100)
    print("Reference:", test_dataset[i]["sentence"])
    print("Prediction:", predicted_sentence)

Reusing dataset common_voice (C:\Users\Typically\.cache\huggingface\datasets\common_voice\nl\6.1.0\a1dc74461f6c839bfe1e8cf1262fd4cf24297e3fbd4087a711bd090779023a5e)


  0%|          | 0/10 [00:00<?, ?ex/s]

NoBackendError: 

## Batching

The selected model is quite resource intensive when it comes to processing large files. Therefore, the audio file must be split up in different sections, which make it easier for the processor to handle the files. The sections are split when the audio waveform reaches a certain dB threshold. The `top_db` value can be fine-tuned according to the audio file. The result of this function is the intervals of each section, meaning the start and end index of the total audio waveform byte array.

In [11]:
file_name = "samples/tweede-kamer.wav"
wav, sampling_rate = librosa.load(file_name, sr=16000)

intervals = librosa.effects.split(wav, top_db=32.5)
intervals

array([[      0,   94720],
       [  95232,  225280],
       [ 233472,  235520],
       [ 242176,  282112],
       [ 284672,  315904],
       [ 316416,  344064],
       [ 361984,  390656],
       [ 393216,  472064],
       [ 472576,  514048],
       [ 514560,  529920],
       [ 530432,  540672],
       [ 542720,  710144],
       [ 710656,  725504],
       [ 726528,  739840],
       [ 741376,  770048],
       [ 775680,  888320],
       [ 888832, 1017856],
       [1019392, 1049600],
       [1050112, 1085952],
       [1087488, 1109504],
       [1112576, 1117184],
       [1119744, 1185280],
       [1186816, 1248256],
       [1250304, 1268736],
       [1278976, 1358848],
       [1359872, 1373184],
       [1381376, 1400320],
       [1402880, 1439744],
       [1445888, 1657344],
       [1657856, 1705472]])

To prepare the batches, the intervals are translated to their corresponding audio waveforms by taking waveform byte array contents between the start and end index.

In [12]:
batches = [wav[interval[0]:interval[1]] for interval in intervals]
batches

[array([ 1.0174924e-06, -2.8060481e-06,  1.1889879e-06, ...,
        -1.2050091e-05,  2.7856824e-04,  5.9959297e-05], dtype=float32),
 array([-0.00102658, -0.00085669, -0.00143559, ..., -0.00055669,
        -0.00019385, -0.00031544], dtype=float32),
 array([-9.6203701e-04,  3.6868907e-04, -1.3820484e-03, ...,
         8.8280016e-05, -1.5363136e-03, -1.4806953e-03], dtype=float32),
 array([ 0.00029198,  0.00112316,  0.00044602, ..., -0.00016031,
        -0.00016692, -0.00020549], dtype=float32),
 array([ 0.00167839,  0.00015959, -0.00131264, ..., -0.00167724,
        -0.00139994, -0.0011482 ], dtype=float32),
 array([-2.6862298e-03, -9.8501936e-05,  1.7255503e-03, ...,
         7.1416527e-04,  2.9694100e-04,  3.1439849e-05], dtype=float32),
 array([-0.00063951, -0.00061384,  0.00017739, ..., -0.00159753,
        -0.0017196 , -0.00145414], dtype=float32),
 array([-0.00037515, -0.00050763, -0.00089572, ..., -0.00517249,
        -0.00505253, -0.00435706], dtype=float32),
 array([ 2.5484879

An example of a batched section can be listened to below:

In [14]:
IPython.display.Audio(data=batches[4], rate=sampling_rate)

## Processing

After all batches are prepared, the batches can be processed. The final step before processing is to prepare the processor. During this step, each batch is processed into features, which is then used to gather the logits. After this data is gathered, the prediction is ran through the processor to decode the waveform into text.

The aforementioned process happens for each batch until all batches are processed. All results are combined into a transcript array, which contains the resulting speech text.

In [17]:
gc.collect()
torch.cuda.empty_cache()


def process_batch(batch):
    features = processor(batch, sampling_rate=16_000, return_tensors="pt", padding=True).to(DEVICE)
    logits = model(features.input_values).logits
    predicted_ids = torch.argmax(logits, dim=-1)

    return processor.batch_decode(predicted_ids)


transcript = []

for i, batch in enumerate(batches):
    print(f"Processing batch #{i}/{len(batches)}")
    transcript.extend(process_batch(batch))

transcript

Processing batch #0/30
Processing batch #1/30
Processing batch #2/30
Processing batch #3/30
Processing batch #4/30
Processing batch #5/30
Processing batch #6/30
Processing batch #7/30
Processing batch #8/30
Processing batch #9/30
Processing batch #10/30
Processing batch #11/30
Processing batch #12/30
Processing batch #13/30
Processing batch #14/30
Processing batch #15/30
Processing batch #16/30
Processing batch #17/30
Processing batch #18/30
Processing batch #19/30
Processing batch #20/30
Processing batch #21/30
Processing batch #22/30
Processing batch #23/30
Processing batch #24/30
Processing batch #25/30
Processing batch #26/30
Processing batch #27/30
Processing batch #28/30
Processing batch #29/30


['wat en dat had natuurlijk alles te maken met de oekraïne en primier rutte minister van devensie-olongen en mijnister van buitenlads zake hoestra',
 'waar een daarbij aanwezig lauwes boven hes politiek verslaggefer een onder andere maker van de pootkas de binnenkamer hij volgde het debat voor ons mijnheer boven van boven groeinavond',
 '',
 'een goeinavond maralo hadden werloop en',
 'jaeerst evs de algemeenheid een',
 'wat was jouw indruk van dit debat',
 'nou ja er s ist de ittimt',
 'het begon dat is op zich ook al uitzonderlijke het begon met een toespraak van mark roete zelf',
 'vanwachtsproken begint een kamerdebat altijd eerst met',
 'e in dit geval',
 'twintig',
 'fractie ze zijn het tegenwoordig in de tweedekamer en om de zwaarten van hete moment toc een beetje te markeeren en  ateerst vira bergkampen een toespraak waaraan cajr solidariteit uit',
 'sprak voor',
 'oekraïne',
 'en vervolgens dit marcrutte dat',
 "en ja die fond toch wel weer  ja  woorden en het het mooie het of