In [37]:
!pip install git+https://github.com/m-bain/whisperx.git --upgrade

[0mCollecting git+https://github.com/m-bain/whisperx.git
  Cloning https://github.com/m-bain/whisperx.git to /tmp/pip-req-build-sgc96xu1
  Running command git clone --filter=blob:none --quiet https://github.com/m-bain/whisperx.git /tmp/pip-req-build-sgc96xu1
  Resolved https://github.com/m-bain/whisperx.git to commit 49e0130e4e0c0d99d60715d76e65a71826a97109
  Preparing metadata (setup.py) ... [?25l[?25hdone
[0m

In [38]:
!pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cpu

[0mLooking in indexes: https://download.pytorch.org/whl/cpu
[0m

In [39]:
from google.colab import files

uploaded = files.upload()
audio_file = list(uploaded.keys())[0]

Saving audio1.wav to audio1 (1).wav


# **Transcripción del audio**

In [40]:
import whisperx
import gc
import torch

device = "cuda"
audio_file = "audio1.wav"
batch_size = 16 # reduce if low on GPU mem
compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)

# 1. Transcribe with original whisper (batched)
model = whisperx.load_model("large-v2", device, compute_type=compute_type)

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print(result["segments"]) # before alignment

# delete model if low on GPU resources
import gc; gc.collect(); torch.cuda.empty_cache(); del model

No language specified, language will be first be detected for each audio file (increases inference time).


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.1.0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../root/.cache/torch/whisperx-vad-segmentation.bin`


Model was trained with pyannote.audio 0.0.1, yours is 3.0.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.1.0+cu118. Bad things might happen unless you revert torch to 1.x.
Detected language: es (1.00) in first 30s of audio...
[{'text': ' Siempre me ha preocupado enormemente ser entendido y aceptado. Pero he aprendido que agradar o contentar a todo el mundo es imposible. Por más que hagamos, nos esforcemos o justifiquemos, siempre habrá alguien que nos busque y a quien no le guste lo que hacemos.', 'start': 0.009, 'end': 19.292}, {'text': ' En el poder de confiar en ti, comparto de la forma más clara y directa todo lo que he vivido y aprendido. No quiero guardarme nada para mí, porque sé que el cambio es posible. También puedes hacerlo tú, independientemente del lugar en el que te encuentres. Recuerda, las puertas del cambio interno están siempre abiertas para todo el que se decida pasar por ellas.', 'start': 19.

# **Traducción de la transcripción**

In [41]:
# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

print(result["segments"]) # after alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model_a

[{'start': 0.59, 'end': 5.736, 'text': ' Siempre me ha preocupado enormemente ser entendido y aceptado.', 'words': [{'word': 'Siempre', 'start': 0.59, 'end': 0.91, 'score': 0.872}, {'word': 'me', 'start': 0.99, 'end': 1.07, 'score': 0.834}, {'word': 'ha', 'start': 1.13, 'end': 1.23, 'score': 0.822}, {'word': 'preocupado', 'start': 1.291, 'end': 1.991, 'score': 0.878}, {'word': 'enormemente', 'start': 2.071, 'end': 2.812, 'score': 0.965}, {'word': 'ser', 'start': 2.832, 'end': 3.113, 'score': 0.837}, {'word': 'entendido', 'start': 3.173, 'end': 3.753, 'score': 0.919}, {'word': 'y', 'start': 3.773, 'end': 3.794, 'score': 0.001}, {'word': 'aceptado.', 'start': 3.954, 'end': 4.594, 'score': 0.954}]}, {'start': 5.736, 'end': 11.142, 'text': 'Pero he aprendido que agradar o contentar a todo el mundo es imposible.', 'words': [{'word': 'Pero', 'start': 5.736, 'end': 5.876, 'score': 0.943}, {'word': 'he', 'start': 5.936, 'end': 6.136, 'score': 0.595}, {'word': 'aprendido', 'start': 6.156, 'end'

# **Asignar un label a cada speaker (Diarization)**

In [42]:
token = "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"

# 3. Assign speaker labels
diarize_model = whisperx.DiarizationPipeline(use_auth_token=token, device=device)

# add min/max number of speakers if known
diarize_segments = diarize_model(audio)
# diarize_model(audio, min_speakers=min_speakers, max_speakers=max_speakers)

result = whisperx.assign_word_speakers(diarize_segments, result)
#print(diarize_segments)
print(result["segments"]) # segments are now assigned speaker IDs

[{'start': 0.59, 'end': 5.736, 'text': ' Siempre me ha preocupado enormemente ser entendido y aceptado.', 'words': [{'word': 'Siempre', 'start': 0.59, 'end': 0.91, 'score': 0.872, 'speaker': 'SPEAKER_00'}, {'word': 'me', 'start': 0.99, 'end': 1.07, 'score': 0.834, 'speaker': 'SPEAKER_00'}, {'word': 'ha', 'start': 1.13, 'end': 1.23, 'score': 0.822, 'speaker': 'SPEAKER_00'}, {'word': 'preocupado', 'start': 1.291, 'end': 1.991, 'score': 0.878, 'speaker': 'SPEAKER_00'}, {'word': 'enormemente', 'start': 2.071, 'end': 2.812, 'score': 0.965, 'speaker': 'SPEAKER_00'}, {'word': 'ser', 'start': 2.832, 'end': 3.113, 'score': 0.837, 'speaker': 'SPEAKER_00'}, {'word': 'entendido', 'start': 3.173, 'end': 3.753, 'score': 0.919, 'speaker': 'SPEAKER_00'}, {'word': 'y', 'start': 3.773, 'end': 3.794, 'score': 0.001, 'speaker': 'SPEAKER_00'}, {'word': 'aceptado.', 'start': 3.954, 'end': 4.594, 'score': 0.954, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}, {'start': 5.736, 'end': 11.142, 'text': 'Per

En resumen, del ultimo resultado obtenido en la parte de Diarization, lo que se debería traducir es la parte de text.

Ej:

[{'start': 0.59, 'end': 5.736, **'text': ' Siempre me ha preocupado enormemente ser entendido y aceptado.'**, 'words': [{'word': 'Siempre', 'start': 0.59, 'end': 0.91, 'score': 0.872, 'speaker': 'SPEAKER_00'}, {'word': 'me', 'start': 0.99, 'end': 1.07, 'score': 0.834, 'speaker': 'SPEAKER_00'}, {'word': 'ha', 'start': 1.13, 'end': 1.23, 'score': 0.822, 'speaker': 'SPEAKER_00'}, {'word': 'preocupado', 'start': 1.291, 'end': 1.991, 'score': 0.878, 'speaker': 'SPEAKER_00'}, {'word': 'enormemente', 'start': 2.071, 'end': 2.812, 'score': 0.965, 'speaker': 'SPEAKER_00'}, {'word': 'ser', 'start': 2.832, 'end': 3.113, 'score': 0.837, 'speaker': 'SPEAKER_00'}, {'word': 'entendido', 'start': 3.173, 'end': 3.753, 'score': 0.919, 'speaker': 'SPEAKER_00'}, {'word': 'y', 'start': 3.773, 'end': 3.794, 'score': 0.001, 'speaker': 'SPEAKER_00'}, {'word': 'aceptado.', 'start': 3.954, 'end': 4.594, 'score': 0.954, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}, {'start': 5.736, 'end': 11.142, **'text': 'Pero he aprendido que agradar o contentar a todo el mundo es imposible.**', ...}]