<a href="https://colab.research.google.com/github/Dec0XD/Decupagem/blob/main/Trancri%C3%A7%C3%A3o_com_divis%C3%A3o_de_falantes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Notas sobre o uso:  

- Certifique-se de [mudar para GPU](https://www.tutorialspoint.com/google_colab/google_colab_using_free_gpu.htm).  
- A transcrição será salva em "Arquivos", que você pode encontrar no menu à esquerda.  
- Altere o número de falantes abaixo caso seja diferente de dois.  
- Escolha um modelo maior para mais precisão e um menor para uma execução mais rápida ([mais informações](https://github.com/openai/whisper#available-models-and-languages)).  
- Se você souber o idioma falado, defina o idioma como "Portuguese", pois isso melhora o desempenho.    

### Visão geral do processo:  

1. Estou usando o modelo Whisper da OpenAI para separar o áudio em segmentos e gerar transcrições.  
2. Em seguida, estou gerando embeddings de falantes para cada segmento.  
3. Depois, utilizo o agrupamento aglomerativo nos embeddings para identificar o falante de cada segmento.    

Me avise se eu puder melhorar algo!

In [1]:
# Pegando o audio
from google.colab import files
uploaded = files.upload()
path = next(iter(uploaded))

Saving C0088 (online-audio-converter.com).wav to C0088 (online-audio-converter.com).wav


In [2]:
num_speakers = 2 #@param {type:"integer"}

language = 'Portuguese' #@param ['any', 'Portuguese']

model_size = 'large' #@param ['tiny', 'base', 'small', 'medium', 'large']


model_name = model_size
if language == 'Portuguese' and model_size != 'large':
  model_name += '.pt'


# Programa

In [None]:
!pip install -q git+https://github.com/openai/whisper.git > /dev/null
!pip install -q git+https://github.com/pyannote/pyannote-audio > /dev/null

import whisper
import datetime

import subprocess

import torch
import pyannote.audio
from pyannote.audio.pipelines.speaker_verification import PretrainedSpeakerEmbedding
embedding_model = PretrainedSpeakerEmbedding(
    "speechbrain/spkrec-ecapa-voxceleb",
    device=torch.device("cuda"))

from pyannote.audio import Audio
from pyannote.core import Segment

import wave
import contextlib

from sklearn.cluster import AgglomerativeClustering
import numpy as np

In [4]:
if path[-3:] != 'wav':
  subprocess.call(['ffmpeg', '-i', path, 'audio.wav', '-y'])
  path = 'audio.wav'

In [None]:
model = whisper.load_model(model_size)

In [6]:
result = model.transcribe(path)
segments = result["segments"]

In [7]:
with contextlib.closing(wave.open(path,'r')) as f:
  frames = f.getnframes()
  rate = f.getframerate()
  duration = frames / float(rate)

In [8]:
audio = Audio()

def segment_embedding(segment):
  start = segment["start"]
  # Whisper overshoots the end timestamp in the last segment
  end = min(duration, segment["end"])
  clip = Segment(start, end)
  waveform, sample_rate = audio.crop(path, clip)

  # Convert waveform to mono by averaging channels
  waveform = torch.mean(waveform, dim=0, keepdim=True)

  return embedding_model(waveform[None])

In [9]:
embeddings = np.zeros(shape=(len(segments), 192))
for i, segment in enumerate(segments):
  embeddings[i] = segment_embedding(segment)

embeddings = np.nan_to_num(embeddings)

In [10]:
clustering = AgglomerativeClustering(num_speakers).fit(embeddings)
labels = clustering.labels_
for i in range(len(segments)):
  segments[i]["speaker"] = 'SPEAKER ' + str(labels[i] + 1)

In [11]:
def time(secs):
  return datetime.timedelta(seconds=round(secs))

f = open("transcript.txt", "w")

for (i, segment) in enumerate(segments):
  if i == 0 or segments[i - 1]["speaker"] != segment["speaker"]:
    f.write("\n" + segment["speaker"] + ' ' + str(time(segment["start"])) + '\n')
  f.write(segment["text"][1:] + ' ')
f.close()