# Welcome to the Diari-scription script !
This python script allows you to transcribe any audio file with speech. Please prefer .mp3 or .wav file formats.
The script also recognize who is speaking, an operation called "diarization".
This script relies on Whisper, a general-purpose speech recognition model developed by [OpenAI](https://openai.com/research/whisper). The original paper (Radford et al., 2022) can be found [here](https://arxiv.org/abs/2212.04356).

HuggingFace is a nice resource, and the transcription part of the code comes from [there](https://huggingface.co/openai/whisper-large-v3) !
The diarization part of the code comes from Max Bain on GitHub and his [WhisperX](https://github.com/m-bain/whisperX) code !


## A quick requirement
Before anything else, please visit the HuggingFace website to [request a token](https://huggingface.co/settings/tokens). Please copy paste it in the **HuggingFaceToken** form right below, without adding any " " or ' ' symbol ! Don't worry, you only have to request the token once, if you don't reinitialize it.

After this first step is completed, please select a **GPU** backend in Google Colab (usually in the top-right corner). This script is meant to be run on a GPU, not a CPU.


If you want to connect your personal Google Drive account to Colab, you can do so by clicking on the folder icon on the left-side bar, and press on the folder icon that contains the Google Drive triangle, called "Mount Drive" *("Installer Drive").

## Files management
To load you audio file, click on the folder icon on the left-side panel, and press on the first icon called "Upload to session storage" *("Aucun fichier selectionné")*.

The transcription of the audio file is exported into a .txt file. This text file will have the **same** name as the audio file you chose as input.

**Be careful**, if you run this script multiple times with the same input audio file, it will overwrite any previous file with the same name.

## Launching the script
You just have to fill out the forms below and go to Runtime -> Run all *(Exécution -> Tout exécuter)* and then wait a few minutes :)

In [None]:
# @title Initialisation script { display-mode: "form" }

#initialization and imports
!pip install pytorch==2.0.0 torchaudio==2.0.0 pytorch-cuda=11.8 -c pytorch -c nvidia &> /dev/null
!pip install git+https://github.com/m-bain/whisperx.git &> /dev/null
!sudo apt update &> /dev/null
!sudo apt install ffmpeg &> /dev/null

import whisperx
import gc

#cuda is for running the inference on the Google Colab's GPU
device = "cuda"
batch_size = 16 # reduce if low on GPU mem
compute_type = "int8" # change to "int8" if low on GPU mem (may reduce accuracy). Another option is "float16"

HuggingFaceToken = "" # @param {type:"string"}

This is the part you have to interact with when you want to transcribe an interview !

You're just a few steps far from your transcript !
1. Just select the path to your audio file (*the easiest way is that your file is hosted on your personal Google Drive account*) is the field **AudioFilePath**.
2. The task type (transcription or translation to english) in the **TaskType** dropwodn menu.
3. Finally, you can choose the language spoken in the interview in the **LanguageSpokenShort** field. If left blank, the model will detect by itself the most probable one ! Please use the abbreviation : fr for french, en for english, etc. A list of abbreviations can be found [here](https://www.loc.gov/standards/iso639-2/php/code_list.php).
4. Choose the minimum and maximum number of speakers in your interview. These numbers can be equal (if you know haw many people are speaking) but MaxNumberSpeakers should **always be >= to MinNumberSpeakers** !!!


In [None]:
# @title Fill out the form ! { display-mode: "form" }
AudioFilePath = "" # @param {type:"string"}
TaskType = "transcribe" # @param ["transcribe", "translate"]
LanguageSpokenShort = "" # @param {type:"string"}
MinNumberSpeakers = 1 # @param {type:"slider", min:1, max:10, step:1}
MaxNumberSpeakers = 1 # @param {type:"slider", min:1, max:10, step:1}
#the path to your audio file of the interview.
#DON'T FORGET THE EXTENSION !
audio_file = AudioFilePath
task_type = TaskType #the task type can also be "translate", to translate IN ENGLISH
language_spoken = LanguageSpokenShort #please use

This is the transcription part, where the audio is turned into text. But the speakers are not identified yet !

In [None]:
# @title Transcription { display-mode: "form" }
# 1. Transcribe with original whisper (batched)
model = whisperx.load_model("large-v3", device, compute_type=compute_type, task=task_type, language=language_spoken)

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
#print(result["segments"]) # before alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model

# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

#print(result1["segments"]) # after alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model_a

This is the part where the speakers get identified and the text they pronounced is assigned to them !
This step is called the '**diarization**'.

In [None]:
# @title Speaker identification { display-mode: "form" }
# 3. Assign speaker labels
diarize_model = whisperx.DiarizationPipeline(use_auth_token=HuggingFaceToken, device=device)

# add min/max number of speakers if known
diarize_segments = diarize_model(audio, min_speakers=MinNumberSpeakers, max_speakers=MaxNumberSpeakers)

result = whisperx.assign_word_speakers(diarize_segments, result, fill_nearest=False)
#print(diarize_segments)
segments = result["segments"]

This is the part where it all comes together, transcription and diarization. The result is formatted and printed into a file that has the **same name** as the audio input.

In [None]:
# @title Where it all comes together { display-mode: "form" }
# @title
#on reprend le nom du fichier audio et on en fait un nom de fichier au format .txt
file_name=audio_file.split('.')[0]+'.txt'
#print(file_name)

# on ouvre un fichier .txt pour écrire la retranscription dedans
# on l'ouvre en mode 'append' pour éviter les fausses manip
#remplacer 'a' par 'w' ouvre en mode 'write' et réécrit tout le fichier à chaque fois
file1 = open(file_name, "w", encoding='utf-8')
#on boucle dans la liste de dictionnaires 'segments' pour séparer les changements de speaker
file1.write("RETRANSCRIPTION \n")
#print(segments[0]['speaker'])

#variable to store the previous speaker
speaker_prev = 'no'
#let's run through the list of dictionnaries and extract the speaker and the text
for i in range(len(segments)):
    #first, we check if the 'speaker' key is present in the dictionnary
    #sometimes, it is not, for example with ponctuation
    if 'speaker' in segments[i]:
        #we want to change the speaker only if it changed compared to the previous sentence
        if segments[i]['speaker'] != speaker_prev:
            #print('\n', segments[i]['speaker'])
            #file1.write(' \n')
            file1.write('\n' + '\n' + segments[i]['speaker'] + '\n')
            #file1.write(' \n')
            speaker_prev = segments[i]['speaker']
        #print(segments[i]['text'], end=' ')
    #whatever, we print the text, with a space at the end
    file1.writelines(segments[i]['text'] + ' ')

#Et on n'oublie pas de refermer en quittant...
file1.close()
print('Done ! Go check your transcription file !')