<a href="https://colab.research.google.com/github/Jared-Steven/YouTube-Video-Transcriber/blob/main/YouTube_Video_Transcriber_With_an_AI_Speech_To_Text_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installing The Libraries Needed For The Task




In [None]:
!pip install whisper
!pip install moviepy
!pip install pydub
!pip install transformers
!pip install gradio==3.43.1
!pip install pytube
!pip install noisereduce
!pip install ffmpeg

# Installing The Huggingface Transformer For Trascribing The Audio From The Video

In [None]:
!pip install --upgrade git+https://github.com/huggingface/transformers.git accelerate datasets[audio]

# Import The Needed Libraries

In [None]:
from pytube import YouTube
from moviepy.editor import VideoFileClip
import noisereduce as nr
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from pydub import AudioSegment
import os
import warnings
import gradio as gr
warnings.filterwarnings('ignore')

# The below block of code is written to download the youtube video and extract the audio from it and save it in the local drive.

In [None]:
# Download the YouTube video
def download_video(video_url, output_path):
    yt = YouTube(video_url)
    stream = yt.streams.filter(progressive=True, file_extension='mp4').order_by('resolution').desc().first()
    stream.download(output_path)
    return stream.default_filename

# Extract audio from the video
def extract_audio(video_file, output_path):
    video = VideoFileClip(video_file)
    audio = video.audio
    audio.write_audiofile(f"{output_path}/audio.mp3")
    audio.close()
    video.close()

# Main function
def main():
    video_url = "https://www.youtube.com/watch?v=Sby1uJ_NFIY"
    output_path = "/content"

    video_file = download_video(video_url, output_path)
    extract_audio(f"{output_path}/{video_file}", output_path)

if __name__ == "__main__":
    main()

# Buildin a speech recognition pipeline using a pre-trained model from Hugging Face

Step 1: Device and Data Type Selection

    Device Selection (device):
        Check if a CUDA-enabled GPU is available using torch.cuda.is_available().
        If a GPU is available, device is set to "cuda:0", which means the first GPU device.
        If no GPU is available, device is set to "cpu", meaning the code will run on the CPU.

    Data Type Selection (torch_dtype):
        If a GPU is available, torch_dtype is set to torch.float16 for half-precision floating-point, which is faster and uses less memory on GPUs.
        If no GPU is available, torch_dtype is set to torch.float32, which is the default precision for CPUs.

Step 2: Model Loading

    Model Identifier (model_id):
        The identifier for the pre-trained model is set to "washeed/audio-transcribe". This string is typically the name or path of the model in a model repository like Hugging Face's Model Hub.

    Loading the Model:
        AutoModelForSpeechSeq2Seq.from_pretrained is a method from the Hugging Face Transformers library. It loads a pre-trained sequence-to-sequence model for speech recognition.
        torch_dtype=torch_dtype specifies the data type to use for the model's tensors (either float16 or float32).
        low_cpu_mem_usage=True minimizes the CPU memory usage during model loading.
        use_safetensors=True uses a safer serialization format for model weights, improving security.
        model.to(device) moves the model to the specified device ("cuda:0" or "cpu").

Step 3: Processor Initialization

    AutoProcessor.from_pretrained(model_id) loads a pre-trained processor associated with the model. The processor typically includes components like tokenizers and feature extractors necessary for preparing input data for the model.

Step 4: Pipeline Creation

    pipeline is a high-level API from the Hugging Face Transformers library that simplifies the process of using models for specific tasks.
    Parameters:
        "automatic-speech-recognition": Specifies the type of task the pipeline will perform.
        model=model: Uses the previously loaded model.
        tokenizer=processor.tokenizer: Uses the tokenizer from the processor to convert text to tokens.
        feature_extractor=processor.feature_extractor: Uses the feature extractor from the processor to process audio input.
        max_new_tokens=128: Sets the maximum number of new tokens the model will generate.
        chunk_length_s=15: Sets the chunk length of the audio in seconds. The model processes audio in chunks of 15 seconds.
        batch_size=16: Sets the batch size for processing the audio data.
        return_timestamps=True: Configures the pipeline to return timestamps for the recognized speech.
        torch_dtype=torch_dtype: Specifies the data type for the pipeline (either float16 or float32).
        device=device: Specifies the device to run the pipeline on ("cuda:0" or "cpu").

In [None]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "washeed/audio-transcribe"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=14,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)


# In Summary

The above code sets up a speech recognition pipeline using a pre-trained model from Hugging Face.

It selects the appropriate device and data type based on GPU availability, loads the model and processor, and configures the pipeline for automatic speech recognition with specified parameters for chunking, batch size, and output format.

# Spliting The Audio Into 14 Second Chuncks for accurate transcription.   

In [None]:
# Replace with the path to your input mp3 file
input_file = "/content/audio.mp3"

# Replace with the desired output folder path
output_folder = "/content/drive/MyDrive/Audio Chunks"

# Load the input mp3 file
audio = AudioSegment.from_mp3(input_file)

# Calculate the total duration of the audio in milliseconds
total_duration = len(audio)

# Create the output folder if it doesn't exist
os.makedirs(output_folder, exist_ok=True)

# Split the audio into 14-second chunks
for i, start_time in enumerate(range(0, total_duration, 14000)):
    end_time = start_time + 14000
    chunk = audio[start_time:end_time]

    # Save the chunk as a new mp3 file
    output_file = os.path.join(output_folder, f"chunk_{i}.mp3")
    chunk.export(output_file, format="mp3")
    print(f"Exported chunk {i} to {output_file}")

# Checking if the number of files in the folder are accurate

In [None]:
# Get the list of all files and directories
directory_path = '/content/drive/MyDrive/Audio Chunks'
files_and_dirs = os.listdir(directory_path)

# Filter out directories, keep only files
files = [f for f in files_and_dirs if os.path.isfile(os.path.join(directory_path, f))]

print("No of files in the folder:", len(files))

# Transcribing The audio chunks and maping the audio and it's transcript  in a json format.

In [None]:
# Function to convert milliseconds to a formatted string (MM:SS)
def format_time(milliseconds):
    seconds = (milliseconds // 1000) % 60
    minutes = (milliseconds // (1000 * 60)) % 60
    return f"{minutes:02}:{seconds:02}"

# Load the mp3 file
audio = AudioSegment.from_mp3("/content/audio.mp3")

# Get the total duration of the audio in milliseconds
total_duration = len(audio)
print(f"Total duration: {format_time(total_duration)}")

# Segment the audio into 14-second chunks
chunk_duration = 14 * 1000  # 14 seconds in milliseconds
time_stamp = []

for start_time in range(0, total_duration, chunk_duration):
    end_time = min(start_time + chunk_duration, total_duration)
    time_stamp.append((start_time, end_time))

chunks = []

for i in range(len(files)):
  result = pipe(f"/content/drive/MyDrive/Audio Chunks/chunk_{i}"+".mp3", generate_kwargs={"task": "transcribe"})
  dictio = {
        "chunk_id": i,
        "chunk_length": 14.0,
        "text": result["text"],
        "start_time": format_time(time_stamp[i][0]),
        "end_time": format_time(time_stamp[i][1])
    }
  chunks.append(dictio)

print(' ')
print('output: ')
print(chunks)