<a href="https://colab.research.google.com/github/LifeHackInnovationsLLC/whisper-video-transcription/blob/main/LHI_WhisperVideoDrive.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [10]:
# LHI_WhisperVideoDrive.py

In [11]:
# ---
# jupyter:
#   jupytext:
#     formats: ipynb,py:percent
#     text_representation:
#       extension: .py
#       format_name: percent
#       format_version: '1.3'
#       jupytext_version: 1.16.5
#   kernelspec:
#     display_name: Python 3
#     name: python3
# ---

<a href="https://colab.research.google.com/github/LifeHackInnovationsLLC/whisper-video-transcription/blob/main/LHI_WhisperVideoDrive.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Jupytext Initialization (Sync Logic)
Ensure Jupytext is installed and the notebook is paired with the `.py` file.

import subprocess
import sys

def ensure_module(module_name, install_name=None):
    """Install a module if it's not already installed."""
    try:
        __import__(module_name)
        print(f"Module '{module_name}' is already installed.")
    except ImportError:
        install_name = install_name or module_name
        print(f"Module '{module_name}' not found. Installing...")
        subprocess.run([sys.executable, "-m", "pip", "install", install_name], check=True)

Ensure Jupytext is installed
ensure_module("jupytext")

Sync the notebook with its paired `.py` file
try:
    subprocess.run(["jupytext", "--sync", "LHI_WhisperVideoDrive.ipynb"], check=True)
    print("Jupytext synchronization successful.")
except subprocess.CalledProcessError as e:
    print(f"Error during Jupytext synchronization: {e}")

In [12]:
# Handle missing modules and Google Colab environment checks

import subprocess
import sys


# Install and import required modules
required_modules = {
    "google.colab": "google-colab",
    "whisper": "openai-whisper",
    "librosa": "librosa",
    "soundfile": "soundfile",
    "colorama": "colorama",
    "google-api-python-client": "google-api-python-client",
    "google-auth-httplib2": "google-auth-httplib2",
    "google-auth-oauthlib": "google-auth-oauthlib"
}


for module, install_name in required_modules.items():
    try:
        __import__(module)
        print(f"Module '{module}' is already installed.")
    except ImportError:
        print(f"Module '{module}' not found. Installing...")
        subprocess.run([sys.executable, "-m", "pip", "install", install_name], check=True)

# Conditional import for Google Colab
try:
    from google.colab import drive
    print("Google Colab environment detected.")
except ImportError:
    print("Google Colab environment not detected. Skipping Colab imports.")

# Import other required modules
import whisper
import librosa
import soundfile as sf



Module 'google.colab' is already installed.
Module 'whisper' is already installed.
Module 'librosa' is already installed.
Module 'soundfile' is already installed.
Module 'colorama' is already installed.
Module 'google-api-python-client' not found. Installing...
Module 'google-auth-httplib2' not found. Installing...
Module 'google-auth-oauthlib' not found. Installing...
Google Colab environment detected.



#📼 OpenAI Whisper + Google Drive Video Transcription

📺 Getting started video: https://youtu.be/YGpYinji7II

###This application will extract audio from all the video files in a Google Drive folder and create a high-quality transcription with OpenAI's Whisper automatic speech recognition system.

*Note: This requires giving the application permission to connect to your drive. Only you will have access to the contents of your drive, but please read the warnings carefully.*

This notebook application:
1. Connects to your Google Drive when you give it permission.
2. Creates a WhisperVideo folder and three subfolders (ProcessedVideo, AudioFiles and TextFiles.)
3. When you run the application it will search for all the video files (.mp4, .mov, mkv and .avi) in your WhisperVideo folder, transcribe them and then move the file to WhisperVideo/ProcessedVideo and save the transcripts to WhisperVideo/TextFiles. It will also add a copy of the new audio file to WhisperVideo/AudioFiles

###**For faster performance set your runtime to "GPU"**
*Click on "Runtime" in the menu and click "Change runtime type". Select "GPU".*


**Note: If you add a new file after running this application you'll need to remount the drive in step 1 to make them searchable**

##0. Choose which 'LHI Client' or folder to add transcriptions to

In [13]:
import os
import subprocess
import sys
from colorama import Fore, Style, init
from google.colab import drive
from google.colab import auth
from googleapiclient.discovery import build
from tabulate import tabulate


init(autoreset=True)

# Global registry
registry_entries = []

def add_to_registry(entry_type, name, path, entity_id=None, is_file=False):
    """Add or update an entity in the registry."""
    url = None
    if entity_id:
        if is_file:
            url = f"https://drive.google.com/file/d/{entity_id}/view"
        else:
            url = f"https://drive.google.com/drive/folders/{entity_id}"

    # Update or add
    for e in registry_entries:
        if e["path"] == path:
            e["type"] = entry_type
            e["name"] = name
            e["id"] = entity_id
            e["url"] = url if url else e["url"]
            return

    registry_entries.append({
        "type": entry_type,
        "name": name,
        "path": path,
        "id": entity_id,
        "url": url
    })

def print_registry_table():
    """Print a table of all registered entries."""
    headers = ["Type", "Name", "Path", "ID", "URL"]
    table_data = []
    for e in registry_entries:
        table_data.append([
            e["type"],
            e["name"],
            e["path"],
            e["id"] if e["id"] else "-",
            e["url"] if e["url"] else "-"
        ])
    print(Fore.CYAN + "=== REGISTRY TABLE ===")
    print(tabulate(table_data, headers=headers, tablefmt="fancy_grid"))

def check_and_mount_drive():
    print("Checking /content/drive status...")
    if os.path.exists("/content/drive"):
        print("Mount directory exists. Checking contents...")
        if os.listdir("/content/drive"):
            print("Mountpoint already contains files. Attempting to unmount...")
            print("Unmounted successfully or already unmounted.")

    # Mount Google Drive
    print("Mounting Google Drive...")
    drive.mount("/content/drive", force_remount=True)
    print("Google Drive mounted successfully.")

    # Verify mount
    if os.path.exists("/content/drive/MyDrive"):
        print("Drive is mounted and ready.")
        return True
    else:
        print("Mounting seems incomplete. Please check your drive configuration.")
        return False

def initialize_drive_api():
    """
    Initialize Google Drive API using OAuth user credentials.
    This will prompt for user authentication.
    """
    print(Fore.CYAN + "Initializing Google Drive API using OAuth (User Credentials)...")
    try:
        auth.authenticate_user()  # This will prompt you to authorize the app
        service = build("drive", "v3")
        print(Fore.GREEN + "Google Drive API service initialized successfully as the user.")
        return service
    except Exception as e:
        print(Fore.RED + f"Failed to initialize Google Drive API: {e}")
        return None

drive_service = initialize_drive_api()


def get_file_id(file_name, folder_id):
    """
    Retrieve the file ID for a given file name in a specific folder on Google Drive.
    """
    try:
        results = drive_service.files().list(
            q=f"name='{file_name}' and '{folder_id}' in parents",
            spaces="drive",
            fields="files(id, name)",
            pageSize=1
        ).execute()
        items = results.get("files", [])
        if items:
            return items[0]["id"]
        else:
            print(Fore.YELLOW + f"File '{file_name}' not found in folder {folder_id}.")
            return None
    except Exception as e:
        print(Fore.RED + f"Error retrieving file ID for '{file_name}': {e}")
        return None


def get_or_create_folder(drive_service, folder_name, parent_id):
    """
    Retrieve or create a folder in Google Drive given a name and parent folder ID.
    """
    try:
        query = f"name='{folder_name}' and mimeType='application/vnd.google-apps.folder' and '{parent_id}' in parents"
        results = drive_service.files().list(
            q=query,
            spaces="drive",
            fields="files(id, name)",
            pageSize=1
        ).execute()
        items = results.get("files", [])

        if items:
            folder_id = items[0]["id"]
            print(Fore.GREEN + f"Folder '{folder_name}' found with ID: {folder_id}")
            return folder_id
        else:
            folder_metadata = {
                "name": folder_name,
                "mimeType": "application/vnd.google-apps.folder",
                "parents": [parent_id]
            }
            folder = drive_service.files().create(body=folder_metadata, fields="id").execute()
            folder_id = folder.get("id")
            print(Fore.GREEN + f"Folder '{folder_name}' created with ID: {folder_id}")
            return folder_id
    except Exception as e:
        print(Fore.RED + f"Error creating or retrieving folder '{folder_name}': {e}")
        return None

folder_id_cache = {}

def get_folder_id_from_path(drive_service, local_path):
    if local_path in folder_id_cache:
        return folder_id_cache[local_path]

    prefix = "/content/drive/MyDrive/"
    if not local_path.startswith(prefix):
        print(Fore.RED + "The path does not start with /content/drive/MyDrive/.")
        return None

    relative_path = local_path[len(prefix):].strip("/")
    if not relative_path:
        folder_id_cache[local_path] = "root"
        return "root"

    parts = relative_path.split("/")
    current_parent_id = "root"
    for part in parts:
        folder_id = get_or_create_folder(drive_service, part, current_parent_id)
        if not folder_id:
            print(Fore.RED + f"Failed to navigate/create the folder for part: {part}")
            return None
        current_parent_id = folder_id

    # Cache the final folder ID
    folder_id_cache[local_path] = current_parent_id
    return current_parent_id


# Attempt to check and mount the drive
if check_and_mount_drive():
    print("Proceeding...")
else:
    print("Drive mount failed. Exiting.")
    raise SystemExit("Drive mount failed.")

drive_service = initialize_drive_api()

# Predefined options for client folders
clients = {
    "1": "/content/drive/MyDrive/Clients/WCBradley/Videos/",
    "2": "/content/drive/MyDrive/Clients/SiriusXM/Videos/",
    "3": "/content/drive/MyDrive/Clients/LHI/Videos/"
}

print("Select a client folder:")
print("1: WCBradley")
print("2: SiriusXM")
print("3: LHI")
print("4: Enter a custom folder path")

choice = input("Enter the number corresponding to your choice (default: 1): ").strip()
if choice in clients:
    client_videos_folder = clients[choice]
elif choice == "4":
    client_videos_folder = input("Enter the full path to your Videos folder: ").strip()
else:
    client_videos_folder = clients["1"]

rootFolder = client_videos_folder + "WhisperVideo/"
audio_folder = rootFolder + "AudioFiles/"
text_folder = rootFolder + "TextFiles/"
processed_folder = rootFolder + "ProcessedVideo/"

# Ensure local folders exist
folders = [rootFolder, audio_folder, text_folder, processed_folder]
for folder in folders:
    try:
        print(f"Checking folder: {folder}")
        folder_name = os.path.basename(os.path.normpath(folder))
        if not os.path.exists(folder):
            os.makedirs(folder)
            print(Fore.GREEN + f"Created folder: {folder}")
        else:
            print(Fore.GREEN + f"Folder already exists: {folder}")
        # Register locally. No ID yet.
        add_to_registry("folder", folder_name, folder)
    except Exception as e:
        print(Fore.RED + f"Error ensuring folder {folder}: {e}")

print(Fore.CYAN + f"WhisperVideo folder and subfolders initialized for client:")
print(Fore.GREEN + f"WhisperVideo folder: {rootFolder}")
print(Fore.GREEN + f"Audio files folder: {audio_folder}")
print(Fore.GREEN + f"Text files folder: {text_folder}")
print(Fore.GREEN + f"Processed videos folder: {processed_folder}")

# Now get or create these folders in Google Drive to get their IDs
if drive_service:
    rootFolderID = get_folder_id_from_path(drive_service, rootFolder)
    if rootFolderID:
        root_name = os.path.basename(os.path.normpath(rootFolder))
        add_to_registry("folder", root_name, rootFolder, rootFolderID, is_file=False)

    audio_name = os.path.basename(os.path.normpath(audio_folder))
    text_name = os.path.basename(os.path.normpath(text_folder))
    processed_name = os.path.basename(os.path.normpath(processed_folder))

    audio_id = get_or_create_folder(drive_service, audio_name, rootFolderID)
    if audio_id:
        add_to_registry("folder", audio_name, audio_folder, audio_id, is_file=False)

    text_id = get_or_create_folder(drive_service, text_name, rootFolderID)
    if text_id:
        add_to_registry("folder", text_name, text_folder, text_id, is_file=False)

    processed_id = get_or_create_folder(drive_service, processed_name, rootFolderID)
    if processed_id:
        add_to_registry("folder", processed_name, processed_folder, processed_id, is_file=False)

# Print the updated registry table with IDs and URLs
print_registry_table()


Checking /content/drive status...
Mount directory exists. Checking contents...
Mountpoint already contains files. Attempting to unmount...
Unmounted successfully or already unmounted.
Mounting Google Drive...
Mounted at /content/drive
Google Drive mounted successfully.
Drive is mounted and ready.
Proceeding...
Initializing Google Drive API using OAuth (User Credentials)...
Google Drive API service initialized successfully as the user.
Select a client folder:
1: WCBradley
2: SiriusXM
3: LHI
4: Enter a custom folder path
Enter the number corresponding to your choice (default: 1): 1
Checking folder: /content/drive/MyDrive/Clients/WCBradley/Videos/WhisperVideo/
Folder already exists: /content/drive/MyDrive/Clients/WCBradley/Videos/WhisperVideo/
Checking folder: /content/drive/MyDrive/Clients/WCBradley/Videos/WhisperVideo/AudioFiles/
Folder already exists: /content/drive/MyDrive/Clients/WCBradley/Videos/WhisperVideo/AudioFiles/
Checking folder: /content/drive/MyDrive/Clients/WCBradley/Video

##1. Load the code libraries

In [14]:
!pip install git+https://github.com/openai/whisper.git
!sudo apt update && sudo apt install ffmpeg
!pip install librosa
!pip install audioread

import whisper
import time
import librosa
import soundfile as sf
import re
import os

# model = whisper.load_model("tiny.en")
model = whisper.load_model("base.en")
# model = whisper.load_model("small.en") # load the small model
# model = whisper.load_model("medium.en")
# model = whisper.load_model("large")

Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-9o734189
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-9o734189
  Resolved https://github.com/openai/whisper.git to commit 90db0de1896c23cbfaf0c58bc2d30665f709f170
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:3 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:7 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu 

In [15]:
# from colorama import Fore, Style, init
# from googleapiclient.discovery import build
# from google.oauth2.service_account import Credentials  # Ensure this import is included
# from google.colab import drive

# print(Fore.CYAN + "Attempting to mount Google Drive...")
# drive.mount('/content/drive', force_remount=True)
# print(Fore.GREEN + "Google Drive mounted successfully.")

# # Initialize colorama for console color support
# init(autoreset=True)
# print(Fore.CYAN + "Colorama initialized for console color support.")

# # Google Drive API setup
# def initialize_drive_api():
#     """
#     Initialize Google Drive API service account for generating shareable links.
#     """
#     print(Fore.CYAN + "Initializing Google Drive API...")
#     try:
#         credentials = Credentials.from_service_account_file(
#             "/content/drive/MyDrive/key.json",
#             scopes=["https://www.googleapis.com/auth/drive"]
#         )
#         service = build("drive", "v3", credentials=credentials)
#         print(Fore.GREEN + "Google Drive API service initialized successfully.")
#         return service
#     except Exception as e:
#         print(Fore.RED + f"Failed to initialize Google Drive API: {e}")
#         return None

# drive_service = initialize_drive_api()

# def get_file_id(file_name, folder_id):
#     """
#     Retrieve the file ID for a given file name in a specific folder on Google Drive.
#     """
#     print(Fore.CYAN + f"Searching for file '{file_name}' in folder ID '{folder_id}'...")
#     if drive_service is None:
#         print(Fore.RED + "Drive service not initialized. Cannot proceed.")
#         return None
#     try:
#         results = drive_service.files().list(
#             q=f"name='{file_name}' and '{folder_id}' in parents",
#             spaces="drive",
#             fields="files(id, name)",
#             pageSize=1
#         ).execute()
#         items = results.get("files", [])
#         if items:
#             file_id = items[0]["id"]
#             print(Fore.GREEN + f"File '{file_name}' found with ID: {file_id}")
#             return file_id
#         else:
#             print(Fore.YELLOW + f"File '{file_name}' not found in folder {folder_id}.")
#             return None
#     except Exception as e:
#         print(Fore.RED + f"Error retrieving file ID for '{file_name}': {e}")
#         return None

# def generate_shareable_link(file_id):
#     """
#     Generate a shareable link for a given Google Drive file.
#     """
#     print(Fore.CYAN + f"Generating shareable link for file ID: {file_id}...")
#     if drive_service is None:
#         print(Fore.RED + "Drive service not initialized. Cannot generate link.")
#         return None
#     try:
#         permission = {"type": "anyone", "role": "reader"}
#         drive_service.permissions().create(fileId=file_id, body=permission).execute()
#         link = f"https://drive.google.com/file/d/{file_id}/view"
#         print(Fore.GREEN + f"Shareable link generated successfully: {link}")
#         return link
#     except Exception as e:
#         print(Fore.RED + f"Failed to generate shareable link: {e}")
#         return None

# # Example usage (uncomment to test):
# # folder_id = "YOUR_FOLDER_ID"
# # file_name = "test.txt"
# # file_id = get_file_id(file_name, folder_id)
# # if file_id:
# #     link = generate_shareable_link(file_id)


##2. Give the application permission to mount the drive and create the folders

In [16]:
# # Mount Google Drive
# from google.colab import drive
# drive.mount("/content/drive", force_remount=True)  # This will prompt for authorization.

# import os

# # Ensure WhisperVideo folder and its subfolders exist
# folders = [rootFolder, audio_folder, text_folder, processed_folder]
# for folder in folders:
#     try:
#         if not os.path.exists(folder):
#             os.makedirs(folder)
#             print(f"Created folder: {folder}")
#         else:
#             print(f"Folder already exists: {folder}")
#     except Exception as e:
#         print(f"Error ensuring folder {folder}: {e}")

# print(f"All folders verified and ready under: {rootFolder}")

##3. Upload any video files you want transcribed in the "WhisperVideo" folder in your Google Drive.

## 4. Extract audio from the video files and create a transcription

This step processes video files in the `WhisperVideo` folder by extracting audio, transcribing it, and saving the transcription in the `TextFiles` folder. The original video file is moved to the `ProcessedVideo` folder upon successful transcription.

### Shareable Links
The shareable link for the processed video is generated based on its Google Drive file path. This method avoids additional API calls and assumes that files are already shared within your team. The constructed link can be found at the beginning of the transcription file.

Example of a shareable link format:
```
https://drive.google.com/file/d/<file_id>/view
```



In [17]:
import os
import shutil
import subprocess
import logging
import csv
from datetime import datetime, timedelta
import librosa
import soundfile as sf
import whisper
from googleapiclient.http import MediaFileUpload

# Clear old local audio and text files before starting (Optional: do this if safe)
for f in os.listdir(audio_folder):
    if f.endswith(".wav"):
        os.remove(os.path.join(audio_folder, f))
for f in os.listdir(text_folder):
    if f.endswith(".txt"):
        os.remove(os.path.join(text_folder, f))

# Also print directory states before processing
print("Initial Audio directory:", os.listdir(audio_folder))
print("Initial Text directory:", os.listdir(text_folder))

def format_time(seconds):
    return str(timedelta(seconds=int(seconds)))

logging.basicConfig(
    filename="processing_log.txt",
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s",
)

def remove_from_registry_by_path(path):
    global registry_entries
    registry_entries = [e for e in registry_entries if e["path"] != path]

def file_in_registry_with_id(path):
    for e in registry_entries:
        if e["path"] == path and e.get("id"):
            return True
    return False

def file_in_registry(path):
    for e in registry_entries:
        if e["path"] == path:
            return True
    return False

def get_file_count(folder):
    return len([f for f in os.listdir(folder) if os.path.isfile(os.path.join(folder, f))])

def get_file_bases(folder):
    return {os.path.splitext(f)[0] for f in os.listdir(folder) if os.path.isfile(os.path.join(folder, f))}

def verify_folder_state():
    """Compare local folder states with what we expect from processed videos."""
    videos = get_file_bases(processed_folder)
    audios = get_file_bases(audio_folder)
    texts = get_file_bases(text_folder)
    all_match = (videos == audios == texts)

    if not all_match:
        print("WARNING: Folder parity mismatch detected:")
        print(f"Processed Videos: {len(videos)} ({videos})")
        print(f"Audio Files: {len(audios)} ({audios})")
        print(f"Text Files: {len(texts)} ({texts})")

    return all_match

def upload_file_to_drive(drive_service, file_path, parent_folder_id):
    file_name = os.path.basename(file_path)
    file_metadata = {
        'name': file_name,
        'parents': [parent_folder_id]
    }
    media = MediaFileUpload(file_path, resumable=True)
    file = drive_service.files().create(body=file_metadata, media_body=media, fields='id').execute()
    return file.get('id')

def move_file_in_drive(drive_service, file_id, old_parent_id, new_parent_id):
    file_info = drive_service.files().get(fileId=file_id, fields='parents').execute()
    parents = file_info.get('parents', [])
    if old_parent_id in parents:
        parents.remove(old_parent_id)
    updated_file = drive_service.files().update(
        fileId=file_id,
        addParents=new_parent_id,
        removeParents=old_parent_id,
        fields='id, parents'
    ).execute()
    return updated_file.get('id')

def register_and_upload_local_file(drive_service, entry_type, file_name, file_path, parent_folder_id, is_file=True):
    if file_in_registry_with_id(file_path):
        # Already uploaded
        return None, None
    else:
        if file_in_registry(file_path):
            remove_from_registry_by_path(file_path)
        add_to_registry(entry_type, file_name, file_path, entity_id=None, is_file=is_file)
        file_id = upload_file_to_drive(drive_service, file_path, parent_folder_id)
        if file_id:
            link = generate_shareable_link(file_id)
            remove_from_registry_by_path(file_path)
            add_to_registry(entry_type, file_name, file_path, entity_id=file_id, is_file=is_file)
            return file_id, link
        else:
            return None, None

# Before processing videos, print registry and folder state
print("Initial Registry State:")
print_registry_table()
verify_folder_state()

video_files = [f for f in os.listdir(rootFolder) if os.path.isfile(os.path.join(rootFolder, f))]

for video_file in video_files:
    if video_file == "processing_report.txt":
        continue
    if not video_file.lower().endswith((".mp4", ".mov", ".avi", ".mkv")):
        skipped_log.append((video_file, "Invalid video format"))
        print(f"Skipped {video_file}: Invalid video format.")
        continue

    base_name = os.path.splitext(video_file)[0]
    video_path = os.path.join(rootFolder, video_file)
    audio_path = os.path.join(audio_folder, base_name + ".wav")
    text_path = os.path.join(text_folder, base_name + ".txt")
    processed_path = os.path.join(processed_folder, video_file)

    if os.path.exists(processed_path):
        print(f"Video {video_file} already in processed folder. Skipping.")
        skipped_log.append((video_file, "Already processed (video)"))
        continue

    if not file_in_registry(video_path):
        add_to_registry("file", video_file, video_path, entity_id=None, is_file=True)

    video_id = get_file_id(video_file, rootFolderID)
    if video_id:
        remove_from_registry_by_path(video_path)
        add_to_registry("file", video_file, video_path, entity_id=video_id, is_file=True)

    # Log current folder state
    print(f"\nProcessing {video_file}:")
    print("Audio directory:", os.listdir(audio_folder))
    print("Text directory:", os.listdir(text_folder))

    try:
        if not os.path.exists(audio_path):
            print(f"Extracting audio for {video_file} to {audio_path}")
            try:
                y, sr = librosa.load(video_path, sr=16000)
                sf.write(audio_path, y, sr)
                print(f"Audio extraction successful using librosa for {video_file}")
            except Exception as e_librosa:
                print(f"Librosa extraction failed for {video_file}: {e_librosa}. Falling back to ffmpeg...")
                subprocess.run(["ffmpeg", "-i", video_path, "-ar", "16000", "-ac", "1", audio_path], check=True)
                print(f"Audio extraction successful using ffmpeg for {video_file}")
        else:
            print(f"Audio file {audio_path} already exists.")

        print(f"Uploading audio file {os.path.basename(audio_path)}...")
        register_and_upload_local_file(drive_service, "file", os.path.basename(audio_path), audio_path, audio_folder_id, is_file=True)

        if not os.path.exists(text_path):
            print(f"Starting transcription for {audio_path}")
            result = model.transcribe(audio_path)
            print(f"Transcription completed for {audio_path}")

            transcription_text = ""
            for segment in result["segments"]:
                start_time = format_time(segment["start"])
                end_time = format_time(segment["end"])
                text_segment = segment["text"].strip()
                transcription_text += f"[{start_time} - {end_time}] {text_segment}\n\n"

            print(f"Saving transcription to {text_path}")
            with open(text_path, "w") as f:
                f.write(transcription_text)
        else:
            print(f"Text file {text_path} already exists.")

        print(f"Uploading text file {os.path.basename(text_path)}...")
        register_and_upload_local_file(drive_service, "file", os.path.basename(text_path), text_path, text_folder_id, is_file=True)

        print(f"Moving file {video_file} to processed folder")
        shutil.move(video_path, processed_path)

        if video_id:
            move_file_in_drive(drive_service, video_id, rootFolderID, processed_folder_id)
            remove_from_registry_by_path(video_path)
            add_to_registry("file", video_file, processed_path, entity_id=video_id, is_file=True)

            for e in registry_entries:
                if e["path"] == processed_path and e["id"] == video_id:
                    if not e["url"]:
                        link = generate_shareable_link(video_id)
                        remove_from_registry_by_path(processed_path)
                        add_to_registry("file", video_file, processed_path, entity_id=video_id, is_file=True)
                    with open(text_path, "a") as f:
                        f.write(f"\nOriginal Video Link: {e['url']}\n")
                    break

        print("Registry after processing this video:")
        print_registry_table()

        if not verify_folder_state():
            print("Folder parity mismatch after processing", video_file)

        success_log.append(video_file)
        logging.info(f"Successfully processed {video_file}")

    except subprocess.CalledProcessError as ffmpeg_error:
        error_message = f"FFmpeg error for {video_file}: {ffmpeg_error}"
        print(error_message)
        error_log.append((video_file, error_message))
        logging.error(error_message)

    except Exception as general_error:
        error_message = f"General error for {video_file}: {general_error}"
        print(error_message)
        error_log.append((video_file, error_message))
        logging.error(error_message)

# Final parity check
videos = get_file_bases(processed_folder)
audios = get_file_bases(audio_folder)
texts = get_file_bases(text_folder)
all_match = (videos == audios == texts)

report = "Processing Report\n"
report += f"\nSuccessfully Processed Files ({len(success_log)}):\n"
report += "\n".join(success_log)
report += f"\n\nSkipped Files ({len(skipped_log)}):\n"
report += "\n".join([f"{file} - {reason}" for file, reason in skipped_log])
report += f"\n\nErrors ({len(error_log)}):\n"
report += "\n".join([f"{file} - {reason}" for file, reason in error_log])
report += f"\n\nFolder Parity Check:\n"
report += f"All folders have matching files: {'Yes' if all_match else 'No'}\n"
report += f"Processed Videos: {len(videos)}\n"
report += f"Audio Files: {len(audios)}\n"
report += f"Text Files: {len(texts)}\n"

with open(os.path.join(rootFolder, "processing_report.txt"), "w") as f:
    f.write(report)

print("=== COMPLETION REPORT ===")
print(report)

csv_path = os.path.join(rootFolder, "processing_log.csv")
file_exists = os.path.isfile(csv_path)
current_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

with open(csv_path, "a", newline="", encoding="utf-8") as csvfile:
    writer = csv.writer(csvfile)
    if not file_exists:
        writer.writerow(["Timestamp", "FileName", "Status", "Notes"])

    for fname in success_log:
        writer.writerow([current_time, fname, "Processed", ""])
    for (fname, reason) in skipped_log:
        writer.writerow([current_time, fname, "Skipped", reason])
    for (fname, reason) in error_log:
        writer.writerow([current_time, fname, "Error", reason])

print("\nCurrent CSV log entries:")
with open(csv_path, "r", encoding="utf-8") as csvfile:
    print(csvfile.read())


Initial Audio directory: []
Initial Text directory: []
Initial Registry State:
=== REGISTRY TABLE ===
╒════════╤════════════════╤══════════════════════════════════════════════════════════════════════════════╤═══════════════════════════════════╤══════════════════════════════════════════════════════════════════════════╕
│ Type   │ Name           │ Path                                                                         │ ID                                │ URL                                                                      │
╞════════╪════════════════╪══════════════════════════════════════════════════════════════════════════════╪═══════════════════════════════════╪══════════════════════════════════════════════════════════════════════════╡
│ folder │ WhisperVideo   │ /content/drive/MyDrive/Clients/WCBradley/Videos/WhisperVideo/                │ 10VXO4dTg36YySueayLgiAvzeC5dX5fNF │ https://drive.google.com/drive/folders/10VXO4dTg36YySueayLgiAvzeC5dX5fNF │
├────────┼────────────────

TypeError: get_file_id() missing 1 required positional argument: 'drive_service'

In [None]:
# ### Final Note for Synchronization
# For Colab: Sync changes manually after downloading the notebook.
# For Local: Use the Jupytext command:
#    jupytext --sync LHI_WhisperVideoDrive.ipynb

print("Final Note: Synchronize your files locally using Jupytext.")
print("Colab users: Save your notebook and download it to sync manually.")