# Cleanup Pipeline

## Overview
This Jupyter notebook implements a cleanup utility for managing video files in the transcription pipeline. It removes videos that have already been transcribed, ensuring efficient storage usage and preventing duplicate processing.

### Key Features
- Checks for already transcribed videos
- Removes redundant video files
- Optional verbose output
- Safe deletion with existence checks
- CSV-based transcription verification

### Prerequisites


In [None]:
import os
import pandas as pd



### Process Flow
1. Load transcription records from CSV
2. Compare existing video files against transcription records
3. Remove videos that have already been transcribed
4. Optional logging of deleted files

### Usage
Run this notebook before transferring the `Non_Transcribed_Videos` folder to a VM or other processing environment. Set `tee=True` in `clean_folder()` call to see which files are being deleted.

### Configuration
- Input: MP4 video files in `Non_Transcribed_Videos` folder
- Reference: `Transcription.csv` containing completed transcriptions
- Output: Cleaned directory with only untranscribed videos

### Note
Important to run this before moving folders to VM to optimize storage usage and processing time.

In [None]:
input_folder = 'Non_Transcribed_Videos'
output_folder = 'Transcription.csv'

mp4_files = os.listdir(input_folder)

# delete transcribed videos
if os.path.exists(output_folder):
        df = pd.read_csv(output_folder)
    
def clean_folder(tee=False):
    transcribed_ids = set(df["Video ID"].astype(str))
    for filename in mp4_files:
        video_id = os.path.splitext(filename)[0]
        
        if video_id in transcribed_ids:
            video_path = os.path.join(input_folder, filename)
            os.remove(video_path)
            if tee:
                print(f"Deleted already transcribed video: {filename}")

clean_folder(True) # delete true to remove prints