# Transcription Validation Pipeline

## Overview
This Jupyter notebook implements validation procedures for the video transcription pipeline, checking data integrity and handling edge cases. It validates transcription outputs against language detection results and manages data quality.

### Key Features
- Validates transcription completeness
- Cross-references with language detection results
- Identifies and handles duplicate transcriptions
- Manages missing or corrupted files
- Updates transcription database integrity

### Prerequisites


In [None]:
import pandas as pd
import os
import whisper
import subprocess



### Data Sources
- `Detected_Language.csv`: Complete language detection results
- `Detected_Language_Confident.csv`: High-confidence (>0.98) detections
- `Video_Transcriptions.csv`: Generated transcriptions
- Video files in `CPU_VM_folder/Non_Transcribed_Videos`

### Validation Steps
1. Load and compare language detection datasets
2. Identify missing or duplicate transcriptions
3. Validate transcription quality
4. Handle edge cases (corrupted files, duplicates)
5. Update master transcription folder

### Process Flow
The pipeline validates ~10,956 videos, ensuring:
- All high-confidence detected videos are transcribed
- No duplicate transcriptions exist
- Data integrity is maintained
- Missing transcriptions are identified and processed

In [5]:
# detected language 
dl = pd.read_csv("Detected_Language.csv")
# detected language confident (>0.98)
dlc = pd.read_csv("Detected_Language_Confident.csv")

In [12]:
N = len(dl)
N # this is exactly 1009 IDs x 20 videos 

20180

In [16]:
M = len(dlc)
m = f"Filtering for confidence we remove {N-M:,} ({((N-M) /N)*100:.1f}%) of the videos leaving us with {M:,}"
print(m)

Filtering for confidence we remove 9,224 (45.7%) of the videos leaving us with 10,956


In [61]:
# We put these 10956 vidoes in the folder
vid_folder = "CPU_VM_folder/Non_Transcribed_Videos"
mp4_files = os.listdir(vid_folder)
len(mp4_files)

10956

In [56]:
df = pd.read_csv("Video_Transcriptions.csv")

In [45]:
len(df) # a video too many?

10958

In [None]:
df['Video ID'].value_counts() # 2 videos have been transcribed twice. one video missing

Video ID
0-pwca91OCM    2
2K0AHZh6OrM    2
r8DgUgj0dOY    1
ZC9SHB8IeX4    1
8KVYgAXY5Qc    1
              ..
fdg2OG8zFiM    1
_Cyd-5BVB0U    1
7jhwwWLMDgs    1
fU4kmv1_3Kg    1
okeeKqcmQkw    1
Name: count, Length: 10956, dtype: int64

In [31]:
up = df[(df['Video ID']=='0-pwca91OCM') | (df['Video ID']=='2K0AHZh6OrM')]
up

Unnamed: 0,Video ID,transcript
0,0-pwca91OCM,These pork and mango spring rolls are one of ...
1,2K0AHZh6OrM,"What's up y'all, Forrest here. To start off t..."
346,2K0AHZh6OrM,"What's up y'all, Forrest here. To start off t..."


In [59]:
(up['transcript'][0] == up['transcript'][345]) # the transcripts are identical

True

In [60]:
up['transcript'][1] == up['transcript'][346] # transcripts are not identical

False

In [30]:
Transcribed_ids = df['Video ID'].unique().tolist()

In [32]:
len(Transcribed_ids)

10955

# Removing duplicates

In [57]:
df = df.drop_duplicates()

In [59]:
df = df.drop(df.index[345]) 

In [60]:
df[(df['Video ID']=='2K0AHZh6OrM')]

Unnamed: 0,Video ID,transcript
1,2K0AHZh6OrM,"What's up y'all, Forrest here. To start off t..."


In [61]:
df['Video ID'].value_counts() 

Video ID
486zF8xEJbE    1
0-pwca91OCM    1
2K0AHZh6OrM    1
-ED-vjRCKaE    1
LsNg-KrFxCA    1
              ..
8EoxNRV1Amk    1
Kjij_M_GFyw    1
Nxkz6I2lA_8    1
3zojnGyZ-L0    1
7Qlm24ekm9Y    1
Name: count, Length: 10956, dtype: int64

In [62]:
df.to_csv("Video_Transcriptions.csv", index=False)

In [63]:
df = pd.read_csv("Video_Transcriptions.csv")

In [64]:
df

Unnamed: 0,Video ID,transcript
0,0-pwca91OCM,These pork and mango spring rolls are one of ...
1,2K0AHZh6OrM,"What's up y'all, Forrest here. To start off t..."
2,-ED-vjRCKaE,"What's up guys, I'm RandomFrankP back with an..."
3,LsNg-KrFxCA,Let me go to Big Head Joe's for you. They hav...
4,1S0lygj3w84,Would you be willing to trade the outfit you ...
...,...,...
10951,kK0FXAoIE1I,"Lige inden den her Titanic video starter, som..."
10952,iMGG0rTjkBs,Holy shit! There's four of those motherfucker...
10953,_U1foLW8h54,". So, thank you. Oh, good, this is on. So, I'..."
10954,HIGtLRnGCD4,Now obviously with so much money at stake it ...


# Transcibing missing video

In [62]:
vid_ids = [vid.split('.')[0] for vid in mp4_files]

In [63]:
set(vid_ids) - set (Transcribed_ids) # missing video

{'486zF8xEJbE'}

In [64]:
'486zF8xEJbE.mp4' in mp4_files

True

In [2]:
!pip install openai-whisper pandas tqdm --break-system-packages
!pip install ffmpeg-python --break-system-packages

Collecting openai-whisper
  Downloading openai-whisper-20240930.tar.gz (800 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m800.5/800.5 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
  Installing build dependencies ... [done
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting numba (from openai-whisper)
  Downloading numba-0.61.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.8 kB)
Collecting torch (from openai-whisper)
  Downloading torch-2.6.0-cp312-cp312-manylinux1_x86_64.whl.metadata (28 kB)
Collecting more-itertools (from openai-whisper)
  Downloading more_itertools-10.6.0-py3-none-any.whl.metadata (37 kB)
Collecting tiktoken (from openai-whisper)
  Downloading tiktoken-0.9.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting triton>=2.0.0 (from openai-whisper)
  Downloading triton-3.2.0-cp312-cp312-manylinux_2_17_x86_64.manyl

In [3]:
import whisper
import subprocess

In [4]:
model = whisper.load_model("turbo")

100%|█████████████████████████████████████| 1.51G/1.51G [01:30<00:00, 17.8MiB/s]


In [13]:
import shutil

source_path = "../../YouTube_Downloader/Complete_Downloads/486zF8xEJbE.mp4"
destination_path = "CPU_VM_folder/Non_Transcribed_Videos/486zF8xEJbE.mp4"  # Current directory

try:
    shutil.copy2(source_path, destination_path)  # Use copy2 to keep metadata
    print(f"File copied successfully to {destination_path}")
except Exception as e:
    print(f"Error copying file: {e}")

File copied successfully to CPU_VM_folder/Non_Transcribed_Videos/486zF8xEJbE.mp4


In [15]:
video = '486zF8xEJbE.mp4'
video_path = os.path.join("CPU_VM_folder/Non_Transcribed_Videos", video)
output_folder = 'Video_Transcriptions.csv'

try:
    # Transcribe audio directly from MP4
    result = model.transcribe(video_path)
    transcript = result["text"]

    # Save result to CSV
    video_id = os.path.splitext(video)[0]
    df = pd.DataFrame([{"Video ID": video_id, "transcript": transcript}])
    df.to_csv(output_folder, mode='a', header=not os.path.exists(output_folder), index=False)

except RuntimeError as e:
    if "moov atom not found" in str(e) or "Invalid data found when processing input" in str(e):
        print(f"Skipping 486zF8xEJbE.mp4 due to file corruption or invalid format.")
    else:
        raise  # Re-raise other errors

