# YouTube Video First Minute Extraction Pipeline

## Overview
This Jupyter notebook contains functionality for extracting the first minute of audio from YouTube videos. The pipeline processes MP4 files and converts them to MP3 format, keeping track of processed files to avoid duplicate work.

### Key Features
- Extracts first 60 seconds of audio from MP4 videos
- Converts video to MP3 format using FFmpeg
- Maintains processing logs to track completed files
- Includes utilities for:
  - Finding MP4 files in directories
  - Processing file management
  - Log concatenation and analysis

### Prerequisites


In [None]:
import pandas as pd
import os
import ffmpeg # ffmpeg-python
import subprocess



### Usage
The pipeline processes MP4 files from a specified input directory, extracts the first minute of audio, and saves them as MP3 files in an output directory. It maintains a log of processed files to enable resumption of interrupted processing.

### Configuration
- `INPUT_FOLDER`: Source directory containing MP4 files
- `OUTPUT_FOLDER`: Destination directory for processed MP3 files
- `LOG_FILE`: File tracking processed videos

### Additional Tools
Includes utilities for:
- Log file analysis and concatenation
- File extension detection
- Video processing status verification

In [30]:
def find_mp4_files(folder_path):
    """Returns a list of .mp4 file names without the extension."""
    mp4_files = [
        os.path.splitext(file)[0] for file in os.listdir(folder_path) if file.lower().endswith(".mp4")
    ]
    return mp4_files

folder_path = '../../YouTube_Downloader/Downloads'
mp4_files = find_mp4_files(folder_path)

In [33]:
INPUT_FOLDER = '../../YouTube_Downloader/Downloads'
OUTPUT_FOLDER = 'First_minute_videos'  
LOG_FILE = "processed_files.log"

def find_mp4_files(folder_path):
    """Returns a list of .mp4 file names without the extension."""
    mp4_files = [
        os.path.splitext(file)[0] for file in os.listdir(folder_path) if file.lower().endswith(".mp4")
    ]
    return mp4_files

def load_processed_files():
    """Reads the log file and returns a set of processed filenames."""
    if not os.path.exists(LOG_FILE):
        return set()
    
    with open(LOG_FILE, "r") as f:
        return set(line.strip() for line in f.readlines())

def save_processed_file(filename):
    """Appends a successfully processed filename to the log file."""
    with open(LOG_FILE, "a") as f:
        f.write(filename + "\n")

def extract_audio(filename):
    """Extracts the first 60 seconds of audio from the given .mp4 file."""
    input_path = os.path.join(INPUT_FOLDER, filename + ".mp4")
    output_path = os.path.join(OUTPUT_FOLDER, filename + ".mp3")

    if not os.path.exists(input_path):
        print(f"File not found: {input_path}")
        return False

    command = [
        "ffmpeg",
        "-i", input_path,  # Input file
        "-t", "60",  # Duration: 60 seconds
        "-q:a", "2",  # Quality setting for MP3
        "-vn",  # No video
        output_path  # Output file
    ]

    try:
        subprocess.run(command, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
        print(f"Successfully processed: {filename}")
        return True
    except subprocess.CalledProcessError:
        print(f"Failed to process: {filename}")
        return False

def process_files(file_list):
    """Processes each file, extracting the first 60 seconds of audio, skipping processed ones."""
    processed_files = load_processed_files()

    for filename in file_list:
        if filename in processed_files:
            print(f"Skipping {filename}, already processed.")
            continue

        if extract_audio(filename):
            save_processed_file(filename)

In [None]:
mp4_files = find_mp4_files(INPUT_FOLDER)
process_files(mp4_files)

Skipping hZN9-pn7cPM, already processed.
Skipping rvCd2GviguU, already processed.
Skipping XCvQFgUx7Xg, already processed.
Skipping qgtiRkFDUrI, already processed.
Skipping yRPLrDnkZ7I, already processed.
Skipping r1VacSU8kFA, already processed.
Skipping 6oeQfEnMHIU, already processed.
Skipping 2NNAccP5Vvg, already processed.
Skipping EjLXRGpZKv8, already processed.
Skipping p1dSJtBbzhg, already processed.
Skipping ZwxJk-noC4I, already processed.
Skipping xw490NjBY0Y, already processed.
Skipping LaVd7qRxqDk, already processed.
Skipping sBCCjhK2CAg, already processed.
Skipping 78I9zOUPuBI, already processed.
Skipping _avT-3l0fL0, already processed.
Skipping CvwAPi8ishw, already processed.
Skipping hQ3FnfqUuwA, already processed.
Skipping 7968d4lGYY8, already processed.
Skipping afTv-DVZBTo, already processed.
Successfully processed: uc4yoIwGqOc
Successfully processed: Ini186efqmc
Successfully processed: rqANzQL2Ynw
Successfully processed: 9kvkaFKoYlI
Successfully processed: YQQokcoOzeY


### In other news:

In [None]:
def concatenate_logs(log_path):
    dataframes = []
    
    # Iterate over all files in the folder
    for filename in os.listdir(log_path):
        # Check if the file is a CSV
        if filename.endswith(".csv"):
            file_path = os.path.join(log_path, filename)
            
            # Read CSV file into a DataFrame and append it to the list
            df = pd.read_csv(file_path, dtype={'Participant ID': str, 'video_id': str})
            if 'log' in df.columns:
                df = df.drop(columns=['log'])
            dataframes.append(df)
    
    # Concatenate all DataFrames in the list
    if dataframes:  # Check if there are any dataframes to concatenate
        concatenated_df = pd.concat(dataframes, ignore_index=True)
    else:
        concatenated_df = pd.DataFrame()  # Return an empty DataFrame if no CSV files found
    
    return concatenated_df
log_df = concatenate_logs('../../YouTube_Downloader/Log_Entries')

In [11]:
def find_mp4_files(folder_path):
    """Returns a list of .mp4 file names without the extension."""
    mp4_files = [
        os.path.splitext(file)[0] for file in os.listdir(folder_path) if file.lower().endswith(".mp4")
    ]
    return mp4_files

folder_path = '../../YouTube_Downloader/Downloads'
mp4_files = find_mp4_files(folder_path)

In [8]:
def concatenate_logs(log_path):
    dataframes = []
    
    # Iterate over all files in the folder
    for filename in os.listdir(log_path):
        # Check if the file is a CSV
        if filename.endswith(".csv"):
            file_path = os.path.join(log_path, filename)
            
            # Read CSV file into a DataFrame and append it to the list
            df = pd.read_csv(file_path, dtype={'Participant ID': str, 'video_id': str})
            if 'log' in df.columns:
                df = df.drop(columns=['log'])
            dataframes.append(df)
    
    # Concatenate all DataFrames in the list
    if dataframes:  # Check if there are any dataframes to concatenate
        concatenated_df = pd.concat(dataframes, ignore_index=True)
    else:
        concatenated_df = pd.DataFrame()  # Return an empty DataFrame if no CSV files found
    
    return concatenated_df
log_df = concatenate_logs('../../YouTube_Downloader/Log_Entries')

In [9]:
df = log_df[log_df['status']=='successful']

In [15]:
set(df['video_id']) - set(mp4_files)

{'-m7NcPgFQv0',
 '1cYs-eynpQc',
 '1yAQhRP-UJ8',
 '3KI2_cpR9Ek',
 '3PBecE4J5ak',
 '6aO3_nd0jPM',
 '7vuE4TP7Ta8',
 '84QsAF6FYg0',
 '8zEPlkPAlNQ',
 'B3fDZz3IS2Q',
 'B6cLl_hHm7Y',
 'CH4dkLvZkKc',
 'DUID0Y2VwPw',
 'G87fDW89hjI',
 'GUWHTIZbmkw',
 'Gy1M2KpBWII',
 'HyAGMjbIFqY',
 'ITJ3AF3TK5M',
 'ItkhXq1N5gg',
 'JZRpVSn_Z4A',
 'KMWZE6ekIIc',
 'LwiG0j7PMG8',
 'MBg2sxYQLYs',
 'MDVyM3w3RwE',
 'OR23wIADpsM',
 'QZQ90uH-NKo',
 'Ts8Bjkpsim8',
 'Vue2J0qbN20',
 'W90w6xm6LJU',
 '_HUTXfdKoe8',
 'bCR2dUrxPk4',
 'cm0i2ewsi3Y',
 'dLF1qgYSQ38',
 'dZtjtiiT5SI',
 'dlJNqEaSjcw',
 'hcwrfBfwUhc',
 'iMfMGMu7KiI',
 'jXYOylTeFqo',
 'k-6P1H2rLmY',
 'l8h3AFKrADo',
 'lVlFwwzi7Ok',
 'pY9H6DxhN8o',
 'rMl4CmbmtGQ',
 'rfYfuAOKTxA',
 'rtT6odWzKbQ',
 'u7PAK5B8JC4',
 'v0WgqPU25oI',
 'wVZ8HBxb9Nk',
 'x_21VQxe_1E',
 'yJbHoajwhQM',
 'yJvR16wRLFs',
 'ydszZrNW_Iw',
 'zt5J8zV7f5s'}

In [24]:
def find_file_extension(ID, folder_path='../../YouTube_Downloader/Downloads/'):
    """Searches for a file in the folder by name (ignoring extension) and returns its extension."""
    for file in os.listdir(folder_path):
        name, ext = os.path.splitext(file)
        if name == ID:
            return ext 
            
find_file_extension('rMl4CmbmtGQ')

In [44]:
log_df[log_df['video_id']=='rMl4CmbmtGQ']

Unnamed: 0,Participant ID,video_id,status,server_reply,size_MB,start_time,end_time,download_time_minutes,download_speed_KBs,format,vcodec,acodec
7700,23690,rMl4CmbmtGQ,successful,Download successful,,2025-02-11 17:52:01.804688,2025-02-11 17:52:06.205357,0.07,,,,


### 53 vidoes are marked as downloaded but do not seem to be downloaded... 