# Audio Description (AD) Text Analysis Prep Tool

This tool will help you prepare an audio description (AD) dataset for analysis of words spoken in media file.

A notable classification in analyzing the words spoken in an AD media is whether a word is or is not a recitation of credits. Reading credits of a film may be important to properly translating the visual experience to the aural experience for AD users, but the words chosen in that recitation are far less interesting to analyze than the other, creative choices in the AD. Perhaps your analysis would benefit from filtering out these credit recitations. Additionally, the WhisperX transcription process labels distinct speakers of the media file. Knowing which speaker speaks credits can serve as a useful indicator of that speaker being the AD voice - since it would be very rare for a character in the narrative to recite the credits like AD does.

This project has 2 main modules:

1.   **Module 1: Transcribe a mediafile with WhisperX**
This is helpful if you don't yet have a dataset of words spoken in a media file of interest, and also don't yet have word level timestamps or speaker identification. Caption files bundle words in caption blocks and this timing may not be precise enough for certain analysis. Additionally, caption and transcript files may not include speaker identification, which would prevent analysis in comparing what words are used in the AD relative to non-AD words. The outputs of WhisperX will write to the same directory as your media file. Module 1 Includes outputs of transcript (txt), caption file (srt) and full data including word level timestamps (json).

2.   **Module 2: Give words a 'Credit Score'**
This module will assess each run of words (named by run_id) from the same speaker ID as assigned by whisperX. A weighted score will be given to each run_id and the highest score (1.00 being a perfect score) is considered the most likely candidate to be the run of words which is reciting the credits. Filtering of this run and other runs could later be applied to analysis. The output of Module 2 is a csv and could be leveraged as a classified version of the Module 1 json dataset. These tools in Module 2 assume English language is both the predominant language of the AD and the media file.



# Module 1: Transcribe a mediafile with WhisperX

Credit to Elle Wang for video tutorial on running WhisperX in Google Colab: https://www.youtube.com/watch?v=1z0aHkFbD8E

WARNING: If running Module 1 (WhisperX), change runtime in Google Colab to T4 GPU. Go to Runtime > Change Runtime Type and select T4 GPU. WhisperX needs much more power than the default CPU setting. If you do not change this before you start running cells in Module 1, you'll loose that work when changing runtime and have to redo those cells. If simply running Module 2, CPU runtime is best to conserve Colab usage limits.

Module 1 requires an access token be generated on Hugging Face (aka hf). Create an account on hf, and follow instructions for creating an access token to include in the provided code. Read token is sufficient:
https://huggingface.co/docs/hub/en/security-tokens

Also while on hf with an account, your account will need to accept the usage terms of 2 models on that platform. The usage of the models is free, though you do need to accept the terms before Module 1 (whisperX) will run correctly.

Accept the terms of each on their respective landing pages before whisperX can pull them via your hf access token

https://huggingface.co/pyannote/segmentation-3.0

https://huggingface.co/pyannote/speaker-diarization-3.1




In [None]:
## Module 1: Kernel 1 ##
## This kernel prepares a directory and media file to run on WhisperX ##

## You'll have option to 1. Use Google Drive or 2. Upload into Colab ##
## Choosing Google Drive will prompt for permission. Then ask for the directory path to the folder with your file ##
## Choosing Upload option will not save the files outside of the Colab session. Aka, save WhisperX outputs! ##

import os
from google.colab import files, drive

# Step 1: Prompt user for storage option
file_choice = input("Choose an option:\n1 - Connect to Google Drive\n2 - Upload a file to sample_data\nEnter 1 or 2: ")

if file_choice == '1':
    # Step 2a: Mount Google Drive
    drive.mount('/content/drive')
    print("Google Drive mounted.")

    # Step 2b: Ask for directory path
    example_path = '/content/drive/MyDrive/your_folder_name'
    drive_path = input(f"Enter the full path to your Google Drive folder (e.g., {example_path}): ")

    if os.path.exists(drive_path):
        os.chdir(drive_path)
        print(f"Changed directory to: {drive_path}")
        # Step 2c: store file list in this directory to variable file_list and print that variable
        file_list = os.listdir()
        print("Files in this directory:", file_list)

    else:
        print("That path doesn't exist. Please double-check and try again.")

elif file_choice == '2':
    # Upload file
    uploaded = files.upload()

    # Save to /content/media_files (or any subfolder you prefer)
    media_dir = '/content/media_files'
    os.makedirs(media_dir, exist_ok=True)

    for filename in uploaded.keys():
      dest_path = os.path.join(media_dir, filename)
      os.rename(filename, dest_path)
      print(f"Saved {filename} to {dest_path}")

    # Step 3b: Change to sample_data directory
    os.chdir(media_dir)
    print("Changed directory to", media_dir)

    # Step 3c: store file list in this directory to variable file_list and print that variable
    file_list = os.listdir()
    print("Files in this directory:", file_list)

else:
    print("Invalid choice. Please enter 1 or 2.")


In [None]:
## Module 1: Kernel 2 ##
## Prepare your variables for Kernel 3 (WhisperX) ##

## Insert your hf token in the placeholder [add_hf_token_here] ##
## Pick the file name in the sample_data OR the folder (aka directory) you
## indicated in the previous kernel of %cd. ##
## Note: WhisperX will accept audio files or video files like mp4 or mp3 ##

import getpass

# Step 1: Securely prompt for Hugging Face token
hf_token = getpass.getpass("Enter your Hugging Face token (input hidden): ")

# Step 2: Display file list and prompt for selection
print("\nAvailable files:")
for i, filename in enumerate(file_list):
    print(f"{i + 1}: {filename}")

file_index = int(input("\nSelect a file by number: ")) - 1

# Validate selection
if file_index < 0 or file_index >= len(file_list):
    print("Invalid selection.")
else:
    selected_file = file_list[file_index]
    print(f"\n Running WhisperX on: {selected_file}")

In [None]:
## Module 1: Kernel 3 ##
## This is your kernel to run WhisperX. ##

## WhisperX settings: ##
## --model large-v2 is good for this task ##
## --chunk_size 6 is good for this task ##
## --diarize is a flag to include this phase of the process and
## is very helpful for preparing AD dataset for analysis ##


## Install all tools for WhisperX. This will take a minute or 2 ##
## Remember to change runtime in Google Colab to T4 GPU for best running of WhisperX ##

!pip install whisperx

!pip3 install -U huggingface_hub

!apt install libcudnn8 libcudnn8-dev -y

## Run the WhisperX code! This will take a few minutes ##
!whisperx --model large-v2 --chunk_size 6 --diarize --hf_token {hf_token} "{selected_file}"

## OUTPUT EXPLAINATION ##
## The output files will write to the same location as the input media file. ##
## Each will have the same file name as the input file but with different extensions. ##
## Among the outputs: SRT is a caption file. TXT is a transcript file (without timecodes).
## JSON is the full dataset with each word timestamped and assigned a speaker value as "SPEAKER ##" ##

In [None]:
## Module 1: Kernel 4 ##
## Optional check of the WhisperX json output and save as CSV ##

## To manipulate and analyze the transcription provided by WhisperX, ##
## Note: this will only work for a json in the same directory selected earlier ##
## (either Google drive folder or sample_data in Google Colab) ##

import pandas as pd
import json

## Input function asks the user for the file name to open in the later functions of the kernel. ##
json_file_name = input("Please enter the JSON file name (e.g., ADxPD_Plan9_Mix.json): ")

## change file name below to json file name inside the single quotes ##
with open(json_file_name, 'r') as f:
  data = json.load(f)

## This block interprets the json data structure to prep for the dataframe ##
all_words = []
for segment in data['segments']:
  for word_info in segment['words']:
    all_words.append(word_info)

## df0 will be the name of the dataframe native to the json file. ##
## Module 2 will analyze df0 but also add to it, ##
## resulting in modified dataframes of names df1, df2, etc. ##
df0 = pd.DataFrame(all_words)

## This line defines the data types of each column in the dataframe ##
df0 = df0.astype({'word': str, 'start': float, 'end': float, 'score': float, 'speaker': str})

## This line updates pandas formating options to display all the decimals of the timecode info from json file ##
pd.options.display.float_format = '{:,.3f}'.format

# Save the DataFrame to a CSV file
df0.to_csv('{input}.csv', index=False)

print(df0)


# Module 2: Give words a 'Credit Score'
You may be interested in analyzing the words transcribed from your media file. While it is helpful for primary AD users to have any on-screen text read aloud, the recitation of the credits (opening or ending) represents a far different creative process from the other words used in the piece - either AD or non-AD. Knowing if a word is a recitation of credits or not becomes a useful classification for further analysis. Additionally, it's very rare for credits recited or narrated which is not the AD. If attempting to automate the classification of which Speaker is the AD, having a classification of which words are credits would aid this as a second order classification.

What follows is a series of tests for runs of words by speaker. WhisperX outputs a transcription of each word in linear order and assigns a speaker id of who spoke that word. A "run of words by the same speaker" is an unbroken stretch of words by the same speaker. If at least one word from another speaker label interups a run, then a new run is started. Runs are then given a run_id linearly through the piece to analyze certain properties of each run.

The results from each test are weighted and added together to form a 'credit score' - 1 being the top score was acheived by that run_id for all tests. The score closest to 1 can be assumed as the highest likelihood this run is a recitation of credits. This credit score will be stored in a final dataframe but also output to a csv with all columns needed for Module 2 appended to the original dataframe from the whisperX output (df0). Analysis can then proceed using this classification along dimenion of the run_id's credit score.

In [None]:
## Module 2: Kernel 1 ##
## This kernel prepares a directory and json file to analyze ##
## NOTE: If continuing with selections and output from Module 1,
## and same session/runtime, then you can skip this kernel. ##

## You'll have option to 1. Use Google Drive or 2. Upload into Colab ##
## Choosing Google Drive will prompt for permission. Then ask for the directory path to the folder with your file ##
## Choosing Upload option will not save the files outside of the Colab session. Aka, save outputs! ##

import os
from google.colab import files, drive

# Step 1: Prompt user for storage option
file_choice = input("Choose an option:\n1 - Connect to Google Drive\n2 - Upload a file to sample_data\nEnter 1 or 2: ")

if file_choice == '1':
    # Step 2a: Mount Google Drive
    drive.mount('/content/drive')
    print("Google Drive mounted.")

    # Step 2b: Ask for directory path
    example_path = '/content/drive/MyDrive/your_folder_name'
    drive_path = input(f"Enter the full path to your Google Drive folder (e.g., {example_path}): ")

    if os.path.exists(drive_path):
        os.chdir(drive_path)
        print(f"Changed directory to: {drive_path}")
        # Step 2c: store file list in this directory to variable file_list and print that variable
        file_list = os.listdir()
        print("Files in this directory:", file_list)

    else:
        print("That path doesn't exist. Please double-check and try again.")

elif file_choice == '2':
    # Upload file
    uploaded = files.upload()

    # Save to /content/media_files (or any subfolder you prefer)
    media_dir = '/content/media_files'
    os.makedirs(media_dir, exist_ok=True)

    for filename in uploaded.keys():
      dest_path = os.path.join(media_dir, filename)
      os.rename(filename, dest_path)
      print(f"Saved {filename} to {dest_path}")

    # Step 3b: Change to sample_data directory
    os.chdir(media_dir)
    print("Changed directory to", media_dir)

    # Step 3c: store file list in this directory to variable file_list and print that variable
    file_list = os.listdir()
    print("Files in this directory:", file_list)

else:
    print("Invalid choice. Please enter 1 or 2.")


Choose an option:
1 - Connect to Google Drive
2 - Upload a file to sample_data
Enter 1 or 2: 2


Saving ADxPD_Plan9_Mix.json to ADxPD_Plan9_Mix.json
Saved ADxPD_Plan9_Mix.json to /content/media_files/ADxPD_Plan9_Mix.json
Changed directory to /content/media_files
Files in this directory: ['ADxPD_Plan9_Mix.json']


In [None]:
## Module 2: Kernel 2 ##
## NOTE: If continuing with selections and output from Module 1,
## and same session/runtime, then you can skip this kernel. ##

## If you haven't already loaded the json file into a dataframe for this session, ##
## you'll need to run this kernel which is a repeat of the final Module 1 kernel. ##

import pandas as pd
import json

json_file_name = input("Please enter the JSON file name (e.g., ADxPD_Plan9_Mix.json): ")

with open(json_file_name, 'r') as f:
  data = json.load(f)

all_words = []
for segment in data['segments']:
  for word_info in segment['words']:
    all_words.append(word_info)

df0 = pd.DataFrame(all_words)
df0 = df0.astype({'word': str, 'start': float, 'end': float, 'score': float, 'speaker': str})
pd.options.display.float_format = '{:,.3f}'.format

print("Original DataFrame")
print(df0)

Please enter the JSON file name (e.g., ADxPD_Plan9_Mix.json): ADxPD_Plan9_Mix.json
Original DataFrame
           word     start       end  score     speaker
0             ♪     2.680     8.151  0.901  SPEAKER_10
1            In     8.131     8.393  0.797  SPEAKER_10
2          bold     8.473     8.695  0.750  SPEAKER_10
3     lettering     8.755     9.077  0.852  SPEAKER_10
4          over     9.218     9.379  0.944  SPEAKER_10
...         ...       ...       ...    ...         ...
8800  Ghoulman, 4,689.279 4,689.723  0.601  SPEAKER_10
8801       Bela 4,689.743 4,689.925  0.580  SPEAKER_10
8802    Lugosi, 4,689.945 4,690.167  0.411  SPEAKER_10
8803        and 4,690.389 4,690.530  0.727  SPEAKER_10
8804  Criswell. 4,690.571 4,691.115  0.602  SPEAKER_10

[8805 rows x 5 columns]


In [None]:
## Module 2: Kernel 3 ##
## This kernel contains all of the analysis and dataframe appending for the Credit Score ##

# Create df1 as an interation of our original df0, leaving df0 undisturbed by Module 2.
# Create run_id, a UID for every stretch of words of the same speaker value.
# Use a boolean series indicating where the speaker value changes
# from the previous row.
df1 = df0.assign(
 speaker_change_point = df0['speaker'] != df0['speaker'].shift(),
 run_id = (df0['speaker'] != df0['speaker'].shift()).cumsum()
)
# Save the df1 to a CSV file
# This aids in later analysis using the run_id as primary key
df1.to_csv(f'{json_file_name}_run_id.csv', index=False)

## CREDIT SCORE Test 1: "Long, Uninterupted run_id" ##
# Count rows per run_id
run_counts = df1.groupby('run_id').size().reset_index(name='run_count')

# Get the first speaker value for each run_id
speakers = df1.groupby('run_id')['speaker'].first().reset_index()

# Merge counts with speaker values
run_length_words = run_counts.merge(speakers, on='run_id')

# Create a df for this entire Credit Scoring process with run_id as primary key
df_creditscores = run_length_words[['run_id', 'run_count']]

# Calculate length (in seconds) of each run_id
run_length_time = df1.groupby('run_id').agg(
    max_end=('end', 'max'),
    min_start=('start', 'min')
).reset_index()

run_length_time['run_length_time'] = run_length_time['max_end'] - run_length_time['min_start']

# Merge run_length_time into df_creditscores
df_creditscores = df_creditscores.merge(run_length_time[['run_id', 'run_length_time']], on='run_id', how='left')

# Multiply both lengths to get a wider distribution of words per time for final "length"
df_creditscores['run_length_total'] = df_creditscores['run_count'] * df_creditscores['run_length_time']
print("df_creditscores Test 1 added")
print(df_creditscores[['run_id','run_length_total','run_count','run_length_time']])

## CREDIT SCORE Test 2: Typical words from credits ##
# Define the list of target words appearing in typical credits recitation
target_credit_words = ['by', 'direct', 'produce', 'credit']

# Create a regex pattern
pattern = '|'.join(target_credit_words)

# Find rows where 'word' matches any of the target words (case-insensitive)
matches = df1['word'].str.contains(pattern, case=False, na=False)

# Count matches per run_id
credit_word_count = df1[matches].groupby('run_id').size().reset_index(name='credit_word_count')

# Merge into df_creditscores
df_creditscores = df_creditscores.merge(credit_word_count, on='run_id', how='left')
df_creditscores['credit_word_count'] = df_creditscores['credit_word_count'].fillna(0).astype(int)
print("df_creditscores Test 2 added")
print(df_creditscores[['run_id','credit_word_count']])

## CREDIT SCORE Test 3: Proper Noun count ##
import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize

# Download required NLTK resources (run once)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger_eng')

# Group words by run_id
run_id_words = df1.groupby('run_id')['word'].apply(list).reset_index()

# Function to count proper nouns in a list of words
def count_proper_nouns(words):
    tokens = [str(w) for w in words]  # Ensure all are strings
    tagged = pos_tag(tokens)
    return sum(1 for word, tag in tagged if tag in ('NNP', 'NNPS'))

# Apply the function to each group
run_id_words['proper_noun_count'] = run_id_words['word'].apply(count_proper_nouns)

# Merge into df_creditscores
df_creditscores = df_creditscores.merge(run_id_words[['run_id', 'proper_noun_count']], on='run_id', how='left')
df_creditscores['proper_noun_count'] = df_creditscores['proper_noun_count'].fillna(0).astype(int)
print("df_creditscores Test 3 added")
print(df_creditscores[['run_id','proper_noun_count']])

## CREDIT SCORE Test 4: Labelling the Final Run ##
# Identify the maximum run_id
max_run_id = df1['run_id'].max()

# Add a new column to flag the final run with a 1 and everything else a 0
df_creditscores['final_run'] = df_creditscores['run_id'].apply(lambda x: 1 if x == max_run_id else 0)
print("df_creditscores Test 4 added")
print(df_creditscores[['run_id','final_run']])

## CREDIT SCORE Tests aggregation ##
# Calculate the maximum value in each test column with a variable value
max_run_length = df_creditscores['run_length_total'].max()
max_credit_word_count = df_creditscores['credit_word_count'].max()
max_proper_noun_count = df_creditscores['proper_noun_count'].max()

# Create a new columns with normalized values scaled between 0 and 1
df_creditscores['run_length_normalized'] = df_creditscores['run_length_total'] / max_run_length
df_creditscores['credit_word_normalized'] = df_creditscores['credit_word_count'] / max_credit_word_count
df_creditscores['proper_noun_normalized'] = df_creditscores['proper_noun_count'] / max_proper_noun_count

## CREDIT SCORE Test weighting to combine with relative importance of each test ##
# Test weights can be adjusted based on specific needs or refinement of analysis
# Weights can total to 1.00 if wanting a normalized scoring system
df_creditscores['run_credit_score_total'] = (
    df_creditscores['run_length_normalized'] * 0.2 +
    df_creditscores['credit_word_normalized'] * 0.4 +
    df_creditscores['proper_noun_normalized'] * 0.3 +
    df_creditscores['final_run'] * 0.2
)
print("df_creditscores Full DataFrame with Credit Score added")
print(df_creditscores)

# Sort by 'run_credit_score_total' in descending order and display the top 20
top_20 = df_creditscores.sort_values(by='run_credit_score_total', ascending=False).head(20)
print("Top 20 Credit Scores Sorted in DataFrame")
print(top_20[['run_id','run_credit_score_total']])



df_creditscores Test 1 added
     run_id  run_length_total  run_count  run_length_time
0         1           748.045         41           18.245
1         2         8,578.385        139           61.715
2         3        13,249.488        168           78.866
3         4           843.359         43           19.613
4         5             2.364          3            0.788
..      ...               ...        ...              ...
517     518            24.876          6            4.146
518     519         3,868.716         87           44.468
519     520         1,327.326         66           20.111
520     521             0.928          2            0.464
521     522           398.844         36           11.079

[522 rows x 4 columns]
df_creditscores Test 2 added
     run_id  credit_word_count
0         1                  0
1         2                  0
2         3                  8
3         4                  0
4         5                  0
..      ...                ...
517  

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


df_creditscores Test 3 added
     run_id  proper_noun_count
0         1                  1
1         2                  6
2         3                121
3         4                  2
4         5                  0
..      ...                ...
517     518                  1
518     519                  2
519     520                 43
520     521                  2
521     522                 35

[522 rows x 2 columns]
df_creditscores Test 4 added
     run_id  final_run
0         1          0
1         2          0
2         3          0
3         4          0
4         5          0
..      ...        ...
517     518          0
518     519          0
519     520          0
520     521          0
521     522          1

[522 rows x 2 columns]
df_creditscores Credit Score added
     run_id  run_count  run_length_time  run_length_total  credit_word_count  \
0         1         41           18.245           748.045                  0   
1         2        139           61.715         8,5

In [None]:
## Module 2 Kernel 4 - Optional ##
# Save the Credit Score DataFrame to a CSV file
df_creditscores.to_csv(f'{json_file_name}_creditscore.csv', index=False)