# Zoom Transcript Format Cleaner

#### Author: John K. Wagner, jkwagner@unm.edu

***

### The following notebook is designed to format Zoom transcripts so that each speakers words are concatenated rather than being frequently separated. 
 
### For this to work properly, the provided Zoom transcripts will need to have the *.txt* extension. 

***

#### Loading Libraries

In [13]:
import pandas as pd #Data management & manipulation
import os #General cmds
import re #Regex
from pathlib import Path #File/directory utility
from docx import Document #Exporting word docs (.docx)
from tqdm import tqdm #Progress bar for main loop
from ipyfilechooser import FileChooser #Browsing to file path (same for below two)
import ipywidgets as widgets #For choosing output type(s)

#### Function for Folder Selection
These functions will allow you to select a folder where the transcripts are intially, and a folder where you want the files after cleaning.

In [14]:
# Create and display a FileChooser widget
fc_transcripts = FileChooser()
fc_transcripts.show_only_dirs = True

fc_destination = FileChooser()
fc_destination.show_only_dirs = True

#### Select Folder Containing Original Transcripts (.txt extensions required)

In [15]:
display(fc_transcripts)

FileChooser(path='C:\Users\deere\Dropbox\SOE_climate_study\Transcript Cleaner', filename='', title='HTML(value…

#### Select Folder to Output Modified Transcripts To (Original Folder Structure Will Be Mirrored)

In [17]:
display(fc_destination)

FileChooser(path='C:\Users\deere\Dropbox\SOE_climate_study\Transcript Cleaner', filename='', title='HTML(value…

#### Select Output Format(s)
Please select either the .txt and/or .docx formats to output to the output folder from the checkboxes below.

In [19]:
checkbox_txt = widgets.Checkbox(description = ".txt")
checkbox_docx = widgets.Checkbox(description = ".docx")
boxes = widgets.VBox([checkbox_txt, checkbox_docx])
print("Please select the output format of your choice. You may select both")
boxes

Please select the output format of your choice. You may select both


VBox(children=(Checkbox(value=False, description='.txt'), Checkbox(value=False, description='.docx')))

#### Checking Provided Folders
First we convert the folders to strings for easier use

In [20]:
transcript_folder = str(fc_transcripts.selected_path)
output_folder = str(fc_destination.selected_path)

Then we check to make sure directories were provided properly. 

In [21]:
isDirectory_transcript = os.path.isdir(str(transcript_folder))
isDirectory_output = os.path.isdir(str(output_folder))

if isDirectory_output == True:
    if isDirectory_transcript == True:
        print("Folders sucessfully verified")
    else:
        try:
            x = 1/0
        except:
            print("Error: invalid original transcript folder provided.")
else:
    try:
        x = 1/0
    except:
        print("Error: invalid destination transcript folder provided.")

Folders sucessfully verified


#### Setting Up File List (Based on transcript_folder Directory)
First we make a path object to the folder containing our transcripts (it just identifies any .txt files)

In [27]:
txt_folder = Path(transcript_folder).rglob('*.txt')

Then we create a list of files in that folder

In [28]:
files = [x for x in txt_folder]

We also transform output_folder and transcript_folder to path objects for later use

In [29]:
output_dir = Path(output_folder)
base_dir = Path(transcript_folder)

#### Loop for Reading, Transforming, and Saving Each Transcript

In [31]:
# This loop reads the file line by line, and stores the result as a list called *content*
for file in tqdm(files):
    f = open(file, 'r')  
    content = f.readlines()
    f.close()
    
    ####
    # Convert List to String
    content_str = "" # Create blank string to fill in
    for i in content: # Loop appends each list element in content to the string, content_str
        content_str += i
    
    #####
    ## Regex Patterns stored as Regex objects for identifying our three main parts of a speaking instance

    # -- for recognzing the number ahead of each speaking instance (based on line containing one number {1-11 digits} and /n only)
    num_pattern = re.compile(r'^[1-9]\d{0,11}\n', re.MULTILINE)
    # -- for recognzing the time period of each speaking instance (based on time period format)
    time_pattern = re.compile(r'(^\d\d:\d\d:\d\d.\d\d\d --> \d\d:\d\d:\d\d.\d\d\d\n)', re.MULTILINE)
    # -- for recognzing the speaking instance itself (it's the line after the time)
    blurb_pattern = re.compile(r'^\d\d:\d\d:\d\d.\d\d\d --> \d\d:\d\d:\d\d.\d\d\d[\r\n]+([^\r\n]+)', re.MULTILINE)

    #####
    ## Apply each pattern to content_str, store each as a list
    num_temp = num_pattern.findall(content_str)
    time_temp = time_pattern.findall(content_str)
    blurb_temp = blurb_pattern.findall(content_str)
    
    ####
    # Transform our resulting three lists into a pandas dataframe for easier transformations
    df = pd.DataFrame({'Speech_Num': num_temp, 'Time' : time_temp, 'Speech' : blurb_temp})
    
    ####
    ## Cleaning up and prepping columns

    # -- Replacing errant newline characters
    df['Speech_Num'] = df['Speech_Num'].str.replace(r'\n', '')
    df['Time'] = df['Time'].str.replace(r'\n', '')

    # Extracting the Speaker from the Speech Column
    df['Speaker'] = df['Speech'].str.extract(r'^(.+?):')
    # Extracting the Speech from the Speech Column (removing speaker)
    df['Speech'] = df['Speech'].str.replace(r'^(.+?):', '')

    # Separating Time Start and Time In for Later Tranformation
    df ['Time_Start'] = df['Time'].str.extract(r'^(.+?) -->')
    df ['Time_End'] = df['Time'].str.extract(r' --> (.*)')

    # Check if prior speech instance had the same speaker (will allow us to collapse adjacent speakers' speaking instances ahead)
    df['Speaker_Match'] = df.Speaker.eq(df.Speaker.shift())
    
    ####
    ## Identifies speaking instances if prior speaker was the same one (for impending collapse)
    i = 0
    for index, row in df.iterrows():
        # Identifying rows that are a new speaker
        if row['Speaker_Match'] == False:
            # Iterate Speaker Number by 1
            i = i + 1
            # Store Speaker Number
            df.at[index, 'Speaker_Num'] = i
        # Row is same speaker
        else:
            # Store Speaker Number
            df.at[index, 'Speaker_Num'] = i 
            
    ####
    ## Group Dataset by Speaker Number
    # Combine All Speech Blurbs per Speaker's Speaking Instance
    collapsed_df = df.groupby(['Speaker_Num']).agg({'Speech': ''.join, 'Speaker' : 'first', 'Time_Start' : 'first', 'Time_End' : 'last'})

    # Remove All Trailing and Leading Whitespace from Speeches
    collapsed_df.replace(r"^ +| +$", r"", regex=True, inplace=True)
    
    ####
    ## Creating Final String to Be Written to File

    # Create an initial string containing the preamble
    trnscpt_final = 'WEBVTT\n\n'

    # Loop to add in the df strings
    i = 0
    for index, row in collapsed_df.iterrows():

        # Add one to speaking instance number
        i = i + 1

        # Add speaking instance number to string
        trnscpt_final = trnscpt_final + str(i) + '\n'

        # Add time elapsed to next line per speaking instance
        trnscpt_final = trnscpt_final + row['Time_Start'] + ' --> ' + row['Time_End'] + '\n'

        # Add speaker to newline 
            # if speaker is not missing (this skips adding speaker when speaker is unknown):
        if str(row['Speaker']) != 'nan':
            trnscpt_final = trnscpt_final + str(row['Speaker']) + ': '

        # Add Speech Blurb to newline
        trnscpt_final = trnscpt_final + row['Speech'] + '\n'

        # Add newline between each speaking instance
        trnscpt_final = trnscpt_final + '\n'
    
    ####
    ## Create Export File Locations
        # Combine the desired path (output_dir) with the directory and file that go beyond our base directory(base_dir)
    output_file_txt = output_dir.joinpath(Path(*file.parts[len(base_dir.parts):len(file.parts)]))
        # Alter file extension from txt to docx for alternate (desired) output format
    output_file_docx = output_file_txt.with_suffix('.docx')
    
    ####
    ## Export .txt file [DISABLED FOR NOW]
    if checkbox_txt.value == True: #Check if user desired .txt output
        # Create missing folders if necessary in output directory
        os.makedirs(os.path.dirname(output_file_txt), exist_ok=True)
            # Open .txt for writing
        text_file = open(output_file_txt, "wt")
            # Write final transcript to file
        n = text_file.write(trnscpt_final)
            # Close file
        text_file.close()
    
    ####
    ## Export .docx file
        #Check if user desired .docx output
    if checkbox_docx.value == True:
            # Create stored document to write to
        document = Document()
            # Write final transcript to body of docx (add_paragraph)
        paragraph = document.add_paragraph(trnscpt_final)
            # Write document to disk
        document.save(output_file_docx)

100%|██████████████████████████████████████████████████████████████████████████████████| 87/87 [00:08<00:00, 10.43it/s]
