# Word Transcript Tabulator Recipe

This notebook is a computational recipe to help you to take a collection of transcripts in word documents (_docx_ only) and turn them into a structured tabular format suitable for working with computational text analytics.

As a side-effect of this, it will also help you check for and identify some inconsistencies in your transcripts (for example, inconsistent speaker identifiers).

Note that this approach will remove all formatting from your text: if you have information in *bold*, _italics_, or elements highlighted in different colours that information will be lost in this process.

## Workflow Overview

1. Upload your transcripts in `.docx` format to the transcripts folder.
2. Run the script to produce an initial file for examination and manual changes.
3. Check that the headers have been correctly been identified. (How???)
4. Use the produced metadata to identify and fix inconsistencies in the transcripts *in word* - reupload any changed versions and try again.
5. Fill in the metadata for speaker_codes and speakers. (Match codes in transcripts to speaker_ids, give details known about speakers).

(This is going to need screenshots...)

## Data Model

## Libraries We'll Use

## TODO: how to do this iteratively?

## USECASE: while you're in data collection stage!

# Getting Started

1. Upload your transcripts in `.docx` format in the transcripts folder on the left. Nested files are currently not supported.
2. Run the code below.


## TODO: Breakdown the functions and explain them one by one.

In [None]:
library(officer)
library(writexl)
library(dplyr)
library(stringr)

# Define some helper functions: these are how we extract individual units of data from your transcripts

# Try to guess where a transcript might start after any header information - this can be fixed in the next step.
guess_header_rows <- function(doc_summary){
    # Go through all of the lines 1 by 1 until they match one of the potential ends of the header block
    for (i in 1:nrow(doc_summary)){
        row = doc_summary[i,]
        # Header might end with a first blank line
        if (trimws(row$text) == '') {
            break
        }
        # Header ends with first list paragraph style (ie, if you use numbered lists to generate turn numbers)
        else if (!is.na(row$style_name) & row$style_name == 'List Paragraph'){
            i <- i - 1
            break
        }
    }
    
    # If we can't find an end of header match, mark nothing as the header.
    if (i == nrow(doc_summary)){
        i <- 1
    }

    doc_summary$header = FALSE
    doc_summary$header[1:i] = TRUE

    return(doc_summary)
}


# TODO: extract any timecodes

# Do the initial extraction from a particular file
prepare_transcripts <- function(transcript_file_path){
    transcript_df <- officer::docx_summary(officer::read_docx(transcript_file_path))
    
    # Attach the filepath as a column so we can trace this back
    transcript_df$source_file <- basename(transcript_file_path)
    
    # Keep a copy of the original text so we can always compare the processing
    transcript_df$source_text <- transcript_df$text
    
    # Extract the speaker codes via regex - match up to, but don't include the first colon.
    speaker_match = '^[[:alpha:]]+?(?=:)'
    transcript_df$speaker_code <- stringr::str_extract(transcript_df$source_text, speaker_match)
    
    # And replace the original text with the speaker code removed
    transcript_df$text <- NULL
    transcript_df$text <- str_replace(transcript_df$source_text, '^[[:alpha:]]+?:', "")
    
    transcript_df <- guess_header_rows(transcript_df)
    
    return(transcript_df)
}

In [None]:
# Now let's actually load your transcripts

# List all of the docx files in the transcripts folder
transcript_files = list.files("transcripts", full.names=TRUE, pattern="*.docx")

paste0("Loading: ", length(transcript_files), " transcripts:") 

# TODO: figure out how to handle warnings without scaring people and ignoring them...
# Load each transcript, extract the paragraphs of the document as rows in a dataframe, using the officer package
loaded_docs <- lapply(transcript_files, prepare_transcripts)

# Combine the transcripts together
combined_transcripts <- bind_rows(loaded_docs)


In [None]:
# Now prepare three views of this dataset:

# Summary of the files and number of turns
file_turns <- combined_transcripts %>% 
    filter(header==FALSE) %>% 
    group_by(source_file) %>% 
    count(name="turns")

# Info about the speaker-codes in each transcript
speaker_code_summary <- combined_transcripts %>% 
    filter(header==FALSE) %>%
    group_by(source_file, speaker_code) %>%
    summarise(n_turns=n(), first_turn=min(doc_index), last_turn=max(doc_index))
    

# An (initially empty) table for speaker information

sheets <- list(
    transcripts = file_turns, 
    turns = combined_transcripts, 
    speaker_codes = speaker_code_summary
)

write_xlsx(
    sheets,
    path='combined_transcripts.xlsx'
)




In [None]:
combined_transcripts %>% 
    filter(header==FALSE) %>%
    group_by(source_file, speaker_code) %>%
    summarise(n_turns=n(), first_turn=min(doc_index), last_turn=max(doc_index))