# Word Transcript Tabulator

This notebook is a computational recipe to help you to take a collection of transcripts in word documents (_docx_ only) and turn them into a structured tabular format suitable for working with computational text analytics.

As a side-effect of this, it will also help you check for and identify some inconsistencies in your transcripts (for example, inconsistent speaker identifiers).

Note that this approach will remove all formatting from your text: if you have information in *bold*, _italics_, or elements highlighted in different colours that information will be lost in this process.

## Workflow Overview

## Data Model

## Libraries We'll Use


# Getting Started

1. Upload your transcripts in `.docx` format in the transcripts folder on the left. Nested files are currently not supported.
2. Run the code below.

In [None]:
library(officer)
library(writexl)
library(tidyverse)

# Define some helper functions: these are how we extract individual units of data from your transcripts

# Try to guess where a transcript might start after any header information - this can be fixed in the next step.
guess_header_end <- function(doc_summary){
  for (i in 1:nrow(doc_summary)){
    row = doc_summary[i,]
    # Header might end with a first blank line
    if (trimws(row$text) == '') {
      return(i)
    }
    # Header ends with first list paragraph style (ie, if you use numbered lists to generate turn numbers)
    else if (!is.na(row$style_name) & row$style_name == 'List Paragraph'){
      return(i)
    }
  }
  # If we don't match any other rules, set the header to be zero lines long
  # at the top of the file.
  return(1)
}

# Extract the speaker codes for each turn. A speaker identifier is assumed to run from the start of the
# turn to first colon (`:`).
handle_speaker_codes <- function(doc_summary){
    speaker_match = '^[[:alpha:]]+?(?=:)'
    
    # rearrange the file so that the original text is left as is
    doc_summary$original_text <- doc_summary$text
    matches = regexpr(speaker_match, doc_summary$text, perl = TRUE)
    doc_summary$speaker_code <- regmatches(doc_summary$text, matches) 
    return(doc_summary)
}

# TODO: text without speakers, turn as originally entered
# TODO: extract any timecodes

# Do the initial extraction from a particular file
extract_data <- function(transcript_file_path){
    transcript_df <- officer::docx_summary(officer::read_docx(transcript_file_path))
    
    # Attach the filepath as a column so we can trace this back
    transcript_df$source_file <- basename(transcript_file_path)
    
    return(transcript_df)
}



In [None]:
# Now let's actually load your transcripts

# List all of the docx files in the transcripts folder
transcript_files = list.files("transcripts", full.names=TRUE, pattern="*.docx")

"Loading:" 
transcript_files

# Load each transcript, extract the paragraphs of the document as rows in a dataframe, using the officer package
loaded_docs <- lapply(transcript_files, extract_data)

# Extract the estimated point the header ends.
# This is a stub for the segments table we'll need to generate to support this annotation.
header_ends <- lapply(loaded_docs, guess_header_end)

# Combine the transcripts together
combined_transcripts <- bind_rows(loaded_docs)


In [None]:
# Write out the transformed transcript
header_ends

In [None]:
install.packages("tidyverse")