# Imports

In [1]:
using DataFrames
using CSV

# Metadata Config

Here we specify how we want our DataFrame structure to look. Column names begin with a number to force Julia to render them in the order we want when we're done. `LineNo`, `Speaker`, and `Text` will be automatically populated by the loop below. All we have to do here is specify the metadata we have attached to this transcript file. For example, the provided transcript comes from `Session` 8, `Activity` 2, `Group` 1.

In [2]:
data = Dict(
    "1_Session" => "#8",
    "2_Activity" => "#2",
    "3_Group" => "#1",
    "4_LineNo" => collect(1:countlines("data/transcript.txt")),
    "5_Speaker" => String[],
    "6_Text" => String[]
)

Dict{String, Any} with 6 entries:
  "4_LineNo"   => [1, 2, 3, 4, 5, 6, 7]
  "3_Group"    => "#1"
  "5_Speaker"  => String[]
  "2_Activity" => "#2"
  "1_Session"  => "#8"
  "6_Text"     => String[]

# Process Transcript

This loop expects the `data/transcript.txt` file to be formatted like:

```txt
>> Alice: Are we Group 1 or 2? Just the group.

>> Beth Anne: One.

>> Carol: So should we just like throw up a whiteboard and start trying to like collectively sketch a prototype?

>> Alice: Yeah, let's do that. Do we want to do that on Slide 24?
```

In theory, if you have a different but similar transcript format, it should only take small changes to update the logic below.

In [3]:
for line in eachline("data/transcript.txt")
    m = match(r">> (.+)", line) # Are we starting a new turn of talk? Signified by >> symbols.
    if !isnothing(m) # Yes, a new turn of talk.
        line = m[1]
        speaker = "UNDEFINED" # By default, if we can't find a valid speaker name, use UNDEFINED
        m = match(r"([^:]+):\s*(.*)", line) # Look for a name between >> and :
        if !isnothing(m) # Did we find a name? If so, keep it. Else, leave the default.
            speaker = m[1]
            line = m[2]
        end

        # Add the speaker and text to the data structure
        push!(data["5_Speaker"], speaker)
        push!(data["6_Text"], line)
    else # No, the same turn of talk, so just append whatever we find to our previous result.
        data["6_Text"][end] *= line
        pop!(data["4_LineNo"])
    end
end

# Save and Output Results

Finally, create a DataFrame, write it to disk, and display it here for a preview. Note, `data/output.csv` is ignored in the `.gitignore` to prevent accidentally pushing non-anonymized data to github.

In [4]:
data = DataFrame(data)
CSV.write("data/output.csv", data, delim="\t")
display(data)

Unnamed: 0_level_0,1_Session,2_Activity,3_Group,4_LineNo,5_Speaker,6_Text
Unnamed: 0_level_1,String,String,String,Int64,String,String
1,#8,#2,#1,1,Alice,Are we Group 1 or 2? Just the group.
2,#8,#2,#1,2,Beth Anne,One.
3,#8,#2,#1,3,Carol,So should we just like throw up a whiteboard and start trying to like collectively sketch a prototype?
4,#8,#2,#1,4,Alice,"Yeah, let's do that. Do we want to do that on Slide 24?"
