# Dataframe Preparation for Generalized Linear Regression Analysis of Annotated Chats

Also low-rank SVD stuff

## README

Section A of this document constructs long-form versions of the data under various codebooks:
1. **Codebook 1**: The size-48 codebook consisting of all leaf nodes, with no programmatic incorporation of the multi-attribute structure for Helping and Questioning quotations that we enforced manually during the annotation phase.
2. **Codebook 2**: The size-897 (or 491 without "unknown" attribute options) codebook in which every possible combination of attributes assigned to a Helping or Questioning instance is treated as an individual code (and attributes cannot occur outside of this structure).
3. **Codebook 3**: The size-197 codebook constructed by dropping confidence and specificity information from Codebook 2.
4. **Codebook 4**: The size-92 codebook derived from Codebook 3 via the following steps: \
   a) Merge "guiding" questions into "guide interactively" help \
   b) Merge positive and negative confirmation into "confirmation" \
   c) Group contentDomains per the hierarchy (Figure 3)

## Settings

In [None]:
import os
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from matplotlib import colors as mcolors
import statsmodels.api as sm
from statsmodels.nonparametric.smoothers_lowess import lowess
from scipy.stats.distributions import norm
import statistics
from itertools import product

In [None]:
input_version = 7
input_file = "../data-management/output/clean/v{}/annotations_data.csv".format(input_version)
document_metadata_file = "../data-management/data/annotation-timeline.csv"

In [None]:
output = True
output_version = 5

outputdir = "derived-dataframes/regression-data-v{}".format(output_version)

Names of files that dataframes are output to:

In [None]:
# Long-form dataframe of annotations with Outcomes Codebook
# Output from Section A.1
codebook1_annotations_output = "codebook1_longform.csv"
codebook1_codes_output = "codebook1_codes.csv" # not using this atm because it's easy to get

# Long-form dataframe of annotations with Questioning/Helping Codebook
# Output from Section A.2
codebook2_annotations_output = "codebook2_longform.csv"
codebook2_codes_output = "codebook2_codes.csv"

# Long-form dataframe of annotations with Questioning/Helping minus details Codebook
# Output from Section A.3
codebook3_annotations_output = "codebook3_longform.csv"
codebook3_codes_output = "codebook3_codes.csv"

# Long-form dataframe of annotations with Questioning/Helping grouping attributes Codebook
# Output from Section A.4
codebook4_annotations_output = "codebook4_longform.csv"
codebook4_codes_output = "codebook4_codes.csv"

# Pooled 1-gram counts dataframe, where each row is a code-outcome with its corresponding
# number of observations
# Output from Section B.1 (the actual outputs are prefixed with the codebook version)
code_counts_output = "code-outcome_counts.csv"

# 1-gram counts dataframe, where each row is a conversation-annotator-code-speaker with
# its corresponding outcome and number of observations
# Output from Section B.2 (the actual outputs are prefixed with the codebook version)
conversation_1gram_counts_output = "conv-annotator-code-speaker_outcome-counts.csv.gz"

# 1-gram counts dataframe, where each row is a conversation-annotator and each column
# is a code-speaker (elements are counts)
# Output from Section B.3 (the actual outputs are prefixed with the codebook version)
countsmtx_output = "conv-annotator_code-speaker_counts.csv"

# 2-gram counts dataframe, where each row is a conversation-annotator-code1-speaker1-code2-speaker2
# with its corresponding outcome and number of observations
# Output from Section C.1
conversation_2gram_counts_output = "conv-annotator-code2-speaker2_outcome-counts.csv.gz"

# the above dataframe is massive (>3 million rows) and doesn't display in github, so we'll also make a 20-line preview
# Output from Section C.1
small_conversation_2gram_counts_output = "small_" + conversation_2gram_counts_output[:-3]

if output:
    try:
        os.mkdir(outputdir)
    except FileExistsError:
        print("Output directory already exists; no action taken.")

### Utility to change the pandas dataframe display settings

In [None]:
#pd.set_option("display.max_colwidth", None)
#pd.reset_option("display.max_colwidth")

# util for displaying dataframes
# the defaults are actually 60 & 20, but that gets annoying
def show(da, rows = 20, cols = 20, width = None):
    pd.set_option("display.max_rows", rows)
    pd.set_option("display.max_columns", cols)
    pd.set_option("display.max_colwidth", width)
    display(da)
    pd.reset_option("max_rows")
    pd.reset_option("max_columns")
    pd.reset_option("display.max_colwidth")

### Read in the data
First read in the main dataset

In [None]:
da0 = pd.read_csv(input_file, index_col=0, keep_default_na=False)
da0 = da0.drop(["da1.idx", "da2.idx"], axis=1)
da0.head(3)

In [None]:
da0["document.creationDateTime"].isna().value_counts()

## A. Long-form dataframe manipulation

Produces the following dataframes:
1. Copy of the input dataframe \
   Codebook size = 48

2. Join the co-located Questioning (question, contentDomain, & specificity) and Helping (communicationMechanism, contentDomain, certainty, & specificity) annotations into a single "Questioning" or "Helping" annotation (still containing all the auxiliary information) \
   Codebook size = 897

3. Join the co-located Questioning and Helping annotations like above, but drop less-interesting auxiliary information \
   Codebook size = ~197

4. Further coarsening based on the hierarchy \
   Codebook size = idk because I haven't really figured out what this would be yet

### 0. Define utilities for column renaming

The variable naming here is not ideal for use in R and regression analysis (e.g. column names are too long, "interaction" means two things, etc.)

We will rename before outputting, but leave the names as-is in this script for legacy reasons

In [None]:
da0.columns

In [None]:
# define the renaming maps
colname_map = {"annotation.code" : "code", 
               "annotation.creatingUser" : "annotator", 
               "document.name" : "document", 
               "quote.speaker" : "speaker", 
               "quote.speakerIsLearner" : "speakerIsLearner", 
               "annotation.code.noOutcome" : "code.noOutcome", 
               "annotation.code.noRequestOutcome" : "code.noRequestOutcome"}

colterm_map = {"." : "_", "interaction" : "conversation"}

def replace_colterms(colname, colterm_map = colterm_map):
    for old, new in colterm_map.items():
        colname = colname.replace(old, new)
    return colname

def output_longform(da, dirname, fname, colname_map = colname_map, colterm_map = colterm_map):
    # real one
    da.rename(columns={old : new for old, new in colname_map.items() if old in da.columns}
          ).rename(columns={col : replace_colterms(col) for col in da.columns}
                  ).to_csv(os.path.join(dirname, fname), index=False)
    # mini one for previewing
    da.rename(columns={old : new for old, new in colname_map.items() if old in da.columns}
          ).rename(columns={col : replace_colterms(col) for col in da.columns}
                  ).head(30).to_csv(os.path.join(dirname, "small_" + fname), index=False)

### 1. Construct the first codebook & dataframe (`da1`)

The main thing we need to do here is "take votes" on request type and outcome for each interaction.

In [None]:
# take the end sentinels because there's exactly one per conversation-annotator (verified in csv cleaning script)
votesda = da0[da0["annotation.code"].str.startswith("Big picture of an interaction > resolveRequest")]

# take the vote (mode) across annotators for each conversation
reqcol = votesda.groupby(by=["document.name", "interaction.number"]).aggregate({"interaction.requests" : statistics.mode})

# rename for later
reqcol = reqcol.rename(columns={"interaction.requests" : "voted.interaction.requests"})
reqcol

In [None]:
# take the vote (mode) across annotators for each conversation
outcol = votesda.groupby(by=["document.name", "interaction.number"]).aggregate({"interaction.outcome" : statistics.mode})

# rename for later
outcol = outcol.rename(columns={"interaction.outcome" : "voted.interaction.outcome"})
outcol

In [None]:
da1 = pd.merge(da0, reqcol, how="left", on=["document.name", "interaction.number"])
da1 = pd.merge(da1, outcol, how="left", on=["document.name", "interaction.number"])
da1

#### Generate the codebook

In [None]:
da1_codes = np.sort(da1["annotation.code"].unique()).tolist()

#### Output

In [None]:
if output:
    #da1.rename(columns={old : new for old, new in colname_map.items() if old in da1.columns}
    #          ).rename(columns={col : replace_colterms(col) for col in da1.columns}
    #                  ).to_csv(os.path.join(outputdir, annotations_output), index=False)
    output_longform(da1, outputdir, codebook1_annotations_output)

### 2. Construct the second codebook & dataframe (`da2`)

In [None]:
# checking that these output the same thing, so quote GUIDs are unique the way I want them
#da1.groupby(by="quote.guid").count()["da2.idx"].value_counts()
da1.groupby(by=["document.name", "annotation.creatingUser", "quote.guid"]).count()["quote.text"].value_counts()

In [None]:
# make the result dataframe
da2 = da1.copy()
#da2["annotation.mainCodebook.code"] = da2["annotation.code"] # rows will change so this is useless

# drop columns that won't be well-defined anymore
da2 = da2.drop(["annotation.creationDateTime", "annotation.guid", 
                "annotation.codeRef.guid", "annotation.original_code"], 
               axis=1)

#### a. Define utility functions for dealing with screwed up annotation clusters

In [None]:
# utility function: given multiple contentDomains, choose the highest-priority
# and breaking ties in favor of the less-frequent code
contentDomainsByCount = da2.loc[da1["annotation.code"].str.startswith("General message attributes > contentDomain"), 
                                "annotation.code"].str.split(" > ").str[-1].value_counts()
contentDomainsByCount = contentDomainsByCount.sort_values(ascending=True).index.to_list()
contentDomainsByCount = {i : k for k, i in enumerate(contentDomainsByCount)}
#print(contentDomainsByCount)

contentDomainPriorities = {"bug" : 0, 
                           "codeSpecifications" : 0, 
                           "codingConcept" : 0, 
                           "learningResources" : 0, 
                           "developmentStrategy" : 0, 
                           "testCases" : 0, 
                           "codingExperience" : 0, 
                           "errorMsg" : 1, 
                           "errorLine" : 2, 
                           "errorLocation" : 3, 
                           "codeOpinion" : 4, 
                           "originalCode" : 5, 
                           "proposedNewCode" : 5, 
                           "platformRelated" : 6} # FIXME this one is temporary

def chooseContentDomain(ser):
    if len(ser) == 1:
        return ser.iloc[0].split(" > ")[-1]
    elif len(ser) == 0:
        return "unknown"
    
    ls = ser.str.split(" > ").str[-1].to_list()
    c0 = ls[0]
    p0 = contentDomainPriorities[c0]
    for c1 in ls[1:]:
        p1 = contentDomainPriorities[c1]
        if p0 < p1 or c0 == c1:
            continue
        elif p0 > p1:
            c0, p0 = c1, p1
        else: # p0 == p1
            q0 = contentDomainsByCount[c0]
            q1 = contentDomainsByCount[c1]
            c0, p0 = (c0, p0) if q0 < q1 else (c1, p1)
    return c0

In [None]:
# utility function: given multiple communicationMechanisms, choose the highest-priority
# and breaking ties in favor of the less-frequent code
commMechsByCount = da2.loc[da1["annotation.code"].str.startswith("Explanations and help > communicationMechanism"), 
                               "annotation.code"].str.split(" > ").str[-1].value_counts()
commMechsByCount = commMechsByCount.sort_values(ascending=True).index.to_list()
commMechsByCount = {i : k for k, i in enumerate(commMechsByCount)}
#print(commMechsByCount)

commMechPriorities = {"explain" : 0, 
                      "implement" : 0, 
                      "guideInteractively" : 0, 
                      "suggest" : 1, 
                      "teachWithExtensions" : 2, 
                      "state" : 3, 
                      "positiveConfirmation" : 4, 
                      "negativeConfirmation" : 4}

def chooseCommunicationMechanism(ser):
    if len(ser) == 1:
        return ser.iloc[0].split(" > ")[-1]
    elif len(ser) == 0:
        return "unknown"
    
    ls = ser.str.split(" > ").str[-1].to_list()
    c0 = ls[0]
    p0 = commMechPriorities[c0]
    for c1 in ls[1:]:
        p1 = commMechPriorities[c1]
        if p0 < p1 or c0 == c1:
            continue
        elif p0 > p1:
            c0, p0 = c1, p1
        else: # p0 == p1
            q0 = commMechsByCount[c0]
            q1 = commMechsByCount[c1]
            c0, p0 = (c0, p0) if q0 < q1 else (c1, p1)
    return c0

In [None]:
# this one is totally unnecessary, but while we're over-engineering things, we might as well go all the way
# checkForFollowing, guiding, personal, checkIfCorrect
# content

# utility function: given multiple question types, choose the highest-priority
# and breaking ties in favor of the less-frequent code
questionsByCount = da2.loc[da1["annotation.code"].str.startswith("Questions > question"), 
                               "annotation.code"].str.split(" > ").str[-1].value_counts()
questionsByCount = questionsByCount.sort_values(ascending=True).index.to_list()
questionsByCount = {i : k for k, i in enumerate(questionsByCount)}
print(questionsByCount)

questionPriorities = {"checkForFollowing" : 0, 
                      "guiding" : 0, 
                      "personal" : 0, 
                      "checkIfCorrect" : 0, 
                      "content" : 1}

def chooseQuestionType(ser):
    if len(ser) == 1:
        return ser.iloc[0].split(" > ")[-1]
    elif len(ser) == 0:
        return "unknown"
    
    ls = ser.str.split(" > ").str[-1].to_list()
    c0 = ls[0]
    p0 = questionPriorities[c0]
    for c1 in ls[1:]:
        p1 = questionPriorities[c1]
        if p0 < p1 or c0 == c1:
            continue
        elif p0 > p1:
            c0, p0 = c1, p1
        else: # p0 == p1
            q0 = questionsByCount[c0]
            q1 = questionsByCount[c1]
            c0, p0 = (c0, p0) if q0 < q1 else (c1, p1)
    return c0

#### b. Extract the helping quotes 

In [None]:
# get all quote GUIDs associated with "Explanations & help" via various criteria that 
# agree theoretically but probably not in practive

# Helping instances and communicationMechanism annotations should correspond exactly
help_quote_guid1 = np.sort(da1.loc[da1["annotation.code"].str.startswith(
    "Explanations and help > communicationMechanism"), "quote.guid"].unique())

# Helping instances and confidenceLevel annotations should correspond exactly
help_quote_guid2 = np.sort(da1.loc[da1["annotation.code"].str.startswith(
    "Explanations and help > confidenceLevel"), "quote.guid"].unique())

# Helping instances should be a subset of contentDomain instances (Questioning
# instances also have these)
help_quote_guid3 = np.sort(da1.loc[da1["annotation.code"].str.startswith(
    "General message attributes > contentDomain"), "quote.guid"].unique())

# [explain, suggest, guideinteractively, & teachW/extensions] Helping instances should 
# be a subset of specificity instances (Questioning instances also have these)
help_quote_guid4 = np.sort(da1.loc[da1["annotation.code"].str.startswith(
    "Questions > specificity"), "quote.guid"].unique())

len(help_quote_guid1), len(help_quote_guid2), len(help_quote_guid3), len(help_quote_guid4)

In [None]:
# separate out the communication mechanisms determining whether the Helping instance has 3 vs 4 attributes
commMechs4 = ["Explanations and help > communicationMechanism > explain", 
              "Explanations and help > communicationMechanism > suggest", 
              "Explanations and help > communicationMechanism > guideInteractively", 
              "Explanations and help > communicationMechanism > teachWithExtensions"]

# derive the complement
commMechs3 = da1["annotation.code"].unique()
commMechs3 = commMechs3.astype("U")
commMechs3 = commMechs3[np.char.startswith(commMechs3, "Explanations and help > communicationMechanism")]
commMechs3 = commMechs3.astype("object")
commMechs3 = np.delete(commMechs3, np.isin(commMechs3, commMechs4))
commMechs3

In [None]:
# get the corresponding quote GUIDs
help_quote_guid5 = np.sort(da1.loc[da1["annotation.code"].isin(commMechs3), "quote.guid"].unique())
help_quote_guid6 = np.sort(da1.loc[da1["annotation.code"].isin(commMechs4), "quote.guid"].unique())

In [None]:
# make the shared part of the filter
# 1. annotation belongs to a quote with a confidenceLevel annotation
#    too many messages are missing these, so we'll deal with it later instead
#help_filter = da1["quote.guid"].isin(help_quote_guid2)

# 2. annotation belongs to a quote with a contentDomain annotation
help_filter = da1["quote.guid"].isin(help_quote_guid3)

#print(da1.loc[help_filter, "annotation.code"].value_counts().sort_index())

##### (i) Helping quotes with 3 attributes

In [None]:
# make the filter for Helping messages with 3 attributes

# 3. annotation belongs to one of the three categories of attributes
help3_filter = da1["annotation.code"].str.startswith("Explanations and help > confidenceLevel")
help3_filter |= da1["annotation.code"].str.startswith("General message attributes > contentDomain")
#help3_filter |= da1["annotation.code"].str.startswith("Questions > specificity")
for commMech in commMechs3:
    help3_filter |= (da1["annotation.code"] == commMech)

# 4. annotation belongs to a quote with a communicationMechanism annotation in the relevant category
help3_filter &= help_filter & da1["quote.guid"].isin(help_quote_guid5)
help3_filter.value_counts()

In [None]:
# make the results dataframe for this category
da2_help3 = da2[help3_filter]

da2_help3.shape

In [None]:
da2_help3.groupby(by="quote.guid").count()["quote.text"].value_counts()

In [None]:
tmp = da2_help3.groupby(by="quote.guid")
tmp = tmp.aggregate({"quote.text" : "count", 
                     "document.name" : "first", 
                     "annotation.creatingUser" : "first", 
                     "annotation.code" : 
                     [lambda ser : ser.str.startswith("Explanations and help > communicationMechanism").sum(), 
                      lambda ser : ser.str.startswith("Explanations and help > confidenceLevel").sum(), 
                      lambda ser : ser.str.startswith("General message attributes > contentDomain").sum()]
                    })
tmp.columns = ["count", "document.name", "annotation.creatingUser", "commMech.count", "conf.count", "content.count"]
tmp["dist"] = list(zip(tmp["commMech.count"].to_list(), tmp["conf.count"].to_list(), tmp["content.count"].to_list()))

In [None]:
tmp["dist"].value_counts()

In [None]:
#show(tmp[tmp["count"] != 3], cols=None)

Code to check the above annotations without exactly 3 attributes:

In [None]:
# pd.set_option("display.max_colwidth", None)
# display(da2_help3[da2_help3["quote.guid"] == "10B67B86-6785-471B-BD23-E62282DC1105"])
# pd.reset_option("display.max_colwidth")

Construct the combined dataframe:

In [None]:
# note this works because I've already verified that all invalid Helping quotes
# have too many contentDomains XOR no confidenceLevel, and no other problems
def combinecodes_help3(codeser):
    commMechs = codeser[codeser.str.startswith("Explanations and help > communicationMechanism")]
    confLevels = codeser[codeser.str.startswith("Explanations and help > confidenceLevel")].unique()
    confLvl = confLevels[0].split(" > ")[-1] if len(confLevels) == 1 else "unknown"
    contentDomains = codeser[codeser.str.startswith("General message attributes > contentDomain")]
    return "Helping > ({}, {}, {})".format(chooseCommunicationMechanism(commMechs), 
                                           confLvl, 
                                           chooseContentDomain(contentDomains))

agg_dict = {col : "first" for col in da2_help3.columns}
agg_dict["annotation.code"] = combinecodes_help3

da2_help3 = da2_help3.groupby(by="quote.guid").aggregate(agg_dict).reset_index(drop=True)
show(da2_help3, cols=None)

In [None]:
da2_help3["annotation.code"].value_counts()

In [None]:
# compare with theoretical number of possibilities
# (n commMechs3) * (n certainties + 1) * (n contentDomains)
(4) * (2 + 1) * (14)
#np.sort(da1["annotation.code"].unique())

In [None]:
# Add rows describing the codes, broken down
da2_help3["code.primary"] = "Helping"

da2_help3["code.communicationMechanism"] = da2_help3["annotation.code"].str.split(" > ").str[1].str.split(", ").str[0].str[1:]
da2_help3["code.confidenceLevel"] = da2_help3["annotation.code"].str.split(", ").str[1]
da2_help3["code.contentDomain"] = da2_help3["annotation.code"].str.split(", ").str[2].str[:-1]

da2_help3["code.questionType"] = "N/A"
da2_help3["code.specificity"] = "N/A"

##### (iI) Helping quotes with 4 attributes

In [None]:
# make the filter for Helping messages with 4 attributes

# 3. annotation belongs to a quote with a communicationMechanism annotation in the relevant category
help_filter &= da1["quote.guid"].isin(help_quote_guid6)

# 4. annotation belongs to a quote with a specificity instance
#    too many messages are missing these, so we'll deal with it later instead
#help_filter &= da1["quote.guid"].isin(help_quote_guid4)

# 5. annotation belongs to one of the four categories of attributes
help4_filter = da1["annotation.code"].str.startswith("Explanations and help > confidenceLevel")
help4_filter |= da1["annotation.code"].str.startswith("General message attributes > contentDomain")
help4_filter |= da1["annotation.code"].str.startswith("Questions > specificity")
for commMech in commMechs4:
    help4_filter |= (da1["annotation.code"] == commMech)

help4_filter &= help_filter

In [None]:
# make the results dataframe for this category
da2_help4 = da2[help4_filter]

# keep track of the leftovers
da2 = da2[~help3_filter & ~help4_filter]

da2_help4.shape

In [None]:
da2_help4.groupby(by="quote.guid").count()["quote.text"].value_counts()

In [None]:
tmp = da2_help4.groupby(by="quote.guid")
tmp = tmp.aggregate({"quote.text" : "count", 
                     "document.name" : "first", 
                     "annotation.creatingUser" : "first", 
                     "annotation.code" : 
                     [lambda ser : ser.str.startswith("Explanations and help > communicationMechanism").sum(), 
                      lambda ser : ser.str.startswith("Explanations and help > confidenceLevel").sum(), 
                      lambda ser : ser.str.startswith("General message attributes > contentDomain").sum(), 
                      lambda ser : ser.str.startswith("Questions > specificity").sum()]
                    })
tmp.columns = ["count", "document.name", "annotation.creatingUser", 
               "commMech.count", "conf.count", "content.count", "specificity.count"]

tmp["dist"] = list(zip(tmp["commMech.count"].to_list(), tmp["conf.count"].to_list(), 
                       tmp["content.count"].to_list(), tmp["specificity.count"].to_list()))

#tmp[tmp["count"] != 4]
tmp["dist"].value_counts()

Code to check the above annotations without exactly one of each category of attributes:

In [None]:
# pd.set_option("display.max_colwidth", None)
# display(da2_help4[da2_help4["quote.guid"] == "027E3617-0493-423B-9819-3CE56C88DE99"])
# pd.reset_option("display.max_colwidth")

In [None]:
def combinecodes_help4(codeser):
    commMechs = codeser[codeser.str.startswith("Explanations and help > communicationMechanism")]
    confLevels = codeser[codeser.str.startswith("Explanations and help > confidenceLevel")].unique()
    confLvl = confLevels[0].split(" > ")[-1] if len(confLevels) == 1 else "unknown"
    contentDomains = codeser[codeser.str.startswith("General message attributes > contentDomain")]
    specificities = codeser[codeser.str.startswith("Questions > specificity")].unique()
    spec = specificities[0].split(" > ")[-1] if len(specificities) == 1 else "unknown"
    return "Helping > ({}, {}, {}, {})".format(chooseCommunicationMechanism(commMechs), 
                                               confLvl, 
                                               chooseContentDomain(contentDomains), 
                                               spec)

agg_dict = {col : "first" for col in da2_help4.columns}
agg_dict["annotation.code"] = combinecodes_help4

da2_help4 = da2_help4.groupby(by="quote.guid").aggregate(agg_dict).reset_index(drop=True)
show(da2_help4, cols=None)

In [None]:
da2_help4["annotation.code"].value_counts()

In [None]:
# compare with theoretical number of possibilities
# (n commMechs4) * (n certainties + 1) * (n contentDomains) * (n specificities + 1)
(4) * (2 + 1) * (14) * (2 + 1)

In [None]:
# Add rows describing the codes, broken down

da2_help4["code.primary"] = "Helping"

da2_help4["code.communicationMechanism"] = da2_help4["annotation.code"].str.split(" > ").str[1].str.split(", ").str[0].str[1:]
da2_help4["code.confidenceLevel"] = da2_help4["annotation.code"].str.split(", ").str[1]
da2_help4["code.contentDomain"] = da2_help4["annotation.code"].str.split(", ").str[2]
da2_help4["code.specificity"] = da2_help4["annotation.code"].str.split(", ").str[3].str[:-1]

da2_help4["code.questionType"] = "N/A"

#### c. Extract the questioning quotes 

In [None]:
# get all quote GUIDs associated with "Questions" via various criteria

# Questioning instances and questions annotations should correspond exactly
ques_quote_guid1 = np.sort(da2.loc[da2["annotation.code"].str.startswith(
    "Questions > question"), "quote.guid"].unique())

# Questioning should be a subset of contentDomain annotations (Helping instances also have these)
ques_quote_guid2 = np.sort(da2.loc[da2["annotation.code"].str.startswith(
    "General message attributes > contentDomain"), "quote.guid"].unique())

# Questioning instances should be a subset of specificity annotations (Helping instances also have these)
ques_quote_guid3 = np.sort(da2.loc[da2["annotation.code"].str.startswith(
    "Questions > specificity"), "quote.guid"].unique())

len(ques_quote_guid1), len(ques_quote_guid2), len(ques_quote_guid3)

In [None]:
# make the filter
# 1. annotation belongs to a quote with a questions annotation
ques_filter = da2["quote.guid"].isin(ques_quote_guid1)

# 2. annotation belongs to a quote with a contentDomain annotation
ques_filter &= da2["quote.guid"].isin(ques_quote_guid2)

# 3. annotation belongs to a quote with a specificity annotation
#    too many messages are missing these, so we'll deal with it later instead
#ques_filter &= da2["quote.guid"].isin(ques_quote_guid3)

tmp = da2["annotation.code"].str.startswith("Questions > question")
tmp |= da2["annotation.code"].str.startswith("General message attributes > contentDomain")
tmp |= da2["annotation.code"].str.startswith("Questions > specificity")
ques_filter &= tmp

#print(da2.loc[ques_filter, "annotation.code"].value_counts().sort_index())
ques_filter.value_counts()

In [None]:
# make the results dataframe for this category
da2_ques = da2[ques_filter]

# keep track of the leftovers
da2 = da2[~ques_filter]

da2_ques.shape

In [None]:
da2_ques.groupby(by="quote.guid").count()["quote.text"].value_counts()

In [None]:
tmp = da2_ques.groupby(by="quote.guid")
tmp = tmp.aggregate({"quote.text" : "count", 
                     "document.name" : "first", 
                     "annotation.creatingUser" : "first", 
                     "annotation.code" : 
                     [lambda ser : ser.str.startswith("Questions > question").sum(), 
                      lambda ser : ser.str.startswith("General message attributes > contentDomain").sum(), 
                      lambda ser : ser.str.startswith("Questions > specificity").sum()]
                    })
tmp.columns = ["count", "document.name", "annotation.creatingUser", 
               "quesType.count", "content.count", "specificity.count"]

tmp["dist"] = list(zip(tmp["quesType.count"].to_list(), 
                       tmp["content.count"].to_list(), 
                       tmp["specificity.count"].to_list()))

tmp["dist"].value_counts()

In [None]:
def combinecodes_ques(codeser):
    #if len(codeser) > 3:
    #    contentDomains = codeser[codeser.str.startswith("General message attributes > contentDomain")]
    #    codeser = codeser[~codeser.str.startswith("General message attributes > contentDomain")]
    #    codeser[contentDomains.index[0]] = chooseContentDomain(contentDomains)
    
    #attr = ", ".join(codeser.sort_values().str.split(" > ").str[-1])
    #return "Questioning > ({})".format(attr)
    
    contentDomains = codeser[codeser.str.startswith("General message attributes > contentDomain")]
    specificities = codeser[codeser.str.startswith("Questions > specificity")].unique()
    spec = specificities[0].split(" > ")[-1] if len(specificities) == 1 else "unknown"
    questions = codeser[codeser.str.startswith("Questions > question")]
    return "Questioning > ({}, {}, {})".format(chooseContentDomain(contentDomains), 
                                               spec, 
                                               chooseQuestionType(questions))

agg_dict = {col : "first" for col in da2_ques.columns}
agg_dict["annotation.code"] = combinecodes_ques

da2_ques = da2_ques.groupby(by="quote.guid").aggregate(agg_dict).reset_index(drop=True)
show(da2_ques, cols=None)

In [None]:
# Add rows describing the codes, broken down

da2_ques["code.primary"] = "Questioning"

da2_ques["code.contentDomain"] = da2_ques["annotation.code"].str.split(" > ").str[1].str.split(", ").str[0].str[1:]
da2_ques["code.specificity"] = da2_ques["annotation.code"].str.split(", ").str[1]
da2_ques["code.questionType"] = da2_ques["annotation.code"].str.split(", ").str[2].str[:-1]

da2_ques["code.communicationMechanism"] = "N/A"
da2_ques["code.confidenceLevel"] = "N/A"

#### d. Join everybody back together

In [None]:
# a small number of miscellaneous Helping and Questioning attributes that were lying loose
# around the dataset are going to be dropped, but that's okay -- let's check how many
tmp = da2[da2["annotation.code"].str.startswith("Explanations and help") | 
          da2["annotation.code"].str.startswith("General message attributes") | 
          da2["annotation.code"].str.startswith("Questions")]
print(len(tmp))
tmp["annotation.code"].value_counts()

In [None]:
da2_others = da2[~(da2["annotation.code"].str.startswith("Explanations and help") | 
                   da2["annotation.code"].str.startswith("General message attributes") | 
                   da2["annotation.code"].str.startswith("Questions"))]

In [None]:
# Add rows describing the codes, broken down

da2_others["code.primary"] = da2_others["annotation.code"].copy()

da2_others["code.communicationMechanism"] = "N/A"
da2_others["code.confidenceLevel"] = "N/A"
da2_others["code.contentDomain"] = "N/A"
da2_others["code.questionType"] = "N/A"
da2_others["code.specificity"] = "N/A"

In [None]:
da2 = pd.concat([da2_help3, da2_help4, da2_ques, da2_others], axis=0)
da2 = da2.sort_values(["document.name", "quote.startPosition", "quote.endPosition", "annotation.creatingUser"])
da2 = da2.reset_index(drop=True)
assert(len(da2) == len(da2_help3) + len(da2_help4) + len(da2_ques) + len(da2_others))

#### e. Do some postprocessing on masked code columns

In [None]:
da2["annotation.code.noOutcome"] = np.where(da2["annotation.code"].str.startswith("Helping > ") | 
                                            da2["annotation.code"].str.startswith("Questioning > "), 
                                            da2["annotation.code"], da2["annotation.code.noOutcome"])

In [None]:
da2["annotation.code.noRequestOutcome"] = np.where(da2["annotation.code"].str.startswith("Helping > ") | 
                                                   da2["annotation.code"].str.startswith("Questioning > "), 
                                                   da2["annotation.code"], da2["annotation.code.noRequestOutcome"])

In [None]:
show(da2, cols=None)

In [None]:
da2["code.communicationMechanism"].value_counts()

In [None]:
da2["code.confidenceLevel"].value_counts()

In [None]:
da2["code.contentDomain"].value_counts()

In [None]:
da2["code.specificity"].value_counts()

In [None]:
da2["code.questionType"].value_counts()

#### f. Generate the codebook

In [None]:
annser = pd.Series(np.sort(da1["annotation.code"].unique()))
short_commMechs3 = pd.Series(commMechs3).str.split(" > ").str[-1].to_list()
short_commMechs4 = pd.Series(commMechs4).str.split(" > ").str[-1].to_list()
confidenceLevels = annser[annser.str.startswith("Explanations and help > confidenceLevel")].str.split(" > ").str[-1].to_list() + ["unknown"]
contentDomains = annser[annser.str.startswith("General message attributes > contentDomain")].str.split(" > ").str[-1].to_list()
specificities = annser[annser.str.startswith("Questions > specificity")].str.split(" > ").str[-1].to_list() + ["unknown"]
questionTypes = annser[annser.str.startswith("Questions > question")].str.split(" > ").str[-1].to_list()
others = annser[~annser.str.startswith("Explanations and help") & 
                ~annser.str.startswith("General message attributes > contentDomain") & 
                ~annser.str.startswith("Questions")].to_list()

help3_codes = np.array(np.meshgrid(short_commMechs3, confidenceLevels, contentDomains), dtype="object")
help3_codes = help3_codes.T.reshape([-1, 3])
help3_codes = list(map(lambda ls : "Helping > ({})".format(", ".join(ls)), help3_codes))

help4_codes = np.array(np.meshgrid(short_commMechs4, confidenceLevels, contentDomains, specificities), dtype="object")
help4_codes = help4_codes.T.reshape([-1, 4])
help4_codes = list(map(lambda ls : "Helping > ({})".format(", ".join(ls)), help4_codes))

question_codes = np.array(np.meshgrid(contentDomains, specificities, questionTypes), dtype="object")
question_codes = question_codes.T.reshape([-1, 3])
question_codes = list(map(lambda ls : "Questioning > ({})".format(", ".join(ls)), question_codes))

da2_codes = help3_codes + help4_codes + question_codes + others

In [None]:
len(da2_codes)

In [None]:
# number of codes if we didn't have "unknown" as an option
(4 * 2 * 14) + (4 * 2 * 14 * 2) + (5 * 14 * 2) + len(others)

#### g. Output

In [None]:
if output:
    output_longform(da2, outputdir, codebook2_annotations_output)
    #da2.to_csv(os.path.join(outputdir, codebook2_annotations_output), index=False)
    pd.Series(da2_codes).to_csv(os.path.join(outputdir, codebook2_codes_output), index=False)

### 3. Construct the third codebook and dataframe (`da3`)

#### a. Regenerate the codes in the long-form dataset

In [None]:
da3 = da2.copy()
da3 = da3.drop(["code.confidenceLevel", "code.specificity"], axis="columns")
da3["annotation.code"] = np.where(da3["code.primary"] == "Helping", 
                                  np.frompyfunc("Helping > ({}, {})".format, 2, 1)(
                                      da3["code.communicationMechanism"], da3["code.contentDomain"]), 
                                  da3["annotation.code"])
da3["annotation.code"] = np.where(da3["code.primary"] == "Questioning", 
                                  np.frompyfunc("Questioning > ({}, {})".format, 2, 1)(
                                      da3["code.questionType"], da3["code.contentDomain"]), 
                                  da3["annotation.code"])
show(da3, cols=None)

#### b. Do some postprocessing on masked code columns

In [None]:
da3["annotation.code.noOutcome"] = np.where(da3["annotation.code"].str.startswith("Helping > ") | 
                                            da3["annotation.code"].str.startswith("Questioning > "), 
                                            da3["annotation.code"], da3["annotation.code.noOutcome"])

In [None]:
da3["annotation.code.noRequestOutcome"] = np.where(da3["annotation.code"].str.startswith("Helping > ") | 
                                                   da3["annotation.code"].str.startswith("Questioning > "), 
                                                   da3["annotation.code"], da3["annotation.code.noRequestOutcome"])

#### c. Generate the codebook

In [None]:
# check how many codes were used
da3["annotation.code"].value_counts()

In [None]:
# compute the codebook size for comparison
(4 * 14) + (4 * 14) + (5 * 14) + 15

In [None]:
commMechs = annser[annser.str.startswith("Explanations and help > communicationMechanism")].str.split(" > ").str[-1].to_list()

help_codes = np.array(np.meshgrid(commMechs, contentDomains), dtype="object")
help_codes = help_codes.T.reshape([-1, 2])
help_codes = list(map(lambda ls : "Helping > ({})".format(", ".join(ls)), help_codes))

ques_codes = np.array(np.meshgrid(questionTypes, contentDomains), dtype="object")
ques_codes = ques_codes.T.reshape([-1, 2])
ques_codes = list(map(lambda ls : "Questioning > ({})".format(", ".join(ls)), ques_codes))

da3_codes = help_codes + ques_codes + others

In [None]:
len(da3_codes) # matches the hardcoded calculation, yay!

#### d. Output

In [None]:
if output:
    output_longform(da3, outputdir, codebook3_annotations_output)
    pd.Series(da3_codes).to_csv(os.path.join(outputdir, codebook3_codes_output), index=False)

### 4. Construct the fourth codebook and dataframe (`da4`)

#### a. Do the thing

In [None]:
da4 = da3.copy()

# Merge guiding codes
fil = da4["code.questionType"] == "guiding"
da4.loc[fil, "code.primary"] = "Helping"
da4.loc[fil, "code.questionType"] = "N/A"
da4.loc[fil, "code.communicationMechanism"] = "guideInteractively"
da4["annotation.code"] = np.where(fil, 
                                  np.frompyfunc("Helping > ({}, {})".format, 2, 1)(
                                      da4["code.communicationMechanism"], da4["code.contentDomain"]), 
                                  da4["annotation.code"])

# Merge confirmation codes
fil = da4["code.communicationMechanism"].str.endswith("confirmation")
da4.loc[fil, "code.communicationMechanism"] = "confirmation"
da4["annotation.code"] = np.where(fil, 
                                  np.frompyfunc("Helping > ({}, {})".format, 2, 1)(
                                      da4["code.communicationMechanism"], da4["code.contentDomain"]), 
                                  da4["annotation.code"])

# Merge contentDomain codes
grp = {"proposedNewCode"     : "sourceCode", 
       "originalCode"        : "sourceCode", 
       "codeOpinion"         : "sourceCode", 
       "bug"                 : "codeError", 
       "errorLocation"       : "codeError", 
       "errorMsg"            : "codeError", 
       "codingConcept"       : "higherLevelInstruction", 
       "developmentStrategy" : "higherLevelInstruction", 
       "learningResources"   : "higherLevelInstruction", 
       "codingExperience"    : "rapportBuilding", 
       "personalInfo"        : "rapportBuilding"}
fil = da4["code.contentDomain"] != "N/A"
da4["code.contentDomain"] = np.frompyfunc(lambda c : grp[c] if c in grp.keys() else c, 
                                          1, 1)(da4["code.contentDomain"])

In [None]:
da4["annotation.code"] = np.where(da4["code.primary"] == "Helping", 
                                  np.frompyfunc("Helping > ({}, {})".format, 2, 1)(
                                      da4["code.communicationMechanism"], da4["code.contentDomain"]), 
                                  da4["annotation.code"])
da4["annotation.code"] = np.where(da4["code.primary"] == "Questioning", 
                                  np.frompyfunc("Questioning > ({}, {})".format, 2, 1)(
                                      da4["code.questionType"], da4["code.contentDomain"]), 
                                  da4["annotation.code"])
show(da4, cols=None)

In [None]:
# expect 529 guideInteractively instances
da4["code.communicationMechanism"].value_counts()

#### b. Do some postprocessing on masked code columns

In [None]:
da4["annotation.code.noOutcome"] = np.where(da4["annotation.code"].str.startswith("Helping > ") | 
                                            da4["annotation.code"].str.startswith("Questioning > "), 
                                            da4["annotation.code"], da4["annotation.code.noOutcome"])

In [None]:
da4["annotation.code.noRequestOutcome"] = np.where(da4["annotation.code"].str.startswith("Helping > ") | 
                                                   da4["annotation.code"].str.startswith("Questioning > "), 
                                                   da4["annotation.code"], da4["annotation.code.noRequestOutcome"])

#### c. Generate the codebook

In [None]:
# check how many codes were used
da4["annotation.code"].value_counts()

In [None]:
# compute the codebook size for comparison
(7 * 7) + (4 * 7) + 15

In [None]:
questionTypes.remove("guiding")

commMechs.remove("positiveConfirmation")
commMechs.remove("negativeConfirmation")
commMechs.append("confirmation")

contentDomains = np.concatenate([[c for c in contentDomains if not c in grp.keys()], np.unique(list(grp.values()))])

help_codes = np.array(np.meshgrid(commMechs, contentDomains), dtype="object")
help_codes = help_codes.T.reshape([-1, 2])
help_codes = list(map(lambda ls : "Helping > ({})".format(", ".join(ls)), help_codes))

ques_codes = np.array(np.meshgrid(questionTypes, contentDomains), dtype="object")
ques_codes = ques_codes.T.reshape([-1, 2])
ques_codes = list(map(lambda ls : "Questioning > ({})".format(", ".join(ls)), ques_codes))

da4_codes = help_codes + ques_codes + others

In [None]:
len(da4_codes) # matches the hardcoded calculation, yay!

#### d. Output

In [None]:
if output:
    output_longform(da4, outputdir, codebook4_annotations_output)
    pd.Series(da4_codes).to_csv(os.path.join(outputdir, codebook4_codes_output), index=False)

## B. 1-gram frequency dataframes

REQUIRES: Section A has been run

### 1. Construct code-outcome counts dataframe `countsda1`

**Pooled `speaker`, `conversation`, `annotator`**

In [None]:
# for each version of the codebook
for k, (da, annls) in enumerate([(da1, da1_codes), (da2, da2_codes), (da3, da3_codes), (da4, da4_codes)]):
    # build the dataframe
    countsda1 = da[["annotation.code.noOutcome", 
                    "voted.interaction.outcome"]].value_counts()                         # compute the frequencies

    for pr in np.array(np.meshgrid(annls, ["F", "S"])).T.reshape(-1,2):                  # fill in empty rows
        idx = tuple(pr)
        if not idx in countsda1.index:
            countsda1[idx] = 0

    countsda1 = countsda1.sort_index().to_frame().reset_index()                          # fix the formatting

    countsda1 = countsda1.rename(columns={"annotation.code.noOutcome" : "code",          # fix the naming
                                          "voted.interaction.outcome" : "outcome", 
                                          0 : "count"})
    
    # preview for debugging
    show(countsda1, rows=5, cols=None, width=None)
    
    # output it
    if output:
        countsda1.to_csv(os.path.join(outputdir, "codebook{}_{}".format(k+1, code_counts_output)), index=False)

### 2. Construct conversation-annotator-code-speaker-outcome counts dataframe `countsda2`

In [None]:
#        'quote.text', 'annotation.code', 'annotation.creatingUser',
#        'annotation.creationDateTime', 'quote.startPosition',
#        'quote.endPosition', 'quote.creatingUser', 'quote.creationDateTime',
#        'quote.modifyingUser', 'quote.modifiedDateTime', 'document.name',
#        'document.creatingUser', 'document.creationDateTime',
#        'document.modifyingUser', 'document.modifiedDateTime',
#        'document.plainTextPath', 'document.richTextPath', 'annotation.guid',
#        'annotation.codeRef.guid', 'quote.guid', 'document.guid',
#        'quote.paragraphStartPosition', 'quote.paragraphEndPosition',
#        'quote.paragraphText', 'quote.speaker', 'quote.speakerIsLearner',
#        'annotation.original_code', 'interaction.number', 'interaction.len',
#        'interaction.requests', 'interaction.outcome', 'interaction.strict',
#        'interaction.strict_len', 'annotation.code.noOutcome',
#        'annotation.code.noRequestOutcome', 'voted.interaction.requests', 'voted.interaction.outcome'

In [None]:
def construct_conv_ann_speaker_outcome_longform_counts(k, da, annls):
    # compute the frequencies
    countsda2 = da[["document.name", "interaction.number", "annotation.creatingUser", # group by interaction-annotator
                    "annotation.code.noOutcome", "quote.speakerIsLearner",            # group by annotation and speaker
                    "voted.interaction.requests", "interaction.requests",             # keep these around
                    "voted.interaction.outcome", "interaction.outcome"]].value_counts()
    countsda2 = countsda2.sort_index().to_frame()                                     # fix the formatting
    countsda2 = countsda2.rename(columns={0 : "count"})                               # fix the naming

    # debugging
    #show(countsda2.head(2))

    # pivot to fill in empty values (and compute conversation lengths)
    countsda2 = countsda2.unstack(level=["annotation.code.noOutcome", "quote.speakerIsLearner"], 
                                  fill_value=0).copy()
    
    # add missing code-speakers
    cols_to_add = []
    for pr in np.array(np.meshgrid(annls, [False, True])).T.reshape(-1,2):                  # fill in empty rows
        idx = ("count", pr[0], pr[1] == "True") # sorry about this one
        if not idx in countsda2.columns:
            cols_to_add.append(idx)
    #print("Included code-speakers:", countsda2.shape[1])
    #print("Missing code-speakers:", len(cols_to_add))
    countsda2 = pd.concat([countsda2, 
                           pd.DataFrame(0, index=countsda2.index, columns=cols_to_add)], 
                          axis = 1) #countsda2[cols_to_add] = 0
    
    # debugging
    #show(countsda2, cols=None)
    #break

    # compute the conversation lengths according to each annotator (for codebook 1 this should
    # agree with the version in the dataframe)
    tmp = countsda2["count"].sum(axis=1) #.astype(np.int64)
    assert(tmp.min() >= 2)

    # we have to make a lot of copies of this column because it doesn't broadcast automatically when we restack
    for col in countsda2.columns:
        countsda2[("interaction.length", col[1], False)] = tmp
        countsda2[("interaction.length", col[1], True)] = tmp

    # make the whole thing vertical again (apparently this fills in more empty values)
    countsda2 = countsda2.stack(level=["annotation.code.noOutcome", "quote.speakerIsLearner"], 
                                dropna=False)
    countsda2 = countsda2.fillna(0).astype(np.int64)

    # debugging
    #show(countsda2, cols=None)

    # clean up some formatting things
    countsda2 = countsda2.reset_index() # level="interaction.outcome"

    countsda2 = countsda2.rename(columns={"document.name" : "document",  
                                          "interaction.number" : "conversation_number",
                                          "annotation.creatingUser" : "annotator", 
                                          "voted.interaction.requests" : "request", 
                                          "voted.interaction.outcome" : "outcome", 
                                          "annotation.code.noOutcome" : "code", 
                                          "quote.speakerIsLearner" : "speakerIsLearner", 
                                          "interaction.requests" : "nominal_request", 
                                          "interaction.outcome" : "nominal_outcome", 
                                          "annotation.code.noRequestOutcome" : "code_noRequest", 
                                          "interaction.length" : "conversation_length"})
    # debugging
    #print(countsda2.shape)
    #show(countsda2.iloc[0:5])
    #show(countsda2.iloc[94:100])

    # derived columns
    assert(len(countsda2[countsda2["conversation_length"] <= 1]) == 0)               # data validity check
    countsda2["ln_conversation_length"] = np.log(countsda2["conversation_length"])   # derived column (offset)
    assert((countsda2.isnull().sum() == 0).all())                                    # data validity check FIXME

    np.power(countsda2["count"], 1/2).hist(figsize=(11, 4), bins=40)                 # visualize the output
    plt.title("Distribution of conversation-annotator-code-speaker frequencies")
    plt.xlabel("Square root count")
    plt.ylabel("Number of conv-ann-code-speakers")
    plt.show()

    countsda2["conversation_sharedID"] = list(zip(countsda2["document"],             # derived column (grouping variable)
                                                  countsda2["conversation_number"]))
    countsda2["conversation_uniqueID"] = list(zip(countsda2["document"],             # derived column (grouping variable)
                                                  countsda2["annotator"], 
                                                  countsda2["conversation_number"]))
    # debugging
    show(countsda2, rows=4, cols=None)

    # output
    if output:
        countsda2.to_csv(os.path.join(outputdir, 
                                      "codebook{}_{}".format(k+1, conversation_1gram_counts_output)), 
                         index=False, 
                         compression="gzip")
    
    # need this for the next thing
    return countsda2

In [None]:
# for each version of the codebook
countsda2ls = []
for k, (da, annls) in enumerate([(da1, da1_codes), (da2, da2_codes), (da3, da3_codes), (da4, da4_codes)]): 
    # preprocessing: annls includes outcomes so let's fix that
    annls = annls.copy()
    annls.remove("Big picture of an interaction > resolveRequest > failure") 
    annls.remove("Big picture of an interaction > resolveRequest > success")
    annls.append("Big picture of an interaction > resolveRequest")
    
    # now do the thing
    countsda2 = construct_conv_ann_speaker_outcome_longform_counts(k, da, annls)
    countsda2ls.append(countsda2)

### 3. Construct conversation-annotator x code-speaker counts matrix `countsmtx1`

In [None]:
for k, countsda2 in enumerate(countsda2ls):
    # make it fat
    countsmtx1 = countsda2.pivot(index=["document", "conversation_number", 
                                        "annotator", 
                                        "request", "outcome", 
                                        "nominal_request", "nominal_outcome", 
                                        "conversation_length"], 
                                 columns=["code", "speakerIsLearner"], values="count")
    # debugging
    #show(countsmtx1.head(4))

    # column name formatting (booleans to strings)
    countsmtx1.columns = pd.MultiIndex.from_product(
        iterables = [countsmtx1.columns.levels[0], 
                     pd.Index(["Helper", "Learner"], dtype="object", name="speaker")])
    # debugging
    #show(countsmtx1.head(4))

    # column name formatting (MultiIndex to tuples)
    countsmtx1.columns = countsmtx1.columns.to_flat_index()
    # debugging
    show(countsmtx1.head(4))

    # row name formatting (MultiIndex to columns)
    countsmtx1 = countsmtx1.reset_index()

    # output
    if output:
        countsmtx1.to_csv(os.path.join(outputdir, "codebook{}_{}".format(k+1, countsmtx_output)), index=False)

## C. 2-gram frequency dataframes

REQUIRES: Section A has been run

NOTE: this isn't updated for new codebooks yet

### 1. Construct conversation-annotator-2gram-speakers-outcome counts dataframe `countsda3`

In [None]:
# INPUT
# da        : a long-form dataframe, for a single conversation-annotator, where each item is an annotation, in sorted order
# codecol   : specify which column to read codes from 
#             (annotation.code, annotation.code.noOutcome, or annotation.code.noOutcomeRequest)
# annls     : alphabetical list of all codes (for indexing)
# speakerls : [True, False] = ["learner", "helper"]
#
# OUTPUT
# counts    : a counts dataframe indexed by (code 1, speaker 1, code 2, speaker 2)
def count2grams(da, codecol, annls, speakerls=[True, False]):
    # First, initialize a counts matrix of all zeros
    idx = pd.MultiIndex.from_product(iterables=[annls, speakerls, annls, speakerls], 
                                     names=["code1", "speakerIsLearner1", "code2", "speakerIsLearner2"])
    counts = pd.DataFrame(data=0, index=idx, columns=["count"], dtype=np.int64)
    
    # Second, iterate through the conversation, adding to the counts dataframe
    i1, j1 = 0, 0 # start and stop indices of all colocated annotations to act as code 1
    
    # Find the first code 1 location range in the conversation
    while j1 < len(da) and da.iloc[j1]["quote.startPosition"] == da.iloc[i1]["quote.startPosition"]:
        j1 += 1
    
    i2, j2 = j1+1, j1+1 # start and stop indices of all colocated annotations to act as code 2
    
    while j1 < len(da): # for each biclique
        # iterate over the code 2 location range
        while j2 < len(da) and da.iloc[j2]["quote.startPosition"] == da.iloc[i2]["quote.startPosition"]:
            # write this biclique into the counts matrix
            for k1 in range(i1, j1):
                counts.loc[(da.iloc[k1][codecol], da.iloc[k1]["quote.speakerIsLearner"], 
                           da.iloc[j2][codecol], da.iloc[j2]["quote.speakerIsLearner"]), 
                           "count"] += 1
            j2 += 1
        
        # move to the next code 1 location range
        i1, j1 = i2, j2
        i2, j2 = j1+1, j1+1
    
    # store the conversation length
    counts["conversation_length"] = len(da)
    
    return counts

In [None]:
# count up the 2-grams
countsda3 = da1.groupby(by=["document.name", "annotation.creatingUser", "interaction.number", # group by conversation
                            "interaction.outcome"])                                           # keep the outcome around

countsda3 = countsda3.apply(count2grams,                                         # the function applied to each group
                            "annotation.code.noOutcome",                         # second argument `codecol`
                            np.sort(da1["annotation.code.noOutcome"].unique()))  # third argument `annls`

countsda3

In [None]:
# fix up the formatting
countsda3 = countsda3.reset_index()

countsda3 = countsda3.rename(columns={"document.name" : "document", 
                                      "annotation.creatingUser" : "annotator", 
                                      "interaction.number" : "conversation_number", 
                                      "interaction.outcome" : "outcome"})

In [None]:
# useful extras
countsda3["ln_conversation_length"] = np.log(countsda3["conversation_length"])

countsda3["conversation_sharedID"] = list(zip(countsda3["document"], 
                                              countsda3["conversation_number"]))
countsda3["conversation_uniqueID"] = list(zip(countsda3["document"], 
                                              countsda3["annotator"], 
                                              countsda3["conversation_number"]))

In [None]:
pd.set_option("display.max_colwidth", None)
display(countsda3)
pd.reset_option("display.max_colwidth")

In [None]:
# sanity check the length of the dataframe (it's not perfect because some 
# annotators skipped some documents, but it's believable)
print(countsda3.shape)
print("", 133*3*(48*2)**2) # num convos * num annotators * (num codes * num speakers)**2

In [None]:
if output:
    countsda3.head(20).to_csv(os.path.join(outputdir, small_conversation_2gram_counts_output), index=False)

In [None]:
if output:
    countsda3.to_csv(os.path.join(outputdir, conversation_2gram_counts_output), index=False, compression="gzip")

In [None]:
# we have 3.33M rows, of which 3.30M are zeros
countsda3["count"].value_counts().sort_index()