# What we want to learn today? 
Today we will have a first look at the Convokit toolkit for conversational analysis. The toolkit is designed to provide a unified framework to load and work with interactive conversational data. It already comes with a larger amount of existing conversational datasets, and it also provides a set of tools to manipulate, analyze, and model conversational data. 

In the session today we will look at two dataset that have annotations for two of the pragmatic phenomena we talked about in the first lecture: politeness strategies and speech acts. The website for Convokit is here: https://convokit.cornell.edu/

What we will cover today:
- representation of datasets: the corpus object in Convokit, how to load and extract information from it?
- manipulation of data: How to use the Transformation Pipeline (e.g., filtering, adding annotations, extracting linguistic features)?
- How to train a simple feature-based classifier to predict politeness strategies or speech acts in the conversations?

In [None]:
from convokit import Corpus, download, TextParser, PolitenessStrategies, Classifier, BoWTransformer
import random

In [None]:
# we are first going to download the two corpora we will be using today.
# 1) The Stanford Politeness Corpus (Danescu-Niculescu-Mizil et al., 2013), which contains ~10k posts from Stack Exchange annotated for politeness strategies.
pol_corpus = Corpus(filename=download("wikipedia-politeness-corpus"))
# 2) The Switchboard Corpus (Godfrey et al., 1992), which contains ~1k two-sided telephone conversations annotated for speech acts.
#swbd_corpus = Corpus(filename=download("switchboard-processed-corpus"))
swbd_corpus = Corpus(filename=download("switchboard-corpus"))


If you want to learn more about the two corpora, you can check out the following links:
- Stanford Politeness Corpus: https://www.cs.cornell.edu/~cristian//pdfs/politeness_talk.pdf
- and the paper: https://aclanthology.org/P13-1025.pdf
- Switchboard Corpus: https://web.stanford.edu/~jurafsky/tr.pdf


In [None]:
# Let's take a look at the summary statistics of the two corpora.
pol_corpus.print_summary_stats()
swbd_corpus.print_summary_stats()

Each corpus component has a consistent data format and we have three core components / objects:
- the utterance object: represents a single utterance in a conversation. It has primary data fields such as text, id, speaker, conversation_id (the conversation it belongs to), reply_to (the utterance it is replying to, if any), and metadata attributes such as annotations or additional features (e.g. dependency parse trees, sentiment scores, etc.).
- to access a random utterance from the corpus, we can use the .random_utterance() method.
- to access a specific utterance by its ID, we can use the .get_utterance(utterance_id) method.
- we can also iterate through all utterances in the corpus using the .iter_utterances() method.
--> task: retrieve a random utterance from the politeness corpus and the switchboard corpus and look at its text, speaker, and metadata,
--> what differences do you notice between the two corpora?


In [None]:
# we can also get all utterances of a corpus as a dataframe. 
utt_df_pol = pol_corpus.get_utterances_dataframe()
utt_df_swbd = swbd_corpus.get_utterances_dataframe()

In [None]:
utt_df_pol.head()

In [None]:
utt_df_swbd.head()

- the speaker object: represents a single speaker in a conversation. It has primary data fields such as id, and metadata attributes such as demographic information or other speaker-level annotations.
--> task: retrieve a random speaker from the politeness corpus and the switchboard corpus and look at its id, and metadata.
--> what differences do you notice between the two corpora?

In [None]:
# again we can retrieve all speakers as a dataframe.
speaker_df_pol = pol_corpus.get_speakers_dataframe()
speaker_df_swbd = swbd_corpus.get_speakers_dataframe()

In [None]:
speaker_df_pol.head()

In [None]:
speaker_df_swbd.head()

- the third core object is the conversation object: it represents a single conversation / interaction. It has primary data fields such as id, owner (which corpus does it belong to?) and metadata attributes such as topic or other conversation-level annotations.
- to access a random conversation from the corpus, we can use the .random_conversation() method.
- to access a specific conversation by its ID, we can use the .get_conversation(conversation_id) method.
- we can also iterate through all conversations in the corpus using the .iter_conversations() method.
- we can also get all conversations of a corpus as a dataframe with the .get_conversations_dataframe() method.
--> task: retrieve a random conversation from the politeness corpus and the switchboard corpus and look at its primary data fields and metadata and compare them.

In [None]:
convo_df_pol = pol_corpus.get_conversations_dataframe()
convo_df_swbd = swbd_corpus.get_conversations_dataframe()

In [None]:
convo_df_pol.head()

To summarize: every corpus in Convokit is made up of conversations, that contain utterances, which are produced by speakers. For each component we can access the corresponding connected components, such as speaker of an utterance, conversation of an utterance, utterances in a conversation, etc.

In [None]:
convo_df_swbd.head()

In the politeness corpus we have annotated politeness labels, stored in the utterance metadata. Take a look at the distribution of politeness labels in the corpus. The binarized politeness label ("Binary") indicates 1=”polite”, 0=”neutral”, -1 = “impolite”. The continuous politeness score ("PolitenessScore") is a continuous score ranging from -3 (most impolite) to 3 (most polite).
--> task: check the distribution of the labels (e.g. histogram plot or value counts) for both the binary and continuous politeness labels.

Next, we will look into how to use different transformers to manipulate and analyze the corpus data. 
Next, we will use the built-in PolitenessStrategies transformer to extract politeness strategies from the utterance text and add them as features to the utterance metadata. The extractor uses pattern-matching rules over syntactic parses to detect linguistic indicators of politeness. How can we use and transform an extractor in Convokit?
1) First, we need to initialize the transformer object: pol_strategies = PolitenessStrategies()
2) Then, we can apply the transformer to the corpus using the .transform() method: pol_corpus = pol_strategies.transform(pol_corpus)
3) Finally, we can access the extracted features in the utterance metadata. The extracted politeness strategies are stored in the 'politeness_strategies' field of the utterance metadata as a dictionary, where keys are strategy names and values are lists of tuples containing the span indices and corresponding text tokens, if markers == True, otherwise only counts are stored.
4) we can also use the summarize() method of pol_strategies to get an overview of the extracted politeness strategies. (and use plot=True to visualize the distribution of strategies)
--> task: follow the steps above to extract politeness strategies from the politeness corpus and display the first few utterances with their extracted strategies, then use the summarize() method to visualize the distribution of strategies.

For a short / rough overview of the politeness strategies in this corpus: 

| Strategy                | Linguistic Pattern (simplified)      | Example                       |
|--------------------------|--------------------------------------|--------------------------------|
| **Apology**              | regex for “sorry”, “apologize”, etc. | “I’m sorry to ask, but…”       |
| **Gratitude**            | tokens like “thanks”, “thank you”    | “Thanks for your help!”        |
| **Deference**            | honorifics like “sir”, “ma’am”       | “Yes, sir.”                    |
| **Indirect Request**     | modal verbs “could”, “would” + verb  | “Could you check this?”        |
| **Please**               | token “please” anywhere              | “Please have a look.”          |
| **Hedges**               | “maybe”, “I think”, “sort of”        | “I think you might be right.”  |
| **First Person Plural**  | use of “we”, “our” for solidarity    | “We should try again.”         |
| **Compliment**           | “good job”, “nice work”, etc.        | “Great answer!”                |


In [None]:
pol_strategies = PolitenessStrategies()
pol_corpus = pol_strategies.transform(pol_corpus, markers=True)

In [None]:
df_pol = pol_corpus.get_utterances_dataframe()
df_pol.head()

We can visualize the tokens that correspond to the features with the following code. This code shows a random utterance and features highlighted / color-coded.

In [None]:
import re
from IPython.display import display, HTML

def highlight_politeness_markers(utterance):
    """
    Visualize politeness markers in an utterance.
    """
    text = utterance.text
    markers = utterance.meta.get('politeness_markers', {})

    # Collect all tokens to highlight
    highlight_spans = []
    for key, token_lists in markers.items():
        if not token_lists:
            continue
        strategy = key.replace('politeness_markers_==', '').replace('==', '')
        for token_group in token_lists:
            for token, sent_i, tok_i in token_group:
                highlight_spans.append((token, strategy))

    if not highlight_spans:
        print("No politeness markers detected for this utterance.")
        print(utterance.text)
        return

    colors = [
        "#FFD54F", "#AED581", "#81D4FA", "#CE93D8",
        "#FFAB91", "#C5E1A5", "#F48FB1", "#80CBC4",
        "#E6EE9C", "#B39DDB"
    ]
    color_map = {}
    color_index = 0

    def get_color(strategy):
        nonlocal color_index
        if strategy not in color_map:
            color_map[strategy] = colors[color_index % len(colors)]
            color_index += 1
        return color_map[strategy]

    highlighted_text = text
    for token, strategy in highlight_spans:
        color = get_color(strategy)
        pattern = r'\b' + re.escape(token) + r'\b'
        repl = f"<mark style='background-color:{color};' title='{strategy}'>{token}</mark>"
        highlighted_text = re.sub(pattern, repl, highlighted_text, flags=re.IGNORECASE)

    # Create color-coded legend
    legend_html = "<div style='margin-top:10px;'>"
    for strategy, color in color_map.items():
        legend_html += f"<span style='background-color:{color}; padding:2px 6px; border-radius:4px; margin-right:4px;'>{strategy}</span>"
    legend_html += "</div>"

    # Display the highlighted text + legend
    html = f"""
    <div style='font-family:sans-serif; line-height:1.6;'>
        <p>{highlighted_text}</p>
        {legend_html}
    </div>
    """
    display(HTML(html))

    #print("Detected strategies:", sorted(set([s for _, s in highlight_spans])))

u = list(pol_corpus.iter_utterances())[random.randint(0, len(pol_corpus.utterances) - 1)]
highlight_politeness_markers(u)

Now that we have extracted politeness strategies as features, we can use them to train a simple classifier to predict whether an utterance is polite or not. We will use the built-in Classifier module in Convokit for this purpose. We will use the extracted politeness strategies as features and the binary politeness label as the target variable.

In [None]:
import random
from sklearn.metrics import classification_report
from convokit import Classifier

In [None]:
pol_binary_corpus = Corpus(utterances=[utt for utt in pol_corpus.iter_utterances() if utt.meta["Binary"] != 0])
# check the distribution
utt_new = pol_binary_corpus.random_utterance()
utt_df_binary = pol_binary_corpus.get_utterances_dataframe()

The Classifier object:
- obj_type: specifies the type of object to classify (utterance, speaker, conversation)
- labeller: a function that takes an object and returns its label (in this case we want to use the .meta info of the utterance to get the binary politeness label)
- you can then use the functions .evaluate_with_cv() to get results on a 5-fold cross validation


In [None]:
# this code uses the wrapper from convokit
cv_classifier = Classifier(
    obj_type='utterance', pred_feats=['politeness_strategies'],
    labeller=lambda u: u.meta['Binary'] == 1)

cv_results = cv_classifier.evaluate_with_cv(pol_binary_corpus, selector=lambda u: u.meta.get('Binary') in {1, -1})
print("Cross-validation accuracy per fold:", cv_results)



In [None]:
# here we use the full corpus, train and predict
cv_classifier.fit_transform(corpus=pol_binary_corpus, selector=lambda u: u.meta.get('Binary') in {1, -1})
result_df = cv_classifier.summarize(pol_binary_corpus)

Since we used a LogisticRegression Classifier we can directly inspect which features where most impactful. 
--> Look at the 10 "most polite" and 10 "most impolite" indicators.

In [None]:
import pandas as pd
feature_names = sorted(list(next(iter(pol_binary_corpus.iter_utterances())).meta['politeness_strategies'].keys()))

coefs = cv_classifier.clf.named_steps['logreg'].coef_[0]

coef_df = pd.DataFrame({'Feature': feature_names, 'Weight': coefs})
coef_df = coef_df.sort_values('Weight', ascending=False)
display(coef_df.head(10))      # Most "polite" indicators
display(coef_df.tail(10))      # Most "impolite" indicators

An alternative solution is to not use the built-in wrapper of convokit but to use the sckit-learn classifier explicitly. Then you have more options to adapt the models to your use cases. For that you need to create the feature vectors from the meta data when iterating over the corpus. The label you can get with utterance.meta.get('Binary'). You can iterate over corpus utterances with .iter_utterances()

Next, let's have a look at speech acts. We will look at the switchboard corpus. We have to use a different version of the corpus for being able to use the speech acts. In this corpus, since it captures dialogues, the concept is called "dialogue acts".
--> check how these are annotated in the dataset, the labels are stored in the "tags" field in the metadata. Inspect a few random utterances. 

In [None]:
swbd_corpus = Corpus(filename=download("switchboard-processed-corpus"))


For your reference, here is a table with the DMSL tags used in the Switchboard corpus, the full name of the tag and an example.
--> In what way do they differ from the speech acts we have looked at in the lecture?

| DAMSL Tag | Dialogue / Speech Act Label          | Example Utterance                          |
|------------|--------------------------------------|--------------------------------------------|
| **sd**     | Statement-non-opinion               | “It’s raining outside.”                    |
| **sv**     | Statement-opinion                   | “I think that’s a bad idea.”               |
| **b**      | Acknowledge / Backchannel           | “Uh-huh.” / “Right.”                       |
| **aa**     | Agree / Accept                      | “Yes.” / “Exactly.”                        |
| **ba**     | Appreciative                        | “Thanks a lot!”                            |
| **qy**     | Yes-No Question                     | “Do you like pizza?”                       |
| **qw**     | Wh-Question                         | “What time is it?”                         |
| **qo**     | Open Question                       | “How do you feel about that?”              |
| **qr**     | Or-Question                         | “Is it red or blue?”                       |
| **qyd**    | Declarative Yes-No Question         | “You’re coming tonight?”                   |
| **ny**     | Yes Answer                          | “Yes.” / “I do.”                           |
| **nn**     | No Answer                           | “No.” / “I don’t.”                         |
| **fc**     | Conventional Closing                | “Goodbye.” / “See you later.”              |
| **fp**     | Conventional Opening                | “Hi!” / “Hello!”                           |
| **fe**     | Exclamation                         | “Wow!” / “Great!”                          |
| **ft**     | Thanking                            | “Thanks!” / “Thank you so much.”           |
| **fa**     | Apology                             | “Sorry about that.”                        |
| **fo**     | Offer / Invitation                  | “Would you like some coffee?”              |
| **oo**     | Other Answers                       | “Maybe.” / “Could be.”                     |
| **h**      | Hedge / Qualified Answer            | “I guess so.” / “Probably.”                |
| **na**     | Affirmative Non-Yes Answer          | “Yeah, sure.” / “Absolutely.”              |
| **ng**     | Negative Non-No Answer              | “Nope, not really.”                        |
| **arp**    | Repetition Request                  | “Pardon?” / “Could you repeat that?”       |
| **aa**     | Accept / Agree                      | “That’s true.”                             |
| **sd@**    | Statement continuation (multi-turn) | “And then we went to the park.”            |
| **t1**     | Self-Talk                           | “Let’s see…” / “Hmm…”                      |
| **t3**     | Segment Other                       | (Speaker changes topic)                    |
| **x**      | Non-verbal / Uninterpretable        | “[laughter]” / “[noise]”                   |



Next, we want to also train a simple classifier, to predict the dialogue acts based on the utterance text. For that we first need to create feature vectors from the utterance text. We will use a bag-of-words representation for that. Convokit provides a Bag-of-Words transformer that can be used to create vectors from the text of utterances. 
--> initialize the BoWTransformer and add the vectors to the corpus

you can now see that the utterances_dataframe has a new column, called "vectors". with utt.get_vector("bow_vector") we can retrieve the vector representation for an utterance. For can now use any scikit-learn classifier and try to predict dialogue acts based on the bag-of-words representation. Note: training the classifier might take a while on you local machine. If it takes too long, only use a subset of the dataset.

In [None]:
# you can now see that the utterances_dataframe has a new column, called "vectors". with utt.get_vector("bow_ve
swbd_corpus_utterances = swbd_corpus.get_utterances_dataframe()
swbd_corpus_utterances.head()

--> Look at the results of the model. How does the model perform? Which tags are predicted well? Which are not?

You have now a first idea of how to load data with convokit and extract features. Pick one of the following tasks
(a) Use a classifier trained to predict politeness based on the politeness features to predict it on a new dataset. You can either use the other portion of the corpus (stack-exchange-politeness-corpus) 
or 
(b) the switchboard corpus. This is spoken language, so a very different domain / genre. 
Inspect the predictions on a subset of 20 utterances. Do they make sense? What is the classifier missing?
--> in general, this is a rather simple classifier. How do you think the classifier could be improved? Is there any additional context, for example that could be taken into consideration?