In [None]:
from bokeh.io import output_notebook

# This needs to be run to enable interactive bokeh plots
output_notebook()
from convokit import Corpus, download

from atap_widgets.conversation import (
    Conversation,
    ConceptSimilarityModel,
    EmbeddingModel,
)
from atap_widgets.plotting import ConversationPlot

## Download corpus

We will use the [Intelligence Squared Debates Corpus](https://convokit.cornell.edu/documentation/iq2.html) from `ConvoKit` as it can be downloaded on demand.
`ConvoKit` provides data as dataframes, so we can load them into our conversation tool easily, without much preprocessing.

In [None]:
corpus = Corpus(filename=download("iq2-corpus"))

In [None]:
sport = corpus.conversations["0"]

In [None]:
sport_df = sport.get_utterances_dataframe().reset_index()
sport_df.head()

## Create a `Conversation` object

We use a `Conversation` object to wrap conversation data and make sure we can access the key parts of it. The input is a dataframe where rows are utterances, and there are `text` and `speaker` columns, with an optional column for an utterance ID.

In [None]:
sport_conversation = Conversation(sport_df, id_column="id")
sport_conversation

The `Conversation` object currently provides a few basic features like the most common terms.

In [None]:
print(sport_conversation.get_speaker_names())
sport_conversation.get_term_frequencies().head(10)

If we notice terms we don't want in the top terms, we can add them to the stopwords:

In [None]:
sport_conversation.add_stopword("uh")
sport_conversation.get_term_frequencies().head(10)

## Calculating conversational similarity

We pass the conversation object to different models to calculate
conversational similarity in different ways:

* `ConceptSimilarityModel` implements the co-occurrence based algorithm from
  > Angus, D., Smith, A., & Wiles, J. (2012). Conceptual Recurrence Plots: Revealing Patterns in Human Discourse. IEEE Transactions on Visualization and Computer Graphics, 18(6), 988–997. https://doi.org/10/b49pvx
* `EmbeddingModel` uses RoBERTa sentence embeddings for each utterance, and calculates the cosine similarity between them.

### Cooccurrence-based similarity

By default we use the top 20 most frequent terms as the key terms for the model (only counting terms once per utterance).
However, we can provide our own custom terms using the `top_terms` argument

In [None]:
sport_concept_model = ConceptSimilarityModel(sport_conversation, n_top_terms=20)
sport_concept_model.top_terms

The model object provides access to details of the model through methods like `get_concept_vectors()`.
However, the most important output provides is the utterance-level pairwise similarity:

In [None]:
sport_concept_matrix = sport_concept_model.get_conversation_similarity()
sport_concept_matrix.iloc[:5, :5]

## Interactive plot

We pass the pairwise similarity matrix we get from a model to `ConversationPlot` to produce an interactive visualisation.
The full conversation may be too large to visualize easily, we can view a subset by
indexing the matrix:

In [None]:
sport_concept_plot = ConversationPlot(
    sport_conversation, sport_concept_matrix.iloc[100:150, 100:150]
)
sport_concept_plot.show()