# Analysis of Conversation Data using Discursis

In this workshop, we'll look at how we can use Discursis to analyze conversation and discourse
transcripts, and other text that is structured like a conversation, such as social media data.

The dataset we'll use is the transcript of the National Press Club Leaders Debate
between Kevin Rudd and Tony Abbott, available at the [Parliament of Australia website](https://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id:%22media/pressrel/2658246%22) under a [CC BY-NC-ND 3.0 AU](https://creativecommons.org/licenses/by-nc-nd/3.0/au/)
Creative Commons license.

## Python setup

Discursis and the accompanying tools for conversation data are in the [atap_widgets](https://github.com/Australian-Text-Analytics-Platform/atap_widgets) Python
package. We'll load the tools from this library, along with the other libraries we'll be using
for the analysis:

In [None]:
import os

# In order for bokeh to work properly on Binder, we need to specify
#   the URL it's running on
os.environ["BINDER_EXTERNAL_URL"] = os.environ.get('JUPYTERHUB_EXTERNAL_URL', "https://notebooks.gesis.org/")
# pandas: tools for data processing
import pandas as pd

# numpy: tools for numerical calculations
import numpy as np

pd.options.display.max_colwidth = 100
# Bokeh: interactive plots
from bokeh.io import output_notebook
from bokeh.models import ColorBar
from bokeh.plotting import figure, show
from bokeh.transform import linear_cmap

# networkx: graphs/networks
import networkx as nx

# matplotlib: another plotting library
from matplotlib import cm
import matplotlib.pyplot as plt

# This needs to be run to enable interactive bokeh plots
output_notebook()

# Individual tools from atap_widgets
from atap_widgets.conversation import (
    ConceptSimilarityModel,
    Conversation,
    EmbeddingModel,
)
from atap_widgets.plotting import ConversationPlot
from atap_widgets.concordance import (
    ConcordanceTable,
    ConcordanceWidget,
    prepare_text_df,
)

In [None]:
# Create a results folder, if it doesn't already exist
os.makedirs("conversation_results", exist_ok=True)

## Loading the data

The conversation tools are designed to accept data as a `pandas` dataframe.
Each row in the dataframe should be an utterance in the conversation. There
should be a `"text"` column with the actual content of the utterance
and a `"speaker"` column identifying who is speaking. It also helps if we have a `"text_id"` column that gives a unique identifier for each utterance that
we can refer to.


Additional metadata columns that might be relevant to your particular dataset
will be imported into the conversation tool as-is. In this case we have an additional
`"role"` column identifying each person's role in the debate.

In [None]:
data = pd.read_excel("data/debate_clean.xlsx")
data.head()

## Data exploration

Before carrying out our analysis using Discursis, we can do some initial exploration of the data
to get an overview of it. We can do this with the tools that are built in to `pandas`,
rather than specialized language tools.

We can see how many times each person spoke during the debate:

In [None]:
print("Total utterances:", len(data))
data["speaker"].value_counts()

And see how many times people in different roles spoke:

In [None]:
pd.crosstab(data["speaker"], data["role"], margins="columns")

We can use the `ConcordanceTable` tool to start looking at some key terms
in the conversation. Terms that might be relevant to a political debate
might be things like "economy" or "environment".

Before using the concordance tools, we need to use the `prepare_text_df()`
function to process the data, which applies some initial NLP processing.

In [None]:
data = prepare_text_df(data, text_column="text", id_column="text_id")

table = ConcordanceTable(data, keyword="economy", results_per_page=10)
table

We can update the search settings and display the table again:

In [None]:
table.keyword = "environment"
table

It seems the economy was a focus, but environmental issues were not a major feature of this debate.

If you have other ideas about relevant topics, you can use the 
`ConcordanceWidget` to search for them in real-time. If you've
found useful results, you can export them to Excel:

In [None]:
widget = ConcordanceWidget(data, results_per_page=10)
widget.show()

> ### What's the difference between ConcordanceWidget and ConcordanceTable?
>
> * `ConcordanceWidget` lets you search interactively, but the results won't be saved in the notebook
> * `ConcordanceTable` is non-interactive, but the results are saved in the notebook, so you
>   can share them easily

## Analysis of Conversation Data

In order to model our conversation data, we need to load our data into
a `Conversation` object. The `Conversation` object carries out some initial processing of the text,
which will be handled by a `spacy` language model. If you need to analyse
data for a non-English corpus, you can install a relevant [spacy model](https://spacy.io/usage/models).

In [None]:
conversation = Conversation(
    data=data,
    text_column="text",
    speaker_column="speaker",
    id_column="text_id",
    language_model="en_core_web_sm",
)
conversation

The `Conversation` object offers some basic functionality for accessing information about
the conversation:

In [None]:
conversation.n_speakers, conversation.n_utterances

You can access the table of utterance data via `conversation.data` - this
is a `pandas` DataFrame like the original data but has some additional
information added:

In [None]:
conversation.data.head()

When we calculate conversational(semantic) similarity below, the default method is based
on the most common terms in the data. We can check what these are:

In [None]:
conversation.get_most_common_terms(n=20)

We may want to treat some of these terms as **stopwords** so that they don't contribute to
the calculation of topic similarity. Stopwords are common grammatical terms that don't carry specific
contextual meaning, such as `and, of, the, is`. Note that the `spacy` language model we are using already has some
default stopwords defined:

In [None]:
default_stopwords = sorted(conversation.nlp.Defaults.stop_words)
print(len(default_stopwords), "default stopwords")
default_stopwords[:10]

After adding these stopwords,
the changes should be applied in any new operations:

In [None]:
conversation.add_stopword("mr")
conversation.add_stopword("said")

In [None]:
conversation.get_most_common_terms(n=20)

You can also access and export the full frequency table of terms. The `term_frequencies` table
is a `pandas` DataFrame, so we can export it to Excel easily:

In [None]:
term_frequencies = conversation.get_term_frequencies()
term_frequencies.to_excel("conversation_results/term_frequencies.xlsx", index=False)
term_frequencies

### Calculating similarity

In order to calculate similarity of terms and topics across the conversation,
we'll use the conceptual recurrence calculation from

> Angus, D., Smith, A. E., & Wiles, J. (2012). Human Communication as Coupled Time Series: Quantifying Multi-Participant Recurrence. IEEE Transactions on Audio, Speech, and Language Processing, 20(6), 1795–1807. https://doi.org/10.1109/TASL.2012.2189566

This method is implemented in `ConceptSimilarityModel`, which takes in a conversation
object and performs the similarity calculation on it. To match the method used in the article,
we'll use the top 50 key terms as the basis for concepts, which we set with the
`key_terms` argument, and use 3-sentence windows when counting which terms co-occur:

In [None]:
concept_model = ConceptSimilarityModel(
    conversation, key_terms=50, sentence_window_size=3
)
concept_model

For convenience, we only need to call one function to get the utterance-to-utterance
similarity that will form the basis of the Discursis plot:

In [None]:
concept_similarity = concept_model.get_conversation_similarity()
print(concept_similarity.shape)
concept_similarity.iloc[:5, :5]

However, if we need to, we can also access the concept vectors the similarity is based on:

In [None]:
concept_vectors = concept_model.get_concept_vectors()
print(concept_vectors.shape)
concept_vectors.iloc[:5, :]

Or the similarity between terms in the conversation (based on their co-occurrence):

In [None]:
term_similarity = concept_model.get_term_similarity_matrix()
print(term_similarity.shape)
term_similarity.loc[concept_model.key_terms, concept_model.key_terms].iloc[:5, :]

All of these results are `pandas` DataFrames, so we can export them to Excel like we did above:

In [None]:
# The index for these tables contains important info, so include it
#   when exporting
concept_similarity.to_excel("conversation_results/debate_similarity.xlsx", index=True)
concept_vectors.to_excel("conversation_results/concept_vectors.xlsx", index=True)
term_similarity.to_excel("conversation_results/term_similarity.xlsx", index=True)

> #### Advanced usage
>
> This code is here to demonstrate that you can use the results
> from the conversation tools however you want, using standard
> tools like plotting libraries.
>
> However, the code requires some advanced knowledge of
> `pandas` - this level of knowledge isn't required
> for the rest of the workshop

In [None]:
term_similarity_data = (
    term_similarity.loc[concept_model.key_terms, concept_model.key_terms]
    .stack()
    .rename_axis(["term", "other"])
    .rename("similarity")
    .reset_index()
)
p = figure(
    title="Term similarity",
    x_range=concept_model.key_terms,
    y_range=concept_model.key_terms,
)
similarity_colours = linear_cmap("similarity", "Viridis256", 0, 1)
p.rect(
    x="term",
    y="other",
    width=1,
    height=1,
    fill_color=similarity_colours,
    source=term_similarity_data,
)
p.xaxis.major_label_orientation = "vertical"

legend = ColorBar(color_mapper=similarity_colours["transform"])
p.add_layout(legend, "right")
show(p)

## Topic recurrence metrics

Once we've calculated the conversational similarity, we can use it
to calculate related quantities, e.g. the **multi-participant
recurrence metrics** outlined in:

> Angus, D., Smith, A. E., & Wiles, J. (2012). Human Communication as Coupled Time Series: Quantifying Multi-Participant Recurrence. IEEE Transactions on Audio, Speech, and Language Processing, 20(6), 1795–1807. https://doi.org/10.1109/TASL.2012.2189566



and the **person-to-person (P2P)** and **group-to-group (G2G) recurrence** outlined in:

> Angus, D., & Wiles, J. (2018). Social semantic networks: Measuring topic management in discourse using a pyramid of conceptual recurrence metrics. Chaos: An Interdisciplinary Journal of Nonlinear Science, 28(8), 085723. https://doi.org/10.1063/1.5024809


### Multi-participant recurrence

The multi-participant recurrence metrics can be calculated for different combinations of:

* Time scale: short, medium or long
* Direction: forward or backward
* Speaker: self or other

You can calculate metrics one at a time. The results show the recurrence score
for each utterance in the conversation (identified by `text_id`):

In [None]:
conversation.get_topic_recurrence(
    similarity=concept_similarity,
    time_scale="short",
    direction="forward",
    speaker="self",
)

Or calculate them for all combinations. These results are returned in a format that's
easy to filter and query in `pandas`:

In [None]:
all_recurrences = conversation.get_all_topic_recurrences(similarity=concept_similarity)
all_recurrences.head()

If you want to view them side-by-side you can do:

In [None]:
all_recurrences.pivot(
    index="text_id", columns=["time_scale", "direction", "speaker"], values="score"
)

### Person-to-person and group-to-group recurrence

In [None]:
p2p_recurrence = conversation.get_grouped_recurrence(
    concept_similarity, grouping_column="speaker"
)
p2p_recurrence.round(2)

In order to interpret the recurrence scores, it's useful to divide the un-normalized recurrence scores by their total
to express them as a percentage of the total recurrence:

In [None]:
def get_percentage_recurrence(recurrence_scores):
    total_score = recurrence_scores.sum().sum()
    percentages = (recurrence_scores / total_score) * 100
    return percentages


p2p_recurrence_raw = conversation.get_grouped_recurrence(
    concept_similarity, grouping_column="speaker", normalize=False
)
p2p_percentages = get_percentage_recurrence(p2p_recurrence_raw)
p2p_percentages.round(1)

In [None]:
g2g_recurrence = conversation.get_grouped_recurrence(
    concept_similarity, grouping_column="role"
)
g2g_recurrence.round(2)

In [None]:
g2g_recurrence_raw = conversation.get_grouped_recurrence(
    concept_similarity, grouping_column="role", normalize=False
)
g2g_percentages = get_percentage_recurrence(g2g_recurrence_raw)
g2g_percentages.round(1)

### Social semantic networks

Recurrence scores can be visualized using a social semantic network, using the recurrence score
as the edge weight between nodes of the network.

You could export the recurrence scores to Excel and use them in specialized software
like [Gephi](https://gephi.org/), or use Python's `networkx` library:

In [None]:
p2p_network = nx.from_pandas_adjacency(p2p_recurrence_raw, create_using=nx.DiGraph)
edge_weights = [weight for (a, b, weight) in p2p_network.edges.data("weight")]

fig, ax = plt.subplots()
fig.set_size_inches((8, 8))
# Draw the network
layout = nx.layout.spring_layout(p2p_network)
nx.draw_networkx(
    p2p_network,
    pos=layout,
    arrows=True,
    connectionstyle="arc3,rad=0.2",
    font_weight="bold",
    node_size=500,
    edge_cmap=cm.viridis,
    edge_color=edge_weights,
    width=[np.log(w) for w in edge_weights],
    bbox={"ec": "k", "fc": "white", "alpha": 0.5},
    verticalalignment="bottom",
    horizontalalignment="left",
)

As expected, the 3 central participants in the debate repeat each others' concept much more than the additional journalists.

## Visualizing similarity

The Discursis-style plot of similarity across the conversation is
available through `ConversationPlot`:

In [None]:
discursis_plot = ConversationPlot(conversation, similarity_matrix=concept_similarity)
discursis_plot.show()

While the default is to colour the plot by speakers, the more relevant column here 
is probably `"role"`, so we may want to use that from now on:

In [None]:
role_plot = ConversationPlot(
    conversation,
    similarity_matrix=concept_similarity,
    grouping_column="role",
    options={"show_help_text": False},
)
role_plot.show()

For more focussed exploration of the conversation, you can inspect a subset.
You can use the **Box Zoom** tool to zoom in on parts of the plot manually, however it may be better
to explicitly plot part of the conversation to focus on. 

**Box zoom icon**: ![Box zoom icon](https://docs.bokeh.org/en/latest/_images/BoxZoom.png)

You can also use the
`threshold` option to remove tiles with low similarity, to better highlight the utterances
that are similar:

In [None]:
focused_plot = ConversationPlot(
    conversation,
    similarity_matrix=concept_similarity.iloc[:25, :25],
    grouping_column="role",
    threshold=0.2,
)
focused_plot.show()

## Using Discursis with your own data

Using discursis with your own data requires you to first
load your data with `pandas`, and then create a conversation
object from it. `pandas` can load Excel and `csv` files (along
with multiple other formats), but we'll assume you have an Excel file here,
with a `.xlsx` file extension, and with two columns named
`text` and `speaker`.

If you are running this notebook via [mybinder.org](https://mybinder.org/),
you can upload a file using the example code below. Upload your
data then re-run the subsequent cell to see a basic Discursis visualisation.

In [None]:
from ipywidgets import FileUpload
from IPython.display import display, Markdown
import io

uploader = FileUpload(accept=".xlsx", multiple=False)

display(Markdown("Press the upload button to load your own data:"))
display(uploader)

Rerun this cell after uploading:

In [None]:
# We are doing some extra steps here to deal with the uploader
# Once you have your data loaded, things should be easier
if uploader.value:
    filename = list(uploader.value.keys())[0]
    file_info = uploader.value[filename]
    my_data = pd.read_excel(io.BytesIO(file_info["content"]))
    # After loading your data, these are the steps to
    #   calculate similarity and produce the plot:
    my_conversation = Conversation(my_data)
    my_model = ConceptSimilarityModel(my_conversation)
    print("Calculating concept similarity...")
    my_similarity = my_model.get_conversation_similarity()
    print("Done")
    my_plot = ConversationPlot(my_conversation, my_similarity.iloc[20:40, 20:40])
    my_plot.show()
else:
    print("Upload your data with the button above then rerun this cell")