# Conversational analysis using Discursis

In this workshop, we'll look at how we can use Discursis to analyze conversational
data.

The dataset we'll use is the transcript of the National Press Club Leaders Debate
between Kevin Rudd and Tony Abbott, available at the [Parliament of Australia website](https://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id:%22media/pressrel/2658246%22) under a [CC BY-NC-ND 3.0 AU](https://creativecommons.org/licenses/by-nc-nd/3.0/au/)
Creative Commons license.

## Python setup

Discursis and the accompanying tools for conversation data are in the `atap_widgets` Python
package. We'll load the tools from this library, along with the other libraries we'll be using
for the analysis:

In [1]:
import os

import pandas as pd

pd.options.display.max_colwidth = 100
from bokeh.io import output_notebook
from bokeh.models import ColorBar
from bokeh.plotting import figure, show
from bokeh.transform import linear_cmap

# This needs to be run to enable interactive bokeh plots
output_notebook()
# Individual tools from atap_widgets
from atap_widgets.conversation import (
    ConceptSimilarityModel,
    Conversation,
    EmbeddingModel,
)
from atap_widgets.plotting import ConversationPlot
from atap_widgets.concordance import (
    ConcordanceTable,
    ConcordanceWidget,
    prepare_text_df,
)

In [2]:
# Create a results folder, if it doesn't already exist
os.makedirs("conversation_results", exist_ok=True)

## Loading the data

The conversation tools are designed to accept data as a `pandas` dataframe.
Each row in the dataframe should be an utterance in the conversation. There
should be a `"text"` column with the actual content of the utterance
and a `"speaker"` column identifying who is speaking. It also helps if we have a `"text_id"` column that gives a unique identifier for each utterance that
we can refer to.


Additional metadata columns that might be relevant to your particular dataset
will be imported into the conversation tool as-is. In this case we have an additional
`"role"` column identifying each person's role in the debate.

In [3]:
data = pd.read_excel("data/debate_clean.xlsx")
data.head()

Unnamed: 0,text_id,speaker,text,role
0,1,SPEERS,Good evening and welcome to the National Press Club election leaders' debate. Please put your h...,Journalist
1,2,PM,"This country of ours, Australia, is one of the best countries in the world. We have a strong an...",Labor
2,3,SPEERS,"Prime Minister, thank you. Tony Abbott I would like you now to make your opening remarks.",Journalist
3,4,ABBOTT,"Thanks very much, David. This debate is between Mr Rudd and me but the election is not about Mr...",Coalition
4,5,SPEERS,"Tony Abbott, Thank you very much for that. Now before we get to questions from my colleagues on...",Journalist


## Data exploration

Before carrying out conversational analysis, we can do some initial exploration of the data
to get an overview of it. We can do this with the tools that are built in to `pandas`,
rather than specialized language tools.

We can see how many times each person spoke during the debate:

In [4]:
print("Total utterances:", len(data))
data["speaker"].value_counts()

Total utterances: 118


SPEERS      47
ABBOTT      31
PM          29
HARTCHER     6
CURTIS       3
BENSON       2
Name: speaker, dtype: int64

And see how many times people in different roles spoke:

In [5]:
pd.crosstab(data["speaker"], data["role"], margins="columns")

role,Coalition,Journalist,Labor,All
speaker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ABBOTT,31,0,0,31
BENSON,0,2,0,2
CURTIS,0,3,0,3
HARTCHER,0,6,0,6
PM,0,0,29,29
SPEERS,0,47,0,47
All,31,58,29,118


We can use the `ConcordanceTable` tool to start looking at some key terms
in the conversation. Terms that might be relevant to a political debate
might be things like "economy" or "environment".

Before using the concordance tools, we need to use the `prepare_text_df()`
function to process the data, which applies some initial NLP processing.

In [6]:
data = prepare_text_df(data, text_column="text", id_column="text_id")

table = ConcordanceTable(data, keyword="economy", results_per_page=10)
table

Document ID,Result,Unnamed: 2,Unnamed: 3
2,ntries in the world. We have a strong and dynamic,economy,", recognised as such around the world. We believe"
2,saving and providing for your kids' future. This,economy,is strong. This election is about the future stre
2,This election is about the future strength of our,economy,and how best to secure it. The election is about
2,e it. The election is about a clear choice on the,economy,", on jobs, on how we support families under pressu"
2,t education and health. We've managed to keep the,economy,strong. We are supporting families under financia
2,Government when we actually had to modernise the,economy,from the old economy of the past. We did so durin
2,ctually had to modernise the economy from the old,economy,of the past. We did so during the Global Financia
4,w our country will change: we'll build a stronger,economy,so that everyone can get ahead. We'll scrap the C
5,"ajor issues of the campaign, let’s start with the",economy,". Now Prime Minister, you mentioned the Global Fin"


We can update the search settings and display the table again:

In [7]:
table.keyword = "environment"
table

Document ID,Result,Unnamed: 2,Unnamed: 3
85,do we enable people to age with dignity in a home,environment,"as well? And frankly, can I just make one point a"
106,ission and we will give a one-stop shop for major,environment,"approvals. Now, we also want to see a 5-pillar ec"


It seems the economy was a focus, but environmental issues were not a major feature of this debate.

If you have other ideas about relevant topics, you can use the 
`ConcordanceWidget` to search for them in real-time. If you've
found useful results, you can export them to Excel:

In [8]:
widget = ConcordanceWidget(data, results_per_page=10)
widget.show()

VBox(children=(Text(value='', description='Keyword(s):'), HBox(children=(Checkbox(value=False, description='En…

> ### What's the difference between ConcordanceWidget and ConcordanceTable?
>
> * `ConcordanceWidget` lets you search interactively, but the results won't be saved in the notebook
> * `ConcordanceTable` is non-interactive, but the results are saved in the notebook, so you
>   can share them easily

## Conversational analysis

In order to perform conversational analysis, we need to load our data into
a `Conversation` object. The `Conversation` object carries out some initial processing of the text,
which will be handled by a `spacy` language model. If you need to analyse
data for a non-English corpus, you can install a relevant [spacy model](https://spacy.io/usage/models).

In [9]:
conversation = Conversation(
    data=data,
    text_column="text",
    speaker_column="speaker",
    id_column="text_id",
    language_model="en_core_web_sm",
)
conversation

Conversation(118 utterances, 6 speakers, language_model='en_core_web_sm')

The `Conversation` object offers some basic functionality for accessing information about
the conversation:

In [10]:
conversation.n_speakers, conversation.n_utterances

(6, 118)

You can access the table of utterance data via `conversation.data` - this
is a `pandas` DataFrame like the original data but has some additional
information added:

In [11]:
conversation.data.head()

Unnamed: 0_level_0,text_id,speaker,text,role,spacy_doc
text_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1,SPEERS,Good evening and welcome to the National Press Club election leaders' debate. Please put your h...,Journalist,"( , Good, evening, and, welcome, to, the, National, Press, Club, election, leaders, ', debate, ...."
2,2,PM,"This country of ours, Australia, is one of the best countries in the world. We have a strong an...",Labor,"( , This, country, of, ours, ,, Australia, ,, is, one, of, the, best, countries, in, the, world,..."
3,3,SPEERS,"Prime Minister, thank you. Tony Abbott I would like you now to make your opening remarks.",Journalist,"( , Prime, Minister, ,, thank, you, ., Tony, Abbott, I, would, like, you, now, to, make, your, o..."
4,4,ABBOTT,"Thanks very much, David. This debate is between Mr Rudd and me but the election is not about Mr...",Coalition,"( , Thanks, very, much, ,, David, ., This, debate, is, between, Mr, Rudd, and, me, but, the, ele..."
5,5,SPEERS,"Tony Abbott, Thank you very much for that. Now before we get to questions from my colleagues on...",Journalist,"( , Tony, Abbott, ,, Thank, you, very, much, for, that, ., Now, before, we, get, to, questions, ..."


When we calculate conversational similarity below, the default method is based
on the most common terms in the data. We can check what these are:

In [12]:
conversation.get_most_common_terms(n=20)

['mr',
 'tax',
 'new',
 'abbott',
 'government',
 'people',
 'future',
 'economy',
 'years',
 'question',
 'country',
 'rudd',
 'productivity',
 'time',
 'australia',
 'minister',
 'change',
 'said',
 'going',
 'way']

We may want to treat some of these terms as **stopwords** so that they don't contribute to
the calculation of topic similarity. After adding these stopwords,
the changes should be applied in any new operations:

In [13]:
conversation.add_stopword("Mr")
conversation.add_stopword("said")

In [14]:
conversation.get_most_common_terms(n=20)

['tax',
 'new',
 'abbott',
 'government',
 'people',
 'future',
 'economy',
 'years',
 'question',
 'country',
 'rudd',
 'time',
 'australia',
 'productivity',
 'minister',
 'change',
 'going',
 'way',
 'national',
 'believe']

You can also access and export the full frequency table of terms. The `term_frequencies` table
is a `pandas` DataFrame, so we can export it to Excel easily:

In [15]:
term_frequencies = conversation.get_term_frequencies()
term_frequencies.to_excel("conversation_results/term_frequencies.xlsx", index=False)
term_frequencies

Unnamed: 0,term,frequency
1269,tax,61
860,new,48
51,abbott,44
584,government,43
924,people,42
...,...,...
850,necessary,1
340,criticised,1
852,needed,1
854,negativity,1


### Calculating similarity

In order to calculate similarity of terms and topics across the conversation,
we'll use the conceptual recurrence calculation from

> Angus, D., Smith, A. E., & Wiles, J. (2012). Human Communication as Coupled Time Series: Quantifying Multi-Participant Recurrence. IEEE Transactions on Audio, Speech, and Language Processing, 20(6), 1795–1807. https://doi.org/10.1109/TASL.2012.2189566

This method is implemented in `ConceptSimilarityModel`, which takes in a conversation
object and performs the similarity calculation on it. To match the method used in the article,
we'll use the top 50 key terms as the basis for concepts, which we set with the
`key_terms` argument, and use 3-sentence windows when counting which terms co-occur:

In [16]:
concept_model = ConceptSimilarityModel(
    conversation, key_terms=50, sentence_window_size=3
)
concept_model

ConceptSimilarityModel(key_terms=50, sentence_window_size=3

For convenience, we only need to call one function to get the utterance-to-utterance
similarity that will form the basis of the Discursis plot:

In [17]:
concept_similarity = concept_model.get_conversation_similarity()
print(concept_similarity.shape)
concept_similarity.iloc[:5, :5]

(118, 118)


text_id,1,2,3,4,5
text_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1.0,0.408907,0.69409,0.307462,0.793816
2,0.408907,1.0,0.329627,0.558965,0.439635
3,0.69409,0.329627,1.0,0.084305,0.921439
4,0.307462,0.558965,0.084305,1.0,0.198498
5,0.793816,0.439635,0.921439,0.198498,1.0


However, if we need to, we can also access the concept vectors the similarity is based on:

In [18]:
concept_vectors = concept_model.get_concept_vectors()
print(concept_vectors.shape)
concept_vectors.iloc[:5, :]

(50, 118)


text_id,1,2,3,4,5,6,7,8,9,10,...,109,110,111,112,113,114,115,116,117,118
abbott,226.657508,89.638009,65.822535,46.306846,139.180737,21.978506,6.738371,61.168528,133.213583,15.320487,...,126.145118,20.160411,7.535714,30.67524,39.276323,69.625738,27.41392,218.28723,10.016613,61.233782
government,36.854661,203.320848,9.635212,135.253168,43.370261,48.312349,7.369811,114.924995,49.780928,24.342093,...,29.328086,49.515239,7.78412,27.636187,14.307657,72.854472,11.932849,292.934629,1.807305,164.994088
question,431.245037,67.626011,4.925293,33.442717,43.113649,23.973507,8.615978,52.205826,25.688888,14.194328,...,36.459616,41.04388,0.0,22.786054,10.136769,39.206392,5.749375,98.099399,2.355862,51.702296
people,53.351949,182.515537,7.163617,176.928827,59.43828,98.307783,4.466867,65.776343,23.981009,30.579424,...,7.829748,27.882479,2.846561,10.054462,5.938501,47.696821,7.259189,358.28646,1.357344,125.946447
tax,49.382885,106.928395,10.450724,211.229058,55.101091,36.146314,7.389293,176.788051,109.888084,26.009276,...,14.721255,46.038178,1.232143,9.842295,3.143102,30.223368,7.929428,135.119181,4.980087,155.393809


Or the similarity between terms in the conversation (based on their co-occurrence):

In [19]:
term_similarity = concept_model.get_term_similarity_matrix()
print(term_similarity.shape)
term_similarity.loc[concept_model.key_terms, concept_model.key_terms].iloc[:5, :]

(1423, 1423)


Unnamed: 0,abbott,government,question,people,tax,years,time,rudd,minister,country,...,ago,look,debate,labor,cuts,better,like,come,way,strong
abbott,0.0,1.370661,2.355862,0.841719,3.561905,0.890465,1.759259,1.376803,1.490909,0.209016,...,1.913793,0.518145,0.843333,2.92504,2.671053,0.370389,0.368035,0.645161,0.840678,0.387097
government,1.370661,0.0,0.658789,0.969935,2.756044,2.756044,1.017316,2.652911,1.210256,2.178283,...,6.169355,0.428819,0.930587,1.845588,1.28882,1.26213,1.511853,0.883803,3.083896,0.320106
question,2.355862,0.658789,0.0,0.391761,2.462121,0.772655,3.123543,1.649642,0.17619,1.676768,...,2.447368,0.0,2.486486,1.395,1.907692,0.0,0.603104,1.054409,0.0,0.0
people,0.841719,0.969935,0.391761,0.0,0.535862,1.000805,0.820339,0.938265,1.223058,1.74,...,0.725806,0.829032,0.36075,2.20339,2.614943,1.796364,0.36075,1.555887,0.823333,1.525424
tax,3.561905,2.756044,2.462121,0.535862,0.0,1.054776,0.584906,2.605128,0.263102,0.676667,...,3.646259,1.974852,0.426446,1.974852,24.642857,0.606208,2.693878,0.0,0.7373,1.036743


All of these results are `pandas` DataFrames, so we can export them to Excel like we did above:

In [20]:
# The index for these tables contains important info, so include it
#   when exporting
concept_similarity.to_excel("conversation_results/debate_similarity.xlsx", index=True)
concept_vectors.to_excel("conversation_results/concept_vectors.xlsx", index=True)
term_similarity.to_excel("conversation_results/term_similarity.xlsx", index=True)

We can also use them in other tools, such as `bokeh` plots:

In [21]:
term_similarity_data = (
    term_similarity.loc[concept_model.key_terms, concept_model.key_terms]
    .stack()
    .rename_axis(["term", "other"])
    .rename("similarity")
    .reset_index()
)
p = figure(
    title="Term similarity",
    x_range=concept_model.key_terms,
    y_range=concept_model.key_terms,
)
similarity_colours = linear_cmap("similarity", "Viridis256", 0, 1)
p.rect(
    x="term",
    y="other",
    width=1,
    height=1,
    fill_color=similarity_colours,
    source=term_similarity_data,
)
p.xaxis.major_label_orientation = "vertical"

legend = ColorBar(color_mapper=similarity_colours["transform"])
p.add_layout(legend, "right")
show(p)

### Visualizing similarity

The Discursis-style plot of similarity across the conversation is
available through `ConversationPlot`:

In [37]:
discursis_plot = ConversationPlot(conversation, similarity_matrix=concept_similarity)
discursis_plot.show()

While the default is to colour the plot by speakers, the more relevant column here 
is probably `"role"`, so we may want to use that from now on:

In [42]:
role_plot = ConversationPlot(
    conversation, similarity_matrix=concept_similarity, grouping_column="role"
)
role_plot.show()

For more focussed exploration of the conversation, you can inspect a subset.
You can use the **Box Zoom** tool to zoom in on parts of the plot manually, however it may be better
to explicitly plot part of the conversation to focus on. 

You can also use the
`threshold` option to remove tiles with low similarity, to better highlight the utterances
that are similar:

In [40]:
focused_plot = ConversationPlot(
    conversation,
    similarity_matrix=concept_similarity.iloc[:25, :25],
    grouping_column="role",
    threshold=0.2,
)
focused_plot.show()