# Discursis: Conversation analysis tool

Discursis can be used to analyze conversation and discourse
transcripts, and other text that is structured like a conversation, such as social media data.
In this notebook, we provide the minimal amount of code to run the tool on your own dataset.
See our other notebooks, e.g. `discursis_workshop.ipynb`, for a more detailed explanation
of the different analysis options available.

For instructions on using this demonstration notebook, see the **Discursis User Guide**

<div class="alert alert-block alert-warning">
<b>User guide to using a Jupyter Notebook</b> 

If you are new to Jupyter Notebook, feel free to take a quick look at [this user guide](https://github.com/Australian-Text-Analytics-Platform/discursis/blob/master/docs/jupyter-notebook-guide.pdf) for basic information on how to use a notebook.
</div>

## Python setup

This cell loads the Python packages/libraries required to run the Discursis tool. Execute this cell
before running the code below.

<div class="alert alert-block alert-info">
<b>Python tools used in the Discursis tool:</b>    
    
- [spaCy](https://spacy.io/): for text cleaning and normalisation
- [pandas](https://pandas.pydata.org/): for storing and displaying in dataframe (table) format
- [bokeh](https://bokeh.org/): for interactive visualizations
- [atap_widgets](https://github.com/Australian-Text-Analytics-Platform/atap_widgets): A collection of data anlysis tools (including Discursis) developed as part of ATAP
</div>

In [None]:
import os

# In order for bokeh to work properly on Binder, we need to specify
#   the URL it's running on (atap_widgets checks for this environment variable)
os.environ["BINDER_EXTERNAL_URL"] = os.environ.get(
    "JUPYTERHUB_EXTERNAL_URL",  # Should be defined on the ATAP binder instance
    "https://notebooks.gesis.org/"
)

from IPython.display import display, Markdown
from ipywidgets import FileUpload
import pandas as pd
from bokeh.io import output_notebook
# This needs to be run to enable interactive bokeh plots
output_notebook()

# Conversational analysis tools from atap_widgets
from atap_widgets.conversation import (
    ConceptSimilarityModel,
    Conversation,
)
from atap_widgets.plotting import ConversationPlot
from atap_widgets.concordance import (
    ConcordanceTable,
    ConcordanceWidget,
    prepare_text_df,
)
from demo import read_uploaded_file

## Using Discursis to visualize conversations

Applying the Discursis tool to a dataset requires you to
load your data with `pandas`, and then create a conversation
object from it. `pandas` can load Excel and `csv` files (along
with multiple other formats), but we'll assume you have an Excel file here,
with a `.xlsx` file extension, and with two columns named
`text` and `speaker`, and one row per utterance in the conversation.

If you are running this notebook in your browser via [mybinder.org](https://mybinder.org/) (or another Binder site),
you can upload a file by running the cell below and the clicking the "Upload" button. If you just want to test out Discursis, you can use [this spreadsheet](https://github.com/Australian-Text-Analytics-Platform/discursis/raw/master/notebooks/data/debate_clean.xlsx), containing the transcript of the National Press Club Leaders Debate
between Kevin Rudd and Tony Abbott, available at the [Parliament of Australia website](https://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id:%22media/pressrel/2658246%22) under a [CC BY-NC-ND 3.0 AU](https://creativecommons.org/licenses/by-nc-nd/3.0/au/)
Creative Commons license.:

<div class="alert alert-block alert-danger">
<b>Large files</b> 
    
If you are running this demo online via Binder, you should keep
your data file small (e.g. &lt; 500 utterances), as the servers
have limited resources. For larger datasets, you can install
the tool on your local machine.
</div>

In [None]:
uploader = FileUpload(accept=".xlsx", multiple=False)

display(Markdown("Press the upload button to load your own data:"))
display(uploader)

Once you've uploaded the file, run the following cell to preview your data:

In [None]:
data = read_uploaded_file(uploader)
data.head()

The next cell loads the data as a conversation, and analyses the similarity between each utterance. This analysis uses the conceptual recurrance metric from:

> Angus, D., Smith, A. E., & Wiles, J. (2012). Human Communication as Coupled Time Series: Quantifying Multi-Participant Recurrence. IEEE Transactions on Audio, Speech, and Language Processing, 20(6), 1795–1807. https://doi.org/10.1109/TASL.2012.2189566

In [None]:
conversation = Conversation(data, text_column="text", speaker_column="speaker")
model = ConceptSimilarityModel(conversation)
similarity = model.get_conversation_similarity()
print(f"Similarity analysis finished:\nsimilarity is a matrix with {similarity.shape[0]} rows and {similarity.shape[1]} columns")

Finally, you can visualise the results of the conversation analysis. We only plot a subset of the conversation here by default (set by the `start` and `end` variables), as the results are easier to inspect on a small scale, but you can plot an entire conversation at once if needed.

In the plot:

* Utterances are represented in the **main diagonal** - each tile represents one utterance
* Tiles are coloured by speaker
* The size of tiles (in the main diagonal) reflects the length of the utterance
* Similarity between utterances is shown in the tiles **below the main diagonal**,
  showing the similarity between the two utterances in that row and column:
  * More opaque tiles have higher similarity scores
  * Tiles are coloured by the speakers for the utterances in that row and column

<div class="alert alert-block alert-warning">
<b>Interacting with the Discursis plot</b> 

The Discursis plot allows you to:
    
* Hover over tiles on the main diagonal to see the ID and text of that utterance
* Hover over tiles below the main diagonal to see the similarity scores between utterances
* Click on tiles to select them, and show the corresponding text in the table below the plot.
</div>

In [None]:
start = 20
end = 40
plot = ConversationPlot(conversation, similarity.iloc[start:end, start:end])
plot.show()