# Analysis of Social Media Data using Discursis

In this workshop, we'll look at how we can use Discursis to analyze conversation and discourse
transcripts, and other text that can be structured like a conversation, such as social media data.

The first dataset we'll use is the transcript of the National Press Club Leaders Debate
between Kevin Rudd and Tony Abbott, available at the [Parliament of Australia website](https://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;query=Id:%22media/pressrel/2658246%22) under a [CC BY-NC-ND 3.0 AU](https://creativecommons.org/licenses/by-nc-nd/3.0/au/)
Creative Commons license.

The second type of dataset we will investigate is youtube data as collected and prepared via [Youte](https://github.com/QUT-Digital-Observatory/youte/). This will show some of the challenges of analysing social media data as conversations.  

## Python setup

Discursis and the accompanying tools for conversation data are in the [atap_widgets](https://github.com/Australian-Text-Analytics-Platform/atap_widgets) Python
package. We'll load the tools from this library, along with the other libraries we'll be using
for the analysis:

In [None]:
import os

# In order for bokeh to work properly on Binder, we need to specify
#   the URL it's running on
os.environ["BINDER_EXTERNAL_URL"] = os.environ.get('JUPYTERHUB_EXTERNAL_URL', "https://notebooks.gesis.org/")

# pandas: tools for data processing
import pandas as pd

# numpy: tools for numerical calculations
import numpy as np

pd.options.display.max_colwidth = 100
# Bokeh: interactive plots
from bokeh.io import output_notebook
from bokeh.models import ColorBar
from bokeh.plotting import figure, show
from bokeh.transform import linear_cmap

# networkx: graphs/networks
import networkx as nx

# matplotlib: another plotting library
from matplotlib import cm
import matplotlib.pyplot as plt

# This needs to be run to enable interactive bokeh plots
output_notebook()

# Individual tools from atap_widgets
from atap_widgets.conversation import (
    ConceptSimilarityModel,
    Conversation,
    EmbeddingModel,
)
from atap_widgets.plotting import ConversationPlot
from atap_widgets.concordance import (
    ConcordanceTable,
    ConcordanceWidget,
    prepare_text_df,
)

In [None]:
# Create a results folder, if it doesn't already exist
os.makedirs("conversation_results", exist_ok=True)

## Loading the data

The conversation tools are designed to accept data as a `pandas` dataframe.
Each row in the dataframe should be an utterance in the conversation. There
should be a `"text"` column with the actual content of the utterance
and a `"speaker"` column identifying who is speaking. It also helps if we have a `"text_id"` column that gives a unique identifier for each utterance that
we can refer to.


Additional metadata columns that might be relevant to your particular dataset
will be imported into the conversation tool as-is. In this case we have an additional
`"role"` column identifying each person's role in the debate.

In [None]:
data = pd.read_excel("data/debate_clean.xlsx")
data.head()

## Data exploration

Before carrying out our analysis using Discursis, we can do some initial exploration of the data
to get an overview of it. We can do this with the tools that are built in to `pandas`,
rather than specialized language tools.

We can see how many times each person spoke during the debate:

In [None]:
print("Total utterances:", len(data))
data["speaker"].value_counts()

And see how many times people in different roles spoke:

In [None]:
pd.crosstab(data["speaker"], data["role"], margins="columns")

## Concordance

We can use the `ConcordanceWidget` tool to start looking at some key terms
in the conversation. Terms that might be relevant to a political debate
might be things like "economy" or "environment".

Before using the concordance tools, we need to use the `prepare_text_df()`
function to process the data, which applies some initial NLP processing.

If you have other ideas about relevant topics, you can use the 
`ConcordanceWidget` to search for them in real-time. If you've
found useful results, you can export them to Excel:

In [None]:
data = prepare_text_df(data, text_column="text", id_column="text_id")
widget = ConcordanceWidget(data, results_per_page=10)
widget.show()

## Analysis of Conversation Data

In order to model our conversation data, we need to load our data into
a `Conversation` object. The `Conversation` object carries out some initial processing of the text,
which will be handled by a `spacy` language model. If you need to analyse
data for a non-English corpus, you can install a relevant [spacy model](https://spacy.io/usage/models).

In [None]:
conversation = Conversation(
    data=data,
    text_column="text",
    speaker_column="speaker",
    id_column="text_id",
    language_model="en_core_web_sm",
)
conversation

The `Conversation` object offers some basic functionality for accessing information about
the conversation:

In [None]:
conversation.n_speakers, conversation.n_utterances

You can access the table of utterance data via `conversation.data` - this
is a `pandas` DataFrame like the original data but has some additional
information added:

In [None]:
conversation.data.head()

When we calculate conversational(semantic) similarity below, the default method is based
on the most common terms in the data. We can check what these are:

In [None]:
conversation.get_most_common_terms(n=20)

We may want to treat some of these terms as **stopwords** so that they don't contribute to
the calculation of topic similarity. Stopwords are common grammatical terms that don't carry specific
contextual meaning, such as `and, of, the, is`. Note that the `spacy` language model we are using already has some
default stopwords defined:

In [None]:
default_stopwords = sorted(conversation.nlp.Defaults.stop_words)
print(len(default_stopwords), "default stopwords")
default_stopwords[:10]

After adding these stopwords,
the changes should be applied in any new operations:

In [None]:
conversation.add_stopword("mr")
conversation.add_stopword("said")

In [None]:
conversation.get_most_common_terms(n=20)

You can also access and export the full frequency table of terms. The `term_frequencies` table
is a `pandas` DataFrame, so we can export it to Excel easily:

In [None]:
term_frequencies = conversation.get_term_frequencies()
term_frequencies.to_excel("conversation_results/term_frequencies.xlsx", index=False)
term_frequencies

### Calculating similarity

In order to calculate similarity of terms and topics across the conversation,
we'll use the conceptual recurrence calculation from

> Angus, D., Smith, A. E., & Wiles, J. (2012). Human Communication as Coupled Time Series: Quantifying Multi-Participant Recurrence. IEEE Transactions on Audio, Speech, and Language Processing, 20(6), 1795–1807. https://doi.org/10.1109/TASL.2012.2189566

This method is implemented in `ConceptSimilarityModel`, which takes in a conversation
object and performs the similarity calculation on it. To match the method used in the article,
we'll use the top 50 key terms as the basis for concepts, which we set with the
`key_terms` argument, and use 3-sentence windows when counting which terms co-occur:

In [None]:
concept_model = ConceptSimilarityModel(
    conversation, key_terms=50, sentence_window_size=3
)

concept_similarity = concept_model.get_conversation_similarity()
concept_model

For convenience, we only need to call one function to get the utterance-to-utterance
similarity that will form the basis of the Discursis plot:

## Visualizing similarity

The Discursis-style plot of similarity across the conversation is
available through `ConversationPlot`:

You can use the **Box Zoom** tool to zoom in on parts of the plot manually.

**Box zoom icon**: ![Box zoom icon](https://docs.bokeh.org/en/latest/_images/BoxZoom.png)

You can also adjust the
`threshold` option to remove tiles with low similarity, to better highlight the utterances
that are similar - note that the default is to not apply a threshold:

In [None]:
discursis_plot = ConversationPlot(
    conversation, 
    similarity_matrix=concept_similarity,
    threshold=0.2,
)
discursis_plot.show()

## Using Discursis with YouTube data

To apply Discursis to the YouTube data we've already collected and prepared using `youte tidy` requires:

1. (If working in Binder or other JupyterHub instances) Uploading the processed database to the binder instance.
2. Loading the comment data from the prepared database.
3. Telling Discursis which columns to use in the dataset.
4. Applying the functions we used earlier to visualise the conversation, and use the keyword tools to find concordance

### Collecting YouTube data

For this section, we're going to look at the comments on a [YouTube video discussing a cat related videogame](https://www.youtube.com/watch?v=PVYVkXhQtzU). To collect and prepare the comments using [youte](https://github.com/QUT-Digital-Observatory/youte/), we can run the following commands:

```
# Collect the top level comments on the specific video id
youte list-comments -v kitty.json PVYVkXhQtzU
youte tidy kitty.json kitty.db

# Collect the comments in a specific thread
youte list-comments -t kitty_thread.json UgzlxcyHprauGpVcsBV4AaABAg
youte tidy kitty_thread.json kitty_thread.db

```

### Uploading Data

Data can be uploaded using the standard Jupyter tools - in this case we're going to upload the collected data to the `data` subdirectory.

### Loading Data

To load data we will connect directly to the uploaded database and load what we need from there. Note that we're ordering results by `published_at` - this establishes a sequence of comments for discursis to use when visualising the conversation.

In [None]:
import sqlite3

# Create a connection to the database
conn = sqlite3.connect("data/kitty_thread.db")

# Note that we need to order by a data - as we need the *sequence* of comments on the page 
# Resetting 
data = pd.read_sql_query("select * from comments order by published_at", conn)
data.reset_index(inplace=True)

data.head()

### Preparing the Conversation

As before, we'll prepare the conversation object - note that we're mapping the YouTube `author_name` as the "speaker", and the `text_original` column as the text for Discursis to use. 

In [None]:
conversation = Conversation(
    data=data,
    text_column="text_original",
    speaker_column="author_name",
    id_column="index",
    language_model="en_core_web_sm",
)
conversation

#### Investigation

As before let's take a look at the common terms in the dataset.

In [None]:
conversation.get_most_common_terms(n=50)

And you'll notice that "etc" is a common word, so let's remove that:

In [None]:
conversation.add_stopword("etc")
conversation.get_most_common_terms(50)

### Create and Visualising the Similarity Model

We'll follow the same steps as earlier to create and then visualise the similarity model.

In [None]:
concept_model = ConceptSimilarityModel(
    conversation, key_terms=50, sentence_window_size=3
)

concept_similarity = concept_model.get_conversation_similarity()
concept_model.key_terms

In [None]:
discursis_plot = ConversationPlot(
    conversation, 
    similarity_matrix=concept_similarity,
    threshold=0.5,
)
discursis_plot.show()

### Looking Deeper with Concordance

In [None]:
data = prepare_text_df(data, text_column="text_original", id_column="index")
widget = ConcordanceWidget(data, results_per_page=10)
widget.show()