# ANZCA 2023 - Digital Methods Workshop

## Relevant links:

Australian Text Analytics Platform (ATAP): https://www.atap.edu.au/

Juxtorpus code base: https://github.com/Sydney-Informatics-Hub/juxtorpus

## Juxtorpus: Understanding your data

### Definitions
**Corpus**

A corpus is a collection of documents of a similar kind. An example of a corpus would be essays from a particular author.
Documents in a corpus are typically assumed to be in the same format as each other and to share a theme or subject-matter.

**Metadata**

Metadata is data that describes the dataset in question. Examples of metadata of a news article would be the author or date published. Metadata provides context for data so that we can perform more informed analysis

### Example data set: Internet news data with readers engagement
Access link: https://www.kaggle.com/datasets/szymonjanowski/internet-articles-data-with-users-engagement

Context:

This dataset was created and used in order to determine the popularity of an article before it was published online but thanks to its flexibility can be used in various tasks.
It contains articles (listed as the top in popularity at the publisher website) from multiple well-known publishers. Then using Facebook GraphAPI data was enriched with engagement features such as shares, reactions, and comments count.

Date range: 03-09-2019 to 04-11-2019

### Data Context

Data always comes with context and bias. It is part of our job when analysing data to understand the data's context and account for it when drawing conclusions. One method of accounting for the context of data is data cleaning. The following changes to the example data set have been made to more easily achieve our analysis goals:

- only the relevant columns were preserved. Columns such as 'url' have been removed. This speeds up processing times and makes loading easier
- the article contents column has been edited to improve term frequency analysis:
    - most entries in the 'content' column ended with a statement like "… \[+100 chars\]". We can infer that this is not from the articles themselves, but is an artifact of the parsing tool used to gather the articles (most likely to keep the file size small). Performing frequency analysis prior to this amendment yielded "chars" as by far the most frequently used term
- often data analysis tools contain rudamentary data cleaning functionality. In the timeline tool demonstrated in this notebook, there is a checkbox titled "Remove stopwords", which allows the ignoring of terms such as "but", "and", and "the". Be mindful of the data cleaning functionalities available to you so that you don't make unnecessary changes to the data.

### Data Analysis

#### Corpus Loading

In [None]:
import os
while not 'juxtorpus' in os.listdir():
    os.chdir('../')
from juxtorpus.corpus.corpora import Corpora
from juxtorpus import Jux
from juxtorpus.viz.item_timeline import ItemTimeline

import warnings
warnings.filterwarnings('ignore')

corpora = Corpora()
corpora.widget()

In [None]:
corpus_ls = [corpora.get(c) for c in corpora.items()]
corpus_ls[-1].summary()

#### Author publishing frequency

In order to run the following cell, ensure you have built a corpus with the document column assigned to the column with the author names (or equivalent).

This cell will provide a timeline of publishing frequency for each day by author.

**Example data discussion**

Below we can start to see some patterns and limitations within our example data:
- individual author names are conflated with publication names. We can mitigate this by deselecting the publication names in the visualisation
- some publications have duplicate entries (BBC News and https://www.facebook.com/bbcnews)
- some dates are conspicuously missing any entries at all (Sept 21st to Sept 25th)

In [None]:
corpus_ls = [corpora.get(c) for c in corpora.items()]
corpus = corpus_ls[-1]
corpus.create_custom_dtm(lambda x: [x])
timeline = ItemTimeline.from_corpus(corpus, freq='1d', use_custom_dtm=True)
timeline.widget()

#### Term frequency over time 

In order to run the following cell, ensure you have built a corpus with the document column assigned to the column with the article contents.

This cell will provide a timeline of terms used across all articles for each day.

In [None]:
corpus_ls = [corpora.get(c) for c in corpora.items()]
corpus = corpus_ls[-1]
timeline = ItemTimeline.from_corpus(corpus, freq='1d')
timeline.widget()

#### Term frequency by news source

In order to run the following cell, ensure you have built a corpus with the document column assigned to the column with the new source identifier.

Additionally, ensure you have sliced the corpus twice by news source.

This cell will set up the news source corpora for use in the later cells.

In [None]:
corpus_ls = [corpora.get(c) for c in corpora.items()]
corpus_a = corpus_ls[-1]
corpus_b = corpus_ls[-2]

The following cells will provide wordclouds for each respective corpus

In [None]:
corpus_a.viz.wordcloud()

In [None]:
corpus_b.viz.wordcloud()

**Polarity wordcloud - [Term frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency) mode**

The Juxtorpus wordcloud in Term Frequency mode provides a wordcloud with three variables for each term:
1. colour: one corpus is assigned red and the other blue.
2. opacity: the opacity of a term indicates how much its frequency skews to one corpus, such that if the term is very opaque it occurs much more frequently in one corpus over the other.
3. size: the size of a term indicates how distinct that term is to a corpus. The formula for determining the size is (simplified): `size = polarity / total_frequency`. Polarity is the difference between the frequency of the term in one corpus and the other corpus (normalised for corpus size). Dividing by total frequency controls for words that are common across both corpora.

In [None]:
jux = Jux(corpus_a, corpus_b)
jux.polarity.wordcloud('tf')

**Polarity wordcloud - [Term frequency-inverse document frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency%E2%80%93inverse_document_frequency) mode**

An alternative to Term Frequency measurement is Term frequency-inverse document frequency. Tf-idf is the product of Term frequency and Inverse document frequency, which minimises the score for terms that appear in many documents within the corpus. A much more detailed description of tf-idf can be found at the link above.

In [None]:
jux = Jux(corpus_a, corpus_b)
jux.polarity.wordcloud('tfidf')

**Polarity wordcloud - [Log likelihood](https://ucrel.lancs.ac.uk/llwizard.html) mode**

Log-likelihood is another alternative measurement that takes into consideration the size of the two corpora when determining the frequency of the term within the corpora.

In [None]:
jux = Jux(corpus_a, corpus_b)
jux.polarity.wordcloud('log_likelihood')

## Other ATAP tools

Semantic tagger: tag your text so you can extract token level semantic tags from the tagged text.
https://github.com/Australian-Text-Analytics-Platform/semantic-tagger

Quotation tool: extract quotes from a text while providing information about the speakers and quote locations.
https://github.com/Australian-Text-Analytics-Platform/quotation-tool

Discursis: an analysis and visualisation tool for conversational data.
https://github.com/Australian-Text-Analytics-Platform/discursis