# ANZCA 2023 - Digital Methods Workshop
## Juxtorpus: Understanding your data

### Definitions
**Corpus**

A corpus is a (usually large) collection of documents of a similar kind. An example of a corpus would be essays from a particular author.
Documents in a corpus are typically assumed to be in the same format as each other and to share a theme or subject-matter.

**Metadata**

Metadata is data that describes the dataset in question. Examples of metadata of a news article would be the author or date published. Metadata provides context for data so that we can perform more informed analysis

### Example data set: Internet news data with readers engagement
Access link: https://www.kaggle.com/datasets/szymonjanowski/internet-articles-data-with-users-engagement

Context:

This dataset was created and used in order to determine the popularity of an article before it was published online but thanks to its flexibility can be used in various tasks.
It contains articles (listed as the top in popularity at the publisher website) from multiple well-known publishers. Then using Facebook GraphAPI data was enriched with engagement features such as shares, reactions, and comments count.

Date range: 03-09-2019 to 04-11-2019

### Data Context

Data always comes with context and bias. It is part of our job when analysing data to understand the data's context and account for it when drawing conclusions. One method of accounting for the context of data is data cleaning. The following changes to the example data set have been made to more easily achieve our analysis goals:

- only the relevant columns were preserved. Columns such as 'url' have been removed. This speeds up processing times and makes loading easier
- the article contents column has been edited to improve term frequency analysis:
    - most entries in the 'content' column ended with a statement like "… \[+100 chars\]". We can infer that this is not from the articles themselves, but is an artifact of the parsing tool used to gather the articles (most likely to keep the file size small). Performing frequency analysis prior to this amendment yielded "chars" as by far the most frequently used term
- often data analysis tools contain rudamentary data cleaning functionality. In the timeline tool demonstrated in this notebook, there is a checkbox titled "Remove stopwords", which allows the ignoring of terms such as "but", "and", and "the". Be mindful of the data cleaning functionalities available to you so that you don't make unnecessary changes to the data.

### Data Analysis

#### Corpus Loading

In [None]:
from juxtorpus.corpus.corpora import Corpora
from juxtorpus import Jux
from juxtorpus.viz.item_timeline import ItemTimeline

import warnings
warnings.filterwarnings('ignore')

corpora = Corpora()
corpora.widget()

In [None]:
corpus_ls = [corpora.get(c) for c in corpora.items()]
corpus_ls[-1].summary()

#### Author publishing frequency

Below we can start to see some patterns and limitations within our example data:
- individual author names are conflated with publication names. We can mitigate this by deselecting the publication names in the visualisation
- some publications have duplicate entries (BBC News and https://www.facebook.com/bbcnews)
- some dates are conspicuously missing any entries at all (Sept 21st to Sept 25th)

In [None]:
corpus_ls = [corpora.get(c) for c in corpora.items()]
corpus = corpus_ls[-1]
corpus.create_custom_dtm(lambda x: [x])
timeline = ItemTimeline.from_corpus(corpus, freq='1d', use_custom_dtm=True)
timeline.widget()

#### Term frequency over time 

In [None]:
corpus_ls = [corpora.get(c) for c in corpora.items()]
corpus = corpus_ls[-1]
timeline = ItemTimeline.from_corpus(corpus, freq='1d')
timeline.widget()

#### Term frequency by news source

In [None]:
corpus_ls = [corpora.get(c) for c in corpora.items()]
corpus_a = corpus_ls[-1]
corpus_b = corpus_ls[-2]
jux = Jux(corpus_a, corpus_b)
jux.polarity.wordcloud('tf')