In [1]:
import numpy as np
import pandas as pd
import sentence_transformers
import nltk
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re
import bertopic
import spacy
from transformers import pipeline
import string
from nltk.stem import PorterStemmer

In [14]:
df_news = pd.read_json('data/cc_gw_news_blogs_2021-10-01_2021-10-31.json')

# For testing purposes, you might find it helpful to cut this dataset down to only the first 500 articles.
df_news = df_news.iloc[:500]

## Challenge 1: Identifying characteristic words.
One of the most valuable applications of TF-IDF is to use the IDF weighting to provide an explainable way to highlight the terms that are key to a given document. This can even be done across a group of documents, such as all those articles that are produced by the same news source.

Explore the articles present in the provided example data and determine the five most characteristic words for each source. You should:
1. Identify the total list of sources in the dataset.
2. Group the article texts together to create a document for each source.
3. Calculate the TF-IDF vectors for each source. Consider the use of custom stopword sets.
4. Identify the tokens with maximum weight in each vector.

Hint: TF-IDF methods include two parameters that affect the results: `max_df` and `min_df`. Review how they affect the results.

In [15]:
import tqdm
#1. Get the list of all sources.
# The source column has most of this, but we need to process the data a little first.
df_news['source_uri'] = [u['uri'] for u in df_news.source]  ## Alternatively, set() could be applied to the list comprehension, but it's useful to save this.
unique_sources = df_news['source_uri'].unique()
print(unique_sources[:10])

['haidagwaiiobserver.com' 'worldbank.org' 'eurasiareview.com'
 'stettlerindependent.com' 'reliefweb.int' 'bloombergquint.com'
 'catholicnewsagency.com' 'cranbrooktownsman.com' 'dailyecho.co.uk'
 'conservativeangle.com']


In [17]:
#2. Group the article texts for each source into one document.
# For now we'll record this in a dictionary domain: string to keep the link between source and the full text.
source_text = {}
for u in tqdm.tqdm(unique_sources):  # tqdm is very handy to add a progress bar and indicate roughly how long it will take to execute this for loop.
    df_u = df_news[df_news.source_uri==u]
    source_text[u] = ' '.join(df_u.body)
print(source_text['dailyecho.co.uk'][:100])

100%|█████████████████████████████████████████████████████████████████████████████| 340/340 [00:00<00:00, 685.18it/s]

Charles is expected to call for a "vast military-style campaign" to address urgent environmental iss





In [20]:
#3. Calculate the TF-IDF vectors for the sources. Consider the use of custom stop words.
# We should first define our list of stopwords, then we can just apply the example code we saw before.
from sklearn.feature_extraction import _stop_words
stop = list(_stop_words.ENGLISH_STOP_WORDS) + ['climate', 'change', 'global', 'warming']  ## These terms should be removed as they're part of our search terms.
stop += ['comment', 'subscribe', 'cookie', 'accept', 'reject']  ## Add in some terms that might appear from online data quirks (not at all an exhaustive list).

tf_vectorizer = TfidfVectorizer(stop_words=stop,max_df=0.5,min_df=10) ## Using max_df and min_df we automatically ignore terms in more than half of all sources and in fewer than 10 sources.
X_tf = tf_vectorizer.fit_transform([source_text[u] for u in unique_sources])  ## We need to use this list comprehension to ensure that the order matches.
print(tf_vectorizer.get_feature_names_out())
print(X_tf.toarray())

['000' '10' '100' ... 'zero' 'zeshan' 'zone']
[[0.         0.         0.         ... 0.15650129 0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.02345739 0.         0.         ... 0.10601783 0.         0.        ]
 ...
 [0.         0.         0.07375629 ... 0.03991311 0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.07159617 ... 0.         0.         0.        ]]


In [48]:
#4. Identify the tokens with maximum weight in each document.
# Each row in X_tf corresponds to a particular source, whereas the columns refer to the feature names (i.e. words).
feature_names = tf_vectorizer.get_feature_names_out()

source_key_terms = {}
for i,u in enumerate(unique_sources):
    max_cols = np.argsort(X_tf[i].toarray())  ## We need to make a dense array for the sorting function to work.
    source_key_terms[u] = [feature_names[j] for j in max_cols[0,-5:]]

print(source_key_terms['dailyecho.co.uk'])
print(source_key_terms['worldbank.org'])
print(source_key_terms['conservativeangle.com'])

['london', 'health', 'pollution', 'ride', 'air']
['policy', 'project', 'bank', 'disaster', 'risk']
['democrats', 'white', 'interested', 'biden', 'americans']


In [49]:
# We can see from checking through these key terms that we can begin to characterise the sources based on the types of words that they prefer to use.
# This can reveal elements of topic focus or region of interest or political leaning from even just a few most characteristic terms.
# This is an application that really benefits from having a big dataset. You need to see a lot of content to start getting enough term frequency to 
# see the most important terms.
# One key thing that should be clear here is the explainability of TF-IDF. We can link our statistical results back to the very tokens we started with.
# This is a key advantage of these simpler methods over transformers!

## Challenge 2: n-gram construction.
Words are not the only sub-sentence unit used to study linguistic patterns in natural language processing. A more general unit called an `n-gram` defines sequences of `n` consecutive words extracted from a text. For example, `2-grams` or `bi-grams` are pairs of words appearing consecutively in a given string. More concretely, given the text `This is example text` would return the following bi-grams: `this is`, `is example`, `example text`.

Using different values of `n`, look for different n-gram patterns appearing the the titles and body text for the provided example dataset. You should:
1. Compare the bi-gram frequency values both before and after removing stopwords from the text.
2. Find the most frequent 4-grams that include `climate` and the most frequent 4-grams that include `change`. Determine the overlap between these sets of 4-grams (i.e. those that include both `climate` and `change`.
3. Find the the set of 5-grams that contain `climate change`. After partitioning the example dataset into one-day windows, can you see any patterns in how the usage of `climate change` varies over time?

## Challenge 3: Narrative consistency.
Text data can be difficult to work with for many reasons, mainly due to it being noisy and needing careful cleaning. The example data you were given contains a particular issue with portions of the text that may be apparently unrelated to the major themes of the text. This is an inevitable consequence of the data source, it is provided by a service that collects and formats online news articles and this sometimes includes text from web features (e.g. other story links) being incorrectly inlcuded in the main text. Luckily, we can use some of the techniques we introduced previously to look for divergence in the semantic content of portions of the text.

Look for evidence of narrative inconsistency in the body text of the example data. You should:
1. Determine an appropriate unit of analysis to partition the article into, and preprocess the data into these units.
2. Describe an algorithm to measure the internal narrative consistency of an article that uses these sub-article units.
3. Apply this algorithm to quantify the narrative consistency of all articles in the example dataset. Which articles are the most and least narratively consistent?

Hint: Consider mapping the semantic space covered by the article. Which articles cover the largest or smallest semantic space?

## Challenge 4: Positivity and negativity in narratives.
The precise dataset and seeding applied to topic modelling can make a big difference to the results. Sometimes this can mean that the detected topics may miss some of the smaller, more nuanced themes. This means it is often a good idea to consider a few different outputs from topic modelling under a range of paramter values. At this stage, confidence in the model outputs can be determined by aggregate the common patterns appearing across different runs.

Compare the topics identified in different perspectives on the example data. You should:
1. Apply the BERTopic framework with several different parameter combinations in the UMAP and HDBSCAN processes. Do you notice and topics that are consistent over runs?
2. Partition the news dataset into three sets of articles based on valence (i.e. positive, neutral, negative). What proportion of articles fall in each category? How sensitive is this value to the method of sentiment analysis applied?
3. Visualise the semantic space for the entire corpus. Are there any spatial patterns emerging based on the article valence?
4. Apply BERTopic to each subset. Compare the topics found on each subet of the dataset to those found over all texts.