In [None]:
import numpy as np
import pandas as pd
import sentence_transformers
import nltk
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re
import bertopic
import spacy
from transformers import pipeline
import string
from nltk.stem import PorterStemmer

In [None]:
df_news = pd.read_json('data/cc_gw_news_blogs_2021-10-01_2021-10-31.json')

# For testing purposes, you might find it helpful to cut this dataset down to only the first few articles.
df_news = df_news.iloc[:100]

## Challenge 1: Identifying characteristic words.
One of the most valuable applications of TF-IDF is to use the IDF weighting to provide an explainable way to highlight the terms that are key to a given document. This can even be done across a group of documents, such as all those articles that are produced by the same news source.

Explore the articles present in the provided example data and determine the five most characteristic words for each source. You should:
1. Identify the total list of sources in the dataset.
2. Group the article texts together to create a document for each source.
3. Calculate the TF-IDF vectors for each source. Consider the use of custom stopword sets.
4. Identify the tokens with maximum weight in each vector.

Hint: TF-IDF methods include two parameters that affect the results: `max_df` and `min_df`. Review how they affect the results.

## Challenge 2: n-gram construction.
Words are not the only sub-sentence unit used to study linguistic patterns in natural language processing. A more general unit called an `n-gram` defines sequences of `n` consecutive words extracted from a text. For example, `2-grams` or `bigrams` are pairs of words appearing consecutively in a given string. More concretely, given the text `This is example text` would return the following bigrams: `this is`, `is example`, `example text`.

Using different values of `n`, look for different n-gram patterns appearing the the titles and body text for the provided example dataset. You should:
1. Write a custom function to calculate all n-grams of a given string. This function should include optional stopword removal.
2. Compare the bigram frequency values both before and after removing stopwords from the text.
3. Find the most frequent 4-grams that include `climate` and the most frequent 4-grams that include `change`. Determine the overlap between these sets of 4-grams (i.e. those that include both `climate` and `change`.
4. Find the the set of 5-grams that contain `climate`. After partitioning the example dataset into one-day windows, can you see any patterns in how the usage of `climate` varies over time? Hint: you will need a larger sample or different sampling strategy to see more than one day included in the data.

## Challenge 3: Narrative consistency.
Text data can be difficult to work with for many reasons, mainly due to it being noisy and needing careful cleaning. The example data you were given contains a particular issue with portions of the text that may be apparently unrelated to the major themes of the text. This is an inevitable consequence of the data source; it is provided by a service that collects and formats online news articles and this sometimes includes text from web features (e.g. other story links) being incorrectly included in the main text. Luckily, we can use some of the techniques we introduced previously to look for divergence in the semantic content of portions of the text.

Look for evidence of narrative inconsistency in the body text of the example data. You should:
1. Determine an appropriate unit of analysis to partition the article into, and preprocess the data into these units.
2. Describe an algorithm to measure the internal narrative consistency of an article that uses these sub-article units.
3. Apply this algorithm to quantify the narrative consistency of all articles in the example dataset. Which articles are the most and least narratively consistent?

Hint: Consider mapping the semantic space covered by the article. Which articles cover the largest or smallest semantic space?

## Challenge 4: Positivity and negativity in narratives.
The precise dataset and seeding applied to topic modelling can make a big difference to the results. Sometimes this can mean that the detected topics may miss some of the smaller, more nuanced themes. This means it is often a good idea to consider a few different outputs from topic modelling under a range of paramter values. At this stage, confidence in the model outputs can be determined by aggregate the common patterns appearing across different runs.

Compare the topics identified in different perspectives on the example data. You should:
1. Apply the BERTopic framework with several different parameter combinations in the UMAP and HDBSCAN processes. Do you notice any topics that are consistent over runs?
2. Partition the news dataset into three sets of articles based on valence (i.e. positive, neutral, negative). What proportion of articles fall in each category? How sensitive is this value to the method of sentiment analysis applied?
3. Visualise the semantic space for the entire corpus. Are there any spatial patterns emerging based on the article valence?
4. Apply BERTopic to each subset. Compare the topics found on each subet of the dataset to those found over all texts.