<a id="nlp-trials-using-gensim-for-selecting-number-of-topics"></a>

# [NLP Trials using Gensim for selecting number of topics](#nlp-trials-using-gensim-for-selecting-number-of-topics)

In [None]:
%load_ext lab_black
%load_ext autoreload
%autoreload 2

In [None]:
!python -m spacy download en_core_web_md

In [None]:
import os
from azure.storage.blob import BlobServiceClient
from glob import glob
from io import StringIO
from operator import itemgetter

import altair as alt
import matplotlib.ticker as mtick
import numpy as np
import pandas as pd
import spacy
from gensim.corpora.dictionary import Dictionary
from joblib import dump, load
from scipy import stats
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
%aimport src.data_helpers
from src.data_helpers import load_data

%aimport src.extraction_helpers
from src.extraction_helpers import get_top_n_most_freq_words

%aimport src.hybrid_helpers
from src.hybrid_helpers import get_nmf_coherence_scores

%aimport src.processing_helpers
from src.processing_helpers import (
    do_tokenize,
    expand_contractions,
    expand_match,
    process_text,
)

%aimport src.visualization_helpers
from src.visualization_helpers import (
    altair_boxplot_sorted,
    altair_datetime_heatmap,
    altair_plot_bar_chart_value_counts,
    altair_plot_grid_by_column,
    altair_plot_horiz_bar_chart,
    altair_plot_line_chart,
    altair_plot_histogram_grid_by_column,
    altair_plot_triangular_heatmap,
    boxplot_sorted,
)

<a id="toc"></a>

## [Table of Contents](#table-of-contents)
0. [About](#about)
1. [User Inputs](#user-inputs)
2. [Load Data, Concatenate and Filter](#load-data-concatenate-and-filter)
3. 3. [Topic modeling using Gensim NMF with Topic coherence to find best number of topics](#topic-modeling-using-gensim-nmf-with-topic-coherence-to-find-best-number-of-topics)
   - 3.1. [Pre-Processing for Gensim NMF, Tokenization, Stemming, etc.](#pre-processing-for-gensim-nmf,-tokenization,-stemming-etc.)
   - 3.2. [Use Gensim to perform Bag-of-Words transformation](#use-gensim-to-perform-bag-of-words-transformation)
   - 3.3. [Use Gensim NMF and Topic coherence to find number of topics](#use-gensim-nmf-and-topic-coherence-to-find-number-of-topics)
4. [Topic modeling using TFIDF vectorization and NMF](#topic-modeling-using-tfidf-vectorization-and-nmf)
   - 4.1. [Pre-processing for NMF](#pre-processing-for-nmf)
   - 4.2. [NMF](#nmf)
   - 4.3. [Merging with the original data](#merging-with-the-source-data)
   - 4.4. [Coherence Residual by Topic](#coherence-residual-by-topic)
   - 4.5. [Assign Names to 15 Topics](#assign-names-to-15-topics)
   - 4.6. [Comparison to Word Vectors](#comparison-to-word-vectors)
5. [Exploring topics combined with source data and 15 topics](#exploring-topics-combined-with-source-data-and-15-topics)
6. [Exploring articles with 35 topics](#exploring-articles-with-35-topics)
   - 6.1. [Coherence Residual by Topic](#coherence-residual-by-topic)
   - 6.2. [Assign Names to 35 Topics](#assign-names-to-35-topics)
7. [Exploring topics combined with source data and 35 topics](#exploring-topics-combined-with-source-data-and-35-topics) 
   - 7.1. [Years Featured](#years-featured)
   - 7.2. [Most Popular Topic by Year](#most-popular-topic-by-year)
   - 7.3. [All Topics by Year](#all-topics-by-year)
   - 7.4. [Examining Infrequently Occurring Topics](#examining-infrequently-occurring-topics)
   - 7.5. [Terms by Topic](#terms-by-topic)
     - 7.5.1. [Philae](#philae)
     - 7.5.2. [SpaceX](#spacex)
     - 7.5.3. [Discovery of Gravitational Waves](#discovery-of-gravitational-waves)
     - 7.5.4. [MIR Space Spation Funding](#mir-space-station-funding)
     - 7.5.5. [Reporting on Pluto](#reporting-on-pluto)
8. [Extracting Topics from Unseen Data](#extracting-topics-from-unseen-data)
   - 8.1. [Comparison to Word Vectors for Unseen Data](#comparison-to-word-vectors-for-unseen-data)
9. [Conclusion](#conclusion)
10. [Looking Forward](#looking-forward)

<a id="about"></a>

## 0. [About](#about)

In this notebook, we will go through a second experiment with [topic coherence](https://en.wikipedia.org/wiki/Coherence_(linguistics)) approaches using Gensim to find an optimal number of topics from the Guardian's Space news listings data in `data/processed/*_processed.csv`. Following previous work ([1](https://github.com/robsalgado/personal_data_science_projects/blob/master/topic_modeling_nmf/topic_modeling_cnn.ipynb)), this will be done using [`sklearn`](https://en.wikipedia.org/wiki/Scikit-learn)'s [`NMF` model](https://en.wikipedia.org/wiki/Non-negative_matrix_factorization#Text_mining) with [TFIDF vectorization](https://en.wikipedia.org/wiki/Tf%E2%80%93idf), after determining the optimal number of topics using [Gensim](https://en.wikipedia.org/wiki/Gensim)'s topic coherence pipeline to evaluate the topics determined by Gensim's `NMF`.

<a id="user-inputs"></a>

## 1. [User Inputs](#user-inputs)

We'll define below the variables that are to be used throughout the code.

In [None]:
PROJ_ROOT_DIR = os.getcwd()
data_dir = os.path.join(PROJ_ROOT_DIR, "data", "raw")
processed_data_dir = os.path.join(PROJ_ROOT_DIR, "data", "processed")

In [None]:
# General inputs
cloud_data = True

# Topic naming
topic_nums = list(range(10, 45 + 5, 5))
n_top_words = 10

unwanted_guardian_cols = [
    "webTitle",
    "id",
    "sectionId",
    "sectionName",
    "type",
    "isHosted",
    "pillarId",
    "pillarName",
    "page",
    "document_type",
    "apiUrl",
    "publication",
    "year",
    "month",
    "day",
    "dayofweek",
    "dayofyear",
    "weekofyear",
    "quarter",
]

# SpaCy preference
run_spacy_medium_model = False

In [None]:
def shaprio_wilk_is_normal(g):
    """
    Use Shapiro Wilk test to check if data is normally distributed
    SOURCE
    ------
    https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.shapiro
    .html
    """
    sw_statistic, p_value = stats.shapiro(g)
    if len(g) > 5000:  # W-stat is more accurate for larger sample sizes
        if sw_statistic < 0.05:
            return False
    if p_value < 0.05:
        return False
    return True


def levene_has_equal_variance(g1, g2):
    """
    Use Levene test to Check homnoscedasticity (equal variance)
    SOURCE
    ------
    https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.
    levene.html
    """
    _, p_value = stats.levene(g1, g2)
    if p_value < 0.05:
        return False
    else:
        return True


def ttest_or_manwhitu(g1, g2):
    """
    Use T-test (if normal and homoskedastic) or Mann Whitney U test to
    calculate test statistic between two groups
    SOURCES
    -------
    1. Mann Whitney U test
       - https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.
         mannwhitneyu.html
    2. T-test (without, by default, and with equal variance)
       - use parameter `equal_var` to indicate if variances are equal or not
       - https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.
         ttest_ind.html
    """
    if shaprio_wilk_is_normal(g1) and shaprio_wilk_is_normal(g2):
        if levene_has_equal_variance(g1, g2):
            test_type = "T-Test"
            return stats.ttest_ind(g1, g2)
        else:
            test_type = "T-Test with unequal variance"
            return stats.ttest_ind(g1, g2, equal_var=False)
    test_type = "U-Test"
    print(f"Performing {test_type} ...")
    return [stats.mannwhitneyu(g1, g2), test_type]

In [None]:
# General inputs
az_storage_container_name = "myconedesx7"

# Guardian Filenames
# # Cloud-based files
guardian_inputs = {
    "blobedesz21": "urls",
    "blobedesz19": "text1",
    "blobedesz20": "text2",
}

In [None]:
alt.data_transformers.disable_max_rows()

The SpaCy pipeline steps in the pre-trained model downloaded and used here are
- part-of-speech tagger
- dependency parser
- named entity recognizer

and will be used when comparing to the word vectors-based approach.

In [None]:
# nlp = spacy.load("en_core_web_sm")  # does not include (multi-dimension.) word vectors (DO NOT USE)
nlp = spacy.load("en_core_web_md")
print(nlp.pipe_names)

The medium sized model is required since `_sm` (small sized models) don't include word vectors (see the [**Important note**](https://spacy.io/usage/vectors-similarity)).

In [None]:
conn_str = (
    "DefaultEndpointsProtocol=https;"
    f"AccountName={os.getenv('AZURE_STORAGE_ACCOUNT')};"
    f"AccountKey={os.getenv('AZURE_STORAGE_KEY')};"
    f"EndpointSuffix={os.getenv('ENDPOINT_SUFFIX')}"
)
blob_service_client = BlobServiceClient.from_connection_string(conn_str=conn_str)

<a id="load-data-concatenate-and-filter"></a>

## 2. [Load Data, Concatenate and Filter](#load-data-concatenate-and-filter)

We'll start by loading the data and drop the news articles that are 500 characters or shorter in length

In [None]:
%%time
df_guardian = load_data(
    cloud_data,
    data_dir,
    "",
    "",
    guardian_inputs,
    blob_service_client,
    az_storage_container_name,
    unwanted_guardian_cols,
)
df_guardian["year"] = pd.to_datetime(df_guardian["publication_date"]).dt.year
df_guardian["article_chars"] = df_guardian["text"].str.split().str.len()

Next, we will show a bar chart of the number of article characters by year

In [None]:
# boxplot_sorted(
#     df_guardian[["article_chars", "year"]],
#     ["year"],
#     "article_chars",
#     "Article Characters by Year",
#     "center",
#     12,
#     45,
#     False,
#     (12, 6),
# )

altair_boxplot_sorted(
    df_guardian[["article_chars", "year"]],
    "year",
    "article_chars",
    "",
    "Article Characters by Year",
    14,
    14,
    16,
    dx=50,
    offset=-5,
    x_tick_label_angle=-45,
    horiz_bar_chart=False,
    axis_range=[0, 8_000],
    fig_size=(700, 300),
)

Most of the articles are 1,000 characters in length. For those (outliers, circles) that are longer than this, they extend upto approx. 4,000 characters with only a very small number of articles being longer than this.

The number of articles per year is shown below

In [None]:
altair_plot_bar_chart_value_counts(
    df_guardian["year"].value_counts().sort_index().reset_index(),
    "Number of articles by Year",
    "index:N",
    "year:Q",
    labelFontSize=12,
    titleFontSize=12,
    plot_titleFontSize=16,
    dx=30,
    offset=-5,
    x_tick_label_angle=-45,
    horiz_bar_chart=False,
    fig_size=(700, 250),
)

Prior to 1999, very few **Space** articles were published by the Guardian. There was also a drop from the years 2006 to 2010, likely coinciding to the [Great Recession](https://www.investopedia.com/terms/g/great-recession.asp). Since 2011, article length increased until 2015 and has been dropping since then.

The big events in British space news in 2003-2005 were the launch of and search for the missing Beagle 2 Mars rover. It was found in 2015. See the timeline [here](https://www.itv.com/news/2015-01-16/the-timeline-of-events-surrounding-the-beagle-2). In 2015, there was big news about the search for the ESA mission's Philae lander (as part of the Rosetta mission to the comet 67P). the lander was declared lost in 2016. The mission timeline is available [here](http://www.esa.int/Education/Teach_with_Rosetta/Rosetta_timeline/). These could account for the increased number of the newspaper's online publications during these year.

<a id="topic-modeling-using-gensim-nmf-with-topic-coherence-to-find-best-number-of-topics"></a>

## 3. [Topic modeling using Gensim NMF with Topic coherence to find best number of topics](#topic-modeling-using-gensim-nmf-with-topic-coherence-to-find-best-number-of-topics)

<a id="pre-processing-for-gensim-nmf,-tokenization,-stemming-etc."></a>

### 3.1. [Pre-Processing for Gensim NMF, Tokenization, Stemming, etc.](#pre-processing-for-gensim-nmf,-tokenization,-stemming-etc.)

Now, we'll perform the following processing actions on each news article's text
- [tokenize](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization) the text of the article, using the [NLTK package's `TweetTokenizer`](https://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.casual)
- clean the text of the articles
  - convert to lowercase
  - remove numbers
  - [expand contractions](https://www.kdnuggets.com/2018/08/practitioners-guide-processing-understanding-text-2.html) ([code](https://grammar.yourdictionary.com/style-and-usage/using-contractions.html))
  - ([snowball](https://en.wikipedia.org/wiki/Snowball_(programming_language))) [stemming](https://en.wikipedia.org/wiki/Stemming)
  - remove [punctuation](https://docs.python.org/3/library/string.html#string.punctuation)
  - remove [stopwords](https://en.wikipedia.org/wiki/Stop_word)
  - remove any standalone single character
  - remove [whitespaces](https://en.wikipedia.org/wiki/Whitespace_character)

In [None]:
%%time
texts = df_guardian["text"].apply(process_text)

The tokenized texts are made up of nearly 67,000 unique words, as shown below

In [None]:
%%time
len(set(pd.DataFrame(texts.tolist()).to_numpy().flatten()))

After processing them, the most common tokens (expressed as a percentage) by year are shown below, for a subset of years from the data. This is done using a helper function that returns an integer representing the number of times a given token appears in the article's text. The helper function is shown below and is followed by the code to use it for getting the most common tokens by year

In [None]:
with open("src/extraction_helpers.py", "r") as f:
    for k, line in enumerate(f):
        if k >= 4:
            print(str(k) + " " + line.strip("\n"))

In [None]:
%%time
ywc = [
    1957,
    1986,
    2000,
    2001,
    2003,
    2004,
    2007,
    2011,
    2012,
    2014,
    2015,
    2016,
    2017,
    2018,
    2019,
]
top_n_words = 10

# Get most popular token by year
tokens_by_year = pd.concat([df_guardian["year"], texts], axis=1)
df_mpy = tokens_by_year.groupby("year").apply(lambda x: get_top_n_most_freq_words(x, top_n_words, True)).to_frame()

# Sanity check for first year (1957)
assert pd.DataFrame(
    get_top_n_most_freq_words(
        tokens_by_year[tokens_by_year["year"] == 1957], top_n_words, True
    )
).equals(pd.DataFrame(df_mpy.loc[1957].iloc[0]))

# Put into DataFrame for plotting
df_ywcs = []
for y in ywc:
    df_ywc = (
        pd.DataFrame.from_dict(df_mpy.loc[y][0], orient="index")
        .T.rename(columns={0: "count"})
        .rename_axis("word")
        .assign(year=y)
        .reset_index()
    )
    df_ywcs.append(df_ywc)
df_ywc = pd.concat(df_ywcs)
# display(df_ywc)
altair_plot_grid_by_column(
    df_ywc, xvar="count", yvar="word", col2grid="year", fig_size=(100, 200)
)

The token `space` is the universally the most commonly occurring one in Guardian **Space** articles from the Science section. Though, it wasn't used much in the early days of publishing in this section of the newspaper.

In 1986, the space shuttle Challenger's explosion was mentioned in a lot of articles. However, the Columbia shuttle crash in 2003 did not get mentioned by name. Even the token `shuttle` wasn't as frequent in 2003 as it was after the Challenger shuttle crash in 1986. Other topics such as the launch of the twin Mars rovers, later [during the same year](https://www.nasa.gov/directorates/somd/reports/2003/table1.html), shared the spotlight with the shuttle crash and were mentioned in a lot of news articles in this section.

<a id="use-gensim-to-perform-bag-of-words-transformation"></a>

### 3.2. [Use Gensim to perform Bag-of-Words transformation](#use-gensim-to-perform-bag-of-words-transformation)

Now, we'll create a corpus comprising an assigned ID and corresponding count frequency of words from the tokens created above. This is Gensim's document conversion into a [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model) format. It returns a list of tuples comprising token identifier and the corresponding count (frequency).

In [None]:
%%time
# Create Dictionary
dictionary = Dictionary(texts)

# Remove extreme values
dictionary.filter_extremes(
    no_below=3,  # default = 5
    no_above=0.85,  # default is 0.5
    keep_n=5_000,  # default is 100_000
)

# Term Document Frequency for corpus
corpus = [dictionary.doc2bow(text) for text in texts]

<a id="use-gensim-nmf-and-topic-coherence-to-find-number-of-topics"></a>

### 3.3. [Use Gensim NMF and Topic coherence to find number of topics](#use-gensim-nmf-and-topic-coherence-to-find-number-of-topics)

Next, we'll compute the coherence score for our specified list of number of topics to be compared. A [topic coherence `Class`](https://radimrehurek.com/gensim/models/coherencemodel.html), from the [Gensim library](https://pypi.org/project/gensim/), is used to evaluate topics found using an [NMF](https://en.wikipedia.org/wiki/Non-negative_matrix_factorization) model. A helper function is used for this and is shown below

In [None]:
with open("src/hybrid_helpers.py", "r") as f:
    for k, line in enumerate(f):
        if k >= 4:
            print(str(k) + " " + line.strip("\n"))

In [None]:
!pygmentize src/hybrid_helpers.py

The coherence model's reported coherence score will be computed as the number of topics used to train an NMF model is varied. The higher the coherence for the selected number of topics the better. This way, we can select the number of topics in an NMF model that returns the highest coherence score.

In [None]:
%%time
# For each specified number of topics, run NMF and calculate topic coherence
topic_coherence_scores = [
    get_nmf_coherence_scores(corpus, texts, n, dictionary)
    for n in topic_nums
]

# Extract coherence score for each number of topics tried
df_coherence_scores = (
    pd.DataFrame.from_dict(dict(zip(topic_nums, topic_coherence_scores)), orient="index")
    .reset_index()
    .rename(columns={"index": "num_topics", 0: "coherence"})
    .set_index("num_topics")
)

We'll now plot the coherence scores by the number of topics

In [None]:
altair_plot_line_chart(
    df_coherence_scores.reset_index(),
    "num_topics",
    "coherence",
    "Topic Coherence versus Number of Topics",
    labelFontSize=12,
    titleFontSize=14,
    plot_titleFontSize=16,
    linewidth=3,
    dx=35,
    offset=-5,
    x_tick_label_angle=0,
    marker_size=150,
    y_axis_range=[0.44, 0.5],
    fig_size=(700, 250),
)

Over a range of 15 to 45 topics, we can see that there isn't much of a change in the coherence score and there is little evidence of a trend. For the above choices of pre-processing and coherence-modeling, the choice of 35 topics gives the maximum coherence score but the difference between scores for 15 and 35 topics is less than 10%. That's quite a large range of topics but only a small improvement.

For now, we'll move forward with 15 topics

In [None]:
n_topics_wanted = 15

<a id="topic-modeling-using-tfidf-vectorization-and-nmf"></a>

## 4. [Topic modeling using TFIDF vectorization and NMF](#topic-modeling-using-tfidf-vectorization-and-nmf)

We'll now train an ML model with the best number of topics.

**TFIDF Overview**

First, we'll convert the documents (news articles) into vector representations that will allow numeric ML techniques to be applied. This is [feature extraction or word vectorization](https://en.wikipedia.org/wiki/Word_embedding), since it maps single or multiple words from a vocabulary of words to a corresponding real-numbered vector.

Here, we'll use a [TFIDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) vectorizer as a preliminary step before the NMF model, as opposed to NLTK's `TweetTokenizer` which was used when selecting the optimal number of topics. So, for this part, `scikit-learn` will be used instead of Gensim.

**NMF Overview**

The ML model will be [Non-Negative Matrix Factorization, or NMF](https://en.wikipedia.org/wiki/Non-negative_matrix_factorization). At a high level, NMF approximates a given matrix using a [low-rank matrix approximation](https://en.wikipedia.org/wiki/Low-rank_approximation). It decomposes an input matrix into two lower dimensionality matrices (a weighted sum of some number of [basis vectors](https://en.wikipedia.org/wiki/Basis_(linear_algebra)))
- a matrix of basis vectors
- a matrix of associated basis weights

The matrix product of these two is an approximation (but not a perfect match) to the original matrix. How good is this approximation? One approach to estimating the goodness of this approximation is by computing the [Frobenius norm](https://en.wikipedia.org/wiki/Matrix_norm#Frobenius_norm) between the original and approximated data.

For NLP tasks, NMF takes a [document-term matrix](https://en.wikipedia.org/wiki/Document-term_matrix) of TFIDF features (output of the vectorizer) and creates an approximate representation of weighted combinations of the document terms that go together across the corpus. The combinations are of basis documents (topics) and their weights indicate how to add the contributions of each topic to approximately re-assemble an orginal document. In other words, these topics represent sets of terms that occur simultaneously in different documents.

<a id="pre-processing-for-nmf"></a>

### 4.1. [Pre-processing for NMF](#pre-processing-for-nmf)

First, a two-step pipeline, of TFIDF vectorization (convert processed text to numerical vectors) followed by NMF, is instantiated.

The TFIDF vectorization hyper-parameters are set as follows (and can be tuned if necessary)
- uni- and bi-grams will be extracted (`ngram_range = (1,2)`)
- pre-processing is not done, since the processed tokens are being used
- text is already converted to lowercase before tokenization, so nothing new will be done here
- of the nearly 67,000 unique words in the corpus, use only the top `max_features = 5000` terms, by term frequency, when building vectors
  - the default is to use all the terms
  - this level of feature reduction can be adjusted by varying the `max_features` hyper-parameter
- the default token pattern is used
  - tokens are comprised of 2 or more alphanumeric characters (removes any standalone single character)
    - this won't be done here since the processed tokens are already filtered
- instead, `min_df` and `max_df` is used to eliminate words based on terms by frequency across the text corpus (collection of documents).i.e. ignore any terms that are found in
    - more than 85% (`max_df = 0.85`)
    - less than three (`min_df = 3`)
    
  of the processed texts in the corpus
- no stopwords are used
  - these were aleady removed earlier, so this does not need to be done here

In [None]:
vectorizer = TfidfVectorizer(
    tokenizer=None,  # default is None
    stop_words=None,  # default is None
    lowercase=True,  # default is True
    ngram_range=(1, 2),  # default is (1, 2)
    max_df=0.85,  # default is 1.0
    min_df=3,  # default is 1
    max_features=5000,  # default is None
    preprocessor=" ".join,  # default is None
    binary=False,  # default is False
    strip_accents=None,  # default is None
    # token_pattern='(?u)\\b\\w\\w+\\b',  # default is '(?u)\\b\\w\\w+\\b'
)

Next, the NMF step is defined, again with hyper-parameter choices that can be tuned later as required

In [None]:
sk_nmf = NMF(
    n_components=n_topics_wanted,
    solver="cd",  # default is "cd"
    init="nndsvd",  # default is None, "nnsvd" = Nonnegative Double Singular Value Decomposition
    max_iter=500,  # default is 200
    l1_ratio=0.0,  # default is 0.0
    alpha=0.0,  # default is 0.0
    tol=0.0001,  # default is 0.0001
    random_state=42,
)
pipe = Pipeline([("vectorizer", vectorizer), ("nmf", sk_nmf)])

**Note**
1. When randomly initialized, NMF results are not deterministic. This stability can be improved if [NMF is initialized using SVD](https://arxiv.org/abs/1702.07186) and so this has been done here.

<a id="nmf"></a>

### 4.2. [NMF](#nmf)

The pipeline is now trained on the data, giving the document-term matrix `W`

In [None]:
%%time
doc_topic = pipe.fit_transform(texts)

First, we'll show the Document-Term Matrix produced by TFIDF vectorization.

In [None]:
pd.DataFrame(pipe.named_steps["vectorizer"].transform(texts).toarray())

This is the TFIDF-weighted frequency of terms across the corpus. So, each row corresponds to a single document and the each column corresponds to a single term. This is the matrix of TFIDF features.

The extracted topics and their top ten most important terms (using basis weights, from the NMF transformation) are shown below (these are the sets of corpus-wide coexisting terms mentioned [earlier](#topic-modeling-using-tfidf-vectorization-and-nmf) and they come from the factorization matrix `H`, where rows represent topics and columns represent the input data features)

In [None]:
def get_top_words_per_topic(row, n_top_words=5):
    return row.nlargest(n_top_words).index.tolist()

In [None]:
# Factorization matrix, of weights
H = pipe.named_steps["nmf"].components_
topic_words = pd.DataFrame(
    H,
    index=[str(k) for k in range(n_topics_wanted)],
    columns=pipe.named_steps["vectorizer"].get_feature_names(),
)
# display(topic_words)

# Get row-wise (topic-wise) top 10 weights
topic_df = (
    pd.DataFrame(
        topic_words.apply(
            lambda x: get_top_words_per_topic(x, n_top_words), axis=1
        ).tolist(),
        index=topic_words.index,
    )
    .reset_index()
    .rename(columns={"index": "topic"})
    # .assign(topic_num=range(1, n_topics_wanted + 1))
    .assign(topic_num=range(n_topics_wanted))
)
# Sanity check on first row
assert (
    topic_words.iloc[0].nlargest(n_top_words).index.tolist()
    == topic_df.iloc[0, 1:-1].tolist()
)
for k, v in topic_df.iterrows():
    print(k, ",".join(v[1:-1]))
# display(topic_df)

In [None]:
# print(np.sort(df_temp["topic_num"].unique()), topic_df["topic_num"].unique())
# print(df_temp.shape[0], len(df_temp), df_guardian["url"].nunique())
# print(df_temp.merge(topic_df, on="topic_num", how="left").assign(
#     url=df_guardian["url"].tolist()
# )["url"].nunique())

<a id="merging-with-the-source-data"></a>

### 4.3. [Merging with the original data](#merging-with-the-source-data)

Next, we'll start combining the original news article data with the NMF model output (document-topic matrix `W`) which is shown below

In [None]:
pd.DataFrame(doc_topic)

The document-topic matrix (`W`) is the transformed data produced by NMF. Here, each row is a unique document vector in the topic space - here, there are 15 elements per vector since 15 topics were used.

Each element of a single such vector is NMF's learned weight of each of the specified number of topics - 15 topics, so 15 columns. This weight is translated into the importance of that topic. Since we have a single vector for each document, we have the importance of each topic for each document.

There are two things we could do with this document-topic matrix
- the topic with the highest importance (weight) is the topic that most strongly reconstructs the combination of terms for a particular document
  - we'll be doing this here
- if we were to add up all the contributions from the different topics then we could reassemble the combination of terms for a particular document in the corpus

First, the most popular topic for each article (highest topic weight) is found by taking the row-wise maximum of the document-topic matrix.

In [None]:
# df_temp = pd.DataFrame({"topic_num": doc_topic.argmax(axis=1)})
df_temp = pd.DataFrame(doc_topic).idxmax(axis=1).rename("topic_num").to_frame()
print(f"Number of rows = {df_temp.shape[0]}")
display(df_temp.head())

In [None]:
df_temp["topic_num"].min(), df_temp["topic_num"].max()

This gives the (most popular) topic number for each document - the topic numbers start at 0 and end at 14.

Earlier, we got the most important terms for each topic. So, now, we'll merge those with this most popular topic, on the topic number column.

Since we ultimately want to merge these results back with the original news article data (including the metadata columns), we'll also append a column of the news article URL.

In [None]:
merged_topic = df_temp.merge(topic_df, on="topic_num", how="left").assign(
    url=df_guardian["url"].tolist()
)
print(f"Number of rows = {merged_topic.shape[0]}")
display(merged_topic.head())

We now merge this result with the original data, to get access all the metadata columns, this time on the common `URL` columns which are aligned for both of these data structures

In [None]:
df_topics = df_guardian.merge(merged_topic, on="url", how="left").astype(
    {"topic_num": int}
)
print(f"Number of rows = {df_topics.shape[0]}")
display(df_topics.head())

The NMF factorization matrix is shown below, with scores (matrix product of topics and weights) for all features (terms, in the column names) and for each component (topic, along the rows)
- for brevity, a random selelction of 10 columns (terms) is shown

In [None]:
topic_word = pd.DataFrame(
    pipe.named_steps["nmf"].components_.round(3),
    index=[k for k in range(n_topics_wanted)],
    columns=pipe.named_steps["vectorizer"].get_feature_names(),
)
print(f"Number of rows = {topic_word.shape[0]}")
display(topic_word.sample(10, axis=1))

From the NMF factorization matrix, we'll get the top 10 terms by weight for each topic

In [None]:
%%time
df_topic_word_factors = (
    topic_word.groupby(topic_word.index)
    .apply(lambda x: x.iloc[0].nlargest(n_top_words))
    .reset_index()
    .rename(columns={"level_0": "topic_num", "level_1": "term", 0: "weight"})
)
display(df_topic_word_factors.head())

We'll also get the 10 most common words by topic

In [None]:
%%time
tokens_by_topic = tokens_by_year[["text"]].assign(
    topic_num=df_topics["topic_num"].tolist()
)
df_mpt = (
    tokens_by_topic.groupby("topic_num")
    .apply(lambda x: get_top_n_most_freq_words(x, top_n_words, True))
    .to_frame()
)
df_twcs = []
for t in np.sort(tokens_by_topic["topic_num"].unique()):
    df_twc = (
        pd.DataFrame.from_dict(df_mpt.loc[t][0], orient="index")
        .T.rename(columns={0: "count"})
        .rename_axis("word")
        .assign(topic_num=t)
        .reset_index()
    )
    df_twcs.append(df_twc)
df_twc = pd.concat(df_twcs)
display(df_twc.head())

**Notes**
1. Each term may be associated with more than a single topic.

<a id="coherence-residual-by-topic"></a>

### 4.4. [Coherence Residual by Topic](#coherence-residual-by-topic)

Here, we'll use the [Frobenius Norm](https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html#numpy.linalg.norm) to examine how well the topics found represent the news article text - this will be a residual between the modeled and true document-term matrix.

To get the Frobenius norm, from TFIDF, we're interested in
- `A`
  - the TF-IDF normalised document-term matrix
  - comes from the `.transform()` method ([1](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.transform)) of a trained `TfidfVectorizer()` object
    - shape is `len_corpus` (or number of news articles) X `num_most_frequent_words`
    - this is a [sparse matrix](https://docs.scipy.org/doc/scipy/reference/sparse.html), so to get its shape, we'll call the `.to_array()` method ([1](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.toarray.html))

NMF produces two factor matrices as its output and we need both to calculate the Frobenius norm
- `W`
  - Factorization matrix, sometimes called *dictionary*
  - comes from the `.components_` attribute ([1](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html#sklearn.decomposition.NMF)) of a trained `NMF()` object
    - shape is `num_topics` X `num_most_frequent_words`
    - we've set `num_topics` to 15 and `num_most_frequent_words` is `max_features`, which was set to 5,000
  - W factor contains the document membership weights relative to each of the k topics. Each row corresponds to a single document, and each column correspond to a topic.
- `H`
  - transformed data
  - comes from the `.transform()` method ([1](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html#sklearn.decomposition.NMF.transform)) of a trained `NMF()` object
    - again, per the `sklearn` docs, shape is `len_corpus` (or number of news articles) X `num_topics`
    - we know we have a corpus with a length of approx. 4,000 and, as above, we've indicated to use 15 topics
  -  H factor contains the term weights relative to each of the k topics. In this case, each row corresponds to a topic, and each column corresponds to a unique term in the corpus vocabulary.

For the current dataset, we'll compute these three matrices

In [None]:
%%time
A = pipe.named_steps["vectorizer"].transform(texts)
W = pipe.named_steps["nmf"].components_
H = pipe.named_steps["nmf"].transform(A)

We'll now show the shapes of each of the three matrices above

In [None]:
f"A={A.toarray().shape}, W={W.shape}, H={H.shape}"

We'll now compute the residual (Frobenius norm) for the text (row) of each news article in our dataset. This is the difference between the true text and the NMF approximations

In [None]:
r = np.zeros(A.shape[0])
for row in range(A.shape[0]):
    r[row] = np.linalg.norm(A[row, :] - H[row, :].dot(W), "fro")
df_topics["resid"] = r
display(df_topics.head())

We'll show a boxplot of these topic residuals

In [None]:
# boxplot_sorted(
#     df_topics[["topic_num", "resid"]],
#     ["topic_num"],
#     "resid",
#     "Topic Residual (sorted by median)",
#     "center",
#     14,
#     sort_by_median=True,
#     fig_size=(12, 6),
# )

altair_boxplot_sorted(
    df_topics[["topic_num", "resid"]],
    "topic_num",
    "resid",
    "median(resid)",
    "Topic Residual (sorted by median)",
    14,
    14,
    16,
    dx=30,
    offset=-5,
    x_tick_label_angle=0,
    horiz_bar_chart=False,
    axis_range=[0.5, 1],
    fig_size=(450, 250),
)

This plot summarizes how the texts in each topic actually fit into that topic. A smaller value on the vertical axis (the residual) indicates a better fit between the news articles in this topic found by the NMF model. Outliers are shown by circles. The topics (horizontal axis) are sorted by the median residual, so the best-performing topic appears on the far left and the worst-scoring one is shown on the far right.

**Observations**
1. None of the topics now have a median residual below 0.80.
2. Only three of the 15 topics have the first quantile of their residuals falling below 0.80.
3. The residuals scale from 0.0 to 1.0. The median residual values are above 0.8 and there is a considerable range.

We're now ready to start reading the articles and assessing the quality of the topics.

<a id="topic-4"></a>

#### 4.4.1. [Topic 4](#topic-4)

The five best texts that fit into the best topic (by residual), topic number 4, are shown below

In [None]:
df_topics[df_topics["topic_num"] == 4].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. This is a very coherent sub-grouping of comet-related articles, where the focus is on mission operations related to the [Rosetta spacecraft mission to comet 67P](https://en.wikipedia.org/wiki/Rosetta_(spacecraft)).

The five worst texts that fit into the best topic (by residual) are shown below

In [None]:
df_topics[df_topics["topic_num"] == 4].nlargest(5, "resid")["url"].tolist()

**Observations**
1. While in some way related to reporting on comets, these aren't connected to the Rosetta mission specifically, or to comet-related missions in general. The topics of these articles are listed below
   - [reports on a theory about comet dust containing diamonds](https://www.theguardian.com/science/2002/jul/18/technology)
   - [end of the Deep Space 1 spacecraft mission which involved studying other comets and asteroids](https://www.theguardian.com/science/2001/dec/20/technology)
   - [obituary for a researcher involved with the advocating for *in situ* analyses on the Philae lander that accompanied Rosetta](https://www.theguardian.com/science/2014/dec/21/colin-pillinger-remembered-monica-grady-observer-obituaries-2014-open-university)
   - [inappropriate attire by one of the scientists providing an update on the mission](https://www.theguardian.com/science/2014/nov/14/rosetta-comet-dr-matt-taylor-apology-sexist-shirt)
   - [related to the Tempel 1 comet](https://www.theguardian.com/science/2005/jul/07/thisweekssciencequestions2)

Although related to comets, the five worst articles aren't directly reporting on Rosetta mission operations or even indirectly in some cases.

<a id="topic-6"></a>

#### 4.4.2. [Topic 6](#topic-6)

The five best texts that fit into topic number 6 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 6].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below and are related to the lundar landings, missions or planned/cancelled missions
   - [Recapping the 1969 Moon Landing, as part of the Apollo Mission](https://www.theguardian.com/science/2009/jun/21/apollo-fallen-dream)
   - [Chinese plans for manned moon mission](https://www.theguardian.com/science/2011/dec/30/china-manned-moon-mission-lunar)
   - [Written in 2019 (as 50th anniversary of Apollo 11 approached), discusses new interest in Moon exploration](https://www.theguardian.com/science/2019/jul/06/everyones-going-to-the-moon-again-apollo-11-50th-aniversary)
   - [written in 2010, reports on Chinese leading the way to a return to the moon](https://www.theguardian.com/science/2010/feb/02/lunar-us-china-race-moon) after [NASA abandons Moon orbit mission](https://www.nbcnews.com/id/wbna35209628)
   - [Written in 1999, references complete absence of moon landings for a period of](https://www.theguardian.com/science/1999/jul/19/spaceexploration.g2) 27 years ([1](https://www.history.com/news/us-moon-landings-apollo))

The five worst texts that fit into this topic are shown below and, while related to the moon, don't necessarily involve missions

In [None]:
df_topics[df_topics["topic_num"] == 6].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [talks about imprisoned peace activist receiving a photograph of Earth from the 1969 Apollo 11 mission's moon landing](https://www.theguardian.com/science/2018/oct/18/a-surprise-package-from-a-spaceman-for-a-prisoner-on-earth)
   - [Death of US astronaut John Glenn](https://www.theguardian.com/science/2016/dec/08/john-glenn-us-astronaut-dies), who was [connected to the Apollo space program](https://web.archive.org/web/20161210225701/http://www.jsc.nasa.gov/Bios/htmlbios/glenn-j.html)
   - [talks about a song played as Apollo 14 pproached the moon](https://www.theguardian.com/science/2017/oct/09/apollo-14-song-a-hymn-to-god-or-to-the-nazis)
   - [reports on auction of limousine used by Apollo 11 astronauts](https://www.theguardian.com/world/2011/aug/04/limousine-pope-neil-armstrong-auction)
   - [Moon tourism](https://www.theguardian.com/science/1999/sep/22/spaceexploration.business)

<a id="topic-12"></a>

#### 4.4.3. [Topic 12](#topic-12)

The five best texts that fit into topic number 12 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 12].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [discovery of Higgs boson](https://www.theguardian.com/science/2012/jul/04/higgs-boson-discovery-real-work)
   - [explaining fundamental concepts behind science researched at LHC (and CERN, in general) related to th eHiggs boson](https://www.theguardian.com/science/2008/jun/30/cern.particle.physics1)
   - [anticipating the LHC coming online](https://www.theguardian.com/science/2007/may/27/particlephysics.observermagazine)
   - [days away from starting LHC operations](https://www.theguardian.com/science/2011/feb/28/large-hadron-collider-higgs-boson)
   - [getting close to detecting the Higgs boson](https://www.theguardian.com/science/2011/dec/13/scientists-higgs-boson-god-particle)

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 12].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below, are focused on elements of the periodic table
   - [all about the element niobium](https://www.theguardian.com/science/punctuated-equilibrium/2011/dec/09/1)
   - [element Xenon](https://www.theguardian.com/science/grrlscientist/2012/mar/16/1)
   - [element tellurium](https://www.theguardian.com/science/grrlscientist/2012/mar/02/1)
   - [element calcium](https://www.theguardian.com/science/punctuated-equilibrium/2011/jul/08/1)
   - [mentions how the element gold is created when neutron stars collide with each other](https://www.theguardian.com/science/life-and-physics/2016/feb/02/the-cosmic-gift-of-neutron-stars)

These aren't related to the search for the Higgs Boson and likely only appears here since they mention components listed on the periodic table (particles like neutrons, protons, etc.) that are also mentioned in news articles about the Higgs boson. Either these should be incorporated into a sepate topic or this hybrid topic should be named to account for all elementary particles and not just the Higgs boson.

<a id="topic-10"></a>

#### 4.4.4. [Topic 10](#topic-10)

The five best texts that fit into topic number 10 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 10].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - approaching end of Cassini mission
     - [1](https://www.theguardian.com/science/2017/apr/30/cassini-saturn-space-probe-fiery-finale-nasa)
     - [2](https://www.theguardian.com/science/2017/sep/14/nasas-cassini-spacecraft-poised-to-begin-mission-ending-dive-into-saturn)
   - Cassini spacecraft enters Saturn orbit
     - [1](https://www.theguardian.com/science/2004/jul/01/spaceexploration.research)
     - [2](https://www.theguardian.com/science/2004/jun/04/spaceexploration.starsgalaxiesandplanets)
   - [explores what was learned from Cassini about Saturn](https://www.theguardian.com/science/2017/sep/15/what-did-the-cassini-mission-tell-us-about-saturn-and-its-moons)

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 10].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [toy experiment to show that gas takes up more volumne than solids](https://www.theguardian.com/science/2008/may/02/physics.chemistry)
     - water mixed with Alka-Seltzer causes the container to shoot up like a rocket
   - [basic science of a stone skimming on water](https://www.theguardian.com/science/2004/jan/08/research.science1)
   - [discovery of remnants of ancient lake in Sudan](https://www.theguardian.com/science/2007/jul/19/sudan.water)
   - [challenges of a mission to fly a glider into space, due to extreme atmospheric conditions](https://www.theguardian.com/science/2005/jul/28/thisweekssciencequestions.aeronautics)
   - [New Horizons spacecraft captures first images of Ultima Thule, proposed to have been created by mixing of ice and dust during the early days of the Solar System](https://www.theguardian.com/science/2019/jan/02/first-close-ups-of-ultima-thule-reveal-it-resembles-dark-red-snowman)

The themes tying these articles to the Cassini mission to Saturn are
- science involving water/ice, since [Cassini may have found evidence of water on one of Saturn's moons](https://www.nasa.gov/mission_pages/cassini/media/cassini-20060309.html)
- discovery
- mission into a difficult atmosphere

If the article about flying a glider is left out, then discoveries/experiments involving water across the solar system and, even on Earth, is a more appropriate theme of the five worst articles.

**Best** - Cassini mission updates

**Worst** - 

<a id="topic-0"></a>

#### 4.4.5. [Topic 0](#topic-0)

The five best texts that fit into topic number 0 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 0].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [about a book discussing biggest remaining questions in science](https://www.theguardian.com/science/2013/sep/01/20-big-questions-in-science)
   - [scientific ideas that aren't very strong](https://www.theguardian.com/science/2014/jan/12/what-scientific-idea-is-ready-for-retirement-edge-org)
   - [unexpected scientific findings that challenge existing beliefs](https://www.theguardian.com/science/2014/jun/29/five-insights-challenging-sciences-unshakable-truths)
   - [about a book on scientific thinking and cosmology](https://www.theguardian.com/science/2016/sep/18/brian-cox-interview-it-is-a-book-about-how-to-think-universal-a-guide-to-the-cosmos)
   - questions for a theorist ([1](https://en.wikipedia.org/wiki/Theoretical_physics)), [primarily about nature of time](https://www.theguardian.com/science/2019/mar/31/carlo-rovelli-you-ask-the-questions-time-travel-is-just-what-we-do-every-day-theoretical-physics)

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 0].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [article about a type of puzzle](https://www.theguardian.com/science/2015/nov/23/can-you-solve-it-the-crossword-that-counts-itself)
   - [home science experiment - soda fountain](https://www.theguardian.com/science/2008/may/02/chemistry.physics)
   - [on a inter-disciplinary research study of knots between Biology and Physics](https://www.theguardian.com/science/2001/sep/13/physicalsciences.highereducation) (knots in [Mathematics](https://en.wikipedia.org/wiki/Knot_theory) and [Biology](https://en.wikipedia.org/wiki/Molecular_knot))
   - [scientific study proving that ducks produce an echo](https://www.theguardian.com/science/2003/sep/08/sciencenews.theguardianlifesupplement)
   - celebrating [Pi Day](https://www.piday.org/), with an [article](https://www.theguardian.com/science/2016/mar/14/pi-day-your-guide-to-this-infinitely-interesting-number) all about the number [Pi](https://en.wikipedia.org/wiki/Pi)

This was the worst overall topic in terms of the median residual. The worst of these high-residual news articles also include those with a focus on home science experiments or leisurely activities (Pi Day, puzzle). The loose connection between puzzles and science is that they could promote creative thinking ([1](https://newforums.com/one-activity-develops-creative-thinking/), [2](https://www.lifehack.org/374975/science-explains-why-crossword-puzzles-are-good-for-your-mental-health)). There are articles about scientific findings, remaining questions and concepts (time, cosmos) in the best articles, but also some in the worst (duck echo and knots). Some of the best articles don't relate to scientific research, but are more a discussion about scientific concepts. So, it may be difficult to separate these into two sub-topics. The articles about research tend to focus on academic research. Interesting scientific findings might be the best blanket topic name to use here.

<a id="topic-1"></a>

#### 4.4.6. [Topic 1](#topic-1)

The five best texts that fit into the best topic (by residual), topic number 1, are shown below

In [None]:
df_topics[df_topics["topic_num"] == 1].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [reporting on NASA's plans to re-attempt the Columbia shuttle mission, two years after the 2003 crash](https://www.theguardian.com/science/2005/jul/07/1)
   - [about preparing for launch of the last mission for the shuttle Discovery](https://www.theguardian.com/science/2010/oct/31/us-space-shuttle-discovery-mission)
   - [looking ahead to Discovery's last launch mission, while discussing the problem of debris from the 2003 Columbia crash](https://www.theguardian.com/science/2005/jul/29/sciencenews.spaceexploration)
   - [about Discovery's last mission risks despite learning and fixing issues related to 2003 Columbia crash](https://www.theguardian.com/science/2005/apr/10/spaceexploration.usnews)
   - [NASA announcing Discovery is given green-light for launch despite facing safety-related questions in light of the 2003 Columbia crash](https://www.theguardian.com/science/2006/jul/01/spaceexploration.internationalnews)

The five worst texts that fit into the best topic (by residual) are shown below

In [None]:
df_topics[df_topics["topic_num"] == 1].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [about death of Israeli astronaut, who trained at NASA as a craft payload specialist to image aerosols in a desert and died in 2003 Columbia crash](https://www.theguardian.com/science/2003/feb/03/spaceexploration.columbia8)
   - [characteristics of an ideal fuel cell](https://www.theguardian.com/science/2005/mar/24/science.research1)
     - [hydeogen is used in rocket propellants](https://theconversation.com/hydrogen-fuels-rockets-but-what-about-power-for-daily-life-were-getting-closer-112958)
   - [remembering Israeli astronaut who died in 2003 Columbia crash](https://www.theguardian.com/science/2003/feb/03/spaceexploration.research1)
   - [about picture carried by Israeli astronaut who died in 2003 Columbia crash](https://www.theguardian.com/science/2003/feb/02/spaceexploration.columbia1)
   - [about hummingbirds travelling faster than fighter jets and a space shuttle](https://www.theguardian.com/science/2009/jun/10/hummingbird-fastest-animal-fighter-jet)

This seems like a reasonably coherent topic, related to space shuttle crashes. Even some of the worst articles fitting this topic are related to the crashes, though predominantly indirectly. A few of the worst fitting articles focus more on space flight than on shuttle crashes.

<a id="topic-2"></a>

#### 4.4.7. [Topic 2](#topic-2)

The five best texts that fit into the best topic (by residual), topic number 2, are shown below

In [None]:
df_topics[df_topics["topic_num"] == 2].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these news articles are listed below
   - [anticipating the mission launch for the Beagle 2 Mars lander](https://www.theguardian.com/science/2003/may/22/spaceexploration.starsgalaxiesandplanets)
   - [about a joint NASA-ESA Mars mission proposal with sample return, with estimated launch between 2018-2023](https://www.theguardian.com/science/2008/jul/14/mars.spaceexploration)
     - this turned out to be the Insight mission with a [lander](https://en.wikipedia.org/wiki/InSight) and [Perseverance rover](https://en.wikipedia.org/wiki/Perseverance_(rover)) with the [Ingenuity helicopter](https://en.wikipedia.org/wiki/Mars_Helicopter_Ingenuity) onboard
   - [year-end review of scientific, including Mars-based research, findings in 2015](https://www.theguardian.com/science/2015/dec/25/2015-big-year-space-exploration-major-tim-peake-pluto)
   - [looking ahead to Beagle 2 Mars mission launch](https://www.theguardian.com/science/2003/jun/02/spaceexploration.starsgalaxiesandplanets)
   - [reporting on Beagle 2 separating from parent spacecraft, and beginning descent to the Martian surface](https://www.theguardian.com/science/2003/dec/19/spaceexploration.research1)

The five worst texts that fit into the best topic (by residual) are shown below

In [None]:
df_topics[df_topics["topic_num"] == 2].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [about](https://www.theguardian.com/uk/2008/jan/14/art.artnews) the [Martian Museum of Terrestrial Art Exhibition](https://www.frieze.com/article/martian-museum-terrestrial-art)
   - [Scuderia Ferrari Formula 1 team sends its red paint to Mars on the ESA's Mars Express](https://www.theguardian.com/science/2002/aug/29/technology)
   - [about NASA's announced spacesuit to aid humans on Mars](https://www.theguardian.com/science/2014/apr/30/nasa-spacesuit-zseries-new-design-mars)
   - [about space tourism in 2018](https://www.theguardian.com/science/shortcuts/2013/feb/28/mars-mission-married-couple-space)
   - [about colonization on Mars](https://www.theguardian.com/science/2019/jun/23/all-female-mars-colony-possible-using-frozen-sperm-says-study)

Although related to Mars, the five worst articles aren't directly reporting on space missions to the planet operations.

<a id="topic-9"></a>

#### 4.4.8. [Topic 9](#topic-9)

The five best texts that fit into the best topic (by residual), topic number 9, are shown below

In [None]:
df_topics[df_topics["topic_num"] == 9].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these news articles are listed below
   - [about life on the ISS](https://www.theguardian.com/science/2010/oct/24/international-space-station-nasa-astronauts)
   - [arrival of first crew to the ISS - US astronaut and two Russian cosmonauts](https://www.theguardian.com/science/2000/nov/02/spaceexploration)
   - [launch of Space Shuttle Atlantis, carrying five US and two Russian crew members, heading to the ISS](https://www.theguardian.com/science/2000/sep/08/spaceexploration)
   - [Atlantis returns to Earth after dropping off five-crew team to ISS](https://www.theguardian.com/science/2000/sep/27/spaceexploration)
   - [Four Russian and one US crew member preparing for mission to MIR space station](https://www.theguardian.com/science/2000/nov/15/spaceexploration)

The five worst texts that fit into the best topic (by residual) are shown below

In [None]:
df_topics[df_topics["topic_num"] == 9].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [remembering US space flight surgeon](https://www.theguardian.com/science/2003/feb/03/spaceexploration.columbia7), [who helped develop an ISS treadmill for astronauts](https://womeninwisconsin.org/profile/laurel-clark/)
   - [remembering spider that spent 100 days on ISS](https://www.theguardian.com/science/2012/dec/05/space-pioneer-jumping-spider-dies)
   - [reporting on a missing tourist (former fashion-designer) who predicted MIR would crash over Paris](https://www.theguardian.com/science/1999/aug/13/eclipse.internationalnews)
   - [article about former British PM as captain of USS Enterprise, from Star Trek](https://www.theguardian.com/science/brain-flapping/2015/aug/18/david-cameron-captain-enterprise-star-trek)
   - [regarding a statue for Yuri Gagarin](https://en.wikipedia.org/wiki/Yuri_Gagarin), who was a former deputy training director for the [Cosmonaut Training Centre](https://en.wikipedia.org/wiki/Yuri_Gagarin_Cosmonaut_Training_Center) ([now named after him](https://en.wikipedia.org/wiki/Yuri_Gagarin_Cosmonaut_Training_Center)) where [ISS crews train](https://www.nasa.gov/feature/remembering-yuri-gagarin-50-years-later)

This is a reasonably coherent topic, related to US and Russian Space Stations - directly, via articles about mission operations, or indirectly involving former astronauts/cosmonauts.

<a id="topic-13"></a>

#### 4.4.9. [Topic 13](#topic-13)

The five best texts that fit into the best topic (by residual), topic number 13, are shown below

In [None]:
df_topics[df_topics["topic_num"] == 13].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these news articles are listed below
   - [report on third detection of gravitational wave at LIGO](https://www.theguardian.com/science/2017/jun/01/third-gravitational-wave-detection-gives-hints-on-dark-matter-and-black-holes)
   - [first discovery of gravitational waves](https://www.theguardian.com/science/2016/feb/11/gravitational-waves-discovery-hailed-as-breakthrough-of-the-century)
   - [remembering Stephen Hawking after he passed away](https://www.theguardian.com/science/2018/mar/14/a-life-in-science-stephen-hawking)
     - Hawking radiation is predicted to be released by black holes but [has not been detected](https://www.forbes.com/sites/startswithabang/2020/07/01/this-is-why-well-never-detect-hawking-radiation-from-an-actual-black-hole/?sh=56eaaf8b1f0d)
       - roughly speaking, in order to learn about black holes, instead of detecting Hawking radiation, gravitational waves can be used to collect data about wave sources (such as black hole collisions)
   - [report on being near first sighting of black hole in Milky Way](https://www.theguardian.com/science/2019/jan/11/scientists-close-to-capturing-first-image-of-black-hole-at-the-centre-of-the-milky-way)
     - article explains what is a black hole and that LIGO has detected black hole collisions
   - [reporting on first image of a black hole](https://www.theguardian.com/science/2019/apr/10/black-hole-picture-captured-for-first-time-in-space-breakthrough)
     - article explains what a black hole is and [links to](https://www.theguardian.com/science/2016/feb/11/gravitational-waves-discovery-hailed-as-breakthrough-of-the-century) gravitational wave detection when black holes collide

The five worst texts that fit into the best topic (by residual) are shown below

In [None]:
df_topics[df_topics["topic_num"] == 13].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - article about [Ada Lovelace](https://en.wikipedia.org/wiki/Ada_Lovelace)[-inspired puzzle where black squares are prominent](https://www.theguardian.com/science/2017/mar/28/did-you-solve-it-take-the-ada-lovelace-challenge-solution-part-i)
   - [why why hummingbirds create nest near hawks](https://www.theguardian.com/science/grrlscientist/2015/sep/30/hummingbirds-nest-near-hawks-for-protection)
   - [black hole explains existence of Santa Claus](https://www.theguardian.com/science/brain-flapping/2014/dec/12/santa-claus-father-christmas)
   - [about](https://www.theguardian.com/science/2003/jan/16/science.research1) the Golden Ratio, which [appears in theory of black holes](https://johncarlosbaez.wordpress.com/2013/02/28/black-holes-and-the-golden-ratio/)
   - [about shape of universe being that of an American football](https://www.theguardian.com/science/2003/oct/09/spaceexploration.research)

This is a somewhat coherent topic, related to black holes. The best articles focus on gravitation wave detection as evidence of black hole existence (collision) or direct black hole images. The worst articles indirectly relate to black holes - focus here, though, is on (sometimes loosely) related topics - hawks (birds, due to the connection between black holes and Hawking, or black-body, radiation), puzzle with the colour black and the universe in general (in the case of the latter, at best, this connection could be due to [our universe maybe existing inside a black hole](https://insidescience.org/news/every-black-hole-contains-new-universe), black holes being [spread across the universe](https://www.space.com/15421-black-holes-facts-formation-discovery-sdcmp.html) and [black holes possibly leading to another universe](https://www.space.com/where-do-black-holes-lead.html)).

<a id="assign-names-to-15-topics"></a>

### 4.5. [Assign Names to 15 Topics](#assign-names-to-15-topics)

A mapping will be made between topic number and name, based on the top 10 terms and manual reading to topic articles, will be developed.

In [None]:
d_topics_15 = {
    4: {
        "best": "Rosetta mission updates",
        "worst": "Science involving Comets",
    },
    6: {"best": "About the Moon Landings", "worst": "Moon in Popular Culture"},
    12: {"best": "Discovery of Higgs Boson", "worst": "Particles of Matter"},
    10: {
        "best": "Cassini mission updates",
        "worst": "Fun with Water",
    },
    0: {"best": "Science Facts & Thinking", "worst": "Fun with Science"},
    1: {
        "best": "Columbia shuttle crash",
        "worst": "Space Shuttle Astronauts and Technology",
    },
    2: {
        "best": "Mars mission updates",
        "worst": "Mars Colonization and Tourism",
    },
    9: {
        "best": "ISS updates",
        "worst": "Space Stations - Facts and History",
    },
    13: {
        "best": "Black Holes",
        "worst": "Black Holes in Popular Culture",
    },
    3: {"best": "Topic 3", "worst": "Topic 3"},
    5: {"best": "Topic 5", "worst": "Topic 5"},
    7: {"best": "Topic 7", "worst": "Topic 7"},
    8: {"best": "Topic 8", "worst": "Topic 8"},
    11: {"best": "Topic 11", "worst": "Topic 11"},
    14: {"best": "Topic 14", "worst": "Topic 14"},
}
df_named_topics = (
    pd.DataFrame.from_dict(d_topics_15, orient="index")
    .sort_index()
    .rename_axis("topic_num")
)
display(df_named_topics)

The residual plot is repeated now with topic names

In [None]:
# boxplot_sorted(
#     df_topics[["topic_num", "resid"]].merge(
#         df_named_topics.reset_index()[["topic_num", "best"]], on="topic_num"
#     ),
#     ["best"],
#     "resid",
#     "Topic Residual (sorted by median)",
#     "right",
#     14,
#     45,
#     sort_by_median=True,
#     fig_size=(12, 6),
# )

altair_boxplot_sorted(
    df_topics[["topic_num", "resid"]].merge(
        df_named_topics.reset_index()[["topic_num", "best"]], on="topic_num"
    ),
    "best",
    "resid",
    "median(resid)",
    "Topic Residual (sorted by median)",
    14,
    14,
    16,
    dx=100,
    offset=-5,
    x_tick_label_angle=-45,
    horiz_bar_chart=False,
    axis_range=[0.5, 1],
    fig_size=(600, 250),
)

**Observations**
1. The topics related to missions and the Moon are among the best, but the ISS and Cassini mission updates aren't as good
   - could the ISS topic be mixing USS and Russian space programs, rather than just the two countries space stations?
   - for Cassini, could it be getting mixed in with water-related articles which serve to de-focus the overall topic?
2. Topic 7 (SpaceX) doesn't fare so well, while Topic 8 (objects impacting Earth) is among the best. both of these are fairly focused topics. It is somewhat surprising that SpaceX is so far down this list. Articles from this topic were not read. It may that the best articles are focused on other companies involved in rocket launch testing and perhaps also with space tourism.
3. Concrete topics on scientific discoveries - Bosons and Black Holes (middle-of-the-chart) - fare better than general scientific research findings, facts, etc. (the highest residual, or worst) topic.

<a id="comparison-to-word-vectors"></a>

### 4.6. [Comparison to Word Vectors](#comparison-to-word-vectors)

We'll briefly explore the concept of comparing documents to eachother, instead of comparing the words as we've done so far. This concept of similarity is determined using [word vectors](https://en.wikipedia.org/wiki/Word_embedding), which leverage the multi-dimensional meaning/representation of words. These are generated using algorithm like [Word2Vec](https://en.wikipedia.org/wiki/Word2vec) and a lot of text.

We'll use the [Spacy natural language processing package](https://en.wikipedia.org/wiki/SpaCy) for applying this technique to the raw news articles without processing. Spacy offers a NLP pipeline that will apply all the necessary steps - [part-of-speech tagging](https://en.wikipedia.org/wiki/Part-of-speech_tagging), [dependency parsing](https://en.wikipedia.org/wiki/Parse_tree) and [named entity recognition](https://en.wikipedia.org/wiki/Named-entity_recognition) - before applying a pre-trained word vector model. In this version of the analysis, we'll limit ourselves to this built-in pipeline with no customizations.

In [None]:
%%time
if run_spacy_medium_model:
    vectors = [nlp(sentence).vector for sentence in df_guardian["text"]]

Next, the articles will be compared to eachother using cosine similarity between these vectors

In [None]:
if run_spacy_medium_model:
    dfcs = pd.DataFrame(cosine_similarity(vectors))
    dfcs = dfcs.assign(topic_num=df_topics["topic_num"])

In [None]:
# if run_spacy_medium_model:
#     dfcs_unstacked = (
#         dfcs.select_dtypes("float32")
#         # dfcs[dfcs.index]
#         .unstack()
#         .reset_index()
#         .rename(columns={"level_0": "doc1", "level_1": "doc2", 0: "cos_sim"})
#     )
#     dfcs_topics = (
#         dfcs.merge(df_named_topics.reset_index()[["topic_num", "best"]], on="topic_num")
#         .reset_index()
#         .rename(columns={"index": "doc1"})[["doc1", "best"]]
#     )
#     dfcs_merged = dfcs_unstacked.merge(dfcs_topics, on="doc1")

We'll now generate a boxplot of the cosine similarities for each document pair within a given topic.

In [None]:
if run_spacy_medium_model:
    dfcs_all = []
    for topic in dfcs["topic_num"].unique():
        dfcs_named_topics = dfcs.loc[dfcs["topic_num"] == topic]
        # display(dfcs_named_topics[dfcs_named_topics.index])
        dfcs_unstacked = (
            dfcs_named_topics[dfcs_named_topics.index]
            .unstack()
            .reset_index()
            .rename(columns={"level_0": "doc1", "level_1": "doc2", 0: "cos_sim"})
        )
        # display(dfcs_unstacked)
        dfcs_reshaped = dfcs_unstacked.merge(
            dfcs_named_topics[dfcs_named_topics.index.tolist() + ["topic_num"]]
            .reset_index()[["index", "topic_num"]]
            .rename(columns={"index": "doc1"}),
            on="doc1",
        ).merge(df_named_topics.reset_index()[["best", "topic_num"]], on="topic_num")[
            ["doc1", "doc2", "cos_sim", "best"]
        ]
        # display(dfcs_reshaped)
        dfcs_all.append(dfcs_reshaped)
    # display(dfcs[["topic_num", "best"]].head(2))
    dfcs_merged = pd.concat(dfcs_all).reset_index(drop=True)
    display(dfcs_merged)

In [None]:
if run_spacy_medium_model:
    altair_boxplot_sorted(
        dfcs_merged,
        "best",
        "cos_sim",
        "median(cos_sim)",
        "Cosine Similarity (sorted by median)",
        14,
        14,
        16,
        dx=75,
        offset=-5,
        x_tick_label_angle=-45,
        horiz_bar_chart=False,
        axis_range=[0.3, 1],
        fig_size=(600, 250),
    )

**Notes**
1. Due to the size of the model required here (`_sm` does not include word vectors) and the size of the data (even though its only a few thousand news articles), this comparison will *only* be made if the [User Inputs](#user-inputs) section variable `run_spacy_medium_model` is set to `True`.

**Observations**
1. Most of the topics have news articles with a high cosine similarity to eachother, so this grouping of texts does seem qualitatively reasonable.
2. A few topics have a relatively larger number of outliers, so there could be articles in these outliers that belong to sub-topics that don't overlap enough to be picked up by the overall grouping.
3. The topics whose best news articles report on one of mission updates, discovery of black holes or discovery of the Higgs particle have the most tightly grouped cosine similarity scores.

The limitation of this approach is that we've used a small vocabulary with fewer vectors. The model took a long time to train and would take even longer with a larger vocabulary. Could this comparison between articles benefit from a larger vocabulary and reveal some less-sensible topics? For this word vector-based approach, are short phrases better than long documents with many irrelevant words?

<a id="exploring-topics-combined-with-source-data-and-15-topics"></a>

## 5. [Exploring topics combined with source data and 15 topics](#exploring-topics-combined-with-source-data-and-15-topics)

The top 10 scoring terms by topic are visualized below

In [None]:
altair_plot_grid_by_column(
    df_topic_word_factors.merge(
        df_named_topics.reset_index()[["topic_num", "best"]], on="topic_num"
    ),
    xvar="weight",
    yvar="term",
    col2grid="best",
    space_between_plots=5,
    fig_size=(150, 200),
)

By comparison, the 10 most common words (expressed as a percentage) by topic are visualized below

In [None]:
altair_plot_grid_by_column(
    df_twc.merge(df_named_topics.reset_index()[["topic_num", "best"]], on="topic_num"),
    xvar="count",
    yvar="word",
    col2grid="best",
    space_between_plots=5,
    fig_size=(150, 200),
)

We can comment on these charts for a couple of topics
1. Topic 10 relates to the [Cassini-Huygens mission](https://en.wikipedia.org/wiki/Cassini%E2%80%93Huygens) to Saturn, exploring the planet and its system of moons, rings, etc. The important terms are tied to the mission. The most commonly used words place a focus on ice and water (some of the findings of the Cassini mission).
2. Topic 2 is clearly about the Guardian's reporting on Mars. The (British) Beagle 2 Mars lander is given high importance. The word-count frequency doesn't include `beagle` in the top 10 most commonly used words. The ML model's predictions suggest the Beagle lander should be among the most important sub-topics, which we cannot determine from just looking at the most frequently used words.

Next, we will show a heatmap of the number of years in which a topic appears, using 15 topics

In [None]:
altair_plot_bar_chart_value_counts(
    df_topics.groupby(["topic_num"])["year"]
    .nunique()
    .sort_values()
    .reset_index()
    .merge(df_named_topics.reset_index()[["topic_num", "best"]], on="topic_num"),
    "Number of years in which a topic appears",
    "best:N",
    "year:Q",
    labelFontSize=12,
    titleFontSize=12,
    plot_titleFontSize=16,
    dx=20,
    offset=0,
    x_tick_label_angle=-45,
    fig_size=(650, 250),
)

**Observations**
1. If a topic does not occur in the majority of years, then that could be an indication that it is a poor choice of a standalone topics and should be combined. This is not the case for topic assignments here.

Next, we will show a heatmap of all the years when each topic's articles appear, with 15 topics

In [None]:
topics_by_timeframe = (
    df_topics.groupby(["topic_num", "year"])
    .size()
    .reset_index()
    .sort_values(by=["topic_num", 0, "year"], ascending=False)
    .rename(columns={0: "count"})
)
display(topics_by_timeframe.head())

Sanity checks

In [None]:
assert (
    df_topics.loc[(df_topics["topic_num"] == 1) & (df_topics["year"] == 2003)].shape[0]
    == topics_by_timeframe.loc[
        (topics_by_timeframe["topic_num"] == 1) & (topics_by_timeframe["year"] == 2003)
    ]["count"].iloc[0]
)

In [None]:
assert topics_by_timeframe["count"].sum() == df_guardian.shape[0]

Generate the heatmap

In [None]:
altair_datetime_heatmap(
    topics_by_timeframe.merge(
        df_named_topics.reset_index()[["topic_num", "best"]], on="topic_num"
    ),
    x="year:O",
    y="best:N",
    xtitle="Year",
    ytitle="Topics by Year",
    tooltip=[
        {
            "title": "Year",
            "field": "year",
            "type": "ordinal",
        },
        {
            "title": "Topic",
            "field": "best",
            "type": "nominal",
        },
        {
            "title": "Number of articles",
            "field": "count",
            "type": "quantitative",
        },
    ],
    cmap="yelloworangered",
    legend_title="",
    color_by_col="count:Q",
    yscale="log",
    axis_tick_font_size=12,
    axis_title_font_size=16,
    title_font_size=20,
    legend_fig_padding=10,  # default is 18
    y_axis_title_alignment="left",
    fwidth=650,
    fheight=450,
    file_path="",
    save_to_html=False,
    sort_y=[],
    sort_x=[],
)

**Observations**
1. There are two topics - Topic 11 (did not manually read articles here) and a blanket *Science Findings* topic that feature a lot of articles for most years. It is likely that several sub-topics are embedded in here and might allow for logical sub-topics.

<a id="exploring-articles-with-35-topics"></a>

## 6. [Exploring articles with 35 topics](#exploring-articles-with-35-topics)

The blocks of TFIDF-NMF code are combined below and are re-run with 35 topics now as opposed to 15 previously

In [None]:
%%time
n_topics_wanted = 35

vectorizer = TfidfVectorizer(
    tokenizer=None,  # default is None
    stop_words=None,  # default is None
    lowercase=True,  # default is True
    ngram_range=(1, 2),  # default is (1, 2)
    max_df=0.85,  # default is 1.0
    min_df=3,  # default is 1
    max_features=5000,  # default is None
    preprocessor=" ".join,  # default is None
    binary=False,  # default is False
    strip_accents=None,  # default is None
    # token_pattern='(?u)\\b\\w\\w+\\b',  # default is '(?u)\\b\\w\\w+\\b'
)
sk_nmf = NMF(
    n_components=n_topics_wanted,
    solver="cd",  # default is "cd"
    init="nndsvd",  # default is None, "nnsvd" = Nonnegative Double Singular Value Decomposition
    max_iter=500,  # default is 200
    l1_ratio=0.0,  # default is 0.0
    alpha=0.0,  # default is 0.0
    tol=0.0001,  # default is 0.0001
    random_state=42,
)
pipe = Pipeline([("vectorizer", vectorizer), ("nmf", sk_nmf)])
doc_topic = pipe.fit_transform(texts)

topic_words = pd.DataFrame(
    pipe.named_steps["nmf"].components_,
    index=[str(k) for k in range(n_topics_wanted)],
    columns=pipe.named_steps["vectorizer"].get_feature_names(),
)
topic_df = (
    pd.DataFrame(
        topic_words.apply(
            lambda x: get_top_words_per_topic(x, n_top_words), axis=1
        ).tolist(),
        index=topic_words.index,
    )
    .reset_index()
    .rename(columns={"index": "topic"})
    .assign(topic_num=range(n_topics_wanted))
)
display(topic_df.head(20))
# for k, v in topic_df.iterrows():
#     print(k, ",".join(v[1:-1]))

df_temp = pd.DataFrame(doc_topic).idxmax(axis=1).rename("topic_num").to_frame()

merged_topic = df_temp.merge(topic_df, on="topic_num", how="left").assign(
    url=df_guardian["url"].tolist()
)
df_topics = df_guardian.merge(merged_topic, on="url", how="left").astype(
    {"topic_num": int}
)
display(df_topics.head(20))

<a id="coherence-residual-by-topic"></a>

### 6.1. [Coherence Residual by Topic](#coherence-residual-by-topic)

In [None]:
A = pipe.named_steps["vectorizer"].transform(texts)
W = pipe.named_steps["nmf"].components_
H = pipe.named_steps["nmf"].transform(A)
f"A={A.toarray().shape}, W={W.shape}, H={H.shape}"
r = np.zeros(A.shape[0])
for row in range(A.shape[0]):
    r[row] = np.linalg.norm(A[row, :] - H[row, :].dot(W), "fro")
df_topics["resid"] = r

# boxplot_sorted(
#     df_topics[["topic_num", "resid"]],
#     ["topic_num"],
#     "resid",
#     "Topic Residual (sorted by median)",
#     "center",
#     14,
#     sort_by_median=True,
#     vert=True,
#     fig_size=(12, 6),
# )
altair_boxplot_sorted(
    df_topics[["topic_num", "resid"]],
    "topic_num",
    "resid",
    "median(resid)",
    "Topic Residual (sorted by median)",
    14,
    14,
    16,
    dx=30,
    offset=-5,
    x_tick_label_angle=0,
    horiz_bar_chart=False,
    axis_range=[0.4, 1],
    fig_size=(700, 250),
)

**Observations**
1. Only two topics now have a median residual below 0.80.
2. 15 of the 35 topics have the first quantile of their residuals falling below 0.80.
3. There are now 10 topics with a median residual of approx. 0.83 or better (lower), compared to only 1 previously.

<a id="topic-2"></a>

#### 6.1.1. [Topic 2](#topic-2)

The five best texts that fit into topic number 2 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 2].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [year-end review of scientific, including Mars-based research, findings in 2015](https://www.theguardian.com/science/2015/dec/25/2015-big-year-space-exploration-major-tim-peake-pluto)
   - [anticipating answers about Mars habitability on 40 year anniversary of Mariner 9 launch](https://www.theguardian.com/science/2011/jun/05/mars-anniversary-40-years-space)
   - [about a joint NASA-ESA Mars mission proposal with sample return, with estimated launch between 2018-2023](https://www.theguardian.com/science/2008/jul/14/mars.spaceexploration)
     - this turned out to be the Insight mission with a [lander](https://en.wikipedia.org/wiki/InSight) and [Perseverance rover](https://en.wikipedia.org/wiki/Perseverance_(rover)) with the [Ingenuity helicopter](https://en.wikipedia.org/wiki/Mars_Helicopter_Ingenuity) onboard
   - [anticipating the launch of the Odyssey orbiter launch](https://www.theguardian.com/science/2001/apr/05/spaceexploration.technology)
   - [list and operational details about of Mars missions that were lost](https://www.theguardian.com/science/2016/oct/20/total-recall-of-unsuccessful-mars-lander-schiaparelli-exomars)

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 2].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below   
   - [about NASA's announced spacesuit to aid humans on Mars](https://www.theguardian.com/science/2014/apr/30/nasa-spacesuit-zseries-new-design-mars)
   - [about space tourism in 2018](https://www.theguardian.com/science/shortcuts/2013/feb/28/mars-mission-married-couple-space)
   - [about colonization on Mars](https://www.theguardian.com/science/2019/jun/23/all-female-mars-colony-possible-using-frozen-sperm-says-study)
   - [book discussing extreme conditions that could be encountered during a manned mission to Mars](https://www.theguardian.com/science/sifting-the-evidence/2013/mar/13/medical-research-health)
   - [reporting on the Curiosity rover's landing procedure on Mars in 2012](https://www.theguardian.com/science/2012/jul/20/spacewatch-curiosity-mars-landing)

The focus of the best articles remains on science and operations for Mars missions. The worst articles focus on manned missions to Mars - habitability, spacesuits, landing sequence (for spacecraft), tourism.

<a id="topic-1"></a>

#### 6.1.2. [Topic 1](#topic-1)

The five best texts that fit into topic number 1 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 1].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [reporting on NASA's plans to re-attempt the Columbia shuttle mission, two years after the 2003 crash](https://www.theguardian.com/science/2005/jul/07/1)
   - [about preparing for launch of the last mission for the shuttle Discovery](https://www.theguardian.com/science/2010/oct/31/us-space-shuttle-discovery-mission)
   - [about Discovery's last mission risks despite learning and fixing issues related to 2003 Columbia crash](https://www.theguardian.com/science/2005/apr/10/spaceexploration.usnews)
   - [looking ahead to Discovery's last launch mission, while discussing the problem of debris from the 2003 Columbia crash](https://www.theguardian.com/science/2005/jul/29/sciencenews.spaceexploration)
   - [about Discovery reaching back safely](https://www.theguardian.com/science/2005/apr/10/spaceexploration.usnews)

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 1].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [about picture carried by Israeli astronaut who died in 2003 Columbia crash](https://www.theguardian.com/science/2003/feb/02/spaceexploration.columbia1)
   - [about more stringent NASA screening after astronaut arrested for at attempted murder](https://www.theguardian.com/science/2003/feb/03/spaceexploration.columbia8)
   - [about an Israeli astronaut on the Columbia shuttle mission](https://www.theguardian.com/science/2003/jan/17/spaceexploration.internationaleducationnews)
   - [about picture of Endeavour shuttle's launch](https://www.theguardian.com/science/2011/may/17/photograph-space-shuttle-endeavour-aeroplane)
   - [about nuclear technology in use for Prometheus project](https://www.theguardian.com/science/2004/sep/23/thisweekssciencequestions2)

The best articles focus on the Columbia shuttle crash. The worst focus on space shuttles in a variety of contexts.

<a id="topic-6"></a>

#### 6.1.3. [Topic 6](#topic-6)

The five best texts that fit into topic number 6 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 6].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [about possible previous collision between Moon and another moon](https://www.theguardian.com/science/2011/aug/03/second-moon-collision)
   - [about China's plans to land on moon](https://www.theguardian.com/science/2018/dec/07/chinese-spacecraft-attempt-first-landing-far-side-of-moon-change-4)
   - [Written in 2019 (as 50th anniversary of Apollo 11 approached), discusses new interest in Moon exploration](https://www.theguardian.com/science/2019/jul/06/everyones-going-to-the-moon-again-apollo-11-50th-aniversary)
   - [about Lunar orbiter to take pictures of Apollo landing site](https://www.theguardian.com/science/2005/jul/18/spaceexploration.internationalnews)
   - [Written in 1999, references complete absence of moon landings for a period of](https://www.theguardian.com/science/1999/jul/19/spaceexploration.g2) 27 years ([1](https://www.history.com/news/us-moon-landings-apollo))

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 6].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [Moon tourism](https://www.theguardian.com/science/1999/sep/22/spaceexploration.business)
   - [about a phyllum surviving Israeli private Moon lander crash](https://www.theguardian.com/science/2019/aug/06/tardigrades-may-have-survived-spacecraft-crashing-on-moon)
   - [about picture of Earth taken by Appolo 8 astronaut](https://www.theguardian.com/science/2018/oct/17/the-most-iconic-photograph-of-earth)
   - [about real estate on the Moon](https://www.theguardian.com/science/2006/jun/15/spaceexploration.alternativeinvestment)
   - [about a competition to put a Swedish house on the moon](https://www.theguardian.com/science/2006/oct/18/spaceexploration.g2)

The focus of the best articles is on exploring the Moon. The worst articles focus on manned missions to the Moon.

<a id="topic-15"></a>

#### 6.1.4. [Topic 15](#topic-15)

The five best texts that fit into topic number 15 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 15].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - approaching end of Cassini mission
     - [1](https://www.theguardian.com/science/2017/apr/30/cassini-saturn-space-probe-fiery-finale-nasa)
     - [2](https://www.theguardian.com/science/2017/sep/14/nasas-cassini-spacecraft-poised-to-begin-mission-ending-dive-into-saturn)
   - Cassini spacecraft enters Saturn orbit
     - [1](https://www.theguardian.com/science/2004/jul/01/spaceexploration.research)
     - [2](https://www.theguardian.com/science/2004/jun/04/spaceexploration.starsgalaxiesandplanets)
     - [3](https://www.theguardian.com/science/2004/jun/27/spaceexploration.research)

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 15].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [firing reserve thrusters on Voyager spacecraft that passed Saturn and its moon Titan in 1980](https://www.theguardian.com/science/2017/dec/07/spacewatch-voyager-1-gets-new-lease-of-life)
   - [acout Cassini's observed change in Saturn's colour at north pole](https://www.theguardian.com/science/2016/oct/26/hexagon-on-saturn-nasa-scientists-ponder-colour-changing-north-pole)
   - [how Earth would look like with Saturn's rings](https://www.theguardian.com/science/punctuated-equilibrium/2011/may/02/1)
   - [anticipating Cassini entering atmosphere of the moon Titan, during which time music will be delivered](https://www.theguardian.com/science/2015/aug/19/potential-sources-of-helium-revealed-as-reserves-of-the-precious-gas-dwindle)
   - [about Cassini probe entering last phase of travel from Earth towards Saturn](https://www.theguardian.com/science/1999/aug/19/spaceexploration.aeronautics)

The topic is focused on the mission updates for the Cassini probe. The sub-topic of water across the solar system and, even on Earth, has been separated. While Cassini did find evidence of water on Saturn's moon ([1](https://solarsystem.nasa.gov/missions/cassini/science/enceladus/), [2](https://www.nasa.gov/mission_pages/cassini/media/cassini-20060309.html)), this specific topic (even the majority of the worst articles) covers articles that are focused on the mission operations itself.

<a id="topic-10"></a>

#### 6.1.5. [Topic 10](#topic-10)

The five best texts that fit into topic number 10 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 10].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics are focused on finding evidence of water across the Solar System.

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 10].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [toy experiment to show that gas takes up more volume than solids](https://www.theguardian.com/science/2008/may/02/physics.chemistry)
     - water mixed with Alka-Seltzer causes the container to shoot up like a rocket
   - [cooking recipe (involves water)](https://www.theguardian.com/science/grrlscientist/2012/aug/18/4)
   - [about tardigrades ability to survive in extreme conditions - ice, heat, etc. - on Earth](https://www.theguardian.com/science/2007/jul/19/sudan.water)
   - [about Helium in presence of groundwater in some places on Earth](https://www.theguardian.com/science/2015/aug/19/potential-sources-of-helium-revealed-as-reserves-of-the-precious-gas-dwindle)
   - [basic science of a stone skimming on water](https://www.theguardian.com/science/2004/jan/08/research.science1)

The best articles here focus on discovery of water on planets across the solar system. The worst involve uses of water on Earth - cooking, stone skimming, experiments, etc.

<a id="topic-4"></a>

#### 6.1.6. [Topic 4](#topic-4)

The five best texts that fit into topic number 4 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 4].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics are focused on operational events related to the Rosetta spacecraft mission to comet 67P - mission end by crashing into comet, landing on comet, images captured, etc.

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 4].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [about the flu epidemic being due to comets passing close to Earth due to energy creaed by sunspots](https://www.theguardian.com/science/2000/jan/19/spaceexploration.medicineandhealth)
   - [about the Temple 1 probe mission ending with its descent and crash into a comet](https://www.theguardian.com/science/2005/jul/07/thisweekssciencequestions2)
   - [about shutting down the Deep Space 1 spacecraft which flew by a comet](https://www.theguardian.com/science/2001/dec/20/technology)
   - [about finding decayed comet bits](https://www.theguardian.com/science/2003/feb/28/spaceexploration.highereducation)
   - [about aerogel - the material used by the Stardust spacecraft to capture comet dust](https://www.theguardian.com/science/2004/jan/08/research.science1)

The best articles here focus on the Rosetta spacecraft's mission to comet 67P. The worst focus on other comets and, in some cases, do not focus on space missions but on comet dust/debris - most of the worst articles are focused on scientific findings or mission operations.

<a id="topic-5"></a>

#### 6.1.7. [Topic 5](#topic-5)

The five best texts that fit into topic number 5 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 5].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topic is focused on news reports about the Philae lander as part of the Rosetta mission.

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 5].nlargest(5, "resid")["url"].tolist()

**Observations**
1. Although most of the worst articles in this topic place the Rosetta mission as a secondary area of focus, the focus remains on non-science/operational themes connected to the mission.

The Philae lander, as part of the Rosetta mission, is the focus here. The worst articles place the try to keep the mission at the center of focus, but it is a secondary theme to some of these articles - there is no emphasis on the Philae lander. The best ones focus on Philae itself.

<a id="topic-9"></a>

#### 6.1.8. [Topic 9](#topic-9)

The five best texts that fit into topic number 9 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 9].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics are focused on operational events (launch, arrival, crew) related to the International Space Station.

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 9].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles involve people related to the ISS, videos filmed from the ISS or the first overall space station (Salyut 1 from Russia).

The focus of this topic is on the ISS. The worst articles do not directly relate to the ISS at all.

<a id="topic-12"></a>

#### 6.1.9. [Topic 12](#topic-12)

The five best texts that fit into topic number 12 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 12].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics are focused on the search for the Higgs Boson.

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 12].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are only generally focused on operating the LHC - safety, facts/figures - or about other particles (some of which don't involve bosons).

The topic is focused on bosons, with a secondary emphasis (some of the worst articles) on other particles (neutron) which indirectly emphasize bosons.

<a id="topic-13"></a>

#### 6.1.10. [Topic 13](#topic-13)

The five best texts that fit into topic number 13 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 13].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics are focused on Stephen Hawking and his research.

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 13].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [scientists show that more M&Ms than ball berrings can fit into a jar](https://www.theguardian.com/science/2004/feb/13/sciencenews.theguardianlifesupplement)
   - [Nobel Prize laureates get a dedicated parking space](https://www.theguardian.com/science/2019/oct/08/nobel-prizes-have-a-point-parking-space)
   - [about Aristotle](https://www.theguardian.com/science/2002/jan/15/peopleinscience)
   - [dollar value of Nobel Prize, from experimental physicist selling his award](https://www.theguardian.com/science/shortcuts/2015/may/27/nobel-prize-buyers-guide-to-worlds-top-trophies-leon-lederman)
   - [obituary for Physics professor Jack Meadows](https://www.theguardian.com/science/2016/aug/02/jack-meadows-obituary)

The best articles here focus on the life and accemplishments of Stephen Hawking. The worst involve other experimental Physicists.

<a id="topic-25"></a>

#### 6.1.11. [Topic 25](#topic-25)

The five best texts that fit into topic number 25 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 25].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics are focused on research into understanding and imaging Black Holes. The findings are primarily related to data collected from the Event Horizon telescope (Earth-based).

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 25].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
      - article about [Ada Lovelace](https://en.wikipedia.org/wiki/Ada_Lovelace)[-inspired puzzle where black squares are prominent](https://www.theguardian.com/science/2017/mar/28/did-you-solve-it-take-the-ada-lovelace-challenge-solution-part-i)
   - [Stephen Hawking black-hole brownies](https://www.theguardian.com/science/brain-flapping/2013/sep/10/great-science-bake-off-baking-recipes)
   - [black hole explains existence of Santa Claus](https://www.theguardian.com/science/brain-flapping/2014/dec/12/santa-claus-father-christmas)   
   - [about](https://www.theguardian.com/science/2003/jan/16/science.research1) the Golden Ratio, which [appears in theory of black holes](https://johncarlosbaez.wordpress.com/2013/02/28/black-holes-and-the-golden-ratio/)
   - [making analogy between doghnuts with holes in them and Nobel Physics topics](https://www.theguardian.com/science/2003/oct/09/spaceexploration.research)

This is a topic about black holes in the universe. Some of the worst articles are loosely connected to the topic.

<a id="topic-28"></a>

#### 6.1.12. [Topic 28](#topic-28)

The five best texts that fit into topic number 28 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 28].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics are focused on launch of and search for the Beagle 2 UK Mars lander.

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 28].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [mentions Beagle 2 researchers being candidates for an award](https://www.theguardian.com/science/2005/mar/24/research.science)
   - [Beagle 2 mission lead talking about patent for nuclear-powered flying saucer](https://www.theguardian.com/science/2006/mar/13/spaceexploration.transportintheuk)
   - [about tests on Mars rover prototype as successor to Beagle 2](https://www.theguardian.com/science/2006/jun/12/spaceexploration.starsgalaxiesandplanets), as [part of the ExoMars mission](https://www.theguardian.com/science/2006/aug/22/spaceexploration.uknews)
   - [mentions Beagle spacecraft in article about Russian dog sent to space](https://www.theguardian.com/science/2004/mar/20/spaceexploration.animalrights)
   - [obituary for a researcher involved with the advocating for *in situ* analyses on the Philae lander that accompanied Rosetta](https://www.theguardian.com/science/2014/dec/21/colin-pillinger-remembered-monica-grady-observer-obituaries-2014-open-university)

The best articles here focus on science and (primarily) operations related to the lost Btitish Beagle 2 Mars lander. The worst articles (sometimes loosely) reference the Beagle 2, but are not directly related to its mission ([Mars Express](https://en.wikipedia.org/wiki/Mars_Express)).

<a id="topic-3"></a>

#### 6.1.13. [Topic 3](#topic-3)

The five best texts that fit into topic number 3 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 3].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics are focused on the discovery of new planets.

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 3].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [about discovery of Earth-like planets](https://www.theguardian.com/science/2004/sep/04/starsgalaxiesandplanets.spaceexploration)
   - [about the Israeli astronaut who was part of the Columbia shuttle mission](https://www.theguardian.com/science/2003/jan/16/technology)
   - [about discovery of youngest supernova in Milky Way](https://www.theguardian.com/science/2008/may/14/usa)
   - [about disputes over naming stars](https://www.theguardian.com/science/2000/aug/27/spaceexploration.theobserver)
   - [about the brightness of the planet Venus](https://www.theguardian.com/science/2018/jun/03/starwatch-venus-gemini-castor-pollux)

The best articles focus on discovery of new planets, including Earth-like ones. The worst focus on existing planets (Venus) and discovery of new planets.

**Best** - Discovrey of new planets

**Worst** - New and Existing planets

<a id="topic-7"></a>

#### 6.1.14. [Topic 7](#topic-7)

The five best texts that fit into topic number 7 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 7].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics are focused on reports about Spacex events involving Falcon rockets - launch, landing, crash.

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 7].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [mentions absence of teapot on Elon Musk's (SpaceX founder) rocket](https://www.theguardian.com/science/2018/feb/09/runways-rockets-and-russells-teapot)
     - no mention of Falcon rocket, just rocket
   - [2004 article on use of rocket nozzle technology in loud speakers](https://www.theguardian.com/science/2004/jul/15/science.research)
     - nozzles are built to optimize air flow and turbulence at launch time
   - [comparing cost of moon trip on Falcon Heavy to terrestrial trips](https://www.theguardian.com/science/shortcuts/2017/feb/28/140-pound-mile-elon-musk-moon-trip-spacex-compare-terrestrial-journeys)
   - [anniversary of first successful launch of liquid-fuelled rocket](https://www.theguardian.com/science/the-h-word/2015/oct/14/frank-malina-and-an-overlooked-space-age-milestone)
   - [1999 article on successful launch](https://www.theguardian.com/science/1999/aug/21/spaceexploration.uknews) of [Starchaser 3A rocket, in the UK](https://en.wikipedia.org/wiki/Starchaser_Industries#Launched_Rockets)

The best articles focus on events involving SpaceX operational events. The worst focus on uses of rocket technology - this could include (non-SpaceX) launches and space tourism.

<a id="topic-8"></a>

#### 6.1.15. [Topic 8](#topic-8)

The five best texts that fit into topic number 8 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 8].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics are focused on imminent and future threatening asteroids.

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 8].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [about UK government agreeing to more studies for near-Earth asteroids](https://www.theguardian.com/science/2001/mar/01/technology3)
   - [about canceled asteroid exploring rover](https://www.theguardian.com/science/2000/nov/09/technology2)
   - [from 2013, about new class of astronauts who could be involved in a mission to an asteroid](https://www.theguardian.com/science/2013/jun/18/nasa-astronaut-mars-2013-women)
   - [on probability of massive asteroid hitting the Earth](https://www.theguardian.com/science/2003/nov/20/science.research)
   - [about efforts to track space rocks threatening Earth - asteroids, mneteors](https://www.theguardian.com/science/2019/mar/18/meteor-blast-over-bering-sea-was-10-times-size-of-hiroshima)

The best articles focus on threatening asteroids. The worst focus on threatening asteroids or missions to asteroids.

**Best** - Earth-threatening asteroids

**Worst** - Earth-threatening asteroids and Missions to asteroids

<a id="topic-11"></a>

#### 6.1.16. [Topic 11](#topic-11)

The five best texts that fit into topic number 11 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 11].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics are focused on
   - [about plans to ignore astronauts in UK space policy in 2008](https://www.theguardian.com/science/2008/feb/14/spaceexploration.spacetechnology)
   - [announcement of British astronaut to travel to ISS](https://www.theguardian.com/science/2013/may/20/tim-peake-space-station-mission)
   - [about UK company's plans for moon satellites, in part aimed at probing for future manned bases](https://www.theguardian.com/science/2007/jan/11/spaceexploration.uknews)
   - [on UK's potential for space exploration areas - manned missions vs satellites (its strength)](https://www.theguardian.com/science/2007/aug/08/spaceexploration.telecoms)
   - [in 2004, about the low profile of the British space program](https://www.theguardian.com/science/2004/jan/08/spaceexploration.science1)

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 11].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [in 2008 about urine sample donations for NASA tests of space capsules for space travellers](https://www.theguardian.com/science/2008/jul/17/spaceexploration.usa)
   - [about delaying aged care reforms in UK](https://www.theguardian.com/science/2018/sep/17/dont-delay-aged-care-reforms-because-of-royal-commission-government-urged)
   - [about Helium gas reserve discovery in Africa](https://www.theguardian.com/science/2016/jun/28/huge-helium-gas-tanzania-east-africa-averts-medical-shortage)
     - article mentions that some of Earth's initial Helium content was ``lost to space``
   - [about astronaut monitoring company becoming public](https://www.theguardian.com/science/2005/sep/12/spaceexploration.business)
   - [about two Swedish citizens hacking into NASA](https://www.theguardian.com/science/1999/aug/17/spaceexploration.aeronautics)

The best articles focus on the extent of the British space program's focus - primarily on satellites. The worst are a mix of space related topics

**Best** - British space program

**Worst** - Random Space news, loosely connected to finances

<a id="topic-14"></a>

#### 6.1.17. [Topic 14](#topic-14)

The five best texts that fit into topic number 14 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 14].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics report on or look ahead to the occurrence of an eclipse.

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 14].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [science approach to stopping a shower curtain from wrapping itself around your legs](https://www.theguardian.com/science/2001/sep/06/physicalsciences.highereducation)
   - [astronomer used science and a trip northwest of Paris to pinpoint the moment of creation of a van Gogh](https://www.theguardian.com/science/2001/mar/08/spaceexploration.uknews)
     - article mentions the following about same astronomer
       - traced the origins of the term "blue moon"
       - tracked down evidence that the young Abraham Lincoln witnessed meteor storm
       - proposed that the star mentioned Hamlet could have been a supernova
   - [about noctiluscent clouds](https://www.theguardian.com/science/2018/jun/17/starwatch-time-to-look-for-noctilucent-clouds-at-the-edge-of-space)
     - [cloud-like phenomena in the upper atmosphere of Earth. They consist of ice crystals and are only visible during astronomical twilight](https://en.wikipedia.org/wiki/Noctilucent_cloud)
   - [about light from meteor shower over UK](https://www.theguardian.com/science/2016/mar/17/st-patricks-day-meteor-sky-over-britain)
   - [about street lights being brighter than real night sky](https://www.theguardian.com/science/2001/aug/14/spaceexploration.uknews)

The best focus on an eclipse. The worst focus on astronomical phenomena associated with lighting up the night sky.

**Best** - Eclipse

**Worst** - Astronomical Phenomena that light up the skies during darker times of the day

<a id="topic-16"></a>

#### 6.1.18. [Topic 16](#topic-16)

The five best texts that fit into topic number 16 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 16].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics are focused on using the [New Horizons probe](https://en.wikipedia.org/wiki/New_Horizons) or [Hubble Space Telescope](https://en.wikipedia.org/wiki/Hubble_Space_Telescope) to capture images of Pluto.

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 16].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [New Horizons spacecraft captures first images of Ultima Thule, proposed to have been created by mixing of ice and dust during the early days of the Solar System](https://www.theguardian.com/science/2019/jan/02/first-close-ups-of-ultima-thule-reveal-it-resembles-dark-red-snowman)
   - [report](https://www.theguardian.com/science/2011/may/18/spacewatch-endeavour-shuttle) on [Dawn spacecraft](https://en.wikipedia.org/wiki/Dawn_(spacecraft)) closing in on asteroid before heading for a dwarf planet [Ceres](https://en.wikipedia.org/wiki/Ceres_(dwarf_planet))
   - [protests against using nuclear technology to power spacecraft to (distant) planets beyond Mars as solar powered craft can't benefit from strong enough sunlight as a power source](https://www.theguardian.com/science/2003/oct/05/spaceexploration.observersciencepages)
   - [report on New Horizons communicatoin after flying past Ultima Thule](https://www.theguardian.com/science/2018/dec/31/new-horizons-heads-for-flyby-of-space-rock-ultima-thule) ([which is smaller than a dwarf planet](https://www.space.com/ultima-thule-birth-new-horizons-first-science.html) and is [located further away from Earth than Pluto](https://www.google.com/search?q=where+is+ultima+thule&client=ubuntu&hs=LQz&channel=fs&sxsrf=ALeKk01-AZjHf2Rwo8PvEEaDxRDh0njVfw:1608062644302&tbm=isch&source=iu&ictx=1&fir=dv-l9DQF1rEwvM%252CXVz_LsNMVaU1mM%252C_&vet=1&usg=AI4_-kQHekD5S6R0osfdct69--S5wtX9BA&sa=X&ved=2ahUKEwjboMKV5NDtAhVNQ80KHVw4B-QQ9QF6BAgeEAE&biw=1920&bih=1098#imgrc=gOqqXvnthfZXrM))
   - [about images captured by Dawn spacecraft, of dwarf planet Ceres](https://www.theguardian.com/science/2016/mar/16/mystery-of-ceres-bright-spots-deepens-new-data-analysed-nasa-dawn)

The best articles focus on images of Pluto. The worst focus on images or technology (nuclear power) to study dwarf planets (Ceres), or distant objects smaller than dwarf planets. eg. [Ultima Thule](https://en.wikipedia.org/wiki/486958_Arrokoth)). A [distant minor planet](https://en.wikipedia.org/wiki/Distant_minor_planet) is located beyond Jupiter and Ceres (dwarf planet) is between Mars and Jupiter so, it is likely that the focus of the worst articles is on dwarf planets (or smaller objects) beyond Mars. Operational aspects of missions are included in the worst articles too.

**Best** - science and operations related to dwarf planet Pluto ([a dwarf planet](https://www.space.com/43-pluto-the-ninth-planet-that-was-a-dwarf.html) and a distant object)

**Worst** - science and operations related to distant objects and dwarf planets

<a id="topic-18"></a>

#### 6.1.19. [Topic 18](#topic-18)

The five best texts that fit into topic number 18 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 18].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics are focused on reporting on the end of the operations for the MIR space station.

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 18].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [about crashes (also involves Russian cosmonauts) returning from the Salyut 1 space station](https://www.theguardian.com/science/2003/feb/02/spaceexploration.theobserver)
   - [about MIR cosmonaut talking about life in space, in London](https://www.theguardian.com/science/2002/mar/14/technology)
   - [about Russian Space Agency medican team member commenting about alcohol on the ISS](https://www.theguardian.com/science/2006/jan/07/uknews)
   - [on anniversary of Russian cosmonaut Yuri Gagarin's launch into space](https://www.theguardian.com/science/2001/apr/08/spaceexploration.theobserver)
   - [about funding limits facing MIR space station](https://www.theguardian.com/science/1999/jun/03/technology1)

The best articles focus on the end of life of the MIR space station operations. The worst retain a loose focus on Russian space stations, not just MIR, and cover funding issues, launches, famous residents, etc.

**Best** - MIR funding

**Worst** - Russian space stations

<a id="topic-19"></a>

#### 6.1.20. [Topic 19](#topic-19)

The five best texts that fit into topic number 19 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 19].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics are focused on
   - [reports on unveiling](https://www.theguardian.com/science/2016/apr/10/virgin-galactic-richard-branson-interview) of a [new Virgin Galactic ship](https://sanfrancisco.cbslocal.com/2016/12/05/virgin-galactic-spaceship-glide-flight-mojave-desert/) to be used for trips into space
   - [evaluating dangers and other factors involved in space travel through discussions with Richard Branson and other Virgin Galactic team members](https://www.theguardian.com/theobserver/2012/jun/17/space-tourism-science-virgin-robin-mckie)
   - [Virgin Galactic becomes public company](https://www.theguardian.com/science/2019/jul/09/richard-branson-virgin-galactic-go-public)
   - [Richard Branson announces launch of Virgin Galactic Airways](https://www.theguardian.com/science/2004/sep/27/spaceexploration.travelnews)
   - [NASA collaborates with Virgin Galactic on space passenger plane design](https://www.theguardian.com/science/2007/feb/22/spaceexploration.weekendmagazinespacesection)

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 19].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [does alcohol really cause air rage](https://www.theguardian.com/science/sifting-the-evidence/2016/aug/03/flight-or-fight-does-alcohol-really-cause-air-rage)
   - [about training for marathon](https://www.theguardian.com/science/2005/feb/10/alokjhasmarathonattempt.lifeandhealth)
   - [monkeys return from space flight of 300 miles](https://www.theguardian.com/books/2009/may/29/from-archive-monkeys-missile)
   - [best place to tie shoes at airport](https://www.theguardian.com/science/2016/jul/18/did-you-solve-it-wheres-the-best-place-to-tie-your-shoe-in-an-airport)
   - [about viewer question (letter) regarding carbon footprint of of one Virgin Galactic flight](https://www.theguardian.com/science/2019/jul/10/picking-up-bransons-rocketing-carbon-bill)

The best articles focus on Virgin Galactic's space tourism business - new spaceship announcement, design, dangers, etc. The worst loosely focus on air travel.

**Best** - Virgin Galactic space travel

**Worst** - air travel

<a id="topic-20"></a>

#### 6.1.21. [Topic 20](#topic-20)

The five best texts that fit into topic number 20 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 20].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics are focused on
   - [in 2007, about detection of halo around cluster of galaxies being evidence for existence of dark matter](https://www.theguardian.com/science/2007/may/16/spaceexploration.universe)
     - appears twice in data
   - [about image of galaxy with no dark matter](https://www.theguardian.com/science/2018/mar/28/galaxy-without-any-dark-matter-baffles-astronomers)
   - [about using dwarf galaxies to determine property of dark matter](https://www.theguardian.com/science/2006/feb/06/starsgalaxiesandplanets.spaceexploration)
   - [about detection of particles with same characteristics of dark matter in underground mine on Earth](https://www.theguardian.com/science/2009/dec/17/dark-matter-detected)

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 20].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [about aether and manifesting itself (in modern times) as dark matter](https://www.theguardian.com/science/2004/feb/12/research.science) (even through [both are not related](https://www.discovermagazine.com/the-sciences/dark-matter-vs-aether))
   - [about suggestion that dark matter affecting travel of Pioneer spacecraft](https://www.theguardian.com/science/2004/sep/12/spaceexploration.research)
   - [about quiz game where one of the questions involved dark matter](https://www.theguardian.com/science/2005/apr/28/1)
   - [about strawberries in space](https://www.theguardian.com/science/2003/nov/13/spaceexploration.science)
   - [very high level discussion of universe, including proportion of dark matter](https://www.theguardian.com/science/2004/apr/15/spaceexploration.highereducation)

The best articles focus on detecting (presence or absence of) dark matter. The worst involve an indirect mention of dark matter in the universe - dark matter here is not strongly related to the main topic.

**Best** - Discoveries about dark matter in the universe

**Worst** - Passing mentions of dark matter in relation to other topics about space

<a id="topic-21"></a>

#### 6.1.22. [Topic 21](#topic-21)

The five best texts that fit into topic number 21 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 21].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics are focused on consequences of and push for a solution to the problem of global warming.

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 21].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [about UK poverty statistics, in context of global economic crisis](https://www.theguardian.com/science/2013/dec/12/press-releasing-poverty-at-the-dwp)
   - [about populating space as a way to deal with global warming](https://www.theguardian.com/science/2010/aug/12/populating-space-stephen-hawking)
   - [about film developed to be invisible to IR cameras](https://www.theguardian.com/science/2018/jun/27/scientists-develop-thermal-camouflage-that-can-fool-infrared-cameras)
     - it can hide an object by causing it to appear as though it has the same temperature as its background
     - focus here is on tempreature
   - [about forest sustainability and microbes surviving in countries with a lot of sunlight](https://www.theguardian.com/science/2000/jun/15/technology1)
   - [photonics expert about mirrors replacing air conditioners, and cut back on electricity use, by reflecting heat towards the sky](https://www.theguardian.com/science/2014/nov/26/mirrors-air-conditioning-heat-space)
     - expert dismisses idea being about to slow down global warming

The best articlesfocus on the problem of global warming and climate change. The worst focus on uses of sunlight, with an indirect link to global warming.

**Best** - Global Warming

**Worst** - Uses of light and heat

<a id="topic-23"></a>

#### 6.1.23. [Topic 23](#topic-23)

The five best texts that fit into topic number 23 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 23].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics are focused on detection of gravitational waves.

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 23].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [high level discussion of the role of EM (micro)-waves in invisibility cloaks](https://www.theguardian.com/science/2011/feb/01/scientists-invent-invisibility-cloak)
   - [about earthquake detection, including maps of seismic wave detections in Los Angeles, USA](https://www.theguardian.com/science/2004/aug/09/geology)
   - [use of electricity in real-world uses of EM (radio and micro)-waves](https://www.theguardian.com/science/2017/may/22/michael-faraday-lost-better-call-saul-genius)
   - [about shape of universe being that of an American football](https://www.theguardian.com/science/2003/oct/09/spaceexploration.research)
   - [about the largest tsunami waves](https://www.theguardian.com/science/2003/feb/27/technology)

The best articles focus on the detection of gravitational waves. The worst focus a few other types of [waves](https://byjus.com/physics/types-of-waves/) include ocean waves.

**Best** - Gravidational waves

**Worst** - Waves - of water and EM

<a id="topic-26"></a>

#### 6.1.24. [Topic 26](#topic-26)

The five best texts that fit into topic number 26 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 26].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics are focused on the search for and detection of neutrinos.

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 26].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [all about the element niobium](https://www.theguardian.com/science/punctuated-equilibrium/2011/dec/09/1)
   - [element Xenon](https://www.theguardian.com/science/grrlscientist/2012/mar/16/1)
   - [element tellurium](https://www.theguardian.com/science/grrlscientist/2012/mar/02/1)
   - [element calcium](https://www.theguardian.com/science/punctuated-equilibrium/2011/jul/08/1)
   - [about science of crumpling](https://www.theguardian.com/science/2002/apr/04/physicalsciences.research)
     - by researchers at University of Chicago, which is located near Fermilab (the world's largest neutrino source)

The best articles focus on detecting and understanding [neutrinos](https://www.scientificamerican.com/article/what-is-a-neutrino/). The worst focus on elements, likely due to their connection through neutrinos through their constituent [sub-atomic particles](https://chem.libretexts.org/Bookshelves/Physical_and_Theoretical_Chemistry_Textbook_Maps/Supplemental_Modules_(Physical_and_Theoretical_Chemistry)/Atomic_Theory/The_Atom/Sub-Atomic_Particles) (protons, neutrons, and electrons).

**Best** - Detection of and Understanding Neutrinos (subatomic particles)

**Worst** - Elements in nature, their subatomic particles and other properties

<a id="topic-27"></a>

#### 6.1.25. [Topic 27](#topic-27)

The five best texts that fit into topic number 27 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 27].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics are focused on Neil Armstrong, one of the Apollo 11 spacecraft's astronauts.

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 27].nlargest(5, "resid")["url"].tolist()

**Observations**
1. These articles are focused on momentos of space travel - photo of earth given to prisoner by astronaut, or song (affiliated with a supposed war criminal) played by an astronaut as Apollo 14 approached the Moon - or remembering astronauts who died in the Columia shuttle crash in 2003.

**Best** - Neil Armstrong

**Worst** - Astronauts who died or were in some way affiliated with real or supposed criminals.

<a id="topic-30"></a>

#### 6.1.26. [Topic 30](#topic-30)

The five best texts that fit into topic number 30 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 30].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics are focused on efforts to search for and some opinions about the form of extraterrestrial life.

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 30].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [about company that believes aliens from space created human life on Earth being orderred to appear in court regarding human cloning in 2003](https://www.theguardian.com/science/2003/jan/13/genetics.internationalnews)
   - [about the UFO Congress Convention and Film](https://www.theguardian.com/science/2001/mar/04/spaceexploration.comment)
   - [about studying the path of pebbles on a beach](https://www.theguardian.com/science/2000/aug/10/uknews1)
   - [attributing Cold War to UFO obsession](https://www.theguardian.com/science/2002/may/05/spaceexploration.research)
   - [about apocalypse in 2058, prompting (among other things) colonizing other planets](https://www.theguardian.com/science/2000/may/18/technology2)

The best articles focus on a scientific view about extraterrestrial life. The worst are focused on public imagination related to life away from Earth.

**Best** - The search for extraterrestrial life

**Worst** - Human's casual fascination with the existence of life forms on other planets

<a id="topic-31"></a>

#### 6.1.27. [Topic 31](#topic-31)

The five best texts that fit into topic number 31 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 31].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics are focused on the achievements of the Hubble telescope, primarily on its 25 year anniversary, or in relation to extending its working lifetime.

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 31].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [imaging provided by Proba satellite](https://www.theguardian.com/science/2004/oct/28/science.research)
   - [about Astrogrid virtual observatory for astronomers](https://www.theguardian.com/science/2005/sep/23/spaceexploration.universe)
   - [about seismological pictures, of Earth's mantle](https://www.theguardian.com/science/2017/may/24/extra-layer-of-tectonic-plates-discovered-within-earths-mantle-scientists-say)
     - analogy is made between new seismology technology used here and turning Hubble to look into the Earth
   - [about 1894 bombing at Observatory whose telescope defined the Greenwich Meridian](https://www.theguardian.com/science/the-h-word/2016/aug/05/secret-agent-greenwich-observatory-bombing-of-1894)
   - [about Observatory becoming a heritage site](https://www.theguardian.com/culture/2019/jul/07/jodrell-bank-observatory-becomes-world-heritage-site)

The best articles focus on the Hubble Telescope and its achievements. The worst focus on imaging in general, as connected to observatories or Earth-facing satellites.

**Best** - Hubble Telescope

**Worst** - Observing Earth and the Stars

<a id="topic-32"></a>

#### 6.1.28. [Topic 32](#topic-32)

The five best texts that fit into topic number 32 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 32].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics are focused on the following
   - [use of satellites as space-based weapons, including how to counter them](https://www.theguardian.com/science/2007/jan/19/spaceexploration.china)
     - appears twice with different URLs
     - weapons to destroy satellites
   - [about launching satellite that was labeled as space junk](https://www.theguardian.com/science/2018/feb/11/one-mans-mission-to-conquer-space-peter-beck-humanity-star)
   - [about implications of space war, including space debris](https://www.theguardian.com/science/2018/apr/15/its-going-to-happen-is-world-ready-for-war-in-space)
   - [why space pollution could be next big problem facing humans](https://www.theguardian.com/science/2017/mar/26/weve-left-junk-everywhere-why-space-pollution-could-be-humanitys-next-big-problem)

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 32].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [satellites to monitor movement of a dam](https://www.theguardian.com/science/2017/jun/04/monitoring-dam-movement-from-space-terrawatch)
   - [about using nanotubes to build space elevators to launch satellites instead of blasting them from the Earth's surface](https://www.theguardian.com/science/2004/mar/18/research.science)
     - space elevator replaces costly rocket launches
   - [use satellites to watch whales](https://www.theguardian.com/science/animal-magic/2014/feb/13/whale-watching-space)
   - [new imaging technology to help police in hostage situations by seeing through walls](https://www.theguardian.com/science/2006/nov/11/uknews)
   - [on use of space elevator to launch satellites without blasting them from Earth](https://www.theguardian.com/science/2006/sep/03/spaceexploration.theobserver)
     - space elevator replaces costly rocket launches

The best articles focus on implications of satellite debris in space. The worst focus on the use of satellites to monitor things on Earth or on how to launch them to space without a blastoff.

**Best** - Satellite debris from overcrowding in space

**Worst** - Cost-reducing methods to add Earth-imaging satellites

<a id="topic-34"></a>

#### 6.1.29. [Topic 34](#topic-34)

The five best texts that fit into topic number 34 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 34].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics are focused on the following topics
   - [Pluto contains plains of ice, despite large distance from Sun](https://www.theguardian.com/science/2015/jul/18/new-horizons-search-life-outside-solar-system-pluto)
     - article mentions that innermost planets are thoughyt to have been relatively hospitable due to their proximity to the Sun
   - [Solar Orbiter telescopes will get closest view of Sun](https://www.theguardian.com/science/2018/sep/19/staring-at-the-sun-solar-orbiter-telescopes-will-get-closest-view-yet)
   - [pros and cons of life on other planets, including implications of their distance from the Sun](https://www.theguardian.com/science/2006/jun/16/spaceexploration.g2)
   - [results of studies showing the effect of solar winds (unhindered from reaching the surface, due to absence of magnetic field) on Mars](https://www.theguardian.com/science/2015/nov/05/mars-atmosphere-liquid-water-nasa-northern-lights)
   - [report on solar storm hitting Earth](https://www.theguardian.com/science/2012/mar/08/solar-storms-continuing-to-hit-earth)

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 34].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [about boat used to cover the world faster than any other](https://www.theguardian.com/science/2005/feb/10/thisweekssciencequestions4)
     - discusses resilience to wind
   - [about efficiently bringing a building down in presence of weather conditions caused by a hurricane](https://www.theguardian.com/science/1999/aug/05/technology1)
   - [about spotting a previously unseen purple patch of ionized gas](https://www.theguardian.com/science/shortcuts/2018/mar/19/steve-mystery-purple-aura-rivals-northern-lights-alberta-canada-nasa)
     - this was not identified to be the northern lights, which is collisions between upper-atmosphere gases on Earth with electrically charged particles emitted from the Sun.
   - [about Voyager 1's 25 year launch anniversary when heading for outer planets (farthest from Sun)](https://www.theguardian.com/science/2002/aug/22/technology)
   - [challenges of a mission to fly a glider into space, due to extreme atmospheric conditions](https://www.theguardian.com/science/2005/jul/28/thisweekssciencequestions.aeronautics)

The best articles focus on the ramifications of energy (specifically solar storms) given off by of the Sun. The worst focus on stormy conditions on Earth (influenced by the Sun's emitted energy) or events indirectly connected to energy (not just solar storms) emitted by the Sun

**Best** - The Sun's emitted energy (Solar Storms) across the solar system

**Worst** - Sun's role in stormy conditions on Earth

<a id="topic-22"></a>

#### 6.1.30. [Topic 22](#topic-22)

The five best texts that fit into topic number 22 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 22].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics are focused on the following
   - [2007 year-end review on findings in understanding science of life](https://www.theguardian.com/science/2008/jan/01/science.review.2007)
   - [2009 decade review on Biology's major findings](https://www.theguardian.com/science/2009/dec/29/science-decade-genetics-language-of-life)
   - [reporting on researcher who grew eyes and brain cells](https://www.theguardian.com/science/neurophilosophy/2014/aug/26/the-man-who-grew-eyes)
   - [Biological findings in top 10 scientific breakthroughs of 2008](https://www.theguardian.com/science/2008/dec/18/top-10-scientific-breakthroughs-2008)
   - [scientists working to eliminate ageing cells](https://www.theguardian.com/science/2018/oct/06/race-to-kill-killer-zombie-cells-senescent-damaged-ageing-eliminate-research-mice-aubrey-de-grey)

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 22].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [healing properties of honey](https://www.theguardian.com/science/2000/jul/02/drugs.life)
   - [better horse nose comfort during Equestrian](https://www.theguardian.com/science/2016/may/03/new-rules-on-horse-nosebands-needed-to-prevent-distress-say-researchers)
   - [on posthumous birth lawsuit](https://www.theguardian.com/science/2003/sep/19/genetics.uknews)
   - [written in 2000, on potential of molecular computers](https://www.theguardian.com/science/2000/feb/10/technology1)
   - [about wine quality](https://www.theguardian.com/science/2013/oct/25/science-magic-wine-making)
     - acidity can potentially cause lifeless wine

The best articles focus on scientific research findings in biology. The worst are mostly unrelated to life - one-off events loosely related to life.

**Best** - Biological Research

**Worst** - Random articles with loose connections to life/health

<a id="topic-17"></a>

#### 6.1.31. [Topic 17](#topic-17)

The five best texts that fit into topic number 17 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 17].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics are focused on the research into neuroscience.

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 17].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [report on study that finds crossing the fingers can confuse the brain's processesing of hot, cold and pain](https://www.theguardian.com/science/2015/mar/26/crossing-your-fingers-might-reduce-pain-says-study)
   - [report on neuroscience study showing that being transparent makes people less anxious in front of a stern-looking crowd](https://www.theguardian.com/science/2015/apr/23/transparent-findings-invisible-people-less-anxious-say-scientists)
   - [mind games - theories about how brian interprets illusions](https://www.theguardian.com/science/head-quarters/2016/oct/03/ambiguous-figure-illusions-do-they-offer-a-window-on-the-mind)
   - [about AI that mirrors the brain's method of learning](https://www.theguardian.com/global/2017/mar/14/googles-deepmind-makes-ai-program-that-can-learn-like-a-human)
   - [study showing putting in golg may be subject to a strong visual illusion](https://www.theguardian.com/science/2001/aug/30/medicalscience.education)

The best and worst articles focus on research into understanding or partially replicating how the brain functions.

**Best** - Neuroscience

**Worst** - Neuroscience

<a id="topic-24"></a>

#### 6.1.32. [Topic 24](#topic-24)

The five best texts that fit into topic number 24 are shown below

In [None]:
df_topics[df_topics["topic_num"] == 24].nsmallest(5, "resid")["url"].tolist()

**Observations**
1. The topics are focused on detection of gravitational waves.

The five worst texts that fit into this topic are shown below

In [None]:
df_topics[df_topics["topic_num"] == 24].nlargest(5, "resid")["url"].tolist()

**Observations**
1. The topics of these articles are listed below
   - [high level discussion of the role of EM (micro)-waves in invisibility cloaks](https://www.theguardian.com/science/2011/feb/01/scientists-invent-invisibility-cloak)
   - [about earthquake detection, including maps of seismic wave detections in Los Angeles, USA](https://www.theguardian.com/science/2004/aug/09/geology)
   - [use of electricity in real-world uses of EM (radio and micro)-waves](https://www.theguardian.com/science/2017/may/22/michael-faraday-lost-better-call-saul-genius)
   - [about shape of universe being that of an American football](https://www.theguardian.com/science/2003/oct/09/spaceexploration.research)
   - [about the largest tsunami waves](https://www.theguardian.com/science/2003/feb/27/technology)

The best articles focus on the detection of gravitational waves. The worst focus a few other types of [waves](https://byjus.com/physics/types-of-waves/) include ocean waves.

**Best** - Gravidational waves

**Worst** - Waves - of water and EM

<a id="assign-names-to-35-topics"></a>

### 6.2. [Assign Names to 35 Topics](#assign-names-to-35-topics)

A mapping will be made between topic number and name, based on the top 10 terms and manual reading to topic articles, will be developed.

In [None]:
d_topics_35 = {
    3: {
        "best": "Reporting on Discovery of New Planets",
        "worst": "Reporting on Discovery of New Planets and Objects in Space",
    },
    7: {"best": "SpaceX Rocket Testing Reports", "worst": "Rocket Technology"},
    8: {
        "best": "Studying Earth-Threatening Asteroids",
        "worst": "Studying Earth-Threatening Asteroids",
    },
    11: {"best": "Focus of the UK Space Program", "worst": "Random Space News"},
    14: {
        "best": "Anticipating or Reporting on Eclipses",
        "worst": "Astronomical Phenomena during dark skies",
    },
    16: {
        "best": "Spacecraft Imaging of Dwarf distant planet Pluto",
        "worst": "Spacecraft Imaging of Distant Objects and Dwarf planets",
    },
    18: {
        "best": "MIR Space Station Funding",
        "worst": "Russian news about Space Stations",
    },
    19: {"best": "Events relating to Virgin Galactic", "worst": "Air Travel"},
    20: {
        "best": "Scientific Research about Dark Matter",
        "worst": "General Dark Matter in Space",
    },
    21: {
        "best": "Reports about Problem of Global Warming",
        "worst": "Uses of Heat on Earth",
    },
    23: {
        "best": "Report on Detection of Gravitational Waves",
        "worst": "All kinds on Waves on Earth",
    },
    26: {
        "best": "On the Search for and Detection of Neutrinos",
        "worst": "Particles and Properties of Elements in Nature",
    },
    27: {"best": "Neil Armstrong", "worst": "Astronauts as human beings"},
    30: {
        "best": "Search for E.T. life",
        "worst": "Fascination with Otherworldly Life",
    },
    31: {
        "best": "Achievements of Hubble Space Telescope",
        "worst": "Observing Earth and the Stars",
    },
    32: {
        "best": "Space Debris from Satellites",
        "worst": "Cost-efficient methods to deploy Earth facing satellites",
    },
    34: {
        "best": "Sun's influence on life across the Solar System",
        "worst": "Sun's role in stormy conditions on Earth",
    },
    22: {"best": "Biological Achievements", "worst": "General Life and Health"},
    17: {"best": "Neuroscience Research", "worst": "Neuroscience Research"},
    2: {
        "best": "Mars mission updates",
        "worst": "Mars Colonization and Tourism",
    },
    1: {"best": "Columbia shuttle crash", "worst": "Space Shuttles"},
    6: {"best": "About Exploring the Moon", "worst": "Moon Tourism"},
    15: {"best": "Cassini mission updates", "worst": "Cassini mission updates"},
    10: {
        "best": "Evidence of Water in the Solar System",
        "worst": "Fun with water on Earth",
    },
    4: {
        "best": "Rosetta mission (to comet 67P) updates",
        "worst": "Science involving Comets",
    },
    5: {
        "best": "News Reports about Philae lander on Rosetta",
        "worst": "About People affiliated with Rosetta",
    },
    9: {
        "best": "ISS updates",
        "worst": "Space Stations - Facts and History",
    },
    12: {"best": "Discovery of Higgs Boson", "worst": "Particles of Matter"},
    13: {
        "best": "Research of Stephen Hawking",
        "worst": "Experimental Physicists",
    },
    25: {
        "best": "Black Holes",
        "worst": "Black Holes in Popular Culture",
    },
    28: {"best": "Beagle 2 Mission Updates", "worst": "Beagle 2 Legacy"},
    24: {"best": "Dinosaurs", "worst": "Topic 24"},
    33: {"best": "Topic 33", "worst": "Topic 33"},
    29: {"best": "Topic 29", "worst": "Topic 29"},
    0: {"best": "Topic 0", "worst": "Topic 0"},
}
df_named_topics = (
    pd.DataFrame.from_dict(d_topics_35, orient="index")
    .sort_index()
    .rename_axis("topic_num")
)
display(df_named_topics)

The residual plot is repeated now with topic names

In [None]:
# boxplot_sorted(
#     df_topics[["topic_num", "resid"]].merge(
#         df_named_topics.reset_index()[["topic_num", "best"]], on="topic_num"
#     ),
#     ["best"],
#     "resid",
#     "Topic Residual (sorted by median)",
#     "right",
#     14,
#     0,
#     sort_by_median=True,
#     vert=False,
#     fig_size=(6, 16),
# )
altair_boxplot_sorted(
    df_topics[["topic_num", "resid"]].merge(
        df_named_topics.reset_index()[["topic_num", "best"]], on="topic_num"
    ),
    "best",
    "resid",
    "median(resid)",
    "Topic Residual (sorted by median)",
    14,
    14,
    16,
    dx=340,
    offset=0,
    x_tick_label_angle=0,
    horiz_bar_chart=True,
    axis_range=[0.4, 1],
    fig_size=(300, 750),
)

**Observations**
1. There are 15 topics related to missions (including Cassini) and the Moon and, as with the smaller number of topics in ML modeling, these topics remain the best
2. The topic on ISS updates are still not as good
   - the Russian segment (MIR) is now separated from the ISS and the MIR topic has a lower residual (a better topic) than does the ISS
3. The topic on SpaceX now fares relatively better than with a smaller number of topics. This perhaps sheds light on why the Virgin Galactic topic is relatively worse (higher residual) than the SpaceX topic. While SpaceX articles are focused on reporting about rocket launch events, articles within the Virgin Galactic topic focus on space tourism. Perhaps, with a smaller number of topics, these two themes were being previously conflated and the space tourism segment was bringing the merged topic down.
3. Concrete topics on scientific discoveries - Bosons, Neutrinos, Dark Matter and Gravitational Waves (middle-of-the-chart) - continue to fare relatively better than general scientific research findings, facts, etc. The box and whisker components for the latter of these are narrower and trend higher in residual (worse).
   - Interestingly, Black holes now score relatively better than the discovery of the Higgs Boson
     - the Higgs Boson is less Space-y since it is searched for at LHC (on Earth), while the best Black hole articles are focused on their detection in space
     - this is the one non-mission-related topic that scores well
     - the Stephen Hawking (black holes are predicted to release Hawking radiation) research topic is among the worst and is distinct from this black hole segment, likely since (as seen earlier) was more focused on his achievements which covers several areas of research
4. Two of the mission-related topics (Columbia crash and Mars mission updates) score relatively worse now than with a smller number of topics.
   - As seen from reading the articles, the Columbia crash topic also includes articles about the first post-crash launch of the Discovery shuttle. In the best articles within this topic that are focused on Discovery, these two shuttles are linked to eachother.
   - Some of the best articles in the Mars mission topic aren't related to mission updates. Instead, they cover year-end reviews and a launch anniversary - not fresh mission updates and/or news reports about the latest data coming in that sheds light on the [red planet](https://en.wikipedia.org/wiki/Mars)! Also, several missions are included under this one topic. As seen from the Beagle 2 topic, perhaps pulling out a few of the big missions (landers, rovers) and keeping them as standalone topics would allow for them to score better.

A theme that seems to have emerged, with both a smaller and larger number of topics in ML modeling, is that the better topics are the ones that are very focused. From reading the articles' text, these tend to be the ones that report on space-related mission findings - the Moon, Pluto, SpaceX, Roestta, Philae, Beagle 2, MIR missions (funding), etc. Other focused space-like topics - Black Hole discovery (related to Earth-based telscropes) and Asteroids heading for Earth - are the non-mission-related topics which also score well. The narrower scope from allowing for a larger number of topics in the ML model has helped in this regard.

<a id="exploring-topics-combined-with-source-data-and-35-topics"></a>

## 7. [Exploring topics combined with source data and 35 topics](#exploring-topics-combined-with-source-data-and-35-topics)

<a id="years-featured"></a>

### 7.1. [Years Featured](#years-featured)

First, we will show a heatmap of the number of years in which a topic appears, using 35 topics

In [None]:
altair_plot_bar_chart_value_counts(
    df_topics.groupby(["topic_num"])["year"]
    .nunique()
    .sort_values()
    .reset_index()
    .merge(df_named_topics.reset_index()[["topic_num", "best"]], on="topic_num"),
    "Number of years in which a topic appears",
    "year:Q",
    "best:N",
    labelFontSize=12,
    titleFontSize=12,
    plot_titleFontSize=16,
    dx=300,
    offset=0,
    horiz_bar_chart=True,
    fig_size=(450, 650),
)

**Observations**
1. There are a few topics that do not occur in the majority of years. We'll discuss this later.

<a id="most-popular-topic-by-year"></a>

### 7.2. [Most Popular Topic by Year](#most-popular-topic-by-year)

Next, we'll get the most popular topic by year

In [None]:
dfg = (
    df_topics.groupby(["year", "topic_num"])["url"]
    .count()
    .reset_index()
    .rename(columns={"url": "count"})
)
dfg = dfg.loc[dfg.groupby("year")["count"].idxmax()]

In [None]:
altair_plot_horiz_bar_chart(
    dfg.merge(
        df_named_topics.reset_index()[["topic_num", "best"]], on="topic_num"
    ).rename(columns={"best": "Topic"}),
    ptitle="Most Popular Topic by Year",
    xvar="count",
    yvar="year",
    xtitle="Occurrences",
    labelFontSize=14,
    titleFontSize=16,
    plot_titleFontSize=16,
    text_var="Topic",
    tooltip=["Topic", "count"],
    dx=45,
    offset=0,
    fig_size=(600, 450),
)

**Observations**
1. The following are a selected years and reasons why this choice of the most popular topic during each year is a sensible prediction by our ML model
   - 1986, 1988
     - Columbia space shuttle crash
       - although the [Challenger shuttle crash occurred in 1986](https://en.wikipedia.org/wiki/Space_Shuttle_Challenger_disaster), it was the Discovery shuttle that was the first post-crash launch in 1988 ([1](https://en.wikipedia.org/wiki/Space_Shuttle_Challenger_disaster#Continuation_of_the_Shuttle_program), [2](https://history.nasa.gov/SP-4219/Chapter15.html))
       - due to Discovery's connection to the Columbia crash in 2003 and the much larger number of articles on the later crash (which prompeted its naming), the articles' connection to Discovery is how this topic appears during these two years
   - 1999
     - MIR
       - [British businessman's fundraising attempt by promising to fly on MIR fell through](http://news.bbc.co.uk/2/hi/science/nature/353467.stm)
       - [last crew members leave the MIR space station](https://www.nytimes.com/1999/08/28/world/last-full-crew-leaves-mir-to-be-abandoned-after-13-years.html)
   - 2001
     - MIR, because this was the year when the [MIR space station was taken out of orbit](https://en.wikipedia.org/wiki/Deorbit_of_Mir)
   - 2003
     - Columbia shuttle crash, because the crash occurred during this year
   - 2004
     - search for the Beagle 2 rover and its mission updates
   - 2005
     - Columbia shuttle crash, because the first *Return-to-Flight* ([1](https://www.nasa.gov/returntoflight/main/index.html), [2](https://www.nature.com/collections/vlrdpjnyvk)) Discovery shuttle launch was [the first one since the Columbia crash](https://en.wikipedia.org/wiki/STS-114)
   - 2006
     - Columbia shuttle crash, because the second *Return-to-Flight* ([1](https://www.nasa.gov/returntoflight/main/index.html), [2](https://www.nature.com/collections/vlrdpjnyvk)) Discovery shuttle launch [occurred](https://en.wikipedia.org/wiki/STS-121)
   - 2008
     - ISS mission updates due to its [10th anniversary celebrations](https://www.nasa.gov/mission_pages/station/main/10th_anniversary.html) and the completion of [four missions to the ISS during the year](https://www.nasa.gov/home/hqnews/2008/dec/HQ_08-322_Top_Stories_2008.html)
   - 2009
     - Neil Armstrong, likely because this was the year of the 40th anniversary of the Apollo 11 Moon landing
   - 2011
     - Columbia Space Shuttle disaster
       - actually, the big news of this year was not about the Columbia disaster specifically (which was the topic of the best articles by residual)
       - instead, the focus was on the [retirement of the Space Shuttle Program](https://en.wikipedia.org/wiki/Space_Shuttle_retirement), which occurred during this year
         - this would have included articles with some focus on Columbia, but also on other shuttles (Discovery, Endeavour, etc.)
   - 2015
     - ISS mission updates, since the SpaceX mission to the ISS ends in explosion ([1](https://www.theguardian.com/science/2015/jun/28/nasa-spacex-launch-international-space-station-wrong), [2](https://phys.org/news/2015-11-nasa-space-station-resupply-missions-relaunch.html))
   - 2016
     - ISS mission updates
       - [successful SpaceX rocket launches during this year](https://www.space.com/35127-top-spaceflight-stories-2016.html), which has implications for [NASA missions to the ISS](https://blogs.nasa.gov/spacex/)
   - 2019
     - Moon, because of a new focus on manned missions to the Moon ([1](https://www.newscientist.com/article/mg24432613-600-2019-was-the-year-we-got-serious-about-walking-on-the-moon-again/), US-based interest - [1](https://www.theatlantic.com/science/archive/2019/03/trump-nasa-moon-2024/585880/), [2](https://www.cnbc.com/2019/06/07/trump-wants-nasa-to-go-to-mars-not-the-moon-like-he-declared-weeks-ago.html), [3](https://www.nature.com/articles/d41586-019-02020-w))
2. This degree of granularity is not possible with a smaller number of topics during topic modeling.

In [None]:
df_all_by_year = dfg.merge(
    df_named_topics.reset_index()[["topic_num", "best"]], on="topic_num"
).rename(columns={"best": "Topic"})
display(df_all_by_year.head())

<a id="all-topics-by-year"></a>

### 7.3. [All Topics by Year](#all-topics-by-year)

Next, we will show a heatmap of the most popular topic by year, using 35 topics

In [None]:
topics_by_timeframe = (
    df_topics.groupby(["topic_num", "year"])
    .size()
    .reset_index()
    .sort_values(by=["topic_num", 0, "year"], ascending=False)
    .rename(columns={0: "count"})
)
display(topics_by_timeframe.head())

Sanity checks

In [None]:
assert (
    df_topics.loc[(df_topics["topic_num"] == 1) & (df_topics["year"] == 2003)].shape[0]
    == topics_by_timeframe.loc[
        (topics_by_timeframe["topic_num"] == 1) & (topics_by_timeframe["year"] == 2003)
    ]["count"].iloc[0]
)

In [None]:
assert topics_by_timeframe["count"].sum() == df_guardian.shape[0]

Generate the heatmap

In [None]:
altair_datetime_heatmap(
    topics_by_timeframe.merge(
        df_named_topics.reset_index()[["topic_num", "best"]], on="topic_num"
    ),
    x="year:O",
    y="best:N",
    xtitle="Year",
    ytitle="Topics by Year",
    tooltip=[
        {
            "title": "Year",
            "field": "year",
            "type": "ordinal",
        },
        {
            "title": "Topic",
            "field": "best",
            "type": "nominal",
        },
        {
            "title": "Number of articles",
            "field": "count",
            "type": "quantitative",
        },
    ],
    cmap="yelloworangered",
    legend_title="",
    color_by_col="count:Q",
    yscale="log",
    axis_tick_font_size=12,
    axis_title_font_size=16,
    title_font_size=20,
    legend_fig_padding=10,  # default is 18
    y_axis_title_alignment="left",
    fwidth=650,
    fheight=900,
    file_path="",
    save_to_html=False,
    sort_y=[],
    sort_x=[],
    dx=20,
    offset=0,
    plot_titleFontSize=16,
)

**Observations**
1. If a topic does not occur in the majority of years, then that could be an indication that it is a poor choice of a standalone topics and should be combined. This appears to be the case for topic numbers 5 (Philae), 7 (SpaceX), 16 (Pluto), 18 (MIR) and 23 (Gravitational Waves) which appear 15 times or less - with the exception of topic 5, it is less of a concern but still worth noting.

<a id="examining-infrequently-occurring-topics"></a>

### 7.4. [Examining Infrequently Occurring Topics](#examining-infrequently-occurring-topics)

We'll walk through the five topics above identified to generate articles during the least number of years.

<a id="philae"></a>

#### 7.4.1. [Philae](#philae)

Focusing on topic 5, the best articles here focused on the Philae lander onboard the Rosetta spacecraft. The topic only appears in the mid 2010s - 2014, 2015 and 2016. Could it only occur during a 3 year preiod? Now, Rosetta launched in 2004, but Philae landed on comet 67P in [2014](http://www.esa.int/Science_Exploration/Space_Science/Rosetta/Three_touchdowns_for_Rosetta_s_lander), so it makes sense that articles under the Philae topic began during that year. The payload comprised other instruments [as well as the spacecraft's payload](https://www.esa.int/Science_Exploration/Space_Science/Rosetta/Rosetta_Media_factsheet). So, prior news reports about Philae would have been either in relation to the Rosetta spacecraft, as opposed to a dedicated discussion about the lander itself, or about the science behind the design of the lander and its own [dedicated payload of 10 instruments](https://www.esa.int/Science_Exploration/Space_Science/Rosetta/Lander_Instruments).

Contact with the lander was [lost after July 2015](https://www.sciencemag.org/news/2015/07/philae-s-scientific-harvest-may-be-its-last), so it is in the news again that year. Rosetta captured an image of the lander in [September 2016](https://www.esa.int/Science_Exploration/Space_Science/Rosetta/Philae_found). In July 2016, it was decided to [end communication with the lander after nearly two years of failed attempts to communicate with it](https://www.bbc.com/news/science-environment-36904368). Overall, it captured [three days](https://www.smithsonianmag.com/smart-news/after-two-years-searching-comet-lander-philae-finally-found-180960350/) of [science data](https://www.cosmos.esa.int/web/psa/rosetta).

So, the news articles prior to 2014 would have focused on the non-operational aspects of the lander. Articles from 2014-2016 would have been focused solely on ESA operations to find the lander. As we saw earlier, this is in line with the five best articles within this topic since they all focused on the operational aspects of the search for the lander. So, it does make sense that this topic only appeared during a three year period, ending in 2016.

Looking at topic number 4 (Rosetta) from the ML model's predictions with 15 topics, we do still see that the years 2014-2016 feature the most occurrences of this topic. But, we don't have a direct approach to determining whether this elevated three year period was solely attributed to Philae being in the news or other comet-related news articles that appeared during those years - or a combination of these two factors.

Okay, so it makes sense that this is picked up as a standalone topic when we search for 35 topics instead of 15 topics. But is it worthy of a standalone topic if its articles only occur in three years? Maybe we don't need this level of granularity. Maybe our interest in the topic of comet science (topic 4, when using 15 topics) can be limited to including the high-level Rosetta mission as one of the sub-topics within this topic. On the other hand, if we're interested in directly attributing that three-year period of elevated news publications to a specific sub-topic, then getting a topic that splits Philae (topic 5, when using 35 topics) out of the overall topic of comets makes sense. Maybe we could keep 35 topics but combine this small Philae-centric topic with the larger topic centered on the Rosetta mission (best topics) and comets in general (worst articles), which has articles appearing across nearly 20 distinct years. This is one of the tradeoffs to be considered when allowing for a larger number of topics.

<a id="spacex"></a>

#### 7.4.2. [SpaceX](#spacex)

For Topic 7, there isn't another topic with a theme of rocket testing/launches. The focus of the news articles involving the Virgin Galactic topic is on space tourism/travel with little mention of launches/testing. Also, note that SpaceX was founded in 2002 so we theoretically have at most 18 (2002-2019) years of data from which to expect news publications on this topic. The worst articles here are focused on rocket technology, which is still distinct from space tourism.

SpaceX does have interest in space tourism (eg. involving the ISS, see [1](https://www.latimes.com/business/story/2020-02-18/spacex-tourists-crew-dragon-capsule), [2](https://www.cnbc.com/2020/09/26/space-tourism-how-spacex-virgin-galactic-blue-origin-axiom-compete.html)). Virgin Galactic does [also have ties to the ISS](https://www.virgingalactic.com/visit-the-international-space-station/) and only recently announced [plans for Mars tourism](https://www.forbes.com/sites/jonathanocallaghan/2019/10/09/virgin-orbit-is-planning-an-ambitious-mission-to-mars-in-2022/?sh=7c63db7966f1) (though the latest Guardian news articles retrieved for this project just pre-date this announcement).

The ISS topic is focused on crew arrivals/departures and history. Space tourism is not limited to the ISS - see the Moon ([1](https://en.wikipedia.org/wiki/Tourism_on_the_Moon#Proposed_missions), [2](https://en.wikipedia.org/wiki/DearMoon_project)) and [Mars](https://www.cnn.com/2020/05/01/business/space-industry-critical-business-blue-origin-spacex-scn/index.html) - but opening up the ISS to space tourists is [a focus for NASA](https://www.bbc.com/news/world-us-canada-48560874) because it is easier to reach than the Moon or Mars.

So assigning SpaceX its own topic, separate from Virgin Galactic (due to the much smaller focus on rocket testing in the best Virgin Galactic news articles), is a reasonable choice. But, it would nonetheless be interesting to compare each of these two companies to the ISS, Moon and Mars topics in some way in order to more qualitatively determine if their news articles should be kept separate or combined under a common topic.

<a id="discovery-of-gravitational-waves"></a>

#### 7.4.3. [Discovery of Gravitational Waves](#discovery-of-gravitational-waves)

Gravitational Waves (Topic 23) were first detected in 2015. LIGO only started collecting data in 2002. [VIRGO](https://en.wikipedia.org/wiki/Virgo_interferometer) started collecting in 2007 and other detectors ([GEO600](https://en.wikipedia.org/wiki/GEO600), [TAMA](https://en.wikipedia.org/wiki/TAMA_300)) were also collecting data starting in 1999 or the early 2000s, so there are at most approx. 20 years in which articled under this topic could appear. The worst topics are focused on other kinds of waves and are limited to their uses or occurrences (eg. ocean waves) on Earth, which don't point to news articles in another topic. So assigning Gravitational Waves its own topic is a reasonable choice.

<a id="mir-space-station-funding"></a>

#### 7.4.4. [MIR Space Spation Funding](#mir-space-station-funding)

Topic 18 is about MIR. Its topic has been predicted here for approx. 10 years. The space station was shut down in 2001. So, while there are as many as approx. 19 years post-shutdown, these articles were only published for approx. 50% of the time. This makes sense since this topic wouldn't be generating as many headlines as an active space station. The ISS topic is distinct since it focuses on a different space station. The worst articles in the MIR topic also include news articles about other Russian Space Stations. In combination, it seems a logical choice to keep this as a standalone topic.

<a id="reporting-on-pluto"></a>

#### 7.4.5. [Reporting on Pluto](#reporting-on-pluto)

Topic 16 is about imaging of Pluto and dwarf/distant objects and appears during approx. 14 years. A considerable focus of the articles was on use of the [New Horizons probe](https://en.wikipedia.org/wiki/New_Horizons), which launched in January, 2006 and [recorded its first images of Pluto in December, 2006](https://en.wikipedia.org/wiki/New_Horizons#First_Pluto_sighting), so this aligns with the number of years in which these articles appear. Discoveries of similarly-sized (non-planetary) space objects in the [Kuiper Belt](https://en.wikipedia.org/wiki/Kuiper_belt) were made in 2002-2005 and prompted Pluto's downgrade in [2006](https://www.bbc.com/news/science-environment-33462184). The best articles in this topic focus on Pluto and New Horizons, rather than on other objects/planets. This could explain the presence of these articles over a period of approx. 14 years.i.e. coinciding with demoting of Pluto from a planet to a dwarf planet combined with the launch and images recorded by New Horizons.

<a id="hypothesis-testing-for-distinct-topics"></a>

#### 7.4.6. [Hypothesis Testing for Distinct Topics](#hypothesis-testing-for-distinct-topics)

We'll run through hypothesis tests to compare pairs of similar topics to eachother based on their residuals. The requirements for the Mann-Whitney U hypothesis test (that will be used here, due to absence of normality in the data) [are](https://www.statisticshowto.com/mann-whitney-u-test/)
- need two independent, categorical groups
  - as we're comparing pairs of topics, we meet this requirement
- no relationship between the two topics or within each topic
  - between the two topics
    - since a single article is only assigned to a single topic, the topics don't have a relation to eachother
  - within each topic
    - each article, within each topic, is a separate observation from all other articles within the same topic
- absence of normal distributed data, but their distributions should follow the same shape
  - this is mostly satisfied for most pairs of topics, as is shown in the grid of histograms below

In [None]:
dfr = df_topics[["topic_num", "resid"]].merge(
    df_named_topics.reset_index()[["topic_num", "best"]], on="topic_num"
)
row_size = 3
combos = [
    ["MIR", "ISS"],
    ["Beagle 2", "Mars"],
    ["Cassini", "Evidence of Water"],
    ["Hawking", "Black Holes"],
    ["Philae", "Rosetta"],
    ["SpaceX", "Virgin Galactic"],
    ["SpaceX", "ISS"],
    ["SpaceX", "Mars"],
    ["SpaceX", "Moon"],
    ["Virgin Galactic", "ISS"],
    ["Virgin Galactic", "Mars"],
    ["Virgin Galactic", "Moon"],
    ["Pluto", "Discovery of New Planets"],
]
altair_plot_histogram_grid_by_column(
    dfr,
    "resid",
    "count()",
    "best",
    combos,
    space_between_plots=5,
    row_size=3,
    labelFontSize=14,
    titleFontSize=14,
    fig_size=(100, 200),
)

In [None]:
d_hypothesis_scores = {}
for topic_pair in combos:
    topic_1_str, topic_2_str = topic_pair
    topic_1 = dfr[dfr["best"].str.contains(topic_1_str)]["resid"]
    topic_2 = dfr[dfr["best"].str.contains(topic_2_str)]["resid"]
    topic_pair = f"{topic_1_str}-{topic_2_str}"
    print(f"{topic_pair} - ", end="")
    test_result, test_type = ttest_or_manwhitu(topic_1, topic_2)
    print(f"{test_result}\n")
    d_hypothesis_scores[topic_pair] = {
        "stat": np.around(test_result.statistic, 1),
        "p": test_result.pvalue,
        "test_type": test_type,
        "topic_1_size": len(topic_1),
        "topic_2_size": len(topic_2),
    }
df_hypothesis_scores = pd.DataFrame.from_dict(
    d_hypothesis_scores, orient="index"
).reset_index()
df_hypothesis_scores[["Topic 1", "Topic 2"]] = df_hypothesis_scores["index"].str.split(
    "-", expand=True
)
df_hypothesis_scores = df_hypothesis_scores.drop(columns=["index"], axis=1)

The results of the hypothesis tests are summarized below

In [None]:
display(
    df_hypothesis_scores.style.format({"stat": "{:.1f}"})
    .set_caption("Hypothesis Tests on Pairs of Topic Residuals")
    .background_gradient(cmap="YlOrRd", subset=["p"])
)

**Notes**
1. It isn't intuitively possible to pick out a similar topic it the discovery of gravitational waves so a test involving this topic was not performed.
2. The Mann-Whitney U test [works with groups (topics) of different sizes](https://influentialpoints.com/Training/Wilcoxon-Mann-Whitney_U_test_use_and_misuse.htm) - as is the case here (see the `topic_1_size` and `topic_2_size` columns).

**Observations**
1. For each pair except for Virgin Galactic and ISS, we can state the following
   - the average residual from approximating the text using the first topic comes from a different distribution than the average residual from the second topic
   - the first topic's residuals are different from the second topic's residuals
   
   which means that predicted topic names under the **Topic 1** column above are different from those under the **Topic 2** column.
2. The Virgin Galactic topic's residuals are not different from those of the ISS topic. We could consider re-training the NMF model to try to combine these articles under the same topic.

<a id="terms-by-topic"></a>

### 7.5. [Terms by Topic](#terms-by-topic)

The top 10 terms by topic are shown below

In [None]:
topic_word = pd.DataFrame(
    pipe.named_steps["nmf"].components_.round(3),
    index=[k for k in range(n_topics_wanted)],
    columns=pipe.named_steps["vectorizer"].get_feature_names(),
)
df_topic_word_factors = (
    topic_word.groupby(topic_word.index)
    .apply(lambda x: x.iloc[0].nlargest(n_top_words))
    .reset_index()
    .rename(columns={"level_0": "topic_num", "level_1": "term", 0: "weight"})
)

The top 10 scoring terms by weight (from the NMF matrix factorization), for each topic, are visualized below

In [None]:
altair_plot_grid_by_column(
    df_topic_word_factors.merge(
        df_named_topics.reset_index()[["topic_num", "best"]], on="topic_num"
    ),
    xvar="weight",
    yvar="term",
    col2grid="best",
    space_between_plots=5,
    row_size=6,
    fig_size=(150, 200),
)

<a id="extracting-topics-from-unseen-data"></a>

## 8. [Extracting Topics from Unseen Data](#extracting-topics-from-unseen-data)

Save (pickle) the ML pipeline that was trained here so that it can be used to predict unseen news article topics without re-training

In [None]:
topic_names_filepath = "app/data/nlp_topic_names.csv"
nlp_pipeline_filepath = "app/data/nlp_pipe.joblib"

In [None]:
dump(pipe, nlp_pipeline_filepath, compress=True)

Export the topic names `DataFrame` to a file

In [None]:
df_named_topics.to_csv(topic_names_filepath, index=True)

Delete the pipeline object and topic names `DataFrame` from memory

In [None]:
del pipe
del df_named_topics

Reload the saved pipeline and topic names from their respective files

In [None]:
%%time
pipe = load(nlp_pipeline_filepath)
df_named_topics = pd.read_csv(topic_names_filepath, index_col="topic_num")

Load unseen data and predict topics. This could be a single news article or more than one

In [None]:
blob_client = blob_service_client.get_blob_client(
    container=az_storage_container_name, blob="blobedesz42"
)
blobstring = blob_client.download_blob().content_as_text()

In [None]:
# Load the unseen articles for prediction
if cloud_data:
    df_new = pd.read_csv(StringIO(blobstring))
else:
    df_new = pd.read_csv("data/guardian_3.csv")
# display(df_new.head())
df_new[["year", "month", "day"]] = df_new["url"].str.extract(
    r"/(\d{4})/([a-z]{3})/(\d{2})/"
)
d = {"jan": 1, "feb": 2, "nov": 11, "dec": 12, "sep": 9, "oct": 10}
df_new["month"] = df_new["month"].map(d).astype(int)
df_new["date"] = pd.to_datetime(df_new[["year", "month", "day"]])
df_new["weekday"] = df_new["date"].dt.day_name()
df_new["weekday"] = df_new["date"].dt.day_name()
df_new["week_of_month"] = df_new["date"].apply(lambda d: (d.day - 1) // 7 + 1)

# Process the text
df_new["processed_text"] = df_new["text"].apply(process_text)

new_texts = df_new["processed_text"]

# Transform the new data with the fitted models
doc_topic = pipe.transform(new_texts)

topic_words = pd.DataFrame(
    pipe.named_steps["nmf"].components_,
    index=[str(k) for k in range(n_topics_wanted)],
    columns=pipe.named_steps["vectorizer"].get_feature_names(),
)
topic_df = (
    pd.DataFrame(
        topic_words.apply(
            lambda x: get_top_words_per_topic(x, n_top_words), axis=1
        ).tolist(),
        index=topic_words.index,
    )
    .reset_index()
    .rename(columns={"index": "topic"})
    .assign(topic_num=range(n_topics_wanted))
)
# for k, v in topic_df.iterrows():
#     print(k, ",".join(v[1:-1]))

df_temp = pd.DataFrame(doc_topic).idxmax(axis=1).rename("topic_num").to_frame()

merged_topic = df_temp.merge(topic_df, on="topic_num", how="left").assign(
    url=df_new["url"].tolist()
)
df_topics_new = df_new.merge(merged_topic, on="url", how="left").astype(
    {"topic_num": int}
)
df_topics_new = df_topics_new[
    ["url", "year", "week_of_month", "weekday", "text"]
    + ["topic_num", "topic"]
    + list(range(n_top_words))
].merge(df_named_topics.reset_index()[["topic_num", "best"]], on="topic_num")[
    ["url", "year", "week_of_month", "weekday", "text", "topic_num", "topic", "best"]
]
display(df_topics_new)

As a reminder, the document residuals from training the NMF model (for the topics in these unseen documents) are shown below

In [None]:
altair_boxplot_sorted(
    df_topics.loc[df_topics["topic_num"].isin(df_topics_new["topic_num"].unique())][
        ["topic_num", "resid"]
    ]
    .merge(df_named_topics.reset_index()[["topic_num", "best"]], on="topic_num")
    .rename(columns={"best": "Topic"}),
    "Topic",
    "resid",
    "median(resid)",
    "Topic Residual (sorted by median)",
    16,
    16,
    16,
    dx=400,
    offset=0,
    x_tick_label_angle=0,
    horiz_bar_chart=True,
    axis_range=[0.4, 1],
    fig_size=(400, 500),
)

Print the article web URL and the ML pipeline's predicted topic

In [None]:
for k, row in df_topics_new.iterrows():
    print(f"Row = {k}, URL={row['url']}, Topic = {row['best']}\n")

After manually reading the articles, a qualitative assessment of the suitability of their topics is shown below

0. [About Hubble constant](https://www.theguardian.com/science/2019/nov/02/hubble-constant-mystery-that-keeps-getting-bigger-estimate-rate-expansion-universe-cosmology-cepheid)
   - not about dark matter
   - dark matter is mentioned as one of the three mysteries about understanding the universe
1. [pictures returned from Voyager 2's mission outide the heliosphere (created by plasma from the Sun)](https://www.theguardian.com/science/2019/nov/04/nasa-voyager-2-sends-back-first-signal-from-interstellar-space)
2. [Solar probe returns first image from edge of Sun's atmosphere](https://www.theguardian.com/science/2019/dec/04/nasas-parker-solar-probe-beams-back-first-insights-from-suns-edge)
3. [About the Voyager probe's accomplishments](https://www.theguardian.com/science/2020/jan/11/voyager-scientist-ed-stone-on-the-search-for-extraterrestrial-life-we-need-to-get-back-to-enceladus)
4. [Discovrey of new form of Aurora Borealis](https://www.theguardian.com/science/2020/jan/29/amateur-stargazers-capture-new-form-of-northern-lights)
5. [On capture of most detailed pictures of Sun's surface](https://www.theguardian.com/science/2020/jan/29/solar-telescope-captures-most-detailed-pictures-yet-the-sun)
6. [On first ever recorded images of Sun's north and south poles](https://www.theguardian.com/science/2020/feb/07/solar-orbiter-spacecraft-will-capture-the-suns-north-and-south-poles)
7. [Mission update from deployment of Solar orbiter probe](https://www.theguardian.com/science/2020/feb/20/spacewatch-solar-orbiter-sends-first-data-back-to-earth)
8. [About a history of messages received from E.T. life](https://www.theguardian.com/science/2019/nov/06/cosmic-cats-nuclear-interstellar-messages-extraterrestrial-intelligence)
9. [Astronomers use grid of satellites to search for E.T. life](https://www.theguardian.com/science/2020/feb/15/astronomers-to-sweep-entire-sky-for-signs-of-extraterrestrial-life)
10. [Obituary for designer of satellites used in various fields of applied science](https://www.theguardian.com/science/2019/nov/06/daniel-lobb-obituary)
11. [About Chilean satellite's views being blocked by communications satellites](https://www.theguardian.com/science/2019/nov/22/not-cool-telescope-faces-interference-from-space-bound-satellites)
12. [Reporting on several satellite launches during the period of a week at the end of November 2019](https://www.theguardian.com/science/2019/nov/28/spacewatch-you-wait-ages-for-a-rocket-launch-then-)
13. [ESA launching space debris collection device](https://www.theguardian.com/science/2019/dec/09/european-space-agency-to-launch-clearspace-1-space-debris-collector-in-2025)
14. [ESA awards space junk cleaning up space junk](https://www.theguardian.com/science/2019/dec/12/spacewatch-esa-awards-first-junk-clean-up-contract-clearspace)
15. [About views of the sky being blocked by technology firms' large number of plans involving launching new satellites](https://www.theguardian.com/science/2020/jan/09/companies-plans-for-satellite-constellations-put-night-sky-at-risk)
16. [About SpaceX satellite launches](https://www.theguardian.com/science/2020/jan/09/spacewatch-spacex-elon-musk-launches-60-more-satellites-into-starlink-constellation)
17. [About images captured about a star leaving the Milky Way after encountering a black hole](https://www.theguardian.com/science/2019/nov/13/superfast-star-found-leaving-milky-way-at-1700km-per-second)
18. [About hole created by material/energy expelled by an inter-stellar explosion originating in a black hole](https://www.theguardian.com/science/2020/feb/27/biggest-cosmic-explosion-ever-detected-makes-huge-dent-in-space)
19. [About Boeing proposing direct Moon flights in 2024](https://www.theguardian.com/science/2019/nov/14/spacewatch-boeing-proposes-direct-flights-moon-2024-nasa)
20. [About trial study that places humans into suspended animatoin](https://www.theguardian.com/science/2019/nov/20/humans-put-into-suspended-animation-for-first-time)
21. [About more than expected energy released after gamma ray burst](https://www.theguardian.com/science/2019/nov/20/big-star-energy-record-breaking-explosion-recorded)
    - this topic is unrelated to the discovery of neutrinos, but focuses on fundamental particles (protons, neutrons), which were seen (during training) in the worst articles within this topic
22. [About Mars sample return mission being planned by the ESA](https://www.theguardian.com/science/2019/nov/24/mars-robot-will-send-samples-to-earth)
23. [About ExoMars asking for NASA help with fixing parachute (landing) problems](https://www.theguardian.com/science/2019/dec/15/exomars-race-against-time-to-launch-troubled-europe-mission-to-mars)
24. [About ExoMars mission successfully testing parachute system](https://www.theguardian.com/science/2019/dec/26/europes-mars-lander-passes-parachute-test)
25. [About space exploration topics - Moon, Mars, Asteroid missions and private companies involved in exploring space](https://www.theguardian.com/science/2020/jan/05/space-race-moon-mars-asteroids-commercial-launches)
26. [About whether zreo net CO2 emissions is possible with air travel](https://www.theguardian.com/science/2019/nov/24/can-we-fly-and-have-net-zero-emissions-air-industry-e-fan-x-rolls-royce-engines-kerosine-carbon-2050)
    - this topic is close to the worst articles that fit into the Virgin Galaxy topic
27. Topic 0
28. Topic 0
29. Topic 0
30. [About major topics from 2019 with one focused on world's renewed focus on combating global warming](https://www.theguardian.com/science/2019/dec/22/the-science-stories-that-shaped-2019)
31. Dinosaurs
32. Dinosaurs
33. [About experiments involving fire and combusion conducted by ISS astronauts](https://www.theguardian.com/science/2020/jan/01/international-space-station-astronauts-play-with-fire-for-research)
34. [About a study to re-create the Overview Effect - image of the Earth from space](https://www.theguardian.com/science/2019/dec/26/scientists-attempt-to-recreate-overview-effect-from-earth)
    - this only inolves reference to a former astronaut involved in building the ISS, and not to the ISS operations itself
    - this is among the worst articles in this topic
35. [About return of first ISS female astronaut to Earth](https://www.theguardian.com/science/2020/feb/06/christina-koch-returns-to-earth-after-record-breaking-space-mission)
36. [About an eclipse among stars within a constellation in space](https://www.theguardian.com/science/2020/jan/05/starwatch-cassiopeia-queen-of-the-northern-sky)
    - this isn't an eclipse involving Earth
    - actually, it is loosely connected to eclipses since the focus is on the constellation
37. [About a wave of dust and gas encapsulating a large stretch of stars across the Milky Way](https://www.theguardian.com/science/2020/jan/07/astronomers-discover-huge-gaseous-wave-holding-milky-ways-newest-stars)
    - this is unrelated to gravitational waves
    - this falls within the scope of the worst articles within this topic
38. [About a cosmologist's theory that could be tested with gravitational wave telescopes](https://www.theguardian.com/science/2020/jan/25/has-physicists-gravity-theory-solved-impossible-dark-energy-riddle)
    - this is not among the worst articles within the gravitational wave discovery topic but is not among the best either
39. [About space tourism around the Moon by SpaceX](https://www.theguardian.com/science/2020/jan/13/japanese-billionaire-yusaku-maezawa-seeks-special-woman-for-trip-around-moon)
    - the rocket launch aspect is secondary
    - this is among the worst articles within the SpaceX rocket testing topic
40. [About successful completion of SpaceX launch](https://www.theguardian.com/science/2020/jan/23/spacewatch-successful-spacex-test-a-key-milestone-for-nasa)
41. [About people involved in rescuing victims who are drowning](https://www.theguardian.com/science/2020/jan/16/bring-up-the-bodies-gene-sandy-ralston-drowning-victims-sonar)
    - unrelated to finding evidence of water on other planets
    - among the worst articles in this topic
42. [About discoveries made by large telescopes on Earth](https://www.theguardian.com/science/2020/feb/02/the-five-large-telescopes-solar-surface-images)
    - among the worst articles in the Hubble Telescope topic
43. [About end of Spitzer  IR Telescope mission](https://www.theguardian.com/science/2020/feb/06/spacewatch-nasa-ends-16-year-spitzer-infrared-mission)
    - only indirectly related to the Hubble Telescope mission
44. Topic 33
45. [About images recorded of the most-distant object every visited by a space probe](https://www.theguardian.com/science/2020/feb/13/not-just-a-space-potato-nasa-unveils-astonishing-details-of-most-distant-object-ever-visited-arrokoth)
46. [Obituary for methematician involved in Moon landing](https://www.theguardian.com/science/2020/feb/24/katherine-johnson-nasa-mathematician-hidden-figures-dies-101)
    - this is indirectly related to Neil Armstrong
47. [Obituary of mathematician involved in Moon landing](https://www.theguardian.com/science/2020/feb/24/katherine-johnson-obituary)
    - this is not related to the Neil Armstrong topic directly, but is indirectly related to Moon landings

A summary of the above qualitative evaluation is shown below

In [None]:
d_unseen_evaluate = {
    "About Hubble constant": [
        "https://www.theguardian.com/science/2019/nov/02/hubble-constant-mystery-that-keeps-getting-bigger-estimate-rate-expansion-universe-cosmology-cepheid)",
        False,
    ],
    "pictures returned from Voyager 2's mission outide the heliosphere (created by plasma from the Sun)": [
        "https://www.theguardian.com/science/2019/nov/04/nasa-voyager-2-sends-back-first-signal-from-interstellar-space",
        True,
    ],
    "Solar probe returns first image from edge of Sun's atmosphere": [
        "https://www.theguardian.com/science/2019/dec/04/nasas-parker-solar-probe-beams-back-first-insights-from-suns-edge",
        True,
    ],
    "About the Voyager probe's accomplishments": [
        "https://www.theguardian.com/science/2020/jan/11/voyager-scientist-ed-stone-on-the-search-for-extraterrestrial-life-we-need-to-get-back-to-enceladus",
        True,
    ],
    "Discovrey of new form of Aurora Borealis": [
        "https://www.theguardian.com/science/2020/jan/29/amateur-stargazers-capture-new-form-of-northern-lights",
        True,
    ],
    "On capture of most detailed pictures of Sun's surface": [
        "https://www.theguardian.com/science/2020/jan/29/solar-telescope-captures-most-detailed-pictures-yet-the-sun",
        True,
    ],
    "On first ever recorded images of Sun's north and south poles": [
        "https://www.theguardian.com/science/2020/feb/07/solar-orbiter-spacecraft-will-capture-the-suns-north-and-south-poles",
        True,
    ],
    "Mission update from deployment of Solar orbiter probe": [
        "https://www.theguardian.com/science/2020/feb/20/spacewatch-solar-orbiter-sends-first-data-back-to-earth",
        True,
    ],
    "About a history of messages received from E.T. life": [
        "https://www.theguardian.com/science/2019/nov/06/cosmic-cats-nuclear-interstellar-messages-extraterrestrial-intelligence",
        True,
    ],
    "Astronomers use grid of satellites to search for E.T. life": [
        "https://www.theguardian.com/science/2020/feb/15/astronomers-to-sweep-entire-sky-for-signs-of-extraterrestrial-life",
        True,
    ],
    "Obituary for designer of satellites used in various fields of applied science": [
        "https://www.theguardian.com/science/2019/nov/06/daniel-lobb-obituary",
        True,
    ],
    "About Chilean satellite's views being blocked by communications satellites": [
        "https://www.theguardian.com/science/2019/nov/22/not-cool-telescope-faces-interference-from-space-bound-satellites",
        True,
    ],
    "Reporting on several satellite launches during the period of a week at the end of November 2019": [
        "https://www.theguardian.com/science/2019/nov/28/spacewatch-you-wait-ages-for-a-rocket-launch-then-",
        True,
    ],
    "ESA launching space debris collection device": [
        "https://www.theguardian.com/science/2019/dec/09/european-space-agency-to-launch-clearspace-1-space-debris-collector-in-2025",
        True,
    ],
    "ESA awards space junk cleaning up space junk": [
        "https://www.theguardian.com/science/2019/dec/12/spacewatch-esa-awards-first-junk-clean-up-contract-clearspace",
        True,
    ],
    "About views of the sky being blocked by technology firms' large number of plans involving launching new satellites": [
        "https://www.theguardian.com/science/2020/jan/09/companies-plans-for-satellite-constellations-put-night-sky-at-risk",
        True,
    ],
    "About SpaceX satellite launches": [
        "https://www.theguardian.com/science/2020/jan/09/spacewatch-spacex-elon-musk-launches-60-more-satellites-into-starlink-constellation",
        True,
    ],
    "About images captured about a star leaving the Milky Way after encountering a black hole": [
        "https://www.theguardian.com/science/2019/nov/13/superfast-star-found-leaving-milky-way-at-1700km-per-second",
        True,
    ],
    "About hole created by material/energy expelled by an inter-stellar explosion originating in a black hole": [
        "https://www.theguardian.com/science/2020/feb/27/biggest-cosmic-explosion-ever-detected-makes-huge-dent-in-space",
        True,
    ],
    "About Boeing proposing direct Moon flights in 2024": [
        "https://www.theguardian.com/science/2019/nov/14/spacewatch-boeing-proposes-direct-flights-moon-2024-nasa",
        True,
    ],
    "About trial study that places humans into suspended animatoin": [
        "https://www.theguardian.com/science/2019/nov/20/humans-put-into-suspended-animation-for-first-time",
        True,
    ],
    "About more than expected energy released after gamma ray burst": [
        "https://www.theguardian.com/science/2019/nov/20/big-star-energy-record-breaking-explosion-recorded",
        False,
    ],
    "About Mars sample return mission being planned by the ESA": [
        "https://www.theguardian.com/science/2019/nov/24/mars-robot-will-send-samples-to-earth",
        True,
    ],
    "About ExoMars asking for NASA help with fixing parachute (landing) problems": [
        "https://www.theguardian.com/science/2019/dec/15/exomars-race-against-time-to-launch-troubled-europe-mission-to-mars",
        True,
    ],
    "About ExoMars mission successfully testing parachute system": [
        "https://www.theguardian.com/science/2019/dec/26/europes-mars-lander-passes-parachute-test",
        True,
    ],
    "About space exploration topics - Moon, Mars, Asteroid missions and private companies involved in exploring space": [
        "https://www.theguardian.com/science/2020/jan/05/space-race-moon-mars-asteroids-commercial-launches",
        True,
    ],
    "About whether zero net CO2 emissions is possible with air travel": [
        "https://www.theguardian.com/science/2019/nov/24/can-we-fly-and-have-net-zero-emissions-air-industry-e-fan-x-rolls-royce-engines-kerosine-carbon-2050",
        False,
    ],
    "Topic 0": [
        "https://www.theguardian.com/science/2019/dec/16/did-you-solve-it-the-club-sandwich-problem",
        np.nan,
    ],
    "Topic 0_1": [
        "https://www.theguardian.com/science/2019/dec/19/true-meanings-of-words-of-emotion-get-lost-in-translation-study-finds",
        np.nan,
    ],
    "Topic 0_2": [
        "https://www.theguardian.com/science/2020/feb/06/brian-greene-theoretical-physicist-interview-until-the-end-of-time",
        np.nan,
    ],
    "About major topics from 2019 with one focused on world's renewed focus on combating global warming": [
        "https://www.theguardian.com/science/2019/dec/22/the-science-stories-that-shaped-2019",
        True,
    ],
    "Dinosaurs": [
        "https://www.theguardian.com/science/2019/dec/23/cha-cha-chimp-ape-study-suggests-urge-to-dance-is-prehuman",
        np.nan,
    ],
    "Dinosaurs_1": [
        "https://www.theguardian.com/science/2020/jan/13/stardust-older-than-earth-and-sun-found-meteorite-australia",
        np.nan,
    ],
    "About experiments involving fire and combusion conducted by ISS astronauts": [
        "https://www.theguardian.com/science/2020/jan/01/international-space-station-astronauts-play-with-fire-for-research",
        True,
    ],
    "About a study to re-create the Overview Effect - image of the Earth from space": [
        "https://www.theguardian.com/science/2019/dec/26/scientists-attempt-to-recreate-overview-effect-from-earth",
        False,
    ],
    "About return of first ISS female astronaut to Earth": [
        "https://www.theguardian.com/science/2020/feb/06/christina-koch-returns-to-earth-after-record-breaking-space-mission",
        True,
    ],
    "About an eclipse among stars within a constellation in space": [
        "https://www.theguardian.com/science/2020/jan/05/starwatch-cassiopeia-queen-of-the-northern-sky",
        False,
    ],
    "About a wave of dust and gas encapsulating a large stretch of stars across the Milky Way": [
        "https://www.theguardian.com/science/2020/jan/07/astronomers-discover-huge-gaseous-wave-holding-milky-ways-newest-stars",
        False,
    ],
    "About a cosmologist's theory that could be tested with gravitational wave telescopes": [
        "https://www.theguardian.com/science/2020/jan/25/has-physicists-gravity-theory-solved-impossible-dark-energy-riddle",
        False,
    ],
    "About space tourism around the Moon by SpaceX": [
        "https://www.theguardian.com/science/2020/jan/13/japanese-billionaire-yusaku-maezawa-seeks-special-woman-for-trip-around-moon",
        False,
    ],
    "About successful completion of SpaceX launch": [
        "https://www.theguardian.com/science/2020/jan/23/spacewatch-successful-spacex-test-a-key-milestone-for-nasa",
        True,
    ],
    "About people involved in rescuing victims who are drowning": [
        "https://www.theguardian.com/science/2020/jan/16/bring-up-the-bodies-gene-sandy-ralston-drowning-victims-sonar",
        False,
    ],
    "About discoveries made by large telescopes on Earth": [
        "https://www.theguardian.com/science/2020/feb/02/the-five-large-telescopes-solar-surface-images",
        False,
    ],
    "About end of Spitzer  IR Telescope mission": [
        "https://www.theguardian.com/science/2020/feb/06/spacewatch-nasa-ends-16-year-spitzer-infrared-mission",
        False,
    ],
    "Topic 33_1": [
        "https://www.theguardian.com/science/2020/feb/12/alan-rodger-obituary",
        np.nan,
    ],
    "About images recorded of the most-distant object every visited by a space probe": [
        "https://www.theguardian.com/science/2020/feb/13/not-just-a-space-potato-nasa-unveils-astonishing-details-of-most-distant-object-ever-visited-arrokoth",
        True,
    ],
    "Obituary for methematician involved in Moon landing": [
        "https://www.theguardian.com/science/2020/feb/24/katherine-johnson-nasa-mathematician-hidden-figures-dies-101",
        False,
    ],
    "Obituary of mathematician involved in Moon landing": [
        "https://www.theguardian.com/science/2020/feb/24/katherine-johnson-obituary",
        False,
    ],
}

In [None]:
df_unseen_evaluate = (
    pd.DataFrame.from_dict(d_unseen_evaluate, orient="index")
    .reset_index()
    .rename(columns={"index": "summary", 0: "url", 1: "Suitable"})
    .assign(best=df_topics_new["best"])
)
df_unseen_evaluate["best"] = df_unseen_evaluate["best"].str.split("_", expand=True)[0]
df_unseen_evaluate = df_unseen_evaluate.fillna("Not Read")
df_unseen_evaluate

In [None]:
altair_plot_horiz_bar_chart(
    (
        df_unseen_evaluate["Suitable"]
        .value_counts()
        .sort_values(ascending=False)
        .reset_index()
        .rename(columns={"index": "Match", "Suitable": "count"})
    ),
    "Evaluation of Assigned Topics versus Article text",
    "count",
    "Match",
    xtitle="Number of Articles",
    labelFontSize=14,
    titleFontSize=14,
    plot_titleFontSize=16,
    text_var="count",
    tooltip=["Match", "count"],
    dx=70,
    offset=0,
    horiz_label_limit=450,
    sort_y=None,
    fig_size=(200, 450),  # (height, width)
)

**Observatoins**
1. The topics are generally reasonable. Approximately 70% of the manually read unseen articles matched the assigned topic. As with the training data, there are some news articles that fall under the *worst articles* within a given topic.

As a reminder, the named topics (with 35 topics) are shown below

In [None]:
display(df_named_topics)

**Observations**
1. The link on row 17 points to an article discussing imaging of the solar system, which is a closer fit to the worst articles of topic 31 (Hubble Telescope). With this exception, the other unseen news articles are an adequate fit to the predicted topic.

The top terms (weights) for each of the topics in the unseen data are shown below, in the same way as has been done earlier in this analysis

In [None]:
topic_word = pd.DataFrame(
    pipe.named_steps["nmf"].components_.round(3),
    index=[k for k in range(n_topics_wanted)],
    columns=pipe.named_steps["vectorizer"].get_feature_names(),
)
display(topic_word)

In [None]:
topic_word = pd.DataFrame(
    pipe.named_steps["nmf"].components_.round(3),
    index=[k for k in range(n_topics_wanted)],
    columns=pipe.named_steps["vectorizer"].get_feature_names(),
)
print(f"Number of rows = {topic_word.shape[0]}")
# display(topic_word.sample(10, axis=1))

df_topic_word_factors = (
    topic_word.groupby(topic_word.index)
    .apply(lambda x: x.iloc[0].nlargest(n_top_words))
    .reset_index()
    .rename(columns={"level_0": "topic_num", "level_1": "term", 0: "weight"})
)
display(df_topic_word_factors.head())

In [None]:
altair_plot_grid_by_column(
    df_topic_word_factors.merge(
        df_named_topics[
            df_named_topics["best"].isin(df_topics_new["best"].unique().tolist())
        ].reset_index()[["topic_num", "best"]],
        on="topic_num",
    ),
    xvar="weight",
    yvar="term",
    col2grid="best",
    space_between_plots=5,
    row_size=3,
    fig_size=(150, 200),
)

Next are the number of occurrences of each topic in the unseen data

In [None]:
altair_plot_horiz_bar_chart(
    df_topics_new["best"]
    .value_counts()
    .sort_values(ascending=False)
    .reset_index()
    .rename(columns={"best": "count", "index": "topic"}),
    "Topic Occurrences",
    "count",
    "topic",
    xtitle="Occurrences",
    labelFontSize=14,
    titleFontSize=16,
    plot_titleFontSize=16,
    text_var="count",
    tooltip=["topic", "count"],
    dx=350,
    offset=0,
    horiz_label_limit=450,
    sort_y=None,
    fig_size=(500, 400),  # (height, width)
)

Below are the number of occurrences by week of the month against day of the week

In [None]:
altair_datetime_heatmap(
    df_topics_new.groupby(["weekday", "week_of_month"])["url"]
    .count()
    .reset_index()
    .rename(columns={"url": "count"}),
    "week_of_month:N",
    "weekday:N",
    "weekday",
    "Articles by Weekday vs Week of Month",
    [
        {
            "title": "Weekday",
            "field": "weekday",
            "type": "nominal",
        },
        {
            "title": "Week of Month",
            "field": "week_of_month",
            "type": "quantitative",
        },
        {
            "title": "Number of News Articles",
            "field": "count",
            "type": "quantitative",
        },
    ],
    "yelloworangered",
    "",
    "count:Q",
    "log",
    axis_tick_font_size=12,
    axis_title_font_size=14,
    title_font_size=14,
    legend_fig_padding=10,  # default is 18
    y_axis_title_alignment="left",
    fwidth=350,
    fheight=200,
    file_path="",
    save_to_html=False,
    sort_x=pd.Series(df_topics_new["week_of_month"].unique())
    .sort_values(ascending=True)
    .values.tolist(),
    sort_y=[
        "Sunday",
        "Monday",
        "Tuesday",
        "Wednesday",
        "Thursday",
        "Friday",
        "Saturday",
    ],
    dx=80,
    offset=5,
    plot_titleFontSize=16,
)

Below is the number of news articles by weekday

In [None]:
altair_datetime_heatmap(
    df_topics_new.groupby(["best", "weekday"])["url"]
    .count()
    .reset_index()
    .rename(columns={"url": "count"}),
    "weekday:N",
    "best:N",
    "weekday",
    "Articles by Weekday",
    [
        {
            "title": "Weekday",
            "field": "weekday",
            "type": "nominal",
        },
        {
            "title": "Topic",
            "field": "best",
            "type": "nominal",
        },
        {
            "title": "Number of News Articles",
            "field": "count",
            "type": "quantitative",
        },
    ],
    "yelloworangered",
    "",
    "count:Q",
    "log",
    axis_tick_font_size=12,
    axis_title_font_size=14,
    title_font_size=14,
    legend_fig_padding=10,  # default is 18
    y_axis_title_alignment="left",
    fwidth=450,
    fheight=350,
    file_path="",
    save_to_html=False,
    sort_x=[
        "Sunday",
        "Monday",
        "Tuesday",
        "Wednesday",
        "Thursday",
        "Friday",
        "Saturday",
    ],
    dx=300,
    offset=5,
    plot_titleFontSize=16,
)

<a id="comparison-to-word-vectors-for-unseen-data"></a>

### 8.1. [Comparison to Word Vectors for Unseen Data](#comparison-to-word-vectors-for-unseen-data)

As before, we'll assess cosine similarity for these unseen articles, using Spacy's pre-trained [medium sized](https://spacy.io/models#conventions) [word vector model](https://spacy.io/models)

In [None]:
%%time
vectors = [nlp(sentence).vector for sentence in df_new["text"]]
similarities = cosine_similarity(vectors)
dfcs = pd.DataFrame(similarities).assign(topic_num=df_topics["topic_num"])

Plotting a heatmap of the document-document cosine similarity, we get the following

In [None]:
dfcs_unstacked = (
    dfcs.select_dtypes("float32")
    .unstack()
    .reset_index()
    .rename(columns={"level_0": "doc1", "level_1": "doc2", 0: "cos_sim"})
)
dfcs_topics = (
    dfcs.merge(df_named_topics.reset_index()[["topic_num", "best"]], on="topic_num")
    .reset_index()
    .rename(columns={"index": "doc1"})[["doc1", "best"]]
)
dfcs_merged = dfcs_unstacked.merge(dfcs_topics, on="doc1")

In [None]:
altair_plot_triangular_heatmap(
    data=dfcs_merged[["best", "cos_sim", "doc1", "doc2"]],
    ptitle=f"Found {df_topics_new['topic'].nunique()} unique topics in unseen data",
    xvar="doc1",
    yvar="doc2",
    zvar="cos_sim",
    xtitle="News Article 1",
    ytitle="News Article 2",
    tooltip=[
        alt.Tooltip("doc1:N", title="News Article 1"),
        alt.Tooltip("doc2:N", title="News Article 2"),
        alt.Tooltip("best:N", title="Topic"),
        alt.Tooltip("cos_sim:Q", title="Cosine Similarity", format=".3f"),
    ],
    axis_tick_font_size=14,
    axis_title_font_size=14,
    plot_titleFontSize=16,
    dx=45,
    offset=0,
    show_triangle="lower",
    fig_size=(650, 650),
)

**Observations**
1. An article (index 16) on SpaceX launching 60 new satellites is found to have a weak (cosine) similarity to all the other articles across a range of topics. Yet, the other articles within the same topic (indices 10-15) are not a similarly poor fit.

Lastly, we'll show a boxplot of the cosine-similarities by topic, where only articles belonging to a given topic are compared to eachother

In [None]:
dfcs_all = []
for topic in dfcs["topic_num"].unique():
    dfcs_named_topics = dfcs.loc[dfcs["topic_num"] == topic]
    # display(dfcs_named_topics[dfcs_named_topics.index])
    dfcs_unstacked = (
        dfcs_named_topics[dfcs_named_topics.index]
        .unstack()
        .reset_index()
        .rename(columns={"level_0": "doc1", "level_1": "doc2", 0: "cos_sim"})
    )
    # display(dfcs_unstacked)
    dfcs_reshaped = dfcs_unstacked.merge(
        dfcs_named_topics[dfcs_named_topics.index.tolist() + ["topic_num"]]
        .reset_index()[["index", "topic_num"]]
        .rename(columns={"index": "doc1"}),
        on="doc1",
    ).merge(df_named_topics.reset_index()[["best", "topic_num"]], on="topic_num")[
        ["doc1", "doc2", "cos_sim", "best"]
    ]
    # display(dfcs_reshaped)
    dfcs_all.append(dfcs_reshaped)
# display(dfcs[["topic_num", "best"]].head(2))
dfcs_merged = pd.concat(dfcs_all).reset_index(drop=True)
display(dfcs_merged.head())

In [None]:
altair_boxplot_sorted(
    dfcs_merged,
    "best",
    "cos_sim",
    "median(cos_sim)",
    "Cosine Similarity (sorted by median)",
    14,
    14,
    16,
    dx=350,
    offset=0,
    x_tick_label_angle=0,
    horiz_bar_chart=True,
    axis_range=[0, 1],
    fig_size=(350, 400),
)

**Notes**
1. Topics containing a single document do not display boxes on the box plot.

**Observations**
1. From the boxplot, the news articles per topic are similar to eachother in cosine-similarity.
2. From the heatmap, cosine similarities are comparable to eachother. From reading the individual articles (there are only a small number), we can see that (when multiple articles appear per topic) these are qualitatively comparable to the best articles in each topic identified earlier. In part, this could explain the high inter-article similarities for a given group. The heatmap isn't too informative.

<a id="conclusion"></a>

## 9. [Conclusion](#conclusion)

Selecting a suitable number of topics is critical to topic modeling. One approach to guiding the choice of number of topics is to use Gensim's topic coherence pipeline to qualitatively evaluate topics from a trained topic model. The topic coherence score can be used to estimate a useful range of number of topics. Once found, there could be several approaches to pick one or multiple numbers of topics from this list for final topic modeling.

Here, we looked through news articles from the Guardian newspaper where the topic coherence score was approximately constant across a range of number of topics that we tried. Picking the low and high extremes of this range, and training a separate topic model for each of these end points, allowed for a comparison between high-level topics and embedded sub-topics like the overall topic on Mars missions vs Beagle 2 (the lost lander from the Mars Express mission). Similarly, we pulled out the Pluto missions as a standalone topic separate from an blanket Planetary Research topic. The greater granularity did allowed for improved topic (Frobenius norm) residuals, which indicated this approach was a better approximation of how good the topic was.

With a larger number of topics there is always a risk that there will be atleast one topic that does not occur very frequently. If this is a niche but genuinely independent topic, then this could justify keeping it as a standalone topic - here, this was the case for the topic on the Philae mission which was separated from the overall Rosetta orbiter mission to comet 67P. A hypothesis test of the residuals also confirmed that these are separate topics. But, if we were not interested specifically in the Philae aspect of the mission, it would also have not been a problem to keep these together as was the case with the higher level topic groupings.

It really depends on the specific reason why we are modeling the topics. Here, the end goal is unsupervised learning, so we don't have ground truth labels to compare the topic groupings to. Both the higher and more granular groupings are logical. But for a supervised learning problem it could mean that one approach (separated or conflated topics) is preferred.i.e. the higher granularity might not be needed.

An effective but time-consuming approach to assess if the topic is a reasonable choice for the constituent news articles is to go through news articles manually. If the articles that fall under a topic are a poor match to eachother, then this points to an incoherent topic. The best (most tightly focused) topics here, in terms of their residual (Frobenius norm), are the ones reporting on mission updates (eg. Saturn imaging, Pluto, Mars, etc.). Earth-based scientific findings (Higgs particle, gravitational waves) were next. The worst topics (highest residuals) were the ones that focused on *general space science*, as the overall focus of articles was qualitatively weaker than for topics focused on the space missions.

<a id="looking-forward"></a>

## 10. [Looking Forward](#looking-forward)

1. Expanding this analysis to include a third choice of number of topics, between the two extrema chosen here (based on topic coherence), allows for a check of a more focused selection of topics by combining some of the poorer ones together.
2. Since we were performing the first version of topic modeling here, hyperparameters were not tuned. A process centered around using the residual (Frobenius norm) to assess topic quality be used to tune hyper-parameters for TFIDF and NMF steps of the overall pipeline. More stopwords should be added to the custom list of stopwords used here.
3. Data were acquired until the end of 2019. Extending this for another year could be useful to determine the new topics that emerged due to global events during the year.
4. The focus here was on using NMF for topic modeling. While NMF relies on matrix factorization, the other approach - Latent Dirichlet Allocation (LDA) - is a probabilistic one. It returns the probabilities of topics within each news article. A comparison between NMF's topics and those from LDA could provide further insights into (a) the topic groupings and (b) the number of topics in the corpus. Distributed LDA training [approaches may be beneficial for large datasets](https://www.ics.uci.edu/~asuncion/pubs/JMLR_09.pdf).