# [Gensim NLP trials](#gensim-nlp-trials)

In [None]:
# !python -m spacy download en_core_web_sm

In [None]:
%load_ext lab_black
%load_ext autoreload
%autoreload 2

In [None]:
import os
import re
from time import time
from IPython.display import display

import altair as alt
import gensim.corpora as corpora
import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import seaborn as sns
from gensim.models import nmf, Word2Vec
from nltk.corpus import stopwords
from spacy.lemmatizer import Lemmatizer
from spacy.lang.en.stop_words import STOP_WORDS
from sklearn.decomposition import NMF
from sklearn.feature_extraction import stop_words
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

In [None]:
%aimport src.pipe_helpers
from src.pipe_helpers import TextCleaner

%aimport src.gensim_helpers
from src.gensim_helpers import (
    compute_coherence_values,
    make_bigrams,
    remove_stopwords,
    lemmatization,
    sent_to_words,
    format_topics_sentences,
    get_bigrams_trigrams,
    plot_coherence_scores,
    compute_coherence_values,
    print_top_words_gensim,
)

%aimport src.word_embedding_helpers
from src.word_embedding_helpers import (
    calculate_coherence,
    get_descriptor,
    TokenGenerator,
    fit_nmf_for_num_topics,
    compute_coherence_values_manually,
    print_top_words,
    get_docs_with_topics,
)

%aimport src.visualization_helpers
from src.visualization_helpers import (
    altair_datetime_heatmap,
    plot_horiz_bar,
    plot_horiz_bar_gensim,
    pipe_get_topics,
    get_top_words_per_topic,
    get_docs_with_topics_v2,
)

In [None]:
SMALL_SIZE = 26
MEDIUM_SIZE = 28
BIGGER_SIZE = 30
plt.rc("font", size=SMALL_SIZE)  # controls default text sizes
plt.rc("axes", titlesize=SMALL_SIZE)  # fontsize of the axes title
plt.rc("axes", labelsize=MEDIUM_SIZE)  # fontsize of the x and y labels
plt.rc("xtick", labelsize=SMALL_SIZE)  # fontsize of the tick labels
plt.rc("ytick", labelsize=SMALL_SIZE)  # fontsize of the tick labels
plt.rc("legend", fontsize=SMALL_SIZE)  # legend fontsize
plt.rc("figure", titlesize=BIGGER_SIZE)  # fontsize of the figure title
plt.rcParams["axes.facecolor"] = "white"
sns.set_style("darkgrid", {"legend.frameon": False})
sns.set_context("talk", font_scale=0.95, rc={"lines.linewidth": 2.5})

In [None]:
pd.set_option("display.max_rows", 500)
pd.set_option("display.max_columns", 500)
pd.set_option("display.width", 1000)
%matplotlib inline

<a id="toc"></a>

## [Table of Contents](#table-of-contents)
0. [About](#about)
1. [User Inputs](#user-inputs)
2. [Load joined data](#load-joined-data)
3. [Topic modeling using TFIDF vectorization and NMF with Gensim Word2Vec word-embedding for selecting number of topics via topic coherence score](#topic-modeling-using-tfidf-vectorization-and-nmf-with-gensim-word2vec-word-embedding-for-selecting-number-of-topics-via-topic-coherence-score)
   - 3.1. [Pre-processing for NMF](#pre-processing-for-nmf)
   - 3.2. [NMF](#nmf)
   - 3.3. [Build Word Embedding model with Gensim](#build-word-embedding-model-with-gensim)
   - 3.4. [Use Word2Vec model to find number of topics](#use-word-to-vec-model-to-find-number-of-topics)
   - 3.5. [NMF with selected number of topics from Word2Vec with coherence score](#nmf-with-selected-number-of-topics-from-word-to-vec-with-coherence-score)
   - 3.6. [Exploring topics combined with source data](#exploring-topics-combined-with-source-data)
4. [Topic modeling using Gensim NMF without TFIDF vectorization](#topic-modeling-using-gensim-nmf-without-tfidf-vectorization)
   - 4.1. [Pre-processing for Gensim NMF](#pre-processing-for-gensim-nmf)
   - 4.2. [Gensim tokenization](#gensim-tokenization)
   - 4.3. [Gensim bi- and tri-gram models and lemmatization](#gensim-bi--and-tri-gram-models-and-lemmatization)
   - 4.4. [Use Gensim to perform Bag-of-Words transformation](#use-gensim-to-perform-bag-of-words-transformation)
   - 4.5. [Use Gensim NMF without TFIDF vectorization to find number of topics](#use-gensim-nmf-without-tfidf-vectorization-to-find-number-of-topics)
   - 4.6. [NMF with selected number of topics from Gensim coherence score without TFIDF vectorization](#nmf-with-selected-number-of-topics-from-gensim-coherence-score-without-tfidf-vectorization)
   - 4.7. [Exploring Gensim NMF topics combined with source data](#exploring-gensim-nmf-topics-combined-with-source-data)

<a id="about"></a>

## 0. [About](#about)

In this notebook, we will experiment with topic coherence approaches using Gensim to find an optimal number of topics from the joined news listings data in `data/processed/*_processed.csv`. This will be done separately using
- `sklearn`'s NMF model with TFIDF vectorization, scored using manually calculated topic coherence via a word embedding model in Gensim
- Gensim's NMF model without TFIDF vectorization, scored using Gensim's built-in coherence score

<a id="user-inputs"></a>

## 1. [User Inputs](#user-inputs)

We'll define below the variables that are to be used throughout the code.

In [None]:
PROJ_ROOT_DIR = os.path.abspath(os.getcwd())
processed_data_dir = os.path.join(PROJ_ROOT_DIR, "data", "processed")

In [None]:
# Dataset
publication_name = "guardian"

# Data locations
data_dir_path = os.path.join(processed_data_dir, f"{publication_name}_processed.csv")
cloud_run = True

# Custom stop words to include
manual_stop_words = ["nt", "ll", "ve"]

# Topic naming
gensim_tfidf_mapping_dict = {
    "guardian": {
        "component_1": "Gravity and Black holes - Hawking",
        "component_2": "Rocket Launches - Testing",
        "component_3": "Mars Exploration",
        "component_4": "Academia",  ##
        "component_5": "Studying Comets and Meteors",
        "component_6": "Discover of Sub-Atomic particles",
        "component_7": "Rocket Launches - Moon Landing",
        "component_8": "Shuttle Missions and Crashes",
        "component_9": "Global Warming",
        "component_10": "ISS - USA and Russian segments",
        "component_11": "Objects crashing into Earth",
        "component_12": "Space Funding Bodies",
        "component_13": "Imaging Stars - Astronomy",  ##
        "component_14": "Saturn Research",
        "component_15": "Planetary Research",  ##
    }
}

gensim_non_tfidf_mapping_dict = {
    "guardian": {
        0: "Studying Comets and Meteors",  #
        1: "Rocket Launches - Testing",  #
        2: "Discover of Sub-Atomic particles",  #
        3: "Learning and Memory",  #
        4: "ISS",  #
        5: "Brain Research",  #
        6: "Academia",  ##
        7: "Rocket Launches - Moon Landing",  #
        8: "Pseudo space-science and Humanity - Opinion",  #
        9: "Imaging Stars - Astronomy",  #
        10: "Planetary Research",  #
        11: "Global Warming and Climate Science",  #
        12: "Dark Matter Theories",  ##
        13: "Space Funding Bodies",  ##
        14: "Mars Exploration",  #
    }
}

# General inputs
limit = 30
start = 10
step = 1
n_top_words = 10
random_state = 42

In [None]:
# Get stop words from all packages
# NLTK
nltk_dir = os.path.join(os.path.expanduser("~"), "nltk_data")
if not os.path.isdir(nltk_dir):
    nltk.download("punkt")
    nltk.download("wordnet")
    nltk.download("stopwords")
    nltk.download("averaged_perceptron_tagger")
nltk_stop_words = set(stopwords.words("english"))
# Spacy and sklearn
spacy_stop_words = STOP_WORDS
sklearn_stop_words = stop_words.ENGLISH_STOP_WORDS

# Assemble manual list of stop words
spacy_not_in_sklearn = set(spacy_stop_words) - set(sklearn_stop_words)
nltk_not_in_sklearn = set(nltk_stop_words) - set(sklearn_stop_words)
all_stop_words = set(
    list(set(sklearn_stop_words))
    + list(spacy_not_in_sklearn)
    + list(nltk_not_in_sklearn)
)

# Manually add to stop words
for manual_stop_word in manual_stop_words:
    all_stop_words.add(manual_stop_word)

<a id="load-joined-data"></a>

## 2. [Load joined data](#load-joined-data)

We'll start by loading the joined data from from a publication, stored at `data/processed/<publication-name>_processed.csv`, into a `DataFrame`

In [None]:
df = pd.read_csv(data_dir_path)
df = df[["text", "year"]]
print(f"Number of rows = {df.shape[0]}")
display(df.head())

In [None]:
chunk_size = 500 if cloud_run else len(df)

<a id="topic-modeling-using-tfidf-vectorization-and-nmf-with-gensim-word2vec-word-embedding-for-selecting-number-of-topics-via-topic-coherence-score"></a>

## 3. [Topic modeling using TFIDF vectorization and NMF with Gensim Word2Vec word-embedding for selecting number of topics via topic coherence score](#topic-modeling-using-tfidf-vectorization-and-nmf-with-gensim-word2vec-word-embedding-for-selecting-number-of-topics-via-topic-coherence-score)

Here, we will find the optimal number of topics for an NMF model from `sklearn` with TFIDF vectorization, using topic coherence from a word embedding (Gensim's Word2Vec) model. This is based on a modified version of a method used [elsewhere](https://github.com/derekgreene/topic-model-tutorial/blob/master/3%20-%20Parameter%20Selection%20for%20NMF.ipynb).

In [None]:
corpus_raw = df["text"].str.lower().values.tolist()

<a id="pre-processing-for-nmf"></a>

### 3.1. [Pre-processing for NMF](#pre-processing-for-nmf)

We'll instantiate a TFIDF vectorizer object with the following changes from notebook `4_nlp_trials.ipynb` due to problems running Word2Vec (in `.wv.similarity()`, in the helper function `calculate_coherence_manually`) on the generated vectors
- `strip_accents`
  - was `"ascii"`, now set to its default value of `None`
- `token_pattern`
  - was `"[a-z][a-z]+"`, now set to its default value of `"(?u)\\b\\w\\w+\\b"`
- `min_df`
  - was `1`, now `20` (meaning word must occur 20 times in document)

In [None]:
vectorizer = TfidfVectorizer(
    tokenizer=None,
    lowercase=True,
    ngram_range=(1, 1),
    stop_words=all_stop_words,
    min_df=20,
    max_features=None,
    binary=False,
    strip_accents=None,
    token_pattern="(?u)\\b\\w\\w+\\b",
)

In [None]:
pipe = Pipeline(
    steps=[("cleaner", TextCleaner(split=False)), ("vectorizer", vectorizer)]
)

In preparation for using an [NMF model on the corpus](https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py), we'll use this vectorizer to get the document-term matrix for the corpus

In [None]:
cell_st = time()

nmf_transformed = pipe.fit_transform(corpus_raw)
docs_terms = pipe.named_steps["vectorizer"].transform(corpus_raw)
print(
    f"Created {docs_terms.shape[0]:0d} X {docs_terms.shape[1]:0d} "
    "TF-IDF-normalized document-term matrix"
)

total_minutes, total_seconds = divmod(time() - cell_st, 60)
print(
    f"Cell exection time: {int(total_minutes):d} minutes, {total_seconds:.2f} seconds"
)

We'll now retrieve the mapping from feature integer indices to feature name.i.e. the vocabulary

In [None]:
terms = pipe.named_steps["vectorizer"].get_feature_names()
print(f"Vocabulary has {len(terms):0d} distinct terms")

<a id="nmf"></a>

### 3.2. [NMF](#nmf)

Now, we will iterate over a desired number of topics and train an NMF model (using the `sklearn` library, as was done in `4_nlp_trials.ipynb`) on that number of topics, using the document-term matrix generated above. A helper function will be used to iterate over a range of number of topics and it will return a `list` of tuples of
- `num_topics`
- `model_transformed`
  - NMF-transformed data
- `factors_dict`
  - factorization matrix/dictionary

In [None]:
cell_st = time()

topic_models = fit_nmf_for_num_topics(start, limit, random_state, docs_terms)

total_minutes, total_seconds = divmod(time() - cell_st, 60)
print(
    f"Cell exection time: {int(total_minutes):d} minutes, {total_seconds:.2f} seconds"
)

<a id="build-word-embedding-model-with-gensim"></a>

### 3.3. [Build Word Embedding model with Gensim](#build-word-embedding-model-with-gensim)

Now, a custom Python class is used to format the documents to allow them to be used with Gensim's implementation of a `Word2Vec` model. The formatted documents are then trained on a `Word2Vec` model with 500 dimensions and the minimum document word frequency count of 20

In [None]:
cell_st = time()

docgen = TokenGenerator(corpus_raw, all_stop_words)
w2v_model = Word2Vec(docgen, size=500, min_count=20, sg=1, seed=1)

total_minutes, total_seconds = divmod(time() - cell_st, 60)
print(
    f"Cell exection time: {int(total_minutes):d} minutes, {total_seconds:.2f} seconds"
)

In [None]:
print(f"Model has {len(w2v_model.wv.vocab):0d} terms")

<a id="use-word-to-vec-model-to-find-number-of-topics"></a>

### 3.4. [Use Word2Vec model to find number of topics](#use-word-to-vec-model-to-find-number-of-topics)

The `Word2Vec` model will be used to evaluate the trained NMF models, by calculating a coherence score for each model

In [None]:
cell_st = time()

coherences = compute_coherence_values_manually(
    topic_models, terms, n_top_words, w2v_model
)

total_minutes, total_seconds = divmod(time() - cell_st, 60)
print(
    f"Cell exection time: {int(total_minutes):d} minutes, {total_seconds:.2f} seconds"
)

The coherence scores are graphed below against the number of topics

In [None]:
plot_coherence_scores(coherences, start, limit, step, (12, 4))

**Observations**
1. Ideally, the coherence scores would increase and then reach a plateau and the optimal number of topics would occur at the elbow in this curve - where the highest coherence score occurs before flattening out. Here, there appears to be a weak plateau between
   - 14-18 topics (coherence score changes from approx. 0.405 to approx. 0.41 over this range)
   - 24-30 topics (score changes from approx. 0.44 to approx. 0.45 over this range)
   
   Hyperparameter optimization could help improve the width of these plateaus and suggest stronger convergence towards an optimal number of topics, possibly by eliminating one such plateau.
2. Over the range of number of topics from 14-30, the change in coherence score is 0.04 (approx. 10%).
3. Since the overall maximum number of topics at the onset of the second plateau (24 topics) returns a coherence score that is only approx. 9% better from the plateau, we'll proceed with using 15 topics.
4. The highest coherence score is annotated and occurs for 30 topics.

<a id="nmf-with-selected-number-of-topics-from-word-to-vec-with-coherence-score"></a>

### 3.5. [NMF with selected number of topics from Word2Vec with coherence score](#nmf-with-selected-number-of-topics-from-word-to-vec-with-coherence-score)

We'll use 15 topics for further exploration

In [None]:
best_n_topics = 15

Now, we'll print out all the topics and their top ten terms found from the NMF model with the chosen number of topics based on coherence scores (15)

In [None]:
cell_st = time()

docs_topics = print_top_words(
    topic_models,
    best_n_topics,
    n_top_words,
    start,
    terms,
    docs_terms,
    method=2,
    random_state=random_state,
)

total_minutes, total_seconds = divmod(time() - cell_st, 60)
print(
    f"Cell exection time: {int(total_minutes):d} minutes, {total_seconds:.2f} seconds"
)

The topics and their top ten words are shown below, for the pre-determined choice of `random_state` of `42` specified in the Gensim NMF model

```
Top terms per topic, using random_state=42:
Topic 0: universe galaxies black stars gravitational matter dark light waves telescope
Topic 1: rocket spacex musk company falcon launch space flight virgin rockets
Topic 2: mars beagle martian planet rover mission nasa lander life surface
Topic 3: science people brain like says world research time human work
Topic 4: comet rosetta philae lander comets dust mission surface esa probe
Topic 5: higgs particle particles lhc physics cern matter boson energy collider
Topic 6: moon lunar apollo armstrong nasa astronauts surface earth space aldrin
Topic 7: shuttle nasa columbia space discovery launch astronauts mission station flight
Topic 8: water ice climate life ocean carbon scientists sea surface warming
Topic 9: station mir space russian iss crew peake astronaut russia soyuz
Topic 10: asteroid asteroids earth impact rock object collision near miles hit
Topic 11: space china satellites satellite uk government britain said chinese agency
Topic 12: planets star planet stars kepler earth telescope astronomers life light
Topic 13: cassini saturn titan pluto huygens rings planet spacecraft moons probe
Topic 14: sun solar eclipse earth magnetic venus mercury atmosphere weather field
```

**Observations**
1. Topics are very similar to those found in `4_np_trials.ipynb` and appear clearly separable. The `Planetary Research` and `Global Warming` topics found here are variants of the corresponding topics found previously, based on top word. Differences in top words, relative to `4_nlp_trials.ipynb`, occur due to differences in TFIDF vectorization used here compared to those used previously, [as documented above](#pre-processing-for-nmf).

Below is the source article data, horizontally concatenated to the NMF document-topic matrix and the most popular topic for each article by taking the row-wise maximum of the document-topic matrix

In [None]:
df_with_topics = get_docs_with_topics(
    docs_topics=docs_topics,
    num_topics=best_n_topics,
    df_raw=df,
    mapper_dict=gensim_tfidf_mapping_dict[publication_name],
)
display(df_with_topics.head(3))

We'll now train a new NMF model with the chosen number of topics based on coherence score and get the corresponding factorization matrix (dictionary)

In [None]:
cell_st = time()

nmf_best = NMF(n_components=best_n_topics, max_iter=700, random_state=random_state)
nmf_transformed = nmf_best.fit_transform(docs_terms)

total_minutes, total_seconds = divmod(time() - cell_st, 60)
print(
    f"Cell exection time: {int(total_minutes):d} minutes, {total_seconds:.2f} seconds"
)

Tabular versions of the top words per topic and of a an alternate method of obtaining the source data with topics (and probabilities) are shown below

In [None]:
(
    topic_word_best,
    df_top_words_per_topic,
    df_top_vals_per_topic,
) = get_top_words_per_topic(
    nmf_best,
    vectorizer,
    gensim_tfidf_mapping_dict[publication_name],
    n_top_words,
    best_n_topics,
    False,
)
display(df_top_words_per_topic.head())

In [None]:
df_with_topics_v2 = get_docs_with_topics_v2(
    corpus_raw, nmf_transformed, gensim_tfidf_mapping_dict[publication_name], df
)
display(df_with_topics_v2.head())

We'll now use a helper function to generate a plot of topics and top 10 words (by how much each document is made up of the resulting topics) for each topic

In [None]:
fig, ax = plt.subplots(figsize=(24, 8))
_ = sns.heatmap(
    df_top_vals_per_topic.rename(columns=gensim_tfidf_mapping_dict[publication_name]).T,
    annot=df_top_words_per_topic.rename(
        columns=gensim_tfidf_mapping_dict[publication_name]
    ).T,
    fmt="",
    center=0,
    cbar=False,
    cmap=sns.diverging_palette(0, 255, sep=1, n=256),
    ax=ax,
)
ax.set_xticklabels(ax.get_xticklabels(), ha="center", rotation=0)
ax.set_yticklabels(ax.get_yticklabels(), rotation=0)
ax.set_title("Top Words per Topic", fontweight="bold", loc="left")

**Notes**
1. In the above heatmap, the darker the shade of blue the stronger the relationship between the word and that topic.

In [None]:
_ = plot_horiz_bar(
    topic_word_best.T,
    ptitle="",
    y_tick_mapper_list=list(gensim_tfidf_mapping_dict[publication_name].values()),
    fig_size=(60, 8),
    xspacer=0.001,
    yspacer=0.3,
    ytick_font_size=18,
    title_font_size=20,
    annot_font_size=16,
    n_bars=10,
    n_plots=topic_word_best.T.shape[1],
    n_cols=best_n_topics,
    show_bar_labels=False,
)

**Observations**
1. These charts generally resemble those found in the previous notebook `4_nlp_trials.ipynb`, with exceptions due to differences in TFIDF vectorization [as mentioned earlier in this section](#nmf-with-selected-number-of-topics-from-word-to-vec-with-coherence-score).

<a id="exploring-topics-combined-with-source-data"></a>

### 3.6. [Exploring topics combined with source data](#exploring-topics-combined-with-source-data)

The number of occurrences of the most popular topic, shown separately for each year, is shown below

In [None]:
topics_by_timeframe = (
    df_with_topics.groupby(["most_popular_topic", "year"])
    .size()
    .reset_index()
    .sort_values(by=["most_popular_topic", 0, "year"], ascending=False)
    .rename(columns={0: "count"})
)
topics_by_timeframe.head()

Here, we will show a heatmap of the most popular topic by year

In [None]:
altair_datetime_heatmap(
    topics_by_timeframe,
    x="year:O",
    y="most_popular_topic:N",
    xtitle="Year",
    ytitle="Most popular topic",
    tooltip=[
        {"title": "Year", "field": "year", "type": "ordinal",},
        {
            "title": "Most popular topic",
            "field": "most_popular_topic",
            "type": "nominal",
        },
        {
            "title": "Number of occurrences as main topic",
            "field": "count",
            "type": "quantitative",
        },
    ],
    cmap="yelloworangered",
    legend_title="",
    color_by_col="count:Q",
    yscale="log",
    axis_tick_font_size=12,
    axis_title_font_size=16,
    title_font_size=20,
    legend_fig_padding=10,  # default is 18
    y_axis_title_alignment="left",
    fwidth=700,
    fheight=450,
    file_path=Path().cwd() / "reports" / "figures" / "my_heatmap.html",
    save_to_html=False,
    sort_y=[],
    sort_x=[],
)

Next, we will show a bar chart of the number of occurrences of the `"Space Funding Bodies"` as the most popular topic, relative to the year 1980
- this will approximate the public interest in changes in this topic over the years investigated

In [None]:
funds = (
    topics_by_timeframe[
        topics_by_timeframe["most_popular_topic"] == "Space Funding Bodies"
    ]
    .set_index("year")["count"]
    .sort_index()
)
funds / funds.loc[funds.index.min()]
funds = funds / funds.loc[funds.index.min()]

In [None]:
fig, ax = plt.subplots(figsize=(15, 5))
funds.plot(kind="bar", ax=ax, rot=45, align="edge", width=0.8)
ax.set_title(
    "Cyclic variation in funding as main topic in article",
    fontsize=18,
    fontweight="bold",
)
ax.set_xlabel(None)
h = plt.ylabel("Funding\n(rel. to 1981)", labelpad=65, fontweight="bold")
h.set_rotation(0)

**Observations**
1. The same implementation of NMF (from the `sklearn` library) and with the same number of topics was used. It is not surprising that the above chart is similar qualitatively (appearance of peaks and dips at locations very similar to those found earlier) and quantitatively (peak heights) to that found in the `4_nlp_trials.ipynb` notebook.
2. Observed differences that do occur result in a peak in 2007 that is stronger than previously, a peak in 2013 that is weaker than previously and fewer occurrences of this topic in 2014, 2015 and 2016 than previously. These are due to the
   - different hyperparameter settings needed in the TFIDF vectorization in order to the use Gensim's Word2Vec model to select the number of topics
   - `random_state` enforced here, which was not done previously

<a id="topic-modeling-using-gensim-nmf-without-tfidf-vectorization"></a>

## 4. [Topic modeling using Gensim NMF without TFIDF vectorization](#topic-modeling-using-gensim-nmf-without-tfidf-vectorization)

Here, we'll use Gensim's implementation of NMF, without TFIDF vectorization, to retrieve the optimal number of topics. This will be done without TFIDF Vectorization from either [`sklearn`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn-feature-extraction-text-tfidfvectorizer) or [`gensim`](https://radimrehurek.com/gensim/models/tfidfmodel.html#gensim.models.tfidfmodel.TfidfModel) itself.

In [None]:
corpus_raw = df.loc[:, "text"].values.tolist()

<a id="pre-processing-for-gensim-nmf"></a>

### 4.1. [Pre-processing for Gensim NMF](#pre-processing-for-gensim-nmf)

First, we'll clean the text of the articles

In [None]:
cell_st = time()

pipe = Pipeline(steps=[("cleaner", TextCleaner(split=False))])
corpus_raw_cleaned = pipe.fit_transform(corpus_raw)

total_minutes, total_seconds = divmod(time() - cell_st, 60)
print(
    f"Cell exection time: {int(total_minutes):d} minutes, {total_seconds:.2f} seconds"
)

<a id="gensim-tokenization"></a>

### 4.2. [Gensim tokenization](#gensim-tokenization)

We'll now tokenize the cleaned sentences into a list of words
- this will convert each document into a list of lowercase tokens, ignoring tokens that are too short (length 2) or too long (length 15)

In [None]:
cell_st = time()

# data_words = list(sent_to_words(corpus_raw))
data_words = list(sent_to_words(corpus_raw_cleaned))

total_minutes, total_seconds = divmod(time() - cell_st, 60)
print(
    f"Cell exection time: {int(total_minutes):d} minutes, {total_seconds:.2f} seconds"
)

<a id="gensim-bi--and-tri-gram-models-and-lemmatization"></a>

### 4.3. [Gensim bi- and tri-gram models and lemmatization](#gensim-bi--and-tri-gram-models-and-lemmatization)

We'll use Gensim's `Phrases` module to build bigram and trigram models

In [None]:
cell_st = time()

bigram_model, trigram_model = get_bigrams_trigrams(data_words, 5, 100)

total_minutes, total_seconds = divmod(time() - cell_st, 60)
print(
    f"Cell exection time: {int(total_minutes):d} minutes, {total_seconds:.2f} seconds"
)

Next, we'll perform the following pre-processing
- remove stopwords
- (optional) create bigrams
- (optional) lemmatize

In [None]:
cell_st = time()

# Remove Stop Words
data_words_nostops = remove_stopwords(data_words, all_stop_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops, bigram_model)

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(
    data_words_nostops, allowed_postags=["NOUN", "ADJ", "VERB", "ADV"]
)

total_minutes, total_seconds = divmod(time() - cell_st, 60)
print(
    f"Cell exection time: {int(total_minutes):d} minutes, {total_seconds:.2f} seconds"
)

<a id="use-gensim-to-perform-bag-of-words-transformation"></a>

### 4.4. [Use Gensim to perform Bag-of-Words transformation](#use-gensim-to-perform-bag-of-words-transformation)

Now, we'll create a corpus comprising an assigned ID and corresponding frequency of words from the cleaned list of words (where stopwords were removed) above. This is Gensim's document conversion into a bag-of-words format, giving a list of tuples comprising token identifier and count (frequency).

In [None]:
cell_st = time()

# Create Dictionary
# id2word = corpora.Dictionary(data_lemmatized)
id2word = corpora.Dictionary(data_words_nostops)

# Term Document Frequency for corpus
# corpus = [id2word.doc2bow(text) for text in data_lemmatized]
corpus = [id2word.doc2bow(text) for text in data_words_nostops]

total_minutes, total_seconds = divmod(time() - cell_st, 60)
print(f"Cell exection time: {int(total_minutes):d} minutes, {total_seconds:.2f} seconds")

<a id="use-gensim-nmf-without-tfidf-vectorization-to-find-number-of-topics"></a>

### 4.5. [Use Gensim NMF without TFIDF vectorization to find number of topics](#use-gensim-nmf-without-tfidf-vectorization-to-find-number-of-topics)

We'll now train Gensim's NMF model. A helper function below will iterate over the number of topics and compute the coherence score for each number.

In [None]:
cell_st = time()

model_dict, coherence_values = compute_coherence_values(
    corpus=corpus,
    id2word=id2word,
    # texts=data_lemmatized,
    texts=data_words_nostops,
    limit=limit,
    start=start,
    step=step,
    chunk_size=chunk_size,
    random_state=random_state,
)

total_minutes, total_seconds = divmod(time() - cell_st, 60)
print(
    f"Cell exection time: {int(total_minutes):d} minutes, {total_seconds:.2f} seconds"
)

The coherence scores are graphed below by number of topics used, [with an annotation](https://matplotlib.org/2.0.0/users/annotations.html#annotating-with-text-with-box) showing the number of topics with the highest coherence score

In [None]:
plot_coherence_scores(coherence_values, start, limit, step, (12, 4))

**Notes**
1. The scores here are not directly comparable to those found earlier since
   - the manual calculation of coherence scores earlier has not been verified against the [built-in gensim calculation](https://radimrehurek.com/gensim/models/coherencemodel.html#gensim.models.coherencemodel.CoherenceModel.get_coherence) used by this helper function
   - pre-processing (text cleaning) was slightly different here than that used with the earlier approach
   - the approach being used here (NMF without TFIDF) is different to that from earlier (word embedding with TFIDF)

**Observations**
1. There isn't strong evidence of an increasing trend in the number of topics, nor of a plateau in the scores. While Gensim model hyperparameter optimization could be further explored, the scores do not show a preference for a specific choice of number of topics over the others.

<a id="nmf-with-selected-number-of-topics-from-gensim-coherence-score-without-tfidf-vectorization"></a>

### 4.6. [NMF with selected number of topics from Gensim coherence score without TFIDF vectorization](#nmf-with-selected-number-of-topics-from-gensim-coherence-score-without-tfidf-vectorization)

We'll use 15 topics for further exploration

In [None]:
best_n_topics = 15

In [None]:
cell_st = time()

best_model = nmf.Nmf(
    corpus=corpus,
    id2word=id2word,
    num_topics=best_n_topics,
    chunksize=chunk_size,  # no. of docs to be used in each training chunk
    passes=10,
    kappa=1.0,
    minimum_probability=0.01,
    w_max_iter=200,
    w_stop_condition=0.0001,
    h_max_iter=50,
    h_stop_condition=0.001,
    eval_every=10,
    normalize=True,
    random_state=random_state,
)

total_minutes, total_seconds = divmod(time() - cell_st, 60)
print(
    f"Cell exection time: {int(total_minutes):d} minutes, {total_seconds:.2f} seconds"
)

Now, we'll print out all the topics found from the Gensim NMF model

In [None]:
print_top_words_gensim(
    gensim_nmf_model=best_model,
    mapper_dict=gensim_non_tfidf_mapping_dict[publication_name],
    top_n_words=n_top_words,
    random_state=random_state,
)

The topics and their top ten words are shown below, for the pre-determined choice of `random_state` of `42` specified in the Gensim NMF model

```
Top terms per cluster, using random_state=42:
Topic 0: life time earth comet years scientists like species book dna
Topic 1: space earth spacecraft satellites orbit satellite mission solar rocket moon
Topic 2: particles universe particle theory physics higgs energy lhc matter time
Topic 3: people memory like memories price day dont think remember things
Topic 4: space station nasa shuttle astronauts launch mission said flight crew
Topic 5: brain cells stem human tissue body cell neurons embryonic brains
Topic 6: number atomic species new matter xenon human tellurium dark numbers
Topic 7: says moon people like time lunar world going apollo think
Topic 8: said time dawkins think people human like new dont world
Topic 9: stars telescope star said light black astronomers galaxy years universe
Topic 10: planet planets solar earth water sun scientists surface pluto ice
Topic 11: work ice theory equation space climate surface time mathematical moduli
Topic 12: research says new university scientists work matter scientific uk dark
Topic 13: science scientific scientists research people public world climate new like
Topic 14: mars mission life surface martian planet water nasa said scientists
```

**Observations**
1. Three new topics are introduced here, relative to the `TFIDF+NMF` or CorEx approaches. Some existing ones are modified in their appearance.
2. One of these news topic names (`Dark Matter Theories`) is a weaker choice, based on the components words, and could be similar to the `Gravity and Black Holes` topic previously identified. Two topics could possibly be combined - `Brian Research` and `Learning and Memory` - resulting in 14 topics, as suggested by cross-validation with coherence-scores. NMF model hyper-parameter optimization could be explored to further study the topic selection.

In [None]:
cell_st = time()

_ = plot_horiz_bar_gensim(
    best_model,
    id2word,
    gensim_non_tfidf_mapping_dict[publication_name],
    fig_size=(40, 35),
)

total_minutes, total_seconds = divmod(time() - cell_st, 60)
print(
    f"Cell exection time: {int(total_minutes):d} minutes, {total_seconds:.2f} seconds"
)

We'll append the topic to the same row as each document in the original data

In [None]:
cell_st = time()

df_with_topics = format_topics_sentences(
    best_model, corpus, df, gensim_non_tfidf_mapping_dict[publication_name]
)
display(df_with_topics.head(2))

total_minutes, total_seconds = divmod(time() - cell_st, 60)
print(
    f"Cell exection time: {int(total_minutes):d} minutes, {total_seconds:.2f} seconds"
)

<a id="exploring-gensim-nmf-topics-combined-with-source-data"></a>

### 4.7. [Exploring Gensim NMF topics combined with source data](#exploring-gensim-nmf-topics-combined-with-source-data)

Here, we will show a heatmap of the most popular topic by year, found by Gensim's implementation of NMF (recall this was done above without TFIDF Vectorization)

In [None]:
topics_by_timeframe = (
    df_with_topics.groupby(["most_popular_topic", "year"])
    .size()
    .reset_index()
    .sort_values(by=["most_popular_topic", 0, "year"], ascending=False)
    .rename(columns={0: "count"})
)
topics_by_timeframe.head()

In [None]:
altair_datetime_heatmap(
    topics_by_timeframe,
    x="year:O",
    y="most_popular_topic:N",
    xtitle="Year",
    ytitle="Most popular topic",
    tooltip=[
        {"title": "Year", "field": "year", "type": "ordinal",},
        {
            "title": "Most popular topic",
            "field": "most_popular_topic",
            "type": "nominal",
        },
        {
            "title": "Number of occurrences as main topic",
            "field": "count",
            "type": "quantitative",
        },
    ],
    cmap="yelloworangered",
    legend_title="",
    color_by_col="count:Q",
    yscale="log",
    axis_tick_font_size=12,
    axis_title_font_size=16,
    title_font_size=20,
    legend_fig_padding=10,  # default is 18
    y_axis_title_alignment="left",
    fwidth=700,
    fheight=450,
    file_path=Path().cwd() / "reports" / "figures" / "my_heatmap.html",
    save_to_html=False,
    sort_y=[],
    sort_x=[],
)

Next, we will show a bar chart of the number of occurrences of the `"Space Funding Bodies"` as the most popular topic, relative to the year 1980
- this will approximate the public interest in changes in this topic over the years investigated

In [None]:
funds = (
    topics_by_timeframe[
        topics_by_timeframe["most_popular_topic"] == "Space Funding Bodies"
    ]
    .set_index("year")["count"]
    .sort_index()
)
funds / funds.loc[funds.index.min()]
funds = funds / funds.loc[funds.index.min()]

In [None]:
fig, ax = plt.subplots(figsize=(15, 5))
funds.plot(kind="bar", ax=ax, rot=45, align="edge", width=0.8)
ax.set_title(
    "Cyclic variation in funding as main topic in article",
    fontsize=18,
    fontweight="bold",
)
ax.set_xlabel(None)
h = plt.ylabel("Funding\n(rel. to 1981)", labelpad=65, fontweight="bold")
h.set_rotation(0)

**Observations**
1. The multi-modal aspects of the `TFIDF+NMF` and CorEx implementations showing a broadened peak centered at 2014, a weaker peak in articles published under this topic in 2004 and a peak in 2007also appear here. The choice of this topic was not as difficult here as it was for two of the other assigned topic names.
2. Peak heights are distinctly weaker than those found in `TFIDF+NMF` and CorEx spproaches indicating documents previously assigned to this topic are now being placed in another topic.

<a id="conclusions"></a>

## 5. [Conclusions](#conclusions)

1. With `TFIDF` vectorization, `sklearn`'s `NMF` found two new topics than those previously seen using `sklearn`'s NMF with TFIDF in `4_nlp_trials.ipynb`. Due to the similarities to the previously used `NMF` approach, the other identified topics here were in good agreement with those found earlier, as expected. The number of topics was chosen at the onset of a plateau in coherence scores using a word embedding model, however a weakly increasing trend (improvement) in scores was noticed after the plateau and was not considered when selecting the optimal number of topics. This offers reasonable validation of the number of topics found by the NMF+TFIDF/CorEx approaches in previous notebooks (`4_nlp_trials.ipynb` and `5_corex_nlp_trials.ipynb`).
2. Without `TFIDF` vectorization, Gensim's `NMF` found three new topics that were not observed previously using `sklearn`'s NMF with TFIDF in `4_nlp_trials.ipynb`. A few of the other topics found using Gensim's `NMF` (without TFIDF) were a variant of those found with that previous notebook's NMF approach - the agreement between topics found here and those from the prevoius notebook was not as good as that noticed with the above approach (NMF with TFIDF from section 3 in this notebook). The number of topics has some evidence of being one too many. If such an approach (Gensim's `NMF` without `TFIDF` vectorization) is to be used further, then the performance of model hyperparameter optimization with 14 and 15 topics should be investigated.