# Naming topics

## Importing modules &amp; defining globals


In [17]:
import os
import numpy as np
import pandas as pd

from gensim.models import LdaModel

RAW_DATA_PATH = "../data/raw"
PKL_DATA_PATH = "../data/pickles"


## Loading the saved `LdaMulticore` model


In [18]:
path = os.path.join(PKL_DATA_PATH, "lda_model.pkl")
lda_model = LdaModel.load(path)


## Naming the topics

According to [Luis Serrano](https://ca.linkedin.com/in/luisgserrano) in this [video](https://www.youtube.com/watch?v=T05t-SqKArY);
labeling/naming a topic is a task that is best done by humans.

A naive approach would be:

    the name of the topic would be included in the title of the articles.

<!-- here is a simple hack to try using ML for the topic name. -->

<!-- Exploiting TF-IDF, for each topic, the lower the IDF score of a token, the more it appears in documents; intuitively the higher the chance that this particular token is the name of the topic. -->

so we gather the top contributing words per topic, across all topics.


In [19]:
topics_word = lda_model.show_topics(formatted=False)
topics_word


[(0,
  [('patient', 0.030850375),
   ('group', 0.008121144),
   ('analysis', 0.0070354235),
   ('covid', 0.0062768757),
   ('risk', 0.005522125),
   ('data', 0.0053980565),
   ('high', 0.005219881),
   ('model', 0.005129929),
   ('treatment', 0.005111212),
   ('result', 0.0050725886)]),
 (1,
  [('patient', 0.011360956),
   ('health', 0.009242357),
   ('level', 0.006977953),
   ('analysis', 0.006772014),
   ('treatment', 0.006655649),
   ('covid', 0.0063779918),
   ('data', 0.0062814853),
   ('high', 0.0061813607),
   ('model', 0.006060855),
   ('include', 0.005775447)]),
 (2,
  [('cell', 0.0097719375),
   ('high', 0.007295579),
   ('group', 0.006840076),
   ('effect', 0.0059798807),
   ('patient', 0.0059255785),
   ('control', 0.0055863317),
   ('result', 0.005190926),
   ('increase', 0.0050835307),
   ('health', 0.005002963),
   ('risk', 0.004952738)]),
 (3,
  [('patient', 0.009856614),
   ('high', 0.00812396),
   ('base', 0.007709249),
   ('specie', 0.0075495476),
   ('data', 0.00719

ignoring for a moment the participation ratio of the tokens


In [20]:
k = len(topics_word)
topic_word_sets = [set(np.array(topics_word[i][1])[:, 0]) for i in range(k)]
topic_word_sets


[{'analysis',
  'covid',
  'data',
  'group',
  'high',
  'model',
  'patient',
  'result',
  'risk',
  'treatment'},
 {'analysis',
  'covid',
  'data',
  'health',
  'high',
  'include',
  'level',
  'model',
  'patient',
  'treatment'},
 {'cell',
  'control',
  'effect',
  'group',
  'health',
  'high',
  'increase',
  'patient',
  'result',
  'risk'},
 {'analysis',
  'base',
  'cell',
  'data',
  'group',
  'health',
  'high',
  'patient',
  'result',
  'specie'},
 {'analysis',
  'cancer',
  'cell',
  'data',
  'disease',
  'group',
  'health',
  'high',
  'increase',
  'patient'},
 {'covid',
  'health',
  'high',
  'include',
  'increase',
  'level',
  'low',
  'patient',
  'report',
  'specie'},
 {'base',
  'cell',
  'data',
  'disease',
  'health',
  'high',
  'model',
  'patient',
  'protein',
  'result'}]

In [21]:
intersection = topic_word_sets[0]
for tw_set in topic_word_sets[1:]:
    intersection &= tw_set

intersection


{'high', 'patient'}

seems like some words already highly contribute to the entire set of topics of $7$.

\[high, patient\]

---

let's try to obtain unique terms per topic, i.e. the tokens that belong to at most $1$ topic

to do that; we need to filter the sets:

for each topic:

- make a union
  of each set of topic/words,
  not including the current topic


In [22]:
union = [
    set.union(
        *[
            topic_word_sets[i]
            for i in range(k)  # 3. of each set of topic/words
            if i != j  # 4. not including the current topic
        ]
    )  # 2. make a union
    for j in range(k)  # 1. for each topic
]
union


[{'analysis',
  'base',
  'cancer',
  'cell',
  'control',
  'covid',
  'data',
  'disease',
  'effect',
  'group',
  'health',
  'high',
  'include',
  'increase',
  'level',
  'low',
  'model',
  'patient',
  'protein',
  'report',
  'result',
  'risk',
  'specie',
  'treatment'},
 {'analysis',
  'base',
  'cancer',
  'cell',
  'control',
  'covid',
  'data',
  'disease',
  'effect',
  'group',
  'health',
  'high',
  'include',
  'increase',
  'level',
  'low',
  'model',
  'patient',
  'protein',
  'report',
  'result',
  'risk',
  'specie'},
 {'analysis',
  'base',
  'cancer',
  'cell',
  'covid',
  'data',
  'disease',
  'group',
  'health',
  'high',
  'include',
  'increase',
  'level',
  'low',
  'model',
  'patient',
  'protein',
  'report',
  'result',
  'specie',
  'treatment'},
 {'analysis',
  'base',
  'cancer',
  'cell',
  'control',
  'covid',
  'data',
  'disease',
  'effect',
  'group',
  'health',
  'high',
  'include',
  'increase',
  'level',
  'low',
  'model',
  

now, for each topic/words set, get the difference between the set and the complementing union


In [23]:
unique_topic_word = [topic_word_sets[i] - union[i] for i in range(k)]
unique_topic_word


[set(),
 {'treatment'},
 {'control', 'effect', 'risk'},
 set(),
 {'cancer'},
 {'low', 'report'},
 {'protein'}]

The notice here is that the results contain empty sets. One intuition would be:

    the vocab is built on a unigram model, and topics could likely be a polygram

The same naive approach could be extended in many ways, e.g:

1. Relying on more than top $10$ words; that might yield non-empty sets, but could equally likely increase the empty sets.
1. Building a polygram \(bigram, trigram, ...\) models, and applying the same naive approach.

---

Apart from that caveat, it seems the approach gives acceptable insights into what some of the topics are:

- topic \#$1$ could be about `treatment`s
- topic \#$4$ could be about `cancer`
- topic \#$6$ could be about `protein`s

other result don't seem particularly informative:

- topic \#$2$ could be about `control`, `effect`, `risk`, or a combination of any two of them, or the three of them.
  perhaps the contribution rates of them could draw another conclusion.
- topic \#$5$ have values `low` &amp; `report` which neither can be highly descriptive.


In [24]:
# looking at participation of words in topic 2
_, topic_2_words = topics_word[2]
unique_participants = [
    topic_2_words[i]
    for i in range(len(topic_2_words))
    if topic_2_words[i][0] in unique_topic_word[2]
]
unique_participants


[('effect', 0.0059798807), ('control', 0.0055863317), ('risk', 0.004952738)]

The participation rates merely sorts the findings, as the differences between them are in the $10^{-3}$ range.

Here a bigram model could be helpful, where it could be seen how each pair of them appear, the topic name might be something like:

- `risk control`
- `risk effect`
- `control effect`, maybe?
- ...

Or a trigram model:

- `control risk effect`
- `effect` _of_ `risk control`
- ...

---

## Loading the saved topics


In [25]:
df_topics = pd.read_csv(os.path.join(RAW_DATA_PATH, "topics.csv"))


### Assigning topic names


In [26]:
topic_dict = {
    1: "treatment",
    4: "cancer",
    6: "protein",
    2: "effect",  # selected by order of participation
    5: "report",  # selected intuitively as report could be more meaningful than low
    # randomly selecting remaining topic names from participants
    0: "covid",
    3: "specie",
}
columns = ["topic_1", "topic_2", "topic_3"]
df_topics.loc[:, columns] = df_topics[columns].replace(topic_dict).values
df_topics.head()


Unnamed: 0,ArticleID,topic_1,topic_1_prop,topic_2,topic_2_prop,topic_3,topic_3_prop
0,34153941,report,0.249494,treatment,0.210527,0.0,0.196882
1,34153942,report,0.45984,effect,0.289519,cancer,0.101766
2,34153964,3,0.65111,0.0,0.111941,treatment,0.082597
3,34153968,0,0.909579,treatment,0.030253,report,0.024919
4,34153978,cancer,0.552593,0.0,0.296178,3.0,0.087066


### Saving findings


In [27]:
df_topics.to_csv(os.path.join(RAW_DATA_PATH, "topics_named.csv"))
