# Topic Clustering

Clusters ClaimReview claims to identify topics.

**Note about Reproducibility**:
Although setting a `random_state` ensures UMAP is consistent between runs, it still varies across OSes and CPU architectures as it depends on native numpy implementation. There are various issues on the `umap-learn` repository about this ([example](https://github.com/lmcinnes/umap/issues/525)), but they deem it unfixable.

When running this notebook, it will likely produce different clusters. In the public version of this repository, we will include a download link to the trained topic model.

In [None]:
import pandas as pd
from bertopic import BERTopic
from umap import UMAP

In [None]:
df = pd.read_csv("../data/claimreview_en_subcols.csv")
claims = df["claimReviewed"]

In [None]:
umap_model = UMAP(random_state=42)
topic_model = BERTopic(umap_model=umap_model)

In [None]:
topics, probs = topic_model.fit_transform(claims)

In [None]:
topic_model.get_topic_info().to_csv('../additional_data/topic_list.csv', index=False)

In [None]:
topic_model.save("topics.bertopic")

#### Topics related to the war in Ukraine

| id  | count | topic                                      |
| --- | ----- | ------------------------------------------ |
| 4   | 217   | 4_ukraine_ukrainian_aid_putin              |
| 38  | 74    | 38_ukraine_ukrainian_russian_russia        |
| 270 | 11    | 270_fighter_jet_russian_ukrainian          |
| 279 | 10    | 279_volodymyr_suspended_russian_vladimir   |

Insert IDs of relevant topics in the cell below:

In [None]:
ukr = [4,38,270,279]

In [None]:
doc_info = topic_model.get_document_info(claims)
doc_info

In [None]:
ukr_docs = doc_info[doc_info.Topic.isin(ukr)].sort_values("Topic")
ukr_docs = ukr_docs.join(df[["datePublished"]], how="left")

In [None]:
ukr_docs.to_csv("../data/claimreview_ukr.csv", index=False)

These documents should then be checked manually for relevance.