Task: Document Analyzing System \
Takes search queries (question in natural language), and retrieves relative paragraphs over the collected articles that most likely to answer the question. Cluster the retrieved paragraphs and uncover its hidden topics through Topic Modeling \
Main article references can be found in: Notes.DOCX



# Phase 0: Install and Load Necessary Libraries/Packages

In [82]:
# Install sentence-transformers, needed for semantic search
!pip install pandas sentence-transformers
# !pip install sentence-transformers

# Install BERTopic
!pip install bertopic

# Install Visualization for BERTopic results
!pip install bertopic[visualization]

# Supporting libraries to run llama2 eaiser
!pip install accelerate bitsandbytes xformers adjustText



In [83]:
# Log into hugging face
# Needed for permission to download Llama2
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [84]:
# Import Basic Python library
import json
import pandas as pd
import numpy as np

# Data Preprocessing related library
# reference: https://www.nltk.org/
# content cleaning, lowercasing, removing punctuation, removing stop words etc
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.stem import WordNetLemmatizer

# Semantic Searching related library
from sentence_transformers import SentenceTransformer
from sentence_transformers import util
import torch
from torch import cuda

# Cosine Similarity, measure the similarity among embeddings
from sklearn.metrics.pairwise import cosine_similarity

# BERTopic Topic Modeling related library
from bertopic import BERTopic
# Fine-tuned Topic Representation
# Reduces stopwords from the resulting topic representations
from bertopic.representation import KeyBERTInspired, TextGeneration
# Clustering method that allow us to set number of clusters to be made
from sklearn.cluster import KMeans

# Progress Bar
from tqdm import tqdm
tqdm.pandas()

#Phase 1: Collect Newspaper Articles

* This phase was conducted outside of this notebook.
* The model processes JSON-formatted files, with each data point representing a unique article.
* Articles sourced from the Nexus Uni Database.
* Json file should be uploaded each runtime session (if needed, can be prevented by mounting the drive).

In [85]:
# Load the JSON data from uploaded JSON documents
# This should be the output of the nexus_parser.py program
with open('Task4.json', 'r') as file:
  data = json.load(file)

In [86]:
# Extract stored information from the JSON file
# list used to store all post-process information
articles_info = []

In [87]:
# loop through every article appearing in the articles
for article_data in data['articles']:

  # get each field of data information
  title = article_data['title']
  source = article_data['source']
  date = article_data['date']
  byline = article_data['byline']
  content = article_data['content']

  # Store article information
  article_info = {
    "title": title,
    "source": source,
    "date": date,
    "byline": byline,
    "content": content
  }

  # Append the extracted information for the current article to the target list
  articles_info.append(article_info)

In [88]:
# Convert the extracted dictionary to Pandas Dataframe datatype for easier further analysis
df = pd.DataFrame(articles_info)
# check if our dataframe is successfully constructed
df.head()

Unnamed: 0,title,source,date,byline,content
0,'It was a nightmare': Pinal County builds new ...,Newstex Blogs,Arizona Mirror,Jen Fifield/Votebeat,"Jun 16, 2023( Arizona Mirror: https://www.azmi..."
1,'The law contains 19 sections aimed at helping...,Newstex Blogs,Small Dead Animals,Kate,"March 10th, 2024 ( Small Dead Animals — Deliv..."
2,(EDITORIAL from Korea Herald on Feb. 26),ASEAN Tribune,"February 26, 2024 Monday",,26 Feb 2024 (Yonhap News Agency) Global chips...
3,05:38 EDT TSMC delays Arizona plant start amid...,Theflyonthewall.com,"July 20, 2023 Thursday 5:38 AM EST",,05:38 EDT TSMC delays Arizona plant start amid...
4,05:38 EDT TSMC delays Arizona plant start amid...,Theflyonthewall.com,"July 20, 2023 Thursday 5:38 AM EST",,05:38 EDT TSMC delays Arizona plant start amid...


In [89]:
# some sample analysis purely based on this Pandas Dataframe is
# Get the unique sources of the article samples

# unique_sources_list = df['source'].unique()
# print(str(len(unique_sources_list)) + " unique sources in the DataFrame:")
# for source in unique_sources_list:
#    print(source)

In [90]:
# Check if the information are successfully extracted, we can do this by checking the date line
# As there is no way for us to check the validness of the rest columns

# Regular expression pattern to match the start of the date with a proper month
pattern = r'^(January|February|March|April|May|June|July|August|September|October|November|December|Unknown|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec|0?[1-9]|1[0-2])'

# Function to check if the date starts with a proper month
def check_proper_date(date):
  return bool(re.match(pattern, date))

# Apply the function to the "date" column and filter out the invalid rows
invalid_rows = df[~df['date'].apply(check_proper_date)]

# Print the invalid rows
# If all dates under column 'date' are valid
if invalid_rows.empty:
  print("All entries in the 'date' column contains a proper month name.")
else:
# If there are invalid dates
  print("The following rows do not contain a proper month name and is not listed as 'unknown':")
  print(invalid_rows)

The following rows do not contain a proper month name and is not listed as 'unknown':
                                                 title          source  \
0    'It was a nightmare': Pinal County builds new ...  Newstex Blogs    
1    'The law contains 19 sections aimed at helping...  Newstex Blogs    
15   1 Cheap AI Stock to Buy Hand Over Fist Before ...  Newstex Blogs    
16                      2:00PM Water Cooler 10/27/2021  Newstex Blogs    
17                        2:00PM Water Cooler 6/4/2021  Newstex Blogs    
..                                                 ...             ...   
889  Why Intel, Broadcom, and Applied Materials Plu...  Newstex Blogs    
890  Why Intel, Taiwan Semiconductor, and Micron St...  Newstex Blogs    
896  Why Taiwan Is at the Heart of a Geopolitical S...  Newstex Blogs    
897  Why Taiwan semiconductors are key for global h...  Newstex Blogs    
898            Window on Washington - Vol. 6, Issue 46  Newstex Blogs    

                        d

In [91]:
# Now the dataframe is successfully constructed, but we may want to focus on certain sources/byline/date
# We can add a filter here

#Phase 2: Data Preprocessing

In [92]:
# Data preprocessing is a crucial step in natural language processing (NLP).
# We need to go through each article's content body and preprocess them before analyze/train
# reference: https://www.nltk.org/
# Download necessary NLTK data files
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [93]:
# Some degree of text preprocessing for apply sentence embedding
def clean_text(text):
  # Remove HTML tag
  text = re.sub(r'<[^>]+>', ' ', text)
  # Remove special characters
  text = re.sub(r'[^a-zA-Z0-9\s?!,.:"]', '', text)
  # Normalize Whitespace
  text = re.sub(r'\s+', ' ', text)
  '''
  # Remove stopwords
  stop_words = set(stopwords.words('english'))
  words = word_tokenize(text)
  filtered_words = [word for word in words if word.lower() not in stop_words]
  text = ' '.join(filtered_words)
  # Stem words
  lemmatizer = WordNetLemmatizer()
  words = word_tokenize(text)
  stemmed_words = [lemmatizer.lemmatize(word) for word in words]
  text = ' '.join(stemmed_words)
  '''

  return text

In [94]:
# Apply the cleaning function to the content of each article
# Store into separate columns
df['cleaned_content'] = df['content'].progress_apply(clean_text)

100%|██████████| 905/905 [00:14<00:00, 63.01it/s] 


In [95]:
df.head()

Unnamed: 0,title,source,date,byline,content,cleaned_content
0,'It was a nightmare': Pinal County builds new ...,Newstex Blogs,Arizona Mirror,Jen Fifield/Votebeat,"Jun 16, 2023( Arizona Mirror: https://www.azmi...","Jun 16, 2023 Arizona Mirror: https:www.azmirro..."
1,'The law contains 19 sections aimed at helping...,Newstex Blogs,Small Dead Animals,Kate,"March 10th, 2024 ( Small Dead Animals — Deliv...","March 10th, 2024 Small Dead Animals Delivered ..."
2,(EDITORIAL from Korea Herald on Feb. 26),ASEAN Tribune,"February 26, 2024 Monday",,26 Feb 2024 (Yonhap News Agency) Global chips...,26 Feb 2024 Yonhap News Agency Global chips r...
3,05:38 EDT TSMC delays Arizona plant start amid...,Theflyonthewall.com,"July 20, 2023 Thursday 5:38 AM EST",,05:38 EDT TSMC delays Arizona plant start amid...,05:38 EDT TSMC delays Arizona plant start amid...
4,05:38 EDT TSMC delays Arizona plant start amid...,Theflyonthewall.com,"July 20, 2023 Thursday 5:38 AM EST",,05:38 EDT TSMC delays Arizona plant start amid...,05:38 EDT TSMC delays Arizona plant start amid...


# Phase 3: Semantic Search

In [96]:
# Load a Pre-trained Sentence Transformer Model that based on SBERT
# all-MiniLM-L6-v2 is a sentence-transformers model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search
# reference: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
# reference: https://sbert.net/
# This is also the default SBERT model used by BERTopic
# Fast and overall great performance
# SBERT_model = SentenceTransformer('all-MiniLM-L6-v2')
# Best overall Performaning SBERT Model according to: https://www.sbert.net/docs/sentence_transformer/pretrained_models.html
SBERT_model = SentenceTransformer('all-mpnet-base-v2')
# Great sentence transfer recommended by the author of BERTopic
# SBERT_model = SentenceTransformer('BAAI/bge-small-en-v1.5')
# Multi language supported, reference: https://sbert.net/examples/training/multilingual/README.html
# SBERT_model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

In [97]:
# Language of question, MUST match the language of the target articles
# Input query
question = "Why is TSMC's operation in the United States being delayed?"

# Get embedding for the question
question_embedding_SBERT = SBERT_model.encode(question, convert_to_tensor=True)

Loop through article pool \
For each article, split the article into sentences, and create embedding for each sentence (these embeddings will not be carried through during iteration) \
Compare the question embedding and each sentence embedding, if similarity score is higher than a certain threshold, add the sentence into result dataframe.

If processing takes long time, and do not wish to repeat this process. \
We can cut the article and store the sentences into Meta's FAISS vector store, to avoid redundant sentence spliting and embedding.

In [98]:
# Function for semantic search
def semantic_search(question_embedding_SBERT, model, article_title, article_content):
  # Ensure the embeddings and the question embedding are on the same device (GPU if available)
  # Needed if we change setting to GPU-T4
  device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
  # print(device)

  # Move question embedding to the correct device
  question_embedding_SBERT = question_embedding_SBERT.to(device)

  # Split the article into sentences using NLTK library provided function
  sentence_list = sent_tokenize(article_content)
  # Create sentence embedding for each sentence in the sentence list and move to GPU
  sentence_embeddings = SBERT_model.encode(sentence_list, convert_to_tensor=True).to(device)

  # Compute cosine similarity scores
  # works similarlu as pre-defined function: util.semantic_search
  scores = util.pytorch_cos_sim(question_embedding_SBERT, sentence_embeddings)[0]

  # Get all results with score over a certain threshold
  target_indices = (scores > 0.5).nonzero(as_tuple=True)[0]
  target_scores = scores[target_indices]

  # Extract top result information
  for score, idx in zip(target_scores, target_indices):
    title = article_title
    content = sentence_list[idx]
    result_list.append({
      "Title": title,
      "Score": score.item(),
      "Content": content,
    })

  # Don't need to return anything
  # return

In [99]:
# List to store highly-rated sentences that may potentially answer the question
result_list = []

# Loop through the data pool and perform semantic searching
for index, row in tqdm(df.iterrows(), total=df.shape[0]):
  article_title = row['title']
  article_content = row['cleaned_content']
  # Print message on which article we are current working on, with how many tokens
  # print("Currently working on article:", article_title, "with", len(article_content), "tokens")
  semantic_search(question_embedding_SBERT, SBERT_model, article_title, article_content)

100%|██████████| 905/905 [30:33<00:00,  2.03s/it]


In [100]:
# Get the most relevant paragraphs and transform into a Dataframe
result_df = pd.DataFrame(result_list)
# Sort the dataframe by score
result_df = result_df.sort_values(by='Score', ascending=False)
# Reset indexes
result_df = result_df.reset_index(drop=True)

In [101]:
result_df

Unnamed: 0,Title,Score,Content
0,From Japanese Farm to Chip 'Fab',0.733754,TSMCs American factories have been repeatedly ...
1,How Japan Is Trying to Rebuild Its Chip Industry,0.733754,TSMCs American factories have been repeatedly ...
2,TSMC gets $6.6 billion in CHIPS funding for th...,0.725563,TSMCs US production has faced delays in gettin...
3,Why contractors are still all in on manufactur...,0.716750,TSMC recently delayed the production timeline ...
4,Biden's Plan To Boost American Chip Manufactur...,0.710944,TSMC delayed its first Arizona factory because...
...,...,...,...
336,Q2 2020 Taiwan Semiconductor Manufacturing Co ...,0.501784,And is this something that TSMC would consider?
337,Hobbs announces new worker safety partnership ...,0.501375,The need for more skilled labor is another iss...
338,Q2 2023 Chemtrade Logistics Income Fund Earnin...,0.500564,So Im going off of what TSMC has said publicly.
339,Chemtrade Logistics Income Fund (CGIFF) Q2 202...,0.500564,So Im going off of what TSMC has said publicly.


In [102]:
result_df['Content'][0]

'TSMCs American factories have been repeatedly delayed.'

In [103]:
# Extract to csv file
result_df.to_csv("result_df.csv", index=False)

# Phase 4: Topic Modeling

We will apply the BERTopic Model for semantic topic modeling \
reference: https://maartengr.github.io/BERTopic/index.html#installation

## Optimization Tricks and Promp Engineering for Llama2

In [104]:
# Copied directly from: https://colab.research.google.com/drive/1QCERSMUjqGetGGujdrvv_6_EeoIcd_9M?usp=sharing#scrollTo=lPQzxTBtZG6R
from torch import bfloat16
import transformers

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library

bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,  # 4-bit quantization
    bnb_4bit_quant_type='nf4',  # Normalized float 4
    bnb_4bit_use_double_quant=True,  # Second quantization after the first
    bnb_4bit_compute_dtype=bfloat16  # Computation type
)

What is Prompt Engineering: Prompt engineering is the process where you guide generative artificial intelligence (generative AI) solutions to generate desired outputs.\
Llama2 Prompt Engineering Reference: https://huggingface.co/blog/llama2#how-to-prompt-llama-2

In [105]:
# Llama2 Prompt Template
"""
<s>[INST] <<SYS>>

{{ System Prompt }}

<</SYS>>

{{ User Prompt }} [/INST]

{{ Model Answer }}
"""

'\n<s>[INST] <<SYS>>\n\n{{ System Prompt }}\n\n<</SYS>>\n\n{{ User Prompt }} [/INST]\n\n{{ Model Answer }}\n'

### Prompt Engineering

Code referenced from: https://colab.research.google.com/drive/1QCERSMUjqGetGGujdrvv_6_EeoIcd_9M?usp=sharing#scrollTo=lPQzxTBtZG6R

In [106]:
# System prompt describes information given to all conversations
# It provide context for the model so it knows how we expect it to respond
system_prompt = """
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant for labeling topics. Please ensure that your responses are socially unbiased and positive in nature.
<</SYS>>
"""

In [107]:
# User Prompt
# Example prompt demonstrating the output we are looking for
example_prompt = """
I have a topic that contains the following documents:
- Traditional diets in most cultures were primarily plant-based with a little meat on top, but with the rise of industrial style meat production and factory farming, meat has become a staple food.
- Meat, but especially beef, is the word food in terms of emissions.
- Eating meat doesn't make you a bad person, not eating meat doesn't make you a good one.

The topic is described by the following keywords: 'meat, beef, eat, eating, emissions, steak, food, health, processed, chicken'.

Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.

[/INST] Environmental impacts of eating meat
"""

# Our main prompt with documents ([DOCUMENTS]) and keywords ([KEYWORDS]) tags
# These are BERTopic Specific tags
main_prompt = """
[INST]
I have a topic that contains the following documents:
[DOCUMENTS]

The topic is described by the following keywords: '[KEYWORDS]'.

Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.
[/INST]
"""

In [108]:
question

"Why is TSMC's operation in the United States being delayed?"

In [109]:
prompt_label = system_prompt + example_prompt + main_prompt

## Apply BERTopic and Llama2

In [110]:
# Before applying the model
# We will loop through the dataset and remove all duplicating content
result_df = result_df.drop_duplicates(subset="Content", keep="first").reset_index(drop=True)
result_df

Unnamed: 0,Title,Score,Content
0,From Japanese Farm to Chip 'Fab',0.733754,TSMCs American factories have been repeatedly ...
1,TSMC gets $6.6 billion in CHIPS funding for th...,0.725563,TSMCs US production has faced delays in gettin...
2,Why contractors are still all in on manufactur...,0.716750,TSMC recently delayed the production timeline ...
3,Biden's Plan To Boost American Chip Manufactur...,0.710944,TSMC delayed its first Arizona factory because...
4,Chemtrade Logistics Income Fund (CGIFF) Q2 202...,0.700141,The delays TSMC has been public with their exp...
...,...,...,...
274,U.S. begins public investment to boost chip ma...,0.502035,This was evidenced by TSMC last July when it a...
275,Q2 2020 Taiwan Semiconductor Manufacturing Co ...,0.501784,And is this something that TSMC would consider?
276,Hobbs announces new worker safety partnership ...,0.501375,The need for more skilled labor is another iss...
277,Q2 2023 Chemtrade Logistics Income Fund Earnin...,0.500564,So Im going off of what TSMC has said publicly.


In [111]:
# Set up the llama2 model
model_id = 'meta-llama/Llama-2-7b-chat-hf'
# Llama 2 Tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)

# Llama 2 Model
Llama_model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map='auto',
)
Llama_model.eval()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((4096

In [112]:
# Text generator for Llama 2
generator = transformers.pipeline(
    model=Llama_model, tokenizer=tokenizer,
    task='text-generation',
    temperature=1, # Keep the creativity low
    max_new_tokens=500,
    repetition_penalty=1.1
)

Device set to use cuda:0


In [113]:
# Memory Intensive Operation happening
# Free up GPU memory
torch.cuda.empty_cache()

Possible Memory Out of Usage Error on CUDA/GPU \
Potential Solution: https://saturncloud.io/blog/how-to-solve-gpu-out-of-memory-error-on-google-colab/

In [125]:
# Call the topic model
# Use fine-tuned Topic Representation model

# BERTopic Pipeline
# If no changes are made or wanting to use default pipeline, simply use BERTopic()

# Text Embedding
embedding_model = SBERT_model

# Dimensionality Reduction
# Clustering
# Use k-means to specify cluster number
cluster_model = KMeans(n_clusters=10)

# Tokenizer
# Tokenize Chinese language
# vectorizer_model = CountVectorizer(tokenizer=tokenize_zh)

# Topic Representation
# ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
keybert = KeyBERTInspired()
# Set up (AI driven) Llama2 label and answer generation
llama2_label = TextGeneration(generator, prompt = prompt_label)

representation_model = {
    "KeyBERT": keybert,
    "Llama2_label": llama2_label,
}

# Not specifying cluster number
# topic_model = BERTopic(embedding_model=embedding_model, representation_model=representation_model, top_n_words = 10, verbose = True)
# Specifying cluster number
topic_model = BERTopic(embedding_model=embedding_model, hdbscan_model=cluster_model, representation_model=representation_model, top_n_words = 10, verbose = True)

# Use GPT modal for powerful representation
# client = openai.OpenAI(api_key="")
# representation_model = OpenAI(client, model="gpt-3.5-turbo", chat=True)
# topic_model = BERTopic(representation_model=representation_model)

# Generate Topics and their Probabilities
docs = result_df["Content"].to_list()
titles = result_df["Title"].to_list()
topics, probs = topic_model.fit_transform(docs)

2025-04-08 08:08:15,094 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/9 [00:00<?, ?it/s]

2025-04-08 08:08:16,047 - BERTopic - Embedding - Completed ✓
2025-04-08 08:08:16,049 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-04-08 08:08:16,978 - BERTopic - Dimensionality - Completed ✓
2025-04-08 08:08:16,980 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-04-08 08:08:17,136 - BERTopic - Cluster - Completed ✓
2025-04-08 08:08:17,143 - BERTopic - Representation - Fine-tuning topics using representation models.
 10%|█         | 1/10 [00:07<01:03,  7.08s/it]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
100%|██████████| 10/10 [00:54<00:00,  5.41s/it]
2025-04-08 08:09:12,177 - BERTopic - Representation - Completed ✓


In [126]:
question

"Why is TSMC's operation in the United States being delayed?"

In [127]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,KeyBERT,Llama2_label,Representative_Docs
0,0,42,0_in_the_to_tsmc,"[in, the, to, tsmc, and, arizona, of, with, pr...","[tsmc, tsmcs, manufacturing, factory, 2023, ph...","[Semiconductor production delays in Arizona, ,...",[Under the preliminary agreement announced by ...
1,1,35,1_semiconductor_the_arizona_to,"[semiconductor, the, arizona, to, delay, tsmc,...","[tsmc, tsmcs, delays, delay, postpone, delayed...",[TSMC Delays Arizona Factory Production Due to...,[Recent reports indicated TSMC urging major su...
2,2,34,2_the_and_is_tsmc,"[the, and, is, tsmc, to, that, it, in, of, inv...","[tsmc, tsmcs, manufacturing, factory, taiwan, ...",[Global Tech Company Faces Supply Chain Concer...,[And while TSMCs president says an invasion wo...
3,3,34,3_2025_production_its_plant,"[2025, production, its, plant, to, from, delay...","[tsmc, 2024, delaying, delays, 2025, manufactu...","[TSMC Production Delays, , , , , , , , , ]",[TSMC also warned that the production at its A...
4,4,31,4_the_of_to_semiconductor,"[the, of, to, semiconductor, its, in, arizona,...","[tsmc, tsmcs, manufacturing, chipmaker, foundr...","[Semiconductor manufacturing delays, , , , , ,...","[TSMC, a semiconductor manufacturer based in T..."
5,5,27,5_tsmc_support_so_made,"[tsmc, support, so, made, is, you, cash, view,...","[tsmc, tsmcs, subsidy, support, when, taiwans,...","[TSMC Support for US CHIPS Act, , , , , , , , , ]",[And then when would you expect the cash suppo...
6,6,21,6_workers_and_us_the,"[workers, and, us, the, to, is, tsmc, with, fr...","[tsmc, tsmcs, technicians, taiwan, taiwan10, t...",[International worker dispatch in the semicond...,[TSMC and its suppliers are in talks with the ...
7,7,21,7_the_to_arizona_of,"[the, to, arizona, of, but, tsmc, as, been, 20...","[tsmc, tsmcs, 2024, 2025, 2028, 2027, delays, ...","[TSMC Arizona factory delays, , , , , , , , , ]","[Kanthan Kanthan2030 March 14, 2024 TSMC also ..."
8,8,18,8_the_for_of_in,"[the, for, of, in, funding, and, to, delays, t...","[delays, delayed, delay, shortages, bottleneck...",[Supply chain delays for semiconductor manufac...,"[For the complete story, see: https:myrepublic..."
9,9,16,9_of_workers_shortage_skilled,"[of, workers, shortage, skilled, tsmc, the, co...","[shortage, tsmc, manufacturing, foundrys, dela...",[Skilled worker shortage delays TSMC microchip...,[A major projectfunded partly by U.S. taxpayer...


In [128]:
# We can choose to output this result to a csv file
df_topic = topic_model.get_topic_info()
# df_topic.to_csv("BERTopic_Topic_Info.csv", index=False)

In [129]:
a = df_topic["Llama2_label"].to_list()
a

[['Semiconductor production delays in Arizona',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  ''],
 ['TSMC Delays Arizona Factory Production Due to Worker Shortage',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  ''],
 ['Global Tech Company Faces Supply Chain Concerns Due to Geopolitical Tensions',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  ''],
 ['TSMC Production Delays', '', '', '', '', '', '', '', '', ''],
 ['Semiconductor manufacturing delays', '', '', '', '', '', '', '', '', ''],
 ['TSMC Support for US CHIPS Act', '', '', '', '', '', '', '', '', ''],
 ['International worker dispatch in the semiconductor industry',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  ''],
 ['TSMC Arizona factory delays', '', '', '', '', '', '', '', '', ''],
 ['Supply chain delays for semiconductor manufacturing',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  ''],
 ['Skilled worker shortage delays TSMC microchip production',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '']]

In [130]:
# visualize terms
topic_model.visualize_barchart()

In [131]:
# Select topics to merge if we decide the model clustered topics are too close
'''
topics_to_merge_list = [[1,3,5]]
for topics_to_merge in topics_to_merge_list:
  topic_model.merge_topics(docs, topics_to_merge)
'''

'\ntopics_to_merge_list = [[1,3,5]]\nfor topics_to_merge in topics_to_merge_list:\n  topic_model.merge_topics(docs, topics_to_merge)\n'

In [133]:
# visualize topics through Intertopic Distance Map
topic_model.visualize_topics()

In [134]:
# substitute c-TF-IDF labels with llama2 labels
llama2_labels = [label[0][0].split("\n")[0] for label in topic_model.get_topics(full=True)["Llama2_label"].values()]
topic_model.set_topic_labels(llama2_labels)

In [135]:
# visualize topics through Document Map
# topic_model.visualize_documents(docs, hide_annotations=True, custom_labels=True)
topic_model.visualize_documents(docs, hide_annotations=True, hide_document_hover=False)

In [136]:
# visualize topics through Hierachical Structure
topic_model.visualize_hierarchy()

In [138]:
# Chatgpt generated code
# Export result
from collections import defaultdict

# Create a dictionary to store topics and their corresponding documents
topic_groups = defaultdict(list)

# Group documents by their assigned topic
for content, title, topic in zip(docs, titles, topics):
  info = {"title": title, "content": content}
  topic_groups[topic].append(info)

# Prepare the data for Excel
excel_data = []

# Add the question at the top of the document
# It should only appear once
excel_data.append({
  "Question": question,
  "Topic Summary": None,
  "Topic Label": None,
  "Article Title": None,
  "Source Sentence": None
})

# Work topic by topic
for topic_label, documents_info in topic_groups.items():
  # Get the summary and label for the topic
  llama2_label = topic_model.get_topic(topic_label, full=True)["Llama2_label"][0][0]

  # Add the topic summary and label
  # These should appear onnly once for each topic
  excel_data.append({
    "Question": None,
    "Topic Label": llama2_label,
    "Article Title": None,
    "Source Sentence": None
  })

  # Add the supporting documents
  for info in documents_info:
    excel_data.append({
      "Question": None,
      "Topic Label": None,
      "Article Title": info["title"],
      "Source Sentence": info["content"]
    })

# Create a DataFrame
df_final = pd.DataFrame(excel_data)

# Save the DataFrame to an Excel file
df_final.to_excel("bertopic_results.xlsx", index=False)

# Save the DataFrame to a csv file
df_final.to_csv("bertopic_results.csv", index=False)