Task: QA System \
* Selected top articles/paragraphs to best answer the question (General QA, most similar to the architecture proposed in the CO-Search paper)
* Predicted topics among all potential selected article paragraphs (Based on Topic Modeling)





Main article references can be found in: Notes.DOCX \
This colab has been focusing on the second task: Topic Modeling with BERTopic and Llama2 \
T4 GPU is required

# Phase 0: Load Necessary Libraries

In [None]:
# Install sentence-transformers, needed for semantic search
!pip install pandas sentence-transformers

# Install BM25
# !pip install rank-bm25

# Install BERTopic
!pip install bertopic

# Install Visualization for BERTopic results
!pip install bertopic[visualization]

# Supporting libraries to run llama2 eaiser
!pip install accelerate bitsandbytes xformers adjustText

# Install openai
# !pip install openai

Collecting sentence-transformers
  Downloading sentence_transformers-3.1.0-py3-none-any.whl.metadata (23 kB)
Downloading sentence_transformers-3.1.0-py3-none-any.whl (249 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m249.1/249.1 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence-transformers
Successfully installed sentence-transformers-3.1.0
Collecting bertopic
  Downloading bertopic-0.16.3-py3-none-any.whl.metadata (23 kB)
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.38.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap_learn-0.5.6-py3-none-any.whl.metadata (21 kB)
Collecting pynndescent>=0.5 (from umap-learn>=0.5.0->bertopic)
  Downloading pynndescent-0.5.13-py3-none-any.whl.metadata (6.8 kB)
Downloading bertopic-0.16.3-py3-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m

In [None]:
# Log into hugging face
# Needed for permission to download Llama2
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# Import Basic Python library
import json
import pandas as pd
import numpy as np

# Data Preprocessing related library
# reference: https://www.nltk.org/
# content cleaning, lowercasing, removing punctuation, removing stop words etc
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Semantic Searching related library
from sentence_transformers import SentenceTransformer
from sentence_transformers import util
import torch
from torch import cuda
from nltk.tokenize import sent_tokenize

# Keyword Matching related library (TF-IDF and BM25)
# reference: https://pypi.org/project/rank-bm25/
from sklearn.feature_extraction.text import TfidfVectorizer
# from rank_bm25 import BM25Okapi

# Cosine Similarity, measure the similarity among embeddings
from sklearn.metrics.pairwise import cosine_similarity

# BERTopic Topic Modeling related library
from bertopic import BERTopic
# Fine-tuned Topic Representation
# Reduces stopwords from the resulting topic representations
from bertopic.representation import KeyBERTInspired, TextGeneration
# Clustering method that allow us to set number of clusters to be made
from sklearn.cluster import KMeans
# Reduce frequent words that appear commonly among topics
from bertopic.vectorizers import ClassTfidfTransformer
# Tokenizer
from sklearn.feature_extraction.text import CountVectorizer
import jieba

# Progress Bar
from tqdm import tqdm
tqdm.pandas()

# We can utilize GPT modal as well
# import openai
# from bertopic.representation import OpenAI

## Optimization Tricks and Promp Engineering for Llama2

In [None]:
# Copied directly from: https://colab.research.google.com/drive/1QCERSMUjqGetGGujdrvv_6_EeoIcd_9M?usp=sharing#scrollTo=lPQzxTBtZG6R
from torch import bfloat16
import transformers

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library

bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,  # 4-bit quantization
    bnb_4bit_quant_type='nf4',  # Normalized float 4
    bnb_4bit_use_double_quant=True,  # Second quantization after the first
    bnb_4bit_compute_dtype=bfloat16  # Computation type
)

What is Prompt Engineering: Prompt engineering is the process where you guide generative artificial intelligence (generative AI) solutions to generate desired outputs.\
Llama2 Prompt Engineering Reference: https://huggingface.co/blog/llama2#how-to-prompt-llama-2

In [None]:
# Llama2 Prompt Template
"""
<s>[INST] <<SYS>>

{{ System Prompt }}

<</SYS>>

{{ User Prompt }} [/INST]

{{ Model Answer }}
"""

### Prompt Engineering - EN

Code referenced from: https://colab.research.google.com/drive/1QCERSMUjqGetGGujdrvv_6_EeoIcd_9M?usp=sharing#scrollTo=lPQzxTBtZG6R

In [None]:
# System prompt describes information given to all conversations
# It provide context for the model so it knows how we expect it to respond
system_prompt_en = """
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant for labeling topics. Please ensure that your responses are socially unbiased and positive in nature.
<</SYS>>
"""

In [None]:
# User Prompt
# Example prompt demonstrating the output we are looking for
example_prompt_en = """
I have a topic that contains the following documents:
- Traditional diets in most cultures were primarily plant-based with a little meat on top, but with the rise of industrial style meat production and factory farming, meat has become a staple food.
- Meat, but especially beef, is the word food in terms of emissions.
- Eating meat doesn't make you a bad person, not eating meat doesn't make you a good one.

The topic is described by the following keywords: 'meat, beef, eat, eating, emissions, steak, food, health, processed, chicken'.

Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.

[/INST] Environmental impacts of eating meat
"""

# Our main prompt with documents ([DOCUMENTS]) and keywords ([KEYWORDS]) tags
# These are BERTopic Specific tags
main_prompt_en = """
[INST]
I have a topic that contains the following documents:
[DOCUMENTS]

The topic is described by the following keywords: '[KEYWORDS]'.

Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.
[/INST]
"""

In [None]:
prompt_en = system_prompt_en + example_prompt_en + main_prompt_en

Prompt Engineering for ZH is not working as expected

### Prompt Engineering - ZH

In [None]:
system_prompt_zh = """
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant for labeling topics.
<</SYS>>
"""

example_prompt_zh = """
I have a topic that contains the following documents:
- 大多数文化中的传统饮食主要以植物为主，少量肉类为辅，但随着工业化肉类生产和工厂化养殖的兴起，肉类已成为主食。
- 肉类，特别是牛肉，是食物中碳排放量最高的。
- 吃肉不会让你成为坏人，不吃肉也不会让你成为好人。

The topic is described by the following keywords: ‘肉类、牛肉、吃、饮食、排放、牛排、食物、健康、加工、鸡肉’。

Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.

[/INST] 吃肉的环境影响
"""

main_prompt_zh = """
[INST]
I have a topic that contains the following documents:
[DOCUMENTS]

The topic is described by the following keywords: ‘[KEYWORDS]’。

Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.
[/INST]
"""

In [None]:
prompt_zh = system_prompt_zh + example_prompt_zh + main_prompt_zh

#Phase 1: Collect Newspaper Articles

* This phase was conducted outside of this notebook.
* The model processes JSON-formatted files, with each data point representing a unique article.
* Articles sourced from the Nexus Uni Database and leading newspapers, including The Wall Street Journal (WSJ).
* Json file should be uploaded each runtime session (if needed, can be prevented by mounting the drive).

In [None]:
# Load the JSON data from uploaded JSON documents
with open('Task2.json', 'r') as file:
  data = json.load(file)

In [None]:
# Extract stored information from the JSON file
# list used to store all post-process information
articles_info = []

# loop through every article appearing in the articles
for article_data in data['articles']:

  # get each field of data information
  title = article_data['title']
  source = article_data['source']
  date = article_data['date']
  byline = article_data['byline']
  content = article_data['content']

  # Store article information
  article_info = {
    "title": title,
    "source": source,
    "date": date,
    "byline": byline,
    "content": content
  }

  # Append the extracted information for the current article to the target list
  articles_info.append(article_info)

In [None]:
# Convert the extracted dictionary to Pandas Dataframe datatype for easier further analysis
df = pd.DataFrame(articles_info)

In [None]:
# check if our dataframe is successfully constructed
print(df)

                                                title               source  \
0   2022 State Of The State: Expanding Arizona‘s T...  Yellow Sheet Report   
1   ADVISORY: Sen. Kelly to Tour Construction of I...  Yellow Sheet Report   
2                            Amkor's big announcement  Yellow Sheet Report   
3                      Arizona is the new Switzerland  Yellow Sheet Report   
4   Arizona Wins 2021 Gold Shovel Award Recognizin...  Yellow Sheet Report   
..                                                ...                  ...   
75        Wake Up Call for Wednesday, August 23, 2023  Yellow Sheet Report   
76       Wake Up Call for Wednesday, December 7, 2022  Yellow Sheet Report   
77      Wake Up Call for Wednesday, February 22, 2023  Yellow Sheet Report   
78  WATCH: Sen. Kelly Talks CHIPS Law, Child Care ...  Yellow Sheet Report   
79  WATCH: Sen. Kelly Talks TikTok, Space as the N...  Yellow Sheet Report   

                           date                          byline

In [None]:
# some sample analysis purely based on this Pandas Dataframe is
# Get the unique sources of the article samples

# unique_sources_list = df['source'].unique()
# print(str(len(unique_sources_list)) + " unique sources in the DataFrame:")
# for source in unique_sources_list:
#    print(source)

In [None]:
# Check if the information are successfully extracted, we can do this by checking the date line
# As there is no way for us to check the validness of the rest columns

# Regular expression pattern to match the start of the date with a proper month
pattern = r'^(January|February|March|April|May|June|July|August|September|October|November|December|Unknown|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec|0?[1-9]|1[0-2])'

# Function to check if the date starts with a proper month
def check_proper_date(date):
  return bool(re.match(pattern, date))

# Apply the function to the "date" column and filter out the invalid rows
invalid_rows = df[~df['date'].apply(check_proper_date)]

# Print the invalid rows
# If all dates under column 'date' are valid
if invalid_rows.empty:
  print("All entries in the 'date' column contains a proper month name.")
else:
# If there are invalid dates
  print("The following rows do not contain a proper month name and is not listed as 'unknown':")
  print(invalid_rows)

All entries in the 'date' column contains a proper month name.


In [None]:
# Now the dataframe is successfully constructed, but we may want to focus on certain sources/byline/date
# We can add a filter here


#Phase 2: Data Preprocessing

In [None]:
# Data preprocessing is a crucial step in natural language processing (NLP).
# We need to go through each article's content body and preprocess them before analyze/train
# reference: https://www.nltk.org/
# Download necessary NLTK data files
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
# define cleaning functions for each content text
# for semantic search, we will not remove stop words, but we will remove extra spaces and lower case all words
# this way, we will try out best to keep semantic meanings of original sentences
def clean_text_lower(text):
  # Convert to lowercase
  text = text.lower()
  # Remove extra leading or rear spaces
  text = text.strip()
  # Output
  return text

# for keyword search, we will do excessive data preprocessing
def clean_text_higher(text):
  # Convert to lowercase
  text = text.lower()
  # Remove punctuation and special character
  text = re.sub(r'[^\w\s]', '', text)
  # Remove numbers
  text = re.sub(r'\d+', '', text)
  # Tokenize the text
  words = word_tokenize(text)
  # Remove stop words
  # stop_words = set(stopwords.words('english'))
  stop_words = set(stopwords.words('chinese'))
  words = [word for word in words if word not in stop_words]
  # Lemmatize the words
  lemmatizer = WordNetLemmatizer()
  words = [lemmatizer.lemmatize(word) for word in words]
  # Join the words back into a single string
  cleaned_text = ' '.join(words)
  # Output
  return cleaned_text

In [None]:
# Apply the cleaning function to the content of each article
# Store into separate columns
df['semantic_content'] = df['content'].apply(clean_text_lower)
# df['keyword_content'] = df['content'].apply(clean_text_higher)

In [None]:
df

Unnamed: 0,title,source,date,byline,content,semantic_content
0,2022 State Of The State: Expanding Arizona‘s T...,Yellow Sheet Report,"January 10, 2022 Monday",wschutsky@azcapitoltimes.com,PHOENIX Governor Doug Ducey today called for ...,phoenix governor doug ducey today called for ...
1,ADVISORY: Sen. Kelly to Tour Construction of I...,Yellow Sheet Report,"April 4, 2023 Tuesday",dlajom,"On Wednesday, April 5, 2023, Arizona Senator M...","on wednesday, april 5, 2023, arizona senator m..."
2,Amkor's big announcement,Yellow Sheet Report,"November 30, 2023 Thursday",jkronenfeld@azcapitoltimes.com,Arizona is once again the recipient of a massi...,arizona is once again the recipient of a massi...
3,Arizona is the new Switzerland,Yellow Sheet Report,"October 12, 2023 Thursday",wschutsky@azcapitoltimes.com,The Arizona Dept of Education announced the la...,the arizona dept of education announced the la...
4,Arizona Wins 2021 Gold Shovel Award Recognizin...,Yellow Sheet Report,"June 10, 2021 Thursday",mstead@azcapitoltimes.com,TSMC Fab In Phoenix Wins “Manufacturing Projec...,tsmc fab in phoenix wins “manufacturing projec...
...,...,...,...,...,...,...
75,"Wake Up Call for Wednesday, August 23, 2023",Yellow Sheet Report,"August 24, 2023 Thursday",wschutsky@azcapitoltimes.com,Lake trying to get Richer’s defamation lawsuit...,lake trying to get richer’s defamation lawsuit...
76,"Wake Up Call for Wednesday, December 7, 2022",Yellow Sheet Report,"December 7, 2022 Wednesday",mstead@azcapitoltimes.com,"Ducey, Biden cheer microchip plant, TSMC annou...","ducey, biden cheer microchip plant, tsmc annou..."
77,"Wake Up Call for Wednesday, February 22, 2023",Yellow Sheet Report,"February 22, 2023 Wednesday",wschutsky@azcapitoltimes.com,Scottsdale expresses approval of plan to get w...,scottsdale expresses approval of plan to get w...
78,"WATCH: Sen. Kelly Talks CHIPS Law, Child Care ...",Yellow Sheet Report,"March 3, 2023 Friday",jkronenfeld@azcapitoltimes.com,Kelly also discussed how his CHIPS law strengt...,kelly also discussed how his chips law strengt...


# Phase 3: Question and Answering

## Part 1: Article and Paragraph Indexing

Semantic search is a technique that leverages the meaning of words and phrases to improve search accuracy, rather than relying solely on keyword matching.\
We will be using both semantic and keyword matching technique to create a more complex and dynamic index.

### Paragraph Spliting

In [None]:
# Chatgpt generated code
def is_all_english(text):
  # Check if all characters are English letters, spaces, punctuation, or digits
  return bool(re.match(r'^[A-Za-z0-9\s.,!?;:\'\"-]+$', text))

In [None]:
# Loop through all stored articles, for each article, split text into paragraphs
# Save back into a pandas dataframe for later processing
paragraphs_info = []
# Set to track seen paragraphs
# seen_paragraphs = set()

# Loop through each row in the existing DataFrame
for index, row in df.iterrows():
  # Split the content into paragraphs
  # Paragraphs are separated by "\n" newline
  # Paragraphs are target specifically for SBERT, thus using semantic-level preprocessed data
  paragraphs = row['semantic_content'].split('\n')

  # Loop through each paragraph
  for paragraph in paragraphs:
    # Some additional condition to ignore unmeaningful paragraphs
    # Check if the paragraph start with "CopyRight"
    # This is specifically for Nexus Uni paragraphs
    if paragraph.startswith("copyright"):
      continue
    # Check if the paragraph is less than 5 words
    # For English articles
    if len(paragraph.split()) < 5:
      continue
    # For Chinese articles
    # if is_all_english(paragraph):
    #   if len(paragraph.split()) < 5:
    #     continue
    # elif len(paragraph) < 10:
    #   continue
    # Check if the paragraph is actually a link
    if paragraph.startswith("https:"):
      continue
    if paragraph.startswith("arizona republic"):
      continue

    # Must set up a global look up table
    # If the exact same paragraph has appeared once, then do not include it
    # Actually not included, because this could change the overall article index, we need a overall score for article
    # if paragraph in seen_paragraphs:
    #   continue
    # seen_paragraphs.add(paragraph)

    # Store paragraph information
    paragraph_info = {
      "title": row['title'],
      "source": row['source'],
      "date": row['date'],
      "byline": row['byline'],
      "content": paragraph
    }
    paragraphs_info.append(paragraph_info)

In [None]:
# Convert the list to a new DataFrame
para_df = pd.DataFrame(paragraphs_info)
para_df

Unnamed: 0,title,source,date,byline,content
0,2022 State Of The State: Expanding Arizona‘s T...,Yellow Sheet Report,"January 10, 2022 Monday",wschutsky@azcapitoltimes.com,phoenix governor doug ducey today called for ...
1,2022 State Of The State: Expanding Arizona‘s T...,Yellow Sheet Report,"January 10, 2022 Monday",wschutsky@azcapitoltimes.com,"in his 2022 state of the state address, govern..."
2,2022 State Of The State: Expanding Arizona‘s T...,Yellow Sheet Report,"January 10, 2022 Monday",wschutsky@azcapitoltimes.com,"“let’s invest in the worker, arming them with ..."
3,2022 State Of The State: Expanding Arizona‘s T...,Yellow Sheet Report,"January 10, 2022 Monday",wschutsky@azcapitoltimes.com,"through targeted investments, governor ducey i..."
4,2022 State Of The State: Expanding Arizona‘s T...,Yellow Sheet Report,"January 10, 2022 Monday",wschutsky@azcapitoltimes.com,"with collaboration among government, industry ..."
...,...,...,...,...,...
1101,"WATCH: Sen. Kelly Talks TikTok, Space as the N...",Yellow Sheet Report,"March 29, 2023 Wednesday",jkronenfeld@azcapitoltimes.com,we invented the semiconductor chip in this cou...
1102,"WATCH: Sen. Kelly Talks TikTok, Space as the N...",Yellow Sheet Report,"March 29, 2023 Wednesday",jkronenfeld@azcapitoltimes.com,"if we lose access to those chips, in a matter ..."
1103,"WATCH: Sen. Kelly Talks TikTok, Space as the N...",Yellow Sheet Report,"March 29, 2023 Wednesday",jkronenfeld@azcapitoltimes.com,on how kelly’s chips law fuels economic growth...
1104,"WATCH: Sen. Kelly Talks TikTok, Space as the N...",Yellow Sheet Report,"March 29, 2023 Wednesday",jkronenfeld@azcapitoltimes.com,[...] it means tens of thousands of high-payin...


### Paragraph Embedding

In [None]:
# Load a Pre-trained Sentence Transformer Model that based on SBERT
# all-MiniLM-L6-v2 is a sentence-transformers model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search
# reference: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
# reference: https://sbert.net/
# This is also the default SBERT model used by BERTopic
# Fast
# SBERT_model = SentenceTransformer('all-MiniLM-L6-v2')
# Best Performaning SBERT Model according to: https://www.sbert.net/docs/sentence_transformer/pretrained_models.html
# SBERT_model = SentenceTransformer('all-mpnet-base-v2')
# Great sentence transfer recommended by the author of BERTopic
SBERT_model = SentenceTransformer('BAAI/bge-small-en-v1.5')
# Multi language supported, reference: https://sbert.net/examples/training/multilingual/README.html
# SBERT_model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# Encode separated paragraphs
# Generate paragraph embeddings using pretrained SBERT model
# para_df['embeddings'] = para_df['content'].apply(lambda x: SBERT_model.encode(x, convert_to_tensor=True).tolist())
para_df['embeddings'] = para_df['content'].progress_apply(lambda x: SBERT_model.encode(x, convert_to_tensor=True).tolist())

100%|██████████| 1106/1106 [00:15<00:00, 69.63it/s]


In [None]:
para_df

Unnamed: 0,title,source,date,byline,content,embeddings
0,2022 State Of The State: Expanding Arizona‘s T...,Yellow Sheet Report,"January 10, 2022 Monday",wschutsky@azcapitoltimes.com,phoenix governor doug ducey today called for ...,"[-0.03664790838956833, 0.010178462602198124, -..."
1,2022 State Of The State: Expanding Arizona‘s T...,Yellow Sheet Report,"January 10, 2022 Monday",wschutsky@azcapitoltimes.com,"in his 2022 state of the state address, govern...","[-0.0530061200261116, 0.005819747690111399, 0...."
2,2022 State Of The State: Expanding Arizona‘s T...,Yellow Sheet Report,"January 10, 2022 Monday",wschutsky@azcapitoltimes.com,"“let’s invest in the worker, arming them with ...","[-0.03257669880986214, -0.018776817247271538, ..."
3,2022 State Of The State: Expanding Arizona‘s T...,Yellow Sheet Report,"January 10, 2022 Monday",wschutsky@azcapitoltimes.com,"through targeted investments, governor ducey i...","[-0.02714850939810276, 0.00980773288756609, -0..."
4,2022 State Of The State: Expanding Arizona‘s T...,Yellow Sheet Report,"January 10, 2022 Monday",wschutsky@azcapitoltimes.com,"with collaboration among government, industry ...","[-0.038465362042188644, 0.016472117975354195, ..."
...,...,...,...,...,...,...
1101,"WATCH: Sen. Kelly Talks TikTok, Space as the N...",Yellow Sheet Report,"March 29, 2023 Wednesday",jkronenfeld@azcapitoltimes.com,we invented the semiconductor chip in this cou...,"[-0.010295873507857323, 0.010998079553246498, ..."
1102,"WATCH: Sen. Kelly Talks TikTok, Space as the N...",Yellow Sheet Report,"March 29, 2023 Wednesday",jkronenfeld@azcapitoltimes.com,"if we lose access to those chips, in a matter ...","[-0.04067046195268631, -0.03629226237535477, -..."
1103,"WATCH: Sen. Kelly Talks TikTok, Space as the N...",Yellow Sheet Report,"March 29, 2023 Wednesday",jkronenfeld@azcapitoltimes.com,on how kelly’s chips law fuels economic growth...,"[-0.08377302438020706, -0.02886289730668068, 0..."
1104,"WATCH: Sen. Kelly Talks TikTok, Space as the N...",Yellow Sheet Report,"March 29, 2023 Wednesday",jkronenfeld@azcapitoltimes.com,[...] it means tens of thousands of high-payin...,"[-0.017358308658003807, 0.000557399878744036, ..."


In [None]:
# Language of question, MUST match the language of the target articles
# question = "Why is TSMC's operation in the US delayed?"
question = "Why is TSMC in the US?"
# question = "為什麼台積電在美國？"
# question = "为什么台积电在美国？"

question_embedding_SBERT = SBERT_model.encode(question, convert_to_tensor=True)

In [None]:
# Function for semantic search
def semantic_search(question_embedding_SBERT, model, df):

  # Ensure the embeddings and the question embedding are on the same device (GPU if available)
  # Needed if we change setting to GPU-T4
  device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
  # print(device)

  # Convert list embeddings back to tensor, move to correct device
  # embeddings = torch.tensor(df['embeddings'].tolist())
  embeddings = torch.tensor(df['embeddings'].tolist()).to(device)

  # Move question embedding to the correct device
  question_embedding_SBERT = question_embedding_SBERT.to(device)

  # Compute cosine similarity scores
  # works similarity as pre-defined function: util.semantic_search
  scores = util.pytorch_cos_sim(question_embedding_SBERT, embeddings)[0]

  # Get all results with positive scores
  positive_indices = (scores > 0.27).nonzero(as_tuple=True)[0]
  positive_scores = scores[positive_indices]

  # Sort the scores and find the cutoff for the top ?%
  sorted_scores, sorted_indices = torch.sort(positive_scores, descending=True)
  top_percent_cutoff = int(len(sorted_scores) * 1)

  # Keep only the top % of the results
  top_indices = sorted_indices[:top_percent_cutoff]
  top_scores = sorted_scores[:top_percent_cutoff]

  # List to store results
  results = []

  # Extract top result information
  for score, idx in zip(top_scores, top_indices):
    idx = idx.item()  # Ensure idx is an integer
    title = df['title'].iloc[idx]
    content = df['content'].iloc[idx]
    results.append({
      "Title": title,
      "Score": score.item(),
      "Content": content,
    })

  # Sort results by score in descending order
  # results = sorted(results, key=lambda x: x['Score'], reverse=True)

  # convert list to pandas Dataframe for easier interpretation
  result_df = pd.DataFrame(results)

  return result_df

# Phase 4: Topic Modeling

We will apply the BERTopic Model for semantic topic modeling \
reference: https://maartengr.github.io/BERTopic/index.html#installation

In [None]:
# Get the most relevant paragraphs
result_df = semantic_search(question_embedding_SBERT, SBERT_model, para_df)

In [None]:
result_df

Unnamed: 0,Title,Score,Content
0,"Wake Up Call for Tuesday, September 19, 2023",0.791027,tsmc is about a lot more for arizona and ameri...
1,"Ducey, Biden cheer microchip plant, TSMC annou...",0.788487,the tsmc project is important for economic and...
2,"Mayor Gallego Joins Phoenix Sister Cities, Cou...",0.760244,"last year, tsmc announced that it would more t..."
3,"Ducey, Biden cheer microchip plant, TSMC annou...",0.759342,"morris chang, who founded tsmc in the 1990s, n..."
4,"Smart policy, technology benefit our economic ...",0.743106,"from carroll’s perspective, this historic inve..."
...,...,...,...
1101,Dutch tech firm's Scottsdale project no small ...,0.329232,"""i think the employment is a huge deal,"" she s..."
1102,No city 'holiday presents' for tech giant,0.325506,after a news conference featuring gov. katie h...
1103,"Wake Up Call for Wednesday, August 23, 2023",0.323746,"the trial date was reaffirmed monday morning, ..."
1104,Governor Ducey Announces Legislative Special S...,0.319537,the press conference followed the governor’s a...


In [None]:
# Extract to excel file for further check
result_df.to_excel('semantic_search_results.xlsx', index=False)

In [None]:
# Set up the llama2 model
model_id = 'meta-llama/Llama-2-7b-chat-hf'
# Llama 2 Tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)

# Llama 2 Model
Llama_model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map='auto',
)
Llama_model.eval()

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=

In [None]:
# Text generator for Llama 2
generator = transformers.pipeline(
    model=Llama_model, tokenizer=tokenizer,
    task='text-generation',
    temperature=1, # Keep the creativity low
    max_new_tokens=500,
    repetition_penalty=1.1
)

In [None]:
# Memory Intensive Operation happening
# Free up GPU memory
torch.cuda.empty_cache()

Possible Memory Out of Usage Error on CUDA/GPU \
Potential Solution: https://saturncloud.io/blog/how-to-solve-gpu-out-of-memory-error-on-google-colab/

In [None]:
# Taken from: https://maartengr.github.io/BERTopic/getting_started/vectorizers/vectorizers.html#max_features
# Chinese language tokenizer
def tokenize_zh(text):
    words = jieba.lcut(text)
    return words

In [None]:
# Call the topic model
# Use fine-tuned Topic Representation model

# BERTopic Pipeline
# If no changes are made or wanting to use default pipeline, simply use BERTopic()
# Text Embedding
embedding_model = SBERT_model
# Dimensionality Reduction
# Clustering
cluster_model = KMeans(n_clusters=10)
# Tokenizer
# Tokenize Chinese language
# vectorizer_model = CountVectorizer(tokenizer=tokenize_zh)
# Topic Representation
# ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
keybert = KeyBERTInspired()
# English
llama2 = TextGeneration(generator, prompt=prompt_en)
# Simplified Chinese
# llama2 = TextGeneration(generator, prompt=prompt_zh)
representation_model = {
    "KeyBERT": keybert,
    "Llama2": llama2,
}

# Not specifying cluster number
topic_model = BERTopic(embedding_model=embedding_model, representation_model=representation_model, top_n_words = 10, verbose = True)
# Specifying cluster number
# topic_model = BERTopic(embedding_model=embedding_model, hdbscan_model = cluster_model, representation_model=representation_model, top_n_words = 10, verbose = True)

# Use GPT modal for powerful representation
# client = openai.OpenAI(api_key="")
# representation_model = OpenAI(client, model="gpt-3.5-turbo", chat=True)
# topic_model = BERTopic(representation_model=representation_model)

# Generate Topics and their Probabilities
docs = result_df["Content"].to_list()
titles = result_df["Title"].to_list()
topics, probs = topic_model.fit_transform(docs)

2024-09-12 07:23:14,032 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/35 [00:00<?, ?it/s]

2024-09-12 07:23:15,611 - BERTopic - Embedding - Completed ✓
2024-09-12 07:23:15,612 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-09-12 07:23:27,392 - BERTopic - Dimensionality - Completed ✓
2024-09-12 07:23:27,393 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-09-12 07:23:27,462 - BERTopic - Cluster - Completed ✓
2024-09-12 07:23:27,470 - BERTopic - Representation - Extracting topics from clusters using representation models.
 40%|████      | 10/25 [01:01<01:23,  5.59s/it]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
100%|██████████| 25/25 [02:23<00:00,  5.75s/it]
2024-09-12 07:25:52,376 - BERTopic - Representation - Completed ✓


In [None]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,KeyBERT,Llama2,Representative_Docs
0,-1,300,-1_the_and_in_to,"[the, and, in, to, of, arizona, for, is, state...","[technology, arizona, workforce, tsmc, sector,...","[Semiconductor Manufacturing in Arizona, , , ,...",[“today’s announcement is yet another importan...
1,0,208,0_election_to_of_arizona,"[election, to, of, arizona, trump, the, for, c...","[elections, election, lawmaker, voters, offici...","[Election Integrity Crisis, , , , , , , , , ]",[a change in federal election laws and a quirk...
2,1,59,1_tsmc_the_in_billion,"[tsmc, the, in, billion, plant, semiconductor,...","[tsmc, investment, funding, taiwan, arizona, m...","[TSMC expands to Arizona, , , , , , , , , ]","[today, the department of commerceannouncedtha..."
3,2,55,2_kelly_chips_act_microchip,"[kelly, chips, act, microchip, manufacturing, ...","[microchip, intel, chips, thechips, kelly, mic...","[U.S. Microchip Manufacturing Boost, , , , , ,...",[senator kelly’s landmark chips law will boost...
4,3,52,3_scottsdale_asm_very_headquarters,"[scottsdale, asm, very, headquarters, saad, la...","[scottsdale, asm, headquarters, facility, camp...","[ASM Builds New HQ in Scottsdale, , , , , , , ...",[asm bought 21 acres near scottsdale road and ...
5,4,36,4_technology_and_the_to,"[technology, and, the, to, in, is, arizona, fo...","[innovation, investment, expansion, investment...","[Arizona Tech Investment, , , , , , , , , ]",[what this means for arizona is that opportuni...
6,5,34,5_sinema_and_bipartisan_arizona,"[sinema, and, bipartisan, arizona, infrastruct...","[bipartisan, arizona, sinema, arizonans, infra...","[AZ Infrastructure & Jobs Law, , , , , , , , , ]","[with projects already underway in arizona, si..."
7,6,34,6_governor_arizona_ducey_the,"[governor, arizona, ducey, the, has, of, our, ...","[arizonans, arizona, economy, economic, govern...","[Arizona Economic Growth, , , , , , , , , ]",[“today’s milestone represents a significant t...
8,7,31,7_sinema_america_semiconductor_manufacturing,"[sinema, america, semiconductor, manufacturing...","[arizona, intel, arizonans, sinema, chips, eco...",[Semiconductor Manufacturing and Job Creation ...,[sinema highlighted arizona as an example of t...
9,8,29,8_city_asm_development_corsette,"[city, asm, development, corsette, agreement, ...","[asm, city, plans, discussions, permitting, de...","[City's Talks with ASM, , , , , , , , , ]","[kelly corsette, a city spokesman, was asked a..."


In [None]:
# We can choose to output this result to a csv file
df_topic = topic_model.get_topic_info()
df_topic.to_csv("BERTopic_Topic_Info.csv", index=False)

In [None]:
# visualize terms
topic_model.visualize_barchart()

In [None]:
# Select topics to merge, if applicable
# topics_to_merge_list = [[1,3,5]]
# for topics_to_merge in topics_to_merge_list:
  # topic_model.merge_topics(docs, topics_to_merge)

In [None]:
# View topic information
topic_model.get_topic(0, full = True)["Llama2"]
# topic_model.get_topic(1, full = True)["KeyBERT"]

[('Election Integrity Crisis', 1),
 ('', 0),
 ('', 0),
 ('', 0),
 ('', 0),
 ('', 0),
 ('', 0),
 ('', 0),
 ('', 0),
 ('', 0)]

In [None]:
# visualize topics through Intertopic Distance Map
topic_model.visualize_topics()

In [None]:
# substitute c-TF-IDF labels with llama2 labels
llama2_labels = [label[0][0].split("\n")[0] for label in topic_model.get_topics(full=True)["Llama2"].values()]
topic_model.set_topic_labels(llama2_labels)

In [None]:
# visualize topics through Document Map
# topic_model.visualize_documents(docs, custom_labels=True)
topic_model.visualize_documents(docs, hide_annotations=True, hide_document_hover=False)

In [None]:
# Extract hierarchical topics and their representations
hierarchical_topics = topic_model.hierarchical_topics(docs)
tree = topic_model.get_topic_tree(hierarchical_topics)
print(tree)

100%|██████████| 23/23 [00:00<00:00, 188.43it/s]

.
├─the_and_to_arizona_in
│    ├─the_and_in_to_arizona
│    │    ├─■──safety_site_construction_workers_tsmc ── Topic: 11
│    │    └─the_and_in_to_arizona
│    │         ├─the_in_and_arizona_to
│    │         │    ├─■──electric_electrameccanica_vehicles_battery_facility ── Topic: 9
│    │         │    └─the_in_and_arizona_to
│    │         │         ├─the_in_arizona_and_to
│    │         │         │    ├─the_in_tsmc_and_to
│    │         │         │    │    ├─■──tsmc_the_in_billion_plant ── Topic: 1
│    │         │         │    │    └─■──technology_and_the_to_in ── Topic: 4
│    │         │         │    └─■──governor_arizona_ducey_the_has ── Topic: 6
│    │         │         └─■──programs_education_and_to_hobbs ── Topic: 21
│    │         └─■──phoenix_gallego_city_sister_service ── Topic: 10
│    └─sinema_and_manufacturing_act_chips
│         ├─manufacturing_act_chips_kelly_and
│         │    ├─■──kelly_chips_act_microchip_manufacturing ── Topic: 2
│         │    └─■──sinema_america_s




# Testing Area

In [None]:
para_df

Unnamed: 0,title,source,date,byline,content
0,2022 State Of The State: Expanding Arizona‘s T...,Yellow Sheet Report,"January 10, 2022 Monday",wschutsky@azcapitoltimes.com,phoenix governor doug ducey today called for ...
1,2022 State Of The State: Expanding Arizona‘s T...,Yellow Sheet Report,"January 10, 2022 Monday",wschutsky@azcapitoltimes.com,"in his 2022 state of the state address, govern..."
2,2022 State Of The State: Expanding Arizona‘s T...,Yellow Sheet Report,"January 10, 2022 Monday",wschutsky@azcapitoltimes.com,"“let’s invest in the worker, arming them with ..."
3,2022 State Of The State: Expanding Arizona‘s T...,Yellow Sheet Report,"January 10, 2022 Monday",wschutsky@azcapitoltimes.com,"through targeted investments, governor ducey i..."
4,2022 State Of The State: Expanding Arizona‘s T...,Yellow Sheet Report,"January 10, 2022 Monday",wschutsky@azcapitoltimes.com,"with collaboration among government, industry ..."
...,...,...,...,...,...
1101,"WATCH: Sen. Kelly Talks TikTok, Space as the N...",Yellow Sheet Report,"March 29, 2023 Wednesday",jkronenfeld@azcapitoltimes.com,we invented the semiconductor chip in this cou...
1102,"WATCH: Sen. Kelly Talks TikTok, Space as the N...",Yellow Sheet Report,"March 29, 2023 Wednesday",jkronenfeld@azcapitoltimes.com,"if we lose access to those chips, in a matter ..."
1103,"WATCH: Sen. Kelly Talks TikTok, Space as the N...",Yellow Sheet Report,"March 29, 2023 Wednesday",jkronenfeld@azcapitoltimes.com,on how kelly’s chips law fuels economic growth...
1104,"WATCH: Sen. Kelly Talks TikTok, Space as the N...",Yellow Sheet Report,"March 29, 2023 Wednesday",jkronenfeld@azcapitoltimes.com,[...] it means tens of thousands of high-payin...
