## RAG Approach to News similarity comparison

In order to develop such an app, RAG based solutions can be employed with some process modifiers to make the querying much more viable and broad in context. In the naive rag methodology, the system would typically look up sources relating to the language semantics of the query

For example for the question:

"How many people were trapped in the mountingbourn cave" 

The RAG system looks up sources that relates to things such as 
- mountingbourn cave 
- trapped 
- how many people

With questions like these it is great and easy to answer but the pulling of sources may perform worse in the context of more open ended and implicitly implied questions. 

For example: "Is trump corruupt" would lead to the system pulling information that can be considered as one sided, ultimately affecting the percieved message given to the user. Reasons would be due to how 
- US media is more left wing than right wing leading to article imbalance with more against him than for him, displaying how majority is not equal to reliability
- Media companies would use negativity as clickbait (Phrasing "Trump displays corruption") 

Hence although yes RAG, but RAG is not sufficient

In [40]:
from openai import OpenAI
from getpass import getpass
import json
import pandas as pd
from sklearn.cluster import DBSCAN
import hdbscan
import numpy as np
from tqdm import tqdm


In [None]:
openai_key = getpass("Enter your API Key:")
client = OpenAI(api_key=openai_key)


In [None]:
def set_role(system_prompt, set_json=False):
    def get_completion(prompt, model="gpt-4o-mini"):
        messages = [{"role":"system", "content": system_prompt}, {"role": "user", "content": f"{prompt}"}]
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            temperature=0, # this is the degree of randomness of the model's output
        )
        return json.loads(response.choices[0].message.content) if set_json else response.choices[0].message.content
    
    return get_completion

## Question Expansion

One possible method in achieving a more openly driven semantic search would be to expand the query before sending it for embedding and vector DB retrieval.
Possible solution would be to use a LLM to expand the query

In [43]:
system_prompt = """
You are an intelligent research assistant that helps break down vague or subjective queries for evidence-based investigation
Given a user query contained within the query tag <query> expand the question to encompass the following 
Ensure that the questions are creative and suit the role of the person, do not simply swap vocabulary
Provide your answer as a JSON object like the one in the example tag where all sub questions are collated in a single list

<requirements>
- At least 2 sub questions that support the original query
- At least 2 sub questions that go against the original query
- At least 2 sub questions that take a neutral stance
- 2 third person perspective questions
</requirements>

<example>
{
  "original": "Is donald trump corrupt",
  "all_sub_questions": [
    "What are the major corruption allegations made against Donald Trump during his presidency?",
    "Have any court cases or legal inquiries found Trump guilty of unethical or corrupt practices?".
    "Have any official investigations concluded that Donald Trump did not engage in corruption?",
    "Were corruption allegations against Donald Trump politically motivated with no legal standing?",
    "What were the main legal and ethical controversies associated with Donald Trump?",
    "How has media coverage of Trump’s alleged corruption varied across sources?",
    "How do historians evaluate the ethical conduct of Donald Trump during his time in office?",
    "What do international news outlets report about Trump’s alleged corruption?"
  ]
}
</example>

"""

expansion_gpt = set_role(system_prompt,set_json=True)
results = expansion_gpt("<query>How many were caught in the implosion of the oceangate incident</query>")
results

{'original': 'How many were caught in the implosion of the OceanGate incident',
 'all_sub_questions': ['What was the total number of individuals aboard the OceanGate submersible during the incident?',
  'Were there any survivors from the OceanGate submersible implosion?',
  'Did any official reports confirm the exact number of casualties in the OceanGate incident?',
  'Were there any discrepancies in the reported number of people involved in the OceanGate incident?',
  'What safety measures were in place on the OceanGate submersible to prevent such incidents?',
  'How did the OceanGate company respond to the incident in terms of public communication?',
  "How do experts in marine exploration assess the risks associated with submersible expeditions like OceanGate's?",
  "What was the international media's reaction to the OceanGate incident in terms of coverage and analysis?"]}

## Trail Dataset (Multinews)
To simulate and test the usage of this rag broadening method, we will use the dataset Multi news by alexfabbri

Found in the link here https://huggingface.co/datasets/alexfabbri/multi_news/tree/main/data

The Multi-News dataset is a multi-document summarization dataset consisting of news articles grouped by topic, where each group has:
- 2 to 10 news articles covering the same event or topic
- A human-written summary that combines key information from all articles

2 train files are given 
- train.tgt which contains the summary 
- train src cleaned which contains the articles themselves (Each row is on a topic, each article per row is delimited by '|||||')




In [44]:

relative = "datasets/"
source = "train.src.cleaned"
target = "train.tgt"

with open(f'{relative}{source}', 'r', encoding='utf-8') as f:
    sources = f.readlines()

with open(f'{relative}{target}', 'r', encoding='utf-8') as f:
    targets = f.readlines()

# Clean up
sources = [s.strip() for s in sources]
targets = [t.strip() for t in targets]

# Check if aligned
assert len(sources) == len(targets)

# Example pair
print("Source:", sources[0])
print("Target (Summary):", targets[0])



Source: National Archives NEWLINE_CHAR NEWLINE_CHAR Yes, it’s that time again, folks. It’s the first Friday of the month, when for one ever-so-brief moment the interests of Wall Street, Washington and Main Street are all aligned on one thing: Jobs. NEWLINE_CHAR NEWLINE_CHAR A fresh update on the U.S. employment situation for January hits the wires at 8:30 a.m. New York time offering one of the most important snapshots on how the economy fared during the previous month. Expectations are for 203,000 new jobs to be created, according to economists polled by Dow Jones Newswires, compared to 227,000 jobs added in February. The unemployment rate is expected to hold steady at 8.3%. NEWLINE_CHAR NEWLINE_CHAR Here at MarketBeat HQ, we’ll be offering color commentary before and after the data crosses the wires. Feel free to weigh-in yourself, via the comments section. And while you’re here, why don’t you sign up to follow us on Twitter. NEWLINE_CHAR NEWLINE_CHAR Enjoy the show. ||||| Employers p

Splitting of topic into their separate articles

In [45]:
all_articles = []

for topic_id, source in enumerate(sources):
    articles = [a.strip().replace("NEWLINE_CHAR", "\n") for a in source.split("|||||")]
    for article in articles:

        # Apply to your DataFrame
        all_articles.append({
            "topic_id": topic_id,
            "article": article        
        })

articles_df = pd.DataFrame(all_articles)
articles_df.reset_index(inplace=True)
articles_df.rename(columns={'index': 'article_id'}, inplace=True)

summaries_df = pd.DataFrame({
    "topic_id": list(range(len(targets))),
    "summary": targets
})



from IPython.display import display
print("Articles DataFrame:")
display(articles_df.head())

print("\nSummaries DataFrame (optional):")
display(summaries_df.head())

Articles DataFrame:


Unnamed: 0,article_id,topic_id,article
0,0,0,"National Archives \n \n Yes, it’s that time again, folks. It’s the first Friday of the month, when for one ever-so-brief moment the interests of Wall Street, Washington and Main Street are all aligned on one thing: Jobs. \n \n A fresh update on the U.S. employment situation for January hits the wires at 8:30 a.m. New York time offering one of the most important snapshots on how the economy fared during the previous month. Expectations are for 203,000 new jobs to be created, according to economists polled by Dow Jones Newswires, compared to 227,000 jobs added in February. The unemployment rate is expected to hold steady at 8.3%. \n \n Here at MarketBeat HQ, we’ll be offering color commentary before and after the data crosses the wires. Feel free to weigh-in yourself, via the comments section. And while you’re here, why don’t you sign up to follow us on Twitter. \n \n Enjoy the show."
1,1,0,"Employers pulled back sharply on hiring last month, a reminder that the U.S. economy may not be growing fast enough to sustain robust job growth. The unemployment rate dipped, but mostly because more Americans stopped looking for work. \n \n The Labor Department says the economy added 120,000 jobs in March, down from more than 200,000 in each of the previous three months. \n \n The unemployment rate fell to 8.2 percent, the lowest since January 2009. The rate dropped because fewer people searched for jobs. The official unemployment tally only includes those seeking work. \n \n The economy has added 858,000 jobs since December _ the best four months of hiring in two years. But Federal Reserve Chairman Ben Bernanke has cautioned that the current hiring pace is unlikely to continue without more consumer spending."
2,2,1,"LOS ANGELES (AP) — In her first interview since the NBA banned her estranged husband, Shelly Sterling says she will fight to keep her share of the Los Angeles Clippers and plans one day to divorce Donald Sterling. \n \n (Click Prev or Next to continue viewing images.) \n \n ADVERTISEMENT (Click Prev or Next to continue viewing images.) \n \n Los Angeles Clippers co-owner Shelly Sterling, below, watches the Clippers play the Oklahoma City Thunder along with her attorney, Pierce O'Donnell, in the first half of Game 3 of the Western Conference... (Associated Press) \n \n Shelly Sterling spoke to Barbara Walters, and ABC News posted a short story with excerpts from the conversation Sunday. \n \n NBA Commissioner Adam Silver has banned Donald Sterling for making racist comments and urged owners to force Sterling to sell the team. Silver added that no decisions had been made about the rest of Sterling's family. \n \n According to ABC's story, Shelly Sterling told Walters: ""I will fight that decision."" \n \n Sterling also said that she ""eventually"" will divorce her husband, and that she hadn't yet done so due to financial considerations."
3,3,1,"Shelly Sterling said today that ""eventually, I am going to"" divorce her estranged husband, Donald Sterling, and if the NBA tries to force her to sell her half of the Los Angeles Clippers, she would ""absolutely"" fight to keep her stake in the team. \n \n ""I will fight that decision,"" she told ABC News' Barbara Walters today in an exclusive interview. ""To be honest with you, I'm wondering if a wife of one of the owners, and there's 30 owners, did something like that, said those racial slurs, would they oust the husband? Or would they leave the husband in?"" \n \n Sterling added that the Clippers franchise is her ""passion"" and ""legacy to my family."" \n \n ""I've been with the team for 33 years, through the good times and the bad times,"" she added. \n \n These comments come nearly two weeks after NBA Commissioner Adam Silver announced a lifetime ban and a $2.5 million fine for Donald Sterling on April 29, following racist comments from the 80-year-old, which were caught on tape and released to the media. \n \n Read: Barbara Walters' Exclusive Interview With V. Stiviano \n \n Being estranged from her husband, Shelly Sterling said she would ""have to accept"" whatever punishment the NBA handed down to him, but that her stake in the team should be separate. \n \n ""I was shocked by what he said. And -- well, I guess whatever their decision is -- we have to live with it,"" she said. ""But I don't know why I should be punished for what his actions were."" \n \n An NBA spokesman said this evening that league rules would not allow her tol hold on to her share. \n \n ""Under the NBA Constitution, if a controlling owner's interest is terminated by a 3/4 vote, all other team owners' interests are automatically terminated as well,"" NBA spokesman Mike Bass said. ""It doesn't matter whether the owners are related as is the case here. These are the rules to which all NBA owners agreed to as a condition of owning their team."" \n \n Sherry Sterling's lawyer, Pierce O'Donnell, disputed the league's reading of its constitution. \n \n ""We do not agree with the league's self-serving interpretation of its constitution, its application to Shelly Sterling or its validity under these unique circumstances,"" O'Donnell said in a statement released this evening in reposnse the NBA. ""We live in a nation of laws. California law and the United States Constitution trump any such interpretation."" \n \n If the league decides to force Donald Sterling to sell his half of the team, Shelly Sterling doesn't know what he will do, but the possibility of him transferring full ownership to her is something she ""would love him to"" consider. \n \n Related: NBA Bans Clippers Owner Donald Sterling For Life \n \n ""I haven't discussed it with him or talked to him about it,"" she said. \n \n The lack of communication between Rochelle and Donald Sterling led Walters to question whether she plans to file for divorce. \n \n ""For the last 20 years, I've been seeing attorneys for a divorce,"" she said, laughing. ""In fact, I have here-- I just filed-- I was going to file the petition. I signed the petition for a divorce. And it came to almost being filed. And then, my financial advisor and my attorney said to me, 'Not now.'"" \n \n Sterling added that she thinks the stalling of the divorce stems from ""financial arrangements."" \n \n But she said ""Eventually, I'm going to."" \n \n She also told Walters she thinks her estranged husband is suffering from ""the onset of dementia."" \n \n Since Donald Sterling's ban, several celebrities have said they would be willing to buy the team from Sterling, including Oprah Winfrey and Magic Johnson. Sterling remains the owner, though his ban means he can have nothing to do with running the team and can't attend any games. \n \n Silver announced Friday that former Citigroup chairman and former Time Warner chairman Richard Parsons has been named interim CEO of the team, but nothing concrete in terms of ownership or whether Sterling will be forced to sell the team. Parsons will now take over the basic daily operations for the team and oversee the team's president. \n \n Read: What You Need to Know This Week About Donald Sterling \n \n ABC News contacted Donald Sterling for comment on his wife's interview, but he declined."
4,4,2,"GAITHERSBURG, Md. (AP) — A small, private jet has crashed into a house in Maryland's Montgomery County on Monday, killing at least three people on board, authorities said. \n \n Preliminary information indicates at least three people were on board and didn't survive the Monday crash into home in Gaithersburg, a Washington, D.C. suburb, said Pete Piringer, a Montgomery County Fire and Rescue spokesman. \n \n He said a fourth person may have been aboard. \n \n Piringer said the jet crashed into one home around 11 a.m., setting it and two others on fire. Crews had the fire under control within an hour and were searching for anyone who may have been in the homes. \n \n Television news footage of the scene showed one home nearly destroyed, with a car in the driveway. Witnesses told television news crews that they saw the airplane appear to struggle to maintain altitude before going into a nosedive and crashing. \n \n An FAA spokesman said preliminary information shows the Embraer EMB-500/Phenom 100 twin-engine jet was on approach at the nearby Montgomery County Airpark. The National Transportation Safety Board is sending an investigator to the scene."



Summaries DataFrame (optional):


Unnamed: 0,topic_id,summary
0,0,"– The unemployment rate dropped to 8.2% last month, but the economy only added 120,000 jobs, when 203,000 new jobs had been predicted, according to today's jobs report. Reaction on the Wall Street Journal's MarketBeat Blog was swift: ""Woah!!! Bad number."" The unemployment rate, however, is better news; it had been expected to hold steady at 8.3%. But the AP notes that the dip is mostly due to more Americans giving up on seeking employment."
1,1,"– Shelly Sterling plans ""eventually"" to divorce her estranged husband Donald, she tells Barbara Walters at ABC News. As for her stake in the Los Angeles Clippers, she plans to keep it, the AP notes. Sterling says she would ""absolutely"" fight any NBA decision to force her to sell the team. The team is her ""legacy"" to her family, she says. ""To be honest with you, I'm wondering if a wife of one of the owners … said those racial slurs, would they oust the husband? Or would they leave the husband in?"""
2,2,"– A twin-engine Embraer jet that the FAA describes as ""on approach to Runway 14"" at the Montgomery County Airpark in Gaithersburg, Maryland, crashed into a home this morning, engulfing that home in flames and setting two others on fire. Three people are dead, but the count could grow. A Montgomery County Fire rep says three fliers were killed in the crash, but notes the corporate plane may have had a fourth person on board, reports the AP. A relative of the owner of the home that was hit tells WUSA 9 that a mother with three children pre-school age and under should have been home at the time; there's no word on the family's whereabouts. The crash occurred around 11am on Drop Forge Lane, and the fire was extinguished within an hour. Crews are now searching the wreckage. A witness noted the plane appeared to ""wobble"" before the crash; the airport is no more than 3/4 mile from the crash scene. NTSB and FAA will investigate."
3,3,"– Tucker Carlson is in deep doodoo with conservative women after an ill-advised tweet referencing Sarah Palin that he posted, then removed, Monday night. ""Palin's popularity falling in Iowa, but maintains lead to become supreme commander of Milfistan,"" he tweeted—and we probably don't need to tell you where that is. His first attempt at an apology, which he tweeted the next morning: ""Apparently Charlie Sheen got control of my Twitter account last night while I was at dinner. Apologies for his behavior.” That wasn't good enough for many conservative women, Politico notes, rounding up reactions from bloggers to Michelle Malkin calling his behavior sexist and misogynistic. By late Tuesday, Carlson had offered up a more sincere-sounding apology: “I’m sorry for last night’s tweet. I meant absolutely no offense. Not the first dumb thing I’ve said. Hopefully the last.” But at least one man—Erick Erickson, editor of RedState.com—was on Carlson's side, tweeting his reaction to the post in question: ""I laughed then got out my passport."""
4,4,"– What are the three most horrifying words in the English language? Wrong. The correct answer is ""amateur testicle surgery."" The BBC reports 56-year-old Allan Matthews pleaded guilty Wednesday to removing another man's left testicle at an Australian motel despite not being qualified to practice medicine. The unsanctioned surgery took place in May after a 52-year-old man posted an ad online seeking help for a medical issue, according to the Sydney Morning Herald. The man was apparently still suffering after being kicked in the groin by a horse years earlier but couldn't afford an actual doctor. A week after Matthews allegedly removed the man's testicle, infection set in. The man went to the hospital, and the police launched an investigation. Authorities say a raid of Matthews' home last month turned up medical equipment, seven guns, and four bottles of what may be amyl nitrate. In addition to performing surgery without being a doctor, Matthews also pleaded guilty to gun and drug charges. He did not plead guilty to inflicting ""reckless grievous bodily harm."" AAP reports Matthews is out on bail until another hearing next month. (An Oregon man claimed surgery left him with an 80-pound scrotum.)"


In [46]:
articles_df.groupby('topic_id').count().reset_index().rename(columns={'article':'article/topic'}).groupby('article/topic').size()

article/topic
1       504
2     23743
3     12577
4      4921
5      1845
6       707
7       371
8       194
9        81
10       29
dtype: int64

Here we see that each topic can span 1 to 10 articles which will be good in simulating the different sources of news.

## Article formatting with chunking for embeddings
Here new line char is changed to \n

In [47]:
articles_df['article'] = articles_df['article'].apply(lambda x: '\n'.join(x.split('NEWLINE_CHAR')))
articles_df.head()

Unnamed: 0,article_id,topic_id,article
0,0,0,"National Archives \n \n Yes, it’s that time again, folks. It’s the first Friday of the month, when for one ever-so-brief moment the interests of Wall Street, Washington and Main Street are all aligned on one thing: Jobs. \n \n A fresh update on the U.S. employment situation for January hits the wires at 8:30 a.m. New York time offering one of the most important snapshots on how the economy fared during the previous month. Expectations are for 203,000 new jobs to be created, according to economists polled by Dow Jones Newswires, compared to 227,000 jobs added in February. The unemployment rate is expected to hold steady at 8.3%. \n \n Here at MarketBeat HQ, we’ll be offering color commentary before and after the data crosses the wires. Feel free to weigh-in yourself, via the comments section. And while you’re here, why don’t you sign up to follow us on Twitter. \n \n Enjoy the show."
1,1,0,"Employers pulled back sharply on hiring last month, a reminder that the U.S. economy may not be growing fast enough to sustain robust job growth. The unemployment rate dipped, but mostly because more Americans stopped looking for work. \n \n The Labor Department says the economy added 120,000 jobs in March, down from more than 200,000 in each of the previous three months. \n \n The unemployment rate fell to 8.2 percent, the lowest since January 2009. The rate dropped because fewer people searched for jobs. The official unemployment tally only includes those seeking work. \n \n The economy has added 858,000 jobs since December _ the best four months of hiring in two years. But Federal Reserve Chairman Ben Bernanke has cautioned that the current hiring pace is unlikely to continue without more consumer spending."
2,2,1,"LOS ANGELES (AP) — In her first interview since the NBA banned her estranged husband, Shelly Sterling says she will fight to keep her share of the Los Angeles Clippers and plans one day to divorce Donald Sterling. \n \n (Click Prev or Next to continue viewing images.) \n \n ADVERTISEMENT (Click Prev or Next to continue viewing images.) \n \n Los Angeles Clippers co-owner Shelly Sterling, below, watches the Clippers play the Oklahoma City Thunder along with her attorney, Pierce O'Donnell, in the first half of Game 3 of the Western Conference... (Associated Press) \n \n Shelly Sterling spoke to Barbara Walters, and ABC News posted a short story with excerpts from the conversation Sunday. \n \n NBA Commissioner Adam Silver has banned Donald Sterling for making racist comments and urged owners to force Sterling to sell the team. Silver added that no decisions had been made about the rest of Sterling's family. \n \n According to ABC's story, Shelly Sterling told Walters: ""I will fight that decision."" \n \n Sterling also said that she ""eventually"" will divorce her husband, and that she hadn't yet done so due to financial considerations."
3,3,1,"Shelly Sterling said today that ""eventually, I am going to"" divorce her estranged husband, Donald Sterling, and if the NBA tries to force her to sell her half of the Los Angeles Clippers, she would ""absolutely"" fight to keep her stake in the team. \n \n ""I will fight that decision,"" she told ABC News' Barbara Walters today in an exclusive interview. ""To be honest with you, I'm wondering if a wife of one of the owners, and there's 30 owners, did something like that, said those racial slurs, would they oust the husband? Or would they leave the husband in?"" \n \n Sterling added that the Clippers franchise is her ""passion"" and ""legacy to my family."" \n \n ""I've been with the team for 33 years, through the good times and the bad times,"" she added. \n \n These comments come nearly two weeks after NBA Commissioner Adam Silver announced a lifetime ban and a $2.5 million fine for Donald Sterling on April 29, following racist comments from the 80-year-old, which were caught on tape and released to the media. \n \n Read: Barbara Walters' Exclusive Interview With V. Stiviano \n \n Being estranged from her husband, Shelly Sterling said she would ""have to accept"" whatever punishment the NBA handed down to him, but that her stake in the team should be separate. \n \n ""I was shocked by what he said. And -- well, I guess whatever their decision is -- we have to live with it,"" she said. ""But I don't know why I should be punished for what his actions were."" \n \n An NBA spokesman said this evening that league rules would not allow her tol hold on to her share. \n \n ""Under the NBA Constitution, if a controlling owner's interest is terminated by a 3/4 vote, all other team owners' interests are automatically terminated as well,"" NBA spokesman Mike Bass said. ""It doesn't matter whether the owners are related as is the case here. These are the rules to which all NBA owners agreed to as a condition of owning their team."" \n \n Sherry Sterling's lawyer, Pierce O'Donnell, disputed the league's reading of its constitution. \n \n ""We do not agree with the league's self-serving interpretation of its constitution, its application to Shelly Sterling or its validity under these unique circumstances,"" O'Donnell said in a statement released this evening in reposnse the NBA. ""We live in a nation of laws. California law and the United States Constitution trump any such interpretation."" \n \n If the league decides to force Donald Sterling to sell his half of the team, Shelly Sterling doesn't know what he will do, but the possibility of him transferring full ownership to her is something she ""would love him to"" consider. \n \n Related: NBA Bans Clippers Owner Donald Sterling For Life \n \n ""I haven't discussed it with him or talked to him about it,"" she said. \n \n The lack of communication between Rochelle and Donald Sterling led Walters to question whether she plans to file for divorce. \n \n ""For the last 20 years, I've been seeing attorneys for a divorce,"" she said, laughing. ""In fact, I have here-- I just filed-- I was going to file the petition. I signed the petition for a divorce. And it came to almost being filed. And then, my financial advisor and my attorney said to me, 'Not now.'"" \n \n Sterling added that she thinks the stalling of the divorce stems from ""financial arrangements."" \n \n But she said ""Eventually, I'm going to."" \n \n She also told Walters she thinks her estranged husband is suffering from ""the onset of dementia."" \n \n Since Donald Sterling's ban, several celebrities have said they would be willing to buy the team from Sterling, including Oprah Winfrey and Magic Johnson. Sterling remains the owner, though his ban means he can have nothing to do with running the team and can't attend any games. \n \n Silver announced Friday that former Citigroup chairman and former Time Warner chairman Richard Parsons has been named interim CEO of the team, but nothing concrete in terms of ownership or whether Sterling will be forced to sell the team. Parsons will now take over the basic daily operations for the team and oversee the team's president. \n \n Read: What You Need to Know This Week About Donald Sterling \n \n ABC News contacted Donald Sterling for comment on his wife's interview, but he declined."
4,4,2,"GAITHERSBURG, Md. (AP) — A small, private jet has crashed into a house in Maryland's Montgomery County on Monday, killing at least three people on board, authorities said. \n \n Preliminary information indicates at least three people were on board and didn't survive the Monday crash into home in Gaithersburg, a Washington, D.C. suburb, said Pete Piringer, a Montgomery County Fire and Rescue spokesman. \n \n He said a fourth person may have been aboard. \n \n Piringer said the jet crashed into one home around 11 a.m., setting it and two others on fire. Crews had the fire under control within an hour and were searching for anyone who may have been in the homes. \n \n Television news footage of the scene showed one home nearly destroyed, with a car in the driveway. Witnesses told television news crews that they saw the airplane appear to struggle to maintain altitude before going into a nosedive and crashing. \n \n An FAA spokesman said preliminary information shows the Embraer EMB-500/Phenom 100 twin-engine jet was on approach at the nearby Montgomery County Airpark. The National Transportation Safety Board is sending an investigator to the scene."


Chunking with meta data capturing is conducted. 

In [9]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ".", " "]
)

chunks = []

# Loop through your DataFrame rows
for _, row in articles_df.iterrows():
    article_text = row['article']
    article_id = row['article_id']
    topic_id = row['topic_id']
    
    # Split and attach metadata
    split_chunks = splitter.create_documents(
        [article_text],
        metadatas=[{
            "article_id": article_id,
            "topic_id": topic_id
        }]
    )
    
    chunks.extend(split_chunks)


In [10]:
chunks[0:3]

[Document(metadata={'article_id': 0, 'topic_id': 0}, page_content='National Archives \n \n Yes, it’s that time again, folks. It’s the first Friday of the month, when for one ever-so-brief moment the interests of Wall Street, Washington and Main Street are all aligned on one thing: Jobs.'),
 Document(metadata={'article_id': 0, 'topic_id': 0}, page_content='A fresh update on the U.S. employment situation for January hits the wires at 8:30 a.m. New York time offering one of the most important snapshots on how the economy fared during the previous month. Expectations are for 203,000 new jobs to be created, according to economists polled by Dow Jones Newswires, compared to 227,000 jobs added in February. The unemployment rate is expected to hold steady at 8.3%.'),
 Document(metadata={'article_id': 0, 'topic_id': 0}, page_content='Here at MarketBeat HQ, we’ll be offering color commentary before and after the data crosses the wires. Feel free to weigh-in yourself, via the comments section. An

## Embedding Model

Here the MPNET base v2 embeddings model was used

In [48]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.docstore.document import Document
import numpy as np

encoder = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

Here only 50000 chunks were embedded to save time for testing. Since FAISS does not use cosine similarity by default, we had to apply our own embeddings by encoding and normalizing our selves before sending them as chunk embedding pairs into FAISS

In [283]:
index_end = 50000
embeddings = encoder.embed_documents([chunk.page_content for chunk in chunks[0:index_end]])
normalized_embeddings = [vec / np.linalg.norm(vec) for vec in embeddings]

text_embedding_pairs = list(zip(
    [doc.page_content for doc in chunks],
    normalized_embeddings
))

faiss_index = FAISS.from_embeddings(
    text_embedding_pairs,         # list of (text, embedding)
    embedding=encoder,            # model used
    metadatas=[doc.metadata for doc in chunks[0:index_end]]  # optional: include article_id/topic_id
)

faiss_index.save_local("vector_index_cosine")


### Retrieval Function 
This function here loads the database and allows for easy query to vector to db chunk retrieval. 
Since FAISS is used, meta data is stored in relation to the article the chunk its from and the topic it is referencing. 

In [49]:
db = FAISS.load_local("vector_index_cosine", encoder, allow_dangerous_deserialization=True)

def semantic_search_with_threshold(db, query, encoder, threshold=0.1, k=99999):
    vec = encoder.embed_query(query)
    vec = vec / np.linalg.norm(vec)
    results = db.similarity_search_with_score_by_vector(vec, k=k)
    return [(doc, 1 - score) for doc, score in results if 1 - score >= threshold]

## Test Topic
For the test topic, we will look at 7 different articles on the topic of Huricane harvey and its aftermath

Why this topic is due to 
- it being broad for 7 articles 
- containing different perspectives and info

In [50]:
articles_df[articles_df['topic_id'] == 390]['article'].values

       'HOUSTON — Five days after the pummeling began — a time when big storms have usually blown through, the sun has come out, and evacuees have returned home — Tropical Storm Harvey refused to go away, battering southeast Texas even more on Tuesday, spreading the destruction into Louisiana and shattering records for rainfall and flooding. \n \n Along 300 miles of Gulf Coast, people poured into shelters by the thousands, straining their capacity; as heavy rain kept falling, some rivers were still rising and floodwater in some areas had not crested yet; and with whole neighborhoods flooded, others were covered in water for the first time. \n \n Officials cautioned that the full-fledged rescue-and-escape phase of the crisis, usually finished by now, would continue, and that they still had no way to gauge the scale of the catastrophe — how many dead, how many survivors taking shelter inland or still hunkered down in flooded communities, and how many homes destroyed. \n \n For everybody,

Possible questions
- What were the key challenges faced by Texas and Louisiana during Hurricane Harvey?
- How many homes were destroyed in huricane harvey

What were the key challenges faced by Texas and Louisiana during Hurricane Harvey?

In [51]:
semantic_search_with_threshold(db, "What were the key challenges faced by Texas and Louisiana during Hurricane Harvey", encoder, threshold=0.25)

[(Document(id='13a9b0ec-d27e-4221-89b3-89c07fcfe49c', metadata={'article_id': 1112, 'topic_id': 390}, page_content='(CNN) With countless Houstonians still awaiting rescue, Tropical Depression Harvey devoured another Texas city. \n \n The unrelenting storm unleashed its wrath on a wide swath east of Houston, leaving thousands stranded in flooded homes and forcing the evacuation of a nursing facility and even an emergency shelter where residents had sought refuge.'),
  np.float32(0.37814432)),
 (Document(id='0e72b927-a078-4c7d-94d7-553a88d5ec3e', metadata={'article_id': 1113, 'topic_id': 390}, page_content="The catastrophic flooding from Hurricane Harvey is not limited to Texas, it's also affecting parts of southwest Louisiana where preparations are underway to evacuate some areas. \n \n Interested in Hurricane Harvey? Add Hurricane Harvey as an interest to stay up to date on the latest Hurricane Harvey news, video, and analysis from ABC News. Add Interest"),
  np.float32(0.36525846)),
 

### Broadening search 

In [52]:
question =  "<query>What were the key challenges faced by Texas and Louisiana during Hurricane Harvey?</query>"
questions = expansion_gpt(question)
questions

{'original': 'What were the key challenges faced by Texas and Louisiana during Hurricane Harvey?',
 'all_sub_questions': ['What were the most significant infrastructure damages in Texas and Louisiana caused by Hurricane Harvey?',
  'How did the emergency response systems in Texas and Louisiana perform during Hurricane Harvey?',
  'Were there any successful strategies or measures implemented in Texas and Louisiana that mitigated the impact of Hurricane Harvey?',
  'Did Texas and Louisiana have adequate disaster preparedness plans in place before Hurricane Harvey struck?',
  'What were the primary environmental impacts of Hurricane Harvey on Texas and Louisiana?',
  'How did the economic recovery process differ between Texas and Louisiana post-Hurricane Harvey?',
  'How did the federal government support Texas and Louisiana in the aftermath of Hurricane Harvey?',
  'What lessons have been learned by Texas and Louisiana from Hurricane Harvey that could improve future disaster responses?']

In [53]:
retrieved_documents =  []
for question in questions['all_sub_questions']:
    retrieved_documents += semantic_search_with_threshold(db, question, encoder, threshold=0.25)

article_id = set()
topic_id_check = set()

for doc, score in retrieved_documents:
    aid = doc.metadata.get("article_id")
    tid = doc.metadata.get("topic_id")
    article_id.add(aid)
    topic_id_check.add(tid)

print("Unique Article IDs:", article_id)
print("Topics Covered:", topic_id_check)


Unique Article IDs: {4040, 2037, 1110, 2038, 1112, 1113, 1114}
Topics Covered: {1394, 390, 711}


From broadening the search we see that even more articles emerged with article_id = 4040 which covers topic 1394. And it encompasses huricane harvey as well from the example below 

In [54]:
articles_df[articles_df['article_id'] == 1537]['article'].iloc[0]

'Acting Homeland Secretary Elaine Duke, center, is briefed on the Hurricane Maria response during a flight to Puerto Rico on Friday, Sept. 29, 2017. President Donald Trump on Thursday cleared the way for... (Associated Press) \n \n BRANCHBURG, N.J. (AP) — President Donald Trump on Sunday scoffed at "politically motivated ingrates" who had questioned his administration\'s commitment to revive Puerto Rico after a pulverizing hurricane and said the federal government had done "a great job with the almost impossible situation." \n \n The tweets coming from a president ensconced in his New Jersey golf club sought to defend Washington\'s efforts to mobilize and coordinate recovery efforts on a U.S. territory in dire straits almost two weeks after Hurricane Maria struck. \n \n San Juan Mayor Carmen Yulin Cruz on Friday accused the Trump administration of "killing us with the inefficiency" after the storm. She begged the president, who is set to visit Puerto Rico on Tuesday, to "make sure some

## Fact Extraction (Using LLMS)
With the articles obtained, we can feed them to the LLM for fact extraction. However we don't want all facts as some are rubbish and waste time or token gen so we can prompt engineer to ask for facts only relating to query. We can test the difference between with question and no question but just not giving the question. The llm will extract everything

In [102]:
system_prompt = """
You are a research assistant specialized in fact extraction.

Extract clear, verifiable facts from the <context>. Focus on short, distinct, evidence-based statements — no opinions, summaries, or assumptions.
Avoid full sentences. Only return **distinct factual keywords or short noun phrases** that can be grounded in the text.
Reduce the facts down to those that aid in answering the <query>

<requirements>
- Output a valid JSON object .
- Min. 1 fact cannot be empty, no additional prior knowledge.
- Each fact must be concise, directly grounded in the text.
- Avoid redundancy, vague phrasing, or restatements.
</requirements>

IMPORTANT:
- Do NOT include markdown formatting like ```json or ``` in your response.
- Return only the raw JSON object,
- Do not write: "Here's the JSON:", "Output:", or similar phrases.


<example>
{
  "facts": [
    "The study was published in Behavioral Ecology",
    "Urban birds solved food puzzles faster than rural birds.",
    "Over 100 pigeons and sparrows were studied in cities."
  ]
}
"""


example ="""

<context>
The OceanGate incident involved the implosion of a submersible during a dive to the Titanic wreck site. Five individuals were aboard the submersible when it imploded
</context>"

<query>
how many people were aboard ocean gate
</query>

"""


fact_gpt = set_role(system_prompt, set_json=True)
results = fact_gpt(example)
results

{'facts': ['Five individuals were aboard the submersible']}

### Testing with single article


In [103]:
article_id_list = list(article_id)
article_id_list

[4040, 2037, 1110, 2038, 1112, 1113, 1114]

In [104]:
single_test = articles_df[articles_df['article_id'] == article_id_list[4]]
fact_extract_context = single_test['article'].values[0]
fact_extract_context

'(CNN) With countless Houstonians still awaiting rescue, Tropical Depression Harvey devoured another Texas city. \n \n The unrelenting storm unleashed its wrath on a wide swath east of Houston, leaving thousands stranded in flooded homes and forcing the evacuation of a nursing facility and even an emergency shelter where residents had sought refuge. \n \n "Our whole city is underwater right now but we are coming!" Port Arthur Mayor Derrick Freeman posted Wednesday on Facebook. "If you called, we are coming. Please get to higher ground if you can, but please try (to) stay out of attics." \n \n My uncles have been rescuing people in Port Arthur for 24hrs! So blessed to have such a helpful family who help others in times like this! pic.twitter.com/O2qIVGHqxR \n \n At least 37 deaths related to Hurricane Harvey and its aftermath have been reported in Texas. One of them, Houston police Sgt. Steve Perez , drowned while trying to get to work. \n \n "To those Americans who have lost loved ones

### No question|

In [105]:
fact_extract_context_query = f"""
<context>
{fact_extract_context}
</context>"
"""

fact_extract_context_query

'\n<context>\n(CNN) With countless Houstonians still awaiting rescue, Tropical Depression Harvey devoured another Texas city. \n \n The unrelenting storm unleashed its wrath on a wide swath east of Houston, leaving thousands stranded in flooded homes and forcing the evacuation of a nursing facility and even an emergency shelter where residents had sought refuge. \n \n "Our whole city is underwater right now but we are coming!" Port Arthur Mayor Derrick Freeman posted Wednesday on Facebook. "If you called, we are coming. Please get to higher ground if you can, but please try (to) stay out of attics." \n \n My uncles have been rescuing people in Port Arthur for 24hrs! So blessed to have such a helpful family who help others in times like this! pic.twitter.com/O2qIVGHqxR \n \n At least 37 deaths related to Hurricane Harvey and its aftermath have been reported in Texas. One of them, Houston police Sgt. Steve Perez , drowned while trying to get to work. \n \n "To those Americans who have lo

In [106]:
fact_gpt(fact_extract_context_query)

{'facts': ['Tropical Depression Harvey affected Texas and Louisiana.',
  'Port Arthur Mayor Derrick Freeman posted about the city being underwater.',
  'At least 37 deaths related to Hurricane Harvey in Texas.',
  'Houston police Sgt. Steve Perez drowned.',
  'Record-setting rain in Harris County.',
  '15 inches of rain in Beaumont area.',
  '60,000 to 70,000 calls for help in Houston.',
  'US Coast Guard searching for two civilian rescuers.',
  'One-third of Houston area covered in water.',
  'Houston Astros to play a doubleheader at home against New York Mets.',
  'Flooding in Houston slowly receding.',
  'Controversy over homes near Barker and Addicks reservoirs.',
  '2,500 homes flooded near Addicks Reservoir.',
  'Homes inundated for several weeks.',
  "Louisiana largely spared from Harvey's wrath.",
  'Harvey weakened to a tropical depression with 35 mph winds.',
  'New Orleans announced a fundraiser for Texas residents.',
  'Mother died saving her toddler in Beaumont.',
  'Bob B

Model extracted  points all extremely atomic along with all assocuated nouns, relating to the specifics of the article. Maybe relevant who knows

### With question

In [107]:
fact_extract_context_query = f"""
<context>
{fact_extract_context}
</context>

<query>
{question}
</query>
"""

ans = fact_gpt(fact_extract_context_query)
ans

{'facts': ['Hurricane Harvey caused historic flooding in Texas.',
  'Port Arthur and Beaumont received 26 inches of rain in 24 hours.',
  'At least 37 deaths related to Hurricane Harvey reported in Texas.',
  'Houston Police received 60,000 to 70,000 calls for help.',
  'US Coast Guard searching for civilian rescuers after boat capsized.',
  'Homes near Barker and Addicks reservoirs flooded.',
  'Controversy over building homes inside reservoirs.',
  "Louisiana largely spared from Harvey's wrath.",
  'Louisiana requested federal disaster declaration for additional parishes.',
  'Harvey dumped almost 52 inches of rain in parts of Texas.',
  'Volunteers from across the country helped evacuate victims.']}

Time could be saved for lesser answers, may be good or bad possibly need to balance 

### For all facts
Now with this established, we can try to extract all facts and nouns from all related articles. Can take a while if there are a lot of articles not sure if there is a faster better way to extract facts... 

Maybe smaller model? OpenIE (A bit out dated)

In [108]:

facts_and_nouns = {}

# Filter once and store in variable to avoid repeated filtering
filtered_df = articles_df[articles_df['article_id'].isin(article_id_list)]

for _, row in tqdm(filtered_df.iterrows(), total=len(filtered_df), desc="Extracting facts"):
    article_row = row.iloc[2]
    article_id = row.iloc[0]

    fact_extract_context_query = f"""
    <context>
    {article_row}
    </context>

    <query>
    {question}
    </query>
    """

    facts = fact_gpt(fact_extract_context_query)
    facts_and_nouns[article_id] = facts

facts_and_nouns



Extracting facts: 100%|██████████| 7/7 [00:24<00:00,  3.57s/it]


{1110: {'facts': ['Hurricane Harvey caused catastrophic flooding in Texas and Louisiana.',
   'Thousands of people were stranded and homes destroyed.',
   'Power outages affected 75,000 in the Houston area.',
   'The Navy sent ships for storm relief efforts.',
   'The National Guard deployed 24,000 troops in Texas.',
   'FEMA is operating over 230 shelters in Texas.',
   'FEMA placed more than 1,800 flood survivors in hotels.',
   'Texas accepted resources from Mexico and Israel.']},
 1112: {'facts': ['Hurricane Harvey caused historic flooding in Texas.',
   'Port Arthur and Beaumont received 26 inches of rain in 24 hours.',
   'Hurricane Harvey broke the US record for rainfall from a single storm.',
   'At least 37 deaths related to Hurricane Harvey reported in Texas.',
   'Houston Police received 60,000 to 70,000 calls for help.',
   'Homes near Barker and Addicks reservoirs were flooded.',
   'Controversy over building homes inside reservoirs.',
   'Floodwaters overwhelmed the Bob B

## HDBSCAN and Cluster Association (Concept Summary)

We’ve extracted factual statements and key nouns from articles retrieved via question broadening, aiming to capture multiple aspects of a topic. Now, we need to organize and reason over this information.

### Project Goals

* Identify distinct factual statements about a topic
* Show what different people/sources are saying
* Detect missing or underreported information

### LLM Limitations

While LLMs can analyze text directly and I am not going against, relying solely on them poses some challenges that I am concerned about:

* Context window limits
* Slow responses when handling many facts
* Diluted attention with too many tokens
* Hallucinations or tracking errors


### Proposed Solution: Fact clustering before content summarization and comparing

We use embeddings + clustering to organize the facts to semantically similar claims and then make sub comparisons and abstractions of them

This allows us to possibly improve on:
- summarising all relevant themes and remove redundant processing saving time 
- Data organization and presentation later on

### Potentially viable ?
* Embeddings inherently encode semantic meaning meaning clustering could use the relational patterns in their latent space
* Clustering leverages this to organize facts/nouns without manual labeling

* Pronouns (e.g., "he", "she") may weaken noun clarity
* Can mitigate by prompting LLM extractors to avoid or resolve pronouns

HDBSCAN seems ok as it does not assume any uniform density and it doesn't require predefined cluster count (Like an anything goes model )



### Fact Reembedding
Next we can take all of these facts then reembedd the them 
Function below embedds the facts as well as normalizes them before fitting them into a dataframe

In [109]:
def embed_and_build_dataframes(articles, encoder):
    fact_rows = []

    for article_id, article in articles.items():
        facts = article["facts"]

        # Embed and normalize facts
        fact_embeddings = encoder.embed_documents(facts)
        norm_fact_embeddings = [vec / np.linalg.norm(vec) for vec in fact_embeddings]

        # Add to rows
        for fact, emb in zip(facts, norm_fact_embeddings):
            fact_rows.append({
                "article": article_id,
                "text": fact,
                "embedding": emb
            })

    df_facts = pd.DataFrame(fact_rows)
    df_facts.reset_index(inplace=True)
    df_facts.rename(columns={'index': 'fact_id'}, inplace=True)

    return df_facts


In [110]:
df_facts = embed_and_build_dataframes(facts_and_nouns, encoder)

In [111]:
df_facts.head(2)

Unnamed: 0,fact_id,article,text,embedding
0,0,1110,Hurricane Harvey caused catastrophic flooding in Texas and Louisiana.,"[-0.04331660117801618, 0.004618039191684698, -0.021901129903252183, -0.03430385021486677, -0.012568234022410736, 0.004241631216600844, -0.03442599502886154, -0.024672835194445526, 0.023964588615835506, -0.04448552645189974, -0.02969932637590785, 0.009679736227870021, 0.0018320668540590661, -0.05806823987448396, 0.05081279023948028, -0.03291281196535767, 0.007360045538344373, -0.013668876590178724, -0.020087392291721858, -0.019575196558729226, -0.0077797604772805815, -0.0343554305825187, 0.07002558312018585, -0.03616392857342918, 0.02203426246059445, 0.027800547840445822, -0.005866738294865124, 0.0396141842926665, -0.015728836323972713, -0.022408056213303627, 0.04708030281190277, -0.01888505675297579, 0.05283912829819478, -0.03577532865495135, 1.7570865173281935e-06, -0.02600664181111939, -0.002372385151768144, -0.027114705156899992, 0.009715224273323166, 0.010257162723008926, -0.11298692572671863, -0.010333404511566235, -0.010376552684925221, 0.02813275059108581, -0.014329210349306195, -0.06975014261582006, -0.01790577668826751, 0.0003650025626802186, -0.0009695646243681266, 0.005765120761012924, 0.01969280769459079, -0.028456334882924504, 0.05014842199110197, 0.027915929390142456, 0.05594135250887264, 0.08529928770594365, 0.0036060906605184027, 0.013255018273046272, 0.06857125219073985, -0.06695042314257107, -0.004687752409224383, 0.03365658849893521, -0.025825734270030754, 0.006363829373214484, -0.021368504678983862, -0.03278880078098924, 0.02709478789302226, 0.01746790794468636, 0.03259640444502565, 0.029193964687726538, -0.048702338854085654, -0.014266412201638958, 0.00014625726391753353, 0.008607865938706664, -0.014884668808330448, -0.0450458957746866, -0.0226188070539199, 0.0055686382473974375, 0.017181451757296826, 0.0040560628718520515, -0.03901633855079986, -0.016140363540783183, 0.02458951721721605, 0.029989299237215012, 0.01942966809835635, 0.04374266633970327, -0.006086800963915021, -0.021424588922447668, -0.09467273616422026, 0.022208235374037263, 0.027310677771423875, 0.01555569042494697, 0.01191789707944634, -0.009654315779093127, -0.04505808864940311, 0.0381776751513793, -0.013485404186810126, 0.031871179856479644, 0.018589481055257498, -0.07609234481697018, ...]"
1,1,1110,Thousands of people were stranded and homes destroyed.,"[-0.030167908906379513, 0.016355400670431647, -0.006856915442646183, 0.0001463415953296043, 0.038027762821276484, 0.011078939614239865, 0.011949311770476104, -0.0005982086791539007, 0.011692317966309833, -0.030875462595843867, 0.016237968346037534, 0.03832742144329653, -0.029495516368892888, -0.04010295065343199, 0.04357250344843186, -0.02789459896312253, 0.006587293370210539, -0.011962257154077586, -0.04878026177060221, 0.00573453578323665, -0.021305747024605727, 0.009142001028367711, 0.05205715000470957, 0.019383968507385725, 0.013366698095712204, 0.025760614875029706, -0.023315174170710818, -0.01407465597916814, -0.016216654097900587, -0.04651755400338259, 0.04228459237962949, -0.05388276574218946, 0.054885382908156616, 0.021615745189700858, 1.8414689029304972e-06, 0.005734957672356899, -0.03259158087715156, 0.00346168894012325, -0.06876727840020844, 0.005217970739715828, -0.025906968490241114, -0.027810288194296987, -0.0227233335841925, -0.022906198769095457, 0.004377270985942544, -0.06382875052933672, 0.04052888543888785, 0.030658289350429295, 0.03114957876529886, 0.039423491240350536, 0.017255490398275057, -0.028409493679629735, 0.011916412800999672, 0.03127017572375613, 0.004733557744977308, 0.043599467099225395, -0.01710626644692985, -0.0070924720640974225, 0.05117243827499765, -0.07172340051776739, 0.05624250986971166, 0.06652139031844513, -0.01847717511974638, -0.002992174312397171, -0.061374473436556616, -0.00283358824100492, -0.08693416291957545, -0.03625828348249897, -0.0012811319718789684, 0.02728231398373948, -0.045454855356343575, 0.0069687710075434835, -0.017586193726552072, 0.02018878579985759, -0.028008147672436198, -0.06319454966744215, -0.008217562803482776, 0.01477116138908829, 0.007816703877988777, 0.0019065938849352946, -0.02803084400325405, 0.01897627646825978, 0.05140871480878874, -0.009184511246676321, -0.001032752552158372, 0.0280929930204119, -0.017453158023966648, 0.04960736933725522, -0.08237042159909531, 0.008861058264538698, -0.12022228366558194, 0.04343361345222636, 0.015480790065219125, 0.009535956991037669, -0.01914407657941183, -0.012545775442188507, 0.017167513943848405, -0.013838147546059295, 0.022919917150422783, -0.13820245663786798, ...]"


Run HDBSCAN

In [112]:
def run_hdbscan(df, min_cluster_size=2, min_samples=1, metric='euclidean'):
    X = np.vstack(df["embedding"].values)

    # Run HDBSCAN
    clusterer = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size,
                                 min_samples=min_samples,
                                 metric=metric,
                                 
                                 )
    clusters = clusterer.fit_predict(X)

    # Add cluster labels to DataFrame
    df_with_clusters = df.copy()
    df_with_clusters["cluster"] = clusters

    return df_with_clusters


In [113]:
# DBSCAN clustering
df_facts_clustered = run_hdbscan(df_facts)



Here we see that a few clusters were found. These results are already better than DBSCAN and KMeans based on experimentation in the past. 
- DBSCAN was at times forming too many or too little clusters
- KMeans requires predefined clusters, not ideal for subjective unstructured tasks

In [114]:
print(f"Unqiue clusters found for facts: {df_facts_clustered['cluster'].unique()}")
df_facts_clustered.head()

Unqiue clusters found for facts: [19 17 -1 11 12 16 10 18 14  4  8 20  7  5  9 15 13  6  3  1  0  2]


Unnamed: 0,fact_id,article,text,embedding,cluster
0,0,1110,Hurricane Harvey caused catastrophic flooding in Texas and Louisiana.,"[-0.04331660117801618, 0.004618039191684698, -0.021901129903252183, -0.03430385021486677, -0.012568234022410736, 0.004241631216600844, -0.03442599502886154, -0.024672835194445526, 0.023964588615835506, -0.04448552645189974, -0.02969932637590785, 0.009679736227870021, 0.0018320668540590661, -0.05806823987448396, 0.05081279023948028, -0.03291281196535767, 0.007360045538344373, -0.013668876590178724, -0.020087392291721858, -0.019575196558729226, -0.0077797604772805815, -0.0343554305825187, 0.07002558312018585, -0.03616392857342918, 0.02203426246059445, 0.027800547840445822, -0.005866738294865124, 0.0396141842926665, -0.015728836323972713, -0.022408056213303627, 0.04708030281190277, -0.01888505675297579, 0.05283912829819478, -0.03577532865495135, 1.7570865173281935e-06, -0.02600664181111939, -0.002372385151768144, -0.027114705156899992, 0.009715224273323166, 0.010257162723008926, -0.11298692572671863, -0.010333404511566235, -0.010376552684925221, 0.02813275059108581, -0.014329210349306195, -0.06975014261582006, -0.01790577668826751, 0.0003650025626802186, -0.0009695646243681266, 0.005765120761012924, 0.01969280769459079, -0.028456334882924504, 0.05014842199110197, 0.027915929390142456, 0.05594135250887264, 0.08529928770594365, 0.0036060906605184027, 0.013255018273046272, 0.06857125219073985, -0.06695042314257107, -0.004687752409224383, 0.03365658849893521, -0.025825734270030754, 0.006363829373214484, -0.021368504678983862, -0.03278880078098924, 0.02709478789302226, 0.01746790794468636, 0.03259640444502565, 0.029193964687726538, -0.048702338854085654, -0.014266412201638958, 0.00014625726391753353, 0.008607865938706664, -0.014884668808330448, -0.0450458957746866, -0.0226188070539199, 0.0055686382473974375, 0.017181451757296826, 0.0040560628718520515, -0.03901633855079986, -0.016140363540783183, 0.02458951721721605, 0.029989299237215012, 0.01942966809835635, 0.04374266633970327, -0.006086800963915021, -0.021424588922447668, -0.09467273616422026, 0.022208235374037263, 0.027310677771423875, 0.01555569042494697, 0.01191789707944634, -0.009654315779093127, -0.04505808864940311, 0.0381776751513793, -0.013485404186810126, 0.031871179856479644, 0.018589481055257498, -0.07609234481697018, ...]",19
1,1,1110,Thousands of people were stranded and homes destroyed.,"[-0.030167908906379513, 0.016355400670431647, -0.006856915442646183, 0.0001463415953296043, 0.038027762821276484, 0.011078939614239865, 0.011949311770476104, -0.0005982086791539007, 0.011692317966309833, -0.030875462595843867, 0.016237968346037534, 0.03832742144329653, -0.029495516368892888, -0.04010295065343199, 0.04357250344843186, -0.02789459896312253, 0.006587293370210539, -0.011962257154077586, -0.04878026177060221, 0.00573453578323665, -0.021305747024605727, 0.009142001028367711, 0.05205715000470957, 0.019383968507385725, 0.013366698095712204, 0.025760614875029706, -0.023315174170710818, -0.01407465597916814, -0.016216654097900587, -0.04651755400338259, 0.04228459237962949, -0.05388276574218946, 0.054885382908156616, 0.021615745189700858, 1.8414689029304972e-06, 0.005734957672356899, -0.03259158087715156, 0.00346168894012325, -0.06876727840020844, 0.005217970739715828, -0.025906968490241114, -0.027810288194296987, -0.0227233335841925, -0.022906198769095457, 0.004377270985942544, -0.06382875052933672, 0.04052888543888785, 0.030658289350429295, 0.03114957876529886, 0.039423491240350536, 0.017255490398275057, -0.028409493679629735, 0.011916412800999672, 0.03127017572375613, 0.004733557744977308, 0.043599467099225395, -0.01710626644692985, -0.0070924720640974225, 0.05117243827499765, -0.07172340051776739, 0.05624250986971166, 0.06652139031844513, -0.01847717511974638, -0.002992174312397171, -0.061374473436556616, -0.00283358824100492, -0.08693416291957545, -0.03625828348249897, -0.0012811319718789684, 0.02728231398373948, -0.045454855356343575, 0.0069687710075434835, -0.017586193726552072, 0.02018878579985759, -0.028008147672436198, -0.06319454966744215, -0.008217562803482776, 0.01477116138908829, 0.007816703877988777, 0.0019065938849352946, -0.02803084400325405, 0.01897627646825978, 0.05140871480878874, -0.009184511246676321, -0.001032752552158372, 0.0280929930204119, -0.017453158023966648, 0.04960736933725522, -0.08237042159909531, 0.008861058264538698, -0.12022228366558194, 0.04343361345222636, 0.015480790065219125, 0.009535956991037669, -0.01914407657941183, -0.012545775442188507, 0.017167513943848405, -0.013838147546059295, 0.022919917150422783, -0.13820245663786798, ...]",17
2,2,1110,"Power outages affected 75,000 in the Houston area.","[-0.05709598545593606, -0.010277228096474975, -0.047742812877784334, 0.037470365260362966, 0.023181434206703582, 0.02040030925936745, 0.018723049358973533, -0.05470773523426689, 0.006634944970212628, -0.02359020102170832, 0.004790104272977147, -0.01638816019864385, 0.00992052780435569, 0.010619472398750092, 0.03340776459019224, -0.02209692018134249, -0.0060636851738874375, 0.00428900101027914, 0.04558845502111093, -0.01450083302874402, -0.02826325014744002, -0.010782021725089585, 0.04554045092753228, -0.008107485721296496, -0.008809910668354707, 0.040292164792189125, 0.0025864449903031856, 0.010143282482033684, -0.0297940334874222, -0.02898773719625177, 0.04725333950641341, -0.048368464243937766, 0.0010245891185491913, 0.010139443570157789, 2.0610378959043345e-06, 0.005431550870914957, -0.017003995424289774, 0.01667251348631302, -0.021753884254832895, 0.02403420349705092, -0.017923013418513867, -0.016691931563123413, 0.022889589359808376, 0.030156189467500972, 0.003942856794495941, -0.06731886063247189, -0.004348326261733429, 0.02306450665065828, 0.01615908277163896, 0.028791783189862634, 0.023534686742451065, 0.013362130927549074, 0.029441237299762828, 0.02412277041640414, 0.009185739395946827, 0.06435897575711944, -0.030789736586901335, 0.03180689922239318, 0.03370767655369808, -0.06629832474641854, 0.021727473807725505, 0.08134146507926507, -0.007645866812768913, -0.00022109002579051842, -0.020047634143649823, -0.00543026052341276, -0.009968369847796928, 0.03394779387914471, 0.010133593932726323, -0.009367689103966766, -0.0013903915759656822, 0.01779107852962815, 0.028139322770516687, -0.06027676515767679, 0.004313417495638278, 0.004223580717880697, -0.04545491453201317, -0.005519228842818988, 0.02605351025932265, -0.005487209505179677, -0.057208880384998675, 0.025491119914052795, 0.030237192184463108, 0.006801371858317113, 0.01686730333941617, -0.013081452638973128, -0.020539674240185702, -0.023276896638800403, -0.02927608399535322, 0.012446450793626584, -0.05227481132873966, -0.016892894222610217, 0.018231091218828984, -0.01947311611510779, -0.03720423795653933, 0.00514848373979615, 0.024391784820377044, 0.010944327044900334, 0.04094219818476443, -0.05848135021801282, ...]",-1
3,3,1110,The Navy sent ships for storm relief efforts.,"[0.012925745272467412, 0.006411300647529484, -0.0413438835897503, -0.012019688528040046, 0.045818106348274476, -0.0039437032241230584, -0.007209312329391638, -0.025706915750560312, -0.02951499512177046, -0.03765041463973959, 0.002806537380735227, -0.024650594762024906, 0.010675863597781174, -0.041547537764268214, 0.030950492929033804, 0.019744350078081053, 0.015549249357924353, -0.0071290015880475085, -0.03035040266545537, -0.0040865196814380035, -0.017516890926194687, -0.014158720462165899, -0.030314604487544528, 0.007676573639165091, 0.08641887646351527, 0.009598975281055357, -0.04191391262260377, -0.04968753938940529, -0.05536305281161971, -0.024297204403310374, 0.1020234622160733, -0.012944961251572318, 0.051100959263228954, -0.03823265889975053, 1.5623716424695992e-06, 0.01904261154118192, -0.0030864524402924383, -0.018421767684894046, -0.028323257965260855, -0.024705835230431267, -0.006571629229261582, -0.06914035383024865, -0.009165249035286628, 0.02014373659712288, -0.00021032413232110951, -0.05262573551710474, 0.005007755587181771, 0.0060135863182071755, 0.024592677672480197, 0.052078348799183584, 0.005755084693418034, 0.03543628830212482, -0.017986307316453225, 0.0374813050829195, -0.03205965902426095, 0.03858264248041215, -0.02256559816486734, 0.02205066058880501, 0.03414956182884955, -0.008057713685923522, 0.04567991297632166, 0.05195829759138351, -0.03568557728380313, 0.047961802345956636, -0.014143381579024434, 0.012390639905607972, 0.03776467302099808, 0.014479144794981316, -0.046955977668528104, -0.013967329943580391, 0.034113430237449664, 0.03425990865520858, 0.007231652895803139, -0.007517860436474212, 0.016754638772355678, 0.005740994248215316, -0.023556681868260473, -0.01282474519967291, -0.00482390226235849, -0.020015120946738175, 0.0007678460847988757, -0.000554124313027818, 0.04845050456451472, -0.02657862438635742, -0.007663967722178979, 0.06735756396462354, -0.014654453234994463, 0.01587279920936896, -0.04646007073837425, -0.004271732271588848, -0.026487129391968615, -0.007708717307212722, 0.029258000376940207, 0.04768929735462742, -0.016284859166285047, -0.003090756082004442, 0.005896716974053296, -0.0354716040549314, -0.010389390630170496, -0.01116120839838388, ...]",11
4,4,1110,"The National Guard deployed 24,000 troops in Texas.","[0.013532480033319281, -0.010954988787832798, 0.0011602688516453435, -9.612623923843618e-05, -0.014242518540371127, 0.009007843703654964, -0.0009605404281746749, 0.023107362311490492, -0.06410798780852095, 0.002171848786497094, -0.0015888557240830502, -0.03023417086754022, 0.021474764924823947, -0.013149308759469115, 0.04125122951439611, 0.07010509456923288, -0.0233232894622175, -0.02339184598332734, 0.04039446485298395, 0.015039616155134036, -0.03729883040214369, 0.029560712850573583, -0.00934337782520974, -0.06397115043779841, -0.06668232000766167, 0.015842248620260416, -0.010851614770367859, 0.010692204001847958, 0.039124748014125355, -0.06374993523720371, 0.053265026887989425, -0.07102986068347006, 0.028424465719823627, -0.06010318130948523, 1.6626873729591692e-06, -0.0032150539929896404, -0.018590900200510373, 0.005258444984087331, -0.03974813812674399, -0.02066625022599478, -0.03100717237201303, -0.057440194646975924, -0.027117145798637, 0.021376411667633193, 0.004903058789446953, -0.059773768771548, -0.006575545189825868, 0.029880419144107634, 0.012595572266013183, 0.060035123977302665, 0.013323609514125844, -0.04090652838423833, -0.0140212847133239, -0.0030219046653763076, -0.07881329639399802, 0.03295299963467575, -0.033327000174632415, 0.013501827411745514, 0.0009163001587405425, 0.01671513075102983, 0.03207625996559143, 0.016698111761371065, -0.02843352003838894, 0.014475314737814195, 0.0017297110534427308, 0.0028158967991030646, -0.0030669913906617103, 0.07882099284417532, 0.010299691562063817, -0.03009911604832294, -0.08316830255197259, -0.005486355463036917, 0.03395087084972582, 0.0118363366412682, -0.01653029673747846, -0.02893686080193269, -0.023272834127657283, -0.01294167503672237, 0.03467231805894029, 0.00599434632044989, -0.06521844475312498, -0.010997838010510385, 0.01670849042070987, -0.030768187535723374, -0.03361899216947621, 0.02590437747568375, -0.00908814885968575, -0.054418123507680306, -0.06462543810984134, 0.04922589183697701, 0.031169572686160853, 0.013211092702446168, 0.037911353842367754, 0.052271585043346336, -0.0414364546834502, 0.012921672089371432, -0.006125179919866997, -0.031297368776834825, 0.025683745283052607, -0.007944867449145236, ...]",12


Each fact cluster represents a specific sentiment or message or topic. The -1's indicate data that have no clusters, can be outliers and unique info not shared across documents 

the one below emphasises on the amount of money that went to different things. Cluster labels can change as well but does not matter as this is a one off thing and the clusters remain the same just the number changes 

In [115]:
# for i in df_facts_clustered['cluster'].unique():
list_of_fact = df_facts_clustered[df_facts_clustered['cluster'] == 1]['text'].to_list()
list_of_fact

['Senate resistance to the aid package.',
 'Internal resistance from House conservatives.']

## Fact Organization

Now that we have the clusters, it is easier for the LLM to focus on certain facts at a time. Here we instructed it to extract statements that are either on similarity, contradiction or standalone. Then color code them based on the sentiment they bring (code:n/g/r). A reference number to their article was also given to allow the LLM to link the topics to the original articles

Color coding is very helpful as it would help the operator see anomalies (contradictions) and not have to read the entire thing when there are so many facts

In [128]:
system_prompt = """
You are a precise fact analysis assistant. You are given a group of factual claims in the <claims> tag from multiple sources that all relate to the same topic cluster.

Your task is to:
- Output a valid JSON object using double quotes.
- Write the "central topic" that summarizes the subject of the cluster. The central topic should be about the thing they refer to specifically
- Group all claims into the fewest possible contradiction or agreement statements.
- Every claim must be used exactly once.
- Avoid atomic or overly granular statements. Instead, group claims with the same intent into a **single code:g** or **code:r** line.
- If numeric values (e.g. deaths, wind speeds) vary, list all values clearly in one **code:r** sentence.
- Use **code:g** if claims agree in meaning or outcome, even if phrasing differs.
- Use **code:r** if claims present **contradictory numbers or incompatible facts** about the same thing Contradictory claims (code:r), only if they come from different references
- Do NOT output individual fact summaries or lines with only one ref unless it’s a unique standalone detail.
- Do not restate claims that are already included in another code:g or code:r statement.
- Use each reference only once in the claims section. Do not repeat the same source across multiple comparison entries.
- Choose the best grouping: if a source contributes to both support and contradiction, prefer grouping it in the contradiction unless it’s a unique point.
- Group similar references to a topic together.

IMPORTANT:
- Do NOT include markdown formatting like ```json or ``` in your response.
- Return only the raw JSON object,
- Do not write: "Here's the JSON:", "Output:", or similar phrases.


<example>
Claims:
"A loan of 80Mil was made ref:1107"
"A loan of 80 million dollars was made ref:1108"
"A loan was not made ref:1109"

{
  "central topic": "Loan approval and funding",
  "claims": [
    "code:g ref:1107, ref:1108 state that a loan of 80 million was made.",
    "code:r ref:1107, ref:1108, ref:1109 show contradiction on whether a loan was made, with 1107 and 1108 affirming it, and 1109 denying it."
  ]
}
</example>



"""

sum_gpt = set_role(system_prompt, set_json=True)
sum_gpt("<claims>$81 billion emergency aid bill passed by the House. [ID 1107].</claims>")

{'central topic': 'Emergency aid bill passage',
 'claims': ['code:g ref:1107 states that an $81 billion emergency aid bill was passed by the House.']}

In [129]:
filtered_facts = df_facts_clustered[df_facts_clustered['cluster'] == 3].copy()
filtered_facts.loc[:, 'format'] = filtered_facts.apply(
    lambda row: f"{row['text']} ref:{row['article']}", axis=1
)
filtered_facts


Unnamed: 0,fact_id,article,text,embedding,cluster,format
75,75,4040,House bill passed with a 251-169 vote.,"[-0.06288344878025215, 0.032688798789605775, 0...",3,House bill passed with a 251-169 vote. ref:4040
76,76,4040,69 Democrats supported the House bill.,"[-0.05474687884152152, 0.014471201606605115, 2...",3,69 Democrats supported the House bill. ref:4040


only non -1 clusters are processed and sent to the LLM as they have content to compare, we can save time by ignoring the -1's and cleaning it ourselves 

In [130]:
cluster_ids = df_facts_clustered[df_facts_clustered['cluster'] > -1]['cluster'].unique()
responses = []
for i in  tqdm(cluster_ids):
    filtered_facts = df_facts_clustered[df_facts_clustered['cluster'] == i].copy()
    filtered_facts.loc[:, 'format'] = filtered_facts.apply(
        lambda row: f"{row['text']} ref:{row['article']}", axis=1
    )

    list_of_fact = filtered_facts['format'].to_list()
    string_of_fact = '\n'.join(list_of_fact)
    # print(string_of_fact)
    prompt = f"""
    <claims>
    {string_of_fact}
    </claims>
    """
    response = sum_gpt(prompt)
    response['cluster'] = i
    responses.append(response)
    


100%|██████████| 21/21 [00:38<00:00,  1.86s/it]


From the output we use regex to extract the color, extract the references as well and form a data frame, completely changing our unstructed data to a more structured factual table containing similarities, contradictions and stand alone statements

In [131]:
import pandas as pd
import re

pd.set_option('display.max_colwidth', None)

rows = []

for i, item in enumerate(responses):
    topic = item['central topic']
    for claim in item['claims']:
        # Extract the code marker at the start
        code_match = re.match(r'code:([grn])\s+', claim, flags=re.IGNORECASE)
        code = code_match.group(1) if code_match else None

        # Extract all ref:<id> patterns
        refs = list(set(re.findall(r'ref:(\d+)', claim, flags=re.IGNORECASE)))

        # Remove both code and ref patterns for clean text
        clean_text = re.sub(r'code:[grn]\s+', '', claim, flags=re.IGNORECASE)

        rows.append({
            'cluster_id': i,
            'central_topic': topic,
            'ref_ids': refs,
            'code': code,
            'claim': clean_text
        })

final_df = pd.DataFrame(rows)

final_df

Unnamed: 0,cluster_id,central_topic,ref_ids,code,claim
0,0,Impact of Hurricane Harvey,"[2038, 1110, 2037, 1112]",g,"ref:1110, ref:1112, ref:2037, ref:2038 agree that Hurricane Harvey caused significant flooding in Texas, with specific mentions of catastrophic, historic, and widespread flooding in Texas and Houston."
1,1,Impact and rescue operations in Houston,"[1110, 1112, 1114]",g,"ref:1110, ref:1112, ref:1114 agree that many people were stranded and a significant number of rescues occurred in Houston."
2,1,Impact and rescue operations in Houston,"[1112, 1114]",r,"ref:1112, ref:1114 show contradiction in the number of people rescued, with ref:1112 stating 9,000 to 10,000 and ref:1114 stating 13,000."
3,1,Impact and rescue operations in Houston,"[1110, 1114, 1112]",r,"ref:1114 states 18 dead, which is not mentioned in ref:1110 or ref:1112."
4,2,Storm relief efforts,"[1110, 2038]",g,"ref:1110, ref:2038 indicate that both the Navy and the 'Cajun Navy' were involved in storm relief efforts."
5,3,Deployment of National Guard troops in Texas,"[1110, 2038]",r,"ref:1110, ref:2038 show a contradiction in the number of National Guard troops deployed in Texas, with ref:1110 stating 24,000 troops and ref:2038 stating 4,000 troops."
6,4,Shelter operations and occupancy in Texas and Louisiana,"[1110, 1114]",g,"ref:1110, ref:1114 state that FEMA is operating shelters in Texas, with ref:1114 specifying that 17,000 people sought refuge in these shelters."
7,4,Shelter operations and occupancy in Texas and Louisiana,[1114],g,"ref:1114 mentions that Houston's largest shelter housed 10,000 displaced people."
8,4,Shelter operations and occupancy in Texas and Louisiana,[1113],g,ref:1113 states that 269 people are in shelters in southwest Louisiana.
9,5,Rainfall measurements during a storm,"[1112, 1114]",g,"ref:1112 states that Port Arthur and Beaumont received 26 inches of rain in 24 hours, while ref:1114 reports a Cedar Bayou rainfall record of 51.88 inches."


And we also have the original -1's or unclustered info which represent unique information not in any specific group.

In [132]:
pd.set_option('display.max_colwidth', 50)
df_facts_clustered[df_facts_clustered['cluster'] == -1].head()

Unnamed: 0,fact_id,article,text,embedding,cluster
2,2,1110,"Power outages affected 75,000 in the Houston a...","[-0.05709598545593606, -0.010277228096474975, ...",-1
6,6,1110,"FEMA placed more than 1,800 flood survivors in...","[-0.01805662391791506, 0.005040829148146107, -...",-1
7,7,1110,Texas accepted resources from Mexico and Israel.,"[0.01843603389992548, 0.03324553482500138, 0.0...",-1
10,10,1112,Hurricane Harvey broke the US record for rainf...,"[-0.04299234415695113, -0.00016000004734850767...",-1
11,11,1112,At least 37 deaths related to Hurricane Harvey...,"[-0.013959809988397229, -0.062403890567950954,...",-1


## UI Design

With these 2 structured tables, we can probably design it like this.

url in grp chat

## Possible improvements 

- We could maybe add the news station inside as better references
- An article reference page that allows the user to go to the directed article

### Tests

1 factual side, against misinformation

In [133]:
claims = [
    "10 people were found dead in the plane crash. ref:1108",  # factual
    "Authorities confirmed that ten bodies were recovered from the wreckage. ref:1109",
    "A total of ten fatalities resulted from the aircraft accident. ref:1110",
    "The plane crash claimed the lives of 10 individuals. ref:1111",
    "Emergency responders found 10 deceased passengers at the crash site. ref:1112",
    "Ten lives were lost following the tragic plane crash. ref:1118",
    "The aircraft disaster led to 10 confirmed deaths. ref:1119",
    "Local officials reported that ten people perished in the crash. ref:1120",
    "Ten casualties have been officially recorded from the aviation incident. ref:1121",
    "Confirmed death toll in the crash stands at ten. ref:1122",
    "Ten passengers did not survive the plane crash. ref:1123",
    "Recovery teams located the remains of ten individuals post-crash. ref:1124",
    "Ten victims have been identified from the plane wreck. ref:1125",
    "10 people lost their lives when the plane went down. ref:1126",
    "The final count lists ten people dead in the crash. ref:1127",

    "Only 2 people died in the plane crash, contrary to earlier reports. ref:1113",  # misinfo
    "All 87 passengers aboard the plane died instantly. ref:1114",
    "No fatalities occurred in the recent plane crash incident. ref:1115",
    "The crash was a hoax, and no plane actually went down. ref:1116",
    "Five survivors were rescued, and no one was killed in the crash. ref:1117",
    "The plane was shot down, not crashed. ref:1128",
    "Ten passengers survived without injuries. ref:1129",
    "The crash site had no human remains, only cargo. ref:1130",
    "The plane landed safely; reports of a crash are false. ref:1131",
    "Only crew members were harmed, not passengers. ref:1132",
    "The incident involved a drone, not a commercial aircraft. ref:1133",
    "A mechanical fault was ruled out; it was sabotage. ref:1134",
    "Crash footage is from a different event in 2015. ref:1135",
    "The death toll is actually 25, not 10. ref:1136",
    "Reports of the crash were fabricated to cover a military exercise. ref:1137"
]


str_claims = "\n".join(claims)

test = f"""
<claims>
{str_claims}
</claims>
"""

sum_gpt(test)

{'central topic': 'Plane crash fatalities and incident details',
 'claims': ['code:g ref:1108, ref:1109, ref:1110, ref:1111, ref:1112, ref:1118, ref:1119, ref:1120, ref:1121, ref:1122, ref:1123, ref:1124, ref:1125, ref:1126, ref:1127 state that 10 people died in the plane crash.',
  'code:r ref:1108, ref:1113, ref:1114, ref:1115, ref:1116, ref:1117, ref:1129, ref:1130, ref:1131, ref:1136 show contradictions on the number of fatalities, with ref:1108 and others stating 10 deaths, ref:1113 stating 2 deaths, ref:1114 stating 87 deaths, ref:1115, ref:1116, ref:1117, ref:1129, ref:1130, ref:1131 stating no deaths, and ref:1136 stating 25 deaths.',
  'code:r ref:1116, ref:1117, ref:1131, ref:1137 claim the crash did not occur, with ref:1116 stating it was a hoax, ref:1117 stating no one was killed, ref:1131 stating the plane landed safely, and ref:1137 stating it was fabricated to cover a military exercise.',
  'code:r ref:1128, ref:1133, ref:1134, ref:1135 provide alternative explanations, 

2 consistent sides

In [124]:
claims = [
    # 20 claims saying 5 people died
    "5 people were found dead in the plane crash. ref:2001",
    "Authorities confirmed five bodies were recovered from the crash site. ref:2002",
    "Only five fatalities occurred in the incident. ref:2003",
    "Emergency responders reported five deaths. ref:2004",
    "The plane crash resulted in five confirmed deaths. ref:2005",
    "Just five victims were identified from the wreckage. ref:2006",
    "The fatality count currently stands at five. ref:2007",
    "Officials have announced five deaths in the crash. ref:2008",
    "Only five passengers lost their lives in the incident. ref:2009",
    "Five bodies were recovered after the crash. ref:2010",
    "Five casualties have been recorded so far. ref:2011",
    "Five lives were lost in the aircraft tragedy. ref:2012",
    "Crash investigators confirmed five deceased. ref:2013",
    "Five fatalities have been verified post-crash. ref:2014",
    "Crash site responders confirmed five dead. ref:2015",
    "Five people are confirmed dead after the incident. ref:2016",
    "Reports indicate five victims in the crash. ref:2017",
    "Medical teams documented five fatalities. ref:2018",
    "The official toll released today is five. ref:2019",
    "Government sources confirm only five deaths. ref:2020",

    # 10 claims saying 20 people died
    "20 people died in the plane crash, according to authorities. ref:2021",
    "The crash claimed 20 lives. ref:2022",
    "Emergency teams recovered 20 bodies from the wreckage. ref:2023",
    "Twenty fatalities have been confirmed. ref:2024",
    "20 individuals were found deceased at the site. ref:2025",
    "Officials report 20 passengers were killed. ref:2026",
    "The plane accident led to 20 deaths. ref:2027",
    "Twenty victims have been listed in the crash report. ref:2028",
    "Medical examiners identified 20 casualties. ref:2029",
    "A total of 20 people are believed to have perished. ref:2030"
]


str_claims = "\n".join(claims)

test = f"""
<claims>
{str_claims}
</claims>
"""

sum_gpt(test)

{'central topic': 'Fatalities in the plane crash',
 'claims': ['code:g ref:2001, ref:2002, ref:2003, ref:2004, ref:2005, ref:2006, ref:2007, ref:2008, ref:2009, ref:2010, ref:2011, ref:2012, ref:2013, ref:2014, ref:2015, ref:2016, ref:2017, ref:2018, ref:2019, ref:2020 state that five people died in the plane crash.',
  'code:r ref:2001, ref:2002, ref:2003, ref:2004, ref:2005, ref:2006, ref:2007, ref:2008, ref:2009, ref:2010, ref:2011, ref:2012, ref:2013, ref:2014, ref:2015, ref:2016, ref:2017, ref:2018, ref:2019, ref:2020, ref:2021, ref:2022, ref:2023, ref:2024, ref:2025, ref:2026, ref:2027, ref:2028, ref:2029, ref:2030 show a contradiction in the number of fatalities, with refs 2001-2020 reporting five deaths and refs 2021-2030 reporting 20 deaths.']}

### Regex exclusion of reference chain (UX)

To improve on user experience but still be able to show the different references in the case of many references being shown, we could try regex extraction of reference chain and include a simple hover over to show the different sources in this chain

In [125]:
import re

text = 'ref:2021, ref:2022, ref:2023, ref:2024, ref:2025, ref:2026, ref:2027, ref:2028, ref:2029, ref:2030 indicate that 20 fatalities occurred in the incident, contradicting the claims of five deaths.'

# Pattern to find the full chain of refs
pattern = r'ref:\d{4}(?:,\s*ref:\d{4})*'

# Substitute the matched pattern with an empty string
cleaned_text = re.sub(pattern, '', text)

# Optionally, remove extra spaces left behind
cleaned_text = re.sub(r'\s{2,}', ' ', cleaned_text).strip()

print("These references " + cleaned_text)
# Output: "code:g state that five fatalities were confirmed."



These references indicate that 20 fatalities occurred in the incident, contradicting the claims of five deaths.


In [152]:
import pandas as pd
from pyvis.network import Network

net = Network(height="800px", width="100%", bgcolor="#222222", font_color="white", directed=True)

for _, row in final_df.iterrows():
    refs = row['ref_ids']
    relation = 'SUPPORTS' if row['code'] == 'g' else 'CONTRADICTS'
    topic = row['central_topic']

    # Add Topic node once
    net.add_node(topic, label=topic, title="Topic", color="lightgreen", shape="ellipse")

    # Add each reference node and link directly to topic
    for ref in refs:
        ref_node = f"ref:{ref}"
        net.add_node(ref_node, label=ref_node, title="Reference", color="lightblue", shape="box")
        edge_color = "green" if relation == "SUPPORTS" else "red"
        net.add_edge(ref_node, topic, title=relation, color=edge_color)

net.set_options("""
{
  "physics": {
    "enabled": false,
    "repulsion": {
      "nodeDistance": 200,
      "springLength": 200,
      "damping": 0.09
    },
    "solver": "repulsion"
  },
  "interaction": {
    "dragNodes": true,
    "dragView": true,
    "zoomView": true
  }
}
""")





net.write_html("graph.html")
