## RAG Approach to News similarity comparison

In order to develop such an app, RAG based solutions can be employed with some process modifiers to make the querying much more viable and broad in context. In the naive rag methodology, the system would typically look up sources relating to the language semantics of the query

For example for the question:

"How many people were trapped in the mountingbourn cave" 

The RAG system looks up sources that relates to things such as 
- mountingbourn cave 
- trapped 
- how many people

With questions like these it is great and easy to answer but the pulling of sources may perform worse in the context of more open ended and implicitly implied questions. 

For example: "Is trump corruupt" would lead to the system pulling information that can be considered as one sided, ultimately affecting the percieved message given to the user. Reasons would be due to how 
- US media is more left wing than right wing leading to article imbalance with more against him than for him, displaying how majority is not equal to reliability
- Media companies would use negativity as clickbait (Phrasing "Trump displays corruption") 

Hence although yes RAG, but RAG is not sufficient

In [1]:
from openai import OpenAI
from getpass import getpass
import json
import pandas as pd
from sklearn.cluster import DBSCAN
import hdbscan
import numpy as np

In [2]:
openai_key = getpass("Enter your API Key:")
client = OpenAI(api_key=openai_key)

In [3]:
def set_role(system_prompt, set_json=False):
    def get_completion(prompt, model="gpt-4o-mini"):
        messages = [{"role":"system", "content": system_prompt}, {"role": "user", "content": f"{prompt}"}]
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            temperature=0, # this is the degree of randomness of the model's output
        )
        return json.loads(response.choices[0].message.content) if set_json else response.choices[0].message.content
    
    return get_completion

## Question Expansion

One possible method in achieving a more openly driven semantic search would be to expand the query before sending it for embedding and vector DB retrieval.
Possible solution would be to use a LLM to expand the query

In [5]:
system_prompt = """
You are an intelligent research assistant that helps break down vague or subjective queries for evidence-based investigation
Given a user query contained within the query tag <query> expand the question to encompass the following 
Ensure that the questions are creative and suit the role of the person, do not simply swap vocabulary
Provide your answer as a JSON object like the one in the example tag where all sub questions are collated in a single list

<requirements>
- At least 2 sub questions that support the original query
- At least 2 sub questions that go against the original query
- At least 2 sub questions that take a neutral stance
- 2 third person perspective questions
</requirements>

<example>
{
  "original": "Is donald trump corrupt",
  "all_sub_questions": [
    "What are the major corruption allegations made against Donald Trump during his presidency?",
    "Have any court cases or legal inquiries found Trump guilty of unethical or corrupt practices?".
    "Have any official investigations concluded that Donald Trump did not engage in corruption?",
    "Were corruption allegations against Donald Trump politically motivated with no legal standing?",
    "What were the main legal and ethical controversies associated with Donald Trump?",
    "How has media coverage of Trump’s alleged corruption varied across sources?",
    "How do historians evaluate the ethical conduct of Donald Trump during his time in office?",
    "What do international news outlets report about Trump’s alleged corruption?"
  ]
}
</example>

"""

expansion_gpt = set_role(system_prompt,set_json=True)
results = expansion_gpt("<query>How many were caught in the implosion of the oceangate incident</query>")
results

{'original': 'How many were caught in the implosion of the Oceangate incident',
 'all_sub_questions': ['What is the total number of individuals who were aboard the Oceangate vessel at the time of the incident?',
  'What were the circumstances surrounding the implosion of the Oceangate submersible, and how many people were involved?',
  'Were there any survivors from the Oceangate incident, and if so, how many?',
  'Have any investigations concluded that the number of individuals reported as caught in the incident was exaggerated?',
  'What safety measures were in place for the Oceangate expedition, and how might they have affected the outcome?',
  'How have different media outlets reported on the number of individuals involved in the Oceangate incident?',
  'What do experts say about the implications of the Oceangate incident for future deep-sea exploration?',
  'How do various reports compare in terms of the number of people caught in the Oceangate implosion?']}

## Trail Dataset (Multinews)
To simulate and test the usage of this rag broadening method, we will use the dataset Multi news by alexfabbri

Found in the link here https://huggingface.co/datasets/alexfabbri/multi_news/tree/main/data

The Multi-News dataset is a multi-document summarization dataset consisting of news articles grouped by topic, where each group has:
- 2 to 10 news articles covering the same event or topic
- A human-written summary that combines key information from all articles

2 train files are given 
- train.tgt which contains the summary 
- train src cleaned which contains the articles themselves (Each row is on a topic, each article per row is delimited by '|||||')




In [6]:

relative = "datasets/"
source = "train.src.cleaned"
target = "train.tgt"

with open(f'{relative}{source}', 'r', encoding='utf-8') as f:
    sources = f.readlines()

with open(f'{relative}{target}', 'r', encoding='utf-8') as f:
    targets = f.readlines()

# Clean up
sources = [s.strip() for s in sources]
targets = [t.strip() for t in targets]

# Check if aligned
assert len(sources) == len(targets)

# Example pair
print("Source:", sources[0])
print("Target (Summary):", targets[0])



Source: National Archives NEWLINE_CHAR NEWLINE_CHAR Yes, it’s that time again, folks. It’s the first Friday of the month, when for one ever-so-brief moment the interests of Wall Street, Washington and Main Street are all aligned on one thing: Jobs. NEWLINE_CHAR NEWLINE_CHAR A fresh update on the U.S. employment situation for January hits the wires at 8:30 a.m. New York time offering one of the most important snapshots on how the economy fared during the previous month. Expectations are for 203,000 new jobs to be created, according to economists polled by Dow Jones Newswires, compared to 227,000 jobs added in February. The unemployment rate is expected to hold steady at 8.3%. NEWLINE_CHAR NEWLINE_CHAR Here at MarketBeat HQ, we’ll be offering color commentary before and after the data crosses the wires. Feel free to weigh-in yourself, via the comments section. And while you’re here, why don’t you sign up to follow us on Twitter. NEWLINE_CHAR NEWLINE_CHAR Enjoy the show. ||||| Employers p

Splitting of topic into their separate articles

In [7]:
all_articles = []

for topic_id, source in enumerate(sources):
    articles = [a.strip().replace("NEWLINE_CHAR", "\n") for a in source.split("|||||")]
    for article in articles:

        # Apply to your DataFrame
        all_articles.append({
            "topic_id": topic_id,
            "article": article        
        })

articles_df = pd.DataFrame(all_articles)
articles_df.reset_index(inplace=True)
articles_df.rename(columns={'index': 'article_id'}, inplace=True)

summaries_df = pd.DataFrame({
    "topic_id": list(range(len(targets))),
    "summary": targets
})



from IPython.display import display
print("Articles DataFrame:")
display(articles_df.head())

print("\nSummaries DataFrame (optional):")
display(summaries_df.head())

Articles DataFrame:


Unnamed: 0,article_id,topic_id,article
0,0,0,"National Archives \n \n Yes, it’s that time ag..."
1,1,0,Employers pulled back sharply on hiring last m...
2,2,1,LOS ANGELES (AP) — In her first interview sinc...
3,3,1,"Shelly Sterling said today that ""eventually, I..."
4,4,2,"GAITHERSBURG, Md. (AP) — A small, private jet ..."



Summaries DataFrame (optional):


Unnamed: 0,topic_id,summary
0,0,– The unemployment rate dropped to 8.2% last m...
1,1,"– Shelly Sterling plans ""eventually"" to divorc..."
2,2,– A twin-engine Embraer jet that the FAA descr...
3,3,– Tucker Carlson is in deep doodoo with conser...
4,4,– What are the three most horrifying words in ...


In [8]:
articles_df.groupby('topic_id').count().reset_index().rename(columns={'article':'article/topic'}).groupby('article/topic').size()

article/topic
1       504
2     23743
3     12577
4      4921
5      1845
6       707
7       371
8       194
9        81
10       29
dtype: int64

Here we see that each topic can span 1 to 10 articles which will be good in simulating the different sources of news.

## Article formatting with chunking for embeddings
Here new line char is changed to \n

In [9]:
articles_df['article'] = articles_df['article'].apply(lambda x: '\n'.join(x.split('NEWLINE_CHAR')))
articles_df.head()

Unnamed: 0,article_id,topic_id,article
0,0,0,"National Archives \n \n Yes, it’s that time ag..."
1,1,0,Employers pulled back sharply on hiring last m...
2,2,1,LOS ANGELES (AP) — In her first interview sinc...
3,3,1,"Shelly Sterling said today that ""eventually, I..."
4,4,2,"GAITHERSBURG, Md. (AP) — A small, private jet ..."


Chunking with meta data capturing is conducted. 

In [280]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ".", " "]
)

chunks = []

# Loop through your DataFrame rows
for _, row in articles_df.iterrows():
    article_text = row['article']
    article_id = row['article_id']
    topic_id = row['topic_id']
    
    # Split and attach metadata
    split_chunks = splitter.create_documents(
        [article_text],
        metadatas=[{
            "article_id": article_id,
            "topic_id": topic_id
        }]
    )
    
    chunks.extend(split_chunks)


In [281]:
chunks[0:3]

[Document(metadata={'article_id': 0, 'topic_id': 0}, page_content='National Archives \n \n Yes, it’s that time again, folks. It’s the first Friday of the month, when for one ever-so-brief moment the interests of Wall Street, Washington and Main Street are all aligned on one thing: Jobs.'),
 Document(metadata={'article_id': 0, 'topic_id': 0}, page_content='A fresh update on the U.S. employment situation for January hits the wires at 8:30 a.m. New York time offering one of the most important snapshots on how the economy fared during the previous month. Expectations are for 203,000 new jobs to be created, according to economists polled by Dow Jones Newswires, compared to 227,000 jobs added in February. The unemployment rate is expected to hold steady at 8.3%.'),
 Document(metadata={'article_id': 0, 'topic_id': 0}, page_content='Here at MarketBeat HQ, we’ll be offering color commentary before and after the data crosses the wires. Feel free to weigh-in yourself, via the comments section. An

## Embedding Model

Here the MiniLM L6 v2 embeddings model was used

In [10]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.docstore.document import Document
import numpy as np

encoder = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

  encoder = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
  from .autonotebook import tqdm as notebook_tqdm


Here only 50000 chunks were embedded to save time for testing. Since FAISS does not use cosine similarity by default, we had to apply our own embeddings by encoding and normalizing our selves before sending them as chunk embedding pairs into FAISS

In [283]:
index_end = 50000
embeddings = encoder.embed_documents([chunk.page_content for chunk in chunks[0:index_end]])
normalized_embeddings = [vec / np.linalg.norm(vec) for vec in embeddings]

text_embedding_pairs = list(zip(
    [doc.page_content for doc in chunks],
    normalized_embeddings
))

faiss_index = FAISS.from_embeddings(
    text_embedding_pairs,         # list of (text, embedding)
    embedding=encoder,            # model used
    metadatas=[doc.metadata for doc in chunks[0:index_end]]  # optional: include article_id/topic_id
)

faiss_index.save_local("vector_index_cosine")


### Retrieval Function 
This function here loads the database and allows for easy query to vector to db chunk retrieval. 
Since FAISS is used, meta data is stored in relation to the article the chunk its from and the topic it is referencing. 

In [11]:
db = FAISS.load_local("vector_index_cosine", encoder, allow_dangerous_deserialization=True)

def semantic_search_with_threshold(db, query, encoder, threshold=0.1, k=99999):
    vec = encoder.embed_query(query)
    vec = vec / np.linalg.norm(vec)
    results = db.similarity_search_with_score_by_vector(vec, k=k)
    return [(doc, 1 - score) for doc, score in results if 1 - score >= threshold]

## Test Topic
For the test topic, we will look at 7 different articles on the topic of Huricane harvey and its aftermath

Why this topic is due to 
- it being broad for 7 articles 
- containing different perspectives and info

In [61]:
articles_df[articles_df['topic_id'] == 390]['article'].values

       'HOUSTON — Five days after the pummeling began — a time when big storms have usually blown through, the sun has come out, and evacuees have returned home — Tropical Storm Harvey refused to go away, battering southeast Texas even more on Tuesday, spreading the destruction into Louisiana and shattering records for rainfall and flooding. \n \n Along 300 miles of Gulf Coast, people poured into shelters by the thousands, straining their capacity; as heavy rain kept falling, some rivers were still rising and floodwater in some areas had not crested yet; and with whole neighborhoods flooded, others were covered in water for the first time. \n \n Officials cautioned that the full-fledged rescue-and-escape phase of the crisis, usually finished by now, would continue, and that they still had no way to gauge the scale of the catastrophe — how many dead, how many survivors taking shelter inland or still hunkered down in flooded communities, and how many homes destroyed. \n \n For everybody,

Possible questions
- What were the key challenges faced by Texas and Louisiana during Hurricane Harvey?
- How many homes were destroyed in huricane harvey

What were the key challenges faced by Texas and Louisiana during Hurricane Harvey?

In [62]:
semantic_search_with_threshold(db, "What were the key challenges faced by Texas and Louisiana during Hurricane Harvey", encoder, threshold=0.25)

[(Document(id='13a9b0ec-d27e-4221-89b3-89c07fcfe49c', metadata={'article_id': 1112, 'topic_id': 390}, page_content='(CNN) With countless Houstonians still awaiting rescue, Tropical Depression Harvey devoured another Texas city. \n \n The unrelenting storm unleashed its wrath on a wide swath east of Houston, leaving thousands stranded in flooded homes and forcing the evacuation of a nursing facility and even an emergency shelter where residents had sought refuge.'),
  np.float32(0.37814432)),
 (Document(id='0e72b927-a078-4c7d-94d7-553a88d5ec3e', metadata={'article_id': 1113, 'topic_id': 390}, page_content="The catastrophic flooding from Hurricane Harvey is not limited to Texas, it's also affecting parts of southwest Louisiana where preparations are underway to evacuate some areas. \n \n Interested in Hurricane Harvey? Add Hurricane Harvey as an interest to stay up to date on the latest Hurricane Harvey news, video, and analysis from ABC News. Add Interest"),
  np.float32(0.36525846)),
 

### Broadening search 

In [63]:
question =  "<query>What were the key challenges faced by Texas and Louisiana during Hurricane Harvey?</query>"
questions = expansion_gpt(question)
questions

{'original': 'What were the key challenges faced by Texas and Louisiana during Hurricane Harvey?',
 'all_sub_questions': ['What were the immediate impacts of Hurricane Harvey on infrastructure in Texas and Louisiana?',
  'How did the response and recovery efforts differ between Texas and Louisiana during and after Hurricane Harvey?',
  'Were there any significant challenges that Texas and Louisiana did not face during Hurricane Harvey?',
  'Did the preparedness levels of Texas and Louisiana mitigate some of the challenges posed by Hurricane Harvey?',
  'What were the long-term economic and social challenges faced by communities in Texas and Louisiana post-Hurricane Harvey?',
  'How did local, state, and federal agencies coordinate their efforts in response to Hurricane Harvey?',
  'What lessons have emergency management officials learned from the challenges faced during Hurricane Harvey in Texas and Louisiana?',
  'How did the media portray the challenges faced by Texas and Louisiana d

In [64]:
retrieved_documents =  []
for question in questions['all_sub_questions']:
    retrieved_documents += semantic_search_with_threshold(db, question, encoder, threshold=0.25)

article_id = set()
topic_id_check = set()

for doc, score in retrieved_documents:
    aid = doc.metadata.get("article_id")
    tid = doc.metadata.get("topic_id")
    article_id.add(aid)
    topic_id_check.add(tid)

print("Unique Article IDs:", article_id)
print("Topics Covered:", topic_id_check)


Unique Article IDs: {1537, 4040, 2037, 1110, 2038, 1112, 1113, 1114}
Topics Covered: {1394, 541, 390, 711}


From broadening the search we see that even more articles emerged with article_id = 4040 which covers topic 1394. And it encompasses huricane harvey as well from the example below 

In [65]:
articles_df[articles_df['article_id'] == 1537]['article'].iloc[0]

'Acting Homeland Secretary Elaine Duke, center, is briefed on the Hurricane Maria response during a flight to Puerto Rico on Friday, Sept. 29, 2017. President Donald Trump on Thursday cleared the way for... (Associated Press) \n \n BRANCHBURG, N.J. (AP) — President Donald Trump on Sunday scoffed at "politically motivated ingrates" who had questioned his administration\'s commitment to revive Puerto Rico after a pulverizing hurricane and said the federal government had done "a great job with the almost impossible situation." \n \n The tweets coming from a president ensconced in his New Jersey golf club sought to defend Washington\'s efforts to mobilize and coordinate recovery efforts on a U.S. territory in dire straits almost two weeks after Hurricane Maria struck. \n \n San Juan Mayor Carmen Yulin Cruz on Friday accused the Trump administration of "killing us with the inefficiency" after the storm. She begged the president, who is set to visit Puerto Rico on Tuesday, to "make sure some

## Fact Extraction (Using LLMS)
With the articles obtained, we can feed them to the LLM for fact extraction. However we don't want all facts as some are rubbish and waste time or token gen so we can prompt engineer to ask for facts only relating to query. We can test the difference between with question and no question but just not giving the question. The llm will extract everything

In [None]:
system_prompt = """
You are a research assistant specialized in fact extraction.

Extract clear, verifiable facts from the <context>. Focus on short, distinct, evidence-based statements — no opinions, summaries, or assumptions.
Avoid full sentences. Only return **distinct factual keywords or short noun phrases** that can be grounded in the text.
Reduce the facts down to those that aid in answering the <query>

<requirements>
- Output a valid JSON.
- Min. 1 fact cannot be empty, no additional prior knowledge.
- Each fact must be concise, directly grounded in the text.
- Avoid redundancy, vague phrasing, or restatements.
</requirements>

<example>
{
  "facts": [
    "The study was published in 'Behavioral Ecology'.",
    "Urban birds solved food puzzles faster than rural birds.",
    "Over 100 pigeons and sparrows were studied in cities."
  ]
}
"""


example ="""

<context>
The OceanGate incident involved the implosion of a submersible during a dive to the Titanic wreck site. Five individuals were aboard the submersible when it imploded
</context>"

<query>
how many people were aboard ocean gate
</query>

"""


fact_gpt = set_role(system_prompt, set_json=True)
results = fact_gpt(example)
results

{'facts': ['Five individuals were aboard the submersible.']}

### Testing with single article


In [67]:
article_id_list = list(article_id)
article_id_list

[1537, 4040, 2037, 1110, 2038, 1112, 1113, 1114]

In [68]:
single_test = articles_df[articles_df['article_id'] == article_id_list[2]]
fact_extract_context = single_test['article'].values[0]
fact_extract_context

'A man checks on a boat storage facility that was damaged by Hurricane Harvey, Saturday, Aug. 26, 2017, in Rockport, Texas. (AP Photo/Eric Gay) (Associated Press) \n \n HOUSTON (AP) — Rising floodwaters from the remnants of Hurricane Harvey chased thousands of people to rooftops or higher ground Sunday in Houston, overwhelming rescuers who fielded countless desperate calls for help. \n \n A fleet of helicopters, airboats and high-water vehicles confronted flooding so widespread that authorities had trouble pinpointing the worst areas. Rescuers got too many calls to respond to each one and had to prioritize life-and-death situations. \n \n The water rose high enough to begin filling second floors — a highly unusual sight for a city built on nearly flat terrain. Authorities urged people to get on top of their homes to avoid becoming trapped in attics and to wave sheets or towels to draw attention to their location. \n \n Harris County Sheriff Ed Gonzalez used Twitter to field calls for a

### No question

In [69]:
fact_extract_context_query = f"""
<context>
{fact_extract_context}
</context>"
"""

fact_extract_context_query

'\n<context>\nA man checks on a boat storage facility that was damaged by Hurricane Harvey, Saturday, Aug. 26, 2017, in Rockport, Texas. (AP Photo/Eric Gay) (Associated Press) \n \n HOUSTON (AP) — Rising floodwaters from the remnants of Hurricane Harvey chased thousands of people to rooftops or higher ground Sunday in Houston, overwhelming rescuers who fielded countless desperate calls for help. \n \n A fleet of helicopters, airboats and high-water vehicles confronted flooding so widespread that authorities had trouble pinpointing the worst areas. Rescuers got too many calls to respond to each one and had to prioritize life-and-death situations. \n \n The water rose high enough to begin filling second floors — a highly unusual sight for a city built on nearly flat terrain. Authorities urged people to get on top of their homes to avoid becoming trapped in attics and to wave sheets or towels to draw attention to their location. \n \n Harris County Sheriff Ed Gonzalez used Twitter to fiel

In [70]:
fact_gpt(fact_extract_context_query)

{'facts': ['Hurricane Harvey made landfall on Aug. 25, 2017.',
  'Hurricane Harvey was a Category 4 storm.',
  'Winds reached 130 mph (209 kph).',
  'Houston received 11 inches (28 centimeters) of rain.',
  'Aransas County reported one death during the storm.',
  'Rockport experienced widespread devastation.',
  'The storm caused structural flooding reports in Houston.',
  'More than 2,000 calls for help were received by authorities.',
  'The Coast Guard received over 300 requests for help.',
  'President Donald Trump announced plans to visit Texas.',
  'Rainfall totals varied across the region.',
  'Harvey was the strongest hurricane to hit Texas since 1961.']}

Model extracted  points all extremely atomic along with all assocuated nouns, relating to the specifics of the article. Maybe relevant who knows

### With question

In [71]:
fact_extract_context_query = f"""
<context>
{fact_extract_context}
</context>

<query>
{question}
</query>
"""

ans = fact_gpt(fact_extract_context_query)
ans

{'facts': ['Rising floodwaters chased thousands to rooftops or higher ground.',
  'Rescuers received countless desperate calls for help.',
  'Authorities had trouble pinpointing the worst areas of flooding.',
  'Water rose high enough to begin filling second floors.',
  'Houston Mayor Sylvester Turner reported over 2,000 calls for help.',
  'The Coast Guard received more than 300 requests for help.',
  'KHOU-TV staff evacuated due to flooding from Buffalo Bayou.',
  'Hurricane Harvey was blamed for at least two deaths and up to 14 injuries.',
  'Forecast predicted as much as 40 inches of rain in the region.',
  'Rockport experienced widespread devastation, including heavily damaged homes and schools.',
  'Harvey came ashore as a Category 4 storm with 130 mph winds.',
  'Harvey was the fiercest hurricane to hit the U.S. in over a decade.']}

Time could be saved for lesser answers, may be good or bad possibly need to balance 

### For all facts
Now with this established, we can try to extract all facts and nouns from all related articles. Can take a while if there are a lot of articles not sure if there is a faster better way to extract facts... 

Maybe smaller model? OpenIE (A bit out dated)

In [72]:
facts_and_nouns = {}

for row in articles_df[articles_df['article_id'].isin(article_id_list)].iterrows():
    article_row = row[1].iloc[2]
    article_id = row[1].iloc[0]

    fact_extract_context_query = f"""
    <context>
    {article_row}
    </context>

    <query>
    {question}
    </query>
    """

    facts = fact_gpt(fact_extract_context_query)

    facts_and_nouns[article_id] = facts

facts_and_nouns


{1110: {'facts': ['30,000 to 40,000 homes destroyed in Houston area.',
   'Harvey made landfall near Cameron, Louisiana.',
   'Maximum sustained winds of 35 mph.',
   'Storm dumped more than 2 feet of rain in Beaumont-Port Arthur area.',
   'Power outages in Houston area down to 75,000.',
   '32,000 outages inaccessible to crews.',
   'Houston Fire Department received about 15,000 calls for assistance.',
   'Coast Guard taking more than 1,000 calls per hour for rescues.',
   'At least 31 deaths reported due to the storm.',
   '24,000 National Guard troops deployed in Texas.',
   'FEMA operating more than 230 shelters in Texas.',
   'George R. Brown Convention Center housing about 8,000 people.',
   'Texas accepting resources from Mexico for relief efforts.',
   'Israeli Rescue Coalition team arriving in Houston.']},
 1112: {'facts': ['Tropical Depression Harvey affected Texas and Louisiana.',
   'Thousands stranded in flooded homes.',
   '37 deaths reported in Texas related to Hurrican

## HDBSCAN and Cluster Association (Concept Summary)

We’ve extracted factual statements and key nouns from articles retrieved via question broadening, aiming to capture multiple aspects of a topic. Now, we need to organize and reason over this information.

### Project Goals

* Identify distinct factual statements about a topic
* Show what different people/sources are saying
* Detect missing or underreported information

### LLM Limitations

While LLMs can analyze text directly and I am not going against, relying solely on them poses some challenges that I am concerned about:

* Context window limits
* Slow responses when handling many facts
* Diluted attention with too many tokens
* Hallucinations or tracking errors


### Proposed Solution: Fact clustering before content summarization and comparing

We use embeddings + clustering to organize the facts to semantically similar claims and then make sub comparisons and abstractions of them

This allows us to possibly improve on:
- summarising all relevant themes and remove redundant processing saving time 
- Data organization and presentation later on

### Potentially viable ?
* Embeddings inherently encode semantic meaning meaning clustering could use the relational patterns in their latent space
* Clustering leverages this to organize facts/nouns without manual labeling

* Pronouns (e.g., "he", "she") may weaken noun clarity
* Can mitigate by prompting LLM extractors to avoid or resolve pronouns

HDBSCAN seems ok as it does not assume any uniform density and it doesn't require predefined cluster count (Like an anything goes model )



### Fact Reembedding
Next we can take all of these facts then reembedd the them 
Function below embedds the facts as well as normalizes them before fitting them into a dataframe

In [73]:
def embed_and_build_dataframes(articles, encoder):
    fact_rows = []

    for article_id, article in articles.items():
        facts = article["facts"]

        # Embed and normalize facts
        fact_embeddings = encoder.embed_documents(facts)
        norm_fact_embeddings = [vec / np.linalg.norm(vec) for vec in fact_embeddings]

        # Add to rows
        for fact, emb in zip(facts, norm_fact_embeddings):
            fact_rows.append({
                "article": article_id,
                "text": fact,
                "embedding": emb
            })

    df_facts = pd.DataFrame(fact_rows)
    df_facts.reset_index(inplace=True)
    df_facts.rename(columns={'index': 'fact_id'}, inplace=True)

    return df_facts


In [74]:
df_facts = embed_and_build_dataframes(facts_and_nouns, encoder)

In [75]:
df_facts.head(2)

Unnamed: 0,fact_id,article,text,embedding
0,0,1110,"30,000 to 40,000 homes destroyed in Houston area.","[-0.056860456866819634, 0.08809049980619167, -..."
1,1,1110,"Harvey made landfall near Cameron, Louisiana.","[-0.011395760571932744, -0.056227716364759636,..."


Run HDBSCAN

In [76]:
def run_hdbscan(df, min_cluster_size=2, min_samples=1, metric='euclidean'):
    X = np.vstack(df["embedding"].values)

    # Run HDBSCAN
    clusterer = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size,
                                 min_samples=min_samples,
                                 metric=metric,
                                 
                                 )
    clusters = clusterer.fit_predict(X)

    # Add cluster labels to DataFrame
    df_with_clusters = df.copy()
    df_with_clusters["cluster"] = clusters

    return df_with_clusters


In [77]:
# DBSCAN clustering
df_facts_clustered = run_hdbscan(df_facts)



Here we see that a few clusters were found. These results are already better than DBSCAN and KMeans based on experimentation in the past. 
- DBSCAN was at times forming too many or too little clusters
- KMeans requires predefined clusters, not ideal for subjective unstructured tasks

In [78]:
print(f"Unqiue clusters found for facts: {df_facts_clustered['cluster'].unique()}")
df_facts_clustered.head()

Unqiue clusters found for facts: [-1 16 10 12  3  9  5 13  4  7 15 11 14  2  8  6  1  0]


Unnamed: 0,fact_id,article,text,embedding,cluster
0,0,1110,"30,000 to 40,000 homes destroyed in Houston area.","[-0.056860456866819634, 0.08809049980619167, -...",-1
1,1,1110,"Harvey made landfall near Cameron, Louisiana.","[-0.011395760571932744, -0.056227716364759636,...",16
2,2,1110,Maximum sustained winds of 35 mph.,"[-0.058482503044972635, -0.07144808938939251, ...",10
3,3,1110,Storm dumped more than 2 feet of rain in Beaum...,"[-0.009097628574021573, 0.004168155769721676, ...",12
4,4,1110,"Power outages in Houston area down to 75,000.","[-0.054019449486435195, 0.0033545723539305603,...",3


Each fact cluster represents a specific sentiment or message or topic. The -1's indicate data that have no clusters, can be outliers and unique info not shared across documents 

the one below emphasises on the amount of money that went to different things. Cluster labels can change as well but does not matter as this is a one off thing and the clusters remain the same just the number changes 

In [152]:
# for i in df_facts_clustered['cluster'].unique():
list_of_fact = df_facts_clustered[df_facts_clustered['cluster'] == 1]['text'].to_list()
list_of_fact

['$81 billion emergency aid bill passed by House.',
 '$27.6 billion allocated for FEMA.',
 '$26.1 billion for Community Development Block Grants.',
 '$12.11 billion for Army Corps of Engineers.',
 'Congress may spend a record $133 billion on natural disasters this year.']

## Fact Organization

Now that we have the clusters, it is easier for the LLM to focus on certain facts at a time. Here we instructed it to extract statements that are either on similarity, contradiction or standalone. Then color code them based on the sentiment they bring (code:n/g/r). A reference number to their article was also given to allow the LLM to link the topics to the original articles

Color coding is very helpful as it would help the operator see anomalies (contradictions) and not have to read the entire thing when there are so many facts

In [4]:
system_prompt = """
You are a precise fact analysis assistant. You are given a group of factual claims in the <claims> tag from multiple sources that all relate to the same topic cluster.

Your task is to:
- Output a valid JSON object using double quotes.
- Write the "central topic" that summarizes the subject of the cluster.
- Group all claims into the fewest possible contradiction or agreement statements.
- Every claim must be used exactly once.
- Avoid atomic or overly granular statements. Instead, group claims with the same intent into a **single code:g** or **code:r** line.
- If numeric values (e.g. deaths, wind speeds) vary, list all values clearly in one **code:r** sentence.
- Use **code:g** if claims agree in meaning or outcome, even if phrasing differs.
- Use **code:r** if claims present **contradictory numbers or incompatible facts** about the same thing Contradictory claims (code:r), only if they come from different references
- Do NOT output individual fact summaries or lines with only one ref unless it’s a unique standalone detail.
- Do not restate claims that are already included in another code:g or code:r statement.
- Use each reference only once in the claims section. Do not repeat the same source across multiple comparison entries.
- Choose the best grouping: if a source contributes to both support and contradiction, prefer grouping it in the contradiction unless it’s a unique point.
- Group similar references to a topic together.


<example>
Claims:
"A loan of 80Mil was made ref:1107"
"A loan of 80 million dollars was made ref:1108"
"A loan was not made ref:1109"

{
  "central topic": "Loan approval and funding",
  "claims": [
    "code:g ref:1107, ref:1108 state that a loan of 80 million was made.",
    "code:r ref:1107, ref:1108, ref:1109 show contradiction on whether a loan was made, with 1107 and 1108 affirming it, and 1109 denying it."
  ]
}
</example>



"""

sum_gpt = set_role(system_prompt, set_json=True)
sum_gpt("<claims>$81 billion emergency aid bill passed by the House. [ID 1107].</claims>")

{'central topic': 'Emergency aid bill approval',
 'claims': ['code:g ref:1107 states that an $81 billion emergency aid bill was passed by the House.']}

In [268]:
filtered_facts = df_facts_clustered[df_facts_clustered['cluster'] == 3].copy()
filtered_facts.loc[:, 'format'] = filtered_facts.apply(
    lambda row: f"{row['text']} ref:{row['article']}", axis=1
)
filtered_facts


Unnamed: 0,fact_id,article,text,embedding,cluster,format
4,4,1110,"Power outages in Houston area down to 75,000.","[-0.054019449486435195, 0.0033545723539305603, -0.04658672392104268, 0.011920405077334176, 0.026038092627688504, 0.018466357849481176, 0.011556990887597408, -0.04244546009855441, -0.01352453842439681, -0.006011732618695522, 0.0030314797766071502, -0.020925999489761222, 0.0043543364658879735, 0.029394027972259252, 0.035633136952664915, -0.04206704137024295, -0.005585822956516535, 0.003794214367998564, 0.04152314151524122, -0.044037962157536886, -0.02582533197736326, -0.0007000191521901712, 0.04217614022610506, 0.012951739628797171, 0.007207828850132316, 0.02232452381282318, 0.014167388512776113, 0.013568999765812995, -0.023729545042414656, -0.05517221826173385, 0.01813745381906943, -0.0515238774198494, -0.006344299075272308, 0.0019163948752587954, 2.093141654955196e-06, -0.03976744938318984, -0.018835249147489175, 0.017238414575723535, -0.05347799342196292, 0.017829104078667224, -0.030469811757944495, -0.030640208405371974, 0.019177409623936065, 0.020298559996614715, -0.01515747106053805, -0.0673229645663081, -0.005838453062127831, 0.004855795018215281, 0.057879386825382004, -0.0012784462213622273, 0.022883213070876627, -0.007456185116284902, 0.021423528791995633, 0.002121011342187534, -0.005634494341792623, 0.08820582519275977, -0.019439088925690434, 0.0174863624569115, 0.02893592900670013, -0.08201394544456814, 0.0315947880039063, 0.016514385752016594, -0.004470459356877045, 0.003257238663773834, -0.02190673435248348, -0.0033468807933226654, -0.010826606718397712, 0.027900609325787622, 0.018712221430667267, 0.0003114648981183677, 2.8894233053152272e-05, -0.010331135639132084, 0.01438476386663544, -0.025482612707350184, 0.00744586606176267, 0.00884236336116329, -0.06321931872669545, 0.019410882889114595, 0.022686459993577386, 0.0031188387687768075, -0.05672003918716412, 0.008521283372000758, 0.026727466937065823, 0.009840285782658813, 0.002732646523467739, 0.0022169494454127117, 0.0013938895960917516, -0.03283248967597476, -0.03176164748805982, -0.0013754482442543572, -0.01818908075664934, -0.03179798024573322, 0.010193173232931252, -0.022308365365534304, -0.042907537674304774, 0.014351722403066874, -0.0007711920484309622, 0.019255573667974136, 0.023775325136647037, -0.057298368176423774, ...]",3,"Power outages in Houston area down to 75,000. ref:1110"
5,5,1110,"32,000 outages inaccessible to crews.","[-0.08085642170468596, 0.041858532365538145, -0.0170479774133727, 0.0411620110993189, 0.015479465322490518, 0.013468577793852824, 0.01530794829713195, 0.005482889820688864, -0.024121620239789608, -0.02109971311846817, 0.02317989291067781, 0.036913786816176034, -0.051399727440894216, 0.006652818193465768, 0.020588523185630966, 0.004001020533288116, -0.03144865545469243, -0.025593866170835835, -0.07728009079311036, 0.004059339488406846, -0.047346402899454916, 0.025423631576687967, 0.006210734259681897, 0.006630818491171521, -0.015322146310063785, 0.00643159974094244, 0.0324128649112017, -0.020321544663156884, 0.02525515345695069, -0.01821597701705929, -0.005583062413141582, -0.052075744708903735, -0.0017210746059952966, 0.01684349995012167, 1.7385857863191646e-06, -0.0109480435339147, -0.010326384771696746, 0.0013956608294936716, -0.08548783680611532, 0.0130414732526564, -0.05258584871959738, -0.0633268294387426, 0.0045397235981707255, 0.020438993755372937, 0.0016948235324788642, -0.013689510796001367, 0.010663526344367328, -0.0225954898461517, 0.03118217053319269, 0.049208578699071324, -0.00919582340493025, 0.06183864671548322, -0.003594453052253474, 0.021090791048026254, -0.06240410481582045, -0.048326985007150915, -0.03373583473173522, 0.004199127284341033, 0.011975028789811989, -0.10088814250215998, 0.03371028669036125, 0.05823329206661619, 0.03667330442147569, 0.018792149052507565, -0.02433676134673165, 0.0327516912464432, 0.001912330548053358, 0.012565328057675023, 0.01288161452351828, 0.02760257440460567, -0.007631567923959839, -0.010202467644071605, -0.015168459455750148, -0.03466997364435843, 0.002755928839312349, 0.0022215021749466143, -0.007371526823481771, 0.0615667042435879, -0.024266068373979675, 0.000904543983543902, -0.04209322565899976, 0.018144322919391234, 0.060042974799595014, -0.04512230396428809, -0.001102782458536081, 0.020554026996782482, -0.016729526553823076, 0.005500382387293796, -0.005674528070671394, 0.036918748902952075, -0.03896620201924013, -0.008404009210984841, 0.022997156242040146, -0.0459723593479157, -0.036732752605238554, 0.018662164358188366, 0.019816527536466764, -0.019734502230401928, 0.019029111047805983, -0.06263666724318906, ...]",3,"32,000 outages inaccessible to crews. ref:1110"


only non -1 clusters are processed and sent to the LLM as they have content to compare, we can save time by ignoring the -1's and cleaning it ourselves 

In [269]:
cluster_ids = df_facts_clustered[df_facts_clustered['cluster'] > -1]['cluster'].unique()
responses = []
for i in cluster_ids:
    filtered_facts = df_facts_clustered[df_facts_clustered['cluster'] == i].copy()
    filtered_facts.loc[:, 'format'] = filtered_facts.apply(
        lambda row: f"{row['text']} ref:{row['article']}", axis=1
    )

    list_of_fact = filtered_facts['format'].to_list()
    string_of_fact = '\n'.join(list_of_fact)
    print(string_of_fact)
    prompt = f"""
    <claims>
    {string_of_fact}
    </claims>
    """
    response = sum_gpt(prompt)
    response['cluster'] = i
    responses.append(response)
    


Harvey made landfall near Cameron, Louisiana. ref:1110
Tropical Storm Harvey made landfall in southwestern Louisiana. ref:1114
Maximum sustained winds of 35 mph. ref:1110
Maximum sustained winds of Harvey were 45 mph. ref:1114
Harvey was a Category 4 storm with 130 mph winds. ref:2037
Storm dumped more than 2 feet of rain in Beaumont-Port Arthur area. ref:1110
15 inches of rain in Beaumont area. ref:1112
26 inches of rain in 24 hours in Beaumont and Port Arthur. ref:1112
52 inches of rain in parts of Texas. ref:1112
Surrounding areas received 10 to 20 inches of rain. ref:1113
Another 10 to 15 inches of rain is still possible. ref:1113
Rainfall exceeded 4 inches per hour. ref:2037
Houston experienced more than 20 inches of rain in 24 hours. ref:2038
Power outages in Houston area down to 75,000. ref:1110
32,000 outages inaccessible to crews. ref:1110
Houston Fire Department received about 15,000 calls for assistance. ref:1110
Houston Police received 60,000 to 70,000 calls for help. ref:1

From the output we use regex to extract the color, extract the references as well and form a data frame, completely changing our unstructed data to a more structured factual table containing similarities, contradictions and stand alone statements

In [266]:
import pandas as pd
import re

pd.set_option('display.max_colwidth', None)

rows = []

for i, item in enumerate(responses):
    topic = item['central topic']
    for claim in item['claims']:
        # Extract the code marker at the start
        code_match = re.match(r'code:([grn])\s+', claim, flags=re.IGNORECASE)
        code = code_match.group(1) if code_match else None

        # Extract all ref:<id> patterns
        refs = list(set(re.findall(r'ref:(\d+)', claim, flags=re.IGNORECASE)))

        # Remove both code and ref patterns for clean text
        clean_text = re.sub(r'code:[grn]\s+', '', claim, flags=re.IGNORECASE)

        rows.append({
            'cluster_id': i,
            'central_topic': topic,
            'ref_ids': refs,
            'code': code,
            'claim': clean_text
        })

final_df = pd.DataFrame(rows)

final_df

Unnamed: 0,cluster_id,central_topic,ref_ids,code,claim
0,0,Landfall of Tropical Storm Harvey,"[1110, 1114]",g,"ref:1110, ref:1114 state that Tropical Storm Harvey made landfall in Louisiana, specifically near Cameron and in southwestern Louisiana."
1,1,Hurricane Harvey wind speeds,"[1110, 2037, 1114]",r,"ref:1110, ref:1114, ref:2037 present contradictory information about Hurricane Harvey's maximum sustained winds, with claims of 35 mph, 45 mph, and 130 mph respectively."
2,2,Rainfall amounts during a storm in Texas,"[1110, 1112, 2038]",r,"ref:1110, ref:1112, ref:2038 show contradictions in reported rainfall amounts, with ref:1110 stating more than 2 feet, ref:1112 reporting 15 inches and 26 inches, and ref:2038 indicating more than 20 inches in Houston."
3,2,Rainfall amounts during a storm in Texas,"[1112, 1113]",g,"ref:1112, ref:1113 indicate that surrounding areas received 10 to 20 inches of rain and that another 10 to 15 inches is still possible."
4,2,Rainfall amounts during a storm in Texas,"[1112, 2037]",g,"ref:1112, ref:2037 state that rainfall exceeded 4 inches per hour."
5,3,Power outages in Houston area,[1110],r,"ref:1110 show contradictory information regarding power outages in the Houston area, with one stating outages are down to 75,000 and another indicating that 32,000 outages are inaccessible to crews."
6,4,Emergency calls in Houston,"[1110, 2038]",g,"ref:1110, ref:2038 state that Houston Fire Department received about 15,000 calls for assistance, while Houston received about 6,000 rescue calls and over 56,000 911 calls."
7,4,Emergency calls in Houston,"[1112, 2038]",r,"ref:1112, ref:2038 show contradiction in the number of calls received by the Houston Police, with ref:1112 stating 60,000 to 70,000 calls, while ref:2038 does not specify a number for police calls."
8,5,Coast Guard rescue operations,"[1110, 2037]",r,"ref:1110, ref:2037 present contradictory information about the number of calls for rescues, with ref:1110 stating more than 1,000 calls per hour and ref:2037 stating over 300 requests for help."
9,6,Casualties from Hurricane Harvey,"[2038, 1110, 1112, 2037]",r,"ref:1110, ref:1112, ref:2037, ref:2038 present contradictory reports on the number of deaths due to the storm, with claims of at least 31 deaths, 37 deaths in Texas, one death in Aransas County, and two deaths reported."


And we also have the original -1's or unclustered info which represent unique information not in any specific group.

In [163]:
pd.set_option('display.max_colwidth', 50)
df_facts_clustered[df_facts_clustered['cluster'] == -1].head()

Unnamed: 0,fact_id,article,text,embedding,cluster
0,0,1110,"30,000 to 40,000 homes destroyed in Houston area.","[-0.056860456866819634, 0.08809049980619167, -...",-1
11,11,1110,George R. Brown Convention Center housing abou...,"[-0.040225155486884066, -0.0024642052493685594...",-1
12,12,1110,Texas accepting resources from Mexico for reli...,"[0.024872729992436025, 0.020925637702795312, -...",-1
13,13,1110,Israeli Rescue Coalition team arriving in Hous...,"[-0.01806380278812633, -0.018730933312977836, ...",-1
20,20,1112,One-third of the Houston area covered in water.,"[0.01787723482161884, 0.031090346408651152, -0...",-1


## UI Design

With these 2 structured tables, we can probably design it like this.

https://www.figma.com/design/48vCsCvLT0L3GF4rWK7vFS/UI-Wireframe-OSU?node-id=0-1&p=f&t=YNVEEaOBBAYRSSG3-0

## Possible improvements 

- We could maybe add the news station inside as better references
- An article reference page that allows the user to go to the directed article

### Tests

1 factual side, against misinformation

In [8]:
claims = [
    "10 people were found dead in the plane crash. ref:1108",  # factual
    "Authorities confirmed that ten bodies were recovered from the wreckage. ref:1109",
    "A total of ten fatalities resulted from the aircraft accident. ref:1110",
    "The plane crash claimed the lives of 10 individuals. ref:1111",
    "Emergency responders found 10 deceased passengers at the crash site. ref:1112",
    "Ten lives were lost following the tragic plane crash. ref:1118",
    "The aircraft disaster led to 10 confirmed deaths. ref:1119",
    "Local officials reported that ten people perished in the crash. ref:1120",
    "Ten casualties have been officially recorded from the aviation incident. ref:1121",
    "Confirmed death toll in the crash stands at ten. ref:1122",
    "Ten passengers did not survive the plane crash. ref:1123",
    "Recovery teams located the remains of ten individuals post-crash. ref:1124",
    "Ten victims have been identified from the plane wreck. ref:1125",
    "10 people lost their lives when the plane went down. ref:1126",
    "The final count lists ten people dead in the crash. ref:1127",

    "Only 2 people died in the plane crash, contrary to earlier reports. ref:1113",  # misinfo
    "All 87 passengers aboard the plane died instantly. ref:1114",
    "No fatalities occurred in the recent plane crash incident. ref:1115",
    "The crash was a hoax, and no plane actually went down. ref:1116",
    "Five survivors were rescued, and no one was killed in the crash. ref:1117",
    "The plane was shot down, not crashed. ref:1128",
    "Ten passengers survived without injuries. ref:1129",
    "The crash site had no human remains, only cargo. ref:1130",
    "The plane landed safely; reports of a crash are false. ref:1131",
    "Only crew members were harmed, not passengers. ref:1132",
    "The incident involved a drone, not a commercial aircraft. ref:1133",
    "A mechanical fault was ruled out; it was sabotage. ref:1134",
    "Crash footage is from a different event in 2015. ref:1135",
    "The death toll is actually 25, not 10. ref:1136",
    "Reports of the crash were fabricated to cover a military exercise. ref:1137"
]


str_claims = "\n".join(claims)

test = f"""
<claims>
{str_claims}
</claims>
"""

sum_gpt(test)

{'central topic': 'Fatalities and circumstances surrounding a plane crash',
 'claims': ['code:g ref:1108, ref:1109, ref:1110, ref:1111, ref:1112, ref:1118, ref:1119, ref:1120, ref:1121, ref:1122, ref:1123, ref:1124, ref:1125, ref:1126, ref:1127 state that ten people were confirmed dead in the plane crash.',
  'code:r ref:1113, ref:1114, ref:1115, ref:1116, ref:1117, ref:1128, ref:1129, ref:1130, ref:1131, ref:1132, ref:1133, ref:1134, ref:1135, ref:1136, ref:1137 present contradictory claims about the crash, with 1113 stating only 2 deaths, 1114 claiming all 87 passengers died, 1115 asserting no fatalities, 1116 suggesting the crash was a hoax, 1117 reporting five survivors with no deaths, 1128 indicating ten passengers survived, 1130 claiming no human remains were found, 1131 stating the crash reports are false, 1132 saying only crew members were harmed, 1133 suggesting it involved a drone, 1134 ruling out mechanical fault for sabotage, 1135 indicating crash footage is from a differen

2 consistent sides

In [9]:
claims = [
    # 20 claims saying 5 people died
    "5 people were found dead in the plane crash. ref:2001",
    "Authorities confirmed five bodies were recovered from the crash site. ref:2002",
    "Only five fatalities occurred in the incident. ref:2003",
    "Emergency responders reported five deaths. ref:2004",
    "The plane crash resulted in five confirmed deaths. ref:2005",
    "Just five victims were identified from the wreckage. ref:2006",
    "The fatality count currently stands at five. ref:2007",
    "Officials have announced five deaths in the crash. ref:2008",
    "Only five passengers lost their lives in the incident. ref:2009",
    "Five bodies were recovered after the crash. ref:2010",
    "Five casualties have been recorded so far. ref:2011",
    "Five lives were lost in the aircraft tragedy. ref:2012",
    "Crash investigators confirmed five deceased. ref:2013",
    "Five fatalities have been verified post-crash. ref:2014",
    "Crash site responders confirmed five dead. ref:2015",
    "Five people are confirmed dead after the incident. ref:2016",
    "Reports indicate five victims in the crash. ref:2017",
    "Medical teams documented five fatalities. ref:2018",
    "The official toll released today is five. ref:2019",
    "Government sources confirm only five deaths. ref:2020",

    # 10 claims saying 20 people died
    "20 people died in the plane crash, according to authorities. ref:2021",
    "The crash claimed 20 lives. ref:2022",
    "Emergency teams recovered 20 bodies from the wreckage. ref:2023",
    "Twenty fatalities have been confirmed. ref:2024",
    "20 individuals were found deceased at the site. ref:2025",
    "Officials report 20 passengers were killed. ref:2026",
    "The plane accident led to 20 deaths. ref:2027",
    "Twenty victims have been listed in the crash report. ref:2028",
    "Medical examiners identified 20 casualties. ref:2029",
    "A total of 20 people are believed to have perished. ref:2030"
]


str_claims = "\n".join(claims)

test = f"""
<claims>
{str_claims}
</claims>
"""

sum_gpt(test)

{'central topic': 'Fatalities in the plane crash',
 'claims': ['code:g ref:2001, ref:2002, ref:2003, ref:2004, ref:2005, ref:2006, ref:2007, ref:2008, ref:2009, ref:2010, ref:2011, ref:2012, ref:2013, ref:2014, ref:2015, ref:2016, ref:2017, ref:2018, ref:2019, ref:2020 state that five fatalities were confirmed in the plane crash.',
  'code:r ref:2021, ref:2022, ref:2023, ref:2024, ref:2025, ref:2026, ref:2027, ref:2028, ref:2029, ref:2030 indicate that 20 fatalities occurred in the incident, contradicting the claims of five deaths.']}

### Regex exclusion of reference chain (UX)

To improve on user experience but still be able to show the different references in the case of many references being shown, we could try regex extraction of reference chain and include a simple hover over to show the different sources in this chain

In [16]:
import re

text = 'ref:2021, ref:2022, ref:2023, ref:2024, ref:2025, ref:2026, ref:2027, ref:2028, ref:2029, ref:2030 indicate that 20 fatalities occurred in the incident, contradicting the claims of five deaths.'

# Pattern to find the full chain of refs
pattern = r'ref:\d{4}(?:,\s*ref:\d{4})*'

# Substitute the matched pattern with an empty string
cleaned_text = re.sub(pattern, '', text)

# Optionally, remove extra spaces left behind
cleaned_text = re.sub(r'\s{2,}', ' ', cleaned_text).strip()

print("These references " + cleaned_text)
# Output: "code:g state that five fatalities were confirmed."



These references indicate that 20 fatalities occurred in the incident, contradicting the claims of five deaths.
