## Possible Merge solution 
This is a merged solution that combines both the strategies of Zons solutions with Shao Yang's. The aim is to reduce the dependency on LLMs given the possible restrained resources provided in terms of GPU VRAM. The preassumption is the maximum model size possible is an 8B model.

In [2]:
import json
import pandas as pd
import hdbscan
import numpy as np
from tqdm import tqdm
from langchain.vectorstores import Chroma
import shutil
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain.embeddings import HuggingFaceEmbeddings
import spacy
from together import Together

In [3]:
import torch
print(torch.cuda.is_available())  # should be True now

True


## Small Model Assumption 
For this experiment, in order to test our systems feasibility with small open source, Together AI is used as quick access to small models from hugging face

The model in use is the Mistral 3.1 7B v0.2 instruct

In [8]:
from getpass import getpass

key = getpass("Enter your API Key:")

client = Together(api_key=key)

def set_role(system_prompt, set_json=False, temperature = 0.2):
    def get_completion(prompt, model="mistralai/Mistral-7B-Instruct-v0.3"):
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ]
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            temperature=temperature,
            stream=False,
        )
        content = response.choices[0].message.content
        return json.loads(content) if set_json else content

    return get_completion

In [9]:
system_prompt = """
You are a question-answer alignment classifier.

Your task is to compare a question with a candidate answer and classify their relationship based on **semantic alignment** and **topic relevance**.

- Return 's' if the answer is a valid or semantically similar response to the question.
- Return 'c' if the answer contradicts the question.
- Return 'i' if the answer is irrelevant to the question or about a different topic.

Return only a **single lowercase letter**: 's', 'c', or 'i'. Do not include any explanation or extra text.
"""

comparison_mistral = set_role(system_prompt,set_json=False)
results = comparison_mistral("""
Q: Did 20 people die on the Titanic?
A: 20 people passed away on the Titanic.
""")
results

' s'

## Trail Dataset (Multinews)
To simulate and test the usage of this rag broadening method, we will use the dataset Multi news by alexfabbri

Found in the link here https://huggingface.co/datasets/alexfabbri/multi_news/tree/main/data

The Multi-News dataset is a multi-document summarization dataset consisting of news articles grouped by topic, where each group has:
- 2 to 10 news articles covering the same event or topic
- A human-written summary that combines key information from all articles

2 train files are given 
- train.tgt which contains the summary 
- train src cleaned which contains the articles themselves (Each row is on a topic, each article per row is delimited by '|||||')


In [10]:
relative = "datasets/"
source = "train.src.cleaned"
target = "train.tgt"

with open(f'{relative}{source}', 'r', encoding='utf-8') as f:
    sources = f.readlines()

with open(f'{relative}{target}', 'r', encoding='utf-8') as f:
    targets = f.readlines()

# Clean up
sources = [s.strip() for s in sources]
targets = [t.strip() for t in targets]

# Check if aligned
assert len(sources) == len(targets)

# Example pair
print("Source:", sources[0])
print("Target (Summary):", targets[0])


Source: National Archives NEWLINE_CHAR NEWLINE_CHAR Yes, it’s that time again, folks. It’s the first Friday of the month, when for one ever-so-brief moment the interests of Wall Street, Washington and Main Street are all aligned on one thing: Jobs. NEWLINE_CHAR NEWLINE_CHAR A fresh update on the U.S. employment situation for January hits the wires at 8:30 a.m. New York time offering one of the most important snapshots on how the economy fared during the previous month. Expectations are for 203,000 new jobs to be created, according to economists polled by Dow Jones Newswires, compared to 227,000 jobs added in February. The unemployment rate is expected to hold steady at 8.3%. NEWLINE_CHAR NEWLINE_CHAR Here at MarketBeat HQ, we’ll be offering color commentary before and after the data crosses the wires. Feel free to weigh-in yourself, via the comments section. And while you’re here, why don’t you sign up to follow us on Twitter. NEWLINE_CHAR NEWLINE_CHAR Enjoy the show. ||||| Employers p

Splitting of topic into their separate articles

In [11]:
all_articles = []

for topic_id, source in enumerate(sources):
    articles = [a.strip().replace("NEWLINE_CHAR", "\n") for a in source.split("|||||")]
    for article in articles:

        # Apply to your DataFrame
        all_articles.append({
            "topic_id": topic_id,
            "article": article        
        })

articles_df = pd.DataFrame(all_articles)
articles_df.reset_index(inplace=True)
articles_df.rename(columns={'index': 'article_id'}, inplace=True)

summaries_df = pd.DataFrame({
    "topic_id": list(range(len(targets))),
    "summary": targets
})



from IPython.display import display
print("Articles DataFrame:")
display(articles_df.head())

print("\nSummaries DataFrame (optional):")
display(summaries_df.head())

Articles DataFrame:


Unnamed: 0,article_id,topic_id,article
0,0,0,"National Archives \n \n Yes, it’s that time ag..."
1,1,0,Employers pulled back sharply on hiring last m...
2,2,1,LOS ANGELES (AP) — In her first interview sinc...
3,3,1,"Shelly Sterling said today that ""eventually, I..."
4,4,2,"GAITHERSBURG, Md. (AP) — A small, private jet ..."



Summaries DataFrame (optional):


Unnamed: 0,topic_id,summary
0,0,– The unemployment rate dropped to 8.2% last m...
1,1,"– Shelly Sterling plans ""eventually"" to divorc..."
2,2,– A twin-engine Embraer jet that the FAA descr...
3,3,– Tucker Carlson is in deep doodoo with conser...
4,4,– What are the three most horrifying words in ...


Chunking with meta data capturing is conducted. 

In [None]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ".", " "]
)

chunks = []
chunk_global_id = 0

for _, row in articles_df.iterrows():
    article_text = row['article']
    article_id = row['article_id']
    topic_id = row['topic_id']

    split_chunks = splitter.create_documents([article_text])

    for chunk in split_chunks:
        chunk.metadata = {
            "article_id": article_id,
            "topic_id": topic_id,
            "chunk_id": chunk_global_id
        }
        chunk_global_id += 1
        chunks.append(chunk)


In [6]:
chunks[-1]

Document(metadata={'article_id': 124039, 'topic_id': 44971, 'chunk_id': 1351371}, page_content='Cohn has always maintained that what was genuine was the staying power of Saturday Night Fever itself. That central figure, with all his grace, energy and passion. A nobody who once a week was a somebody. “Tribal Rites is about identity,” he said. “Finding a place in the world where you can shine. What still resonates, to me at least, is the sense of yearning. If I was writing the story today, Vincent might be trans…”')

## Embedding Model

Here the MPNET base v2 embeddings model was used

In [12]:
encoder = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

  encoder = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
  from .autonotebook import tqdm as notebook_tqdm


Here only 50000 chunks were embedded to save time for testing. Here we have decided to instead use chroma store to leverage on a meta data based retrieval feature that FAISS does not have. Chroma store also does use HNSW under the hood that we do not have to manage.

In [8]:
shutil.rmtree("chroma_store", ignore_errors=True)

# Slice and prepare chunks
index_end = 50000
chunks_subset = chunks[0:index_end]

# Optional: Normalize embedding (Chroma handles this internally if using standard encoder)
# You don’t need to manually embed; just pass encoder and documents
# But if needed for custom handling, embed here

# Create Chroma DB from documents
chroma_db = Chroma.from_documents(
    documents=chunks_subset,
    embedding=encoder,
    persist_directory="chroma_store"
)

# Persist the DB to disk
chroma_db.persist()


  chroma_db.persist()


### Retrieval Functions
Here we have 2 types of retrieval functions. The first one being a retrieval by semantic similarity and the second one being a full retrireval by article id with filtered chunks. 

The remaining chunks is done after similarity search to pull chunks related to article but not query or question

In [13]:
from langchain.vectorstores import Chroma

# Load Chroma vector store
db = Chroma(persist_directory="chroma_store", embedding_function=encoder)

def semantic_search_with_threshold(db, query, encoder, threshold=0.1, k=999):
    vec = encoder.embed_query(query)
    vec = vec / np.linalg.norm(vec)
    
    # Note: Chroma returns docs with score by default via similarity_search_with_relevance_scores
    results = db.similarity_search_with_relevance_scores(query, k=k)

    return [(doc, score) for doc, score in results if score >= threshold]

def get_remaining_chunks_by_article_ids(db, article_ids, exclude_ids):
    all_related = []
    for article_id in article_ids:
        # Retrieve all chunks with this article_id
        docs = db.similarity_search(query="", k=999, filter={"article_id": article_id})
        
        # Exclude already retrieved doc ids
        filtered_docs = [doc for doc in docs if doc.metadata.get("id") not in exclude_ids]
        all_related.extend(filtered_docs)
    
    return all_related



  db = Chroma(persist_directory="chroma_store", embedding_function=encoder)


In [14]:
# articles_df[articles_df['topic_id'] == 390]['article'].values

In [15]:
search_results = semantic_search_with_threshold(db, "Did 37 people die in hurricane harvey", encoder, threshold=0.45)
search_results

  results = db.similarity_search_with_relevance_scores(query, k=k)


[(Document(metadata={'topic_id': 390, 'article_id': 1114, 'chunk_id': 12443}, page_content='Four days after the storm ravaged the Texas coastline as a Category 4 hurricane, authorities and family members reported at least 18 deaths from Harvey. They include a former football and track coach in suburban Houston and a woman who died after she and her young daughter were swept into a rain-swollen drainage canal. Two Beaumont, Texas, police officers and two fire-rescue divers spotted the woman floating with the child, who was holding onto her mother.'),
  0.6620693924785019),
 (Document(metadata={'chunk_id': 12366, 'topic_id': 390, 'article_id': 1112}, page_content='At least 37 deaths related to Hurricane Harvey and its aftermath have been reported in Texas. One of them, Houston police Sgt. Steve Perez , drowned while trying to get to work. \n \n "To those Americans who have lost loved ones, all of America is grieving with you and our hearts are joined with yours forever," President Donald

Here we extract the unique chunks and articles referenced

In [16]:
found_doc_ids = set(doc.metadata['chunk_id'] for doc, _ in search_results)
article_ids = set(doc.metadata['article_id'] for doc, _ in search_results)
print(article_ids)
print(found_doc_ids)

{1112, 1114, 1110, 1111}
{12357, 12366, 12336, 12434, 12346, 12443}


Unrelated to query chunks related to the same article are taken out as well

In [17]:
remaining_chunks = get_remaining_chunks_by_article_ids(db, article_ids, found_doc_ids)
remaining_chunks[0:10]

[Document(metadata={'chunk_id': 12365, 'topic_id': 390, 'article_id': 1112}, page_content='"Our whole city is underwater right now but we are coming!" Port Arthur Mayor Derrick Freeman posted Wednesday on Facebook. "If you called, we are coming. Please get to higher ground if you can, but please try (to) stay out of attics." \n \n My uncles have been rescuing people in Port Arthur for 24hrs! So blessed to have such a helpful family who help others in times like this! pic.twitter.com/O2qIVGHqxR'),
 Document(metadata={'topic_id': 390, 'chunk_id': 12419, 'article_id': 1112}, page_content='"I\'m in my home in Tyler County, and we could not get out unless a helicopter plucks me out or I get my boat and launch it," the Texas Republican told CNN by phone early in the day. "We\'re fine. These waters are going to recede hopefully sometime this evening." \n \n On Wednesday afternoon, a US Navy helicopter plucked seven people from floodwaters. \n \n \'We help each other out\' \n \n Strangers from

### DF Conversion
For better management of items we can convert the relevant document chunks to the dataframe with their meta data

In [18]:
rows = []
for doc, score in search_results:
    row = {
        'chunk_id': doc.metadata['chunk_id'],
        'article_id': doc.metadata['article_id'],
        'topic_id': doc.metadata['topic_id'],
        'page_content': doc.page_content,
        'score': score
    }
    rows.append(row)

related_df = pd.DataFrame(rows)
related_df.head()

Unnamed: 0,chunk_id,article_id,topic_id,page_content,score
0,12443,1114,390,Four days after the storm ravaged the Texas co...,0.662069
1,12366,1112,390,At least 37 deaths related to Hurricane Harvey...,0.618639
2,12336,1110,390,"Harvey, which first came ashore last Friday in...",0.579347
3,12434,1114,390,But the dangers remain far from over Wednesday...,0.531248
4,12357,1111,390,• Local officials in Texas said at least 30 de...,0.454794


In [19]:
rows = []
for doc in remaining_chunks:
    row = {
        'chunk_id': doc.metadata['chunk_id'],
        'article_id': doc.metadata['article_id'],
        'topic_id': doc.metadata['topic_id'],
        'page_content': doc.page_content,
    }
    rows.append(row)

indirectly_related_df = pd.DataFrame(rows)
indirectly_related_df.head()

Unnamed: 0,chunk_id,article_id,topic_id,page_content
0,12365,1112,390,"""Our whole city is underwater right now but we..."
1,12419,1112,390,"""I'm in my home in Tyler County, and we could ..."
2,12375,1112,390,"""While things are still serious and there is a..."
3,12420,1112,390,JUST WATCHED CNN crew helps rescue man from tr...
4,12418,1112,390,"In Beaumont, a man who accidentally drove a tr..."


## Sentence extraction

Next we can convert the chunks into sentences using libraries like spacy or nltk. This would allow for a separate strategy from LLM fact extraction

In [20]:
nlp = spacy.load("en_core_web_sm")

def split_with_spacy(text):
    doc = nlp(text)
    return [sent.text.strip() for sent in doc.sents]

# Example
text = "Dr. Smith went to Washington. He arrived at 3 p.m. It was raining."
print(split_with_spacy(text))


['Dr. Smith went to Washington.', 'He arrived at 3 p.m. It was raining.']


In [21]:
def extract_sentence_df(df):
    sentence_rows = []
    for _, row in related_df.iterrows():
        doc = nlp(row['page_content'])
        for i, sent in enumerate(doc.sents):
            sentence_rows.append({
                'chunk_id': row['chunk_id'],
                'article_id': row['article_id'],
                'topic_id': row['topic_id'],
                'sentence': sent.text.strip(),
                'sentence_id': f"{row['chunk_id']}_{i}"
            })

    # Create new sentence-level DataFrame
    sentence_df = pd.DataFrame(sentence_rows)

    return sentence_df

Successful sentence extraction for both related 

In [22]:
related_sentence_df = extract_sentence_df(related_df)
related_sentence_df.head()

Unnamed: 0,chunk_id,article_id,topic_id,sentence,sentence_id
0,12443,1114,390,Four days after the storm ravaged the Texas co...,12443_0
1,12443,1114,390,They include a former football and track coach...,12443_1
2,12443,1114,390,"Two Beaumont, Texas, police officers and two f...",12443_2
3,12366,1112,390,At least 37 deaths related to Hurricane Harvey...,12366_0
4,12366,1112,390,"One of them, Houston police Sgt.",12366_1


Irrelevant chunks are just kept as chunks

In [23]:
indirectly_related_df.head()


Unnamed: 0,chunk_id,article_id,topic_id,page_content
0,12365,1112,390,"""Our whole city is underwater right now but we..."
1,12419,1112,390,"""I'm in my home in Tyler County, and we could ..."
2,12375,1112,390,"""While things are still serious and there is a..."
3,12420,1112,390,JUST WATCHED CNN crew helps rescue man from tr...
4,12418,1112,390,"In Beaumont, a man who accidentally drove a tr..."


## Question to sentence relevance

In the past few days, multiple attempts have been placed onto either clustering or LLMs to classify the sentences in relevance to one another, however they had issues in terms of accuracy, understanding, granularity and following the instruction. One possible solution we could leverage would be using both llm Bart large mnli for classification towards topic. 

One of the major issues faced was that using llms to classify kept throwing out labels that were not in the provided list, which will cause problems in our system. They idea of using an llm is for the topic gen flexibility but for strictness we use the Bart MNLI to follow a strict set of labels

In [27]:
main_q = "Did at least 37 people die in hurricane harvey"
system_prompt = """
You are an atomic question segmenter.

Your task is to segment a complex question into atomic subquestions **only if it contains multiple explicit claims**.

Important rules:
- Do **NOT** split out references like events, dates, or locations **unless they are being questioned independently**.
- Do **NOT** invent additional questions like "Did the event involve Hurricane Harvey?" unless that is a separately stated claim.
- Maintain the original phrasing and claim. Do not rephrase or reinterpret.
- Do not use generic topics like event. Use specific things like death, injury, date

Return in the following format:

{
  "atomic": [
    {
      "question": "<original subquestion>",
      "topic": "<single word topic>"
    }
  ]
}
"""


topic_mistral = set_role(system_prompt,set_json=True)
sub_q = topic_mistral(main_q)
sub_q

{'atomic': [{'question': 'Did at least 37 people die in hurricane harvey?',
   'topic': 'death'}]}

In [28]:
import torch.nn.functional as F

from transformers import pipeline
classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli", device=0)


Device set to use cuda:0


In [29]:
# Step 1: Create your list of inputs
sequences = related_sentence_df['sentence'].tolist()
initial_candidate_labels = [i['topic'] for i in sub_q['atomic']]
candidate_labels = initial_candidate_labels + ['others']
# Step 2: Call the classifier ONCE with all sequences
results = classifier(sequences, candidate_labels, batch_size=16)

# Step 3: Extract the top label for each result
top_labels = [r['labels'][0] for r in results]

# Step 4: Assign back to the DataFrame
related_sentence_df['category'] = top_labels


In [30]:
explicit_relevance_df = related_sentence_df[related_sentence_df['category'].isin(initial_candidate_labels)]
implicit_relevance_df = related_sentence_df[~related_sentence_df['category'].isin(initial_candidate_labels)]

explicit_relevance_df.head()

Unnamed: 0,chunk_id,article_id,topic_id,sentence,sentence_id,category
0,12443,1114,390,Four days after the storm ravaged the Texas co...,12443_0,death
1,12443,1114,390,They include a former football and track coach...,12443_1,death
3,12366,1112,390,At least 37 deaths related to Hurricane Harvey...,12366_0,death
5,12366,1112,390,"Steve Perez , drowned while trying to get to w...",12366_2,death
8,12336,1110,390,The storm led to at least 31 deaths over the p...,12336_1,death


As for the other sentences that are not relevant, we can prevent loss of information by repackaging them as chunks to be combined later with the implicitly relevant data 

In [31]:
merged_chunks = (
    implicit_relevance_df.groupby("chunk_id")
    .agg({
        "article_id": "first",
        "topic_id": "first",
        "sentence": lambda x: " ".join(x),  # Join sentences into one chunk
        "sentence_id": list,
    })
    .reset_index()
)

merged_chunks

Unnamed: 0,chunk_id,article_id,topic_id,sentence,sentence_id
0,12336,1110,390,"Harvey, which first came ashore last Friday in...",[12336_0]
1,12346,1110,390,"Five days after Harvey first made landfall, FE...","[12346_0, 12346_1, 12346_2]"
2,12357,1111,390,“I expect that number to be significantly high...,[12357_3]
3,12366,1112,390,"One of them, Houston police Sgt. ""To those Ame...","[12366_1, 12366_3]"
4,12434,1114,390,But the dangers remain far from over Wednesday...,"[12434_0, 12434_1, 12434_2]"
5,12443,1114,390,"Two Beaumont, Texas, police officers and two f...",[12443_2]


## Fact CC (Factual consistency of abstractive text summarization)
https://arxiv.org/abs/1910.12840

Next we have another model that could help us evaluate the factual consistency between sources. In this case, we will be evaluating between our claim and the news stated. Originally this model was used for hallucination detection in LLM summary models, but there is a chance to repurpose it for our use case as it has been trained to focus on things like numerical values and classifying them between 2 categories (Correct/Incorrect). 

However based on experience it seems like a failed attempt :<

In [32]:
from transformers import BertForSequenceClassification, BertTokenizer
model_path = 'manueldeprada/FactCC'

tokenizer = BertTokenizer.from_pretrained(model_path)
model = BertForSequenceClassification.from_pretrained(model_path)


In [33]:
main_q = "at least 37 people were reported dead in hurricane harvey"

In [34]:
pd.set_option('display.max_colwidth', None)

# Define a function to classify consistency
def factcc_consistency_check(hypothesis, source_text):
    input_dict = tokenizer(source_text, hypothesis, max_length=512, padding='max_length', truncation='only_first', return_tensors='pt')
    with torch.no_grad():
        logits = model(**input_dict).logits
    pred = logits.argmax(dim=1)
    return model.config.id2label[pred.item()]

# Apply to the DataFrame
explicit_relevance_df.loc[:, 'factcc_label'] = explicit_relevance_df['sentence'].apply(
    lambda x: factcc_consistency_check(x, main_q)
)

# View results
explicit_relevance_df


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  explicit_relevance_df.loc[:, 'factcc_label'] = explicit_relevance_df['sentence'].apply(


Unnamed: 0,chunk_id,article_id,topic_id,sentence,sentence_id,category,factcc_label
0,12443,1114,390,"Four days after the storm ravaged the Texas coastline as a Category 4 hurricane, authorities and family members reported at least 18 deaths from Harvey.",12443_0,death,INCORRECT
1,12443,1114,390,They include a former football and track coach in suburban Houston and a woman who died after she and her young daughter were swept into a rain-swollen drainage canal.,12443_1,death,INCORRECT
3,12366,1112,390,At least 37 deaths related to Hurricane Harvey and its aftermath have been reported in Texas.,12366_0,death,INCORRECT
5,12366,1112,390,"Steve Perez , drowned while trying to get to work.",12366_2,death,INCORRECT
8,12336,1110,390,"The storm led to at least 31 deaths over the past five days, according to The Associated Press.",12336_1,death,INCORRECT
9,12336,1110,390,"Harris County officials, where Houston is located, confirmed six new deaths late Wednesday.",12336_2,death,INCORRECT
13,12357,1111,390,"• Local officials in Texas said at least 30 deaths were believed to have been caused by the storm through Tuesday, up from eight a day earlier.",12357_0,death,INCORRECT
14,12357,1111,390,"The dead included a Houston police officer, Sgt.",12357_1,death,INCORRECT
15,12357,1111,390,"Steve Perez, 60, who was caught in flooding on Sunday while trying to report for duty.",12357_2,death,INCORRECT


In [35]:
pd.set_option('display.max_colwidth', 50)

## Using a small LLM Instead

Since the factCC model provided less than desirable results, it seems that we have to use the LLM for this as there is little to no sources on fact classifying. I tried other models besides this including FACTCC-PENS which uses news headlines even but it was worse in number sensitivity.

However given that a lot of the facts have been segregated to the relative to claim category vs others, we have offloaded the responsibility of the LLM. The task will also be similar ^ simply just returning a single word 

In [36]:
system_prompt = """
You are a fact checker.

You will be given two sentences: S1 and S2. 
Ignore attribution or hedging phrases when making a decision
Provide a single word response from the 3 categories

- correct: The two sentences express the same factual content, including same exact numbers, values, and scope. Hedging phrases should not play a part
- incorrect: The sentences contradict each other or show any difference in values, numbers, dates, locations, or other facts. Even small numeric mismatches or missing scope count as incorrect.
- neutral: The two sentences refer to different topics, have different contexts or do not provide any direct answering

"""

segment_mistral = set_role(system_prompt,set_json=False, temperature=0)
results = segment_mistral("""
S1: Did 20 people die on the Titanic?
S2: 20 people passed away on the Titanic.
""")

results

' correct'

In [37]:
main_q = "In total there were 37 deaths related to Hurricane Harvey"

In [39]:
pd.set_option('display.max_colwidth', None)

key_map = {
    'incorrect': 'contradict',
    'correct': 'support',
    'neutral': 'neutral'
}


explicit_relevance_df.loc[:, 'llm_sentiment'] = explicit_relevance_df['sentence'].apply(
    lambda x: key_map[segment_mistral(f"""
    S1: {main_q.lower()}
    S2: {x.lower()}
    """).strip()]
)

# View results
explicit_relevance_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  explicit_relevance_df.loc[:, 'llm_sentiment'] = explicit_relevance_df['sentence'].apply(


Unnamed: 0,chunk_id,article_id,topic_id,sentence,sentence_id,category,factcc_label,sentiment,llm_sentiment
0,12443,1114,390,"Four days after the storm ravaged the Texas coastline as a Category 4 hurricane, authorities and family members reported at least 18 deaths from Harvey.",12443_0,death,INCORRECT,contradict,contradict
1,12443,1114,390,They include a former football and track coach in suburban Houston and a woman who died after she and her young daughter were swept into a rain-swollen drainage canal.,12443_1,death,INCORRECT,neutral,neutral
3,12366,1112,390,At least 37 deaths related to Hurricane Harvey and its aftermath have been reported in Texas.,12366_0,death,INCORRECT,support,support
5,12366,1112,390,"Steve Perez , drowned while trying to get to work.",12366_2,death,INCORRECT,neutral,neutral
8,12336,1110,390,"The storm led to at least 31 deaths over the past five days, according to The Associated Press.",12336_1,death,INCORRECT,contradict,contradict
9,12336,1110,390,"Harris County officials, where Houston is located, confirmed six new deaths late Wednesday.",12336_2,death,INCORRECT,neutral,neutral
13,12357,1111,390,"• Local officials in Texas said at least 30 deaths were believed to have been caused by the storm through Tuesday, up from eight a day earlier.",12357_0,death,INCORRECT,contradict,contradict
14,12357,1111,390,"The dead included a Houston police officer, Sgt.",12357_1,death,INCORRECT,neutral,neutral
15,12357,1111,390,"Steve Perez, 60, who was caught in flooding on Sunday while trying to report for duty.",12357_2,death,INCORRECT,neutral,neutral


here we have managed to classify the sentences into the groups of support, contradict, neutral. Breaking a complex task of extracting similar facts into a more simple one for small llms to handle

## Using Bart large finetuned instead

In [48]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_path = "./model_training/bart-nli-finetuned/checkpoint-2980"
# model_path = "facebook/bart-large-mnli"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
model.eval()
model.to("cuda")


# Map label IDs to text if needed
id2label = {0: "contradiction", 1: "neutral", 2: "entailment"}  # double-check your model's config

# Function to classify a single sentence
def classify_entailment(premise, main_q):
    inputs = tokenizer(premise, main_q, return_tensors="pt", truncation=True, padding=True, max_length=256).to("cuda")
    with torch.no_grad():
        logits = model(**inputs).logits
    predicted_class_id = torch.argmax(logits, dim=1).item()
    return id2label[predicted_class_id]

# Apply to DataFrame
explicit_relevance_df["bart_sentiment"] = explicit_relevance_df["sentence"].apply(lambda x: classify_entailment(x,main_q))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  explicit_relevance_df["bart_sentiment"] = explicit_relevance_df["sentence"].apply(lambda x: classify_entailment(x,main_q))


In [49]:
pd.set_option("display.max_colwidth", None)   # Do not truncate text in cells

explicit_relevance_df

Unnamed: 0,chunk_id,article_id,topic_id,sentence,sentence_id,category,factcc_label,llm_sentiment,bart_sentiment
0,12443,1114,390,"Four days after the storm ravaged the Texas coastline as a Category 4 hurricane, authorities and family members reported at least 18 deaths from Harvey.",12443_0,death,INCORRECT,contradict,contradiction
1,12443,1114,390,They include a former football and track coach in suburban Houston and a woman who died after she and her young daughter were swept into a rain-swollen drainage canal.,12443_1,death,INCORRECT,neutral,neutral
3,12366,1112,390,At least 37 deaths related to Hurricane Harvey and its aftermath have been reported in Texas.,12366_0,death,INCORRECT,support,entailment
5,12366,1112,390,"Steve Perez , drowned while trying to get to work.",12366_2,death,INCORRECT,neutral,neutral
8,12336,1110,390,"The storm led to at least 31 deaths over the past five days, according to The Associated Press.",12336_1,death,INCORRECT,contradict,contradiction
9,12336,1110,390,"Harris County officials, where Houston is located, confirmed six new deaths late Wednesday.",12336_2,death,INCORRECT,neutral,neutral
13,12357,1111,390,"• Local officials in Texas said at least 30 deaths were believed to have been caused by the storm through Tuesday, up from eight a day earlier.",12357_0,death,INCORRECT,contradict,contradiction
14,12357,1111,390,"The dead included a Houston police officer, Sgt.",12357_1,death,INCORRECT,neutral,neutral
15,12357,1111,390,"Steve Perez, 60, who was caught in flooding on Sunday while trying to report for duty.",12357_2,death,INCORRECT,neutral,neutral


In [50]:
classify_entailment('It is reported that 50 year old sam altman die last week',  'Fifty year old sam altman passed away last')

'entailment'

In [51]:
indirectly_related_df

Unnamed: 0,chunk_id,article_id,topic_id,page_content
0,12365,1112,390,"""Our whole city is underwater right now but we are coming!"" Port Arthur Mayor Derrick Freeman posted Wednesday on Facebook. ""If you called, we are coming. Please get to higher ground if you can, but please try (to) stay out of attics."" \n \n My uncles have been rescuing people in Port Arthur for 24hrs! So blessed to have such a helpful family who help others in times like this! pic.twitter.com/O2qIVGHqxR"
1,12419,1112,390,"""I'm in my home in Tyler County, and we could not get out unless a helicopter plucks me out or I get my boat and launch it,"" the Texas Republican told CNN by phone early in the day. ""We're fine. These waters are going to recede hopefully sometime this evening."" \n \n On Wednesday afternoon, a US Navy helicopter plucked seven people from floodwaters. \n \n 'We help each other out' \n \n Strangers from across the country descended on Texas and braved treacherous floodwater to evacuate victims."
2,12375,1112,390,"""While things are still serious and there is a long way to go, we ... have fared much better than we'd feared might be the case, but our neighbors are still taking it on the chin,"" Gov. John Bel Edwards said. ""In Texas, we're going to do everything we can do to be good neighbors to them."" \n \n Edwards requested a federal disaster declaration be extended to seven additional Louisiana parishes."
3,12420,1112,390,"JUST WATCHED CNN crew helps rescue man from truck Replay More Videos ... MUST WATCH CNN crew helps rescue man from truck 02:05 \n \n Tom Dickers is among those who came hauling boats from Dallas and San Antonio. \n \n ""This is what Texans would do. We help each other out,"" Dickers said. \n \n At least 9,000 to 10,000 people have been rescued in the Houston region by first responders. Volunteers said they have helped as many as 400 in one day."
4,12418,1112,390,"In Beaumont, a man who accidentally drove a truck into a flooded ravine that looked like a street was rescued by CNN correspondent Drew Griffin, producer Brian Rokus and photographer Scott Pisczek on Wednesday. ""I want to thank these guys for saving my life,"" said the driver, Jerry Sumrall. \n \n In Woodville, a town north of Beaumont, US Rep. Brian Babin was trapped for part of Wednesday at home with members of his family after a creek overflowed."
...,...,...,...,...
108,12354,1111,390,"Along 300 miles of Gulf Coast, people poured into shelters by the thousands, straining their capacity; as heavy rain kept falling, some rivers were still rising and floodwater in some areas had not crested yet; and with whole neighborhoods flooded, others were covered in water for the first time."
109,12353,1111,390,"HOUSTON — Five days after the pummeling began — a time when big storms have usually blown through, the sun has come out, and evacuees have returned home — Tropical Storm Harvey refused to go away, battering southeast Texas even more on Tuesday, spreading the destruction into Louisiana and shattering records for rainfall and flooding."
110,12357,1111,390,"• Local officials in Texas said at least 30 deaths were believed to have been caused by the storm through Tuesday, up from eight a day earlier. The dead included a Houston police officer, Sgt. Steve Perez, 60, who was caught in flooding on Sunday while trying to report for duty. “I expect that number to be significantly higher once the roads become passable,” said Erin Barnhart, the chief medical examiner for Galveston County."
111,12359,1111,390,"• Parts of the Houston area broke the record for rainfall from a single storm anywhere in the continental United States, with a top reading on Tuesday afternoon, since the storm began, of 51.88 inches in Cedar Bayou, east of Houston, the National Weather Service reported. The previous record was 48 inches in Medina, Tex., from Tropical Storm Amelia in 1978, and with the rain still falling along the Gulf Coast, Harvey could top the 52 inches recorded in Kauai, Hawaii in 1950 from Hurricane Hiki."


## Dealing with implicitly related facts 

In [53]:
implicitly_related_df = pd.concat([indirectly_related_df, merged_chunks.drop(columns=['sentence_id']).rename(columns={'sentence':'page_content'})])
implicitly_related_df.head()

Unnamed: 0,chunk_id,article_id,topic_id,page_content
0,12365,1112,390,"""Our whole city is underwater right now but we are coming!"" Port Arthur Mayor Derrick Freeman posted Wednesday on Facebook. ""If you called, we are coming. Please get to higher ground if you can, but please try (to) stay out of attics."" \n \n My uncles have been rescuing people in Port Arthur for 24hrs! So blessed to have such a helpful family who help others in times like this! pic.twitter.com/O2qIVGHqxR"
1,12419,1112,390,"""I'm in my home in Tyler County, and we could not get out unless a helicopter plucks me out or I get my boat and launch it,"" the Texas Republican told CNN by phone early in the day. ""We're fine. These waters are going to recede hopefully sometime this evening."" \n \n On Wednesday afternoon, a US Navy helicopter plucked seven people from floodwaters. \n \n 'We help each other out' \n \n Strangers from across the country descended on Texas and braved treacherous floodwater to evacuate victims."
2,12375,1112,390,"""While things are still serious and there is a long way to go, we ... have fared much better than we'd feared might be the case, but our neighbors are still taking it on the chin,"" Gov. John Bel Edwards said. ""In Texas, we're going to do everything we can do to be good neighbors to them."" \n \n Edwards requested a federal disaster declaration be extended to seven additional Louisiana parishes."
3,12420,1112,390,"JUST WATCHED CNN crew helps rescue man from truck Replay More Videos ... MUST WATCH CNN crew helps rescue man from truck 02:05 \n \n Tom Dickers is among those who came hauling boats from Dallas and San Antonio. \n \n ""This is what Texans would do. We help each other out,"" Dickers said. \n \n At least 9,000 to 10,000 people have been rescued in the Houston region by first responders. Volunteers said they have helped as many as 400 in one day."
4,12418,1112,390,"In Beaumont, a man who accidentally drove a truck into a flooded ravine that looked like a street was rescued by CNN correspondent Drew Griffin, producer Brian Rokus and photographer Scott Pisczek on Wednesday. ""I want to thank these guys for saving my life,"" said the driver, Jerry Sumrall. \n \n In Woodville, a town north of Beaumont, US Rep. Brian Babin was trapped for part of Wednesday at home with members of his family after a creek overflowed."


In [None]:
system_prompt = """
You are an atomic fact extractor.

Your job is to extract atomic facts from a given text chunk that is related/interesting to point out in relation to a particular sentence. Each fact must be:
- Specific
- Literal
- Short
- Independent (no grouping or summarizing)
- Fully detailed (include all numbers, names, dates, and locations)
- Keep facts simple, and unrepeated. Do not mention the same detail twice 
- Be as atomic as possible


Output format:
You must return a **strict JSON object** with a single field: "facts".  
"facts" must be a list of fact strings.  
Do not return any explanation, text, or keys other than "facts".

Example:

sentence: 

{
  "facts": [
    "The flood damaged 14 houses in Orange County.",
    "Three people were rescued by helicopter on Tuesday.",
    "President Smith declared a state of emergency on April 9."
  ]
}
"""

chunk_mistral = set_role(system_prompt,set_json=True, temperature=0)
results = chunk_mistral("""
In Beaumont, a man who accidentally drove a truck into a flooded ravine that looked like a street was rescued by CNN correspondent Drew Griffin, producer Brian Rokus and photographer Scott Pisczek on Wednesday. "I want to thank these guys for saving my life," said the driver, Jerry Sumrall. \n \n In Woodville, a town north of Beaumont, US Rep. Brian Babin was trapped for part of Wednesday at home with members of his family after a creek overflowed.""")
results

{'facts': ['Jerry Sumrall was rescued by CNN correspondent Drew Griffin, producer Brian Rokus, and photographer Scott Pisczek in Beaumont on Wednesday.',
  'U.S. Rep. Brian Babin was trapped at home in Woodville, a town north of Beaumont, due to a creek overflow on Wednesday.']}

In [315]:
from concurrent.futures import ThreadPoolExecutor, as_completed
from json.decoder import JSONDecodeError

fact_list = []
errors = []
max_workers = 10  # adjust depending on rate limits & CPU

def process_row(row):
    try:
        content = row["page_content"]
        result = chunk_mistral(content)
        return {
            "chunk_id": row["chunk_id"],
            "article_id": row["article_id"],
            "topic_id": row["topic_id"],
            "fact": result
        }
    except JSONDecodeError:
        return None

with ThreadPoolExecutor(max_workers=max_workers) as executor:
    futures = [executor.submit(process_row, row) for _, row in implicitly_related_df.iterrows()]

    for future in tqdm(as_completed(futures), total=len(futures)):
        data = future.result()
        if data is not None:
            fact_list.append(data)
        else:
            errors.append(1)

print("CHunks skipped due to JSON error:", len(errors))

100%|██████████| 119/119 [00:13<00:00,  9.09it/s]

CHunks skipped due to JSON error: 5





In [312]:
fact_list

[{'chunk_id': 12419,
  'article_id': 1112,
  'topic_id': 390,
  'fact': {'facts': ['The Texas Republican is in his home in Tyler County.',
    'Seven people were rescued by a US Navy helicopter on Wednesday afternoon.',
    'Strangers from across the country are helping evacuate victims.']}},
 {'chunk_id': 12377,
  'article_id': 1112,
  'topic_id': 390,
  'fact': {'facts': ['New Orleans officials announced a fundraiser to help the residents of Houston and other flooded Texas cities recover from Harvey.',
    'Landrieu stated that no city was more welcoming for the citizens of New Orleans than the people of Houston.']}},
 {'chunk_id': 12383,
  'article_id': 1112,
  'topic_id': 390,
  'fact': {'facts': ['The police department posted a message on Facebook about rescue boats in Port Arthur.',
    'The city of Port Arthur asked people trapped to hang a white towel, sheet, or shirt outside to alert rescuers.']}},
 {'chunk_id': 12382,
  'article_id': 1112,
  'topic_id': 390,
  'fact': {'facts

In [316]:
# Flatten it and add fact_id
flat_fact_rows = []

for i, entry in enumerate(fact_list):
    for fact in entry['fact']['facts']:
        flat_fact_rows.append({
            'fact_id': len(flat_fact_rows), 
            'chunk_id': entry['chunk_id'],
            'article_id': entry['article_id'],
            'topic_id': entry['topic_id'],
            'fact': fact
        })

# Create DataFrame
df_facts_flat = pd.DataFrame(flat_fact_rows)
df_facts_flat

Unnamed: 0,fact_id,chunk_id,article_id,topic_id,fact
0,0,12377,1112,390,New Orleans officials announced a fundraiser t...
1,1,12377,1112,390,Landrieu stated that no city was more welcomin...
2,2,12377,1112,390,"According to Landrieu, the heart of New Orlean..."
3,3,12383,1112,390,The police department posted a message on Face...
4,4,12383,1112,390,The city of Port Arthur asked people trapped t...
...,...,...,...,...,...
389,389,12346,1110,390,"These shelters house more than 30,000 people."
390,390,12434,1114,390,At least 18 people died in the Houston area an...
391,391,12434,1114,390,"13,000 people were rescued in the Houston area."
392,392,12434,1114,390,Weakened levees were in danger of failing.


In [317]:
def embed_and_build_dataframes_from_chunks(df_chunks, encoder):
    chunk_rows = []

    # Embed and normalize each chunk's page_content
    chunk_embeddings = encoder.embed_documents(df_chunks["fact"].tolist())
    norm_chunk_embeddings = [vec / np.linalg.norm(vec) for vec in chunk_embeddings]

    for i, row in df_chunks.iterrows():
        chunk_rows.append({
            "chunk_id": row["chunk_id"],
            "article_id": row["article_id"],
            "topic_id": row["topic_id"],
            "text": row["fact"],
            "embedding": norm_chunk_embeddings[i]
        })

    df_chunks_embedded = pd.DataFrame(chunk_rows)
    df_chunks_embedded.reset_index(inplace=True)
    df_chunks_embedded.rename(columns={'index': 'row_id'}, inplace=True)

    return df_chunks_embedded


In [318]:
df_chunks_embedded = embed_and_build_dataframes_from_chunks(df_facts_flat, encoder)
df_chunks_embedded

Unnamed: 0,row_id,chunk_id,article_id,topic_id,text,embedding
0,0,12377,1112,390,New Orleans officials announced a fundraiser t...,"[0.02674776268382799, 0.06051028834227942, -0...."
1,1,12377,1112,390,Landrieu stated that no city was more welcomin...,"[-0.030839549835687527, 0.042668536939276136, ..."
2,2,12377,1112,390,"According to Landrieu, the heart of New Orlean...","[-0.03863598400228658, 0.06449426700446063, -0..."
3,3,12383,1112,390,The police department posted a message on Face...,"[-0.01166471244046706, -0.029752739715401915, ..."
4,4,12383,1112,390,The city of Port Arthur asked people trapped t...,"[-0.015490042314821342, -0.041836966555968594,..."
...,...,...,...,...,...,...
389,389,12346,1110,390,"These shelters house more than 30,000 people.","[-0.023295945665673685, -0.02859723289193262, ..."
390,390,12434,1114,390,At least 18 people died in the Houston area an...,"[0.0030162208417100424, 0.0058314003408430634,..."
391,391,12434,1114,390,"13,000 people were rescued in the Houston area.","[-0.00025582824536043406, 0.03314129853510445,..."
392,392,12434,1114,390,Weakened levees were in danger of failing.,"[-0.03867061845411607, 0.05848771016682153, -0..."


In [319]:
def run_hdbscan(df, min_cluster_size=3, min_samples=2, metric='euclidean'):
    X = np.vstack(df["embedding"].values)

    # Run HDBSCAN
    clusterer = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size,
                                 min_samples=min_samples,
                                 metric=metric,
                                 cluster_selection_method='leaf'
                                 )
    clusters = clusterer.fit_predict(X)

    # Add cluster labels to DataFrame
    df_with_clusters = df.copy()
    df_with_clusters["cluster"] = clusters

    return df_with_clusters

In [320]:
df_chunks_clustered = run_hdbscan(df_chunks_embedded)
print(f"Unqiue clusters found for facts: {df_chunks_clustered['cluster'].unique()}")
df_chunks_clustered.head()



Unqiue clusters found for facts: [-1  8  3  1 19 12 14 20 15  5 13 10 16 18 21 17  4  7  2  0  6 11  9]


Unnamed: 0,row_id,chunk_id,article_id,topic_id,text,embedding,cluster
0,0,12377,1112,390,New Orleans officials announced a fundraiser t...,"[0.02674776268382799, 0.06051028834227942, -0....",-1
1,1,12377,1112,390,Landrieu stated that no city was more welcomin...,"[-0.030839549835687527, 0.042668536939276136, ...",8
2,2,12377,1112,390,"According to Landrieu, the heart of New Orlean...","[-0.03863598400228658, 0.06449426700446063, -0...",8
3,3,12383,1112,390,The police department posted a message on Face...,"[-0.01166471244046706, -0.029752739715401915, ...",-1
4,4,12383,1112,390,The city of Port Arthur asked people trapped t...,"[-0.015490042314821342, -0.041836966555968594,...",-1


In [324]:
print(df_chunks_clustered[df_chunks_clustered['cluster'] == -1].iloc[0: , 4].values)
print('article ids:', df_chunks_clustered[df_chunks_clustered['cluster'] == -1].iloc[0: , 2].values)

['New Orleans officials announced a fundraiser to help the residents of Houston and other flooded Texas cities.'
 'The police department posted a message on Facebook about rescue boats in Port Arthur.'
 'The city of Port Arthur asked people trapped to hang a white towel, sheet, or shirt outside to alert rescuers.'
 'Gov. John Bel Edwards said things are serious but better than feared.'
 'Edwards stated that neighbors are still taking it on the chin.'
 'Edwards plans to help Texas as good neighbors.'
 'Edwards requested a federal disaster declaration for 7 additional Louisiana parishes.'
 'The Texas Republican is in his home in Tyler County.'
 'Seven people were rescued by a US Navy helicopter on Wednesday afternoon.'
 'Strangers from across the country are helping evacuate victims.'
 'Uncles of the author have been rescuing people in Port Arthur for 24 hours.'
 'Cynthia Harmon was trapped with her two sons, two grandsons in the attic of her Port Arthur home.'
 'They began waiting for r