# A Human-Interpretable Multi-Document Reasoning System Powered by LLaMA

This notebook provides a runnable, end-to-end scaffold for the project: ingestion, preprocessing, dense retrieval (FAISS), basic NER, KG construction stub, RAG inference pipeline, and LoRA/QLoRA fine-tuning recipes. **Note:** model weights are not included. Install and run on a machine with appropriate GPU(s).

**How to use:** Run cells sequentially. Several cells include `!pip install` commands to install required packages.

In [None]:
# Setup: install required packages (run this cell first)
# You may need to restart the kernel after some installs.
!pip install --upgrade pip
!pip install transformers==4.34.0 sentence-transformers faiss-cpu datasets[parquet] nltk spacy pyvis streamlit==1.24.1 neo4j pandas scikit-learn tqdm jupyterlab
# For LoRA / QLoRA & bitsandbytes (requires CUDA & compatible drivers) - uncomment if GPU available
# !pip install peft accelerate bitsandbytes
# Install a spaCy model
!python -m spacy download en_core_web_sm


## Sample documents
Below we create a tiny set of sample documents to run the pipeline end-to-end. Replace this with your PDFs / scraped pages in practice.

In [None]:
# Create a small sample corpus
docs = [
    { 'id': 'doc1', 'text': "Apple acquired Beats in 2014. The company continued expanding into audio products." },
    { 'id': 'doc2', 'text': "In 2014, Apple bought Beats Electronics. Tim Cook announced the acquisition." },
    { 'id': 'doc3', 'text': "Tesla delivered 1.2M cars in 2023 according to their report. Another source claims 1.4M deliveries." },
    { 'id': 'doc4', 'text': "Beats was founded by Dr. Dre and Jimmy Iovine before being acquired by Apple in 2014."}
]
import json, os
os.makedirs('/mnt/data/project_assets', exist_ok=True)
with open('/mnt/data/project_assets/sample_docs.json', 'w') as f:
    json.dump(docs, f, indent=2)
print('Sample documents saved to /mnt/data/project_assets/sample_docs.json')


## Preprocessing & Chunking
Chunk documents into passages for retrieval. Adjust chunk size for your use-case.

In [None]:
import json
from nltk.tokenize import sent_tokenize
import nltk
nltk.download('punkt')

with open('/mnt/data/project_assets/sample_docs.json') as f:
    docs = json.load(f)

# simple sentence-based chunking
passages = []
for d in docs:
    sents = sent_tokenize(d['text'])
    for i, s in enumerate(sents):
        passages.append({
            'doc_id': d['id'],
            'passage_id': f"{d['id']}_s{i}",
            'text': s
        })

import pandas as pd
df_passages = pd.DataFrame(passages)
df_passages.to_csv('/mnt/data/project_assets/passages.csv', index=False)
df_passages.head()


## Build embeddings and FAISS index
Uses `sentence-transformers` to create embeddings and FAISS to index them. If you have a GPU, use a GPU-enabled model.


In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss
import pickle
import pandas as pd

# Load a small SBERT model (CPU compatible). On GPU use a larger model.
model = SentenceTransformer('all-MiniLM-L6-v2')  # ~100MB, good for demos
texts = df_passages['text'].tolist()
embs = model.encode(texts, show_progress_bar=True, convert_to_numpy=True)

# Build FAISS index
d = embs.shape[1]
index = faiss.IndexFlatIP(d)  # inner product (use normalized vectors for cosine)
# normalize vectors for cosine similarity
faiss.normalize_L2(embs)
index.add(embs)
print('FAISS index built with', index.ntotal, 'vectors')

# save index and metadata
faiss.write_index(index, '/mnt/data/project_assets/faiss_index.idx')
with open('/mnt/data/project_assets/passages_meta.pkl','wb') as f:
    pickle.dump(df_passages, f)


## Retrieval demo
Retrieve top-k passages for a query and show their provenance.

In [None]:
def retrieve(query, k=3):
    q_emb = model.encode([query], convert_to_numpy=True)
    faiss.normalize_L2(q_emb)
    D, I = index.search(q_emb, k)
    results = []
    for score, idx in zip(D[0], I[0]):
        results.append((float(score), df_passages.iloc[int(idx)].to_dict()))
    return results

query = "When did Apple acquire Beats?"
results = retrieve(query, k=5)
for score, meta in results:
    print(f"score={score:.3f}\t doc={meta['doc_id']}\t passage={meta['passage_id']}\n  {meta['text']}\n")


## Named Entity Recognition (SpaCy)
Run a simple NER pass to extract entities from passages.

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
def extract_entities(text):
    doc = nlp(text)
    return [(ent.text, ent.label_) for ent in doc.ents]

for _, row in df_passages.iterrows():
    print(row['passage_id'], extract_entities(row['text']))


## Relation Extraction (placeholder)
Relation extraction is often task-specific. Below is a simple heuristic extractor for 'acquire' relations. Replace with a trained RE model (REBEL/T5) for production.

In [None]:
# Very simple heuristic relation extractor
def extract_acquisition_relations(text):
    text_lower = text.lower()
    if 'acquir' in text_lower or 'bought' in text_lower or 'acquired' in text_lower:
        # naive entity extraction around keywords using spaCy
        doc = nlp(text)
        companies = [ent.text for ent in doc.ents if ent.label_ in ('ORG','PERSON')]
        return companies
    return []

for _, row in df_passages.iterrows():
    rels = extract_acquisition_relations(row['text'])
    if rels:
        print(row['passage_id'], '->', rels)


## Knowledge Graph (Neo4j) â€” stub
This cell shows how to prepare triples and (optionally) push them to Neo4j. Here we build a simple in-memory triple list.

In [None]:
# Build triples from heuristic RE
triples = []
for _, row in df_passages.iterrows():
    rels = extract_acquisition_relations(row['text'])
    if rels:
        # if sentence contains acquisition, assume first org is acquirer, second is acquired (very naive)
        if len(rels) >= 2:
            triples.append((rels[0], 'acquired', rels[1], row['doc_id'], row['passage_id']))
        elif len(rels) == 1:
            triples.append((rels[0], 'mentioned_in', row['doc_id'], row['doc_id'], row['passage_id']))

import pandas as pd
pd.DataFrame(triples, columns=['head','relation','tail','src_doc','passage_id']).to_csv('/mnt/data/project_assets/triples.csv', index=False)
print('Triples saved to /mnt/data/project_assets/triples.csv')
pd.read_csv('/mnt/data/project_assets/triples.csv').head()


## Retrieval-Augmented Generation (RAG) - inference recipe
Below is an example of how you would wire retrieval results into an LLM prompt and generate an answer. This cell uses a placeholder HF model; replace with your LLaMA checkpoint and LoRA adapter when available.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
# Small CPU-friendly model for demo - replace with LLaMA (or a LLaMA-derivative) when running on GPU with proper weights.
tokenizer = AutoTokenizer.from_pretrained('distilgpt2')
model_lm = AutoModelForCausalLM.from_pretrained('distilgpt2')

def build_rag_prompt(question, retrieved):
    ctx = '\n'.join([f"[{r['doc_id']}|{r['passage_id']}] {r['text']}" for _, r in retrieved])
    prompt = f"Context:\n{ctx}\n\nQuestion: {question}\nAnswer concisely and cite passages in brackets like [docID|passageID].\n"
    return prompt

question = "When did Apple acquire Beats and who founded Beats?"
prompt = build_rag_prompt(question, results)
print(prompt)
# Generate (short)
inputs = tokenizer(prompt, return_tensors='pt', truncation=True, max_length=512)
out = model_lm.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(out[0], skip_special_tokens=True))


## LoRA / QLoRA fine-tuning recipe (instructional)
Below is a template (non-executable in CPU-only environments) showing how to fine-tune a causal LLaMA-family model with LoRA using `peft`. Replace model paths and dataset with your own.

In [None]:
# Example LoRA fine-tuning template (requires GPU and proper drivers)
# from transformers import AutoTokenizer, AutoModelForCausalLM
# from datasets import load_dataset
# from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# import bitsandbytes as bnb
#
# model_name = 'meta-llama/Llama-2-7b'  # example
# tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
# model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, device_map='auto')
#
# # Prepare for k-bit training
# model = prepare_model_for_kbit_training(model)
# lora_config = LoraConfig(r=8, lora_alpha=32, target_modules=['q_proj','v_proj'], lora_dropout=0.05, bias='none', task_type='CAUSAL_LM')
# model = get_peft_model(model, lora_config)
#
# # Load your dataset formatted as {"input": "...", "output": "..."}
# dataset = load_dataset('json', data_files={'train':'train.json','validation':'val.json'})
#
# # Use Trainer/Accelerate to run training. See peft docs for examples.
# print('See template - run on GPU-enabled machine with bitsandbytes, peft, accelerate installed.')


## Evaluation & Next Steps
Suggested metrics and steps to expand the notebook into a full project.

In [None]:
print("Suggested evaluations:\n- QA: Exact Match (EM) / F1\n- Citation fidelity: % of claims grounded in cited passages\n- Verification accuracy on FEVER-style data\n- Calibration (Brier score) for confidence outputs\n\nNext steps:\n- Replace heuristic RE with a trained RE model (ReBEL/T5)\n- Integrate a LLaMA LoRA adapter for answer generation\n- Add a Streamlit demo and Neo4j-backed KG visualizer\n- Prepare ablation experiments and run on larger datasets (HotpotQA, FEVER)")
