## Expansion with generated answers


In [14]:
from pypdf import PdfReader
from openai import OpenAI
import os
import requests

In [16]:
reader = PdfReader('LLM.pdf')
pdf_texts = [p.extract_text().strip() for p in reader.pages]

pdf_texts = [text for text in pdf_texts if text]

print(pdf_texts[0])

Cultural Palette: Pluralising Culture Alignment via Multi-agent Palette
Jiahao Yuan1, Zixiang Di1, Shangzixin Zhao2, Usman Naseem3
1East China Normal University
2University of Shanghai for Science and Technology
3Macquarie University
51275900024@stu.ecnu.edu.cn, 51265901113@stu.ecnu.edu.cn,
2135061508@st.usst.edu.cn, usman.naseem@mq.edu.au
Abstract
Large language models (LLMs) face challenges
in aligning with diverse cultural values despite
their remarkable performance in generation,
which stems from inherent monocultural bi-
ases and difficulties in capturing nuanced cul-
tural semantics. Existing methods struggle to
adapt to unkown culture after fine-tuning. In-
spired by cultural geography across five con-
tinents, we propose Cultural Palette, a multi-
agent framework that redefines cultural align-
ment as an adaptive "color-blending" process
for country-specific adaptation. Our approach
harnesses cultural geography across five conti-
nents (Africa, America, Asia, Europe, Oceania)
t

In [17]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter

***Why Apply the Token Splitter After the Recursive One?***

It’s about semantic preservation followed by model compatibility:

First, use RecursiveCharacterTextSplitter to keep semantically coherent blocks — paragraphs, bullet lists, or code functions.

Then, feed each chunk into SentenceTransformersTokenTextSplitter to make sure it respects model limits and still flows well.

This two-step approach gives you:

✅ Clean input chunks

✅ No abrupt sentence cuts

✅ Guaranteed model compatibility

In [18]:
character_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=1000,
    chunk_overlap=0
)

character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))


print(character_split_texts[10])
print(f"\nTotal chunks: {len(character_split_texts)}")

plicit cultural norms (Chen et al., 2024). Re-
cent studies have demonstrated that self-instruct
(Wang et al., 2023) or multi-agent (Li et al., 2024a)
can effectively synthesize culturally nuanced data
through LLM-driven multi-step generation and re-
finement, including expanding datasets based on
the World Values Survey (WVS) (Haerpfer et al.,
2022) to study cultural dominance and alignment,
with benchmarks such as CultureLLM (Li et al.,
2024a), CulturePark (Li et al., 2024b), and Cul-
tureSPA (Xu et al., 2024). However, LLM-driven
data synthesis, seeded from the WVS for multiple-
choice data pairs, may introduce biases in cultural
options and lead to value leakage from the WVS
benchmark (Zheng et al., 2023; Zhou et al., 2023).
To overcome these limitations, we expanded
Prism (Kirk et al., 2024) to cultural dialogues from
five continents, creating Pentachromatic Cultural
Palette Dataset (§3) through self-feedback con-
trastive aggregation of cultural differences.
2.3 Model Merging

To

In [19]:
token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=256)

token_split_texts = []
for text in character_split_texts:
    token_split_texts += token_splitter.split_text(text)

print(token_split_texts[10])
print(f"\nTotal chunks: {len(token_split_texts)}")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

troduce a multi - agent framework integrated with a cultural moerges ( goddard et al., 2024 ) mech - anism integrating semantic relationships at both continent and country levels to dynamically blend colors adapting to judging diverse cultural align - ment and ensuring nuanced and context - aware re - sponse through multi - agent collaboration. 2. 2 data synthesis for cultural datasets aligning llms for cultural pluralism — whether through fine - tuning ( achiam et al., 2023 ; li et al., 2024a, b ; shi et al., 2024 ), alignment ( kirk et al., 2024 ; choenni and shutova, 2024 ; li et al. ), or agent - based approaches ( sorensen et al., 2024b ; feng et al., 2024 ) — requires extensive, culture - specific datasets. while existing datasets like prism ( kirk et al., 2024 ) collect user feedback on llm responses across diverse countries, they primarily focus on preference ranking rather than generating culturally nuanced dialogues, limiting their utility for training models to understand im

In [20]:
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

embedding_function = SentenceTransformerEmbeddingFunction()
print(embedding_function([token_split_texts[10]]))

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

[array([ 5.65085374e-02, -2.31369156e-02, -7.21845329e-02, -1.61805116e-02,
        1.68165788e-02,  5.74388076e-03, -1.06489612e-02, -8.65370780e-02,
        4.74963337e-02, -2.76748650e-02,  7.18682352e-03, -1.39496714e-01,
        6.58115745e-02,  4.74999472e-02, -4.49240766e-03,  4.45113033e-02,
        7.17280358e-02,  3.49879116e-02, -7.40670711e-02, -4.88845631e-02,
       -4.10147719e-02,  1.45947589e-02,  6.56867772e-02,  1.89636704e-02,
       -8.55829790e-02, -2.97187921e-02,  4.68219295e-02, -3.59477401e-02,
        1.93399377e-02,  1.48043297e-02,  2.85676178e-02,  1.28142416e-01,
       -4.66285422e-02,  9.18545797e-02, -1.20962216e-02,  1.04049474e-01,
       -5.73261194e-02,  5.77217974e-02, -8.87486804e-03,  4.09008972e-02,
       -5.05358838e-02,  2.73765158e-03,  2.09790748e-02, -4.40682620e-02,
        3.20180915e-02,  4.09754924e-02, -3.14888135e-02,  4.03019227e-02,
       -9.42779556e-02,  6.86208531e-02, -8.72482434e-02,  2.63484269e-02,
        5.68088889e-02, 

In [21]:
chroma_client = chromadb.Client()
chroma_collection = chroma_client.create_collection("LLM", embedding_function=embedding_function)

ids = [str(i) for i in range(len(token_split_texts))]

chroma_collection.add(ids=ids, documents=token_split_texts)
chroma_collection.count()

ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event CollectionAddEvent: capture() takes 1 positional argument but 3 were given


94

In [None]:
import requests
api_key = "))"

def augment_query_generated(query, retrieved_documents, api_key, model="mistral-large-latest"):
    url = "https://api.mistral.ai/v1/chat/completions"

    # Combine retrieved context into a single string
    information = "\n\n".join(retrieved_documents)

    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json",
    }

    messages = [
        {
            "role": "system",
            "content": (
                "You are a helpful researcher assistant. Your users are asking questions about information contained in a final report. "
                "You will be shown the user's question, and the relevant information from the report. Answer the user's question using only this information."
            )
        },
        {
            "role": "user",
            "content": f"Question: {query}\n\nContext: {information}"
        }
    ]

    payload = {
        "model": model,
        "messages": messages
    }

    response = requests.post(url, headers=headers, json=payload)

    if response.status_code == 200:
        result = response.json()
        return result["choices"][0]["message"]["content"]
    else:
        raise Exception(f"Request failed: {response.status_code} - {response.text}")


In [25]:
query = "Why the did not use other datasets like WVS or eHRAF?"
results = chroma_collection.query(query_texts=[query], n_results=5)
retrieved_documents = results["documents"][0]
hypothetical_answer = augment_query_generated(
    query= query,
    retrieved_documents=retrieved_documents,
    api_key=api_key
)

joint_query = f"{query} {hypothetical_answer}"
print(joint_query)

Why the did not use other datasets like WVS or eHRAF? The report does not explicitly state why other datasets like WVS or eHRAF were not used. However, it does mention that using WVS for LLM-driven data synthesis may introduce biases and lead to value leakage. It also notes that seeding from WVS or the PRISM dataset can be limited by underrepresented cultures and may lead to overfitting and value leakage. Additionally, the report discusses the struggle with unseen cultures when using multi-culture community mechanisms. These challenges could imply reasons why other datasets might not have been used.


In [26]:
results = chroma_collection.query(query_texts=joint_query, n_results=5, include=['documents', 'embeddings'])
retrieved_documents = results['documents'][0]

for doc in retrieved_documents:
    print(doc)
    print('')

plicit cultural norms ( chen et al., 2024 ). re - cent studies have demonstrated that self - instruct ( wang et al., 2023 ) or multi - agent ( li et al., 2024a ) can effectively synthesize culturally nuanced data through llm - driven multi - step generation and re - finement, including expanding datasets based on the world values survey ( wvs ) ( haerpfer et al., 2022 ) to study cultural dominance and alignment, with benchmarks such as culturellm ( li et al., 2024a ), culturepark ( li et al., 2024b ), and cul - turespa ( xu et al., 2024 ). however, llm - driven data synthesis, seeded from the wvs for multiple - choice data pairs, may introduce biases in cultural options and lead to value leakage from the wvs benchmark ( zheng et al., 2023 ; zhou et al., 2023 ). to overcome these limitations, we expanded prism ( kirk et al., 2024 ) to cultural dialogues from five continents, creating pentachromatic cultural palette dataset ( § 3 ) through self - feedback

et al., 2024 ) seeded from the 

## Expansion with multiple queries

In [None]:
import requests
api_key = "77"

def augment_multiple_query(query, retrieved_documents, api_key, model="mistral-large-latest"):
    url = "https://api.mistral.ai/v1/chat/completions"

    # Combine retrieved context into a single string
    flat_documents = [chunk for doc_group in retrieved_documents for chunk in doc_group]
    information = "\n\n".join(flat_documents)


    headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json"
}

    messages = [
        {
            "role": "system",
            "content": (
                "You are a helpful researcher assistant. Your users are asking questions about information contained in a final report. "
                "Suggest up to five additional related questions to help them find the information they need, for the provided question. "
                "Make sure they are complete questions, and that they are related to the original question."
                "Output one question per line. Do not number the questions."
            )
        },
        {
            "role": "user",
            "content": f"Question: {query}\n\nContext: {information}"
        }
    ]

    payload = {
        "model": model,
        "messages": messages
    }

    response = requests.post(url, headers=headers, json=payload)

    if response.status_code == 200:
        result = response.json()
        return result["choices"][0]["message"]["content"]
    else:
        raise Exception(f"Request failed: {response.status_code} - {response.text}")


In [45]:
query = "Why the did not use other datasets like WVS or eHRAF?"

augmented_queries = augment_multiple_query(query=query, retrieved_documents=retrieved_documents, api_key=api_key)
augmented_queries_list = augmented_queries.strip().split('\n')

for query in augmented_queries:
    print(query)



queries = [query] + augmented_queries_list
results = chroma_collection.query(query_texts=queries, n_results=5, include=['documents', 'embeddings'])

retrieved_documents = results['documents']

# Deduplicate the retrieved documents
unique_documents = set()
for documents in retrieved_documents:
    for document in documents:
        unique_documents.add(document)

for i, documents in enumerate(retrieved_documents):
    print(f"Query: {queries[i]}")
    print('')
    print("Results:")
    for doc in documents:
        print(doc)
        print('')
    print('-'*100)

W
h
a
t
 
a
r
e
 
t
h
e
 
l
i
m
i
t
a
t
i
o
n
s
 
o
f
 
t
h
e
 
W
o
r
l
d
 
V
a
l
u
e
s
 
S
u
r
v
e
y
 
(
W
V
S
)
 
d
a
t
a
s
e
t
 
t
h
a
t
 
m
a
d
e
 
i
t
 
l
e
s
s
 
s
u
i
t
a
b
l
e
 
f
o
r
 
t
h
i
s
 
s
t
u
d
y
?


W
h
a
t
 
a
r
e
 
t
h
e
 
s
p
e
c
i
f
i
c
 
c
h
a
l
l
e
n
g
e
s
 
f
a
c
e
d
 
w
h
e
n
 
u
s
i
n
g
 
t
h
e
 
e
H
R
A
F
 
d
a
t
a
s
e
t
 
f
o
r
 
c
u
l
t
u
r
a
l
 
a
l
i
g
n
m
e
n
t
?


H
o
w
 
d
o
e
s
 
t
h
e
 
P
R
I
S
M
 
d
a
t
a
s
e
t
 
c
o
m
p
a
r
e
 
t
o
 
t
h
e
 
W
V
S
 
a
n
d
 
e
H
R
A
F
 
d
a
t
a
s
e
t
s
 
i
n
 
t
e
r
m
s
 
o
f
 
c
u
l
t
u
r
a
l
 
r
e
p
r
e
s
e
n
t
a
t
i
o
n
?


W
h
a
t
 
a
r
e
 
t
h
e
 
p
o
t
e
n
t
i
a
l
 
b
i
a
s
e
s
 
t
h
a
t
 
c
o
u
l
d
 
b
e
 
i
n
t
r
o
d
u
c
e
d
 
w
h
e
n
 
u
s
i
n
g
 
t
h
e
 
W
V
S
 
d
a
t
a
s
e
t
 
f
o
r
 
d
a
t
a
 
s
y
n
t
h
e
s
i
s
?


H
o
w
 
d
o
e
s
 
t
h
e
 
P
e
n
t
a
c
h
r
o
m
a
t
i
c
 
C
u
l
t
u
r
a
l
 
P
a
l
e
t
t
e
 
d
a
t
a
s
e
t
 
a
d
d
r
e
s
s
 
t
h
e
 
l
i
m
i
t
a
t
i
o
n
s
 
f
o
u
n
d
 
i
n
 
o
t
h
e
r
 
d
a
t
