## Expansion with generated answers


In [6]:
from pypdf import PdfReader
from openai import OpenAI
import os
import requests

In [7]:
reader = PdfReader('LLM.pdf')
pdf_texts = [p.extract_text().strip() for p in reader.pages]

pdf_texts = [text for text in pdf_texts if text]

print(pdf_texts[0])

Cultural Palette: Pluralising Culture Alignment via Multi-agent Palette
Jiahao Yuan1, Zixiang Di1, Shangzixin Zhao2, Usman Naseem3
1East China Normal University
2University of Shanghai for Science and Technology
3Macquarie University
51275900024@stu.ecnu.edu.cn, 51265901113@stu.ecnu.edu.cn,
2135061508@st.usst.edu.cn, usman.naseem@mq.edu.au
Abstract
Large language models (LLMs) face challenges
in aligning with diverse cultural values despite
their remarkable performance in generation,
which stems from inherent monocultural bi-
ases and difficulties in capturing nuanced cul-
tural semantics. Existing methods struggle to
adapt to unkown culture after fine-tuning. In-
spired by cultural geography across five con-
tinents, we propose Cultural Palette, a multi-
agent framework that redefines cultural align-
ment as an adaptive "color-blending" process
for country-specific adaptation. Our approach
harnesses cultural geography across five conti-
nents (Africa, America, Asia, Europe, Oceania)
t

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter

***Why Apply the Token Splitter After the Recursive One?***

It’s about semantic preservation followed by model compatibility:

First, use RecursiveCharacterTextSplitter to keep semantically coherent blocks — paragraphs, bullet lists, or code functions.

Then, feed each chunk into SentenceTransformersTokenTextSplitter to make sure it respects model limits and still flows well.

This two-step approach gives you:

✅ Clean input chunks

✅ No abrupt sentence cuts

✅ Guaranteed model compatibility

In [9]:
character_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=1000,
    chunk_overlap=0
)

character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))


print(character_split_texts[10])
print(f"\nTotal chunks: {len(character_split_texts)}")

plicit cultural norms (Chen et al., 2024). Re-
cent studies have demonstrated that self-instruct
(Wang et al., 2023) or multi-agent (Li et al., 2024a)
can effectively synthesize culturally nuanced data
through LLM-driven multi-step generation and re-
finement, including expanding datasets based on
the World Values Survey (WVS) (Haerpfer et al.,
2022) to study cultural dominance and alignment,
with benchmarks such as CultureLLM (Li et al.,
2024a), CulturePark (Li et al., 2024b), and Cul-
tureSPA (Xu et al., 2024). However, LLM-driven
data synthesis, seeded from the WVS for multiple-
choice data pairs, may introduce biases in cultural
options and lead to value leakage from the WVS
benchmark (Zheng et al., 2023; Zhou et al., 2023).
To overcome these limitations, we expanded
Prism (Kirk et al., 2024) to cultural dialogues from
five continents, creating Pentachromatic Cultural
Palette Dataset (§3) through self-feedback con-
trastive aggregation of cultural differences.
2.3 Model Merging

To

In [10]:
token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=256)

token_split_texts = []
for text in character_split_texts:
    token_split_texts += token_splitter.split_text(text)

print(token_split_texts[10])
print(f"\nTotal chunks: {len(token_split_texts)}")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

troduce a multi - agent framework integrated with a cultural moerges ( goddard et al., 2024 ) mech - anism integrating semantic relationships at both continent and country levels to dynamically blend colors adapting to judging diverse cultural align - ment and ensuring nuanced and context - aware re - sponse through multi - agent collaboration. 2. 2 data synthesis for cultural datasets aligning llms for cultural pluralism — whether through fine - tuning ( achiam et al., 2023 ; li et al., 2024a, b ; shi et al., 2024 ), alignment ( kirk et al., 2024 ; choenni and shutova, 2024 ; li et al. ), or agent - based approaches ( sorensen et al., 2024b ; feng et al., 2024 ) — requires extensive, culture - specific datasets. while existing datasets like prism ( kirk et al., 2024 ) collect user feedback on llm responses across diverse countries, they primarily focus on preference ranking rather than generating culturally nuanced dialogues, limiting their utility for training models to understand im

In [18]:
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

embedding_function = SentenceTransformerEmbeddingFunction()
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

# Initialize the embedding function
#embedding_function = SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")

# Generate embeddings and count them
embeddings = embedding_function(token_split_texts)
print(f"Generated {len(embeddings)} embeddings.")
print (embeddings)

Generated 94 embeddings.
[array([ 3.70091349e-02, -1.92501415e-02,  3.51325772e-03, -4.18434180e-02,
        1.38586583e-02, -4.96600755e-03, -5.62489927e-02, -1.27150118e-01,
        7.82619715e-02, -1.85568873e-02, -1.00761419e-03, -1.75074473e-01,
        5.73186800e-02,  1.67862084e-02,  1.30108036e-02,  1.74466018e-02,
        1.51935238e-02,  4.50468063e-02, -7.26108029e-02, -6.80698082e-02,
        2.29914039e-02, -2.57953256e-02,  3.13208625e-02,  3.26047023e-03,
       -1.94989294e-02, -2.81291939e-02,  4.78519499e-02, -1.58775672e-02,
        4.19371128e-02, -5.88064734e-03,  2.37059239e-02,  1.80798352e-01,
       -4.19346504e-02,  9.61453915e-02,  1.01414518e-02,  1.04062796e-01,
       -1.12802073e-01,  5.79457320e-02, -7.53625808e-03,  4.07293253e-02,
        1.48292370e-02,  4.19239178e-02,  5.84433526e-02, -3.78549211e-02,
        3.83243784e-02,  2.82139275e-02, -5.25874086e-02,  2.17330828e-02,
       -6.78416193e-02,  1.45771056e-02, -4.09235321e-02,  1.30603556e-03,

In [19]:
chroma_client = chromadb.Client()
chroma_collection = chroma_client.create_collection("LLM", embedding_function=embedding_function)

ids = [str(i) for i in range(len(token_split_texts))]

chroma_collection.add(ids=ids, documents=token_split_texts)
chroma_collection.count()

ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event CollectionAddEvent: capture() takes 1 positional argument but 3 were given


94

In [None]:
import requests
api_key = "null"

def augment_query_generated(query, retrieved_documents, api_key, model="mistral-large-latest"):
    url = "https://api.mistral.ai/v1/chat/completions"

    # Combine retrieved context into a single string
    information = "\n\n".join(retrieved_documents)

    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json",
    }

    messages = [
        {
            "role": "system",
            "content": (
                "You are a helpful researcher assistant. Your users are asking questions about information contained in a final report. "
                "You will be shown the user's question, and the relevant information from the report. Answer the user's question using only this information."
            )
        },
        {
            "role": "user",
            "content": f"Question: {query}\n\nContext: {information}"
        }
    ]

    payload = {
        "model": model,
        "messages": messages
    }

    response = requests.post(url, headers=headers, json=payload)

    if response.status_code == 200:
        result = response.json()
        return result["choices"][0]["message"]["content"]
    else:
        raise Exception(f"Request failed: {response.status_code} - {response.text}")


In [23]:
query = "Why the did not use other datasets like WVS or eHRAF?"
results = chroma_collection.query(query_texts=[query], n_results=5)
retrieved_documents = results["documents"][0]
hypothetical_answer = augment_query_generated(
    query= query,
    retrieved_documents=retrieved_documents,
    api_key=api_key
)

joint_query = f"{query} {hypothetical_answer}"
print(joint_query)

Why the did not use other datasets like WVS or eHRAF? The report indicates that other datasets like the World Values Survey (WVS) were not used due to potential biases and value leakage. Using LLM-driven data synthesis seeded from the WVS for multiple-choice data pairs may introduce biases in cultural options and lead to value leakage from the WVS benchmark. To overcome these limitations, the researchers expanded the PRISM dataset to cultural dialogues from five continents, creating the Pentachromatic Cultural Palette dataset.


In [24]:
results = chroma_collection.query(query_texts=joint_query, n_results=5, include=['documents', 'embeddings'])
retrieved_documents = results['documents'][0]

for doc in retrieved_documents:
    print(doc)
    print('')

plicit cultural norms ( chen et al., 2024 ). re - cent studies have demonstrated that self - instruct ( wang et al., 2023 ) or multi - agent ( li et al., 2024a ) can effectively synthesize culturally nuanced data through llm - driven multi - step generation and re - finement, including expanding datasets based on the world values survey ( wvs ) ( haerpfer et al., 2022 ) to study cultural dominance and alignment, with benchmarks such as culturellm ( li et al., 2024a ), culturepark ( li et al., 2024b ), and cul - turespa ( xu et al., 2024 ). however, llm - driven data synthesis, seeded from the wvs for multiple - choice data pairs, may introduce biases in cultural options and lead to value leakage from the wvs benchmark ( zheng et al., 2023 ; zhou et al., 2023 ). to overcome these limitations, we expanded prism ( kirk et al., 2024 ) to cultural dialogues from five continents, creating pentachromatic cultural palette dataset ( § 3 ) through self - feedback

decision - making tasks within 

In [25]:
from sentence_transformers import CrossEncoder
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

In [26]:
pairs = [[query, doc] for doc in retrieved_documents]
scores = cross_encoder.predict(pairs)
print("Scores:")
for score in scores:
    print(score)

Scores:
-5.4395037
-9.187912
-11.031597
-11.206184
-10.818405


In [29]:
print("New Ordering:")
for o in np.argsort(scores)[::-1]:
    print(o+1)

New Ordering:
1
2
5
3
4


In [28]:
import numpy as np

## Expansion with multiple queries

In [None]:
import requests
api_key = "cvb"

def augment_multiple_query(query, retrieved_documents, api_key, model="mistral-large-latest"):
    url = "https://api.mistral.ai/v1/chat/completions"

    # Combine retrieved context into a single string
    flat_documents = [chunk for doc_group in retrieved_documents for chunk in doc_group]
    information = "\n\n".join(flat_documents)


    headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json"
}

    messages = [
        {
            "role": "system",
            "content": (
                "You are a helpful researcher assistant. Your users are asking questions about information contained in a final report. "
                "Suggest up to five additional related questions to help them find the information they need, for the provided question. "
                "Make sure they are complete questions, and that they are related to the original question."
                "Output one question per line. Do not number the questions."
            )
        },
        {
            "role": "user",
            "content": f"Question: {query}\n\nContext: {information}"
        }
    ]

    payload = {
        "model": model,
        "messages": messages
    }

    response = requests.post(url, headers=headers, json=payload)

    if response.status_code == 200:
        result = response.json()
        return result["choices"][0]["message"]["content"]
    else:
        raise Exception(f"Request failed: {response.status_code} - {response.text}")


In [31]:
query = "Why the did not use other datasets like WVS or eHRAF?"

augmented_queries = augment_multiple_query(query=query, retrieved_documents=retrieved_documents, api_key=api_key)
augmented_queries_list = augmented_queries.strip().split('\n')

for query in augmented_queries:
    print(query)



queries = [query] + augmented_queries_list
results = chroma_collection.query(query_texts=queries, n_results=5, include=['documents', 'embeddings'])

retrieved_documents = results['documents']

# Deduplicate the retrieved documents
unique_documents = set()
for documents in retrieved_documents:
    for document in documents:
        unique_documents.add(document)

for i, documents in enumerate(retrieved_documents):
    print(f"Query: {queries[i]}")
    print('')
    print("Results:")
    for doc in documents:
        print(doc)
        print('')
    print('-'*100)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
d
r
i
v
e
n
 
d
a
t
a
 
s
y
n
t
h
e
s
i
s
,
 
T
h
e
 
f
i
n
a
l
 
r
e
p
o
r
t
 
d
e
m
o
n
s
t
r
a
t
e
s
 
t
h
e
 
c
u
l
t
u
r
a
l
 
p
a
l
e
t
t
e
 
f
r
a
m
e
w
o
r
k
,
 
w
h
i
c
h
 
p
r
e
s
e
n
t
s
 
a
 
c
o
m
p
r
e
h
e
n
s
i
v
e
 
a
n
d
 
d
i
v
e
r
s
e
 
s
e
t
 
o
f
 
c
u
l
t
u
r
a
l
 
d
i
a
l
o
g
u
e
s
,
 
l
a
y
i
n
g
 
a
 
s
o
l
i
d
 
f
o
u
n
d
a
t
i
o
n
 
f
o
r
 
t
r
a
i
n
i
n
g
 
c
u
l
t
u
r
a
l
 
n
u
a
n
c
e
-
a
w
a
r
e
 
L
L
M
s
.
 
B
y
 
i
n
t
e
g
r
a
t
i
n
g
 
f
i
v
e
 
c
o
n
t
i
n
e
n
t
s
,
 
o
u
r
 
w
o
r
k
 
e
x
t
e
n
d
s
 
a
b
o
v
e
 
e
f
f
o
r
t
s
 
b
y
 
p
r
e
s
e
n
t
i
n
g
 
a
 
n
o
v
e
l
 
a
p
p
r
o
a
c
h
 
d
e
s
i
g
n
e
d
 
t
o
 
e
n
h
a
n
c
e
 
c
u
l
t
u
r
a
l
 
a
l
i
g
n
m
e
n
t
 
o
f
 
L
L
M
s
 
c
o
n
c
e
p
t
u
a
l
i
z
i
n
g
 
t
h
e
 
c
u
l
t
u
r
e
s
 
o
f
 
f
i
v
e
 
c
o
n
t
i
n
e
n
t
s
 
a
s
 
p
r
i
m
a
r
y
 
c
o
l
o
r
s
.
 
a
n
d
 
i
n
t
r
o
d
u
c
i
n
g
 
a
 
c
u
l
t
u
r
a
l
 
p
a
l
e
t
t
e
 
f
r
a

In [32]:

original_query = "Why the did not use other datasets like WVS or eHRAF?"
generated_queries = [
    "What were the limitations of using WVS for the study?",
    "Were there any specific challenges faced when using Prism for cultural data collection?",
    "Did the study consider any other datasets besides WVS and Prism for cultural data synthesis?",
    "How did the cultural palette framework address the limitations of existing datasets like Prism?",
    "What were the key contributions of the cultural palette framework to the study of cultural nuances?"
]

In [33]:
queries = [original_query] + generated_queries

results = chroma_collection.query(query_texts=queries, n_results=10, include=['documents', 'embeddings'])
retrieved_documents = results['documents']

In [34]:
# Deduplicate the retrieved documents
unique_documents = set()
for documents in retrieved_documents:
    for document in documents:
        unique_documents.add(document)

unique_documents = list(unique_documents)

In [35]:
pairs = []
for doc in unique_documents:
    pairs.append([original_query, doc])

In [36]:
scores = cross_encoder.predict(pairs)

In [37]:
print("Scores:")
for score in scores:
    print(score)

Scores:
-10.589858
-11.326737
-9.300861
-11.342749
-11.031597
-10.000721
-8.579799
-9.232069
-10.398995
-11.288389
-10.524404
-11.345856
-10.818405
-10.752235
-10.561224
-11.255968
-5.4395037
-9.187912
-10.780692
-11.392193
-11.187029
-8.609164
-9.145045
-10.71002
-11.385693
-11.418872
-3.861743
-11.286821
-11.206184
-10.320877


In [38]:
print("New Ordering:")
for o in np.argsort(scores)[::-1]:
    print(o)

New Ordering:
26
16
6
21
22
17
7
2
5
29
8
10
14
0
23
13
18
12
4
20
28
15
27
9
1
3
11
24
19
25
