In [None]:
!pip install langchain langchain-openai openai langchain-community google-api-python-client html2text huggingface_hub tiktoken faiss-cpu unstructured sentence_transformers chromadb --quiet


1. **Technique 1**: Plain Chat LLM
2. **Technique 2**: Self sak with search LangChain Agent (ESA architecture)
3. **Technique 3**: Custom search + RAG chain

In [None]:
from langchain_core.prompts import PromptTemplate
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_openai import ChatOpenAI
from langchain.agents import initialize_agent, load_tools, AgentType
from langchain.tools import Tool
from langchain.utilities import GoogleSearchAPIWrapper
from langchain.document_loaders import DirectoryLoader, PDFMinerLoader
from langchain.chains import LLMChain

import os

os.environ['OPENAI_API_KEY'] = ""
os.environ["GOOGLE_CSE_ID"] = ""
os.environ["GOOGLE_API_KEY"] = ""
os.environ["HUGGINGFACEHUB_API_TOKEN"] = ""

In [None]:
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

## Technique 1

In [None]:
def build_profile(prompt: str):
    '''
    Function to initialize and run a GPT model to build a cultural profile from its training data
    params
        prompt (str): prompt to build the profile
    '''
    llm = ChatOpenAI(model_name='gpt-4',
                     temperature=0.5,
                     max_tokens=600)

    messages = [
            SystemMessage(content="You're a helpful assistant that aids in constructing detailed and comprehensive cultural profiles"),
            HumanMessage(content=prompt),
        ]

    response = llm.invoke(prompt)

    return response

In [None]:
# prompt inputs for search
tribe_to_search = "Yanomami"
relevant_factors = ["lifestyle", "culture", "economic system", "political ideologies", "values", "kinship", "social organization"]

# Generate persona based on profile and system prompt
prompt = f"Please construct a profile on the {tribe_to_search}. " \
        + f" The profile must cover the following socio-economic relevant factors {relevant_factors}. Proceed step by step."

tribe_profile1 = build_profile(prompt)
print(tribe_profile1.content)

Profile: Yanomami Tribe

1. Lifestyle: 
The Yanomami are indigenous people who live in the rainforests and mountains of northern Brazil and southern Venezuela. Their lifestyle revolves around their environment, with hunting, gathering, and gardening as their primary means of subsistence. They live in communal houses called Shabonos, which are large, oval structures that can house up to 400 people. The Yanomami are also known for their use of a hallucinogenic drug called Yopo as part of their spiritual and cultural practices.

2. Culture: 
The Yanomami culture is deeply rooted in their spiritual beliefs. They practice shamanism and believe in the existence of numerous spirits in the natural world. Their rituals often involve the use of Yopo, which they believe allows them to communicate with these spirits. They also have a rich oral tradition, with storytelling being a significant part of their cultural practices. The Yanomami are known for their intricate basketry, pottery, and body pa

## Technique 2

In [None]:
def run_search_agent(prompt_search: str):
    '''
    Function to initialize and run a LangChain SELF_ASK_WITH_SEARCH Agent with access to google serper.
    params:
        prompt (str): Instructions for the agent about what to search.
    '''
    # Initialize the LLM
    llm = ChatOpenAI(model_name='gpt-4',
                     temperature=0.5,
                     max_tokens=600)

    search = GoogleSearchAPIWrapper()
    tools = [
        Tool(
            name="Intermediate Answer",
            func=search.run,
            description="useful for when you need to ask with search",
        )
    ]

    self_ask_with_search = initialize_agent(tools, llm,
                                            agent=AgentType.SELF_ASK_WITH_SEARCH,
                                            verbose=True,
                                            max_iterations=10,
                                            early_stopping_method="generate",
                                            handle_parsing_errors=True)

    result = self_ask_with_search.run(prompt_search)

    return result


In [None]:
# prompt inputs for search
tribe_to_search = "Yanomami"
relevant_factors = ["lifestyle", "culture", "economic system", "political ideologies", "values", "kinship", "social organization"]

# Generate persona based on profile and system prompt
prompt_search = f"Please construct a detailed and comprehensive cultura profile on the {tribe_to_search}. " \
        + f" The profile must cover the following socio-economic relevant factors {relevant_factors}, use search to get this information."

tribe_profile2 = run_search_agent(prompt_search=prompt_search)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mCould not parse output: Yes.[0m
Intermediate answer: Invalid or incomplete response
[32;1m[1;3mFollow up: What is the lifestyle of the Yanomami?[0m
Intermediate answer: [36;1m[1;3mThe Yanomami are the largest relatively isolated tribe in South America. They live in the rainforests and mountains of northern Brazil and southern ... Mar 2, 2024 ... Yanomami, South American Indians, speakers of a Xiriana language, who live in the remote forest of the Orinoco River basin in southern ... Aug 9, 2022 ... Today, the Yanomami – who number about 29,000 – say they are at serious risk of losing their lands, culture and traditional way of life. The ... Nov 15, 2018 ... The Yanomami diet, low in fat and salt and high in fiber, consists of such items as plantains, cassavas (a root vegetable), fruit, and meat— ... The Yanomami, also spelled Yąnomamö or Yanomama, are a group of approximately 35,000 indigenous people who live in some 200

In [None]:
print(tribe_profile2)

The Yanomami are the largest relatively isolated tribe in South America. They live in the rainforests and mountains of northern Brazil and southern Venezuela. Their lifestyle is highly dependent on hunting and gathering. Their culture is deeply rooted in their belief in equality among people and they do not recognize 'chiefs'. Decisions are made by consensus after everyone has had the chance to speak. Their economic system is primarily a subsistence economy. However, their political ideologies are not clearly defined but they have a strong belief in the importance of maintaining their lands and protecting their culture. The Yanomami value marriage, family, leadership, and belief in spirits. They have a patrilineal kinship system and their social organization is structured around large, circular, communal houses called yanos or shabonos.


## Technique 3

In [None]:
import html2text
from langchain_community.document_transformers import Html2TextTransformer
from langchain_community.document_loaders import AsyncHtmlLoader

async def do_webscraping(link):
    '''
    Function to search a link asynchronously. Returns a JSON load with extracted HTML objects parsed from each link.
    Parameters
        link: A single url to search. We pass several link in a loop so that we can do it async
    Returns:
        doc: A JSON (python dictionary) object with relevant information parsed from the HTML scraped from the link.
    '''
    try:
        urls = [link]
        loader = AsyncHtmlLoader(urls)
        docs = loader.load()

        html2text_transformer = Html2TextTransformer()
        docs_transformed = html2text_transformer.transform_documents(docs)

        if docs_transformed != None and len(docs_transformed) > 0:
            metadata = docs_transformed[0].metadata
            title = metadata.get('title', '')
            content = docs_transformed[0].page_content

            doc = {
                    'title': title,
                    'metadata': metadata,
                    'page_content': html2text.html2text(content)
                }
            return doc
        else:
            return None

    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        return None

In [None]:
def search_tool():
    search = GoogleSearchAPIWrapper()

    # k has to be a default parameter so we can build a tool with it. Change to get more or less sources
    # 5 seems like a good default
    # 10 yields a more comprehensive profile
    # 3 is actually pretty good and much faster
    def topk_results(query, k=3):
        return search.results(query, k)

    # alternatively, we can just use search.run and retrieve the first result from a google search given query
    # But using search.results gives us access to the link
    tool_sources = Tool(
        name="Google Search Sources",
        description="Search Google for recent results and information relevant to user query.",
        func=topk_results,
    )

    return tool_sources

In [None]:
async def get_results(prompt):

    results = search_tool().run(prompt)

    structured_response = []

    sources = [result['link'] for result in results]

    for link in sources:
        print(link)
        response = await do_webscraping(link)
        if response != None:
            structured_response.append(response)

    return structured_response, sources

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain.embeddings import HuggingFaceBgeEmbeddings

BGEembeddings = HuggingFaceBgeEmbeddings(model_name="BAAI/bge-small-en",
                                             model_kwargs={"device": "cpu"},
                                             encode_kwargs={"normalize_embeddings": True})

print("Embeddings model initialized!")

def chunk_and_index(texts):

    # we'll create chunks of 2000 tokens each, with the last 200 tokens of each chunk overlapping with the next to capture context a bit better
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1500, chunk_overlap=200,
        add_start_index=True, separators = ['\n', '\\', '####']
    )

    docs = [x['page_content'] for x in texts]

    chunks = text_splitter.create_documents(docs)
    print(f"There are {len(chunks)} chunks")

    vectorstore = Chroma.from_documents(documents=chunks,
                                        embedding=BGEembeddings)

    print("Documents created and indexed to ChromaDB")

    # set up a retriver from the vector store
    # let's have it return the 10 chunk most similar (proxy for relevant) to the given a query.
    retriever = vectorstore.as_retriever(search_type="similarity",
                                        search_kwargs={"k": 5})

    return retriever

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embeddings model initialized!


In [None]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

def build_run_chain(retriever, target_population_label, relevant_factors):
    '''
    '''

    # langchain utility function to format chunks into documents to pass as context
    def format_docs(docs):
        return "\n\n".join(doc.page_content for doc in docs)

    llm = ChatOpenAI(model_name='gpt-4',
                    max_tokens=600, temperature=0.5)

    # TODO: Experiment with ChatHuggingFace and open-source models for building profile (less $$)
    # I think the less we depend on commercial tools, like OpenAI, the greater the contribution of our work to academia

    template = """Use the following pieces of context to answer the query at the end.
    The context will contain information about a specific tribe or society.
    Only rely on the information provided to ensure accuracy in your thoughtful response.

    {context}

    Query: {query}

    Thoughtful response:"""

    custom_rag_prompt = PromptTemplate.from_template(template)

    # Generate persona based on profifle and system prompt
    query = f"Please construct a detailed and comprehensive the {target_population_label}. " \
            + f" The profile must cover the following socio-economic relevant factors {relevant_factors}."

    rag_chain = (
        {"context": retriever | format_docs, "query": RunnablePassthrough()}
        | custom_rag_prompt
        | llm
        | StrOutputParser()
    )

    profile = rag_chain.invoke(query)

    return profile

Pick any target population label and list of relevant factors for the profile. The following function will return the profile and sources referenced.

In [None]:
async def main(target_population_label,
         relevant_factors):

    search_query = f"What characterizes the {target_population_label} population?"

    # we can play around with this search prompt
    results_with_sources, sources = await get_results(search_query)
    retriever = chunk_and_index(results_with_sources)

    profile = build_run_chain(retriever,
                                target_population_label=target_population_label,
                                relevant_factors=relevant_factors)

    return profile, sources

In [None]:
target_population_label = "Yanomami"
relevant_factors = ["lifestyle", "culture", "economic system",
                    "political ideologies", "values", "kinship", "social organization"]

# takes way too long to index the chunks
# Maybe try another vectorDB or another embeddings model
tribe_profile3= await main(target_population_label,
                            relevant_factors)

print(tribe_profile3)

https://www.survivalinternational.org/tribes/yanomami


Fetching pages: 100%|##########| 1/1 [00:00<00:00,  1.80it/s]


https://www.ohchr.org/en/stories/2022/08/amazon-rainforest-indigenous-tribe-fights-survival


Fetching pages: 100%|##########| 1/1 [00:00<00:00,  5.13it/s]


https://www.greenpeace.org/international/story/58033/yanomami-indigenous-brazil-mining-health-crisis-malnutrition-malaria/


Fetching pages: 100%|##########| 1/1 [00:01<00:00,  1.17s/it]


There are 17 chunks
Documents created and indexed to ChromaDB
("The Yanomami are the largest relatively isolated tribe in South America, living in the rainforests and mountains of northern Brazil and southern Venezuela.\n\nLifestyle: The Yanomami live in large, circular, communal houses called yanos or shabonos, with some housing up to 400 people. The central area is used for activities such as rituals, feasts, and games. Each family has its own hearth where food is prepared and cooked during the day. At night, hammocks are slung near the fire which is stoked all night to keep people warm.\n\nCulture: The Yanomami have a strong belief in equality among people and do not recognize 'chiefs'. Decisions are made by consensus, often after long debates where everyone has a say. Tasks are divided between the sexes, with men hunting and women tending the gardens and collecting food. Both men and women participate in fishing. Wild honey is highly prized and the Yanomami harvest 15 different kin

In [None]:
tribe_profile3[0]

"The Yanomami are the largest relatively isolated tribe in South America, living in the rainforests and mountains of northern Brazil and southern Venezuela.\n\nLifestyle: The Yanomami live in large, circular, communal houses called yanos or shabonos, with some housing up to 400 people. The central area is used for activities such as rituals, feasts, and games. Each family has its own hearth where food is prepared and cooked during the day. At night, hammocks are slung near the fire which is stoked all night to keep people warm.\n\nCulture: The Yanomami have a strong belief in equality among people and do not recognize 'chiefs'. Decisions are made by consensus, often after long debates where everyone has a say. Tasks are divided between the sexes, with men hunting and women tending the gardens and collecting food. Both men and women participate in fishing. Wild honey is highly prized and the Yanomami harvest 15 different kinds.\n\nEconomic System: The Yanomami have a subsistence economy

In [None]:
tribe_profile3[1]

['https://en.wikipedia.org/wiki/Hadza_people',
 'https://www.nature.com/articles/ncomms4654',
 'https://pubmed.ncbi.nlm.nih.gov/28063234/']