# Pipeline for RAG

**TABLE OF CONTENTS**
1. Imports
2. Loading the embedding and the vectorstore
3. Auxilliary functions
4. RAG 1
    * a) Natural history
    * b) Theory of relativity
    * c) Humans, animals and the origin of life
5. RAG 2
    * a) Blumenbach's style of writing
    * b) Rodents
    * c) God
    * d) Human races
    * e) Explorative research
6. RAG 3
    * a) Comparing editions
    * b) Search by metadata with both editions without registers
    * c) Search by metadata in the register of the 1st edition
    * d) Using information from metadata

## 1. Imports

In [1]:
import datetime
from huggingface_hub import login
login(token = hf_logging_token)  # UPDATE THE TOKEN FOR RUNNING THE NOTEBOOK
from langchain.chains import RetrievalQA, LLMChain
from langchain.chat_models import ChatOpenAI
from langchain_core.runnables import RunnablePassthrough
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.prompts import PromptTemplate
from langchain.schema import Document
from langchain.vectorstores import FAISS
import pickle
from operator import itemgetter
import os
os.environ['OPENAI_API_KEY'] = openAI_API_key  # UPDATE THE KEY FOR RUNNING THE NOTEBOOK
from transformers import AutoModel 

  from .autonotebook import tqdm as notebook_tqdm


Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /Users/Onema/.cache/huggingface/token
Login successful


In [2]:
# custom helper functions for the RAG pipeline
from rag_funcs import *

In [3]:
# a function for documentation (RAG1) - relies on hard-coding variable names, so shouldn't be imported
def record_in_txt(file_name, export_dir = "../data/results/RAG"):
    
    """Records all relevant information about generation through access to global variables.
    Warning: this function relies on hard coding. It should be used ONLY in relation to the last generation."""
    
    model_name, temp, max_tokens = llm_hyperparameters   # unpacking model hyperparameters
    vectorstore, _ = retriever_hyperparameters   # unpacking retriever hyperparameters
    
    identified_docs = retriever.get_relevant_documents(question)  # reproduce the retrieval
                  
    with open(f"{export_dir}/{file_name}.txt", "a") as f:
        f.write("X"*30+f"   NEW GENERATION {datetime.datetime.now()}   "+"X"*30)    # separator
        f.write("\n\n")
        f.write("TEMPLATE:\n")    # prompt template
        f.write(template)
        f.write("\n\n")
        f.write("QUESTION OR INSTRUCTION:\n")   # question / instruction
        f.write(question)
        f.write("\n\n")
        f.write(f"GENERATION ({model_name}, temperature = {temp}, max_tokens = {max_tokens}):")  # generation
        f.write(inference.pretty_repr()[81:])
        f.write("\n\n")
        comment = input("Please insert a short comment: ")
        if comment:
            f.write(f"COMMENT: {comment}")
        else:
            f.write("COMMENT: N/A")
        f.write("\n\n")
        f.write(f"RETRIEVED DOCUMENTS ({vectorstore}):\n")
        f.write(format_docs_for_documentation(identified_docs))
        f.write("\n\n\n\n")
        

# a function for documentation (RAG2) - relies on hard-coding variable names, so shouldn't be imported
# version for the instruction following templates
def record_in_txt_instruction(file_name, export_dir = "../data/results/RAG"):
    
    """Records all relevant information about generation through access to global variables.
    Warning: this function relies on hard coding. It should be used ONLY in relation to the last generation."""
    
    model_name, temp, max_tokens = llm_hyperparameters   # unpacking model hyperparameters
    vectorstore, _ = retriever_hyperparameters   # unpacking retriever hyperparameters
    
    identified_docs = retriever.get_relevant_documents(topic)  # reproduce the retrieval
                  
    with open(f"{export_dir}/{file_name}.txt", "a") as f:
        f.write("X"*30+f"   NEW GENERATION {datetime.datetime.now()}   "+"X"*30)    # separator
        f.write("\n\n")
        f.write("TEMPLATE:\n")    # prompt template
        f.write(template)
        f.write("\n\n")
        f.write("QUESTION OR INSTRUCTION:\n")   # question / instruction
        f.write(instruction)
        f.write("\n\n")
        f.write("TOPIC (used for retrieval):\n")   # topic
        f.write(topic)
        f.write("\n\n")
        f.write(f"GENERATION ({model_name}, temperature = {temp}, max_tokens = {max_tokens}):")  # generation
        f.write(inference.pretty_repr()[81:])
        f.write("\n\n")
        comment = input("Please insert a short comment: ")
        if comment:
            f.write(f"COMMENT: {comment}")
        else:
            f.write("COMMENT: N/A")
        f.write("\n\n")
        f.write(f"RETRIEVED DOCUMENTS ({vectorstore}):\n")
        f.write(format_docs_for_documentation(identified_docs))
        f.write("\n\n\n\n")
        

def format_and_save_docs(docs):
    """For RAG3, to be used in conjunction with record_in_txt_metadata.
    Warning: this function relies on hard coding. It should be used ONLY inside of a chain."""  
    
    formatted_docs = format_docs_with_metadata(docs) 
    
    with open("../data/results/RAG/metadata.txt", "a") as f:
        f.write("X"*30+f"   NEW GENERATION {datetime.datetime.now()}   "+"X"*30)    # separator
        f.write("\n\n")
        f.write(f"RETRIEVED DOCUMENTS ({vectorstore}):\n")
        f.write(formatted_docs)
        f.write("\n\n")
    return formatted_docs


# a function for documentation (RAG3) - relies on hard-coding variable names, so shouldn't be imported
# version for the SelfQueryRetriever & instruction following templates
def record_in_txt_metadata(file_name="metadata", export_dir = "../data/results/RAG"):
    
    """Records all relevant information about generation through access to global variables.
    To be used in conjunction with format_and_save_docs.
    Warning: this function relies on hard coding. It should be used ONLY in relation to the last generation."""
    
    model_name, temp, max_tokens = llm_hyperparameters   # unpacking model hyperparameters
                  
    with open(f"{export_dir}/{file_name}.txt", "a") as f:
        f.write("TEMPLATE:\n")    # prompt template
        f.write(template)
        f.write("\n\n")
        f.write("QUESTION OR INSTRUCTION:\n")   # question / instruction
        f.write(instruction)
        f.write("\n\n")
        f.write("TOPIC (used for retrieval):\n")   # topic
        f.write(topic)
        f.write("\n\n")
        f.write(f"GENERATION ({model_name}, temperature = {temp}, max_tokens = {max_tokens}):")  # generation
        f.write(inference.pretty_repr()[81:])
        f.write("\n\n")
        comment = input("Please insert a short comment: ")
        if comment:
            f.write(f"COMMENT: {comment}")
        else:
            f.write("COMMENT: N/A")
        f.write("\n\n\n\n")



## 2. Loading the embedding and the vectorstore

In [4]:
# load the embedding

model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-de', trust_remote_code=True) 
model_name = "jinaai/jina-embeddings-v2-base-de"
model_kwargs = {'device': 'cpu'}
hf_jina = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs)

  return self.fget.__get__(instance, owner)()


In [5]:
# load the vectorstore - use longer documents without the register

db_dir_path = "../data/vectorDB"
#db_1500_without_reg = FAISS.load_local(db_dir_path+"/"+"faiss_vecDB_1500_without_reg", hf_jina, allow_dangerous_deserialization = True)
#db_700 = FAISS.load_local(db_dir_path+"/"+"faiss_vecDB_700_reg", hf_jina, allow_dangerous_deserialization = True)
#db_700_without_reg = FAISS.load_local(db_dir_path+"/"+"faiss_vecDB_700_without_reg", hf_jina, allow_dangerous_deserialization = True)
db_1500 = FAISS.load_local(db_dir_path+"/"+"faiss_vecDB_1500_reg", hf_jina, allow_dangerous_deserialization = True)
#db_1ed_12ed = FAISS.load_local(db_dir_path+"/"+"faiss_vecDB_ed1_ed12", hf_jina, allow_dangerous_deserialization = True)


## 4. RAG

In [6]:
# initialize a retriever and an LLM
retriever, retriever_hyperparameters = initialize_retriever(db_1500, k = 7)
llm, llm_hyperparameters = initialize_llm(max_tokens=400)

Which vector database are you using? Please provide its name for record: db_1500_with_register
Initialized the retriever out of db_1500_with_register. 7 documents will be retrieved.
Created an LLM with standard hyperparameters.


  warn_deprecated(


### a) Natural history

In [42]:
# prompt template

template = """You are a historian analyzing a primary historical source (SOURCE). Answer the QUESTION based solely on the SOURCE. Explain your ANSWER. If the SOURCE doesn't contain sufficient information to answer the QUESTION, say that you don't know the ANSWER.
QUESTION: {question}
SOURCE: {context}
ANSWER: """

prompt = PromptTemplate.from_template(template)

In [43]:
# chain
chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt | print_prompt
    | llm
)

In [44]:
question = "How are medieval alchemy and natural history of the 18th century different?"
inference = chain.invoke(question)

Retrieved context (metadata are not passed to the model):

Alle Dinge, die sich auf, und in unsrer Erde finden, zeigen sich entweder in derselben Gestalt, in welcher sie aus der Hand der Natur gekommen; oder so, wie sie durch Menschen oder Thiere, zu bestimmten Absichten, oder auch durch bloßen Zufall verändert und gleichsam umgeschaffen worden sind. Auf diese Verschiedenheit gründet sich die bekannte Eintheilung aller Körper in natürliche (naturalia), und durch Kunst verfertigte (artefacta). Die erstern machen den Gegenstand der Naturgeschichte aus, und man belegt alle Körper mit dem Namen der Naturalien, die nur noch keine wesentliche Veränderung durch Menschenhände erlitten haben; Da hingegen die mehresten von denen so der Zufall umgeändert hat, und beyläufig auch diejenigen so durch die Thiere nach ihren Trieben und zu Stillung ihrer Bedürfnisse verändert und umgebildet worden, mit unter den Naturalien begriffen werden. Artefacten werden sie blos alsdann, wenn der Mensch wesentlich

In [45]:
print(inference.pretty_repr())


Medieval alchemy and the natural history of the 18th century are different in their approach to categorizing and studying natural objects. In the source, it is explained that natural history focuses on objects that have not undergone significant changes by humans or animals, while alchemy includes objects that have been altered or transformed for specific purposes. Natural history categorizes objects as naturalia, which are objects that have not undergone essential changes by human hands, while alchemy categorizes objects as artefacta when significant changes are made by humans. This distinction shows that medieval alchemy focused on the transformation and manipulation of natural objects, while the natural history of the 18th century focused on studying and categorizing natural objects in their original form.


In [46]:
record_in_txt("natural_history")

Please insert a short comment: not correct about alchemy


### b) Theory of relativity

In [66]:
template = """You are a historian analyzing a primary historical source (SOURCE). Answer the QUESTION based on the SOURCE. If the SOURCE doesn't contain sufficient information, report it and answer with your knowledge.
QUESTION: {question}
SOURCE: {context}
ANSWER: """

prompt = PromptTemplate.from_template(template)

In [67]:
# chain
chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt | print_prompt
    | llm
)

In [68]:
question = "Who came up with the theory of relativity?"
inference = chain.invoke(question)

Retrieved context (metadata are not passed to the model):

Zuerst etwas vom Ursprung der Mineralien, nemlich von den Hauptwegen, wodurch sie theils vor Zeiten mit einemmal entstanden sind, und theils nach und nach und noch immerfort entstehen. Um jene aufzuklären, müssen wir nothwendig auf den Ursprung unsrer Erde selbst zurück gehen: eine Untersuchung, bey der man sich freylich immer einige gewagte Muthmassungen wird erlauben müssen: doch wollen wir uns nicht dem Flug der kühnen Männer überlassen, die Kometen und ausgebrannte Sonnen zum Bau ihres Erdsystems aufgebothen haben sondern unsere bescheidnere Meinung vortragen auf die wir zuerst durch die Untersuchung der Versteinerungen, und durch ihre Vergleichung und gefundene Unähnlichkeit mit den gegenwärtigen organisirten Körpern und dann durch die Beobachtung einiger ehemaligen Vulcane gebracht worden sind, und die uns zwar immer noch eine Hypothese, aber doch eine solche Hypothese zu seyn scheint, die sich der Natur und dem Augensche

In [69]:
print(inference.pretty_repr())


The source does not contain information about who came up with the theory of relativity. The theory of relativity was developed by Albert Einstein in the early 20th century.


In [70]:
record_in_txt("theory_of_relativity")

Please insert a short comment: 


### c) Humans, animals and the origin of life

In [80]:
# prompt template

template ="""You are a historian. Answer the QUESTION based on the SOURCE, which is an eighteen-century textbook on natural history.
QUESTION: {question}
SOURCE: {context}
ANSWER: """

prompt = PromptTemplate.from_template(template)

In [92]:
# chain
chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt | print_prompt
    | llm
)

In [93]:
question = "What is said about human races?"
inference = chain.invoke(question)

Retrieved context (metadata are not passed to the model):

Eine directe Folge der Vernunft, mithin ein abermaliges Eigenthum der Menschheit, ist die Rede oder Sprache (Loquela), die nicht mit der Stimme (vox) der Thiere verwechselt werden darf. Auch der Mensch hat Stimme, wie man an den unglücklichen Beyspielen in Wildniß aufgewachsener, oder taubgebohrner Kinder sieht, und wie die unwillkürlichen Töne aus beklemmter Brust, bey Schrecken und in andern heftigen Leidenschaften zeigen Die Sprache aber entwickelt sich erst mit der Vernunft, da denn die Seele ihre erlangten Begriffe, der Zunge zum Aussprechen überträgt. Es giebt eben so wenig ein sprachloses, als ein vernunftloses Volt auf unserer Erde, und wir haben nun die Wörterbücher der Eskimos, der Hottentotten und anderer Nationen, denen die leichtglaubigen Reisenden der alten Zeit die Rede abzusprechen wagten. Zu den körperlichen Eigenschaften des Menschen gehört vorzüglich sein aufrechter Gang und der Gebrauch zweyer Hände, wodurch

In [94]:
print(inference.pretty_repr())


The source discusses human races in terms of physical characteristics and geographical distribution. It mentions five main varieties of the human race, including Europeans, Asians, North Africans, Greenlanders and Eskimos, and Australasians and Polynesians. The text also highlights the impact of climate, diet, and lifestyle on the physical appearance of different human populations. It emphasizes that all human beings belong to the same species and can trace their ancestry back to Adam. The source also touches on the social nature of humans, their dependence on culture and education for the development of reason and language, and their ability to domesticate and work with animals.


In [95]:
record_in_txt("humans")

Please insert a short comment: 1st generation with inclusion of the register


## 5. RAG

### a) Blumenbach's writing style

In [7]:
# prompt template

template ="""You are a historian. Follow the INSTRUCTION. The SOURCE is a historical textbook on natural history by J. F. Blumenbach.
INSTRUCTION: {instruction} Provide details and examples from the SOURCE.
SOURCE: {context} """

prompt = PromptTemplate.from_template(template)


In [8]:
# chain
chain = (
    {"context": itemgetter("topic") | retriever | format_docs, "instruction": itemgetter("instruction")}
    | prompt | print_prompt
    | llm
)

In [15]:
topic = "scientific discussion"
instruction = "First, write a few sentences about best practices of scientific writing. Then, evaluate the language used by Blumenbach according to them."
inference = chain.invoke({"topic": topic, "instruction": instruction})

Retrieved context (metadata are not passed to the model):

Der gegenwärtige Abschnitt betrift allerdings eine eben so wichtige als anmuthige Untersuchung nemlich die allgemeine Naturgeschichte der Gewächse, die wir so viel möglich in der gleichen Ordnung abfassen wollen, die oben in der allgemeinen Thiergeschichte befolgt worden ist, damit beide desto leichter mit einander verglichen und die Aehnlichkeit oder Abweichung dieser zweyerley Arten von organisirten Körpern um so deutlicher ersehen Werden kan.
Metadata: J.F.Blumenbach, Handbuch der Naturgeschichte, ed. 1 (1780), §. 170.

1. Geschl. homo. Animal rationale, loquens erectum, bimanum. 1. Gatt. sapiens. Der Mensch wird durch so merkwürdige Eigenschaften des Geistes und des Körpers von der ganzen übrigen thierischen Schöpfung ausgezeichnet, daß er bey weitem nicht blos in einem eignen Geschlecht, sondern allerdings in einer besondern Ordnung, von ihr abgeschieden werden muß. Er hat ausser dem Begattungstrieb wenig Spuren von Instin

In [16]:
print(inference.pretty_repr())


Best practices of scientific writing include clarity, precision, objectivity, and logical organization of ideas. It is important for scientific writing to be accessible to a wide audience, provide evidence to support claims, and avoid bias or subjective language.

In evaluating the language used by Blumenbach in the historical textbook on natural history, it is evident that he adheres to some of these best practices. Blumenbach's writing is clear and precise, as he discusses the classification of organisms and theories of evolution in a structured and organized manner. He presents his arguments logically and supports them with examples and evidence from the natural world.

Blumenbach also maintains objectivity in his writing, as he discusses different theories and perspectives without showing bias towards any particular viewpoint. For example, he presents the theories of Epigenesis and Evolution in a balanced manner, highlighting the arguments of both sides without favoring one over t

In [17]:
record_in_txt_instruction("Blumenbach")

Please insert a short comment: 


### b) Rodents

In [111]:
# prompt template

template ="""You are a historian analysing a primary historical SOURCE, which is an eighteenth-century textbook on natural history by J. F. Blumenbach. First, generate a few sentences about historical hermeneutics. Then, follow the INSTRUCTION.
INSTRUCTION: {instruction} Provide quotes from the source to support your reply.
SOURCE: {context}  """

prompt = PromptTemplate.from_template(template)

In [71]:
# chain
chain = (
    {"context": itemgetter("topic") | retriever | format_docs, "instruction": itemgetter("instruction")}
    | prompt | print_prompt
    | llm
)

In [113]:
topic = "hamster, marmot and other small animals"
instruction = "Find incoherences in the SOURCE."
inference = chain.invoke({"topic": topic, "instruction": instruction})

Retrieved context (metadata are not passed to the model):

u. f Ein muntres possierliches Thier, was in gebürgichten Gegenden der nordlichen Erde, besonders in den Schweizer-Alpen, in Savoyen Aegypten, und in der grossen Tattarey zu Hause ist. Es macht sich tiefe Hölen in die Erde, die es mit Heu und Moos ausfüttert, nährt sich von allerhand Pflanzen und Wurzeln; liebt aber vorzüglich Milchspeisen, daher es sich in den Schweizeralpen haüfig in die Sennhütten eingräbt. Bey kaltem Wetter schlafen die Murmelthiere; sobald aber die Sonne scheint, kommen sie aus ihren Hölen hervor, balgen sich und spielen mit einander. Ihr Fleisch ist eßbar und wohlschmeckend. Gegen den Winter werben sie so fett, daß oft eins bey 20 Pfund wiegt. Sie schlafen alsdann vom October bis in den April; und nachdem der Winter hart oder gelind werden wird vermachen sie den Eingang zu ihren Hölen fester oder lockerer. In der Tatarey pflanzen sie den Rhabarber fort. 2. †. Cricetus. der Hamster. M. abdomine nigro. * F.

In [114]:
print(inference.pretty_repr())


Historical hermeneutics involves the interpretation and understanding of historical sources within their historical context, taking into account the biases, perspectives, and limitations of the author. It also involves analyzing the language, symbols, and cultural references used in the source to uncover deeper meanings and implications.

In the eighteenth-century textbook on natural history by J. F. Blumenbach, there are several incoherences that can be identified. One inconsistency is the classification of animals, where the author mixes different categories and orders without clear distinctions. For example, in the passage about amphibians, the author mentions three orders but then introduces a fourth order without clear justification: "Die Siren lacertina aus Süd-Carolina, die Linné doch erst spät und mit eigenem Gefühl von Zweifel und Ungewißheit, in eine besondere vierte Ordnung (meantes) gesetzt hat..." This inconsistency in classification undermines the clarity and accuracy of

In [115]:
record_in_txt_instruction("Rodents")

Please insert a short comment: Mistook a marmot for a hamster, but better answer about incoherences than was the case with human races 


### c) God

In [65]:
template ="""You are a historian analysing a primary historical SOURCE, which is an eighteenth-century textbook on natural history by J. F. Blumenbach. First, generate a few sentences about historical hermeneutics. Then, follow the INSTRUCTION.
INSTRUCTION: {instruction} Provide relevant quotes from the SOURCE.
SOURCE: {context} """

prompt = PromptTemplate.from_template(template)

In [66]:
# chain
chain = (
    {"context": itemgetter("topic") | retriever | format_docs, "instruction": itemgetter("instruction")}
    | prompt | print_prompt
    | llm
)

In [67]:
topic = "God is the Creator of the world."
instruction = "Find fragments of the SOURCE where the idea that God created the world is mentioned or discussed. If there are no such fragments, report it."
inference = chain.invoke({"topic": topic, "instruction": instruction})

Retrieved context (metadata are not passed to the model):

Die Vollkommenheit in der große Haushaltung der Mutter-Natur ist, so wie bey der kleinsten Oekonomie einer Familie, in ganz andern Vorzügen zu suchen. Daß Gott in seiner Schöpfung keine Lücke gelassen hat, daß dieses unermeßliche Uhrwerk nirgend stockt, sondern im ununterbrochnen Gange, im beständigen Gleichgewicht erhalten wird, davon liegt der Grund wohl schwerlich darinne, weil der Orangoutang den Uebergang vom Menschen zum Affen machen, oder weil die Vögel durch die Fledermäuse mit den vierfüßigen Thieren, und durch die fliegenden Fische mit den Fischen verbunden seyn sollen sondern weil jedes erschaffne Wesen seine Bestimmung, und den zu dieser Bestimmung erfoderlichen Körperbau hat; weil kein zweckloses Geschöpf exsistirt, was nicht auch seinen Beytrag zur Vollkommenheit des Ganzen gäbe. Das machts, daß die Schöpfung ihren Gang geht und daß noch kein Weiser, irgend einer Zeit oder eines Volks, in ihr eine Lücke hat antref

In [68]:
print(inference.pretty_repr())


Historical hermeneutics involves the interpretation and understanding of historical sources within their historical context, taking into account the biases, perspectives, and intentions of the author. It requires a critical analysis of the text to uncover hidden meanings and implications.

In the eighteenth-century textbook on natural history by J. F. Blumenbach, the idea that God created the world is mentioned in several fragments. The author emphasizes the perfection of God's creation, stating that every created being has a purpose and contributes to the overall perfection of the whole. Blumenbach argues that there is no gap in God's creation, and that the intricate balance and design of the natural world reflect the wisdom and perfection of the Creator. He also discusses the concept of natural classification and the interconnectedness of all living beings in the grand scheme of creation. The author highlights the role of humans as distinct from other animals, possessing unique qual

In [69]:
record_in_txt_instruction("intelligent_design")

Please insert a short comment: Topic is described differently, but still works.


### d) Human races

In [34]:
template ="""You are a historian analysing a primary historical SOURCE, which is an eighteenth-century textbook on natural history by J. F. Blumenbach. First, generate a few sentences about historical hermeneutics. Then, follow the INSTRUCTION.
INSTRUCTION: {instruction} Provide all the relevant quotes from the SOURCE to support your reply.
SOURCE: {context} """

prompt = PromptTemplate.from_template(template)

In [35]:
# chain
chain = (
    {"context": itemgetter("topic") | retriever | format_docs, "instruction": itemgetter("instruction")}
    | prompt | print_prompt
    | llm
)

In [42]:
topic = "human races"
instruction = "Based on your own knowledge, generate a few sentences about theories of human races in the 18th century. Then, decide whether the SOURCE presents an extreme or a moderate view on human races."
inference = chain.invoke({"topic": topic, "instruction": instruction})

Retrieved context (metadata are not passed to the model):

Die Australasiaten und Polynesen; oder die Südländer des fünften Welttheils; dazu man auch wol die Bewohner der Sundaischen Inseln, der Molucken, Philippinen u. s. w. zälen könnte. Sie sind meist schwarzbraun, breitnasicht, und starkbehaart Die Patagonischen Riesen sind, von Magelhans Zeiten bis auf die unsrigen, in den Erzälungen der Reisenden, von zwölf Fus zu sechs bis siebenthalb eingekrochen, und bleiben also nicht größer und nicht kleiner als jeder andre ehrliche Mensch von guter Statur. Commersons Quimos und andre Zwergnationen werden in dem Maas wachsen, wie die Patagonen an Länge abgenommen haben. Die Rackerlacken, Blafards, Albinos oder weiße Mohren sind nicht einmal eine Spielart geschweige eine besondre Gattung, wozu sie der gute Voltaire so gern machen möchte: sondern eine Krankheit, die Menschen unter allen Meridianen befallen kan, und der auch Thiere unterworfen sind. Linne's Homo troglodytes ist ein Gemisch aus 

In [43]:
print(inference.pretty_repr())


Historical hermeneutics is the study of how historical texts are interpreted and understood in their historical context, taking into account the biases, assumptions, and perspectives of the author and the audience. It involves analyzing the language, tone, and content of the text to uncover the underlying beliefs and values of the time period in which it was written.

In the 18th century, theories of human races were often based on physical characteristics such as skin color, facial features, and body size. These theories were used to categorize and rank different groups of people, with some views promoting the idea of hierarchy and superiority of certain races over others. The concept of race was often linked to ideas of civilization, culture, and intelligence, with European societies typically being seen as the most advanced and superior.

Based on the SOURCE, it presents a moderate view on human races. Blumenbach acknowledges the physical differences among human populations but arg

In [44]:
record_in_txt_instruction("racism")

Please insert a short comment: rerunning


### e) Exploratory research

In [55]:
template ="""You are a historian analysing a primary historical SOURCE, which is an eighteenth-century textbook on natural history by J. F. Blumenbach. First, generate a few sentences about historical hermeneutics. Then, follow the INSTRUCTION.
INSTRUCTION: {instruction}
SOURCE: {context} """

prompt = PromptTemplate.from_template(template)

In [72]:
# chain
chain = (
    {"context": itemgetter("topic") | retriever | format_docs, "instruction": itemgetter("instruction")}
    | prompt | print_prompt
    | llm
)

In [84]:
topic = "scientific discussion"
instruction = "Identify incoherencies or contradictions in the SOURCE. If there are none, report it."
inference = chain.invoke({"topic": topic, "instruction": instruction})

Retrieved context (metadata are not passed to the model):

Der gegenwärtige Abschnitt betrift allerdings eine eben so wichtige als anmuthige Untersuchung nemlich die allgemeine Naturgeschichte der Gewächse, die wir so viel möglich in der gleichen Ordnung abfassen wollen, die oben in der allgemeinen Thiergeschichte befolgt worden ist, damit beide desto leichter mit einander verglichen und die Aehnlichkeit oder Abweichung dieser zweyerley Arten von organisirten Körpern um so deutlicher ersehen Werden kan.
Metadata: J.F.Blumenbach, Handbuch der Naturgeschichte, ed. 1 (1780), §. 170.

1. Geschl. homo. Animal rationale, loquens erectum, bimanum. 1. Gatt. sapiens. Der Mensch wird durch so merkwürdige Eigenschaften des Geistes und des Körpers von der ganzen übrigen thierischen Schöpfung ausgezeichnet, daß er bey weitem nicht blos in einem eignen Geschlecht, sondern allerdings in einer besondern Ordnung, von ihr abgeschieden werden muß. Er hat ausser dem Begattungstrieb wenig Spuren von Instin

In [85]:
print(inference.pretty_repr())


Historical hermeneutics is the study of how historical texts are interpreted and understood in their historical context. It involves analyzing the language, structure, and content of primary sources to uncover the meaning and significance of the text within its historical setting. By examining the primary historical SOURCE, one can identify incoherencies or contradictions that may exist within the text, shedding light on the historical context in which the source was created.

In the eighteenth-century textbook on natural history by J. F. Blumenbach, there are several incoherencies and contradictions present. One such contradiction is the discussion of the classification of natural bodies into naturalia and artefacta. The text mentions that artefacta are created through human intervention, yet includes examples such as the bark of a mulberry tree or the outer shell of a coconut as naturalia, despite being altered by humans. This inconsistency raises questions about the criteria for cl

In [86]:
record_in_txt_instruction("exploratory_research")

Please insert a short comment: nice


## 6. RAG 3

In [14]:
# initialize the llm
llm, llm_hyperparameters = initialize_llm(max_tokens=400)

Created an LLM with standard hyperparameters.


### a) Comparing editions

In [15]:
retriever, retriever_hyperparameters = initialize_retriever(db_1ed_12ed, k = 14)

Which vector database are you using? Please provide its name for record: db_1st_and_12th_edition
Initialized the retriever out of db_1st_and_12th_edition. 14 documents will be retrieved.


In [8]:
template ="""You are a historian analysing primary historical SOURCE, which are fragments from the 1st and the 12th edition of a historical textbook on natural history by J. F. Blumenbach. 
"ed. 1" in the metadata indicates the 1st edition; "ed. 12" in the metadata indicates the 12th edition.
First, generate a few sentences about historical hermeneutics. Then, follow the INSTRUCTION.
INSTRUCTION: {instruction}
SOURCE: {context} """
prompt = PromptTemplate.from_template(template)

In [45]:
# chain
chain = (
    {"context": itemgetter("topic") | retriever | format_docs_with_metadata, "instruction": itemgetter("instruction")}
    | prompt | print_prompt
    | llm
)

# note that this time the metadata are passed to the prompt by the format_docs_with_metadata function

In [46]:
topic = "natural history"
instruction = "Compare and contrast the portrayal of natural history in the 1st and the 12th edition."
inference = chain.invoke({"topic": topic, "instruction": instruction})

Aber alles dies herzlich gerne zugegeben, dürfen doch die Leitern und Ketten, der guten Sache der bestimmten Naturreiche, und der Classification der Naturalien, bey weitem keinen Eintrag thun Die passendste Allegorie kann matt werden, kann in eine Spielerey ausarten, wenn sie zu weit getrieben wird. Und das ist in der That bey den eben angeführten zu befürchten. Es ist unterhaltend, es ist, wie wir so eben selbst gesagt haben, nutzbar, wenn der Naturforscher die Creaturen nach ihrer nächsten Verwandschaft unter einander ordnet, an einander kettet u. s. w Aber es scheint uns von der andern Seite eine schwache, und der Allweisheit des Schöpfers unanständige Behauptung, wenn man im Ernste annehmen wollte, daß auch Er bey der Schöpfung einen solchen allegorischen Plan befolgt, und die Vollkommenheit seiner großen Handlung darein gesetzt hätte, daß er seinen Creaturen alle ersinnliche Formen gäbe, und sie folglich vom obersten bis zum untersten ganz regelmäßig stufenweis auf einander folgen

In [47]:
print(inference.pretty_repr())


Historical hermeneutics is the study of how historical texts are interpreted and understood in their historical context, taking into account the cultural, social, and intellectual factors that influenced the author and the audience. It involves analyzing the language, symbols, and ideas present in the text to uncover the meaning and significance behind them.

In comparing the portrayal of natural history in the 1st and the 12th edition of J.F. Blumenbach's textbook, several differences and similarities can be observed. In the 1st edition, there is a focus on the classification of natural objects into two categories: naturalia and artefacta, emphasizing the distinction between objects created by nature and those altered by humans. Blumenbach discusses the importance of understanding the origins and changes in natural objects, such as petrified remains and fossils, to gain insights into the Earth's geological history.

On the other hand, in the 12th edition, there is a shift towards a m

In [48]:
record_in_txt_instruction("comparing_editions")

Please insert a short comment: increased max_tokens to 800, but the reply is still short


### b) Search by metadata in both editions without register

Additional computation needed for this step:

In [4]:
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
#from langchain_openai import ChatOpenAI
#from langchain_chroma import Chroma

In [8]:
from langchain_community.vectorstores import Chroma

In [9]:
# SelfQueryRetriever does not support faiss, so it is necessary to create the vectorstore again with Chroma
with open("../data/pickles/ed1_ed12_docs.pickle", 'rb') as f:
    docs_both_editions = pickle.load(f)
db_1ed_12ed_chroma = Chroma.from_documents(docs_both_editions, hf_jina)

In [65]:
# describe the metadata
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The title of the book",
        type="string",
    ),
    AttributeInfo(
        name="autor",
        description="The name of the author of the book",
        type="string",
    ),
    AttributeInfo(
        name="edition",
        description="The edition of the book",
        type="integer",
    ),
    AttributeInfo(
        name="date", 
        description="Year of publication of the book", 
        type="integer"
    ),
    
    AttributeInfo(
        name="language", 
        description="The main language of the book", 
        type="string"
    ),
    
    AttributeInfo(
        name="in-text location", 
        description="The paragraph (e.g. §. 1. means paragraph 1, §. 5. means paragraph 5)", 
        type="string"
    ),
]

# describe the contents 
document_content_description = "A chunk from a book"


In [66]:
# llm for retrieval (the default llm proposed in the class)
llm_retrieval = ChatOpenAI(temperature=0) 

# initialize the retriever
retriever = SelfQueryRetriever.from_llm(
    llm_retrieval,
    db_1ed_12ed_chroma,
    document_content_description,
    metadata_field_info,
)

vectorstore = input("Please indicate which vector database is used.")

Please indicate which vector database is used.db_1st_and_12th_edition_chroma


The actual RAG pipeline starts here

In [82]:
template ="""You are a historian analysing a primary historical SOURCE, which is an eighteenth-century textbook on natural history by J. F. Blumenbach. First, generate a few sentences about historical hermeneutics. Then, follow the INSTRUCTION.
INSTRUCTION: {instruction}
SOURCE: {context} """

prompt = PromptTemplate.from_template(template)

In [83]:
# chain
chain = (
    {"context": itemgetter("topic") | retriever | format_and_save_docs, "instruction": itemgetter("instruction")}
    | prompt | print_prompt
    | llm
)


In [87]:
topic = "Information about natural history from the 1st paragraph of the 12th edition."
instruction = "Summarize what natural history was."
inference = chain.invoke({"topic": topic, "instruction": instruction})

Alle Körper, die sich auf, und in unserer Erde finden, zeigen sich entweder in derselben Gestalt und Beschaffenheit, die sie aus der Hand des Schöpfers erhalten und durch die Wirkung der sich selbst überlassenen Naturkräfte angenommen haben; oder so wie sie durch Menschen und Thiere, zu bestimmten Absichten, oder auch durch bloßen Zufall verändert und gleichsam umgeschaffen worden sind Auf diese Verschiedenheit gründet sich die bekannte Eintheilung derselben in natürliche (naturalia), und durch Kunst verfertigte (artefacta). 
Die erstern machen den Gegenstand der Naturgeschichte aus, und man pflegt alle Körper zu den Naturalien zu rechnen, die nur noch keine wesentliche Veränderung durch Menschen erlitten haben. Artefacten werden sie dann genannt, wenn der Mensch absichtlich Veränderungen mit ihnen vorgenommen Anm. 1. Daß übrigens jene Begriffe vom Wesentlichen und vom Absichtlichen im gegenwärtigen Falle bey so verschiedentlicher Rücksicht und Modification, nicht anders als relativ se

In [88]:
print(inference.pretty_repr())


Natural history, as described in the eighteenth-century textbook by J.F. Blumenbach, is the study of all bodies found on or within the Earth, either in their original form as created by the Creator and shaped by natural forces, or altered by humans and animals for specific purposes or by chance. This distinction leads to the classification of bodies into natural (naturalia) and artificially made (artefacta). Natural history focuses on bodies that have not undergone significant changes by humans, while artefacts are those intentionally altered by humans. Sometimes, natural bodies can closely resemble artificial products, making it difficult to distinguish between the two.


In [89]:
record_in_txt_metadata()

Please insert a short comment: the 1st paragraph of the 12th edition


### c) Search by metadata in the register of the 1st edition

Additional computation needed for this step

In [90]:
# SelfQueryRetriever does not support faiss, so it is necessary to create the vectorstore again with Chroma
with open("../data/pickles/ed1_docs_with_register_1500_400.pickle", 'rb') as f:
    docs_ed1_1500_with_reg = pickle.load(f)
db_1ed_chroma = Chroma.from_documents(docs_ed1_1500_with_reg, hf_jina)

In [233]:
# describe the metadata
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The title of the book",
        type="string",
    ),
    AttributeInfo(
        name="autor",
        description="The name of the author of the book",
        type="string",
    ),
    AttributeInfo(
        name="edition",
        description="The edition of the book",
        type="integer",
    ),
    AttributeInfo(
        name="date", 
        description="Year of publication of the book", 
        type="integer"
    ),
    
    AttributeInfo(
        name="language", 
        description="The main language of the book", 
        type="string"
    ),
    
    AttributeInfo(
        name="in-text location", 
        description="Topic of the section (e.g. Register 6th Chapter: Amphibien, I. REPTILES. means the register about amphibiens and reptiles; Register 13th Chapter: Saltze, I. ACIDA. means register about acid salts)", 
        type="string"
    ),
]

# describe the contents 
document_content_description = "A chunk from a book"

In [234]:
# llm for retrieval (the default llm proposed in the class)
llm_retrieval = ChatOpenAI(temperature=0) 

# initialize the retriever
retriever = SelfQueryRetriever.from_llm(
    llm_retrieval,
    db_1ed_chroma,
    document_content_description,
    metadata_field_info,
)

vectorstore = input("Please indicate which vector database is used.")

Please indicate which vector database is used.1st_edition_1500_with_register_chroma


The actual RAG pipeline starts here

In [240]:
template ="""You are a historian analysing a primary historical SOURCE, which is an eighteenth-century textbook on natural history by J. F. Blumenbach. First, generate a few sentences about historical hermeneutics. Then, follow the INSTRUCTION.
INSTRUCTION: {instruction}
SOURCE: {context} """

prompt = PromptTemplate.from_template(template)

In [241]:
# chain
chain = (
    {"context": itemgetter("topic") | retriever | format_and_save_docs, "instruction": itemgetter("instruction")}
    | prompt | print_prompt
    | llm
)

# note that this time the metadata are passed to the prompt by the format_docs_with_metadata function

In [245]:
topic = "Fragments from register about gold"
instruction = "What is said about gold in the SOURCE?"
inference = chain.invoke({"topic": topic, "instruction": instruction})

Aculeata, der Goldwurm. A. ovalis hirsuta aculeata, pedibus utrinque 32. * Ein über alle Beschreibung prächtiges Geschöpf die Stacheln und Haare, womit es zumal an beiden Seiten besetzt ist, changiren, zumal im Sonnenschein, in alle mögliche Goldfarben: theils auch wie blaue Schwefelflammen u. s. w.
Metadata: Blumenbach, Handbuch der Naturgeschichte, ed. 1 (1779), Register 9th Chapter: Würmer, I. MOLLUSCA., 1.

Aurata. Der Goldbrachsen. S. lunula aurea inter oculos. * Hat fast in allen Sprachen seinen Namen von dem goldnen halben Mond vor den Augen. Hält sich im Sommer in der offnen See, die übrige Zeit aber am Gestade und in Flüssen auf. Er schläft zu gesetzter Zeit, was man bey andern Fischen nicht so bemerkt.
Metadata: Blumenbach, Handbuch der Naturgeschichte, ed. 1 (1779), Register 7th Chapter: Fische, III. THORACICI., 1.

† Chrysaëtos. der Goldadler, Steinadler. ( le grand Aigle, Buff.) F. cera lutea, pedibusque lauatis luteo-ferrugineis, corpore fusco ferrugineo vario, cauda nig

In [246]:
print(inference.pretty_repr())


Historical hermeneutics is the study of interpreting historical texts and sources in order to understand the context, meaning, and significance of the information they provide. When analyzing a primary historical source like the eighteenth-century textbook on natural history by J. F. Blumenbach, it is important to consider the language, cultural context, and scientific knowledge of the time period in order to accurately interpret the information presented.

In the SOURCE, gold is mentioned in relation to various creatures. The Goldwurm is described as having stunning gold colors on its spines and hairs, which shimmer in the sunlight. The Goldbrachsen, or Goldbrass, is noted for the golden crescent shape near its eyes. The Goldadler, or Golden Eagle, is described as having a beautiful and strong physique, living in mountainous regions, and preying on small mammals and birds. The deaurata insect is mentioned as varying in color, with some specimens being green and gold, while others are

In [247]:
record_in_txt_metadata()

Please insert a short comment: again, the retriever doesn't work with the metadata properly


### d) Using information from metadata

In [11]:
retriever, retriever_hyperparameters = initialize_retriever(db_1500, k = 20)

Which vector database are you using? Please provide its name for record: db_1500_with_register
Initialized the retriever out of db_1500_with_register. 20 documents will be retrieved.


In [28]:
template ="""You are a historian analysing a primary historical SOURCE, which are fragments from the 1st and the 12th edition of a historical textbook on natural history by J. F. Blumenbach. 
Each fragment is described by metadata. 
For example, if metadata include "Register 6th Chapter: Amphibien, I. REPTILES", the fragment is about the subclass of reptiles belonging to amphibian.
For example, if metadata include "Register 13th Chapter: Saltze, I. ACIDA", the fragment is about the subclass of acid salts belonging to salts.
For example, if metadata include "Register 15th Chapter: Metalle, I. Eigentliche Metalle., B. Unedle Metalle., 4. stannvm, Zinn", the fragment is about tin belonging to base metals, actual metals and metals.
First, generate a few sentences about historical hermeneutics. Then, follow the INSTRUCTION.
INSTRUCTION: {instruction}
SOURCE: {context} """
prompt = PromptTemplate.from_template(template)

In [29]:
# chain
chain = (
    {"context": itemgetter("topic") | retriever | format_docs_with_metadata, "instruction": itemgetter("instruction")}
    | prompt | print_prompt
    | llm
)

# note that this time the metadata are passed to the prompt by the format_docs_with_metadata function

In [45]:
topic = "description of metals, gold, silver, etc."
instruction = "Provide all information from the SOURCE about base metals."
inference = chain.invoke({"topic": topic, "instruction": instruction})

Man theilt die Metalle überhaupt in Ganze- oder eigentlich so genannte Metalle, und Halbmetalle, und begreift unter der lezten Abtheilung diejenigen, die nicht so geschmeidig als die erstern sind, und im Feuer größtentheils verflüchtigen. Von jenen hat man das Gold und Silber wegen ihrer größern Feuerbeständigkeit Edle und die übrigen Unedle Metalle genannt.
Metadata: J.F.Blumenbach, Handbuch der Naturgeschichte, ed. 1 (1780), §. 242.

So verschieden die Gestalten sind, unter denen sich die Metalle zeigen, so lassen sie sich doch am kürzesten auf zwey Hauptgattungen zurück bringen. Entweder nemlich finden sich die Erzte gediegen (metallum nudum s. natiuum) d. h in aller ihrer wahren metallischen Substanz und Ansehen, so daß sie ohne weitere Scheidung u. s. w sogleich verarbeitet werden könnten; oder aber vererzt, (mineralisatum) so daß ihnen der Mangel eines ihrer eigenthümlichen Bestandtheile, oder die innige Beymischung einer fremden Säure von Schwefel u. s. w. mehr oder weniger von 

In [46]:
print(inference.pretty_repr())


Historical hermeneutics involves the interpretation and analysis of historical sources to understand the context, meaning, and significance of the information presented. In the study of historical texts, it is important to consider the metadata provided, such as the edition, chapter, and specific topic, to extract relevant information about a particular subject.

From the SOURCE provided, information about base metals can be gathered from the following fragments:

1. The classification of metals into noble and base metals, with gold and silver being considered noble due to their greater resistance to fire, while the rest are classified as base metals.
2. Metals are composed of three main elements: phlogiston, salt, and earth. Base metals like copper and iron lose their metallic properties when deprived of phlogiston.
3. Base metals are categorized under the class of metals and semi-metals, along with additional classifications like salts and earth resins.
4. Detailed descriptions of b

In [47]:
record_in_txt_instruction("metadata_2")

Please insert a short comment: a great reply
