In [1]:
# importing os module for environment variables
import os
# importing necessary functions from dotenv library
from dotenv import load_dotenv, dotenv_values

load_dotenv()

True

In [2]:
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
# MODEL = "gpt-3.5-turbo"
MODEL = "llama3:8b"

In [14]:
from langchain_openai.chat_models import ChatOpenAI
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings

# test openai and llama message response
if MODEL.startswith('gpt'):
    model = ChatOpenAI(api_key=OPENAI_API_KEY, model=MODEL)
    embeddings = OpenAIEmbeddings()
else:
    model = Ollama(model=MODEL)
    embeddings = OllamaEmbeddings(model="llama3:8b")

model.invoke("Why is the sky blue ?")

"What a great question!\n\nThe color of the sky can appear differently depending on various factors such as time of day, atmospheric conditions, and the observer's location. However, under normal circumstances, the sky typically appears blue to our eyes because of a phenomenon called Rayleigh scattering.\n\nRayleigh scattering is the scattering of light by small particles or molecules in the atmosphere, like nitrogen (N2) and oxygen (O2). These gases are present in the air at very high concentrations, making up about 99% of the Earth's atmosphere. When sunlight enters the Earth's atmosphere, it encounters these tiny molecules and scatters in all directions.\n\nHere's the key part: shorter (blue) wavelengths scatter more than longer (red) wavelengths. This is because the smaller molecules are more effective at scattering the shorter wavelengths. As a result, the blue light is dispersed throughout the atmosphere, reaching our eyes from all directions.\n\nWhen we look up at the sky, we se

OpenAi gives us an AIMessage object for our OpenAI model, and a string for our local llama model. We use a parser to always dea with strings.

In [15]:
from langchain_core.output_parsers import StrOutputParser

parser = StrOutputParser()
# pipe the output of our model into the input of our parser
chain = model | parser
# instead of invoking the model, we invoke the chain instead
chain.invoke("Why is the sky blue ? Give a concise answer.")

"The sky appears blue because of a phenomenon called Rayleigh scattering, where shorter (blue) wavelengths of light are scattered more than longer (red) wavelengths by tiny molecules of gases like nitrogen and oxygen in the Earth's atmosphere. This scattering effect gives the sky its blue color!"

In [16]:
from langchain_community.document_loaders import UnstructuredMarkdownLoader

loader = UnstructuredMarkdownLoader("data/building_statistical_models_in_python.md")
pages = loader.load_and_split()
pages

[Document(page_content="lang: eng\nannotation-target: \nauthor:\n  - Huy Hoang Nguyen\n  - Paul N Adams\n  - Stuart J Miller\nsubject:\n  - Statistical Anylisis\n  - Python\ntags:\n  - book\n  - statisticalmodels\n\nOutline\n\nAn introduction to statistics\n\nRegression models\n\nClassification models\n\nTime series models\n\nSurvival analysis\n\nReadthrough\n\nPart 1, An introduction to statistics\n\nChapter 1, Sampling and Generalization\n\nPopulation versus sample\n\nThe goal of stats modeling is to answer a question about a group by making an inference about that group (the entirety of the group is called a population) .\n\nBecause it's unlikely to have data on the whole population (can't collect all data, too large),  we use a subset of the population, a sample.\n\nThis subset needs to be representative of the population.\n\nPopulation inference from samples\n\nWe have to give our study the same degrees of uncertainty as those of the population, to that effect we used randomized e

In [17]:
type(pages[0])

langchain_core.documents.base.Document

We create a prompt template encompassing the context to give the model.

In [18]:
from langchain.prompts import PromptTemplate

template="""
Answer the question based on the context below.
If you can't answer the question, reply 'I don't know'.

Context: {context}
Question: {question}
"""

prompt = PromptTemplate.from_template(template)
print(prompt.format(context="Here is some context", question="Here is a question"))


Answer the question based on the context below. The context
may contain sentences in french, interpret them in english.
If you can't answer the question, reply 'I don't know'.

Context: Here is some context
Question: Here is a question



To pass this prompt to our model, we expand upon our chain.

In [19]:
chain = prompt | model | parser

In [20]:
chain.input_schema.schema()

{'title': 'PromptInput',
 'type': 'object',
 'properties': {'context': {'title': 'Context', 'type': 'string'},
  'question': {'title': 'Question', 'type': 'string'}}}

In [21]:
# to invoke a chain we need to understand the structure of our prompt template, which can be seen on the input schema above
chain.invoke(
    {
        "context": "In the 1960s, the rock scene was an effervescent field of talented and eccentric musicians brought up through the hippie movement. Jimi Hendrix, Eric Clapton, Jimmy Page are a few guitarists of that era who experienced a lot of success. Jimi Hendrix is still considered to be the best guitarist of all time.",
        "question": "Who is the greatest guitarist ever ?",
    }
)

"Based on the context, I would say that according to many people's opinions, including fans and experts in the field, Jimi Hendrix is widely considered to be the greatest guitarist of all time."

To make it so that our chain receives our documents' relevant information as context, we'll use a very simple vector store that will receive the embeddings generated with our data.

In [22]:
from langchain_community.vectorstores import DocArrayInMemorySearch

vectorstore = DocArrayInMemorySearch.from_documents(pages, embedding=embeddings)

A retriever is a component of Langchain that allows to retrieve information from a vector store (can retrieve from other sources). Using invoke to retrieve the top 4 closest documents most relevant to the prompt inputted.

In [30]:
retriever = vectorstore.as_retriever()
retriever.invoke("statistical test")

[Document(page_content="[[Bootstrapping as a demonstration of the CLT]]\n\nPermutations\n\nBasic knowledge of permutations and combinations\n\nThe order of objects matter in permutations while it does not for combinations.\n\nFor exemple, someone needs to choose at random 3 people out 10 to get moneys prices (winners being Huy, Paul, and Stuart). \nIn one example, he gives out 1000\\$, 500\\$, et 300\\$. In a second, he gives three equal 500$ prices. In the first exemple, the prices are different for each winner so it plays out in more ways than the second exemple as the order prize arrangement doesn't matter, the first example is a permutation example.\n![[Pasted image 20240209153511.png]]\nIn Python, the package itertools is used to find permutations directly with permutations.\n\nFor the second example, the order doesn't matter. In the first example, when the 3 winners are selected, there are six ways of  arranging the prizes, whereas there is only one way of doing so in the second 

The retriever allows us to generate context that can be passed within our chain. The prompt object expects a map so we need to integrate our retriver with that in mind.

In [29]:
# an itemgetter allows to create a callable with a set key, and can be used to retrieve related value of a parameter-object with said key,
# here the parameter-object is the dict passed through invoke
from operator import itemgetter

# dict given to the prompt is a Runnable that generates a map with context and question
# context comes from our retriever, which receives a 'question' item
chain = ({
    "context": itemgetter("question") | retriever,
    "question": itemgetter("question"),
    }
    | prompt
    | model
    | parser
)
# item is given through invoke when using the chain with our question
chain.invoke({"question": "What is a statistical test used for?"})

'According to the provided context, a statistical test is used "to help make decisions based on quantifiable uncertainties". It also contains a null hypothesis (no difference between data) and an alternative hypothesis (difference between data), with a critical value or benchmark. Additionally, it can be used to:\n\n* Test one variable against another (e.g., t-test)\n* Test multiple variables against one variable (e.g., linear regression)\n* Test multiple variables against multiple variables (e.g., MANOVA)'

We can see that the retriever gave the pieces of text related to our question (statistical tests) to the mapped object which completed our formatted prompt. The model answers with phrasing used in our Markdown document and cites examples directly taken from it.
We now test our chain with a series of question to pin down its effectiveness.

In [31]:
questions = [
    "Who are the authors of the book ?",
    "What are the main subjects of the book ?",
    "What are the different ways to evaluate distributions ?",
    "How to sample data ?",
    "How much does the book cost ?",
    "What are the topics that will be covered by the document ?",
]

for question in questions:
    print(f"Question: \n{question}")
    print(f"Answer : \n{chain.invoke({'question': question})}")

Question: 
Who are the authors of the book ?
Answer : 
According to the context, the authors of the book are:

* Huy Hoang Nguyen
* Paul N Adams
* Stuart J Miller
Question: 
What are the main subjects of the book ?
Answer : 
Based on the context, the main subjects of the book are:

1. Statistical Analysis
2. Python
3. Regression models
4. Classification models
5. Time series models
6. Survival analysis

These subjects are listed under the "Outline" section at the beginning of the book, which suggests that they are the primary topics covered in the book.
Question: 
What are the different ways to evaluate distributions ?
Answer : 
Based on the provided context, there is no specific information about evaluating distributions. However, I can provide a general overview of common methods used to evaluate or understand the distribution of data:

1. **Summary Statistics**: Calculating mean, median, mode, standard deviation, variance, and other measures to get an overall sense of the distributi

In general, llama3 is not good as at consistent retrieval and summarization, much slower and much more verbose in its answers compared to GPT 3.5.

Instead of printing in one shot at `invoke`, we can use `stream`.

In [33]:
for s in chain.stream({'question': 'What is permutation testing useful for ?'}):
    print(s, end="", flush=True)

Based on the context, it seems that permutation testing is not explicitly mentioned. However, since we are discussing randomization and statistical tests, I'll try to provide an answer based on related concepts.

Permutation testing is a type of non-parametric statistical test that is used to determine whether there is a significant association between two variables or not. It's particularly useful when the data does not follow a normal distribution or when the assumptions of parametric tests are not met. Permutation testing can be used for both continuous and categorical outcomes.

In the context of the problem, permutation testing could be used to test whether there is a significant difference in crop yield between different pesticide brands. This would involve randomly permuting the treatment labels and recalculating the statistic (e.g., mean difference) multiple times to generate a null distribution under the hypothesis that there is no difference between treatments. The actual val

(Again, failure to infer the correct response.)
We can also batch the answers, get them all at once.

In [35]:
chain.batch([{"question": question} for question in questions[:3]])

['Based on the provided context, the authors of the book "Building Statistical Models in Python" are:\n\n1. Huy Hoang Nguyen\n2. Paul N Adams\n3. Stuart J Miller',
 'Based on the context, the main subjects of the book "Building Statistical Models in Python" appear to be:\n\n1. **Statistical Analysis**: The book covers various statistical concepts and techniques, including hypothesis testing, regression models, classification models, time series models, survival analysis, and more.\n2. **Python**: The book focuses on building statistical models using Python, which implies that it will cover various Python libraries and tools relevant to statistical analysis.\n\nThese are the primary subjects of the book, as inferred from the provided context.',
 'Based on the provided context, it appears that you are referring to a discussion of statistical methods for evaluating data distributions.\n\nIn this context, "evaluating distributions" likely refers to understanding and analyzing the character

Next time we'll use a Pinecone index to have a static and more organized retrieval and embedding search method. We'll also experiment in different ways:
 - implement function calling like websearch,
  - orchestration of an agent workflow with LangGraph to:
    - route prompts for different retrieval methods - document/RAG retrieval or websearch for example,
    - have a fallback mechanism to progress with when the context retrieved is irrelevant,
    - evaluate response with a reflection system or hallucination metric;
 - integration of this functionality in streamlit, plugins or other.
 - another vector database that is self-hosted (Qdrant or Milvus) or different data structures like graphs or entity databases.