<b><h1>RAG Model

In [1]:
import os
from langsmith.wrappers import wrap_openai
from langsmith import traceable
from langsmith import Client
from pathlib import Path
from dotenv import load_dotenv,find_dotenv

load_dotenv(find_dotenv()) 
os.environ["LANGCHAIN_API_KEY"]=str(os.getenv("LANGCHAIN_API_KEY"))
os.environ["LANGCHAIN_TRACING_V2"]="true"
os.environ["LANGCHAIN_ENDPOINT"]="https://api.smith.langchain.com"

In [2]:
client=Client()

Loading PDF from sample directory

In [3]:
from langchain_community.document_loaders import PyPDFLoader,DirectoryLoader
directory_path = "sample"

loader = DirectoryLoader(directory_path, glob="*.pdf", loader_cls=PyPDFLoader)
data = loader.load()

In [4]:
len(data)

98

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# split data
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
docs = text_splitter.split_documents(data)


print("Total number of documents: ",len(docs))

Total number of documents:  348


In [6]:
docs[0]

Document(metadata={'producer': 'Skia/PDF m133', 'creator': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36 Edg/133.0.0.0', 'creationdate': '2025-02-19T08:37:02+00:00', 'title': 'Salt and Sodium - The Nutrition Source', 'moddate': '2025-02-19T08:37:02+00:00', 'source': 'sample\\Salt and Sodium - The Nutrition Source.pdf', 'total_pages': 10, 'page': 0, 'page_label': '1'}, page_content='The Nutrition Source > Salt and Sodium\nSalt and Sodium\nSalt, also known as sodium chloride, is about 40% sodium and 60% chloride. It\nflavors food and is used as a binder and stabilizer. It is also a food preservative,\nas bacteria can’t thrive in the presence of a high amount of salt. The human body\n\uf409\nTHE NUTRITION SOURCE\n \uf431')

Creating embeddings

In [7]:
from langchain_chroma import Chroma
from langchain_google_genai import GoogleGenerativeAIEmbeddings

embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
vector = embeddings.embed_query("Hello World")
vector[:5]
# sample vector

[0.04656680300831795,
 -0.0376756377518177,
 -0.0274836253374815,
 -0.02519204653799534,
 0.023942284286022186]

In [8]:
vectorstore = Chroma.from_documents(documents=docs, embedding=GoogleGenerativeAIEmbeddings(model="models/embedding-001"))

Retrieving Top 10 similar documents

In [9]:
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 10})
retrieved_docs = retriever.invoke("What are the types of salts?")

In [10]:
len(retrieved_docs)

10

In [11]:
print(retrieved_docs[5].page_content)

requires a small amount of sodium to conduct nerve impulses, contract and
relax muscles, and maintain the proper balance of water and minerals. It is
estimated that we need about 500 mg of sodium daily for these vital functions.
But too much sodium in the diet can lead to high blood pressure, heart disease,
and stroke. It can also cause calcium losses, some of which may be pulled from
bone. Most Americans consume at least 1.5 teaspoons of salt per day, or about
3400 mg of sodium, which contains far more than our bodies need.
Recommended Amounts
The U.S. Dietary Reference Intakes state that there is not enough evidence to
establish a Recommended Dietary Allowance or a toxic level for sodium (aside
from chronic disease risk). Because of this, a Tolerable Upper intake Level (UL)
has not been established; a UL is the maximum daily intake unlikely to cause
harmful effects on health. 
Guidelines for Adequate Intakes (AI) of sodium were established based on the


LLM - Gemini model

In [12]:
from langchain_google_genai import ChatGoogleGenerativeAI

llm = ChatGoogleGenerativeAI(model="gemini-1.5-pro",temperature=0.3, max_tokens=500)

In [13]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

In [14]:
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

Sample RAG model outputs

In [15]:
response = rag_chain.invoke({"input": "list only the types of salt"})
print(response["answer"])

Iodized table salt, Kosher salt, Diamond Crystal kosher salt, sea salt, pink (Himalayan) salt, black salt, fleur de sel, and potassium salt (salt substitute).


In [16]:
response = rag_chain.invoke({"input": "What is RAG?"})
print(response["answer"])

This document does not contain the answer to what RAG is.  It discusses dietary recommendations for various nutrients, including zinc, and its relationship with age-related macular degeneration.  It also mentions different intake levels like RDA, AI, EAR, and UL.


<H1><B>Dataset: Manually Curated

Created a dataset of question-answer pairs on a blog post about Vitamins and minerals.

Built a Manually Curated dataset of input, output pairs:

In [None]:
from langsmith import Client

# Define your QA pairs
inputs = [
    "What is Water-soluble vitamins?",
    "What is the source of Vitamin E?",
    "What is the RDA of Vitamin B2?",
    "List the fat soluble vitamins",
    "What are the sources of Zinc?",
]

outputs = [
    "These include the eight B-complex vitamins (B1, B2, B3, B5, B6, B7, B9, B12) and vitamin C. Water-soluble vitamins are not stored in large amounts and need to be replenished regularly through your diet, as excess amounts are excreted through urine.",
    "Nuts, seeds, spinach, and vegetable oils",
    "1.3 mg/day (men), 1.1 mg/day (women)",
    "Vitamin - A, D, B2, and C",
    "Meat, shellfish, legumes, seeds",
    
]

# Create QA pairs
qa_pairs = [{"question": q, "answer": a} for q, a in zip(inputs, outputs)]

# Initialize Langsmith client
client = Client()

# Define dataset parameters
dataset_name = "Vit_Min_Dataset"
dataset_description = "QA pairs about Vitamins and Minerals."

# Create the dataset
dataset = client.create_dataset(
    dataset_name=dataset_name,
    description=dataset_description,
)

# Add examples to the dataset
client.create_examples(
    inputs=[{"question": q} for q in inputs],
    outputs=[{"answer": a} for a in outputs],
    dataset_id=dataset.id,
)


In [17]:
# import image module 
from IPython.display import Image 
  
# get the image 
Image(url="1.png", width=1000, height=500) 

<B><H1>Dataset: From User Logs

Save user logs as a dataset for future testing

Loaded data from a site

In [18]:
import requests
from bs4 import BeautifulSoup

url = "https://drhyman.com/blogs/content/supplements-101-essential-vitamins-and-minerals#:~:text=Learn%20why%20the%20RDA%20doesn%E2%80%99t%20provide%20optimal%20intake,Dr.%20Mark%20Hyman%E2%80%99s%20patient-proven%20recommendations%20for%20peak%20health."
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
text = [p.text for p in soup.find_all("p")]
full_text = "\n".join(text)

Q&A function - using LLM

In [19]:
def answer_question(inputs: dict) -> dict:
    """
    Generates answers to user questions based on a provided website text using Gemini API.

    Parameters:
    inputs (dict): A dictionary with a single key 'question', representing the user's question as a string.

    Returns:
    dict: A dictionary with a single key 'output', containing the generated answer as a string.
    """

    # System prompt
    system_msg = (
        f"Answer user questions in 2-3 sentences about this context: \n\n\n {full_text}"
    )

    # Pass in website text
    messages = [
        {"role": "system", "content": system_msg},
        {"role": "user", "content": inputs["question"]},
    ]

    llm = ChatGoogleGenerativeAI(model="gemini-1.5-pro") 

    response = llm.invoke(messages)

    # Response in output dict
    return {"answer": response.content}

Sample Outputs:

In [21]:
answer_question(
    {
        "question": "What is the source of Magnesium"
    }
)

{'answer': "Magnesium can be found in foods like pumpkin seeds, almonds, spinach, and dark chocolate.  It's also available as a supplement in different forms, including citrate, glycinate, and L-threonate, each with specific benefits."}

In [20]:
answer_question(
    {
        "question": "What is the MRI for Vitamin-A"
    }
)

{'answer': "The Mark's Recommended Intake (MRI) for Vitamin A is 2000-3000 mcg/day. This contrasts with the Recommended Dietary Allowance (RDA) of 900 mcg/day for men and 700 mcg/day for women, which are lower amounts aimed at preventing deficiency rather than optimizing health."}

In [None]:
answer_question(
    {
        "question": "What is the recommended dietary allowance of Iron for men"
    }
)

In [None]:
answer_question(
    {
        "question": "What is Fat-soluble vitamins?"
    }
)

<B><H1>LLM-as-Judge: Built-in evaluator

In [21]:
Image(url="2.png", width=1000, height=500) 

In [None]:
from langsmith.evaluation import evaluate, LangChainStringEvaluator

# Evaluators
qa_evalulator = [LangChainStringEvaluator("cot_qa")]
dataset_name = "Vit_Min_Dataset"

experiment_results = evaluate(
    answer_question,
    data=dataset_name,
    evaluators=qa_evalulator,
    experiment_prefix="test-VitMindataset-qa",
    # Any experiment metadata can be specified here
    metadata={
        "variant": "stuff website context into OpenAI",
    },
) 
#Since it uses OpenAI as the LLM, not able to run

<b><H1>Custom evaluator - Heuristics

In [22]:
Image(url="3.png", width=1000, height=500) 

In [23]:
from langsmith.schemas import Run, Example
from langsmith import evaluate

def is_answered(run: Run, example: Example) -> dict:
    # Get outputs
    student_answer = run.outputs.get("answer")

    # Check if the student_answer is an empty string
    if not student_answer:
        return {"key": "is_answered", "score": 0}
    else:
        return {"key": "is_answered", "score": 1}


# Evaluators
qa_evalulator = [is_answered]
dataset_name = "Vit_Min_Dataset"

# Run
experiment_results = evaluate(
    answer_question,
    data=dataset_name,
    evaluators=qa_evalulator,
    experiment_prefix="test-vitmin-qa-custom-eval-is-answered",
    # Any experiment metadata can be specified here
    metadata={
        "variant": "stuff website context into gemini",
    },
)

View the evaluation results for experiment: 'test-vitmin-qa-custom-eval-is-answered-e2ccd1f3' at:
https://smith.langchain.com/o/a87e4dfa-61d0-4714-8c72-64f2925b822e/datasets/b5ef57fa-3d8f-49f7-b239-0e90a25f508c/compare?selectedSessions=23a1b317-2d22-4770-aace-b5cbe447574e




0it [00:00, ?it/s]

Used a simple rule based function, which gives a score 1 - if the LLM gives response and score 0 - if the LLM doesn't give any response.<br>
Output displayed in Langsmith:

In [24]:
Image(url="4.png", width=1000, height=500) 

<B><H1>Custom evaluator - LLM as a judge

In [25]:
Image(url="5.png", width=1000, height=500) 

In [26]:
def compare_semantic_similarity(run: Run, example: Example) -> dict:
    input_question = example.inputs.get("question")
    reference_response = example.outputs.get("answer")
    run_response = run.outputs.get("answer")

    llm = ChatGoogleGenerativeAI(model="gemini-1.5-pro", temperature=0.0) # Using a low temperature for determinism.

    prompt = f"""
    You are a semantic similarity evaluator. Compare the meanings of two responses to a question, 
    Reference Response and New Response, where the reference is the correct answer, and we are trying to judge if the new response is similar. 
    Provide a score between 1 and 10, where 1 means completely unrelated, and 10 means identical in meaning.

    Question: {input_question}
    Reference Response: {reference_response}
    Run Response: {run_response}

    Please provide your response as an integer value between 1 and 10, where 1 means unrelated and 10 means identical.
    """
    similarity_score = llm.invoke(prompt)
    return {"key": "semantic_similarity", "score": int(similarity_score.content)}

In [27]:
# Evaluators
qa_evalulator1 = [compare_semantic_similarity]
dataset_name = "Vit_Min_Dataset"

# Run
experiment_results = evaluate(
    answer_question,
    data=dataset_name,
    evaluators=qa_evalulator1,
    experiment_prefix="test-vitmin-qa-custom-eval-simantic-similarity",
    # Any experiment metadata can be specified here
    metadata={
        "variant": "stuff website context into gemini",
    },
)

View the evaluation results for experiment: 'test-vitmin-qa-custom-eval-simantic-similarity-673a2fd2' at:
https://smith.langchain.com/o/a87e4dfa-61d0-4714-8c72-64f2925b822e/datasets/b5ef57fa-3d8f-49f7-b239-0e90a25f508c/compare?selectedSessions=42fab29c-2698-4531-8b45-5a290d9d5ae7




0it [00:00, ?it/s]

Retrying langchain_google_genai.chat_models._chat_with_retry.<locals>._chat_with_retry in 2.0 seconds as it raised ResourceExhausted: 429 Resource has been exhausted (e.g. check quota)..
Error running evaluator <DynamicRunEvaluator compare_semantic_similarity> on run 7ffe627e-e3fb-4879-98a6-26db1051c747: ResourceExhausted('Resource has been exhausted (e.g. check quota).')
Traceback (most recent call last):
  File "C:\Users\ragamira.shankar\AppData\Local\anaconda3\envs\env_langchain1\lib\site-packages\langsmith\evaluation\_runner.py", line 1634, in _run_evaluators
    evaluator_response = evaluator.evaluate_run(
  File "C:\Users\ragamira.shankar\AppData\Local\anaconda3\envs\env_langchain1\lib\site-packages\langsmith\evaluation\evaluator.py", line 331, in evaluate_run
    result = self.func(
  File "C:\Users\ragamira.shankar\AppData\Local\anaconda3\envs\env_langchain1\lib\site-packages\langsmith\run_helpers.py", line 629, in wrapper
    raise e
  File "C:\Users\ragamira.shankar\AppData\L

In [28]:
Image(url="6.png", width=1000, height=500) 

Semantic similarity is compared for the ground truth and the Output by the LLM

<b><h1> Comparison

In [None]:
!pip install -qU mistralai

In [29]:
from mistralai import Mistral
import re
def compare_semantic_similarity_mistral(run: Run, example: Example) -> dict:
    input_question = example.inputs.get("question")
    reference_response = example.outputs.get("answer")
    run_response = run.outputs.get("answer")

    model="mistral-large-latest"
    mistral_api_key = os.environ.get("MISTRAL_API_KEY")
    client = Mistral(api_key=mistral_api_key)
    

    prompt = f"""
    You are a semantic similarity evaluator. Compare the meanings of two responses to a question, 
    Reference Response and New Response, where the reference is the correct answer, and we are trying to judge if the new response is similar. 
    Provide a score between 1 and 10, where 1 means completely unrelated, and 10 means identical in meaning.

    Question: {input_question}
    Reference Response: {reference_response}
    Run Response: {run_response}

    Please provide your response as an integer value between 1 and 10, where 1 means unrelated and 10 means identical.
    """
    similarity_score_m = client.chat.complete(model=model,
                                     messages = [
                                        {
                                            "role":"user",
                                            "content":prompt,
                                        },
                                     ]
                                    )
    response_text = similarity_score_m.choices[0].message.content
    match = re.search(r'\b(10|[1-9])\b', response_text)  # Find a number between 1 and 10

    if match:
        score = int(match.group(1))  # Convert the matched string to an integer
        return {"key": "semantic_similarity", "score": score}
    else:
        return {"key": "semantic_similarity", "score": 0}  # Return 0 if no score is found
    return {"key": "semantic_similarity", "score": similarity_score_m.choices[0].message.content}

In [30]:
# Evaluators
qa_evalulator1 = [compare_semantic_similarity_mistral]
dataset_name = "Vit_Min_Dataset"

# Run
experiment_results = evaluate(
    answer_question,
    data=dataset_name,
    evaluators=qa_evalulator1,
    experiment_prefix="test-vitmin-qa-custom-eval-simantic-similarity-mistral",
    # Any experiment metadata can be specified here
    metadata={
        "variant": "stuff website context into mistral",
    },
)

View the evaluation results for experiment: 'test-vitmin-qa-custom-eval-simantic-similarity-mistral-e4b28c0a' at:
https://smith.langchain.com/o/a87e4dfa-61d0-4714-8c72-64f2925b822e/datasets/b5ef57fa-3d8f-49f7-b239-0e90a25f508c/compare?selectedSessions=76bfa80e-3948-4124-b874-99ba9d860eda




0it [00:00, ?it/s]

In [31]:
Image(url="7.png", width=1000, height=500) 

<b><h1> Experiment on datasets from the prompt playground

In [32]:
inputs = [
    {
        "question": "Sea moss",
        "doc_txt": "What potential health risk is associated with excessive iodine intake from sea moss?",
    },
    {"question": "hallucinations", "doc_txt": "What are the primary natural food sources of thiamin?"},
    {
        "question": "zinc deficiency",
        "doc_txt": "What type of surgery or medical conditions can lead to zinc deficiency",
    },
]

outputs = ["yes", "no", "yes"]

In [None]:
from langsmith import Client

client = Client()
dataset_name = "Relevance_grade"

# Store
dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Testing relevance grading.",
)
client.create_examples(
    inputs=inputs,
    outputs=[{"answer": a} for a in outputs],
    dataset_id=dataset.id,
)

Used Gemini, Mistral and their different models to test the performance and compared them

In [33]:
Image(url="8.png", width=1000, height=500) 

<b><h1>Summary Evaluators

In [34]:
from langchain_google_genai import ChatGoogleGenerativeAI
import os

def predict_gemini(inputs: dict) -> dict:
    question=inputs["question"]
    document=inputs["doc_txt"]
    # Gemini Model
    llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash",temperature=0.3, max_tokens=500)

    # Prompt
    prompt = f"""You are a grader assessing relevance of a retrieved document to a user question. 
        It does not need to be a stringent test. The goal is to filter out erroneous retrievals. 
        If the document contains keyword(s) or semantic meaning related to the user question, grade it as relevant. 
        Give a binary score 'yes' or 'no' score to indicate whether the document is relevant to the question.

        Question: {question}
        Reference Response: {document}
        """
    score = llm.invoke(prompt)
    return {"grade": score.content}

In [35]:
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import JsonOutputParser

def predict_mistral(inputs: dict) -> dict:
    
    model="mistral-large-latest"
    mistral_api_key = os.environ.get("MISTRAL_API_KEY")
    client = Mistral(api_key=mistral_api_key)

    prompt_template = PromptTemplate(
    template="""You are a grader assessing relevance of a retrieved document to a user question. \n 
    Here is the retrieved document: \n\n {document} \n\n
    Here is the user question: {question} \n
    If the document contains keywords related to the user question, grade it as relevant. \n
    It does not need to be a stringent test. The goal is to filter out erroneous retrievals. \n
    Give a binary score 'yes' or 'no' score to indicate whether the document is relevant to the question. \n
    Provide the binary score no premable or explaination.""",
    input_variables=["question", "document"],
    )

    prompt = prompt_template.format(
        question=inputs["question"], document=inputs["doc_txt"]
    )

    llm = client.chat.complete(model=model,
                                     messages = [
                                        {
                                            "role":"user",
                                            "content":prompt,
                                        },
                                     ]
                                    )

    grade = llm.choices[0].message.content
    return {"grade": grade}

In [36]:
from typing import List
from langsmith.schemas import Example, Run
from langsmith.evaluation import evaluate


def f1_score_summary_evaluator(runs: List[Run], examples: List[Example]) -> dict:
    """
    Evaluates the F1 score for a list of runs against a set of examples.

    The function iterates through paired runs and examples, comparing the output
    of each run (`run.outputs["grade"]`) with the expected output in the example
    (`example.outputs["answer"]`). It calculates the true positives, false positives,
    and false negatives based on these comparisons to compute the F1 score of the predictions.

    Parameters:
    - runs (List[Run]): A list of run objects, where each run contains an output that is a prediction.
    - examples (List[Example]): A list of example objects, where each example contains an output that is the expected answer.

    Returns:
    - dict: A dictionary with a single key-value pair where the key is "f1_score" and the value
    """

    # Default values
    true_positives = 0
    false_positives = 0
    false_negatives = 0

    # Iterate through samples
    for run, example in zip(runs, examples):
        reference = example.outputs["answer"]
        prediction = run.outputs["grade"]
        if reference and prediction == reference:
            true_positives += 1
        elif prediction and not reference:
            false_positives += 1
        elif not prediction and reference:
            false_negatives += 1
    if true_positives == 0:
        return {"key": "f1_score", "score": 0.0}

    # Compute F1 score
    precision = true_positives / (true_positives + false_positives)
    recall = true_positives / (true_positives + false_negatives)
    f1_score = 2 * (precision * recall) / (precision + recall)
    return {"key": "f1_score", "score": f1_score}

In [37]:
evaluate(
    predict_mistral,
    data="Relevance_grade",
    summary_evaluators=[f1_score_summary_evaluator],
    experiment_prefix="test-score-mistral",
    # Any experiment metadata can be specified here
    metadata={
        "model": "mistral",
    },
)

View the evaluation results for experiment: 'test-score-mistral-5f6873bf' at:
https://smith.langchain.com/o/a87e4dfa-61d0-4714-8c72-64f2925b822e/datasets/97e32f7e-ad98-4f6b-9d21-fa5c425aa83e/compare?selectedSessions=88145053-32bf-4490-8e2b-345a39da4c7b




0it [00:00, ?it/s]

Unnamed: 0,inputs.doc_txt,inputs.question,outputs.grade,error,reference.answer,execution_time,example_id,id
0,What type of surgery or medical conditions can...,zinc deficiency,yes,,yes,7.32798,e05d8031-72ba-4aa7-ae50-fd04d1d540c2,531cec15-8e8e-40a8-a038-c81359aead7e
1,What are the primary natural food sources of t...,hallucinations,No,,no,4.662165,7ef36323-2cb2-44b1-830b-c60de837cfb2,a310f261-0d58-48e3-96b5-691e627b3591
2,What potential health risk is associated with ...,Sea moss,yes,,yes,4.565858,fb510641-52ba-4dd2-9461-f727848ddf9a,945d8e26-1359-4987-8375-521a30f2140b


In [38]:
evaluate(
    predict_gemini,
    data="Relevance_grade",
    summary_evaluators=[f1_score_summary_evaluator],
    experiment_prefix="test-score-gemini",
    # Any experiment metadata can be specified here
    metadata={
        "model": "gemini",
    },
)

View the evaluation results for experiment: 'test-score-gemini-7204bd0f' at:
https://smith.langchain.com/o/a87e4dfa-61d0-4714-8c72-64f2925b822e/datasets/97e32f7e-ad98-4f6b-9d21-fa5c425aa83e/compare?selectedSessions=6e3c587c-5d56-4975-a218-5a8c36941a79




0it [00:00, ?it/s]

Unnamed: 0,inputs.doc_txt,inputs.question,outputs.grade,error,reference.answer,execution_time,example_id,id
0,What type of surgery or medical conditions can...,zinc deficiency,yes,,yes,0.747617,e05d8031-72ba-4aa7-ae50-fd04d1d540c2,efdaa028-5334-4281-80a6-e93138ab72ba
1,What are the primary natural food sources of t...,hallucinations,no,,no,1.806177,7ef36323-2cb2-44b1-830b-c60de837cfb2,21a4b319-a3ce-459a-912f-ad8d5fbb4090
2,What potential health risk is associated with ...,Sea moss,yes,,yes,1.862178,fb510641-52ba-4dd2-9461-f727848ddf9a,f4a86df0-7adf-4825-8e9f-c94c7c38c725


In [39]:
Image(url="9.png", width=1000, height=500) 

<b><h1>RAG Application evaluation - Answer Hallucination

In [40]:
### INDEX

from bs4 import BeautifulSoup as Soup
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load
url = "https://python.langchain.com/v0.1/docs/expression_language/"
loader = RecursiveUrlLoader(
    url=url, max_depth=20, extractor=lambda x: Soup(x, "html.parser").text
)
docs = loader.load()

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

# Embed
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")


vectorstore = Chroma.from_documents(documents=docs, embedding=GoogleGenerativeAIEmbeddings(model="models/embedding-001"))
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 1})

retriever = vectorstore.as_retriever()

In [43]:
from langsmith import traceable
from mistralai.client import MistralClient
class RagBot:
    def __init__(self, retriever, model: str = "mistral-medium"):
        self._retriever = retriever
        self._model = model

    @traceable()
    def retrieve_docs(self, question):
        return self._retriever.invoke(question)

    @traceable()
    def get_answer(self, question: str):
        similar = self.retrieve_docs(question)
        similar_str = "\n\n".join([str(doc) for doc in similar]) #convert similar documents to a single string.
        mistral_api_key = os.environ.get("MISTRAL_API_KEY")
        client = Mistral(api_key=mistral_api_key)
        
        prompt = f"""You are a helpful AI code assistant with expertise in LCEL. Use the following docs to produce a concise code solution to the user question.\n 
        Docs: {similar_str} """
        
        response = client.chat.complete(model=self._model,
                                     messages = [
                                        {
                                            "role":"user",
                                            "content":prompt,
                                        },
                                     ])
        

        return {
            "answer": response.choices[0].message.content,
            "contexts": [str(doc) for doc in similar],
        }

rag_bot = RagBot(retriever)

In [42]:
response = rag_bot.get_answer("What is LCEL?")
response["answer"]

'To create a simple LCEL chain that takes a topic and generates a joke using the OpenAI model, you can use the following code:\n```python\nfrom langchain_openai import ChatOpenAI\nfrom langchain_core.output_parsers import StrOutputParser\nfrom langchain_core.prompts import ChatPromptTemplate\n\n# Create the model\nmodel = ChatOpenAI(model="gpt-4")\n\n# Create the prompt template\ntemplate = "tell me a short joke about {topic}"\nprompt = ChatPromptTemplate.from_template(template)\n\n# Create the output parser\noutput_parser = StrOutputParser()\n\n# Create the chain\nchain = prompt | model | output_parser\n\n# Invoke the chain with a topic\noutput = chain.invoke({"topic": "ice cream"})\nprint(output)\n```\nThis code creates an LCEL chain with three components: a prompt template, a model, and an output parser. The `|` symbol is used to chain these components together. When the chain is invoked with a topic, the prompt template generates a prompt using the topic, the model generates a resp

In [44]:
def predict_rag_answer_with_context(example: dict):
    """Use this for evaluation of retrieved documents and hallucinations"""
    response = rag_bot.get_answer(example["question"])
    return {"answer": response["answer"], "contexts": response["contexts"]}

In [None]:
from langsmith import Client

inputs = [
    "How can I directly pass a string to a runnable and use it to construct the input needed for my prompt?",
    "How can I make the output of my LCEL chain a string?",
    "How can I apply a custom function to one of the inputs of an LCEL chain?",
]

outputs = [
    "Use RunnablePassthrough. from langchain_core.runnables import RunnableParallel, RunnablePassthrough; from langchain_core.prompts import ChatPromptTemplate; from langchain_openai import ChatOpenAI; prompt = ChatPromptTemplate.from_template('Tell a joke about: {input}'); model = ChatOpenAI(); runnable = ({'input' : RunnablePassthrough()} | prompt | model); runnable.invoke('flowers')",
    "Use StrOutputParser. from langchain_openai import ChatOpenAI; from langchain_core.prompts import ChatPromptTemplate; from langchain_core.output_parsers import StrOutputParser; prompt = ChatPromptTemplate.from_template('Tell me a short joke about {topic}'); model = ChatOpenAI(model='gpt-3.5-turbo') #gpt-4 or other LLMs can be used here; output_parser = StrOutputParser(); chain = prompt | model | output_parser",
    "Use RunnableLambda with itemgetter to extract the relevant key. from operator import itemgetter; from langchain_core.prompts import ChatPromptTemplate; from langchain_core.runnables import RunnableLambda; from langchain_openai import ChatOpenAI; def length_function(text): return len(text); chain = ({'prompt_input': itemgetter('foo') | RunnableLambda(length_function),} | prompt | model); chain.invoke({'foo':'hello world'})",
]

qa_pairs = [{"question": q, "answer": a} for q, a in zip(inputs, outputs)]

# Create dataset
client = Client()
dataset_name = "RAG_test_LCEL"
dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="QA pairs about LCEL.",
)
client.create_examples(
    inputs=[{"question": q} for q in inputs],
    outputs=[{"answer": a} for a in outputs],
    dataset_id=dataset.id,
)

In [46]:
from langchain import hub

grade_prompt_hallucinations = prompt = hub.pull("langchain-ai/rag-answer-hallucination")

def answer_hallucination_evaluator(run, example) -> dict:
    """
    A simple evaluator for generation hallucination
    """
    
    # RAG inputs
    input_question = example.inputs["question"]
    contexts = run.outputs["contexts"]
        
    # RAG answer 
    prediction = run.outputs["answer"]

    # LLM grader
    llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash",temperature=0.3, max_tokens=500)

    # Structured prompt
    answer_grader = grade_prompt_hallucinations | llm

    # Get score
    score = answer_grader.invoke({"documents": contexts,
                                  "student_answer": prediction})
    final_score = score[0]["args"]["Score"]
    # print(score)

    return {"key": "answer_hallucination", "score": final_score}

In [47]:
dataset_name = "RAG_test_LCEL"
experiment_results = evaluate(
    predict_rag_answer_with_context,
    data=dataset_name,
    evaluators=[answer_hallucination_evaluator],
    experiment_prefix="rag-qa-gemini-hallucination",
    metadata={
        "variant": "LCEL context, gemini-1.5-flash",
    },
)

View the evaluation results for experiment: 'rag-qa-gemini-hallucination-945c1fda' at:
https://smith.langchain.com/o/a87e4dfa-61d0-4714-8c72-64f2925b822e/datasets/5a65df5b-e73a-420e-8939-2b0cacc52590/compare?selectedSessions=20efb0bc-223e-4296-b453-edde3260ea84




0it [00:00, ?it/s]

Key 'parameters' is not supported in schema, ignoring
Key 'parameters' is not supported in schema, ignoring
Key 'parameters' is not supported in schema, ignoring


In [48]:
Image(url="10.png", width=1000, height=500) 

<b><h1>RAG Application evaluation - Document Relevance to Question

In [49]:
# Grade prompt 
grade_prompt_doc_relevance = hub.pull("langchain-ai/rag-document-relevance")

def docs_relevance_evaluator(run, example) -> dict:
    """
    A simple evaluator for document relevance
    """
    
    # RAG inputs
    input_question = example.inputs["question"]
    contexts = run.outputs["contexts"]
        
    # RAG answer 
    prediction = run.outputs["answer"]

    # LLM grader
    llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash",temperature=0.3, max_tokens=500)

    # Structured prompt
    answer_grader = grade_prompt_doc_relevance | llm

    # Get score
    score = answer_grader.invoke({"question":input_question,
                                  "documents":contexts})
    final_score = score[0]["args"]["Score"]

    return {"key": "document_relevance", "score": final_score}

In [50]:
dataset_name = "RAG_test_LCEL"
experiment_results = evaluate(
    predict_rag_answer_with_context,
    data=dataset_name,
    evaluators=[docs_relevance_evaluator],
    experiment_prefix="rag-qa-gemini-doc-relevance",
    metadata={
        "variant": "LCEL context, gemini-1.5-flash",
    },
)

View the evaluation results for experiment: 'rag-qa-gemini-doc-relevance-04b36138' at:
https://smith.langchain.com/o/a87e4dfa-61d0-4714-8c72-64f2925b822e/datasets/5a65df5b-e73a-420e-8939-2b0cacc52590/compare?selectedSessions=6eb69359-333b-4f0a-bd22-fce12cb95094




0it [00:00, ?it/s]

Key 'parameters' is not supported in schema, ignoring
Key 'parameters' is not supported in schema, ignoring
Key 'parameters' is not supported in schema, ignoring


In [51]:
Image(url="11.png", width=1000, height=500) 

<b><h1>RAG Application Evaluator - Reference Answer

In [52]:
def predict_rag_answer(example: dict):
    """Use this for answer evaluation"""
    response = rag_bot.get_answer(example["question"])
    return {"answer": response["answer"]}

In [53]:
from langchain import hub
# Grade prompt 
grade_prompt_answer_accuracy = prompt = hub.pull("langchain-ai/rag-answer-vs-reference")

def answer_evaluator(run, example) -> dict:
    """
    A simple evaluator for RAG answer accuracy
    """
    
    # Get summary
    input_question = example.inputs["question"]
    reference = example.outputs["answer"]
    prediction = run.outputs["answer"]

    # LLM grader
    llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash",temperature=0.3, max_tokens=500)

    # Structured prompt
    
    answer_grader = grade_prompt_answer_accuracy | llm

    # Get score
    score = answer_grader.invoke({"question": input_question,
                                  "correct_answer": reference,
                                  "student_answer": prediction})
    final_score = score[0]["args"]["Score"]

    return {"key": "answer_score", "score": final_score}

In [155]:
from langsmith.evaluation import evaluate

dataset_name = "RAG_test_LCEL"
experiment_results = evaluate(
    predict_rag_answer,
    data=dataset_name,
    evaluators=[answer_evaluator],
    experiment_prefix="rag-qa-gemini",
    metadata={"variant": "LCEL context, gemini-1.5.flash"},
)

View the evaluation results for experiment: 'rag-qa-gemini-18ecd6f7' at:
https://smith.langchain.com/o/a87e4dfa-61d0-4714-8c72-64f2925b822e/datasets/5a65df5b-e73a-420e-8939-2b0cacc52590/compare?selectedSessions=809420d5-7bfa-4acd-8079-f6362397b536




0it [00:00, ?it/s]

Key 'parameters' is not supported in schema, ignoring
Key 'parameters' is not supported in schema, ignoring
Key 'parameters' is not supported in schema, ignoring


In [54]:
Image(url="12.png", width=1000, height=500) 

<B><H1>Regression testing

In [157]:
from langsmith import Client

# QA
inputs = [
    "My LCEL map contains the key 'question'. What is the difference between using itemgetter('question'), lambda x: x['question'], and x.get('question')?",
    "How can I make the output of my LCEL chain a string?",
    "How can I run two LCEL chains in parallel and write their output to a map?",
]

outputs = [
    "Itemgetter can be used as shorthand to extract specific keys from the map. In the context of a map operation, the lambda function is applied to each element in the input map and the function returns the value associated with the key 'question'. (get) is safer for accessing values in a dictionary because it handles the case where the key might not exist.",
    "Use StrOutputParser. from langchain_openai import ChatOpenAI; from langchain_core.prompts import ChatPromptTemplate; from langchain_core.output_parsers import StrOutputParser; prompt = ChatPromptTemplate.from_template('Tell me a short joke about {topic}'); model = ChatOpenAI(model='gpt-3.5-turbo') #gpt-4 or other LLMs can be used here; output_parser = StrOutputParser(); chain = prompt | model | output_parser",
    "We can use RunnableParallel. For example: from langchain_core.prompts import ChatPromptTemplate; from langchain_core.runnables import RunnableParallel; from langchain_openai import ChatOpenAI; model = ChatOpenAI(); joke_chain = ChatPromptTemplate.from_template('tell me a joke about {topic}') | model; poem_chain = (ChatPromptTemplate.from_template('write a 2-line poem about {topic}') | model); map_chain = RunnableParallel(joke=joke_chain, poem=poem_chain); map_chain.invoke({'topic': 'bear'})",
]

qa_pairs = [{"question": q, "answer": a} for q, a in zip(inputs, outputs)]

# Create dataset
client = Client()
dataset_name = "RAG_QA_LCEL_Reg_Testing"
dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="QA pairs about LCEL.",
)
client.create_examples(
    inputs=[{"question": q} for q in inputs],
    outputs=[{"answer": a} for a in outputs],
    dataset_id=dataset.id,
)

In [55]:
### INDEX

from bs4 import BeautifulSoup as Soup
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load
url = "https://python.langchain.com/v0.1/docs/expression_language/"
loader = RecursiveUrlLoader(
    url=url, max_depth=20, extractor=lambda x: Soup(x, "html.parser").text
)
docs = loader.load()

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

# Embed
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")


vectorstore = Chroma.from_documents(documents=docs, embedding=GoogleGenerativeAIEmbeddings(model="models/embedding-001"))
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 1})

retriever = vectorstore.as_retriever()

In [77]:
### RAG
from langsmith import traceable
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

class RagBot_RT:
    """
    A class to interface with retrieval-augmented generation (RAG) models from different providers
    such as Mistral or Gemini, utilizing a retriever for document-based context.
    """

    def __init__(
        self,
        retriever,
        provider: str = "mistral",
        model: str = "mistral-large-latest",
        use_vectorstore: bool = True,
    ):
        """
        Initializes the RagBot with a retriever, provider information, model details, and configuration
        to use a vector store for document retrieval.

        Args:
        retriever: The document retriever instance.
        provider (str): The provider of the RAG model ('Mistral' or 'Gemini').
        model (str): The model identifier used by the provider.
        use_vectorstore (bool): Flag to determine whether to use vectorstore for document retrieval.
        """
        self._retriever = retriever
        self._provider = provider
        self._model = model
        self._use_vectorstore = use_vectorstore
        if provider == "mistral":
            self._client = Mistral(api_key=mistral_api_key)
        elif provider == "Gemini":
            self._client = ChatGoogleGenerativeAI(model=self._model,temperature=0.3, max_tokens=500)

        
    @traceable()
    def retrieve_docs_RT(self, question):
        """
        Retrieves documents based on the input question, using either a vectorstore or full context.

        Args:
        question (str): The question to retrieve documents for.

        Returns:
        list: A list of documents relevant to the question or the full context (as a string).
        """
        if self._use_vectorstore:
            return self._retriever.invoke(question)
        else:
            return full_doc_text

    @traceable()
    def get_answer_RT(self, question: str):
        """
        Generates an answer for a given question by using RAG, leveraging both the retriever
        and the provider's model capabilities.

        Args:
        question (str): The user's question to answer.

        Returns:
        dict: A dictionary containing the 'answer' and 'contexts' (related documents).
        """
        similar = self.retrieve_docs_RT(question)
        if self._provider == "mistral":
            "MistralAI RAG"
            response = self._client.chat.complete(
                model=self._model,
                messages=[
                    {
                        "role": "system",
                        "content": "You are a helpful AI code assistant with expertise in LCEL.\n"
                        " Use the following docs to produce a concise code solution to the user question.\n"
                        " Use three sentences maximum and keep the answer concise. \n"
                        f"## Docs\n\n{similar}",
                    },
                    {"role": "user", "content": question},
                ],
            )
            response_str = response.choices[0].message.content

        elif self._provider == "Gemini":
            "Gemini RAG"
            prompt = PromptTemplate(
                template="""You are a helpful AI code assistant with expertise in LCEL.
                Use the following docs to produce a concise code solution to the user question.
                If you don't know the answer, just say that you don't know. 
                Use three sentences maximum and keep the answer concise.
                Question: {question} 
                Context: {context} 
                Answer: """,
                input_variables=["question", "context"],
            )
            rag_chain = prompt | self._client | StrOutputParser()
            response_str = rag_chain.invoke({"context": similar, "question": question})

        return {
            "answer": response_str,
            "contexts": [str(doc) for doc in similar],
        }

In [78]:
def predict_rag_answer_mistral_RT(example: dict):
    """Use this for answer evaluation"""
    rag_bot = RagBot_RT(retriever, provider="mistral", model="mistral-large-latest")
    response = rag_bot.get_answer_RT(example["question"])
    return {"answer": response["answer"]}


def predict_rag_answer_gemini_pro_RT(example: dict):
    """Use this for answer evaluation"""
    rag_bot = RagBot_RT(retriever, provider="Gemini", model="gemini-1.5-pro")
    response = rag_bot.get_answer_RT(example["question"])
    return {"answer": response["answer"]}


def predict_rag_answer_gemini_flash_RT(example: dict):
    """Use this for answer evaluation"""
    rag_bot = RagBot_RT(retriever, provider="Gemini", model="gemini-1.5-flash")
    response = rag_bot.get_answer_RT(example["question"])
    return {"answer": response["answer"]}

In [79]:
from langsmith.schemas import Example, Run
from langchain_core.prompts import ChatPromptTemplate
import re

def answer_evaluator_RT(run, example) -> dict:
    """
    A simple evaluator for RAG answer generation
    """
    input_question = example.inputs["question"]
    reference = example.outputs["answer"]
    prediction = run.outputs["answer"]
    
    # LLM
    llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash",temperature=0.3, max_tokens=500)

    # Prompt
    system = """Is the Assistant's Answer grounded in and similar to the Ground Truth answer. Note that we do not expect all of the text 
            in code solution examples to be identical. We expect (1) code imports to be identical if the same import is used. (2) But, it is
            ok if there are differences in the implementation itself. The main point is that the same concept is employed. A score of 1 means 
            that the Assistant answer is not at all conceptically grounded in and similar to the Ground Truth answer. A score of 5 means  that the Assistant 
            answer contains some information that is conceptically grounded in and similar to the Ground Truth answer. A score of 10 means that the 
            Assistant answer is fully conceptically grounded in and similar to the Ground Truth answer.
            return only a numerical score for answer accuracy, a score from 1 to 10"""

    grade_prompt = ChatPromptTemplate.from_messages(
        [
            ("system", system),
            (
                "human",
                "Ground Truth answer: \n\n {reference} \n\n Assistant's Answer: {prediction}",
            ),
        ]
    )

    answer_grader = grade_prompt | llm
    score = answer_grader.invoke({"reference": reference, "prediction": prediction})
    final_score = int(score.content)
    return {"key": "answer_accuracy", "score": final_score/10}

In [80]:
mistral_api_key = os.environ.get("MISTRAL_API_KEY")
client = Mistral(api_key=mistral_api_key)

In [81]:
dataset_name = "RAG_QA_LCEL_Reg_Testing"
experiment_results = evaluate(
    predict_rag_answer_mistral_RT,
    data=dataset_name,
    evaluators=[answer_evaluator_RT],
    experiment_prefix="rag-qa-mistral",
    metadata={"variant": "LCEL context, mistral-large-latest"},
)

View the evaluation results for experiment: 'rag-qa-mistral-662acd6c' at:
https://smith.langchain.com/o/a87e4dfa-61d0-4714-8c72-64f2925b822e/datasets/04f9d8d4-cf06-44a9-930c-4702a1a0073c/compare?selectedSessions=22405e21-dbd7-43c1-99b4-844d2b6c0af4




0it [00:00, ?it/s]

In [82]:
experiment_results = evaluate(
    predict_rag_answer_gemini_pro_RT,
    data=dataset_name,
    evaluators=[answer_evaluator_RT],
    experiment_prefix="rag-qa-gemini-pro",
    metadata={"variant": "LCEL context, gemini-pro"},
)

View the evaluation results for experiment: 'rag-qa-gemini-pro-48bd5e53' at:
https://smith.langchain.com/o/a87e4dfa-61d0-4714-8c72-64f2925b822e/datasets/04f9d8d4-cf06-44a9-930c-4702a1a0073c/compare?selectedSessions=c45353db-00e1-4be3-b241-05017cab3774




0it [00:00, ?it/s]

In [83]:
experiment_results = evaluate(
    predict_rag_answer_gemini_flash_RT,
    data=dataset_name,
    evaluators=[answer_evaluator_RT],
    experiment_prefix="rag-qa-gemini-flash",
    metadata={"variant": "LCEL context, gemini-flash"},
)

View the evaluation results for experiment: 'rag-qa-gemini-flash-140ce620' at:
https://smith.langchain.com/o/a87e4dfa-61d0-4714-8c72-64f2925b822e/datasets/04f9d8d4-cf06-44a9-930c-4702a1a0073c/compare?selectedSessions=51ec1aed-1a4f-4c49-a38f-19a353896d68




0it [00:00, ?it/s]

In [84]:
Image(url="13.png", width=1000, height=500) 

<B><H1>Pairwise evaluations - Paper summerization for Twitter

In [85]:
pip install arxiv

Note: you may need to restart the kernel to use updated packages.


In [86]:
pip install pymupdf

Note: you may need to restart the kernel to use updated packages.


In [87]:
from langchain_community.document_loaders import ArxivLoader

# Arxiv IDs
# phi3, llama3 context extension, jamba, longRope, can llms reason & plan, action learning, roformer, attn is all you need, segment anything, # swin transformer
ids = [
    "2404.14219",
    "2404.19553",
    "2403.19887",
    "2402.13753",
    "2403.04121",
    "2402.15809",
    "2104.09864",
    "1706.03762",
    "2304.02643",
    "2111.09883",
]

# Load papers
docs = []
for paper_id in ids:
    doc = ArxivLoader(query=paper_id, load_max_docs=1).load()
    docs.extend(doc)

In [None]:
from langsmith import Client

# Summarization
inputs = [d.page_content for d in docs]

# Create dataset
client = Client()
dataset_name = "Paper_Tweet_Generator"
dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Papers to summarize",
)
client.create_examples(
    inputs=[{"text": d} for d in inputs],
    dataset_id=dataset.id,
)

In [None]:
pip install langchain-cohere

In [261]:
pip install langchain_openai

Collecting langchain_openai
  Downloading langchain_openai-0.3.7-py3-none-any.whl.metadata (2.3 kB)
Collecting langchain-core<1.0.0,>=0.3.39 (from langchain_openai)
  Downloading langchain_core-0.3.40-py3-none-any.whl.metadata (5.9 kB)
Collecting openai<2.0.0,>=1.58.1 (from langchain_openai)
  Downloading openai-1.65.1-py3-none-any.whl.metadata (27 kB)
Collecting tiktoken<1,>=0.7 (from langchain_openai)
  Downloading tiktoken-0.9.0-cp310-cp310-win_amd64.whl.metadata (6.8 kB)
Collecting distro<2,>=1.7.0 (from openai<2.0.0,>=1.58.1->langchain_openai)
  Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Collecting jiter<1,>=0.4.0 (from openai<2.0.0,>=1.58.1->langchain_openai)
  Downloading jiter-0.8.2-cp310-cp310-win_amd64.whl.metadata (5.3 kB)
Downloading langchain_openai-0.3.7-py3-none-any.whl (55 kB)
Downloading langchain_core-0.3.40-py3-none-any.whl (414 kB)
Downloading openai-1.65.1-py3-none-any.whl (472 kB)
Downloading tiktoken-0.9.0-cp310-cp310-win_amd64.whl (894 kB)
   --

In [262]:
pip install langchain_anthropic

Collecting langchain_anthropic
  Downloading langchain_anthropic-0.3.8-py3-none-any.whl.metadata (1.9 kB)
Collecting anthropic<1,>=0.47.0 (from langchain_anthropic)
  Downloading anthropic-0.49.0-py3-none-any.whl.metadata (24 kB)
Downloading langchain_anthropic-0.3.8-py3-none-any.whl (23 kB)
Downloading anthropic-0.49.0-py3-none-any.whl (243 kB)
Installing collected packages: anthropic, langchain_anthropic
Successfully installed anthropic-0.49.0 langchain_anthropic-0.3.8
Note: you may need to restart the kernel to use updated packages.


In [89]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_mistralai import ChatMistralAI
### Anthropic

# Prompt
system_tweet_instructions = (
    "<role> You are an assistant that generates Tweets to distill / summarize"
    " an academic paper. Ensure the summary: (1) has an engaging title, "
    " (2) provides a bullet point list of main points from the paper, "
    " (3) utilizes emojis, (4) includes limitations of the approach, and "
    " (5) highlights in one sentence the key point or innovation in the paper. </role>"
)
human = "Here is a paper to convert into a Tweet: <paper> {paper} </paper>"
prompt = ChatPromptTemplate.from_messages(
    [("system", system_tweet_instructions), ("human", human)]
)

def predict_tweet_gemini_flash(example: dict):
    llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash",temperature=0.3, max_tokens=500)
    tweet_generator_gemini_flash = prompt | llm | StrOutputParser()
    response = tweet_generator_gemini_flash.invoke({"paper": example["text"]})
    return {"answer": response}

def predict_tweet_mistralai_mll(example: dict):
    chat = ChatMistralAI(temperature=0, model="mistral-large-latest")
    tweet_generator_mistralai_mll = prompt | chat | StrOutputParser()
    response = tweet_generator_mistralai_mll.invoke({"paper": example["text"]})
    return {"answer": response}

In [None]:
from langchain import hub

from langchain_openai import ChatOpenAI
from langsmith.schemas import Example, Run
from langchain_core.prompts import ChatPromptTemplate
from langsmith.evaluation import LangChainStringEvaluator, evaluate

# Grade prompt
grade_prompt = hub.pull("rlm/summary-evaluator")

def text_summary_grader(run, example) -> dict:
    """
    A simple evaluator for text summarization
    """
    
    # Get summary
    summary = run.outputs["answer"]

    # LLM grader
    llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash",temperature=0.3, max_tokens=500)

    # Structured prompt
    answer_grader = grade_prompt | llm

    # Get score
    score = answer_grader.invoke({"summary": summary})
    final_score = score[0]["args"]["Score"]

    return {"key": "summary_engagement_score", "score": final_score}

In [271]:
pip install langchain-mistralai

Collecting langchain-mistralai
  Downloading langchain_mistralai-0.2.7-py3-none-any.whl.metadata (2.0 kB)
Downloading langchain_mistralai-0.2.7-py3-none-any.whl (15 kB)
Installing collected packages: langchain-mistralai
Successfully installed langchain-mistralai-0.2.7
Note: you may need to restart the kernel to use updated packages.


In [91]:
dataset_name = "Paper_Tweet_Generator"

experiment_results = evaluate(
    predict_tweet_gemini_flash,
    data=dataset_name,
    evaluators=[text_summary_grader],
    experiment_prefix="summary-gemini-1.5-flash",
    metadata={"variant": "paper summary tweet, gemini-1.5-flash"},
)

View the evaluation results for experiment: 'summary-gemini-1.5-flash-6322e0c4' at:
https://smith.langchain.com/o/a87e4dfa-61d0-4714-8c72-64f2925b822e/datasets/e5b3d17b-6239-489c-8e8c-f4f5dcce9981/compare?selectedSessions=fe1bee1b-bd5e-4a9e-8b13-9ad4a42172be




0it [00:00, ?it/s]

Key 'parameters' is not supported in schema, ignoring
Key 'parameters' is not supported in schema, ignoring
Key 'parameters' is not supported in schema, ignoring
Key 'parameters' is not supported in schema, ignoring
Key 'parameters' is not supported in schema, ignoring
Key 'parameters' is not supported in schema, ignoring
Key 'parameters' is not supported in schema, ignoring
Key 'parameters' is not supported in schema, ignoring
Key 'parameters' is not supported in schema, ignoring
Key 'parameters' is not supported in schema, ignoring


In [92]:
experiment_results = evaluate(
    predict_tweet_mistralai_mll,
    data=dataset_name,
    evaluators=[text_summary_grader],
    experiment_prefix="summary-mistral-mll",
    max_concurrency=2,
    metadata={"variant": "paper summary tweet, mistral-mll"},
)

View the evaluation results for experiment: 'summary-mistral-mll-0f9f2ece' at:
https://smith.langchain.com/o/a87e4dfa-61d0-4714-8c72-64f2925b822e/datasets/e5b3d17b-6239-489c-8e8c-f4f5dcce9981/compare?selectedSessions=6a387013-98a7-402b-8c20-455acbcf6191




0it [00:00, ?it/s]

Key 'parameters' is not supported in schema, ignoring
Key 'parameters' is not supported in schema, ignoring
Error running target function: Error response 429 while fetching https://api.mistral.ai/v1/chat/completions: {"message":"Requests rate limit exceeded"}
Traceback (most recent call last):
  File "C:\Users\ragamira.shankar\AppData\Local\anaconda3\envs\env_langchain1\lib\site-packages\langsmith\evaluation\_runner.py", line 1914, in _forward
    fn(
  File "C:\Users\ragamira.shankar\AppData\Local\anaconda3\envs\env_langchain1\lib\site-packages\langsmith\run_helpers.py", line 629, in wrapper
    raise e
  File "C:\Users\ragamira.shankar\AppData\Local\anaconda3\envs\env_langchain1\lib\site-packages\langsmith\run_helpers.py", line 626, in wrapper
    function_result = run_container["context"].run(func, *args, **kwargs)
  File "C:\Users\ragamira.shankar\AppData\Local\Temp\ipykernel_20952\833916525.py", line 28, in predict_tweet_mistralai_mll
    response = tweet_generator_mistralai_mll.i

In [93]:
from langchain import hub
from langsmith.schemas import Example, Run
from langchain_core.prompts import ChatPromptTemplate
from langsmith.evaluation import evaluate

def evaluate_pairwise(runs: list, example) -> dict:
    """
    A simple evaluator for pairwise answers to score based on  engagement
    """

    # Store scores
    scores = {}
    for i, run in enumerate(runs):
        scores[run.id] = i

    # Runs is the pair of runs for each example
    answer_a = runs[0].outputs["answer"]
    answer_b = runs[1].outputs["answer"]

    # LLM with function call, use highest capacity model
    llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash",temperature=0.3)

    # Structured prompt
    grade_prompt = hub.pull("rlm/pairwise-evaluation-tweet-summary")
    answer_grader = grade_prompt | llm

    # Get score
    score = answer_grader.invoke(
        {
            "question": system_tweet_instructions,
            "answer_a": answer_a,
            "answer_b": answer_b,
        }
    )
    final_score = score[0]["args"]["Preference"]

    # Map from the score to the run assisnment
    if final_score == 1:  # Assistant A is preferred
        scores[runs[0].id] = 1
        scores[runs[1].id] = 0
    elif final_score == 2:  # Assistant B is preferred
        scores[runs[0].id] = 0
        scores[runs[1].id] = 1
    else:
        scores[runs[0].id] = 0
        scores[runs[1].id] = 0

    return {"key": "ranked_preference", "scores": scores}

In [94]:
from langsmith.evaluation import evaluate_comparative

evaluate_comparative(
    # Replace the following array with the names or IDs of your experiments
    ["summary-mistral-mll-5160412f", "summary-gemini-1.5-flash-cdceef3f"],
    evaluators=[evaluate_pairwise],
)

View the pairwise evaluation results at:
https://smith.langchain.com/o/a87e4dfa-61d0-4714-8c72-64f2925b822e/datasets/e5b3d17b-6239-489c-8e8c-f4f5dcce9981/compare?selectedSessions=6e52f796-19bb-4e10-8319-0afebaeaf5ef%2C4fe8df47-ed19-4741-847f-e202cdd89305&comparativeExperiment=a9b68177-2478-478b-af6d-a60e3ea9018f




  0%|          | 0/10 [00:00<?, ?it/s]

Key 'parameters' is not supported in schema, ignoring
Key 'parameters' is not supported in schema, ignoring
Key 'parameters' is not supported in schema, ignoring
Key 'parameters' is not supported in schema, ignoring
Key 'parameters' is not supported in schema, ignoring
Key 'parameters' is not supported in schema, ignoring
Key 'parameters' is not supported in schema, ignoring
Key 'parameters' is not supported in schema, ignoring
Key 'parameters' is not supported in schema, ignoring
Key 'parameters' is not supported in schema, ignoring


<langsmith.evaluation._runner.ComparativeExperimentResults at 0x2f692a09e10>

In [95]:
Image(url="14.png", width=1000, height=500) 