In [1]:
from dotenv import load_dotenv
load_dotenv()

True

In [2]:
from langsmith import Client

client = Client()

example_inputs = [
("What is the difference between LangChain and LangSmith?", "LangChain is an open-source framework for building applications with large language models, providing tools for chaining together different components like prompts, LLMs, and data sources. LangSmith, on the other hand, is a platform for monitoring, debugging, and testing LLM applications built with or without LangChain. While LangChain helps you build LLM apps, LangSmith helps you observe and improve them in production."),
    
("How do I create a simple chain in LangChain?", "To create a simple chain in LangChain, you can use the LCEL (LangChain Expression Language) syntax by connecting components with the pipe operator. For example, you can chain a prompt template with an LLM like this: chain = prompt | llm. You can then invoke the chain with chain.invoke({'input': 'your input'}). This creates a sequence where the output of one component becomes the input of the next."),
    
("What is a vector store and why is it important for RAG applications?", "A vector store is a database that stores embeddings (numerical representations) of text chunks and enables efficient similarity search. It's crucial for RAG (Retrieval Augmented Generation) applications because it allows you to quickly find relevant documents based on semantic similarity to a user's query. When a question is asked, the vector store retrieves the most relevant chunks, which are then provided as context to the LLM for generating accurate answers."),
    
("How do I make API calls with rate limiting in Python?", "To implement rate limiting for API calls in Python, you can use libraries like ratelimit or tenacity. A simple approach is to use time.sleep() between calls, but for more robust solutions, use a rate limiter decorator. For example, with the ratelimit library: @sleep_and_retry and @limits(calls=10, period=60) above your function ensures no more than 10 calls per 60 seconds. This prevents hitting API rate limits and getting blocked."),
    
("What are embeddings in the context of LLMs?", "Embeddings are dense vector representations of text that capture semantic meaning in a high-dimensional space. In LLMs, text is converted into numerical vectors (typically 768, 1536, or more dimensions) where similar concepts are positioned close together. These embeddings enable tasks like semantic search, clustering, and similarity comparison. Models like OpenAI's text-embedding-ada-002 or open-source alternatives create these representations for use in applications."),
    
("How do I handle errors and retries when calling LLM APIs?", "When calling LLM APIs, implement error handling using try-except blocks to catch common errors like rate limits, timeouts, or API failures. Use exponential backoff for retries, where wait time increases with each attempt. Libraries like tenacity provide decorators like @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10)) that automatically retry failed requests. Always log errors and set maximum retry limits to avoid infinite loops."),
    
("What is prompt engineering and why is it important?", "Prompt engineering is the practice of crafting effective input prompts to get desired outputs from language models. It's important because LLMs are highly sensitive to how questions are phrased, and well-designed prompts can significantly improve response quality, accuracy, and consistency. Techniques include providing clear instructions, using examples (few-shot learning), specifying output format, and adding context. Good prompt engineering can reduce hallucinations and improve task performance without model fine-tuning."),
    
("How do I implement streaming responses from an LLM API?", "To implement streaming responses, use the streaming parameter when calling the API and process chunks as they arrive. With OpenAI's API, set stream=True in the completion request. Then iterate over the response object to handle each chunk: for chunk in response: print(chunk.choices[0].delta.content). In LangChain, use callbacks or the streaming method to handle real-time token generation. This provides better user experience by showing responses as they're generated rather than waiting for completion."),
    
("What is the purpose of text splitting in RAG applications?", "Text splitting divides large documents into smaller chunks that fit within an LLM's context window and improve retrieval accuracy. It's essential because LLMs have token limits, and smaller chunks provide more precise context matching during similarity search. Good splitting strategies preserve semantic meaning by respecting sentence or paragraph boundaries. LangChain's RecursiveCharacterTextSplitter splits text hierarchically, and parameters like chunk_size and chunk_overlap help balance between context completeness and retrieval precision."),
    
("How do I track costs when using LLM APIs in production?", "Track LLM API costs by monitoring token usage for each request, as most providers charge per token. Implement logging to record input/output tokens, model used, and timestamp for each call. Tools like LangSmith automatically track these metrics and provide cost analysis dashboards. Calculate costs by multiplying token counts by the model's price per token. Set up alerts for unusual usage spikes, implement caching to reduce redundant calls, and consider using cheaper models for simpler tasks to optimize costs.")
]
# TODO: Fill in dataset id
dataset_id = "3b3a2968-d4c6-4433-bc5a-8519d6f7b1bb"

# Prepare inputs and outputs for bulk creation
inputs = [{"question": input_prompt} for input_prompt, _ in example_inputs]
outputs = [{"output": output_answer} for _, output_answer in example_inputs]

client.create_examples(
  inputs=inputs,
  outputs=outputs,
  dataset_id=dataset_id,
)

{'example_ids': ['9bf9cdfc-7adb-4579-86e4-e6562c1ef569',
  '21836737-355f-424c-a081-53d2416c5958',
  '37ca6249-1656-4fc9-ad96-6b5335e63273',
  'd987ec00-a7b1-4614-98d5-222cc36e0861',
  '02c8fc33-9b2e-4699-9460-c2c7163fbb49',
  'd71e40e3-c281-4d0f-aca6-c4747292d34d',
  'f1644430-42bc-42e7-9ce4-561c764449ab',
  '6147b90b-c010-42f8-b028-c6ec1784ec9b',
  'e99a8a6d-1be5-4571-abd4-f45ec65eff90',
  '607cba38-6f79-4617-91f5-69f809f60d7c'],
 'count': 10}

In [3]:
from app import langsmith_rag

Fetching pages: 100%|##########| 197/197 [00:56<00:00,  3.52it/s]


In [4]:
question = "What is a vector store and why is it important for RAG applications?"
langsmith_rag(question)

'A vector store is a data structure that organizes and stores embeddings, which are high-dimensional representations of data such as text, images, or other types of information. In Retrieval-Augmented Generation (RAG) applications, a vector store is essential because it facilitates efficient searching and retrieval of relevant information based on user queries, allowing the application to generate accurate and contextually relevant responses. This enhances the capabilities of the language model by grounding its outputs in external knowledge.'

In [5]:
question = "How do I handle errors and retries when calling LLM APIs?"
langsmith_rag(question)

'To handle errors and retries when calling LLM APIs, you can implement exponential backoff for retrying failed requests, increasing the wait time between each retry. In LangChain, you can add retries to model calls using the `.with_retry(...)` method, specifying the maximum number of attempts. Additionally, monitor your runs and check for common error messages to help diagnose issues effectively.'