# Dataset Upload

In addition to creating and editing Datasets in the LangSmith UI, you can also create and edit datasets with the LangSmith SDK.

Let's go ahead an upload a list of examples that we have from our RAG application to LangSmith as a new dataset.

In [1]:
# Setup
import os
from dotenv import load_dotenv
os.environ["LANGSMITH_ENDPOINT"] = "https://api.smith.langchain.com"


# Inline env (optional). Prefer .env in project root.
os.environ.setdefault("LANGSMITH_TRACING", "true")
os.environ.setdefault("LANGSMITH_PROJECT", "Langsmith_intro")  # personalize project name
os.environ.setdefault("GEMINI_API_KEY", os.getenv("GEMINI_API_KEY", ""))
os.environ.setdefault("LANGSMITH_API_KEY", os.getenv("LANGSMITH_API_KEY", ""))

# Load from .env if present
load_dotenv(dotenv_path="../../.env", override=True)
print("Tracing:", os.getenv("LANGSMITH_TRACING"), "Project:", os.getenv("LANGSMITH_PROJECT"))

# Install Gemini client if needed
try:
    import google.generativeai as genai  # noqa: F401
except Exception:
    import sys, subprocess
    subprocess.check_call([sys.executable, "-m", "pip", "install", "google-generativeai", "-q"])

Tracing: true Project: Langsmith_intro


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Or you can use a .env file
#from dotenv import load_dotenv
#load_dotenv(dotenv_path=r"C:\Users\rishi\LLM_prac\intro-to-langsmith-main\.env", override=True)

In [3]:
from langsmith import Client

# Resolve and set dataset_id from dataset name
DATASET_NAME = "second"
client = Client()
dataset_id = client.read_dataset(dataset_name=DATASET_NAME).id
print("dataset_id:", dataset_id)


dataset_id: 073c3ace-3840-4c61-99a0-816c2c13e33a


In [4]:
from langsmith import Client

# Use env vars (.env loaded earlier) instead of hard-coding the key
client = Client()

# Require a dataset_id to avoid any name-based API calls
if 'dataset_id' not in globals() or not dataset_id:
    raise RuntimeError("dataset_id is not set. Run the cell that defines dataset_id first.")

# Read dataset strictly by ID (no name lookups)
ds = client.read_dataset(dataset_id=dataset_id)

print("Dataset ID:", ds.id)
print("Dataset name:", ds.name)
print("As dict:", ds.dict())



Dataset ID: 073c3ace-3840-4c61-99a0-816c2c13e33a
Dataset name: second
As dict: {'name': 'second', 'description': '', 'data_type': <DataType.kv: 'kv'>, 'id': UUID('073c3ace-3840-4c61-99a0-816c2c13e33a'), 'created_at': datetime.datetime(2025, 10, 3, 12, 42, 38, 64851, tzinfo=datetime.timezone.utc), 'modified_at': datetime.datetime(2025, 10, 3, 12, 42, 38, 64851, tzinfo=datetime.timezone.utc), 'example_count': 40, 'session_count': 0, 'last_session_start_time': None, 'inputs_schema': None, 'outputs_schema': None, 'transformations': None, 'metadata': None}


In [5]:
from langsmith import Client

client = Client()
dataset = client.read_dataset(dataset_id="a213a95a-dbd0-4d5f-af2e-256bf41691c4")

print(dataset)          # shows dataset summary
print(dataset.dict())   # full fields as a dict


name='Golden Example' description='' data_type=<DataType.kv: 'kv'> id=UUID('a213a95a-dbd0-4d5f-af2e-256bf41691c4') created_at=datetime.datetime(2025, 10, 3, 11, 18, 39, 393942, tzinfo=datetime.timezone.utc) modified_at=datetime.datetime(2025, 10, 3, 11, 18, 39, 393942, tzinfo=datetime.timezone.utc) example_count=0 session_count=0 last_session_start_time=None inputs_schema=None outputs_schema=None transformations=None metadata=None
{'name': 'Golden Example', 'description': '', 'data_type': <DataType.kv: 'kv'>, 'id': UUID('a213a95a-dbd0-4d5f-af2e-256bf41691c4'), 'created_at': datetime.datetime(2025, 10, 3, 11, 18, 39, 393942, tzinfo=datetime.timezone.utc), 'modified_at': datetime.datetime(2025, 10, 3, 11, 18, 39, 393942, tzinfo=datetime.timezone.utc), 'example_count': 0, 'session_count': 0, 'last_session_start_time': None, 'inputs_schema': None, 'outputs_schema': None, 'transformations': None, 'metadata': None}


In [6]:
from langsmith import Client

example_inputs = [
("How do I set up tracing to LangSmith if I'm using LangChain?", "To set up tracing to LangSmith while using LangChain, you need to set the environment variable `LANGSMITH_TRACING` to 'true'. Additionally, you must set the `LANGSMITH_API_KEY` environment variable to your API key. By default, traces will be logged to a project named \"default.\""),
("How can I trace with the @traceable decorator?", "To trace with the @traceable decorator in Python, simply decorate any function you want to log traces for by adding `@traceable` above the function definition. Ensure that the LANGSMITH_TRACING environment variable is set to 'true' to enable tracing, and also set the LANGSMITH_API_KEY environment variable with your API key. By default, traces will be logged to a project named \"default,\" but you can configure it to log to a different project if needed."),
("How do I pass metadata in with @traceable?", "You can pass metadata with the @traceable decorator by specifying arbitrary key-value pairs as arguments. This allows you to associate additional information, such as the execution environment or user details, with your traces. For more detailed instructions, refer to the LangSmith documentation on adding metadata and tags."),
("What is LangSmith used for in three sentences?", "LangSmith is a platform designed for the development, monitoring, and testing of LLM applications. It enables users to collect and analyze unstructured data, debug issues, and create datasets for testing and evaluation. The tool supports various workflows throughout the application development lifecycle, enhancing the overall performance and reliability of LLM applications."),
("What testing capabilities does LangSmith have?", "LangSmith offers capabilities for creating datasets of inputs and reference outputs to run tests on LLM applications, supporting a test-driven approach. It allows for bulk uploads of test cases, on-the-fly creation, and exporting from application traces. Additionally, LangSmith facilitates custom evaluations to score test results, enhancing the testing process."),
("Does LangSmith support online evaluation?", "Yes, LangSmith supports online evaluation as a feature. It allows you to configure a sample of runs from production to be evaluated, providing feedback on those runs. You can use either custom code or an LLM as a judge for the evaluations."),
("Does LangSmith support offline evaluation?", "Yes, LangSmith supports offline evaluation through its evaluation how-to guides and features for managing datasets. Users can manage datasets for offline evaluations and run various types of evaluations, including unit testing and auto-evaluation. This allows for comprehensive testing and improvement of LLM applications."),
("Can LangSmith be used for finetuning and model training?", "Yes, LangSmith can be used for fine-tuning and model training. It allows you to capture run traces from your deployment, query and filter this data, and convert it into a format suitable for fine-tuning models. Additionally, you can create training datasets to keep track of the data used for model training."),
("Can LangSmith be used to evaluate agents?", "Yes, LangSmith can be used to evaluate agents. It provides various evaluation strategies, including assessing the agent's final response, evaluating individual steps, and analyzing the trajectory of tool calls. These methods help ensure the effectiveness of LLM applications."),
("How do I create user feedback with the LangSmith sdk?", "To create user feedback with the LangSmith SDK, you first need to run your application and obtain the `run_id`. Then, you can use the `create_feedback` method, providing the `run_id`, a feedback key, a score, and an optional comment. For example, in Python, it would look like this: `client.create_feedback(run_id, key=\"feedback-key\", score=1.0, comment=\"comment\")`."),
]

client = Client()
# Resolve dataset by name
DATASET_NAME = "second"
from langsmith.utils import LangSmithNotFoundError
try:
    ds_by_name = client.read_dataset(dataset_name=DATASET_NAME)
    dataset_id = ds_by_name.id
except LangSmithNotFoundError as e:
    raise RuntimeError(f"Dataset '{DATASET_NAME}' not found in your workspace.") from e

# Prepare inputs and outputs for bulk creation
inputs = [{"question": input_prompt} for input_prompt, _ in example_inputs]
outputs = [{"output": output_answer} for _, output_answer in example_inputs]

client.create_examples(
  inputs=inputs,
  outputs=outputs,
  dataset_id=dataset_id,
)

{'example_ids': ['b0947fc0-7ed3-4d03-9c6b-53d80c6a2271',
  'd860852c-9645-4d74-8fdc-b3066c5df34a',
  'f9e9b43b-46f8-405d-920a-ebca4a0e08bc',
  '43947649-4970-4a49-be91-91cf0d4a4540',
  '2a8c9125-d001-4ae1-a0f8-6112a53de892',
  'e7510179-a4d8-4ec2-8394-195df04ce879',
  'fa2d5aeb-5f3b-4741-920d-21e38c2bf75f',
  '0cc53c10-ccea-49d8-84a4-a13a7bd5fc0d',
  '555c4939-ac00-44ba-99a6-6f87cc2ec321',
  'c80c9023-ab98-4239-a3d9-277f90358aa5'],
 'count': 10}

In [7]:
from dotenv import load_dotenv
load_dotenv(dotenv_path="../../.env", override=True)

from langsmith import Client

client = Client()

# Only use dataset_id already defined (e.g., from previous cell). Do not look up by name.
if 'dataset_id' not in globals():
    raise RuntimeError("dataset_id is not set. Run the previous cell that defines dataset_id first.")

# Prepare inputs and outputs for bulk creation (example_inputs defined above)
inputs = [{"question": input_prompt} for input_prompt, _ in example_inputs]
outputs = [{"output": output_answer} for _, output_answer in example_inputs]

client.create_examples(
    inputs=inputs,
    outputs=outputs,
    dataset_id=dataset_id,
)


{'example_ids': ['d4032ed9-9e98-4606-b8ed-db468bc6cdd9',
  '92ed06f8-c58a-4bdc-8334-2adb3307efc2',
  'f6494af6-ede7-42f9-8975-d5ea888775bd',
  'ea281bd3-3f70-4f72-9c3f-46c25caa9788',
  'b7db606f-4490-42a5-8752-7d8b619241d9',
  'b1a962af-65b0-4f6c-a62d-78ec6f9b0f5e',
  'a4dd9382-e16b-4452-91d6-d2d5eb278ad3',
  '8a9db11c-ed21-4f3b-8ab9-79e21bd2dc3c',
  '28d51241-cde7-42b1-9d07-3abd415a849f',
  '68bd6225-63d1-4c88-ab0c-744d0c60f547'],
 'count': 10}

## Submitting another Trace

I've moved our RAG application definition to `app.py` so we can quickly import it.

In [9]:
from app import langsmith_rag

E0000 00:00:1759497045.251981 33456571 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.
E0000 00:00:1759497045.253887 33456571 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


In [10]:
from app import langsmith_rag

Let's ask another question to create a new trace!

In [11]:
question = "How do I set up tracing to LangSmith if I'm using LangChain?"
langsmith_rag(question)

I0000 00:00:1759497067.832554 33456571 fork_posix.cc:71] Other threads are currently calling into gRPC, skipping fork() handlers
E0000 00:00:1759497069.160497 33456571 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


'If you are using LangChain, you can set up tracing to LangSmith using its built-in integration. You primarily need to set a few environment variables to enable tracing. For more detailed configuration, refer to the "Trace With LangChain" guide.'