docs: rename 'manual review' feature to 'annotation' (langfuse#665)

marliessophie · web-flow · commit 4e3c9968e41d · 2024-06-05T10:31:54.000Z
* feat: update annotation docs

* add: changelog post
diff --git a/next.config.mjs b/next.config.mjs
@@ -179,6 +179,7 @@ const nonPermanentRedirects = [
   ["/docs/sdk/typescript", "/docs/sdk/typescript/guide"],
   ["/docs/sdk/typescript-web", "/docs/sdk/typescript/guide-web"],
   ["/docs/scores/evals", "/docs/scores/model-based-evals"],
+  ["/docs/scores/manually", "/docs/scores/annotation"],
   ["/docs/scores/model-based-evals/overview", "/docs/scores/model-based-evals"],
   ["/docs/scores/model-based-evals/ragas", "/cookbook/evaluation_of_rag_with_ragas"],
   ["/docs/scores/model-based-evals/langchain", "/cookbook/evaluation_with_langchain"],
diff --git a/pages/blog/update-2023-07.mdx b/pages/blog/update-2023-07.mdx
@@ -106,9 +106,9 @@ Until now, token counts needed to be ingested when logging new LLM calls. For Op
 
 Scores in Langfuse are essential to monitor the quality of your LLM app. Until now, scores were created via the Web SDK based on user feedback (e.g. thumbs up/down, implicit user feedback) or via the API (e.g. when running model-based evals).
 
-Many of you wanted to manually score generations in the UI as you or your team browse production logs. We've added this to the Langfuse UI:
+Many of you wanted to annotate generations in the UI as you or your team browse production logs. We've added this to the Langfuse UI:
 
-<Frame>![Add manual score in UI](/images/docs/score-manual.gif)</Frame>
+<Frame>![Annotate via the langfuse UI](/images/docs/score-manual.gif)</Frame>
 
 → [Learn more](/docs/scores)
 
diff --git a/pages/changelog/2024-06-05-annotation.mdx b/pages/changelog/2024-06-05-annotation.mdx
@@ -0,0 +1,18 @@
+---
+date: 2024-06-05
+title: Annotation via Langfuse UI
+description: Record human-in-the-loop evaluation by annotating traces and observations with scores. 
+author: Marlies
+---
+
+import { ChangelogHeader } from "@/components/changelog/ChangelogHeader";
+
+<ChangelogHeader />
+
+Introducing our revamped annotation workflow via the Langfuse UI allowing you to effectively collaborate with your team on human-in-the-loop evaluations. 
+
+## Highlights
+- **Centralized score configuration management**: Standardize score names, data types and criteria project-wide.
+- **Enhanced annotation capabilities**: Score traces and observations across configured score dimensions.
+- **Improved data type support**: Annotate numeric, categorical, and binary scores.
+- **Comment feature**: Optionally add context to each score for improved data interpretation.
diff --git a/pages/docs/index.mdx b/pages/docs/index.mdx
@@ -29,7 +29,7 @@ import { ProductUpdateSignup } from "@/components/productUpdateSignup";
 - **Evals:** Collect and calculate scores for your LLM completions ([Scores & Evaluations](/docs/scores))
   - Run [model-based evaluations](/docs/scores/model-based-evals/overview) within Langfuse
   - Collect [user feedback](/docs/scores/user-feedback)
-  - [Manually score](/docs/scores/manually) observations in Langfuse
+  - [Annotate](/docs/scores/annotation) observations in Langfuse
 
 ### Test
 
diff --git a/pages/docs/integrations/haystack/example-python.md b/pages/docs/integrations/haystack/example-python.md
@@ -222,7 +222,7 @@ You can score traces using a number of methods:
 - Through user feedback
 - Model-based evaluation
 - Through SDK/API
-- Manually, in the Langfuise UI
+- Using annotation in the Langfuse UI
 
 The example below walks through a simple way to score the chat generator's response via the Python SDK. It adds a score of 1 to the trace above with the comment "Cordial and relevant" because the model's response was very polite and factually correct. You can then sort these scores to identify low-quality output or to monitor the quality of responses.
 
diff --git a/pages/docs/scores/_meta.json b/pages/docs/scores/_meta.json
@@ -1,6 +1,6 @@
 {
   "overview": "Overview",
-  "manually": "Manually in Langfuse UI",
+  "annotation": "Annotation via Langfuse UI",
   "user-feedback": "User Feedback",
   "model-based-evals": "Model-based Evaluation",
   "custom": "Custom via SDKs/API"
diff --git a/pages/docs/scores/annotation.mdx b/pages/docs/scores/annotation.mdx
@@ -0,0 +1,20 @@
+---
+description: Annotate traces and observations with scores in the Langfuse UI to record human-in-the-loop evaluations.
+---
+
+# Annotation in Langfuse UI
+
+Collaborate with your team and add [`scores`](/docs/scores) via the Langfuse UI.
+
+<Frame>![Annotate in UI](/images/docs/score-manual.gif)</Frame>
+
+## Common use cases:
+
+- **Collaboration**: Enable team collaboration by inviting other internal members to annotate a subset of traces and observations. This human-in-the-loop evaluation can enhance the overall accuracy and reliability of your results by incorporating diverse perspectives and expertise. 
+- **Annotation data consistency**: Create score configurations for annotation workflows to ensure that all team members are using standardized scoring criteria. Hereby configure categorical, numerical or binary score types to capture different aspects of your data.
+- **Evaluation of new product features**: This feature can be useful for new use cases where no other scores have been allocated yet.
+- **Benchmarking of other scores**: Establish a human baseline score that can be used as a benchmark to compare and evaluate other scores. This can provide a clear standard of reference and enhance the objectivity of your performance evaluations.
+
+## Get in touch
+
+Looking for a specific way to annotate your executions in Langfuse? Join the [Discord](/discord) and discuss your use case!
diff --git a/pages/docs/scores/manually.mdx b/pages/docs/scores/manually.mdx
diff --git a/pages/docs/scores/overview.mdx b/pages/docs/scores/overview.mdx
@@ -51,7 +51,7 @@ Most users of Langfuse ingest scores programmatically. These are common sources
 
 | Source                                                   | examples                                                                                                                                                                                                     |
 | -------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
-| [Manual evaluation (UI)](/docs/scores/manually)          | Review traces/generations and add scores manually in the UI                                                                                                                                                  |
+| [Annotation (UI)](/docs/scores/annotation)          | Annotate traces/generations by adding scores in the UI                                                                                                                                                  |
 | [User feedback](/docs/scores/user-feedback)              | Explicit (e.g., thumbs up/down, 1-5 star rating) or implicit (e.g., time spent on a page, click-through rate, accepting/rejecting a model-generated output)                                                  |
 | [Model-based evaluation](/docs/scores/model-based-evals) | OpenAI Evals, Whylabs Langkit, Langchain Evaluators ([cookbook](/docs/scores/model-based-evals/langchain)), RAGAS for RAG pipelines ([cookbook](/docs/scores/model-based-evals/ragas)), custom model outputs |
 | [Custom via SDKs/API](/docs/scores/custom)               | Run-time quality checks (e.g. valid structured output format), custom workflow tool for human evaluation                                                                                                     |
diff --git a/pages/docs/sdk/python/decorators.mdx b/pages/docs/sdk/python/decorators.mdx
@@ -398,7 +398,7 @@ def llama_index_fn(question: str):
 
 ## Adding scores
 
-[Scores](https://langfuse.com/docs/scores/overview) are used to evaluate single observations or entire traces. They can created manually via the Langfuse UI or via the SDKs.
+[Scores](https://langfuse.com/docs/scores/overview) are used to evaluate single observations or entire traces. They can be created via our annotation workflow in the Langfuse UI or via the SDKs.
 
 | Parameter | Type   | Optional | Description                                                           |
 | --------- | ------ | -------- | --------------------------------------------------------------------- |
diff --git a/pages/docs/sdk/python/example.md b/pages/docs/sdk/python/example.md
@@ -335,7 +335,7 @@ Make sure to call `langfuse_context.flush()` before exiting to prevent this. Thi
 
 ### Scoring
 
-[Scores](https://langfuse.com/docs/scores/overview) are used to evaluate single observations or entire traces. You can create them manually in the Langfuse UI, run model-based evaluation or ingest via the SDK.
+[Scores](https://langfuse.com/docs/scores/overview) are used to evaluate single observations or entire traces. You can create them via our annotation workflow in the Langfuse UI, run model-based evaluation or ingest via the SDK.
 
 | Parameter | Type   | Optional | Description
 | --- | --- | --- | ---
diff --git a/pages/docs/sdk/python/low-level-sdk.md b/pages/docs/sdk/python/low-level-sdk.md
@@ -299,7 +299,7 @@ See documentation of spans above on how to use the langfuse client and ids if yo
 
 ## Scores
 
-[Scores](https://langfuse.com/docs/scores/overview) are used to evaluate single executions/traces. They can created manually via the Langfuse UI or via the SDKs.
+[Scores](https://langfuse.com/docs/scores/overview) are used to evaluate single executions/traces. They can be created via Annotation in the Langfuse UI or via the SDKs.
 
 If the score relates to a specific step of the trace, specify the `observation_id`.
 
diff --git a/pages/docs/tracing-features/sessions.mdx b/pages/docs/tracing-features/sessions.mdx
@@ -223,7 +223,7 @@ _Example session spanning multiple traces_
 
 - Publish a session to share with others as a public link ([example](https://cloud.langfuse.com/project/clkpwwm0m000gmm094odg11gi/sessions/lf.docs.conversation.TL4KDlo))
 - Bookmark a session to easily find it later
-- Manually evaluate sessions by adding `scores` from the Langfuse UI
+- Annotate sessions by adding `scores` via the Langfuse UI to record human-in-the-loop evaluations
 
 ```
 
diff --git a/pages/guides/cookbook/integration_haystack.md b/pages/guides/cookbook/integration_haystack.md
@@ -222,7 +222,7 @@ You can score traces using a number of methods:
 - Through user feedback
 - Model-based evaluation
 - Through SDK/API
-- Manually, in the Langfuise UI
+- Using annotation in the Langfuse UI
 
 The example below walks through a simple way to score the chat generator's response via the Python SDK. It adds a score of 1 to the trace above with the comment "Cordial and relevant" because the model's response was very polite and factually correct. You can then sort these scores to identify low-quality output or to monitor the quality of responses.
 
diff --git a/pages/guides/cookbook/python_decorators.md b/pages/guides/cookbook/python_decorators.md
@@ -335,7 +335,7 @@ Make sure to call `langfuse_context.flush()` before exiting to prevent this. Thi
 
 ### Scoring
 
-[Scores](https://langfuse.com/docs/scores/overview) are used to evaluate single observations or entire traces. You can create them manually in the Langfuse UI, run model-based evaluation or ingest via the SDK.
+[Scores](https://langfuse.com/docs/scores/overview) are used to evaluate single observations or entire traces. You can create them via our annotation workflow in the Langfuse UI, run model-based evaluation or ingest via the SDK.
 
 | Parameter | Type   | Optional | Description
 | --- | --- | --- | ---
diff --git a/pages/guides/cookbook/python_sdk_low_level.md b/pages/guides/cookbook/python_sdk_low_level.md
@@ -312,7 +312,7 @@ See documentation of spans above on how to use the langfuse client and ids if yo
 
 ## Scores
 
-[Scores](https://langfuse.com/docs/scores/overview) are used to evaluate single executions/traces. They can created manually via the Langfuse UI or via the SDKs.
+[Scores](https://langfuse.com/docs/scores/overview) are used to evaluate single executions/traces. They can be created via the Langfuse UI annotation workflow or via the SDKs.
 
 If the score relates to a specific step of the trace, specify the `observation_id`.
 

Original file line number	Diff line number	Diff line change
`@@ -1,6 +1,6 @@`
`1`	`1`	`{`
`2`	`2`	`"overview": "Overview",`
`3`		`- "manually": "Manually in Langfuse UI",`
	`3`	`+ "annotation": "Annotation via Langfuse UI",`
`4`	`4`	`"user-feedback": "User Feedback",`
`5`	`5`	`"model-based-evals": "Model-based Evaluation",`
`6`	`6`	`"custom": "Custom via SDKs/API"`