diff --git a/docs/en/benchmark.md b/docs/en/benchmark.md index 6b5743e0c5..4647da1d80 100644 --- a/docs/en/benchmark.md +++ b/docs/en/benchmark.md @@ -697,14 +697,14 @@ To ensure a fair comparison of these tools, we enlisted the assistance of human - [sbiobertresolve_rxnorm_augmented](https://nlp.johnsnowlabs.com/2024/01/17/sbiobertresolve_rxnorm_augmented_en.html): Trained with `sbiobert_base_cased_mli` embeddings. - [biolordresolve_rxnorm_augmented](https://nlp.johnsnowlabs.com/2024/05/06/biolordresolve_rxnorm_augmented_en.html): Trained with `mpnet_embeddings_biolord_2023_c` embeddings. -- **GPT-4:** *GPT-4 Turbo* and *GPT-4o* models. +- **GPT-4:** *GPT-4 (Turbo)* and *GPT-4o* models. - **Amazon:** *Amazon Comprehend Medical* service ### Evaluation Notes - Healthcare NLP returns up to 25 closest results, and Amazon Medical Comprehend returns up to five results, both sorted starting from the closest one. In contrast, the GPT-4 returns only one result, *so its scores are reflected similarly in both charts*. -- Since the performance of GPT-4 Turbo and GPT-4o is almost identical according to the [official announcement](https://community.openai.com/t/announcing-gpt-4o-in-the-api/744700?page=3), and we used both versions for the accuracy calculation. Additionally, the GPT-4 returns **only one result**, which means you will see the same results in both evaluation approaches. +- Since the performance of GPT-4 and GPT-4o is almost identical according to the [official announcement](https://community.openai.com/t/announcing-gpt-4o-in-the-api/744700?page=3), and we used both versions for the accuracy calculation. Additionally, the GPT-4 returns **only one result**, which means you will see the same results in both evaluation approaches. - Two approaches were adopted for evaluating these tools, given that the model outputs may not precisely match the annotations: - **Top-3:** Compare the annotations to see if they appear in the first three results. - **Top-5:** Compare the annotations to see if they appear in the first five results. @@ -723,7 +723,7 @@ To ensure a fair comparison of these tools, we enlisted the assistance of human Since we don't have such a small dataset in real world, we calculated the price of these tools according to 1M clinical notes.  -- *Open AI Pricing:* We created a prompt to achieve better results, which costs $3.476 on GPT-4 and $1.738 GPT-4o model for the 79 documents. This means that for processing **1 million notes, the estimated cost would be $44,000 for the GPT-4 Turbo model** and **$22,000 for the GPT-4o model**. +- *Open AI Pricing:* We created a prompt to achieve better results, which costs $3.476 on GPT-4 and $1.738 GPT-4o model for the 79 documents. This means that for processing **1 million notes, the estimated cost would be $44,000 for the GPT-4** and **$22,000 for the GPT-4o**. - *Amazon Comprehend Medical Pricing:* According to the price calculator, obtaining RxNorm predictions for **1M documents, with an average of 9,700 characters per document, costs $24,250**. @@ -739,7 +739,7 @@ Based on the evaluation results: If you want to process **1M documents** and extract RxNorm codes for medication entities (*excluding the NER stage*), the total cost: - With Healthcare NLP is about **$4,500, including the infrastructure costs**. - **$24,250** with Amazon Comprehend Medical -- **$44,000** with the GPT-4 Turbo and **$22,000** with the GPT-4o. +- **$44,000** with the GPT-4 and **$22,000** with the GPT-4o. Therefore, **Healthcare NLP is almost 5 times cheaper than its closest alternative**, not to mention the accuracy differences (**Top 3: Healthcare NLP 82.7% vs Amazon 55.8% vs GPT-4 8.9%**). @@ -768,7 +768,7 @@ Therefore, **Healthcare NLP is almost 5 times cheaper than its closest alternati $24,250 - GPT-4 Turbo + GPT-4 (Turbo) 8.9% 8.9% $44,000