Questioning the Cost of Data translation with ChatGPT Turbo #3

temberature · 2023-03-21T09:43:59Z

"We translated the alpaca_data.json to portuguese using ChatGPT. ....We paid around US$ 8.00 to translate the full dataset to portuguese."

The initial size of the data is approximately 20 million, and the cost of processing it with ChatGPT Turbo is $0.002 per 1,000 tokens. I am curious as to why the total cost is not close to $40.

By the way, I appreciate you sharing the excellent suggestion for fine-tuning.

stefangrotz · 2023-03-22T18:31:57Z

I would like to know this too. Right now i am at 4% and 0.25 cents costs, so it could be even around 6€. I will tell you once I have finished the German translation, right now it is really slow, almost like it would be stuck at 4%.

EDIT: looks like I hit the rate limit, after some experiments I am now down to 25 parallel calls. This is very slow, but it seems to work.

sungkim11 · 2023-03-22T20:13:38Z

I have no idea how they did it for US$8. My cost was close to US$25. I did not translate to Portuguese, though.

agupta54 · 2023-03-23T17:33:22Z

I am trying this with Hindi. The generation results don't seem so good.

stefangrotz · 2023-03-31T10:15:51Z

I highly recommend to translate the Cleaned Dataset: https://github.com/gururise/AlpacaDataCleaned

I will try to translate it into German in a few weeks when the cleaning has progressed further.

MohamedYasser97 · 2023-04-10T23:08:48Z

If you look closely in translate_data.py:

with open('alpaca_data.json', 'r') as f:
    data = json.load(f)

start = 40000
end = 55000
translated_data = []
data = data[start:end]

with ThreadPoolExecutor(max_workers=MAX_PARALLEL_REQUESTS) as executor:
    futures = {executor.submit(translate_item, item): item for item in data}
    
    for future in tqdm(as_completed(futures), total=len(futures), desc="Translating"):
        translated_data.append(future.result())

Only a chunk of the original instruction set is translated. You need to repeat this process by changing the start and end variables.

Aciid · 2023-05-14T12:22:41Z

Translating the whole alpaca-lora/alpaca_data_cleaned_archive.json is somewhere around ~~30,000-50,000€~~ 30-50€ according to calculated tokens from 1000 random sampled prompts.

I'm curious is the selected chunk 40000-55000 for translation in the project chosen for it's quality or is it just random?

MohamedYasser97 · 2023-05-14T12:26:53Z

Translating the whole alpaca-lora/alpaca_data_cleaned_archive.json is somewhere around 30,000-50,000€ according to calculated tokens from 1000 random sampled prompts.

I'm curious is the selected chunk 40000-55000 for translation in the project chosen for it's quality or is it just random?

I translated the complete alpaca_data.json to Arabic and it costed me $60 using GPT-3.5-turbo ($16~$18 of which where given for free by OpenAI iirc)

Aciid · 2023-05-14T12:37:01Z

Ah it seems I've miscalculated from the JSON structure rows <-> instructions, thank you for the correction. I'll just run the whole translation, but I think the larger dataset will take a lot more time to fine-tune.

Aciid · 2023-05-14T12:51:31Z

Dropping tqdm in favour of just counting via callback how many futures have been completed/not completed seems to double the overall speed of the threading job. There seems to be underlying issues with this library that is used in many machinelearning projects.

MohamedYasser97 · 2023-05-14T12:52:34Z

Dropping tqdm in favour of just counting via callback how many futures have been completed/not completed seems to double the overall speed of the threading job. There seems to be underlying issues with this library that is used in many machinelearning projects.

Nice. Didn't think of trying that before

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questioning the Cost of Data translation with ChatGPT Turbo #3

Questioning the Cost of Data translation with ChatGPT Turbo #3

temberature commented Mar 21, 2023

stefangrotz commented Mar 22, 2023 •

edited

Loading

sungkim11 commented Mar 22, 2023

agupta54 commented Mar 23, 2023

stefangrotz commented Mar 31, 2023

MohamedYasser97 commented Apr 10, 2023 •

edited

Loading

Aciid commented May 14, 2023 •

edited

Loading

MohamedYasser97 commented May 14, 2023 •

edited

Loading

Aciid commented May 14, 2023

Aciid commented May 14, 2023

MohamedYasser97 commented May 14, 2023

Questioning the Cost of Data translation with ChatGPT Turbo #3

Questioning the Cost of Data translation with ChatGPT Turbo #3

Comments

temberature commented Mar 21, 2023

stefangrotz commented Mar 22, 2023 • edited Loading

sungkim11 commented Mar 22, 2023

agupta54 commented Mar 23, 2023

stefangrotz commented Mar 31, 2023

MohamedYasser97 commented Apr 10, 2023 • edited Loading

Aciid commented May 14, 2023 • edited Loading

MohamedYasser97 commented May 14, 2023 • edited Loading

Aciid commented May 14, 2023

Aciid commented May 14, 2023

MohamedYasser97 commented May 14, 2023

stefangrotz commented Mar 22, 2023 •

edited

Loading

MohamedYasser97 commented Apr 10, 2023 •

edited

Loading

Aciid commented May 14, 2023 •

edited

Loading

MohamedYasser97 commented May 14, 2023 •

edited

Loading