Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questioning the Cost of Data translation with ChatGPT Turbo #3

Open
temberature opened this issue Mar 21, 2023 · 10 comments
Open

Questioning the Cost of Data translation with ChatGPT Turbo #3

temberature opened this issue Mar 21, 2023 · 10 comments

Comments

@temberature
Copy link

"We translated the alpaca_data.json to portuguese using ChatGPT. ....We paid around US$ 8.00 to translate the full dataset to portuguese."

The initial size of the data is approximately 20 million, and the cost of processing it with ChatGPT Turbo is $0.002 per 1,000 tokens. I am curious as to why the total cost is not close to $40.

By the way, I appreciate you sharing the excellent suggestion for fine-tuning.

@stefangrotz
Copy link

stefangrotz commented Mar 22, 2023

I would like to know this too. Right now i am at 4% and 0.25 cents costs, so it could be even around 6€. I will tell you once I have finished the German translation, right now it is really slow, almost like it would be stuck at 4%.

EDIT: looks like I hit the rate limit, after some experiments I am now down to 25 parallel calls. This is very slow, but it seems to work.

@sungkim11
Copy link

I have no idea how they did it for US$8. My cost was close to US$25. I did not translate to Portuguese, though.

@agupta54
Copy link

I am trying this with Hindi. The generation results don't seem so good.

@stefangrotz
Copy link

I highly recommend to translate the Cleaned Dataset: https://github.com/gururise/AlpacaDataCleaned

I will try to translate it into German in a few weeks when the cleaning has progressed further.

@MohamedYasser97
Copy link

MohamedYasser97 commented Apr 10, 2023

If you look closely in translate_data.py:

with open('alpaca_data.json', 'r') as f:
    data = json.load(f)

start = 40000
end = 55000
translated_data = []
data = data[start:end]

with ThreadPoolExecutor(max_workers=MAX_PARALLEL_REQUESTS) as executor:
    futures = {executor.submit(translate_item, item): item for item in data}
    
    for future in tqdm(as_completed(futures), total=len(futures), desc="Translating"):
        translated_data.append(future.result())

Only a chunk of the original instruction set is translated. You need to repeat this process by changing the start and end variables.

@Aciid
Copy link

Aciid commented May 14, 2023

Translating the whole alpaca-lora/alpaca_data_cleaned_archive.json is somewhere around 30,000-50,000€ 30-50€ according to calculated tokens from 1000 random sampled prompts.

I'm curious is the selected chunk 40000-55000 for translation in the project chosen for it's quality or is it just random?

@MohamedYasser97
Copy link

MohamedYasser97 commented May 14, 2023

Translating the whole alpaca-lora/alpaca_data_cleaned_archive.json is somewhere around 30,000-50,000€ according to calculated tokens from 1000 random sampled prompts.

I'm curious is the selected chunk 40000-55000 for translation in the project chosen for it's quality or is it just random?

I translated the complete alpaca_data.json to Arabic and it costed me $60 using GPT-3.5-turbo ($16~$18 of which where given for free by OpenAI iirc)

@Aciid
Copy link

Aciid commented May 14, 2023

Ah it seems I've miscalculated from the JSON structure rows <-> instructions, thank you for the correction. I'll just run the whole translation, but I think the larger dataset will take a lot more time to fine-tune.

@Aciid
Copy link

Aciid commented May 14, 2023

Dropping tqdm in favour of just counting via callback how many futures have been completed/not completed seems to double the overall speed of the threading job. There seems to be underlying issues with this library that is used in many machinelearning projects.

@MohamedYasser97
Copy link

Dropping tqdm in favour of just counting via callback how many futures have been completed/not completed seems to double the overall speed of the threading job. There seems to be underlying issues with this library that is used in many machinelearning projects.

Nice. Didn't think of trying that before

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants