Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offer translations #157

Open
Wituareard opened this issue Apr 21, 2024 · 8 comments
Open

Offer translations #157

Wituareard opened this issue Apr 21, 2024 · 8 comments
Assignees
Labels
enhancement New feature or request

Comments

@Wituareard
Copy link
Collaborator

Discord discussion: https://discord.com/channels/1100491867675709580/1226136907046588476

@Wituareard Wituareard added the enhancement New feature or request label Apr 21, 2024
@joepio joepio self-assigned this Jul 2, 2024
@joepio
Copy link
Collaborator

joepio commented Jul 2, 2024

This is a very big and important question, as there exists little AI risk knowledge in most non-english languages. We need to get this right.

Wants

  • Translate to a lot of languages without too much effort (would be great if we can utilize AI to translate)
  • Keep pages up to date. We need a process for this, expect automated translations to require QC / fine-tuning.
  • Have custom pages for certain nationalities that are not translated.

Technical approaches

sveltekit-i18n

Seems great for non-markdown content. We'll need something custom for dealing with pages. May need quite a bit of logic to accommodate all the above

Split some folders, manual management

  • We have /en and /nl etc folders.
  • We can initially translate things using LLMs to make a whole bunch manually
  • Have the same file names to identify
  • Manually keep track of when the article was last translated, e.g. with a date in a comment. Would be nice to have analytics insights to prioritize translations.

Translate articles on the fly

  • Translate using LLM if there is no hand-written article found
  • Show notice that the article is translated, show original button
  • Store cached responses somewhere to keep costs low

Use an external CMS with built-in drafting / versioning

  • Can help integrating drafts / comments / pipelines for translations
  • Will cost money $$$
  • We chatted with Hubspot, which has a CMS that offers (AI gen) translations

Questions

  • Do we use the same handle for every page? That means we can't translate the handles, might be bad for SEO. We can use a lookup function in the a.svelte component
  • How do we initially translate the articles? Do we translate

@joepio
Copy link
Collaborator

joepio commented Jul 31, 2024

@Imotaru

@Imotaru
Copy link

Imotaru commented Aug 2, 2024

From my call with Mia:

  • Having a button to change between different languages on the website (good example is how Finnish government sites does it e.g. https://valtioneuvosto.fi/en/decisions).
  • Keep track of if a page was manually translated / verified or not. If the translation has not been manually verified could have a note "this page is a machine translation, if you want to help improve the translation click here (could just be a link to discord or to email me or someone else on the team if we want to keep it simple)
  • Version number could indicate how important the change is, like 1.0.1 to 1.0.2 is low prio, but 1.0.1 to 2.0.0 is a big change and high prio. Low prio change would be something like a typo fix that only affects English anyways, and high prio would be something that adds very important info or if the old info does not apply anymore / will be misleading.

@joepio
Copy link
Collaborator

joepio commented Aug 8, 2024

@Imotaru Great insights.

  • The switch button should probably be a dropdown
  • Some versioning system sounds like a good idea. Maybe we can use something like v1.2ai to indicate if this has been AI generated, whereas v1.2m is a manually checked one. I think we could put this in the md metadata

@joepio
Copy link
Collaborator

joepio commented Sep 6, 2024

Current idea

  • Have a JS script that iterates over every markdown page and creates translations for every configured language using an LLM
  • Use Git to do version management
  • Every language can have an additional set of instructions to inject in the prompt
  • Every translated page can have an addiontal page-language specific prompt
  • Store if files are generated in the document

@anthonybailey
Copy link
Collaborator

anthonybailey commented Sep 6, 2024

OK. So as you can see from the above note, Joep's background evolving plan and the one I'd arrived at independently were pretty similar and even complementary.

My evolving ideas are captured in the Discord Project Babel thread and at the time of writing conclude with a "Sonnet 3.5, tell me why I'm stupid about this" interaction linked from here.

Joep showed me some of the Svelte machinery in a 1:1 chat just now and bestowed the welcome revelation that Netlify is already ultimately serving static content from a CDN - all my worrying about how our main PauseAI website scales to extraordinary load during the most important hours of its life is pretty much handled. As expected, we do then have to thread the localization needle appropriately (decisions about whether to capture locale in route URL / fragment / cookie etc.); and similarly, we might want to special-case localization for some really valuable dynamic components (or tweak their design), and the built in search indexing will need some localization tweaks.

But ultimately it all looks essentially flexible and workable, with nice degradation properties when we mess up or things fail (very low effort LLM-powered l10n coverage at reasonable quality suffices for us.) Even though it has a worrying whiff of "invent it here, bespoke glue" developer enthusiasm, it truly might be simplest to just do it ourselves, rather than locate and early adopt some cutting-edge truly future-looking AI start-up's reimagining of localization, or make do with existing l10n frameworks that have incentive to wrongly emphasize highest possible quality and humans in the loop. We can pair at his convenience to share understanding of existing Svelte design choices and mitigate a very clear potential bus problem.

@anthonybailey
Copy link
Collaborator

Here are some more specific notes.

The website wasn't designed to be l10nable, so there are scattered text edge-cases. Experiments with Claude suggest I can translate markdown posts directly. And most other parts of the site are post-like and probably convertible to that form.

Trivial and ignorable short messages (errors, special cases, understandable in context):

  • src/lib/components/{widget-consent/*, Edit.svelte, Toc.svelte, ToggleTheme.svelte}

This is opaque to me but almost text-free:

  • routes/[slug]/+page.svelte

There is l10n-worthy text in these, which I suspect can become markdown posts:

  • routes/about/+page.svelte
  • routes/+page.svelte, src/lib/config.ts - this is the homepage

As above, but we'd want to factor out some non-li0ned innards:

  • routes/communities/+page.svelte, routes/communities/communities.ts - factor out the map
  • routes/chat/+page.svelte, api/chat/+server.ts - factor out the bot

Plausibly become markdown, but they are patterns over static text data:

  • routes/pdoom/+page.svelte, src/lib/components/Doomers.svelte
  • routes/quotes/+page.svelte, routes/quotes/data.json

As above, but said data is drawn from AirTable:

  • routes/people/{+page.svelte, meta.ts}
  • routes/teams/{+page.svelte, meta.ts}

These are simple aggregators that just need the l10n choice pushed through:

  • routes/rss.xml/+server.ts
  • routes/sitemap.{xml, txt}/+server.ts

These special cases look too difficult to factor trivially. Perhaps leave as en-US only in round one?

  • routes/outcomes/+page.svelte, outcomes/tree.ts
  • routes/email-builder/*

joepio added a commit that referenced this issue Sep 22, 2024
@joepio
Copy link
Collaborator

joepio commented Sep 22, 2024

I played around a bit with a script, got it to autotically generate translations based on a configuration, add sub folders.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants