Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

automatically detect language of text being processed #99

Open
BBC-Esq opened this issue Feb 18, 2024 · 5 comments
Open

automatically detect language of text being processed #99

BBC-Esq opened this issue Feb 18, 2024 · 5 comments

Comments

@BBC-Esq
Copy link
Contributor

BBC-Esq commented Feb 18, 2024

Instead of having to pass the language identifiers (e.g. de or pl) perhaps autodetect language in a multilingual text string. Possible libraries like langdetect might be used:

Example:

from langdetect import detect

# Example texts
text_de = "Dies ist ein deutscher Text."
text_pl = "To jest polski tekst."

# Detecting languages
language_de = detect(text_de)
language_pl = detect(text_pl)

print(f"The language of the first text is: {language_de}")  # Output: de
print(f"The language of the second text is: {language_pl}")  # Output: pl

The challenge would be to implement the language detection in a single text string since langdetect is geared towards detecting the "predominant" language in a single string of text...But assuming we could parse it intelligently (or there's another better library), it would remove the need to pass the language identifiers to the methods in the WhisperSpeech library...

@jpc
Copy link
Contributor

jpc commented Feb 18, 2024

Yeah, that would be a nice idea although you pointed out correctly that language switching is going to be challenging.

We could try to train a model that would detect the language of each input token but I am not sure how well it would work in practice.

A bit related: there is a different "API" in the Gradio demo where you can specify the language inside the text string with html-like tags. Have you seen it?

@BBC-Esq
Copy link
Contributor Author

BBC-Esq commented Feb 18, 2024

I have another idea then...What about changing the default to "auto" so a user doesn't have to (but can) specify a language? For example, within pipeline.py it states:

    def generate_to_file(self, fname, text, speaker=None, lang='en', cps=15, step_callback=None):
        self.vocoder.decode_to_file(fname, self.generate_atoks(text, speaker, lang=lang, cps=cps, step_callback=None))

Could we set the default as lang=auto instead. Then we'd simply modify the source code to use langdetect on the input text to get the langauge identifier whenever a language isn't specified since "auto" is the default. This would prevent a user from having to get the langauge codes themselves and specify each time.

We'd still leave in place the functionality of a user being able to specify the language, however, auto-detect would be the default.

For example, this would enable users to choose auto-detect for sentences that are only one language and langdetect would have no problem identifying the language...while still keeping the ability to specify multiple languages when the text string is multilingual?

@jpc
Copy link
Contributor

jpc commented Feb 19, 2024

Yeah, that sounds nice. I’d like to move away from the lang= parameter but we could use this auto detection if there are no tags in the text.

@BBC-Esq
Copy link
Contributor Author

BBC-Esq commented Feb 19, 2024

Sounds good. It would require modifying the source code somewhat and I might be able to take that on, but I haven't had the time to analyze the code base further. If you're willing, can you explain briefly how the language parameter operates? I see the language script, but can you explain perhaps, for example...

  1. lang= is passed to script A
  2. then it's passed to script B
  3. then the languages.py script is consulted...
  4. and so on...

I only ask because this is a hobby of mine and I'm not a programmer by trade...and if I had a summary of the flow of the program it'd save me a lot of time. For example, my basic understand so far is that (using generate_to_file as an example) is:

  1. runs generate_atoks
  2. which runs t2s.generate
  3. and runs s2a.generate
  4. returning back to pipeline.py, then runs vocoder.decode_to_file using what it obtained from generate_atoks

As an amateur this took me hours to understand, so any help would be much appreciated since I'd like to contribute more efficiently!

@BBC-Esq BBC-Esq changed the title Implement auto language detection for multilingural text strings Implement auto language detection Feb 19, 2024
@BBC-Esq
Copy link
Contributor Author

BBC-Esq commented Feb 19, 2024

@jpc Just to give you an idea, I didn't know what the word "python" even meant until approximately 9 months ago. ;-)

@BBC-Esq BBC-Esq changed the title Implement auto language detection automatically detect language of text being processed Feb 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants