automatically detect language of text being processed #99

BBC-Esq · 2024-02-18T07:04:21Z

Instead of having to pass the language identifiers (e.g. de or pl) perhaps autodetect language in a multilingual text string. Possible libraries like langdetect might be used:

Example:

from langdetect import detect

# Example texts
text_de = "Dies ist ein deutscher Text."
text_pl = "To jest polski tekst."

# Detecting languages
language_de = detect(text_de)
language_pl = detect(text_pl)

print(f"The language of the first text is: {language_de}")  # Output: de
print(f"The language of the second text is: {language_pl}")  # Output: pl

The challenge would be to implement the language detection in a single text string since langdetect is geared towards detecting the "predominant" language in a single string of text...But assuming we could parse it intelligently (or there's another better library), it would remove the need to pass the language identifiers to the methods in the WhisperSpeech library...

The text was updated successfully, but these errors were encountered:

jpc · 2024-02-18T12:22:28Z

Yeah, that would be a nice idea although you pointed out correctly that language switching is going to be challenging.

We could try to train a model that would detect the language of each input token but I am not sure how well it would work in practice.

A bit related: there is a different "API" in the Gradio demo where you can specify the language inside the text string with html-like tags. Have you seen it?

BBC-Esq · 2024-02-18T13:19:52Z

I have another idea then...What about changing the default to "auto" so a user doesn't have to (but can) specify a language? For example, within pipeline.py it states:

    def generate_to_file(self, fname, text, speaker=None, lang='en', cps=15, step_callback=None):
        self.vocoder.decode_to_file(fname, self.generate_atoks(text, speaker, lang=lang, cps=cps, step_callback=None))

Could we set the default as lang=auto instead. Then we'd simply modify the source code to use langdetect on the input text to get the langauge identifier whenever a language isn't specified since "auto" is the default. This would prevent a user from having to get the langauge codes themselves and specify each time.

We'd still leave in place the functionality of a user being able to specify the language, however, auto-detect would be the default.

For example, this would enable users to choose auto-detect for sentences that are only one language and langdetect would have no problem identifying the language...while still keeping the ability to specify multiple languages when the text string is multilingual?

jpc · 2024-02-19T15:57:17Z

Yeah, that sounds nice. I’d like to move away from the lang= parameter but we could use this auto detection if there are no tags in the text.

BBC-Esq · 2024-02-19T16:10:18Z

Sounds good. It would require modifying the source code somewhat and I might be able to take that on, but I haven't had the time to analyze the code base further. If you're willing, can you explain briefly how the language parameter operates? I see the language script, but can you explain perhaps, for example...

lang= is passed to script A
then it's passed to script B
then the languages.py script is consulted...
and so on...

I only ask because this is a hobby of mine and I'm not a programmer by trade...and if I had a summary of the flow of the program it'd save me a lot of time. For example, my basic understand so far is that (using generate_to_file as an example) is:

runs generate_atoks
which runs t2s.generate
and runs s2a.generate
returning back to pipeline.py, then runs vocoder.decode_to_file using what it obtained from generate_atoks

As an amateur this took me hours to understand, so any help would be much appreciated since I'd like to contribute more efficiently!

BBC-Esq · 2024-02-19T16:16:46Z

@jpc Just to give you an idea, I didn't know what the word "python" even meant until approximately 9 months ago. ;-)

BBC-Esq changed the title ~~Implement auto language detection for multilingural text strings~~ Implement auto language detection Feb 19, 2024

BBC-Esq changed the title ~~Implement auto language detection~~ automatically detect language of text being processed Feb 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

automatically detect language of text being processed #99

automatically detect language of text being processed #99

BBC-Esq commented Feb 18, 2024

jpc commented Feb 18, 2024

BBC-Esq commented Feb 18, 2024 •

edited

jpc commented Feb 19, 2024

BBC-Esq commented Feb 19, 2024

BBC-Esq commented Feb 19, 2024 •

edited

automatically detect language of text being processed #99

automatically detect language of text being processed #99

Comments

BBC-Esq commented Feb 18, 2024

jpc commented Feb 18, 2024

BBC-Esq commented Feb 18, 2024 • edited

jpc commented Feb 19, 2024

BBC-Esq commented Feb 19, 2024

BBC-Esq commented Feb 19, 2024 • edited

BBC-Esq commented Feb 18, 2024 •

edited

BBC-Esq commented Feb 19, 2024 •

edited