Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A method to scrape all Forvo pronunciations to use the add-on offline #5

Closed
ghost opened this issue Nov 22, 2021 · 4 comments
Closed
Labels
enhancement New feature or request

Comments

@ghost
Copy link

ghost commented Nov 22, 2021

A method to scrape all Forvo pronunciations is now available :

https://ankiweb.net/shared/info/560814150#:~:text=11%2F21%2F2021-,a%20method%20to%20scrape,-all%20Forvo%20pronunciations

The scraping method works perfectly. It can scrape absolutely all the audio files for each language.

For example, there are more than 500.000 russian audios to scrape easily.

Would it be possible to download the audios of one's target language and then bulk add into Anki with the add-on ?

@Rascalov Rascalov added the enhancement New feature or request label Nov 22, 2021
@Rascalov
Copy link
Owner

Good day,

I have not yet tried the script myself, I'm curious how it will prevent forvo from blocking the ip that's bulk downloading all audios of a language. I'll soon have some fresh throwaway IPs to test it on.
I'm curious how the creator got a txt dump of all words recorded on forvo, very useful.

For now, I will mark this as an enhancement, this proposal could well solve the issues my original bulk scraper had by having the users give the addon a dictionary file to work with.
I just need some time to figure out how this can be done best. Taking a .mdx dictionary file as input seems like my best bet.

@ghost
Copy link
Author

ghost commented Nov 22, 2021

The Author of the Script is from China. He is a member of the Telegram group of "FreeMDict":
https://t.me/freemdict

He obtained a list with 5.7 million URLs from Forvo using Python and spent several weeks doing it! He finished the work on August 2021 and shared with me the script.

The original author tried to scrape too quickly all the sounds from Forvo and after querying 1 or 2 million URLs his IP was blocked in China. Then, he asked me to scrape from my IP. I did it slowly (at an speed 400 Kb/s) and succesfully queried all the 5.7 million URLs 🥇

Forvo never blocked me. 👯 😃

On September I obtained 620.000 German Pronunciations from Forvo and made an .mdx dictionary (on FreeMDict - Private post).

Yesterday I run the Python script and is still working perfectly ! I tried Russian, French and English and those languages work OK.

Just follow the instructions on FreeMDict where the script was posted:
https://forum.freemdict.com/t/topic/8100 (private post required registration)

Please contact me on FreeMDict Forum. My nickname there is "tovaremeterio" : https://forum.freemdict.com/u/tovaremeterio/

I want to scrape several languages (including Russian). We could split the work to avoid duplication of effort :D

@ghost
Copy link
Author

ghost commented Nov 26, 2021

@Rascalov

Someone from https://forum.ru-board.com/ (aleven) is downloading all the Russian pronunciations from Forvo.com. He might finish within 3-4 days.

Please let me know if you are interested in the sounds.

@ghost
Copy link
Author

ghost commented Apr 3, 2022

@Rascalov All Forvo Audios are now available to download:

https://forum.freemdict.com/t/topic/11947

You can use the Russian audios for your language learning :D

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant