Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PyTorch/LanguageModeling/BERT] BookCorpus Data Download - HTTPError: HTTP Error 403: Forbidden #536

Closed
paulhendricks opened this issue May 28, 2020 · 4 comments
Labels
bug Something isn't working

Comments

@paulhendricks
Copy link

paulhendricks commented May 28, 2020

Related to Model/Framework(s)
PyTorch/LanguageModeling/BERT

Describe the bug
BookCorpus no longer available from Smashwords.

To Reproduce

The following works perfectly.

git clone https://github.com/NVIDIA/DeepLearningExamples.git
cd DeepLearningExamples
cd PyTorch/LanguageModeling/BERT
bash scripts/docker/build.sh
bash scripts/docker/launch.sh

However, errors start here:

bash data/create_datasets_from_start.sh
root@dgxstation:/workspace/bert# bash data/create_datasets_from_start.sh
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
Working Directory: /workspace/bert/data
Action: download
Dataset Name: bookscorpus

Directory Structure:
{ 'download': '/workspace/bert/data/download',
  'extracted': '/workspace/bert/data/extracted',
  'formatted': '/workspace/bert/data/formatted_one_article_per_line',
  'hdf5': '/workspace/bert/data/hdf5_lower_case_1_seq_len_512_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5',
  'sharded': '/workspace/bert/data/sharded_training_shards_256_test_shards_256_fraction_0.2',
  'tfrecord': '/workspace/bert/data/tfrecord_lower_case_1_seq_len_512_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5'}

0 files had already been saved in /workspace/bert/data/download/bookscorpus.
Failed to open https://www.smashwords.com/books/download/246580/6/latest/0/0/silence.txt
HTTPError: HTTP Error 403: Forbidden
Failed to open https://www.smashwords.com/books/download/246580/6/latest/0/0/silence.txt
HTTPError: HTTP Error 403: Forbidden
Failed to open https://www.smashwords.com/books/download/246580/6/latest/0/0/silence.txt
HTTPError: HTTP Error 403: Forbidden
 Gave up to open https://www.smashwords.com/books/download/246580/6/latest/0/0/silence.txt

Expected behavior

BookCorpus should download. This looks similar to:

Looks like https://www.smashwords.com/ has stepped up their anti-web crawling. In fact, after attempting to download my IP address is now blocked from their website. Users should be aware of this before we ask them to download the BookCorpus dataset lest they become banned, unaware of the consequences.

Screen Shot 2020-05-28 at 1 42 16 PM

Environment
Please provide at least:

  • Git commit: c76880b
  • Container version (e.g. pytorch:19.05-py3):
Step 1/15 : ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:20.03-py3
Step 2/15 : FROM nvcr.io/nvidia/tritonserver:20.03-py3-clientsdk as trt
  • GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB): 4x Tesla V100, DGX Station
  • CUDA driver version (e.g. 418.67): 418.126.02
@paulhendricks paulhendricks added the bug Something isn't working label May 28, 2020
@paulhendricks paulhendricks changed the title [PyTorch/LanguageModeling/BERT] What is the problem? [PyTorch/LanguageModeling/BERT] BookCorpus Data Download - HTTPError: HTTP Error 403: Forbidden May 28, 2020
@swethmandava
Copy link
Contributor

You can just ignore the bookscorpus files that are missing. They dont exist anymore on the web.

#247 #262

@vilmara
Copy link

vilmara commented Jul 1, 2020

Hi @swethmandava, the script run_pretraining_lamb.sh throws a bunch of errors because it is still referencing these datasets, is there another script to conduct the pre-training process on BERT with available datasets?

@swethmandava
Copy link
Contributor

Could you open another bug with details of your errors? @vilmara

@vilmara
Copy link

vilmara commented Jul 14, 2020

I have found how to work only with English Wikipedia dataset which is still available, ignoring BookCorpus dataset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants