Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) #4

Closed
DehengYang opened this issue Jul 30, 2020 · 10 comments

Comments

@DehengYang
Copy link

I have met the following error when running python3 -m sosed.run -i input_examples/input.txt -o output/output_example:

(sosed-env) dale@dale:~/sosed$ python3 -m sosed.run -i input_examples/input.txt -o output/output_example
Running tokenizer on repos listed in input_examples/input.txt
Parser successfully initialized.
Enry successfully initialized.
Tokenizing the repositories.
Tokenizing batch 1 out of 1.
  0%|                                                                                                                                   | 0/1 [00:00<?, ?it/s]Segmentation fault
  0%|                                                                                                                                   | 0/1 [05:51<?, ?it/s]
Traceback (most recent call last):
  File "/home/apr/anaconda3/envs/sosed-env/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/apr/anaconda3/envs/sosed-env/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/apr/apr_tools/sosed/sosed/run.py", line 198, in <module>
    tokenize(args.input, args.output, args.batches, args.local, args.force)
  File "/home/apr/apr_tools/sosed/sosed/run.py", line 36, in tokenize
    run_tokenizer(tokenizer_args)
  File "/home/apr/apr_tools/sosed/tokenizer/identifiers_extractor/run.py", line 19, in main
    batch_size=int(args.batches), local=args.local)
  File "/home/apr/apr_tools/sosed/tokenizer/identifiers_extractor/parsing.py", line 304, in tokenize_repositories
    lang2files = recognize_languages(td)
  File "/home/apr/apr_tools/sosed/tokenizer/identifiers_extractor/parsing.py", line 212, in recognize_languages
    .format(enry_loc=get_enry(), directory=directory)))
  File "/home/apr/anaconda3/envs/sosed-env/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/home/apr/anaconda3/envs/sosed-env/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/apr/anaconda3/envs/sosed-env/lib/python3.7/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

The content of the input.txt file is:

https://github.com/google/closure-compiler

Is there any way to deal with such error? It would be much appreciated if any guidance or solution could be provided.

Thanks!

@egor-bogomolov
Copy link
Collaborator

Hi! Could you specify your system details (mainly OS)? I've tried to reinstall Sosed from scratch and run your example, and it worked fine:
python3 -m sosed.run -i input.txt -o output/closure/
...

Query project: https://github.com/google/closure-compiler
https://github.com/st-js/st-js | similarity = 1.2118
https://github.com/google/compile-testing | similarity = 1.2119
https://github.com/rzwitserloot/lombok | similarity = 1.2121
https://github.com/cincheo/jsweet | similarity = 1.2137
https://github.com/peichhorn/lombok-pg | similarity = 1.2138
https://github.com/codemix/babel-plugin-typecheck | similarity = 1.2142
https://github.com/BladeRunnerJS/brjs | similarity = 1.2143
https://github.com/nativelibs4java/Scalaxy | similarity = 1.2144
https://github.com/ceylon/ceylon-compiler | similarity = 1.2144
https://github.com/google/closure-templates | similarity = 1.2145

I will look deeper into your stacktrace and try to understand the reason but information about your system will definitely be helpful.

@egor-bogomolov
Copy link
Collaborator

egor-bogomolov commented Jul 30, 2020

Based on the stacktrace, something went wrong during the language recognition step.
Could you please run enry from the command line and attach the output?

tokenizer/identifiers_extractor/language_recognition/build/enry -json -mode files [path to some small directory with code]

@DehengYang
Copy link
Author

Thank you so much for the detailed guidance!

My OS is Ubuntu 14.04.

I ran this command you provided and it reported Segmentation fault:

dale@dale:~/dale/sosed$ tokenizer/identifiers_extractor/language_recognition/build/enry -json -mode files Closure/
Segmentation fault
dale@dale:~/dale/sosed$ tokenizer/identifiers_extractor/language_recognition/build/enry -json -mode files Closure/src/com/google/javascript/jscomp/
Segmentation fault

I have no idea why this occurs...

Thank you again for your help!

@egor-bogomolov
Copy link
Collaborator

Thanks a lot! What I suspect is:

  1. We use enry to detect languages in source code files
  2. Instead of building it from scratch when setting up the tokenizer, we just download a proper release version based on the OS
  3. In your case, the prebuilt version seems to fail :(

A workaround for now would be to build the enry from scratch and check whether it will work. We will think about the proper way to add the build step in case the prebuilt one fails. On this, refer to an issue in Buckwheat I've just opened.

@DehengYang
Copy link
Author

Thank you so much for providing this workaround! There is no such error since I downloaded the relased version of entry and re-run python3 -m sosed.run -i input_examples/input.txt -o output/output_example.

However, when it comes to the data downloading, the speed is extremely slow, especially at downloading https://s3-eu-west-1.amazonaws.com/resources.ml.labs.aws.intellij.net/sosed/data_stars_100.tar.xz. I also tried different network and downloaded it via web browser but all failed. It there any other way to have this tar.xz? Thank you!

@DehengYang
Copy link
Author

Or could you please send me a copy of data_stars_100.tar.xz, data_stars_50.tar.xz and data_stars_10.tar.xz via email? (My email is dehengyang@qq.com) It would be sincerely appreciated if any help could be offered.

Thank you again for your great help and patience!

@egor-bogomolov
Copy link
Collaborator

I've sent you two emails. The archives are too large to send directly, so the emails contain links to Google Drive / OneDrive. If neither works for you, let's try some other file sharing service :)

@DehengYang
Copy link
Author

Thank you so much for the great and timely help which really helps me a lot! I now haved all the data_stars_{}.tar.xz files. I have decompressed them into the folder like below:

image

However, a new error occurs as follows when I run python3 -m sosed.run -i input_examples/input.txt -o output/closure/ --force or python3 -m sosed.run -i input_examples/input.txt -o output/closure/ (without --force):

(sosed-env) dale@dale:~/dale/sosed$ python3 -m sosed.run -i input_examples/input.txt -o output/closure/ --force
Running tokenizer on repos listed in input_examples/input.txt
Parser successfully initialized.
Enry successfully initialized.
Tokenizing the repositories.
Tokenizing batch 1 out of 1.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [07:08<00:00, 428.97s/it]
Tokenization successfully completed.
Found 1 batches with tokenized data.
Assigning clusters to tokens from vocab file.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 39002982/39002982 [00:23<00:00, 1677005.10it/s]
Computing vectors for 1 repositories.
Extracting stats for 1 repositories
Traceback (most recent call last):
  File "/home/apr/anaconda3/envs/sosed-env/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/apr/anaconda3/envs/sosed-env/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/apr/apr_tools/sosed/sosed/run.py", line 201, in <module>
    analyze(processed_data, args.min_stars, args.closest, args.explain, args.metric, args.lang)
  File "/home/apr/apr_tools/sosed/sosed/run.py", line 107, in analyze
    clusters_info = get_clusters_info()
  File "/home/apr/apr_tools/sosed/sosed/utils.py", line 129, in get_clusters_info
    return pickle.load(filepath.open('rb'))
  File "/home/apr/anaconda3/envs/sosed-env/lib/python3.7/pathlib.py", line 1203, in open
    opener=self._opener)
  File "/home/apr/anaconda3/envs/sosed-env/lib/python3.7/pathlib.py", line 1058, in _opener
    return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: 'data/clusters_info.pkl'

It would be much appreciated if further guidance could be provided.

Thank you again for your great kindness and help!

@egor-bogomolov
Copy link
Collaborator

Well, as the error states, you are missing a pickle file :)
Fortunately, it is a small one, stored right in the repository. It should have downloaded during the repo cloning, but either something went wrong, or you deleted it afterwards :)

@DehengYang
Copy link
Author

DehengYang commented Aug 2, 2020

Thank you very much for pointing out this! I am sorry that I unintentionally deleted this file at the very beginning when I was trying to solve the error reported above by myself. Now every thing work well! And I obtained the same ouptut as yours:

Found tokenizer output in output/closure/.
If you want to re-run tokenizer, pass --force flag.
Found precomputed vectors in output/closure.
If you wan to re-run vector computation, pass --force flag.

-----------------------
Query project: https://github.com/google/closure-compiler
https://github.com/st-js/st-js | similarity = 1.2102
https://github.com/google/compile-testing | similarity = 1.2102
https://github.com/rzwitserloot/lombok | similarity = 1.2105
https://github.com/cincheo/jsweet | similarity = 1.2119
https://github.com/peichhorn/lombok-pg | similarity = 1.2121
https://github.com/codemix/babel-plugin-typecheck | similarity = 1.2126
https://github.com/nativelibs4java/Scalaxy | similarity = 1.2126
https://github.com/BladeRunnerJS/brjs | similarity = 1.2126
https://github.com/ceylon/ceylon-compiler | similarity = 1.2127
https://github.com/google/closure-templates | similarity = 1.2129
-----------------------

Sorry for my delayed reply. And thank you again for your continuous help and great kindness. This issue is well solved now.

Wish you a nice day!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants