Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No such file Error: ../data/MSRvid2012 #2

Closed
huache opened this issue Feb 20, 2017 · 11 comments
Closed

No such file Error: ../data/MSRvid2012 #2

huache opened this issue Feb 20, 2017 · 11 comments

Comments

@huache
Copy link

huache commented Feb 20, 2017

When I run the demo.sh in examples directory, this error occurred:

word vectors loaded from ../data/glove.840B.300d.txt
word weights computed from ../auxiliary_data/enwiki_vocab_min200.txt using parameter a=-1.000000
remove the first 0 principal components
Traceback (most recent call last):
  File "sim_sif.py", line 28, in <module>
    parr, sarr = eval.sim_evaluate_all(We, words, weight4ind, sim_algo.weighted_average_sim_rmpc, params)
  File "../src/eval.py", line 64, in sim_evaluate_all
    p,s = sim_getCorrelation(We, words, prefix+i, weight4ind, scoring_function, params)
  File "../src/eval.py", line 13, in sim_getCorrelation
    f = open(f,'r')
IOError: [Errno 2] No such file or directory: '../data/MSRvid2012'

Would you please tell me that:

  1. Is this error detrimental to the model training ?
  2. Where to download the missing data file ?

Thanks a lot !

@YingyuLiang
Copy link
Collaborator

It's because the function sim_evaluate_all will evaluate over all textual similarity datasets, but I only put online a few example datasets.

You can:

  1. run sim_evaluate_one instead of sim_evaluate_all. The function sim_evaluate_one will check only one example dataset.
  2. If you would like to check over all datasets, you'll need to contact John Wieting and obtain all the datasets. These datasets are from (https://github.com/jwieting/iclr2016), but both of us only put example datasets online, since some other datasets have copyright issues.

@huache
Copy link
Author

huache commented Feb 21, 2017

It is my fault, I should check the comments of source code before starting a new issue.
I will change the function , and try again. Thank you for replying so quickly !

@huache
Copy link
Author

huache commented Feb 21, 2017

I run demo.sh again with updated codes, the error still occurred when running sim_tfidf.py:

Traceback (most recent call last):
  File "sim_tfidf.py", line 22, in <module>
    weight4ind = data_io.getIDFWeight(wordfile)
  File "../src/data_io.py", line 355, in getIDFWeight
    g1x,g1mask,g2x,g2mask = getDataFromFile(prefix+f, words)
  File "../src/data_io.py", line 309, in getDataFromFile
    f = open(f,'r')
IOError: [Errno 2] No such file or directory: '../data/MSRvid2012'

I will change the farr list at line 326 to only one element remained ["MSRpar2012"], and run it again.
Hope that everything is OK.

@YingyuLiang
Copy link
Collaborator

Yes, data_io.getIDFWeight will read all the data files and compute the idf weights. I forgot to change it to read only the example file. Just changed.

That being said, I recommend using more files for computing the idf weights. Using one file will probably lead to a not so accurate estimation.

@huache
Copy link
Author

huache commented Feb 23, 2017

Sorry to bother you again.
I have run demo.sh two times more, but it seems that I need more memory to run it. I got this error on my 32G (28G free) memory machine:

Traceback (most recent call last):
  File "train.py", line 237, in <module>
    model = proj_model_sentiment(We, params)
        ......
        ......
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

How many memory do I need to run demo.sh ?

@YingyuLiang
Copy link
Collaborator

Are you running it with the word vector file glove.840B.300d.txt?

This file contains a very large vocabulary (the file is about 6G), and probably using the whole set of word vectors causes memory issue. You can try using only the first 50,000 words (i.e., keep only the first 50,000 lines in the file), which merely affects the experiments.

@huache
Copy link
Author

huache commented Feb 24, 2017

Yes, I haven't noticed that...
I used the first 50,000 words to run demo.sh again , it finally works.
Thank you so much !

@YingyuLiang
Copy link
Collaborator

No problem!

@loretoparisi
Copy link
Contributor

loretoparisi commented Jul 20, 2017

@huache in my case it is not a OOM problem, but it takes to long. How you achieved to cap to the first 50K words?

@huache
Copy link
Author

huache commented Jul 21, 2017

@loretoparisi As YingyuLiang said, modify the file glove.840B.300d.txt (or other word vector file you used) , "keep only the first 50,000 lines in the file".

@loretoparisi
Copy link
Contributor

@huache ok so I just cat the first 50K rows of the text file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants