Skip to content
This repository has been archived by the owner on Jul 4, 2023. It is now read-only.

RuntimeError: Vector for token darang has 230 dimensions, but previously read vectors have 300 dimensions. All vectors must have the same number of dimensions. #57

Closed
aurooj opened this issue Nov 30, 2018 · 4 comments

Comments

@aurooj
Copy link

aurooj commented Nov 30, 2018

Expected Behavior

Load FastText vectors

Environment:
Ubuntu 16.04
Python 3.6.4
Pytorch 0.4.1

Actual Behavior

Throws the following error:

File "", line 1, in
File "/home/zxi/.local/lib/python3.6/site-packages/torchnlp/word_to_vector/fast_text.py", line 83, in init
super(FastText, self).init(name, url=url, **kwargs)
File "/home/zxi/.local/lib/python3.6/site-packages/torchnlp/word_to_vector/pretrained_word_vectors.py", line 72, in init
self.cache(name, cache, url=url)
File "/home/zxi/.local/lib/python3.6/site-packages/torchnlp/word_to_vector/pretrained_word_vectors.py", line 153, in cache
word, len(entries), dim))
RuntimeError: Vector for token darang has 230 dimensions, but previously read vectors have 300 dimensions. All vectors must have the same number of dimensions.

Steps to Reproduce the Problem

  1. Open python console
  2. Write the following code:
        from torchnlp.word_to_vector import FastText
        vectors = FastText()
    
    
  3. Throws the error mentioned above.
@PetrochukM
Copy link
Owner

PetrochukM commented Nov 30, 2018

Hi There!

This code base works just fine:

>>> from torchnlp.word_to_vector import FastText
>>> vectors = FastText()
wiki.en.vec: 6.60GB [05:28, 21.4MB/s]
  0%|                                                                      | 0/2519371 [00:00<?, ?it/s]Skipping token 2519370 with 1-dimensional vector ['300']; likely a header
100%|██████████████████████████████████████████████████████| 2519371/2519371 [05:19<00:00, 7884.92it/s]
>>> vectors['derang']
tensor([ 0.3663, -0.2729, -0.5492,  0.2594, -0.2059, -0.6579,  0.3311, -0.3561,
        -0.0211, -0.4950,  0.2345,  0.5009,  0.1284, -0.0284,  0.4262,  0.1306,
         0.0736, -0.1482,  0.1071,  0.3749, -0.3396,  0.2189, -0.0933, -0.6236,
         0.2598,  0.1215,  0.3682,  0.0977,  0.3826,  0.2483,  0.0497,  0.3010,
         0.1354, -0.1132,  0.3291,  0.1183,  0.0862, -0.2852, -0.2880,  0.4053,
        -0.2330,  0.4374, -0.0842,  0.1315, -0.1406,  0.1829, -0.1734,  0.2383,
         0.1084,  0.0826, -0.2086,  0.1929,  0.4043, -0.0709,  0.0764, -0.2958,
         0.0644,  0.4529,  0.0039,  0.0321,  0.2296,  0.1703,  0.3169,  0.3324,
        -0.1998,  0.1265, -0.4961, -0.1126,  0.3073, -0.0775,  0.1673, -0.1065,
         0.1746, -0.3484, -0.1683,  0.3709,  0.1794, -0.1061, -0.3025,  0.0797,
         0.7037, -0.3384,  0.0654,  0.0047,  0.0675,  0.2268, -0.2287, -0.0502,
        -0.1027, -0.1576,  0.0931, -0.5580,  0.3006, -0.6026,  0.0979, -0.1607,
         0.2291,  0.2667, -0.2266,  0.3741, -0.3300,  0.2384, -0.1749,  0.1554,
        -0.0474,  0.1531, -0.2938,  0.3155,  0.1208, -0.4494,  0.0461,  0.1716,
        -0.3338,  0.1848,  0.2872, -0.4439, -0.0408,  0.0823, -0.3677,  0.0684,
         0.1709, -0.2148, -0.0842,  0.4830, -0.2937, -0.0804, -0.1713, -0.1559,
        -0.1759,  0.1321,  0.0048,  0.1698,  0.1019,  0.1963,  0.0649, -0.0431,
        -0.3056, -0.2303, -0.2197,  0.0797, -0.1263,  0.2204, -0.0276, -0.0039,
         0.2605, -0.0019, -0.0057,  0.3839,  0.5118,  0.0172,  0.1729, -0.0898,
         0.1416, -0.4514, -0.0455,  0.2964, -0.1571,  0.5023,  0.0768, -0.3092,
        -0.1937,  0.2595, -0.2484,  0.5232, -0.1842, -0.3832, -0.4159, -0.3071,
         0.3744,  0.5791,  0.0642, -0.1190, -0.0598,  0.0508,  0.1179,  0.0383,
        -0.3242,  0.1952, -0.0211, -0.1509, -0.4514, -0.1727, -0.0395, -0.4362,
         0.3575,  0.1249,  0.0599,  0.0472,  0.6013,  0.1357, -0.0937,  0.1200,
         0.1294,  0.4008, -0.1689,  0.1403, -0.7018, -0.0751, -0.6768, -0.1206,
         0.5307, -0.0490, -0.1083,  0.2631,  0.0748, -0.1714,  0.1157,  0.3715,
         0.6093,  0.3088,  0.4642,  0.0930,  0.0624, -0.0640,  0.1391, -0.7331,
        -0.1361, -0.0859, -0.3891,  0.0768, -0.4963,  0.0695, -0.3626,  0.8411,
         0.1532, -0.1458, -0.2630, -0.2151, -0.3103,  0.1697, -0.1632, -0.3756,
        -0.0803, -0.1968,  0.5468,  0.1773, -0.2990, -0.0036,  0.0758, -0.3991,
        -0.0524,  0.2814, -0.2947, -0.1843,  0.3038,  0.4715, -0.3175,  0.1851,
         0.0134, -0.1914,  0.4584,  0.2807,  0.1590,  0.3280,  0.3517,  0.3911,
         0.1309, -0.2509, -0.0008, -0.2097,  0.2152,  0.1403,  0.3071,  0.0773,
         0.1583, -0.6938,  0.0017, -0.3672,  0.1968,  0.0241, -0.5667,  0.1639,
         0.0899, -0.1899, -0.1444,  0.3414,  0.4791,  0.0642,  0.0116, -0.1053,
         0.5087,  0.0990,  0.1311,  0.3384, -0.3098, -0.1424, -0.0206, -0.1233,
         0.1623, -0.0964, -0.2188,  0.4343,  0.1835, -0.0482, -0.3140,  0.2048,
        -0.0942,  0.0402,  0.0923, -0.1973])

You must have modified the wiki.en.vec file. Try deleting it and rerunning rm -r .word_vectors_cache/wiki.en.vec.

@aurooj
Copy link
Author

aurooj commented Nov 30, 2018

Thanks for your reply!

I am running into one more issue:

After downloading the pre-trained embeddings, when it starts loading them, my RAM gets filled up and then machine dies or gives me memory error.
Same happens when I try loading GloVe.

I am not an expert in NLP or have any prior experience in text. All I want to do is to load pre-trained embeddings and features for the words in my dataset.

I tried on two machines with the following configurations:
Machine1:
Ubuntu 16.04
RAM 24GB
Python 3.6.4
Pytorch 0.4.1

Machine 2:
Ubuntu 14.04
RAM 16GB
Python 3.6.6
Pytorch 0.4.1

wiki.en.vec: 6.60GB [05:28, 21.4MB/s] <-- [this step finishes successfully.]
0%| | 0/2519371 [00:00<?, ?it/s]Skipping token 2519370 with 1-dimensional vector ['300']; likely a header
100%|██████████████████████████████████████████████████████| 2519371/2519371 [05:19<00:00, 7884.92it/s] <-- [My RAM starts filling up at this step resulting in freezing my machine or throwing the error I posted in this issue]

Your help is highly appreciated. Thanks.

@PetrochukM
Copy link
Owner

Yup, this is a known problem. You are attempting to put into memory all 6 gigabytes of embeddings. I'd use is_include to filter the embeddings by your vocabulary.

There are other more sophisticated options like so: https://github.com/vzhong/embeddings

@aurooj
Copy link
Author

aurooj commented Dec 1, 2018

Ah, I see. Thank you, I will try these solutions.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants