Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Russian word embedding models from RusVectores project #3

Closed
akutuzov opened this issue Nov 16, 2017 · 10 comments
Closed

Russian word embedding models from RusVectores project #3

akutuzov opened this issue Nov 16, 2017 · 10 comments

Comments

@akutuzov
Copy link

Name: word2vec-ruscorpora-300
Link: http://rusvectores.org/static/models/ruscorpora_1_300_10.bin.gz
Description: Word2vec Continuous Skipgram vectors trained on full Russian National Corpus (about 250M words). The model contains 185K words.
Related papers: https://www.academia.edu/24306935/WebVectors_a_Toolkit_for_Building_Web_Interfaces_for_Vector_Semantic_Models
Preprocessing: The corpus was lemmatized and tagged with Universal PoS.
Parameters: vector size 300, window size 10
Code example:

model = gensim.models.KeyedVectors.load_word2vec_format('ruscorpora_1_300_10.bin.gz', binary=True)
for n in model.most_similar(positive=[u'пожар_NOUN']):
    print n[0], n[1]
пожарище_NOUN 0.618148565292
возгорание_NOUN 0.592390716076
сгорать_VERB 0.589370012283
наводнение_NOUN 0.575950324535
тушение_NOUN 0.572953224182
пожарный_NOUN 0.562128543854
поджог_NOUN 0.561940491199
сгорать::дотла_VERB 0.547737360001
поджигать_VERB 0.534844279289
незатушить_VERB 0.534272968769
@akutuzov
Copy link
Author

OK, let's try :-)
By the way, what is the procedure for updating the resources? RusVectores rolls out new models from time to time.

@menshikh-iv
Copy link
Contributor

@akutuzov no updates, only adding a new model, best scheme for support backward compatibility :)

Thanks for the detailed info, only one thing: as I remember, RusVectores used mystem for _POS, can you add function for converting word -> word_POS in the first message?

@akutuzov
Copy link
Author

Well, it can be any tagger supporting Russian and Universal Tags, do we really need to clutter the issue with the preprocessing details?

@menshikh-iv
Copy link
Contributor

@akutuzov This would be very desirable because this is not an obvious process (it is impossible to apply this model without pre-processing in the current case).

Your code example will be linked with this model and simplify life for users :)

@akutuzov
Copy link
Author

akutuzov commented Nov 21, 2017

OK. It will look somewhat like this with UDPipe. Models for various languages can be downloaded here.

def tag(word='пожар', modelfile='russian-syntagrus-ud-2.0-170801.udpipe'):
    from ufal.udpipe import Model, Pipeline
    model = Model.load(modelfile)
    pipeline = Pipeline(model, 'tokenize', Pipeline.DEFAULT, Pipeline.DEFAULT, 'conllu')
    processed = pipeline.process(word)
    output = [l for l in processed.split('\n') if not l.startswith('#')]
    tagged = ['_'.join(w.split('\t')[2:4]) for w in output if w]
    return tagged

This produces Universal PoS tags straight away.
Another option is to use pymystem:

def tag(word='пожар'):
    from pymystem3 import Mystem
    m = Mystem()
    processed = m.analyze(word)[0]
    lemma = processed["analysis"][0]["lex"].lower().strip()
    pos = processed["analysis"][0]["gr"].split(',')[0]
    pos = pos.split('=')[0].strip()
    tagged = lemma+'_'+pos
    return tagged

With Mystem output, one will have to convert RNC tags to UPOS, using this conversion table.

menshikh-iv added a commit that referenced this issue Dec 18, 2017
* add ruscorpora-300

* add ruscorpora to README
@menshikh-iv
Copy link
Contributor

menshikh-iv commented Dec 18, 2017

Thanks @akutuzov, sorry for waiting, now this repo released and ruscorpora vectors available with our API gensim>=3.2.0

import gensim.downloader as api

model = api.load("word2vec-ruscorpora-300")

@akutuzov
Copy link
Author

Thanks @menshikh-iv! One small fix: in the table, I see "License not found" for this model. However, we do have a license, it is Creative Commons Attribution 4.0 International :-).
We are now updating our models, will come up with more of them before the end of month, I think.

@menshikh-iv
Copy link
Contributor

@akutuzov update license fa71854 :)

@rahelmou
Copy link

Sorry, i cant download the file, may you fix the download link above?

@akutuzov
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants