Russian word embedding models from RusVectores project #3

akutuzov · 2017-11-16T19:02:18Z

Name: word2vec-ruscorpora-300
Link: http://rusvectores.org/static/models/ruscorpora_1_300_10.bin.gz
Description: Word2vec Continuous Skipgram vectors trained on full Russian National Corpus (about 250M words). The model contains 185K words.
Related papers: https://www.academia.edu/24306935/WebVectors_a_Toolkit_for_Building_Web_Interfaces_for_Vector_Semantic_Models
Preprocessing: The corpus was lemmatized and tagged with Universal PoS.
Parameters: vector size 300, window size 10
Code example:

model = gensim.models.KeyedVectors.load_word2vec_format('ruscorpora_1_300_10.bin.gz', binary=True)
for n in model.most_similar(positive=[u'пожар_NOUN']):
    print n[0], n[1]
пожарище_NOUN 0.618148565292
возгорание_NOUN 0.592390716076
сгорать_VERB 0.589370012283
наводнение_NOUN 0.575950324535
тушение_NOUN 0.572953224182
пожарный_NOUN 0.562128543854
поджог_NOUN 0.561940491199
сгорать::дотла_VERB 0.547737360001
поджигать_VERB 0.534844279289
незатушить_VERB 0.534272968769

The text was updated successfully, but these errors were encountered:

akutuzov · 2017-11-16T19:03:09Z

OK, let's try :-)
By the way, what is the procedure for updating the resources? RusVectores rolls out new models from time to time.

menshikh-iv · 2017-11-17T08:03:47Z

@akutuzov no updates, only adding a new model, best scheme for support backward compatibility :)

Thanks for the detailed info, only one thing: as I remember, RusVectores used mystem for _POS, can you add function for converting word -> word_POS in the first message?

akutuzov · 2017-11-17T14:54:36Z

Well, it can be any tagger supporting Russian and Universal Tags, do we really need to clutter the issue with the preprocessing details?

menshikh-iv · 2017-11-21T01:05:26Z

@akutuzov This would be very desirable because this is not an obvious process (it is impossible to apply this model without pre-processing in the current case).

Your code example will be linked with this model and simplify life for users :)

akutuzov · 2017-11-21T14:02:39Z

OK. It will look somewhat like this with UDPipe. Models for various languages can be downloaded here.

def tag(word='пожар', modelfile='russian-syntagrus-ud-2.0-170801.udpipe'):
    from ufal.udpipe import Model, Pipeline
    model = Model.load(modelfile)
    pipeline = Pipeline(model, 'tokenize', Pipeline.DEFAULT, Pipeline.DEFAULT, 'conllu')
    processed = pipeline.process(word)
    output = [l for l in processed.split('\n') if not l.startswith('#')]
    tagged = ['_'.join(w.split('\t')[2:4]) for w in output if w]
    return tagged

This produces Universal PoS tags straight away.
Another option is to use pymystem:

def tag(word='пожар'):
    from pymystem3 import Mystem
    m = Mystem()
    processed = m.analyze(word)[0]
    lemma = processed["analysis"][0]["lex"].lower().strip()
    pos = processed["analysis"][0]["gr"].split(',')[0]
    pos = pos.split('=')[0].strip()
    tagged = lemma+'_'+pos
    return tagged

With Mystem output, one will have to convert RNC tags to UPOS, using this conversion table.

* add ruscorpora-300 * add ruscorpora to README

menshikh-iv · 2017-12-18T11:17:17Z

Thanks @akutuzov, sorry for waiting, now this repo released and ruscorpora vectors available with our API gensim>=3.2.0

import gensim.downloader as api

model = api.load("word2vec-ruscorpora-300")

akutuzov · 2017-12-18T11:27:05Z

Thanks @menshikh-iv! One small fix: in the table, I see "License not found" for this model. However, we do have a license, it is Creative Commons Attribution 4.0 International :-).
We are now updating our models, will come up with more of them before the end of month, I think.

menshikh-iv · 2017-12-18T11:30:52Z

@akutuzov update license fa71854 :)

rahelmou · 2018-07-23T01:53:11Z

Sorry, i cant download the file, may you fix the download link above?

akutuzov · 2018-07-23T14:49:25Z

@rahelmou http://rusvectores.org/en/models/

menshikh-iv mentioned this issue Dec 18, 2017

Add ruscorpora model. Fix #3 #13

Merged

menshikh-iv closed this as completed in #13 Dec 18, 2017

menshikh-iv added a commit that referenced this issue Dec 18, 2017

Add ruscorpora model. Fix #3 (#13)

e908b90

* add ruscorpora-300 * add ruscorpora to README

GraphGrailAi mentioned this issue May 2, 2019

Grammatically-correct response lukalabs/cakechat#54

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Russian word embedding models from RusVectores project #3

Russian word embedding models from RusVectores project #3

akutuzov commented Nov 16, 2017

akutuzov commented Nov 16, 2017

menshikh-iv commented Nov 17, 2017

akutuzov commented Nov 17, 2017

menshikh-iv commented Nov 21, 2017

akutuzov commented Nov 21, 2017 •

edited

menshikh-iv commented Dec 18, 2017 •

edited

akutuzov commented Dec 18, 2017

menshikh-iv commented Dec 18, 2017

rahelmou commented Jul 23, 2018

akutuzov commented Jul 23, 2018

Russian word embedding models from RusVectores project #3

Russian word embedding models from RusVectores project #3

Comments

akutuzov commented Nov 16, 2017

akutuzov commented Nov 16, 2017

menshikh-iv commented Nov 17, 2017

akutuzov commented Nov 17, 2017

menshikh-iv commented Nov 21, 2017

akutuzov commented Nov 21, 2017 • edited

menshikh-iv commented Dec 18, 2017 • edited

akutuzov commented Dec 18, 2017

menshikh-iv commented Dec 18, 2017

rahelmou commented Jul 23, 2018

akutuzov commented Jul 23, 2018

akutuzov commented Nov 21, 2017 •

edited

menshikh-iv commented Dec 18, 2017 •

edited