-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP GSOC 2018]: Multistream API, Part 1 #2048
Closed
persiyanov
wants to merge
51
commits into
piskvorky:develop
from
persiyanov:feature/gsoc-multistream-api-1
Closed
Changes from 32 commits
Commits
Show all changes
51 commits
Select commit
Hold shift + click to select a range
2724812
Add wikipedia parsing script
f893487
Track performance metrics in base_any2vec.py
f03d9e6
reset performance metrics in beginning of epoch
55517fd
add tracking CPU load + benchmarking script
8ae3248
Some bug fixes
29d2dba
prettify logging results in benchmark script
5e47dfa
More prettifying in benchmark script
389293f
add SUM cpu load
b1765e7
remove sent2vec from script
4d50cff
First approach to multistream, only for word2vec right now
48f498c
adapted benchmarking script to multistream
a2a6e4f
fix
b9668ee
fix bench script
2765207
Measure vocabulary building time
d110f26
fix
c9e507f
multiprocessing multistream
44bc8f8
add w2v benchmarking script
99d0fc0
multiprocessinng for scan_vocab
ffd5204
fixes
8a0badd
without progress_per at all
f21b3a2
Merge branch 'develop' into feature/gsoc-multistream-api-1
75cac9d
Merge branch 'feature/gsoc-multistream-api-1' of https://github.com/p…
2472b2b
get rid of job_producer, make batches in _worker_loop
4e0c103
fix
3dd8a64
fix
d389847
make cythonlinesentence. not working, but at least compiles now
4c1d3a6
add operator>>
36882a0
change ifstream to ifstream*
37b55f3
fastlinesentence in c++
97f834d
almost working version; works on large files, but one bug is to be fixed
944e3dc
remove batch iterator from pyx
0081f01
working code
fe66246
remove build_vocab changes
491a087
approaching to fully nogil cython _worker_loop
15e07ae
wrapper fix
5cad26b
one more fix
495c4dc
more fixes
8b29df8
upd
2119c3a
try to cythonize batch preparation
3506ec9
it compiles
62f71ee
prepare batch inside nogil section in a while loop
8924af5
compiles
53fedfa
some bugfixes
c679bc6
add cpu_distribution script
921ff38
accept CythonLineSentence into _worker_loop, not filename
9e4ed0e
make CythonLineSentence iterable
f9ea23b
fix
cb8bb71
python iterators without gil
6162b50
fix
c14fca1
fixes
440c6df
last changes
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
#pragma once | ||
|
||
#include <stdexcept> | ||
#include "linesentence.h" | ||
|
||
|
||
FastLineSentence::FastLineSentence(const std::string& filename) : fs_(filename) { } | ||
|
||
std::vector<std::string> FastLineSentence::ReadSentence() { | ||
if (fs_.eof()) { | ||
throw std::runtime_error("EOF occured in C++!"); | ||
} | ||
std::string line, word; | ||
std::getline(fs_, line); | ||
std::vector<std::string> res; | ||
|
||
std::istringstream iss(line); | ||
while (iss >> word) { | ||
res.push_back(word); | ||
} | ||
|
||
return res; | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
#pragma once | ||
|
||
#include <fstream> | ||
#include <sstream> | ||
#include <vector> | ||
|
||
|
||
class FastLineSentence { | ||
public: | ||
explicit FastLineSentence(const std::string& filename); | ||
|
||
std::vector<std::string> ReadSentence(); | ||
private: | ||
std::ifstream fs_; | ||
}; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,112 @@ | ||
from __future__ import unicode_literals | ||
from __future__ import print_function | ||
|
||
import logging | ||
import argparse | ||
import json | ||
import copy | ||
# import yappi | ||
import os | ||
import glob | ||
|
||
from gensim.models import base_any2vec | ||
from gensim.models.fasttext import FastText | ||
from gensim.models.word2vec import Word2Vec | ||
from gensim.models.doc2vec import Doc2Vec, TaggedLineDocument | ||
from gensim.models.word2vec import LineSentence | ||
|
||
|
||
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) | ||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
SUPPORTED_MODELS = { | ||
'fasttext': FastText, | ||
'word2vec': Word2Vec, | ||
'doc2vec': Doc2Vec, | ||
} | ||
|
||
|
||
def print_results(model_str, results): | ||
logger.info('----- MODEL "{}" RESULTS -----'.format(model_str).center(50)) | ||
logger.info('\t* Vocab time: {} sec.'.format(results['vocab_time'])) | ||
logger.info('\t* Total epoch time: {} sec.'.format(results['total_time'])) | ||
# logger.info('\t* Avg queue size: {} elems.'.format(results['queue_size'])) | ||
logger.info('\t* Processing speed: {} words/sec'.format(results['words_sec'])) | ||
logger.info('\t* Avg CPU loads: {}'.format(results['cpu_load'])) | ||
logger.info('\t* Sum CPU load: {}'.format(results['cpu_load_sum'])) | ||
|
||
|
||
def benchmark_model(input_streams, model, window, workers, vector_size): | ||
if model == 'doc2vec': | ||
kwargs = { | ||
'input_streams': [TaggedLineDocument(inp) for inp in input_streams] | ||
} | ||
else: | ||
kwargs = { | ||
'input_streams': [inp for inp in input_streams] # hack for CythonLineSentence | ||
} | ||
|
||
kwargs['size'] = vector_size | ||
|
||
if model != 'sent2vec': | ||
kwargs['window'] = window | ||
|
||
kwargs['workers'] = workers | ||
kwargs['iter'] = 1 | ||
|
||
logger.info('Creating model with kwargs={}'.format(kwargs)) | ||
|
||
# Training model for 1 epoch. | ||
# yappi.start() | ||
SUPPORTED_MODELS[model](**kwargs) | ||
# yappi.get_func_stats().print_all() | ||
# yappi.get_thread_stats().print_all() | ||
|
||
return copy.deepcopy(base_any2vec.PERFORMANCE_METRICS) | ||
|
||
|
||
def do_benchmarks(input_streams, models_grid, vector_size, workers_grid, windows_grid, label): | ||
full_report = {} | ||
|
||
for model in models_grid: | ||
for window in windows_grid: | ||
for workers in workers_grid: | ||
model_str = '{}-{}-window-{:02d}-workers-{:02d}-size-{}'.format(label, model, window, workers, vector_size) | ||
|
||
logger.info('Start benchmarking {}.'.format(model_str)) | ||
results = benchmark_model(input_streams, model, window, workers, vector_size) | ||
|
||
print_results(model_str, results) | ||
|
||
full_report[model_str] = results | ||
|
||
logger.info('Benchmarking completed. Here are the results:') | ||
for model_str in sorted(full_report.keys()): | ||
print_results(model_str, full_report[model_str]) | ||
|
||
fout_name = '{}-report.json'.format(label) | ||
with open(fout_name, 'w') as fout: | ||
json.dump(full_report, fout) | ||
|
||
logger.info('Saved metrics report to {}.'.format(fout_name)) | ||
|
||
|
||
if __name__ == '__main__': | ||
parser = argparse.ArgumentParser(description='GSOC Multistream-API: evaluate performance ' | ||
'metrics for any2vec models') | ||
parser.add_argument('--input', type=str, help='Input file or regexp if `multistream` mode is on.') | ||
parser.add_argument('--models-grid', nargs='+', type=str, default=SUPPORTED_MODELS.keys()) | ||
parser.add_argument('--size', type=int, default=300) | ||
parser.add_argument('--workers-grid', nargs='+', type=int, default=[1, 4, 8, 10, 12, 14]) | ||
parser.add_argument('--windows-grid', nargs='+', type=int, default=[10]) | ||
parser.add_argument('--label', type=str, default='untitled') | ||
|
||
args = parser.parse_args() | ||
|
||
input_ = os.path.expanduser(args.input) | ||
input_streams = glob.glob(input_) | ||
logger.info('Glob found {} input streams. List: {}'.format(len(input_streams), input_streams)) | ||
|
||
do_benchmarks(input_streams, args.models_grid, args.size, args.workers_grid, args.windows_grid, args.label) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
from __future__ import unicode_literals | ||
from __future__ import print_function | ||
|
||
import logging | ||
import argparse | ||
# import yappi | ||
import os | ||
import glob | ||
|
||
from gensim.models import base_any2vec | ||
from gensim.models.word2vec import Word2Vec, LineSentence | ||
|
||
|
||
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) | ||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
if __name__ == '__main__': | ||
parser = argparse.ArgumentParser(description='GSOC Multistream-API: evaluate vocab performance ' | ||
'for word2vec') | ||
parser.add_argument('--input', type=str, help='Input file or regexp for multistream.') | ||
parser.add_argument('--size', type=int, default=300) | ||
parser.add_argument('--workers-grid', nargs='+', type=int, default=[1, 2, 3, 4, 5, 8, 10, 12, 14]) | ||
parser.add_argument('--label', type=str, default='untitled') | ||
|
||
args = parser.parse_args() | ||
|
||
input_ = os.path.expanduser(args.input) | ||
input_streams = glob.glob(input_) | ||
logger.info('Glob found {} input streams. List: {}'.format(len(input_streams), input_streams)) | ||
|
||
input_streams = [LineSentence(_) for _ in input_streams] | ||
for workers in args.workers_grid: | ||
model = Word2Vec() | ||
model.build_vocab(input_streams, workers=workers) | ||
logger.info('Workers = {}\tVocab time = {:.2f} secs'.format(workers, | ||
base_any2vec.PERFORMANCE_METRICS['vocab_time'])) | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -26,4 +26,4 @@ | |
|
||
i += 1 | ||
|
||
fout.close() | ||
fout.close() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can simply measure like
why you need some "internal" stuff?