ipython notebook for fastText comparison #815

jayantj · 2016-08-06T11:03:57Z

A brief comparison of word representations from fastText and word2vec. I got some interesting looking results while trying out fastText, so I thought I'd share them.

Jupyter notebook.

Let me know if this is worth merging, and if you think any changes/additions could be made.

@tmylk

gojomo · 2016-08-07T00:50:42Z

Great stuff!

I would make all parameters explicit, to avoid seeing differences that are only because of different defaults (and to also better ensure consistent operation if the various packages' defaults are changed in the future).

In particular, I've noticed that the fastText word2vec defaults differ from gensim Word2Vec (and word2vec.c before it) in three ways: skip-gram (instead of CBOW), initial learning rate of 0.05 (no matter the mode, instead of 0.025 for skip-gram), and downsampling threshold of 0.0001 (instead of 0.001). It'd be best to compare against the gensim Word2Vec results with those options.

Also, except for the char-ngrams, the fastText skipgram mode is still word2vec, so should match gensim Word2Vec analogy results pretty closely. Running fastText with some excessive min-char-ngram-size (eg -minn 100) seems to effectively disable the char-ngrams... so it'd be good to see those results, and compare them vs gensim Word2Vec, and fastText with char-ngrams, to really understand the effect of the char-ngram approach.

For all the major steps, using notebook timing (eg %%time cell magic) would help show relative performance.

FYI: I needed to patch the logging-setup with logging.root.handlers = [], to get log output in the notebook rather than the notebook-server console.

piskvorky · 2016-08-07T08:17:59Z

Great, thanks!

Like @gojomo says, let's compare apples to apples, then draw conclusions.

jayantj · 2016-08-07T20:36:29Z

Thanks for the detailed response!

Yes, I'd completely missed word2vec was using cbow, not skipgram as default. That could be a major difference. I'll recheck the other hyperparameters once too, and push the updated notebook.

jayantj · 2016-08-08T13:51:27Z

So I've pushed a bunch of changes -

Took care of the different hyperparameters for the word2vec and fastText models
Added comparisons between fastText models with and without n-grams - I think this adds much more insight, thanks for the idea @gojomo
Minor changes in logging, timing

As for the logging to notebook vs console, I wasn't sure if logging to the notebook was a great idea since Gensim prints a lot of logs during training word2vec. The current notebook has these - they look a little ugly, so I'm thinking I should remove them

gojomo · 2016-08-08T21:39:28Z

Comments/suggestions on current notebook:

Ideally, it would do the right/minimal things in a "Run All" situation - only download/expand/recalc things that are needed. This could take the form of running certain steps only when the expected files don't already exist in the expected locations. And, only loading models from files if the (just-trained) model objets don't already exist in current scope. (Also, since none of the models take that long to train, providing a public download source doesn't seem important.)
While it seems to intend to enable INFO logging before the first Word2Vec (Brown) training, because the logging.root.handlers fix isn't yet applied, no logging appears in the notebook. (Maybe verbose logging isn't wanted yet or ever; if not it seems odd to try setting it up.)
Steps & results could be re-grouped to highlight runtime/performance comparisons. (For example, all three Brown variants trained in consecutive cells - gensim then fasttext-no-ngrams then fasttext – then compared in same order.)
models directory currently needs to be created outside notebook
fasttext output paths don't save to 'models' directory - but later loads expect files there
In my tests with text8, repeated runs of gensim-word2vec and fasttext-word2vec-no_ngrams both give 39%-42% on both semantic and syntactic analogies – sometimes one has the slight edge, sometimes the other, but same range as expected. BUT adding ngrams consistently brings fastext-word2vec's syntactic score up to 63% or more, while dropping the semantic score to 34% or less. Those are both significant changes: the char-ngrams are helping one and hurting the other. (34% is not really in the same range as 39-42% - it's ~6 points or ~15% lower.)
Also worth noting: gensim is faster than fasttext! Only a little when disabling char-ngrams so they're doing equal work, but almost 3x faster when char-ngrams are enabled.
Tightening up these observations about syntactic-vs-semantic, gensim-vs-fasttext, char-ngrams-vs-not with more trials or larger datasets (maybe text9), with a few graphs, would make a compelling blog post about the tradeoffs & opportunities for improvement.
One other oddity in my experiments: I tried adding more training passes ('epoch/iter` to 10) on text8, to see if the relations held. All scores improved; charn-grams continued to help syntactic and hurt semantic, BUT fasttext-word2vec-no_ngrams improved more in semantic analogies, with the extra iterations, than gensim-word2vec. I don't have a good theory why that would be; the algorithms should be the same except for small somewhat-arbitrary differences in ordering/platform-math/threading-granularity.

piskvorky · 2016-08-09T08:12:59Z

Thanks @jayantj @gojomo for the analysis. Super useful.

Related to the "syntactic ngram mode": morphology for word embeddings (potentially combining the best of both worlds).

Re. @gojomo 's last bullet point: perhaps different alpha decay across epochs?

jayantj · 2016-08-09T14:46:16Z

@piskvorky Yes, me and @tmylk have discussed that paper, the authors haven't published any related code though. I'll mail the authors asking them about it, we were thinking of reproducing their paper, and if it goes well maybe integrating it into gensim?

@gojomo Really appreciate all the help and feedback you've given so far.

Yep, I completely agree about Run All. A public download link is for people who don't have gensim/fastText installed, and also since training all the models does take > 30 minutes.
Ah, no, the gensim training was causing way too many messages to be logged. The intention behind logging currently is to log the training messages in the console, and subsequent messages in the notebook. You can look at the notebook state after the 2nd commit in the PR, it looks very cluttered with all the training message logs.

3, 4, 5. Yes, sounds good. Will do

Hmm, that's a very interesting observation. My hypothesis would be the words occurring in the semantic analogies are mostly standalone words and completely unrelated to their morphemes (father, mother, Paris, France), as a result, information from char n-grams actually makes the embeddings worse. The results in the original paper show worse performance on semantic tasks too. I should definitely look into this further.
Yes, I'd noticed gensim being faster too! Adding a note about this - this is pretty impressive.
I thought about this, but I'm not very sure about it, since I don't know what else I could add that would actually provide insight into why the models work as they do. Sure, I could add quite a lot of empirical results, but I don't know how useful they would be, considering this analysis is on a toy-ish task. Maybe an analysis on different tasks would be more useful?
That's a little surprising. Will dig into this further.

…lanation

jayantj · 2016-08-11T22:58:13Z

@gojomo I've made some changes, do you think this looks better?

gojomo · 2016-08-19T18:29:25Z

It looks pretty good! It's already very useful, but if you want to keep refining, other comments:

if only 'brown' is needed, I think nltk.download('brown') can be used
it's nice to predicate the wget/unzip steps on whether the expected files already exist – fastText's own word-vector-example.sh is a good example
maybe pull questions-words.txt from github.com/tmikolov's word2vec mirror - just for extra authoritativeness
could offer option of running all tests yet again with 'text9' - perhaps assuming as prerequisite user has already run fastText's own word-vector-example.sh to ensure it's present and unpacked - this would be slower but provide yet more evidence of how performance scales with more data
I tend to prefer smaller-rather-than-larger code cells - as soon as I were to start tinkering, and running cells out of order, I'd have to split a lot of these cells

Your various conclusions about the reasons for observed results seem exactly right.

I would add an extra caveat on the Brown numbers that the corpus is so small the results (and thus relative accuracies) could vary a lot from run to run. (For example, only more runs or more iterations would give me confidence that the fastText semantic-accuracy increase from 16.5% to 18.1% isn't just jitter.)

(Also, it may be notable that the P parameter, from the 'enhancing with subword info' paper, for exempting the P most-frequent terms from subword-composition, is missing from the fastText implementation. It might help eliminate the penalty on semantic analogies (no noise from unhelpful subwords) – but also hurt the syntactic, by not helping rare/OOV words learn morphemes from their more-common peers.)

Finally, I think a summary graph of all time/accuracy results would help hammer-home the results... and might significantly drive interest in the reporting/discussing the notebook, if it's easy to pull out for display in other tweets/blogposts. Specifically: I'm thinking a bar graph, left axis train-time, right-axis analogy-accuracy-percent. Three bottom-axis clusters of bars: brown, text8, text9. Within each cluster, runtime(gensim, ft-ng, ft); semantic-acc (gensim, ft-ng, ft); syntactic-acc (gensim, ft-ng, ft). It's a lot to squeeze into one graph (27 bars!)... but it'd still be interpretable and likely be re-used a lot!

piskvorky · 2016-08-20T11:45:22Z

Very nice @jayantj !

In addition to what @gojomo wrote (all great points), it looks like NLTK is doing something clever in brown.sents(), because just iterating over this tiny corpus, doing nothing at all, takes 2.5s (vs 100ms for Text8Corpus("brown_corp.txt")). So probably best to avoid it and use the same plain-text file input as fastText. The sentences are split differently between these two versions too (could theoretically affect accuracy, in addition to speed). Just a minor nitpick.

What BLAS does your gensim installation use @jayantj ?

piskvorky · 2016-08-21T03:17:55Z

docs/notebooks/Word2Vec_FastText_Comparison.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For training the models yourself, you'll need to have both [Gensim](github.com/RaRe-Technologies/gensim) and [FastText](https://github.com/facebookresearch/fastText) set up on your machine."


The gensim link gives 404.

jayantj · 2016-08-21T06:48:17Z

Thanks for the great ideas. I've added the check-if-existing conditions, updated the links, and removed the brown.sents() call. Will add graphs and text9 as soon as training is complete(!)

@piskvorky Haven't manually installed any BLAS packages, and can't seem to find any with dpkg -l | grep -iE 'openblas|lapack|atlas'

word2vec.FAST_VERSION returns 1 though.

piskvorky · 2016-08-21T12:27:05Z

Worth installing OpenBlas then (and then re-install numpy+scipy). Btw easiest way to check BLAS linkage is with numpy.show_config() and scipy.show_config().

jayantj · 2016-08-21T12:30:58Z

Yeah, ran that already, that gives me

blas_opt_info:
    libraries = ['openblas', 'openblas']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    library_dirs = ['/usr/local/lib']
blas_mkl_info:
  NOT AVAILABLE
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    library_dirs = ['/usr/local/lib']
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    library_dirs = ['/usr/local/lib']
openblas_info:
    libraries = ['openblas', 'openblas']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    library_dirs = ['/usr/local/lib']

This means it's linked, right?
Packages don't seem to show up with dpkg though

jayantj · 2016-08-23T05:26:42Z

So I've added the text9 comparison, and a nice looking graph - thanks for the advice @gojomo

One tiny nitpick - I''ve had to hardcode the training time values, there doesn't seem an easy way of retrieving the output of %time, and other methods to time statements seem to add a lot of clutter

rodrigocesar · 2017-01-31T21:32:34Z

Hi there guys, I tried to run the notebook FastText_Tutorial.ipynb bu to no avail. I have installed fastText already. The error I get is when I try to import FastText from the gensim package, i.e.,

import gensim, os
from gensim.models.wrappers.fasttext import FastText

# Set FastText home to the path to the FastText executable
ft_home = '/home/rodrigo/Projects/fastText/fasttext'


# Set file names for train and test data
data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data']) + os.sep
lee_train_file = data_dir + 'lee_background.cor'

model = FastText.train(ft_home, lee_train_file)

print(model)

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-20-9c488bfaa296> in <module>()
      1 import gensim, os
----> 2 from gensim.models.wrappers.fasttext import FastText
      3 
      4 # Set FastText home to the path to the FastText executable
      5 ft_home = '/home/rodrigo/Projects/fastText/fasttext'

ImportError: No module named fasttext

Can you please help me solve this issue?

tmylk · 2017-01-31T22:55:46Z

Hi. This is not yet released. You need to clone the develop branch from github

rodrigocesar · 2017-02-01T12:56:02Z

Thanks Lev. But now another error is showing up when I try to import.

import gensim, os
from gensim.models.wrappers.fasttext import FastText

# Set FastText home to the path to the FastText executable
ft_home = '/home/rodrigo/Projects/fastText/fasttext'

# Set file names for train and test data
data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data']) + os.sep
lee_train_file = data_dir + 'lee_background.cor'

model = FastText.train(ft_home, lee_train_file)

print(model)


ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

Traceback (most recent call last):
  File "/home/rodrigo/.local/lib/python2.7/site-packages/IPython/core/ultratb.py", line 1132, in get_records
    return _fixed_getinnerframes(etb, number_of_lines_of_context, tb_offset)
  File "/home/rodrigo/.local/lib/python2.7/site-packages/IPython/core/ultratb.py", line 313, in wrapped
    return f(*args, **kwargs)
  File "/home/rodrigo/.local/lib/python2.7/site-packages/IPython/core/ultratb.py", line 358, in _fixed_getinnerframes
    records = fix_frame_records_filenames(inspect.getinnerframes(etb, context))
  File "/usr/lib/python2.7/inspect.py", line 1049, in getinnerframes
    framelist.append((tb.tb_frame,) + getframeinfo(tb, context))
  File "/usr/lib/python2.7/inspect.py", line 1009, in getframeinfo
    filename = getsourcefile(frame) or getfile(frame)
  File "/usr/lib/python2.7/inspect.py", line 454, in getsourcefile
    if hasattr(getmodule(object, filename), '__loader__'):
  File "/usr/lib/python2.7/inspect.py", line 483, in getmodule
    file = getabsfile(object, _filename)
  File "/usr/lib/python2.7/inspect.py", line 467, in getabsfile
    return os.path.normcase(os.path.abspath(_filename))
  File "/usr/lib/python2.7/posixpath.py", line 364, in abspath
    cwd = os.getcwd()
OSError: [Errno 2] No such file or directory

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/home/rodrigo/.local/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc in run_code(self, code_obj, result)
   2896             if result is not None:
   2897                 result.error_in_exec = sys.exc_info()[1]
-> 2898             self.showtraceback()
   2899         else:
   2900             outflag = 0

/home/rodrigo/.local/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc in showtraceback(self, exc_tuple, filename, tb_offset, exception_only)
   1822                     except Exception:
   1823                         stb = self.InteractiveTB.structured_traceback(etype,
-> 1824                                             value, tb, tb_offset=tb_offset)
   1825 
   1826                     self._showtraceback(etype, value, stb)

/home/rodrigo/.local/lib/python2.7/site-packages/IPython/core/ultratb.pyc in structured_traceback(self, etype, value, tb, tb_offset, number_of_lines_of_context)
   1404         self.tb = tb
   1405         return FormattedTB.structured_traceback(
-> 1406             self, etype, value, tb, tb_offset, number_of_lines_of_context)
   1407 
   1408 

/home/rodrigo/.local/lib/python2.7/site-packages/IPython/core/ultratb.pyc in structured_traceback(self, etype, value, tb, tb_offset, number_of_lines_of_context)
   1312             # Verbose modes need a full traceback
   1313             return VerboseTB.structured_traceback(
-> 1314                 self, etype, value, tb, tb_offset, number_of_lines_of_context
   1315             )
   1316         else:

/home/rodrigo/.local/lib/python2.7/site-packages/IPython/core/ultratb.pyc in structured_traceback(self, etype, evalue, etb, tb_offset, number_of_lines_of_context)
   1196                 structured_traceback_parts += formatted_exception
   1197         else:
-> 1198             structured_traceback_parts += formatted_exception[0]
   1199 
   1200         return structured_traceback_parts

IndexError: string index out of range

ipython notebook for fastText comparison

6837205

jayantj force-pushed the fast_text_notebook branch from b64812a to 6bbef25 Compare August 8, 2016 13:27

improvements to word2vec fastText comparison

2481dbe

jayantj force-pushed the fast_text_notebook branch from 6bbef25 to 2481dbe Compare August 8, 2016 13:28

removes training logs from notebook

fb020bd

tmylk mentioned this pull request Aug 10, 2016

Loading fastText binary output to gensim like word2vec #814

Closed

jayantj force-pushed the fast_text_notebook branch 3 times, most recently from 2b65bc6 to 0395f4a Compare August 10, 2016 21:01

removes download link, organizes training, semantic accuracy drop exp…

ecfc4e4

…lanation

jayantj force-pushed the fast_text_notebook branch from 0395f4a to ecfc4e4 Compare August 10, 2016 21:30

piskvorky reviewed Aug 21, 2016
View reviewed changes

jayantj force-pushed the fast_text_notebook branch from 4effd58 to 5a4d607 Compare August 21, 2016 06:25

jayantj force-pushed the fast_text_notebook branch 2 times, most recently from b97ea8f to d3cf3d7 Compare August 21, 2016 06:51

wget checks existing, other minor changes to fastText notebook

5d438a8

jayantj force-pushed the fast_text_notebook branch from d3cf3d7 to 5d438a8 Compare August 21, 2016 06:53

adds graph, text9 comparison

9f3e275

tmylk merged commit ba1ce89 into piskvorky:develop Sep 9, 2016

schwittlick mentioned this pull request Dec 1, 2016

Compare Word2Vec Training with Facebook/fastText Schwittleymani/ECO#180

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ipython notebook for fastText comparison #815

ipython notebook for fastText comparison #815

jayantj commented Aug 6, 2016 •

edited by piskvorky

Loading

gojomo commented Aug 7, 2016

piskvorky commented Aug 7, 2016 •

edited

Loading

jayantj commented Aug 7, 2016

jayantj commented Aug 8, 2016

gojomo commented Aug 8, 2016 •

edited by piskvorky

Loading

piskvorky commented Aug 9, 2016 •

edited

Loading

jayantj commented Aug 9, 2016 •

edited

Loading

jayantj commented Aug 11, 2016

gojomo commented Aug 19, 2016 •

edited

Loading

piskvorky commented Aug 20, 2016 •

edited

Loading

piskvorky Aug 21, 2016

jayantj commented Aug 21, 2016 •

edited

Loading

piskvorky commented Aug 21, 2016

jayantj commented Aug 21, 2016 •

edited

Loading

jayantj commented Aug 23, 2016 •

edited

Loading

rodrigocesar commented Jan 31, 2017

tmylk commented Jan 31, 2017

rodrigocesar commented Feb 1, 2017

ipython notebook for fastText comparison #815

ipython notebook for fastText comparison #815

Conversation

jayantj commented Aug 6, 2016 • edited by piskvorky Loading

gojomo commented Aug 7, 2016

piskvorky commented Aug 7, 2016 • edited Loading

jayantj commented Aug 7, 2016

jayantj commented Aug 8, 2016

gojomo commented Aug 8, 2016 • edited by piskvorky Loading

piskvorky commented Aug 9, 2016 • edited Loading

jayantj commented Aug 9, 2016 • edited Loading

jayantj commented Aug 11, 2016

gojomo commented Aug 19, 2016 • edited Loading

piskvorky commented Aug 20, 2016 • edited Loading

piskvorky Aug 21, 2016

Choose a reason for hiding this comment

jayantj commented Aug 21, 2016 • edited Loading

piskvorky commented Aug 21, 2016

jayantj commented Aug 21, 2016 • edited Loading

jayantj commented Aug 23, 2016 • edited Loading

rodrigocesar commented Jan 31, 2017

tmylk commented Jan 31, 2017

rodrigocesar commented Feb 1, 2017

jayantj commented Aug 6, 2016 •

edited by piskvorky

Loading

piskvorky commented Aug 7, 2016 •

edited

Loading

gojomo commented Aug 8, 2016 •

edited by piskvorky

Loading

piskvorky commented Aug 9, 2016 •

edited

Loading

jayantj commented Aug 9, 2016 •

edited

Loading

gojomo commented Aug 19, 2016 •

edited

Loading

piskvorky commented Aug 20, 2016 •

edited

Loading

jayantj commented Aug 21, 2016 •

edited

Loading

jayantj commented Aug 21, 2016 •

edited

Loading

jayantj commented Aug 23, 2016 •

edited

Loading