Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipython notebook for fastText comparison #815

Merged
merged 6 commits into from
Sep 9, 2016

Conversation

jayantj
Copy link
Contributor

@jayantj jayantj commented Aug 6, 2016

A brief comparison of word representations from fastText and word2vec. I got some interesting looking results while trying out fastText, so I thought I'd share them.

Jupyter notebook.

Let me know if this is worth merging, and if you think any changes/additions could be made.

@tmylk

@gojomo
Copy link
Collaborator

gojomo commented Aug 7, 2016

Great stuff!

I would make all parameters explicit, to avoid seeing differences that are only because of different defaults (and to also better ensure consistent operation if the various packages' defaults are changed in the future).

In particular, I've noticed that the fastText word2vec defaults differ from gensim Word2Vec (and word2vec.c before it) in three ways: skip-gram (instead of CBOW), initial learning rate of 0.05 (no matter the mode, instead of 0.025 for skip-gram), and downsampling threshold of 0.0001 (instead of 0.001). It'd be best to compare against the gensim Word2Vec results with those options.

Also, except for the char-ngrams, the fastText skipgram mode is still word2vec, so should match gensim Word2Vec analogy results pretty closely. Running fastText with some excessive min-char-ngram-size (eg -minn 100) seems to effectively disable the char-ngrams... so it'd be good to see those results, and compare them vs gensim Word2Vec, and fastText with char-ngrams, to really understand the effect of the char-ngram approach.

For all the major steps, using notebook timing (eg %%time cell magic) would help show relative performance.

FYI: I needed to patch the logging-setup with logging.root.handlers = [], to get log output in the notebook rather than the notebook-server console.

@piskvorky
Copy link
Owner

piskvorky commented Aug 7, 2016

Great, thanks!

Like @gojomo says, let's compare apples to apples, then draw conclusions.

@jayantj
Copy link
Contributor Author

jayantj commented Aug 7, 2016

Thanks for the detailed response!

Yes, I'd completely missed word2vec was using cbow, not skipgram as default. That could be a major difference. I'll recheck the other hyperparameters once too, and push the updated notebook.

@jayantj
Copy link
Contributor Author

jayantj commented Aug 8, 2016

So I've pushed a bunch of changes -

  1. Took care of the different hyperparameters for the word2vec and fastText models
  2. Added comparisons between fastText models with and without n-grams - I think this adds much more insight, thanks for the idea @gojomo
  3. Minor changes in logging, timing

As for the logging to notebook vs console, I wasn't sure if logging to the notebook was a great idea since Gensim prints a lot of logs during training word2vec. The current notebook has these - they look a little ugly, so I'm thinking I should remove them

@gojomo
Copy link
Collaborator

gojomo commented Aug 8, 2016

Comments/suggestions on current notebook:

  • Ideally, it would do the right/minimal things in a "Run All" situation - only download/expand/recalc things that are needed. This could take the form of running certain steps only when the expected files don't already exist in the expected locations. And, only loading models from files if the (just-trained) model objets don't already exist in current scope. (Also, since none of the models take that long to train, providing a public download source doesn't seem important.)
  • While it seems to intend to enable INFO logging before the first Word2Vec (Brown) training, because the logging.root.handlers fix isn't yet applied, no logging appears in the notebook. (Maybe verbose logging isn't wanted yet or ever; if not it seems odd to try setting it up.)
  • Steps & results could be re-grouped to highlight runtime/performance comparisons. (For example, all three Brown variants trained in consecutive cells - gensim then fasttext-no-ngrams then fasttext – then compared in same order.)
  • models directory currently needs to be created outside notebook
  • fasttext output paths don't save to 'models' directory - but later loads expect files there
  • In my tests with text8, repeated runs of gensim-word2vec and fasttext-word2vec-no_ngrams both give 39%-42% on both semantic and syntactic analogies – sometimes one has the slight edge, sometimes the other, but same range as expected. BUT adding ngrams consistently brings fastext-word2vec's syntactic score up to 63% or more, while dropping the semantic score to 34% or less. Those are both significant changes: the char-ngrams are helping one and hurting the other. (34% is not really in the same range as 39-42% - it's ~6 points or ~15% lower.)
  • Also worth noting: gensim is faster than fasttext! Only a little when disabling char-ngrams so they're doing equal work, but almost 3x faster when char-ngrams are enabled.
  • Tightening up these observations about syntactic-vs-semantic, gensim-vs-fasttext, char-ngrams-vs-not with more trials or larger datasets (maybe text9), with a few graphs, would make a compelling blog post about the tradeoffs & opportunities for improvement.
  • One other oddity in my experiments: I tried adding more training passes ('epoch/iter` to 10) on text8, to see if the relations held. All scores improved; charn-grams continued to help syntactic and hurt semantic, BUT fasttext-word2vec-no_ngrams improved more in semantic analogies, with the extra iterations, than gensim-word2vec. I don't have a good theory why that would be; the algorithms should be the same except for small somewhat-arbitrary differences in ordering/platform-math/threading-granularity.

@piskvorky
Copy link
Owner

piskvorky commented Aug 9, 2016

Thanks @jayantj @gojomo for the analysis. Super useful.

Related to the "syntactic ngram mode": morphology for word embeddings (potentially combining the best of both worlds).

Re. @gojomo 's last bullet point: perhaps different alpha decay across epochs?

@jayantj
Copy link
Contributor Author

jayantj commented Aug 9, 2016

@piskvorky Yes, me and @tmylk have discussed that paper, the authors haven't published any related code though. I'll mail the authors asking them about it, we were thinking of reproducing their paper, and if it goes well maybe integrating it into gensim?

@gojomo Really appreciate all the help and feedback you've given so far.

  1. Yep, I completely agree about Run All. A public download link is for people who don't have gensim/fastText installed, and also since training all the models does take > 30 minutes.
  2. Ah, no, the gensim training was causing way too many messages to be logged. The intention behind logging currently is to log the training messages in the console, and subsequent messages in the notebook. You can look at the notebook state after the 2nd commit in the PR, it looks very cluttered with all the training message logs.

3, 4, 5. Yes, sounds good. Will do

  1. Hmm, that's a very interesting observation. My hypothesis would be the words occurring in the semantic analogies are mostly standalone words and completely unrelated to their morphemes (father, mother, Paris, France), as a result, information from char n-grams actually makes the embeddings worse. The results in the original paper show worse performance on semantic tasks too. I should definitely look into this further.
  2. Yes, I'd noticed gensim being faster too! Adding a note about this - this is pretty impressive.
  3. I thought about this, but I'm not very sure about it, since I don't know what else I could add that would actually provide insight into why the models work as they do. Sure, I could add quite a lot of empirical results, but I don't know how useful they would be, considering this analysis is on a toy-ish task. Maybe an analysis on different tasks would be more useful?
  4. That's a little surprising. Will dig into this further.

@jayantj
Copy link
Contributor Author

jayantj commented Aug 11, 2016

@gojomo I've made some changes, do you think this looks better?

@gojomo
Copy link
Collaborator

gojomo commented Aug 19, 2016

It looks pretty good! It's already very useful, but if you want to keep refining, other comments:

  • if only 'brown' is needed, I think nltk.download('brown') can be used
  • it's nice to predicate the wget/unzip steps on whether the expected files already exist – fastText's own word-vector-example.sh is a good example
  • maybe pull questions-words.txt from github.com/tmikolov's word2vec mirror - just for extra authoritativeness
  • could offer option of running all tests yet again with 'text9' - perhaps assuming as prerequisite user has already run fastText's own word-vector-example.sh to ensure it's present and unpacked - this would be slower but provide yet more evidence of how performance scales with more data
  • I tend to prefer smaller-rather-than-larger code cells - as soon as I were to start tinkering, and running cells out of order, I'd have to split a lot of these cells

Your various conclusions about the reasons for observed results seem exactly right.

I would add an extra caveat on the Brown numbers that the corpus is so small the results (and thus relative accuracies) could vary a lot from run to run. (For example, only more runs or more iterations would give me confidence that the fastText semantic-accuracy increase from 16.5% to 18.1% isn't just jitter.)

(Also, it may be notable that the P parameter, from the 'enhancing with subword info' paper, for exempting the P most-frequent terms from subword-composition, is missing from the fastText implementation. It might help eliminate the penalty on semantic analogies (no noise from unhelpful subwords) – but also hurt the syntactic, by not helping rare/OOV words learn morphemes from their more-common peers.)

Finally, I think a summary graph of all time/accuracy results would help hammer-home the results... and might significantly drive interest in the reporting/discussing the notebook, if it's easy to pull out for display in other tweets/blogposts. Specifically: I'm thinking a bar graph, left axis train-time, right-axis analogy-accuracy-percent. Three bottom-axis clusters of bars: brown, text8, text9. Within each cluster, runtime(gensim, ft-ng, ft); semantic-acc (gensim, ft-ng, ft); syntactic-acc (gensim, ft-ng, ft). It's a lot to squeeze into one graph (27 bars!)... but it'd still be interpretable and likely be re-used a lot!

@piskvorky
Copy link
Owner

piskvorky commented Aug 20, 2016

Very nice @jayantj !

In addition to what @gojomo wrote (all great points), it looks like NLTK is doing something clever in brown.sents(), because just iterating over this tiny corpus, doing nothing at all, takes 2.5s (vs 100ms for Text8Corpus("brown_corp.txt")). So probably best to avoid it and use the same plain-text file input as fastText. The sentences are split differently between these two versions too (could theoretically affect accuracy, in addition to speed). Just a minor nitpick.

What BLAS does your gensim installation use @jayantj ?

"cell_type": "markdown",
"metadata": {},
"source": [
"For training the models yourself, you'll need to have both [Gensim](github.com/RaRe-Technologies/gensim) and [FastText](https://github.com/facebookresearch/fastText) set up on your machine."
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gensim link gives 404.

@jayantj
Copy link
Contributor Author

jayantj commented Aug 21, 2016

Thanks for the great ideas. I've added the check-if-existing conditions, updated the links, and removed the brown.sents() call. Will add graphs and text9 as soon as training is complete(!)

@piskvorky Haven't manually installed any BLAS packages, and can't seem to find any with dpkg -l | grep -iE 'openblas|lapack|atlas'

word2vec.FAST_VERSION returns 1 though.

@jayantj jayantj force-pushed the fast_text_notebook branch 2 times, most recently from b97ea8f to d3cf3d7 Compare August 21, 2016 06:51
@piskvorky
Copy link
Owner

Worth installing OpenBlas then (and then re-install numpy+scipy). Btw easiest way to check BLAS linkage is with numpy.show_config() and scipy.show_config().

@jayantj
Copy link
Contributor Author

jayantj commented Aug 21, 2016

Yeah, ran that already, that gives me

blas_opt_info:
    libraries = ['openblas', 'openblas']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    library_dirs = ['/usr/local/lib']
blas_mkl_info:
  NOT AVAILABLE
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    library_dirs = ['/usr/local/lib']
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    library_dirs = ['/usr/local/lib']
openblas_info:
    libraries = ['openblas', 'openblas']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    library_dirs = ['/usr/local/lib']

This means it's linked, right?
Packages don't seem to show up with dpkg though

@jayantj
Copy link
Contributor Author

jayantj commented Aug 23, 2016

So I've added the text9 comparison, and a nice looking graph - thanks for the advice @gojomo
download 1

One tiny nitpick - I''ve had to hardcode the training time values, there doesn't seem an easy way of retrieving the output of %time, and other methods to time statements seem to add a lot of clutter

@rodrigocesar
Copy link

Hi there guys, I tried to run the notebook FastText_Tutorial.ipynb bu to no avail. I have installed fastText already. The error I get is when I try to import FastText from the gensim package, i.e.,

import gensim, os
from gensim.models.wrappers.fasttext import FastText

# Set FastText home to the path to the FastText executable
ft_home = '/home/rodrigo/Projects/fastText/fasttext'


# Set file names for train and test data
data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data']) + os.sep
lee_train_file = data_dir + 'lee_background.cor'

model = FastText.train(ft_home, lee_train_file)

print(model)
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-20-9c488bfaa296> in <module>()
      1 import gensim, os
----> 2 from gensim.models.wrappers.fasttext import FastText
      3 
      4 # Set FastText home to the path to the FastText executable
      5 ft_home = '/home/rodrigo/Projects/fastText/fasttext'

ImportError: No module named fasttext

Can you please help me solve this issue?


@tmylk
Copy link
Contributor

tmylk commented Jan 31, 2017

Hi. This is not yet released. You need to clone the develop branch from github

@rodrigocesar
Copy link

Thanks Lev. But now another error is showing up when I try to import.

import gensim, os
from gensim.models.wrappers.fasttext import FastText

# Set FastText home to the path to the FastText executable
ft_home = '/home/rodrigo/Projects/fastText/fasttext'

# Set file names for train and test data
data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data']) + os.sep
lee_train_file = data_dir + 'lee_background.cor'

model = FastText.train(ft_home, lee_train_file)

print(model)


ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

Traceback (most recent call last):
  File "/home/rodrigo/.local/lib/python2.7/site-packages/IPython/core/ultratb.py", line 1132, in get_records
    return _fixed_getinnerframes(etb, number_of_lines_of_context, tb_offset)
  File "/home/rodrigo/.local/lib/python2.7/site-packages/IPython/core/ultratb.py", line 313, in wrapped
    return f(*args, **kwargs)
  File "/home/rodrigo/.local/lib/python2.7/site-packages/IPython/core/ultratb.py", line 358, in _fixed_getinnerframes
    records = fix_frame_records_filenames(inspect.getinnerframes(etb, context))
  File "/usr/lib/python2.7/inspect.py", line 1049, in getinnerframes
    framelist.append((tb.tb_frame,) + getframeinfo(tb, context))
  File "/usr/lib/python2.7/inspect.py", line 1009, in getframeinfo
    filename = getsourcefile(frame) or getfile(frame)
  File "/usr/lib/python2.7/inspect.py", line 454, in getsourcefile
    if hasattr(getmodule(object, filename), '__loader__'):
  File "/usr/lib/python2.7/inspect.py", line 483, in getmodule
    file = getabsfile(object, _filename)
  File "/usr/lib/python2.7/inspect.py", line 467, in getabsfile
    return os.path.normcase(os.path.abspath(_filename))
  File "/usr/lib/python2.7/posixpath.py", line 364, in abspath
    cwd = os.getcwd()
OSError: [Errno 2] No such file or directory

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/home/rodrigo/.local/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc in run_code(self, code_obj, result)
   2896             if result is not None:
   2897                 result.error_in_exec = sys.exc_info()[1]
-> 2898             self.showtraceback()
   2899         else:
   2900             outflag = 0

/home/rodrigo/.local/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc in showtraceback(self, exc_tuple, filename, tb_offset, exception_only)
   1822                     except Exception:
   1823                         stb = self.InteractiveTB.structured_traceback(etype,
-> 1824                                             value, tb, tb_offset=tb_offset)
   1825 
   1826                     self._showtraceback(etype, value, stb)

/home/rodrigo/.local/lib/python2.7/site-packages/IPython/core/ultratb.pyc in structured_traceback(self, etype, value, tb, tb_offset, number_of_lines_of_context)
   1404         self.tb = tb
   1405         return FormattedTB.structured_traceback(
-> 1406             self, etype, value, tb, tb_offset, number_of_lines_of_context)
   1407 
   1408 

/home/rodrigo/.local/lib/python2.7/site-packages/IPython/core/ultratb.pyc in structured_traceback(self, etype, value, tb, tb_offset, number_of_lines_of_context)
   1312             # Verbose modes need a full traceback
   1313             return VerboseTB.structured_traceback(
-> 1314                 self, etype, value, tb, tb_offset, number_of_lines_of_context
   1315             )
   1316         else:

/home/rodrigo/.local/lib/python2.7/site-packages/IPython/core/ultratb.pyc in structured_traceback(self, etype, evalue, etb, tb_offset, number_of_lines_of_context)
   1196                 structured_traceback_parts += formatted_exception
   1197         else:
-> 1198             structured_traceback_parts += formatted_exception[0]
   1199 
   1200         return structured_traceback_parts

IndexError: string index out of range

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants