Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example needed for persisting Birch model using _ClusteringWrapper or similar #184

Open
robguinness opened this issue Sep 18, 2018 · 5 comments

Comments

@robguinness
Copy link

robguinness commented Sep 18, 2018

Hi,

I am new to FreeDiscovery. I came across it when reading this pull request. The features you added to Birch are really great, but I am stuck figuring out how to persist a Birch model according to your API.

I have tried persisting using the _ClusteringWrapper class, as in this example:

cl.birch(threshold=0.9, branching_factor=20)
cl = _ClusteringWrapper(cache_dir=cache_dir, parent_id=lsi_model_id)

The problem is when I try to interact with it using the Birch API, I get an error, e.g.:

htree, _ = birch_hierarchy_wrapper(cl.km)

OUT: ValueError: the birch object must be created with freediscovery.cluster.Birch

An example of how to persist a Birch model and work with it later using the Birch API would be most appreciated!

@rth
Copy link
Contributor

rth commented Sep 18, 2018

Thanks for the feedback @robguinness!

If you want to use this version of Birch in Python, just use,

from freediscovery.cluster import Birch

instead of

from sklearn.cluster import Birch

everything else should be mostly the same as with the Birch estimator from scikit-learn: see documentation at http://freediscovery.io/doc/stable/user_manual/clustering.html#birch for more details.

There is a more complete example in http://freediscovery.io/doc/stable/python/examples/birch_cluster_hierarchy.html

The serialization should work the same as with regular scikit-learn estimators,

from sklearn.externals import joblib

joblib.dump(estmator, 'file_name.pkl')

As to _ClusteringWrapper it's a private class, and shouldn't be used directly from Python. Which example did you see it in?

@robguinness
Copy link
Author

Thanks for the reply @rth. I want to clarify my issue a bit. Persisting with joblib or pickle is one possibility, but what I would really like to do is utilize freediscovery's consistent approach of persisting models and the associated metadata, as found in freediscovery.server.resources. In particular I followed the "example" in BirchClusteringApi.post():

    @marshal_with(IDSchema())
    def post(self, **args):
        from math import sqrt

        metric = args.pop('metric')
        S_cos = _scale_cosine_similarity(args.pop('min_similarity'),
                                         metric=metric,
                                         inverse=True)
        # cosine sim to euclidean distance
        threshold = sqrt(2 * (1 - S_cos))

        cl = _ClusteringWrapper(cache_dir=self._cache_dir,
                                parent_id=args.pop('parent_id'),
                                metric=metric)

        if args.get('n_clusters') <= 0:
            args['n_clusters'] = None

        cl.birch(threshold=threshold, **args)
        return {'id': cl.mid}

However, I later found out that this version of the Birch model created by _ClusteringWrapper.birch() is not the same as the one created by freediscovery.cluster.Birch. I realize now that _ClusteringWrapper was not really designed to be used in this way, but as Python doesn't really have private classes, it could be used, right? ;-)

I really like the idea of interacting with freediscovery through REST APIs, but not 100% of the features I need are currently available through the REST APIs, and I am interested to eventually extend them. Still getting my feet wet though...

@rth
Copy link
Contributor

rth commented Sep 20, 2018

Thanks for the additional explanations! You are of course free to use the _ClusteringWrapper in Python if you find it useful. Initially it was part of the public API, but because maintaining backward compatibility on the REST API, and the exposed Python classes was somewhat difficult, it was subsequently removed. Also I was not sure Python users would find it useful. So thanks for the feedback. Now the code base is fairly stable, so it should be fine to use it. Looking at the freediscovery.engine and freediscovery.server code is indeed currently the simplest way of finding an example of its use.

To answer your initial question, cl = _ClusteringWrapper(..) already calls birch_hierarchy_wrapper internally here, and you can access the results with cl.km.htree.

If you encounter any issues, e.g. while using scikit-learn 0.20rc1 don't hesitate to open issues. Pull Requests would also be very appreciated!

@robguinness
Copy link
Author

Thanks. This is very helpful! I will certainly send a PR if I come up with anything useful ;-)

@robguinness
Copy link
Author

Hi again,
Sorry, I tried your suggestion, and I am still getting an error:

Traceback (most recent call last):
  File "/home/rob/projects/dochier/core.py", line 78, in <module>
    htree = cl.km.htree
AttributeError: '_BirchDummy' object has no attribute 'htree'

Any ideas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants