Rgf feature importances #161

fukatani · 2018-02-24T16:15:03Z

No description provided.

Conflicts: include/rgf/src/tet/AzFindSplit.cpp

StrikerRUS · 2018-02-27T18:05:33Z

Wow! Great job!

I've created a separate repo for RGF C++: https://github.com/StrikerRUS/rgf . Of course, we had to do it more early... But better late than never! 😃 I've invited you as a collaborator to this repo, please accept the invitation. After doing this I will transfer the ownership of the repo to Tong Zhang. We'll remain collaborators.
I'll help you to transfer all your work from current subdirectory to the new repo if you want.

fukatani · 2018-02-28T13:12:00Z

Thank you for your comment.

Basically, I agree with this PR exceeds the scope of the wrapper.
But I'm not sure that separating wrapper from RGF is the best way.
For example, XGB and LightGBM repository includes both C++ and their wrapper.
Maintenance costs for versioning and testing are also if wrapper and C ++ are held in one repository. RGF (and fast RGF) itself doesn't have any tests, so rgf_python test is essential for development them.

I think there are two ways.

Make group RGF, and transfer clone whole rgf_python.
A slightly rough way, rename this repository to "RGF."
How do you think?

StrikerRUS · 2018-02-28T20:09:04Z

I agree that tests is very important part of the repo which helps developers and we'll face some problems after separating C++ and Python code. But it'll be possible just to create new branch and update GitHub submodule into wrapper's repo, so tests will be triggered (I understand that it's not so comfortable, but OK in some meaning).

Anyway, RGF is now written by us and it's wrong way to just copy the original code into the wrapper's repo. XGB and LGB are written by the same people, so it's OK to held all code in one repo for them.

Moreover, submodules is a tradeoff between independence (we can reference any certain commit we want) and need to maintain the C++ code or duplicate it. Also what's about the users who want to use command line tool? It's not so easy to find the way to do it into wrapper's repo. Both RGF and FastRGF are independent projects, and deserve separate repos.

My opinion about the situation is that the right way is modified (1): create RGF-team organization on GitHub and transfer all three repos under its owning. I'm not sure about the difficulties connected with transfer of fast_rgf, I mean baidu is the owner now, so I'll contact Tong with this question.

StrikerRUS · 2018-03-01T10:58:30Z

I've created RGF-team organization and invited you and Tong Zhang there. Please accept the invitation and transfer rgf_python repo.
https://github.com/RGF-team

fukatani · 2018-03-02T14:32:13Z

Thank you for creating organization.

But there is no need to hurry.
And before starting to transfer, we should agree with transfer process (and maintenance policy may be).

XGB and LGB are written by the same people, so it's OK to held all code in one repo for them.

Why wrong method? In terms of licence? Copyright?

For example, tensorflow including skflow keras, .... and other repository.
They were originally independent projects.

Honestly, I can not feel the benefit of separately managing.
Transition is also troublesome, maintenance costs increase.

Also what's about the users who want to use command line tool?
They can use git grep.

The maintainer will increase in the future, so I want to choose a simple and easy way.

fukatani · 2018-03-02T14:36:22Z

Would it be better to upload clone rgf_python than transferring this repository?
Some articles are linking to https://github.com/fukatani/rgf_python.
We should guidance to new repository in https://github.com/fukatani/rgf_python readme.

fukatani · 2018-03-02T14:46:54Z

Anyway, whatever system it will be, It is a lot of fun to be run by three people. It was a great benefit for RGFcommunity Tong Zhang join. 👍
Thanks @StrikerRUS !

StrikerRUS · 2018-03-02T16:01:43Z

Would it be better to upload clone rgf_python than transferring this repository?
Some articles are linking to https://github.com/fukatani/rgf_python.
We should guidance to new repository in https://github.com/fukatani/rgf_python readme.

Yeah! Good point! But you shouldn't worry about it. GitHub makes redirects automatically after transferring is finished. You could check it by following this link: https://github.com/StrikerRUS/rgf . I've transferred ownership of rgf repo to RGF-team.

Transition is also troublesome, maintenance costs increase.

To be honest, I don't see any of them.

The main reason I want to separate repos is that rgf_python's repo contains much unnecessary information for RGF/FastRGF CLI version users. Repo's structure is dictated by Python package structure and C++ code is hidden into one of the dozens folders. Also, it brings independence into projects development and frees us from code duplication. And

Both RGF and FastRGF are independent projects, and deserve separate repos.

To speak about tests, it's possible to write own tests for RGF and FastRGF (folders with examples are already there and could be executed at Travis/Appveyor side) in the future. Also, as I said before, any commit into RGF/FastRGF repo could be checked by rgf_python tests just by creating, let say, test branch with a pointer to a needed commit.

fukatani · 2018-03-03T05:40:17Z

The main reason I want to separate repos is that rgf_python's repo contains much unnecessary information for RGF/FastRGF CLI version users. Repo's structure is dictated by Python package structure and C++ code is hidden into one of the dozens folders.

XGB holding CLI and Python and R users, though it seems no problem. Their folder structure is useful as a reference.
Tensorflow holding C++ API users, Python users, java users, go users, Keras users, skflow users, ...
and they structured by >100 folders.

I agree with separating repository is "possible", but not feel that it is "best".
It is less confusing to develop wrapper and kernel to test and to release simultaneously.
We should think why many famous machine learning repository includes their wrapper and kernel.

StrikerRUS · 2018-03-03T10:39:33Z

Your examples use submodules for the core components too: LGB uses compute module for GPU version and XGB uses cub, dmlc-core, nccl, rabit. It's logically correct not to copy-paste the needed code and then update this code in many places, but use a link to it.

We should think why many famous machine learning repository includes their wrapper and kernel.

Because the development were done by the same people and in the same time. RGF much time hadn't a wrapper, so it's not good to hide the adult and independent project into the one of multiple folders of its' wrapper. In your examples the main place takes C++ code, wrappers have their subdirectories. rgf_python has the opposite situation, even the name of the project says that Python plays the main role in it.

Imagine the situation: someone want to develop, let say, Java wrapper. What should he do? Copy-paste the needed C++ code? Or pull the full Python wrapper repo with much unnecessary code? The current rgf_python's structure doesn't allow to it easy.

In addition, what stops you to transfer the repo now "as is"?

fukatani · 2018-03-04T12:05:04Z

Since the discussion is overheating, it seems time is needed to cool down.

In addition, what stops you to transfer the repo now "as is"?

It seems not to be yet time to start teamwork.
Do you suggest once transfer rgf_python as is to RGF-team and after we reach consensus future plan, we will take additional transfer work?

StrikerRUS · 2018-03-04T13:12:05Z

Do you suggest once transfer rgf_python as is to RGF-team and after we reach consensus future plan, we will take additional transfer work?

I think that transferring rgf_python now to RGF-team means nothing in terms of development scheme. You have Owner status in RGF-team and can continue the development as you wish. It isn't connected with our discussion about either using Git submodules, or copy-pasting the code, or transforming C++ repos into wrappers subfolders. It's just cosmetic change of the URL with the aim to collect all code about RGF in one place. And as you correctly noticed, we should reach the consensus before performing any changes in the repos' structure.

fukatani · 2018-03-04T14:19:18Z

I moved this repository and made a logo for RGF-team. (Is it cool?)

BTW, Can I merge this PR? I'd like to decide the review system in the near future.

StrikerRUS · 2018-03-04T14:39:33Z

Great! 🌟

and made a logo for RGF-team. (Is it cool?)

For some reason (I don't know why) I find the logo funny 😄 .

BTW, Can I merge this PR? I'd like to decide the review system in the near future.

This PR is rather big and important, but not critical for fast merging. My opinion is that Tong (or Rie, if she express the wish to join RGF-team) should take a loot at it. Also I can review Python part of the PR, if you are not in a hurry.

fukatani · 2018-03-04T14:45:50Z

It is nice to be reviewed by RGF authors (of cource and by you).
I'm not in a hurry, but I am little concerned that Tong is busy.

fukatani · 2018-03-04T14:47:29Z

I compared feature_importances_ with the random forest, the trends were consistent.

StrikerRUS · 2018-03-04T14:56:38Z

It is nice to be reviewed by RGF authors (of cource and by you).
I'm not in a hurry, but I am little concerned that Tong is busy.

Thanks!
Let's wait for his answer. I've added him to reviewers.

I thinks for future PR we should review each other for Python code and ask Tong's review for C++ code. If there is no review from anyone, let say, for week, then merge PR. What do think about it?

fukatani · 2018-03-04T15:05:32Z

Basically, OK.
What about changing indenting or fine coding style?
And in some cases we may want to merge PR in a hurry.

And I will review C ++ written by Tong as far as I can.

StrikerRUS · 2018-03-04T15:52:27Z

OK.

What about changing indenting or fine coding style?

What's your suggestion?

And I will review C ++ written by Tong as far as I can.

If he will have a time to do it 😃 .

I'll review this PR in a few hours today.

StrikerRUS

Impressive work! 👍
Really like new features introduced in this PR. But please think about "real dump" of a model. I suppose it'll be more useful than just printing to the console.

StrikerRUS · 2018-03-04T16:17:28Z

rgf/rgf_model.py

@@ -381,6 +381,38 @@ def _find_model_file(self):
                            'Training is abnormally finished.'.format(utils.TEMP_PATH))
        self._model_file = sorted(model_files, reverse=True)[0]

+    def dump_model(self):


I found the name is little confusing. Maybe print_model? Or another variant: really "dump" model to a file and if necessary print it to the console. So the signature will be dump_model(file_name, print_to_console=False).

Dump is not wrong since it includes the meaning of "print".
Should I change it to print_model?

Or leave it to prepare for print_to_console argument?

StrikerRUS · 2018-03-04T16:18:15Z

rgf/rgf_model.py

+    def feature_importances_(self):
+        """Return the feature importances.
+
+        The importance of a feature is computed from sum of gain of each nodes.


... each node.

StrikerRUS · 2018-03-04T16:19:15Z

rgf/rgf_model.py

+        params.append("feature_importances_fn=%s" % self._feature_importances_loc)
+        params.append("model_fn=%s" % self._model_file)
+        cmd = (utils.RGF_PATH, "feature_importances", ",".join(params))
+        print(cmd)


Remove this line.

StrikerRUS · 2018-03-04T16:19:51Z

rgf/rgf_model.py

@@ -659,6 +691,34 @@ def _fit_multiclass_task(self, X, y, sample_weight, params):
                                                                                      sample_weight)
                                                        for i in range(self._n_classes))

+    def dump_model(self):


The same as above.

StrikerRUS · 2018-03-04T16:20:45Z

rgf/rgf_model.py

+    def feature_importances_(self):
+        """Return the feature importances.
+
+        The importance of a feature is computed from sum of gain of each nodes.


The same as above.

StrikerRUS · 2018-03-04T16:24:19Z

rgf/rgf_model.py

+    def feature_importances_(self):
+        """Return the feature importances.
+
+        The importance of a feature is computed from sum of gain of each nodes.


The same as above.

StrikerRUS · 2018-03-04T16:24:44Z

rgf/rgf_model.py

+        params.append("feature_importances_fn=%s" % self._feature_importances_loc)
+        params.append("model_fn=%s" % self._model_file)
+        cmd = (utils.RGF_PATH, "feature_importances", ",".join(params))
+        print(cmd)


The same as above.

StrikerRUS · 2018-03-04T16:29:54Z

rgf/utils.py

@@ -293,13 +294,13 @@ def _save_train_data(self, X, y, sample_weight):
            self._save_dense_files(X, y, sample_weight)
            self._is_sparse_train_X = False

-    def _execute_command(self, cmd):
+    def _execute_command(self, cmd, verbose=False):


Are you sure that we need a new parameter? If you don't like the idea about real dumping a model with next printing the file content, you can temporary set self.verbose = True.

I first come out come up with your solution, but current implementation is simple for me. And fewer line numbers.
self.verbose setted by only user one time. How do you think?

StrikerRUS · 2018-03-04T17:15:01Z

I compared feature_importances_ with the random forest, the trends were consistent.

Can you please share obtained results?

fukatani · 2018-03-05T13:17:23Z

Really like new features introduced in this PR. But please think about "real dump" of a model. I suppose it'll be more useful than just printing to the console.

For example dump in JSON format like lightGBM.
It's convenient and we may support it in the future, but we should do it with another PR.
This PR is already big.

StrikerRUS · 2018-03-05T17:39:21Z

For example dump in JSON format like lightGBM.

Yeah, JSON seem to be a good choice!

Should I change it to print_model?
Or leave it to prepare for print_to_console argument?

I first come out come up with your solution, but current implementation is simple for me. And fewer line numbers.
self.verbose setted by only user one time. How do you think?

If you have plans to implement "real dump" to JSON with possibility to output results to the console, I think it's OK to leave name and additional parameter (which will be unneeded and deleted in next PR) as is.

StrikerRUS · 2018-03-05T17:42:06Z

Tong asked me to email him in case we need any his attention, so I asked to review this PR.

fukatani added 19 commits February 24, 2018 17:36

refactoring

a7bbe24

refactoring

a00aa56

fix test

08a07bb

refactoring

adbbf9e

introduce impurity

5fe1d5f

Support dump_model

a0f9e23

impurity to gain

9db8f5f

clean up

a808edf

wip

b531595

wip

50bc789

wip

d87cffe

fix feature importances

09e8fd0

implement help message

5d15f19

refactoring

a250e05

write doc

d99a44c

fix test

527008d

RGFClassifier supports feature importances

3e2d3d5

Add feature importances test

4579d83

Merge remote-tracking branch 'refs/remotes/origin/master'

e221d78

Conflicts: include/rgf/src/tet/AzFindSplit.cpp

fukatani changed the title ~~[WIP] Rgf feature importances~~ Rgf feature importances Feb 27, 2018

StrikerRUS requested a review from TongZhang-ML March 4, 2018 14:56

StrikerRUS requested changes Mar 4, 2018

View reviewed changes

Addressed review comments

2c51ad2

StrikerRUS approved these changes Mar 5, 2018

View reviewed changes

TongZhang-ML merged commit 535f5d7 into master Mar 6, 2018

This was referenced Mar 14, 2018

repo's structure #166

Closed

Feature importances #109

Open

dump RGF and FastRGF to the JSON file #167

Open

StrikerRUS deleted the rgf-feature-importances branch June 11, 2018 00:28

fukatani mentioned this pull request Jun 22, 2018

Add copy right for RGF #187

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rgf feature importances #161

Rgf feature importances #161

fukatani commented Feb 24, 2018

StrikerRUS commented Feb 27, 2018

fukatani commented Feb 28, 2018

StrikerRUS commented Feb 28, 2018 •

edited

StrikerRUS commented Mar 1, 2018

fukatani commented Mar 2, 2018 •

edited

fukatani commented Mar 2, 2018 •

edited

fukatani commented Mar 2, 2018 •

edited

StrikerRUS commented Mar 2, 2018 •

edited

fukatani commented Mar 3, 2018

StrikerRUS commented Mar 3, 2018

fukatani commented Mar 4, 2018

StrikerRUS commented Mar 4, 2018 •

edited

fukatani commented Mar 4, 2018 •

edited

StrikerRUS commented Mar 4, 2018 •

edited

fukatani commented Mar 4, 2018

fukatani commented Mar 4, 2018

StrikerRUS commented Mar 4, 2018

fukatani commented Mar 4, 2018

StrikerRUS commented Mar 4, 2018

StrikerRUS left a comment

StrikerRUS Mar 4, 2018

fukatani Mar 5, 2018

StrikerRUS Mar 4, 2018

StrikerRUS Mar 4, 2018

StrikerRUS Mar 4, 2018

StrikerRUS Mar 4, 2018

StrikerRUS Mar 4, 2018

StrikerRUS Mar 4, 2018

StrikerRUS Mar 4, 2018

fukatani Mar 5, 2018 •

edited

StrikerRUS commented Mar 4, 2018

fukatani commented Mar 5, 2018

StrikerRUS commented Mar 5, 2018

StrikerRUS commented Mar 5, 2018

Rgf feature importances #161

Rgf feature importances #161

Conversation

fukatani commented Feb 24, 2018

StrikerRUS commented Feb 27, 2018

fukatani commented Feb 28, 2018

StrikerRUS commented Feb 28, 2018 • edited

StrikerRUS commented Mar 1, 2018

fukatani commented Mar 2, 2018 • edited

fukatani commented Mar 2, 2018 • edited

fukatani commented Mar 2, 2018 • edited

StrikerRUS commented Mar 2, 2018 • edited

fukatani commented Mar 3, 2018

StrikerRUS commented Mar 3, 2018

fukatani commented Mar 4, 2018

StrikerRUS commented Mar 4, 2018 • edited

fukatani commented Mar 4, 2018 • edited

StrikerRUS commented Mar 4, 2018 • edited

fukatani commented Mar 4, 2018

fukatani commented Mar 4, 2018

StrikerRUS commented Mar 4, 2018

fukatani commented Mar 4, 2018

StrikerRUS commented Mar 4, 2018

StrikerRUS left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fukatani Mar 5, 2018 • edited

Choose a reason for hiding this comment

StrikerRUS commented Mar 4, 2018

fukatani commented Mar 5, 2018

StrikerRUS commented Mar 5, 2018

StrikerRUS commented Mar 5, 2018

StrikerRUS commented Feb 28, 2018 •

edited

fukatani commented Mar 2, 2018 •

edited

fukatani commented Mar 2, 2018 •

edited

fukatani commented Mar 2, 2018 •

edited

StrikerRUS commented Mar 2, 2018 •

edited

StrikerRUS commented Mar 4, 2018 •

edited

fukatani commented Mar 4, 2018 •

edited

StrikerRUS commented Mar 4, 2018 •

edited

fukatani Mar 5, 2018 •

edited