Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rgf feature importances #161

Merged
merged 20 commits into from Mar 6, 2018
Merged

Rgf feature importances #161

merged 20 commits into from Mar 6, 2018

Conversation

fukatani
Copy link
Member

No description provided.

@fukatani fukatani changed the title [WIP] Rgf feature importances Rgf feature importances Feb 27, 2018
@StrikerRUS
Copy link
Member

Wow! Great job!

I've created a separate repo for RGF C++: https://github.com/StrikerRUS/rgf . Of course, we had to do it more early... But better late than never! 😃 I've invited you as a collaborator to this repo, please accept the invitation. After doing this I will transfer the ownership of the repo to Tong Zhang. We'll remain collaborators.
I'll help you to transfer all your work from current subdirectory to the new repo if you want.

@fukatani
Copy link
Member Author

Thank you for your comment.

Basically, I agree with this PR exceeds the scope of the wrapper.
But I'm not sure that separating wrapper from RGF is the best way.
For example, XGB and LightGBM repository includes both C++ and their wrapper.
Maintenance costs for versioning and testing are also if wrapper and C ++ are held in one repository. RGF (and fast RGF) itself doesn't have any tests, so rgf_python test is essential for development them.

I think there are two ways.

  1. Make group RGF, and transfer clone whole rgf_python.
  2. A slightly rough way, rename this repository to "RGF."
    How do you think?

@StrikerRUS
Copy link
Member

StrikerRUS commented Feb 28, 2018

I agree that tests is very important part of the repo which helps developers and we'll face some problems after separating C++ and Python code. But it'll be possible just to create new branch and update GitHub submodule into wrapper's repo, so tests will be triggered (I understand that it's not so comfortable, but OK in some meaning).

Anyway, RGF is now written by us and it's wrong way to just copy the original code into the wrapper's repo. XGB and LGB are written by the same people, so it's OK to held all code in one repo for them.

Moreover, submodules is a tradeoff between independence (we can reference any certain commit we want) and need to maintain the C++ code or duplicate it. Also what's about the users who want to use command line tool? It's not so easy to find the way to do it into wrapper's repo. Both RGF and FastRGF are independent projects, and deserve separate repos.

My opinion about the situation is that the right way is modified (1): create RGF-team organization on GitHub and transfer all three repos under its owning. I'm not sure about the difficulties connected with transfer of fast_rgf, I mean baidu is the owner now, so I'll contact Tong with this question.

@StrikerRUS
Copy link
Member

I've created RGF-team organization and invited you and Tong Zhang there. Please accept the invitation and transfer rgf_python repo.
https://github.com/RGF-team

@fukatani
Copy link
Member Author

fukatani commented Mar 2, 2018

Thank you for creating organization.

But there is no need to hurry.
And before starting to transfer, we should agree with transfer process (and maintenance policy may be).

XGB and LGB are written by the same people, so it's OK to held all code in one repo for them.

Why wrong method? In terms of licence? Copyright?

For example, tensorflow including skflow keras, .... and other repository.
They were originally independent projects.

Honestly, I can not feel the benefit of separately managing.
Transition is also troublesome, maintenance costs increase.

Also what's about the users who want to use command line tool?
They can use git grep.

The maintainer will increase in the future, so I want to choose a simple and easy way.

@fukatani
Copy link
Member Author

fukatani commented Mar 2, 2018

Would it be better to upload clone rgf_python than transferring this repository?
Some articles are linking to https://github.com/fukatani/rgf_python.
We should guidance to new repository in https://github.com/fukatani/rgf_python readme.

@fukatani
Copy link
Member Author

fukatani commented Mar 2, 2018

Anyway, whatever system it will be, It is a lot of fun to be run by three people. It was a great benefit for RGFcommunity Tong Zhang join. 👍
Thanks @StrikerRUS !

@StrikerRUS
Copy link
Member

StrikerRUS commented Mar 2, 2018

Would it be better to upload clone rgf_python than transferring this repository?
Some articles are linking to https://github.com/fukatani/rgf_python.
We should guidance to new repository in https://github.com/fukatani/rgf_python readme.

Yeah! Good point! But you shouldn't worry about it. GitHub makes redirects automatically after transferring is finished. You could check it by following this link: https://github.com/StrikerRUS/rgf . I've transferred ownership of rgf repo to RGF-team.

Transition is also troublesome, maintenance costs increase.

To be honest, I don't see any of them.

The main reason I want to separate repos is that rgf_python's repo contains much unnecessary information for RGF/FastRGF CLI version users. Repo's structure is dictated by Python package structure and C++ code is hidden into one of the dozens folders. Also, it brings independence into projects development and frees us from code duplication. And

Both RGF and FastRGF are independent projects, and deserve separate repos.

To speak about tests, it's possible to write own tests for RGF and FastRGF (folders with examples are already there and could be executed at Travis/Appveyor side) in the future. Also, as I said before, any commit into RGF/FastRGF repo could be checked by rgf_python tests just by creating, let say, test branch with a pointer to a needed commit.

@fukatani
Copy link
Member Author

fukatani commented Mar 3, 2018

The main reason I want to separate repos is that rgf_python's repo contains much unnecessary information for RGF/FastRGF CLI version users. Repo's structure is dictated by Python package structure and C++ code is hidden into one of the dozens folders.

XGB holding CLI and Python and R users, though it seems no problem. Their folder structure is useful as a reference.
Tensorflow holding C++ API users, Python users, java users, go users, Keras users, skflow users, ...
and they structured by >100 folders.

I agree with separating repository is "possible", but not feel that it is "best".
It is less confusing to develop wrapper and kernel to test and to release simultaneously.
We should think why many famous machine learning repository includes their wrapper and kernel.

@StrikerRUS
Copy link
Member

Your examples use submodules for the core components too: LGB uses compute module for GPU version and XGB uses cub, dmlc-core, nccl, rabit. It's logically correct not to copy-paste the needed code and then update this code in many places, but use a link to it.

We should think why many famous machine learning repository includes their wrapper and kernel.

Because the development were done by the same people and in the same time. RGF much time hadn't a wrapper, so it's not good to hide the adult and independent project into the one of multiple folders of its' wrapper. In your examples the main place takes C++ code, wrappers have their subdirectories. rgf_python has the opposite situation, even the name of the project says that Python plays the main role in it.

Imagine the situation: someone want to develop, let say, Java wrapper. What should he do? Copy-paste the needed C++ code? Or pull the full Python wrapper repo with much unnecessary code? The current rgf_python's structure doesn't allow to it easy.

In addition, what stops you to transfer the repo now "as is"?

@fukatani
Copy link
Member Author

fukatani commented Mar 4, 2018

Since the discussion is overheating, it seems time is needed to cool down.

In addition, what stops you to transfer the repo now "as is"?

It seems not to be yet time to start teamwork.
Do you suggest once transfer rgf_python as is to RGF-team and after we reach consensus future plan, we will take additional transfer work?

@StrikerRUS
Copy link
Member

StrikerRUS commented Mar 4, 2018

Do you suggest once transfer rgf_python as is to RGF-team and after we reach consensus future plan, we will take additional transfer work?

I think that transferring rgf_python now to RGF-team means nothing in terms of development scheme. You have Owner status in RGF-team and can continue the development as you wish. It isn't connected with our discussion about either using Git submodules, or copy-pasting the code, or transforming C++ repos into wrappers subfolders. It's just cosmetic change of the URL with the aim to collect all code about RGF in one place. And as you correctly noticed, we should reach the consensus before performing any changes in the repos' structure.

@fukatani
Copy link
Member Author

fukatani commented Mar 4, 2018

I moved this repository and made a logo for RGF-team. (Is it cool?)

BTW, Can I merge this PR? I'd like to decide the review system in the near future.

@StrikerRUS
Copy link
Member

StrikerRUS commented Mar 4, 2018

Great! 🌟

and made a logo for RGF-team. (Is it cool?)

For some reason (I don't know why) I find the logo funny 😄 .

BTW, Can I merge this PR? I'd like to decide the review system in the near future.

This PR is rather big and important, but not critical for fast merging. My opinion is that Tong (or Rie, if she express the wish to join RGF-team) should take a loot at it. Also I can review Python part of the PR, if you are not in a hurry.

@fukatani
Copy link
Member Author

fukatani commented Mar 4, 2018

It is nice to be reviewed by RGF authors (of cource and by you).
I'm not in a hurry, but I am little concerned that Tong is busy.

@fukatani
Copy link
Member Author

fukatani commented Mar 4, 2018

I compared feature_importances_ with the random forest, the trends were consistent.

@StrikerRUS
Copy link
Member

It is nice to be reviewed by RGF authors (of cource and by you).
I'm not in a hurry, but I am little concerned that Tong is busy.

Thanks!
Let's wait for his answer. I've added him to reviewers.

I thinks for future PR we should review each other for Python code and ask Tong's review for C++ code. If there is no review from anyone, let say, for week, then merge PR. What do think about it?

@fukatani
Copy link
Member Author

fukatani commented Mar 4, 2018

Basically, OK.
What about changing indenting or fine coding style?
And in some cases we may want to merge PR in a hurry.

And I will review C ++ written by Tong as far as I can.

@StrikerRUS
Copy link
Member

OK.

What about changing indenting or fine coding style?

What's your suggestion?

And I will review C ++ written by Tong as far as I can.

If he will have a time to do it 😃 .

I'll review this PR in a few hours today.

Copy link
Member

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive work! 👍
Really like new features introduced in this PR. But please think about "real dump" of a model. I suppose it'll be more useful than just printing to the console.

@@ -381,6 +381,38 @@ def _find_model_file(self):
'Training is abnormally finished.'.format(utils.TEMP_PATH))
self._model_file = sorted(model_files, reverse=True)[0]

def dump_model(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the name is little confusing. Maybe print_model? Or another variant: really "dump" model to a file and if necessary print it to the console. So the signature will be dump_model(file_name, print_to_console=False).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dump is not wrong since it includes the meaning of "print".
Should I change it to print_model?

Or leave it to prepare for print_to_console argument?

rgf/rgf_model.py Outdated
def feature_importances_(self):
"""Return the feature importances.

The importance of a feature is computed from sum of gain of each nodes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... each node.

rgf/rgf_model.py Outdated
params.append("feature_importances_fn=%s" % self._feature_importances_loc)
params.append("model_fn=%s" % self._model_file)
cmd = (utils.RGF_PATH, "feature_importances", ",".join(params))
print(cmd)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line.

@@ -659,6 +691,34 @@ def _fit_multiclass_task(self, X, y, sample_weight, params):
sample_weight)
for i in range(self._n_classes))

def dump_model(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as above.

rgf/rgf_model.py Outdated
def feature_importances_(self):
"""Return the feature importances.

The importance of a feature is computed from sum of gain of each nodes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as above.

rgf/rgf_model.py Outdated
def feature_importances_(self):
"""Return the feature importances.

The importance of a feature is computed from sum of gain of each nodes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as above.

rgf/rgf_model.py Outdated
params.append("feature_importances_fn=%s" % self._feature_importances_loc)
params.append("model_fn=%s" % self._model_file)
cmd = (utils.RGF_PATH, "feature_importances", ",".join(params))
print(cmd)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as above.

@@ -293,13 +294,13 @@ def _save_train_data(self, X, y, sample_weight):
self._save_dense_files(X, y, sample_weight)
self._is_sparse_train_X = False

def _execute_command(self, cmd):
def _execute_command(self, cmd, verbose=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that we need a new parameter? If you don't like the idea about real dumping a model with next printing the file content, you can temporary set self.verbose = True.

Copy link
Member Author

@fukatani fukatani Mar 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I first come out come up with your solution, but current implementation is simple for me. And fewer line numbers.
self.verbose setted by only user one time. How do you think?

@StrikerRUS
Copy link
Member

I compared feature_importances_ with the random forest, the trends were consistent.

Can you please share obtained results?

@fukatani
Copy link
Member Author

fukatani commented Mar 5, 2018

Really like new features introduced in this PR. But please think about "real dump" of a model. I suppose it'll be more useful than just printing to the console.

For example dump in JSON format like lightGBM.
It's convenient and we may support it in the future, but we should do it with another PR.
This PR is already big.

@StrikerRUS
Copy link
Member

For example dump in JSON format like lightGBM.

Yeah, JSON seem to be a good choice!

Should I change it to print_model?
Or leave it to prepare for print_to_console argument?

I first come out come up with your solution, but current implementation is simple for me. And fewer line numbers.
self.verbose setted by only user one time. How do you think?

If you have plans to implement "real dump" to JSON with possibility to output results to the console, I think it's OK to leave name and additional parameter (which will be unneeded and deleted in next PR) as is.

@StrikerRUS
Copy link
Member

Tong asked me to email him in case we need any his attention, so I asked to review this PR.

@TongZhang-ML TongZhang-ML merged commit 535f5d7 into master Mar 6, 2018
@StrikerRUS StrikerRUS deleted the rgf-feature-importances branch June 11, 2018 00:28
@fukatani fukatani mentioned this pull request Jun 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants