New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rgf feature importances #161
Conversation
Conflicts: include/rgf/src/tet/AzFindSplit.cpp
Wow! Great job! I've created a separate repo for RGF C++: https://github.com/StrikerRUS/rgf . Of course, we had to do it more early... But better late than never! 😃 I've invited you as a collaborator to this repo, please accept the invitation. After doing this I will transfer the ownership of the repo to Tong Zhang. We'll remain collaborators. |
Thank you for your comment. Basically, I agree with this PR exceeds the scope of the wrapper. I think there are two ways.
|
I agree that tests is very important part of the repo which helps developers and we'll face some problems after separating C++ and Python code. But it'll be possible just to create new branch and update GitHub submodule into wrapper's repo, so tests will be triggered (I understand that it's not so comfortable, but OK in some meaning). Anyway, RGF is now written by us and it's wrong way to just copy the original code into the wrapper's repo. XGB and LGB are written by the same people, so it's OK to held all code in one repo for them. Moreover, submodules is a tradeoff between independence (we can reference any certain commit we want) and need to maintain the C++ code or duplicate it. Also what's about the users who want to use command line tool? It's not so easy to find the way to do it into wrapper's repo. Both RGF and FastRGF are independent projects, and deserve separate repos. My opinion about the situation is that the right way is modified (1): create |
I've created |
Thank you for creating organization. But there is no need to hurry.
Why wrong method? In terms of licence? Copyright? For example, tensorflow including skflow keras, .... and other repository. Honestly, I can not feel the benefit of separately managing.
The maintainer will increase in the future, so I want to choose a simple and easy way. |
Would it be better to upload clone rgf_python than transferring this repository? |
Anyway, whatever system it will be, It is a lot of fun to be run by three people. It was a great benefit for RGFcommunity Tong Zhang join. 👍 |
Yeah! Good point! But you shouldn't worry about it. GitHub makes redirects automatically after transferring is finished. You could check it by following this link: https://github.com/StrikerRUS/rgf . I've transferred ownership of
To be honest, I don't see any of them. The main reason I want to separate repos is that
To speak about tests, it's possible to write own tests for RGF and FastRGF (folders with examples are already there and could be executed at Travis/Appveyor side) in the future. Also, as I said before, any commit into RGF/FastRGF repo could be checked by |
XGB holding CLI and Python and R users, though it seems no problem. Their folder structure is useful as a reference. I agree with separating repository is "possible", but not feel that it is "best". |
Your examples use submodules for the core components too: LGB uses
Because the development were done by the same people and in the same time. RGF much time hadn't a wrapper, so it's not good to hide the adult and independent project into the one of multiple folders of its' wrapper. In your examples the main place takes C++ code, wrappers have their subdirectories. Imagine the situation: someone want to develop, let say, Java wrapper. What should he do? Copy-paste the needed C++ code? Or pull the full Python wrapper repo with much unnecessary code? The current In addition, what stops you to transfer the repo now "as is"? |
Since the discussion is overheating, it seems time is needed to cool down.
It seems not to be yet time to start teamwork. |
I think that transferring |
I moved this repository and made a logo for RGF-team. (Is it cool?) BTW, Can I merge this PR? I'd like to decide the review system in the near future. |
Great! 🌟
For some reason (I don't know why) I find the logo funny 😄 .
This PR is rather big and important, but not critical for fast merging. My opinion is that Tong (or Rie, if she express the wish to join RGF-team) should take a loot at it. Also I can review Python part of the PR, if you are not in a hurry. |
It is nice to be reviewed by RGF authors (of cource and by you). |
I compared |
Thanks! I thinks for future PR we should review each other for Python code and ask Tong's review for C++ code. If there is no review from anyone, let say, for week, then merge PR. What do think about it? |
Basically, OK. And I will review C ++ written by Tong as far as I can. |
OK.
What's your suggestion?
If he will have a time to do it 😃 . I'll review this PR in a few hours today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive work! 👍
Really like new features introduced in this PR. But please think about "real dump" of a model. I suppose it'll be more useful than just printing to the console.
@@ -381,6 +381,38 @@ def _find_model_file(self): | |||
'Training is abnormally finished.'.format(utils.TEMP_PATH)) | |||
self._model_file = sorted(model_files, reverse=True)[0] | |||
|
|||
def dump_model(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the name is little confusing. Maybe print_model
? Or another variant: really "dump" model to a file and if necessary print it to the console. So the signature will be dump_model(file_name, print_to_console=False)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dump is not wrong since it includes the meaning of "print".
Should I change it to print_model
?
Or leave it to prepare for print_to_console
argument?
rgf/rgf_model.py
Outdated
def feature_importances_(self): | ||
"""Return the feature importances. | ||
|
||
The importance of a feature is computed from sum of gain of each nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... each node.
rgf/rgf_model.py
Outdated
params.append("feature_importances_fn=%s" % self._feature_importances_loc) | ||
params.append("model_fn=%s" % self._model_file) | ||
cmd = (utils.RGF_PATH, "feature_importances", ",".join(params)) | ||
print(cmd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this line.
@@ -659,6 +691,34 @@ def _fit_multiclass_task(self, X, y, sample_weight, params): | |||
sample_weight) | |||
for i in range(self._n_classes)) | |||
|
|||
def dump_model(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same as above.
rgf/rgf_model.py
Outdated
def feature_importances_(self): | ||
"""Return the feature importances. | ||
|
||
The importance of a feature is computed from sum of gain of each nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same as above.
rgf/rgf_model.py
Outdated
def feature_importances_(self): | ||
"""Return the feature importances. | ||
|
||
The importance of a feature is computed from sum of gain of each nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same as above.
rgf/rgf_model.py
Outdated
params.append("feature_importances_fn=%s" % self._feature_importances_loc) | ||
params.append("model_fn=%s" % self._model_file) | ||
cmd = (utils.RGF_PATH, "feature_importances", ",".join(params)) | ||
print(cmd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same as above.
@@ -293,13 +294,13 @@ def _save_train_data(self, X, y, sample_weight): | |||
self._save_dense_files(X, y, sample_weight) | |||
self._is_sparse_train_X = False | |||
|
|||
def _execute_command(self, cmd): | |||
def _execute_command(self, cmd, verbose=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure that we need a new parameter? If you don't like the idea about real dumping a model with next printing the file content, you can temporary set self.verbose = True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I first come out come up with your solution, but current implementation is simple for me. And fewer line numbers.
self.verbose
setted by only user one time. How do you think?
Can you please share obtained results? |
For example dump in JSON format like lightGBM. |
Yeah, JSON seem to be a good choice!
If you have plans to implement "real dump" to JSON with possibility to output results to the console, I think it's OK to leave name and additional parameter (which will be unneeded and deleted in next PR) as is. |
Tong asked me to email him in case we need any his attention, so I asked to review this PR. |
No description provided.