Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor TPOT to work directly with numpy matrix instead of pandas DataFrames #113

Closed
rhiever opened this issue Mar 15, 2016 · 5 comments
Closed

Comments

@rhiever
Copy link
Contributor

rhiever commented Mar 15, 2016

Since we don't maintain the column names any more, it seems that we could replace the pandas DataFrames in our pipeline structure with numpy matrices. We're always changing the data into numpy matrices anyway when passing them to the sklearn operations, so I'm not seeing the point of using pandas DataFrames any more.

This might make TPOT more memory efficient, as we won't introduce DataFrame overhead either.

To make this happen, we would need to:

  • Store the train/test indices as internal self variables (in place of having a group column)
  • Ensure that the class column is always the last entry in the matrix (in place of having a class column)
  • Ensure that the latest guess column is always the second-to-last entry in the matrix (in place of having a guess column)

I believe that this would also make #29 much easier to implement.

Any downsides to this change that we can think of?

@rasbt
Copy link
Contributor

rasbt commented Mar 15, 2016

We're always changing the data into numpy matrices anyway when passing them to the sklearn operations, so I'm not seeing the point of using pandas DataFrames any more.

Yes, I agree 100%

Ensure that the class column is always the last entry in the matrix (in place of having a class column)

I'd suggest using 2 arrays ("matrices") instead of 1. The reason is that you may want to have separate Numpy dtypes: The feature array (e.g. X) as float array and the class label array (e.g., y) as integer array. Here, you can simply operate via a meta-array, an index array, that is created at the very beginning of the TPOT tree: something like idx = np.asarray(range(data.shape[0]). Operating via the idx variable then (X[idx], y[idx]) could also help you to avoid creating unnecessary copies of the arrays internally (plus, having "shared memory" for parallelization at some point is probably a good ideas as well)

@rhiever rhiever changed the title Refactor TPOT to work directly with numpy matrixes instead of pandas DataFrames? Refactor TPOT to work directly with numpy matrix instead of pandas DataFrames? Mar 19, 2016
@rhiever rhiever changed the title Refactor TPOT to work directly with numpy matrix instead of pandas DataFrames? Refactor TPOT to work directly with numpy matrix instead of pandas DataFrames Apr 30, 2016
@dipanjanS
Copy link

Really looking forward to this!

@tonyfast
Copy link

Perhaps it might be best to use pandas under the hood? It is a supercharged numpy object. Much of tpot relies on pandas machinery and it might make extra work to manage state with it.

@rhiever
Copy link
Contributor Author

rhiever commented May 31, 2016

pandas comes with quite a bit of overhead. sklearn doesn't use pandas; I don't think it's necessary for TPOT to use it either.

@rhiever
Copy link
Contributor Author

rhiever commented Aug 19, 2016

This change will be in the 0.5 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants