Refactor TPOT to work directly with numpy matrix instead of pandas DataFrames #113

rhiever · 2016-03-15T00:47:34Z

Since we don't maintain the column names any more, it seems that we could replace the pandas DataFrames in our pipeline structure with numpy matrices. We're always changing the data into numpy matrices anyway when passing them to the sklearn operations, so I'm not seeing the point of using pandas DataFrames any more.

This might make TPOT more memory efficient, as we won't introduce DataFrame overhead either.

To make this happen, we would need to:

Store the train/test indices as internal self variables (in place of having a group column)
Ensure that the class column is always the last entry in the matrix (in place of having a class column)
Ensure that the latest guess column is always the second-to-last entry in the matrix (in place of having a guess column)

I believe that this would also make #29 much easier to implement.

Any downsides to this change that we can think of?

The text was updated successfully, but these errors were encountered:

rasbt · 2016-03-15T03:30:10Z

We're always changing the data into numpy matrices anyway when passing them to the sklearn operations, so I'm not seeing the point of using pandas DataFrames any more.

Yes, I agree 100%

Ensure that the class column is always the last entry in the matrix (in place of having a class column)

I'd suggest using 2 arrays ("matrices") instead of 1. The reason is that you may want to have separate Numpy dtypes: The feature array (e.g. X) as float array and the class label array (e.g., y) as integer array. Here, you can simply operate via a meta-array, an index array, that is created at the very beginning of the TPOT tree: something like idx = np.asarray(range(data.shape[0]). Operating via the idx variable then (X[idx], y[idx]) could also help you to avoid creating unnecessary copies of the arrays internally (plus, having "shared memory" for parallelization at some point is probably a good ideas as well)

dipanjanS · 2016-05-12T19:03:19Z

Really looking forward to this!

tonyfast · 2016-05-31T14:24:13Z

Perhaps it might be best to use pandas under the hood? It is a supercharged numpy object. Much of tpot relies on pandas machinery and it might make extra work to manage state with it.

rhiever · 2016-05-31T15:12:28Z

pandas comes with quite a bit of overhead. sklearn doesn't use pandas; I don't think it's necessary for TPOT to use it either.

rhiever · 2016-08-19T17:47:05Z

This change will be in the 0.5 release.

rhiever added need contributor question labels Mar 15, 2016

rhiever changed the title ~~Refactor TPOT to work directly with numpy matrixes instead of pandas DataFrames?~~ Refactor TPOT to work directly with numpy matrix instead of pandas DataFrames? Mar 19, 2016

rhiever added being worked on enhancement and removed question labels Apr 25, 2016

rhiever changed the title ~~Refactor TPOT to work directly with numpy matrix instead of pandas DataFrames?~~ Refactor TPOT to work directly with numpy matrix instead of pandas DataFrames Apr 30, 2016

rhiever removed the need contributor label Jun 1, 2016

tonyfast mentioned this issue Jun 3, 2016

[WIP] Refactor tpot to many sklearn models #164

Closed

rhiever added this to the Major refactor milestone Jun 17, 2016

rhiever closed this as completed Aug 19, 2016

danthedaniel mentioned this issue Aug 22, 2016

TypeError: 'NoneType' object is not iterable #234

Closed

AIAdventures mentioned this issue Jun 6, 2017

Titanic example -problem with 2nd last cell. #492

Closed

saddy001 mentioned this issue Mar 20, 2018

Segfault on optimization process #676

Closed

westurner mentioned this issue Jul 20, 2018

Parallelization with python dask and dask-learn. Proposal. #304

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor TPOT to work directly with numpy matrix instead of pandas DataFrames #113

Refactor TPOT to work directly with numpy matrix instead of pandas DataFrames #113

rhiever commented Mar 15, 2016

rasbt commented Mar 15, 2016

dipanjanS commented May 12, 2016

tonyfast commented May 31, 2016

rhiever commented May 31, 2016

rhiever commented Aug 19, 2016

Refactor TPOT to work directly with numpy matrix instead of pandas DataFrames #113

Refactor TPOT to work directly with numpy matrix instead of pandas DataFrames #113

Comments

rhiever commented Mar 15, 2016

rasbt commented Mar 15, 2016

dipanjanS commented May 12, 2016

tonyfast commented May 31, 2016

rhiever commented May 31, 2016

rhiever commented Aug 19, 2016