Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Pipeline To Seamlessly Use A Data Container Object #49

Conversation

@alexbrillant
Copy link
Member

commented Sep 9, 2019

  1. Introduce handle_transform and handle_fit_transform method inside BaseStep to execute side effects on the new DataContainer Object after fits and transforms.

  2. Introduce DataContainer Object to keep track of :

  • original_ids
  • ids
  • data inputs
  • expected outputs

This will enable ids and hyperparams hashing.

The hyperparams/ids checkpointing itself will be done in the next pr.

@cla-bot cla-bot bot added the cla-signed label Sep 9, 2019


class TruncableSteps(BaseStep, ABC):

def __init__(
self,
steps_as_tuple: NamedTupleList,
hyperparams: HyperparameterSamples = dict(),
hyperparams_space: HyperparameterSpace = dict()
hyperparams_space: HyperparameterSpace = dict(),
hasher: Hasher = None

This comment has been minimized.

Copy link
@alexbrillant

alexbrillant Sep 9, 2019

Author Member

maybe force hasher prop here without defaut value (see comments above)

This comment has been minimized.

Copy link
@guillaume-chevalier

guillaume-chevalier Sep 9, 2019

Member

I'd use the ID hasher as the pipeline will use that as a starting point for the data.

Potential bug I just thought of: what if we have nested pipelines, will the inner pipeline have access to the rehashed IDs of the new data ? It'd need to override the handle methods to obtain it.

with open(self.get_checkpoint_file_path(data_inputs_checkpoint_file_name), 'rb') as file:
data_inputs = pickle.load(file)
with open(self.get_checkpoint_file_path(data_container), 'rb') as file:
checkpoint = pickle.load(file)

This comment has been minimized.

Copy link
@alexbrillant

alexbrillant Sep 9, 2019

Author Member

rename to data_container_checkpoint

@@ -96,44 +102,45 @@ def __init__(self, force_checkpoint_name: str = None, cache_folder: str = DEFAUL
self.cache_folder = cache_folder
self.force_checkpoint_name = force_checkpoint_name

def read_checkpoint(self, data_inputs):
def read_checkpoint(self, data_container: DataContainer):

This comment has been minimized.

Copy link
@alexbrillant

alexbrillant Sep 9, 2019

Author Member

add return type

This comment has been minimized.

Copy link
@guillaume-chevalier

guillaume-chevalier Sep 9, 2019

Member

I think it returns a DataContainer.


def set_checkpoint_path(self, path):
"""
Set checkpoint path inside the cache folder (ex: cache_folder/pipeline_name/force_checkpoint_name/data_inputs.pickle)
:param path: checkpoint path
"""
if path is None:

This comment has been minimized.

Copy link
@alexbrillant

alexbrillant Sep 9, 2019

Author Member

clean code : use self.force_checkpoint_name directly for clarity

This comment has been minimized.

Copy link
@guillaume-chevalier

guillaume-chevalier Sep 9, 2019

Member

TODO: init checkpoint path at the initialize method?

@alexbrillant alexbrillant changed the title Introduce Base Step Hasher & Refactor Pipeline To Seamlessly Use DataContainer Object Refactor Pipeline To Seamlessly Use DataContainer Object Sep 9, 2019

@alexbrillant alexbrillant changed the title Refactor Pipeline To Seamlessly Use DataContainer Object Refactor Pipeline To Seamlessly Use A Data Container Object Sep 9, 2019

@alexbrillant alexbrillant requested a review from guillaume-chevalier Sep 9, 2019

@@ -29,14 +29,55 @@
from neuraxle.hyperparams.space import HyperparameterSpace, HyperparameterSamples


class BaseStep(ABC):
class Hasher(ABC):

This comment has been minimized.

Copy link
@alexbrillant

alexbrillant Sep 9, 2019

Author Member

to do : make only 1 hasher to hash hyperparams

This comment has been minimized.

Copy link
@guillaume-chevalier

guillaume-chevalier Sep 9, 2019

Member

1 or 2 (abstract)

raise NotImplementedError()


class HasherByIndex(Hasher):

This comment has been minimized.

Copy link
@alexbrillant

alexbrillant Sep 9, 2019

Author Member

we should not be hashing by data

This comment has been minimized.

Copy link
@guillaume-chevalier

guillaume-chevalier Sep 9, 2019

Member

except the first time (so 3 classes)

@@ -29,14 +29,55 @@
from neuraxle.hyperparams.space import HyperparameterSpace, HyperparameterSamples


class BaseStep(ABC):
class Hasher(ABC):
def hash(self, hyperparameters: HyperparameterSamples, data_inputs: Any):

This comment has been minimized.

Copy link
@alexbrillant

alexbrillant Sep 9, 2019

Author Member

we don't need hash anymore... we just need to initialize ids in DataContainer

@Neuraxio Neuraxio deleted a comment from guillaume-chevalier Sep 9, 2019

):
self.hasher = hasher

if self.hasher is None and hyperparams is not None and hyperparams != {}:

This comment has been minimized.

Copy link
@alexbrillant

alexbrillant Sep 9, 2019

Author Member

not needed anymore

@@ -29,14 +29,55 @@
from neuraxle.hyperparams.space import HyperparameterSpace, HyperparameterSamples


class BaseStep(ABC):
class Hasher(ABC):

This comment has been minimized.

Copy link
@guillaume-chevalier

guillaume-chevalier Sep 9, 2019

Member

1 or 2 (abstract)

raise NotImplementedError()


class HasherByIndex(Hasher):

This comment has been minimized.

Copy link
@guillaume-chevalier

guillaume-chevalier Sep 9, 2019

Member

except the first time (so 3 classes)


class IDHasher(Hasher):
def rehash(self, ids, hyperparameters: HyperparameterSamples, data_inputs: Any):
return ids

This comment has been minimized.

Copy link
@guillaume-chevalier

guillaume-chevalier Sep 9, 2019

Member

just delete this one, I'd keep the HasherByID for the pipeline.transform initialization of original ID, and everywhere else I'd use the hasher by hyperparam.

"""
new_self, out = self.fit_transform(data_container.data_inputs, data_container.expected_outputs)
data_container.set_data_inputs(out)

This comment has been minimized.

Copy link
@guillaume-chevalier

guillaume-chevalier Sep 9, 2019

Member

Add something like:
data_container.id_rehash = self.hashr._ahsh(data_container.id_rehash, self.get_hyperparams())

def inverse_transform_one(self, data_output):
# return data_input
raise NotImplementedError("TODO: Implement this method in {}.".format(self.__class__.__name__))

This comment has been minimized.

Copy link
@guillaume-chevalier

guillaume-chevalier Sep 9, 2019

Member

Open issue to delete "_one" classes.


class TruncableSteps(BaseStep, ABC):

def __init__(
self,
steps_as_tuple: NamedTupleList,
hyperparams: HyperparameterSamples = dict(),
hyperparams_space: HyperparameterSpace = dict()
hyperparams_space: HyperparameterSpace = dict(),
hasher: Hasher = None

This comment has been minimized.

Copy link
@guillaume-chevalier

guillaume-chevalier Sep 9, 2019

Member

I'd use the ID hasher as the pipeline will use that as a starting point for the data.

Potential bug I just thought of: what if we have nested pipelines, will the inner pipeline have access to the rehashed IDs of the new data ? It'd need to override the handle methods to obtain it.

@guillaume-chevalier
Copy link
Member

left a comment

More unit tests would be good too with tmpdir and check if at least some amount of the checkpoint files exists on disks.

@@ -67,19 +71,21 @@ def set_checkpoint_path(self, path):
raise NotImplementedError()

@abstractmethod
def read_checkpoint(self, data_inputs) -> Any:
def read_checkpoint(self, data_container: DataContainer) -> DataContainer:

This comment has been minimized.

Copy link
@guillaume-chevalier

guillaume-chevalier Sep 9, 2019

Member

TODO: we should start thinking of single item (streaming) checkpoints and multiple items (block) checkpoints...

@@ -96,44 +102,45 @@ def __init__(self, force_checkpoint_name: str = None, cache_folder: str = DEFAUL
self.cache_folder = cache_folder
self.force_checkpoint_name = force_checkpoint_name

def read_checkpoint(self, data_inputs):
def read_checkpoint(self, data_container: DataContainer):

This comment has been minimized.

Copy link
@guillaume-chevalier

guillaume-chevalier Sep 9, 2019

Member

I think it returns a DataContainer.


def set_checkpoint_path(self, path):
"""
Set checkpoint path inside the cache folder (ex: cache_folder/pipeline_name/force_checkpoint_name/data_inputs.pickle)
:param path: checkpoint path
"""
if path is None:

This comment has been minimized.

Copy link
@guillaume-chevalier

guillaume-chevalier Sep 9, 2019

Member

TODO: init checkpoint path at the initialize method?

super().__init__(steps)
def __init__(self, steps: NamedTupleList, hasher=HasherByIndex()):
if hasher is None:
raise ValueError('Pipeline hasher cannot be None')

This comment has been minimized.

Copy link
@guillaume-chevalier

guillaume-chevalier Sep 9, 2019

Member

todo: = None and add an if that creates it in the constructor. We want to avoid default arguments by reference. Same commen as in BaseStep's constructor.

:return: transformed data inputs
"""
ids = self.hasher.hash(self.hyperparams, data_inputs)
data_container = DataContainer(ids=ids, data_inputs=data_inputs)

This comment has been minimized.

Copy link
@guillaume-chevalier

guillaume-chevalier Sep 9, 2019

Member

When do we set ids_rehashed = ids at the beginning once? Should we initialize ids_rehashed at the same time of ids like I suggest?

This comment has been minimized.

Copy link
@alexbrillant

alexbrillant Sep 10, 2019

Author Member

Right now, id_rehashed is named as ids in the DataContainer class. DataContainer has original_ids, and ids in it

"""
steps_left_to_do, data_inputs = self.read_checkpoint(data_inputs)
steps_left_to_do, data_container = self._load_pipeline_checkpoint(data_container)

This comment has been minimized.

Copy link
@guillaume-chevalier

guillaume-chevalier Sep 9, 2019

Member

We'll need to start thinking how (and if) we resume the steps here or something.

@@ -35,13 +35,14 @@ def __init__(
self,
wrapped_sklearn_predictor,
hyperparams_space: HyperparameterSpace = None,
return_all_sklearn_default_params_on_get=False
return_all_sklearn_default_params_on_get=False,
hasher: Hasher = IDHasher()

This comment has been minimized.

Copy link
@guillaume-chevalier

guillaume-chevalier Sep 9, 2019

Member

I wouldn't specify a hasher here and I'd let the BaseStep receive None to create an HasherByHyperparameter. As in my firsts comment way above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.