Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-define the type system #8352

Closed
wants to merge 3 commits into from

Conversation

wangkuiyi
Copy link
Collaborator

Fix #8336

Related with #8339

This PR is not supposed to work. It is just a reference to how we need to re-design Fluid's type system.

}
optional LoDTensorDesc lod_tensor = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LoDTensorDesc => LoDTensor?

required Type type = 1;

message Tensor {
required Type data_type = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the data_type here for Tensor only POD types?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Our C++ implementation of Tensor requires that elements have to be C++ POD types:

https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/tensor.h#L169

It looks to me that we can check in C++ code that data_type must be of POD types. But I am open to better solutions. What do you think?

Copy link
Contributor

@abhinavarora abhinavarora Feb 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As suggested by @wangkuiyi , one way could be to leave it as it is and check in the C++ code that data_type must be a POD. Another way could be to define this inside protobuf. We can split Type into 2 different enums :

enum PodType {
    // POD types
    BOOL = 0;
    INT16 = 1;
    INT32 = 2;
    INT64 = 3;
    FP16 = 4;
    FP32 = 5;
    FP64 = 6;
}

enum CompositeType {
    LOD_TENSOR = 1;    // LoDTensor
    SELECTED_ROWS = 2; // Tensor
    FEED_MINIBATCH = 3;
    FETCH_LIST = 4;
    STEP_SCOPES = 5;
    LOD_RANK_TABLE = 6;
    // LOD_TENSOR_ARRAY = 7; // replaced by array-of-tensors. c.f. ArrayType
    // PLACE_LIST = 8; // replaced by array-of-place. c.f. ArrayType
    READER = 7; // ReaderType
}

Once we have done this, we can define a field type in VarType as follows:

message VarType {
    oneof type {
        PodType ptype = 1;
        CompositeType ctype = 2;
    }
}
```protobuf

This way we can capture the requirement of our Tensor and LODTensor inside protobuf message. The tensor can be defined as follows:

```protobuf
message Tensor {
    required PodType data_type = 1;
    repeated int64 dims = 2; // [UNK, 640, 480] is saved as [-1, 640, 480]
  }

What do you guys think @wangkuiyi and @jacquesqiao ?

}
optional Array array = 4;

message Reader { repeated LoDTensorDesc lod_tensor = 1; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LoDTensorDesc => LoDTensor

FP64 = 6;

// other types, that might need additonal descriptions
LOD_TENSOR = 1; // LoDTensorType
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because other types are in the same enum Type with POD types, the index should not be duplicated. So the index should not start from 1.


message ReaderDesc { repeated LoDTensorDesc lod_tensor = 1; }
message Array { VarType elem_type = 1; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the VarType be repeated?
And message Array can be used as array of VarType in which the VarType should be the same and n-tuple in which the VarType of the array can be different, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the VarType in the array should be repeated. This is because the array will have all elements of the same VarType. In an n-tuple, the VarType will be repeated.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Abhinav's comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhinavarora Thanks, you are right, I finally know where I was wrong.
We should add message Tuple {repeated VarType tuple_type = 1;}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this has also been discussed here: #8336 Instead of VarDesc we will have VarType as has been done in this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, channel of channels is a commonly used pattern used with CSP

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a channel can hold so many types, we cannot make shape inference easily at compile time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Tuple and Array should be two different types.

  • The Array can hold elements with the same type and shape at compile type. The shape of array_read can be inferenced at compile time only if the array holds the same time.

    • The length of Array can only be decided at runtime since there are array_write operators for RNN. If we make array can hold different types, we cannot inference type at compile time.
  • The Tuple is an immutable type. The length of Tuple and its elements will not be changed at compile time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reyoung You are right, Tuple and Array are 2 different types. The type of the array elements will be known at compile time. Tuple should be immutable and its length and elements will not change. I am just a little curious how we could implement thin our cpp code. Today, we map the LODTensorArray directly to a Vector<LODTensor>. For an Array would we have to create mappings for all types?

Copy link
Contributor

@chengduoZH chengduoZH Feb 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would we have to create mappings for all types?

I am curious about this too.

Copy link
Contributor

@abhinavarora abhinavarora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! This was exactly as we discussed on Friday. I have few more questions that I have asked in the comments.

@abhinavarora
Copy link
Contributor

abhinavarora commented Feb 11, 2018

Since @wangkuiyi was unable to see my comment, I am pasting it here again:

There are 2 ways to make sure that Tensors and LODTensors only contain POD data types. One way, as suggested by @wangkuiyi, is to leave it as it is and check in the C++ code that data_type must be a POD. Another way could be to define this inside protobuf. We can split Type into 2 different enums :

enum PODType {
    // POD types
    BOOL = 0;
    INT16 = 1;
    INT32 = 2;
    INT64 = 3;
    FP16 = 4;
    FP32 = 5;
    FP64 = 6;
}

enum CompositeType {
    LOD_TENSOR = 1;    // LoDTensor
    SELECTED_ROWS = 2; // Tensor
    FEED_MINIBATCH = 3;
    FETCH_LIST = 4;
    STEP_SCOPES = 5;
    LOD_RANK_TABLE = 6;
    // LOD_TENSOR_ARRAY = 7; // replaced by array-of-tensors. c.f. ArrayType
    // PLACE_LIST = 8; // replaced by array-of-place. c.f. ArrayType
    READER = 7; // ReaderType
}

Once we have done this, we can define a field type in VarType as follows:

message VarType {
    oneof type {
        PODType ptype = 1;
        CompositeType ctype = 2;
    }
   ...
}

This way we can capture the requirement of our Tensor and LODTensor inside protobuf message. The tensor can be defined as follows:

message Tensor {
    required PodType data_type = 1;
    repeated int64 dims = 2; // [UNK, 640, 480] is saved as [-1, 640, 480]
  }

More information about oneof in protobuf can be found here
What do you guys think @wangkuiyi and @jacquesqiao ?

@kavyasrinet
Copy link

Isn't this similar to what @wangkuiyi has proposed in the PR , except for splitting the VarType into two separate enums ? Or am I missing something ?

Copy link

@kavyasrinet kavyasrinet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for starting off the PR. The refinement done here should suffice for now and is the same idea as we had discussed on Friday.

@abhinavarora
Copy link
Contributor

@kavyasrinet The above comment was in response to the following thread -> #8352 (comment)

@kavyasrinet
Copy link

I see. Thanks for the clarification, I agree with both the approaches.

LOD_TENSOR_ARRAY = 7;
PLACE_LIST = 8;
READER = 9;
}
required string name = 1;
required VarType type = 2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can VarDesc::type be POD type like BOOL, INT16? Or it can only be LOD_TENSOR, SELECTED_ROWS...?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VarType::Type contains POD type.

FETCH_LIST = 10;
STEP_SCOPES = 11;
LOD_RANK_TABLE = 12;
// LOD_TENSOR_ARRAY = 7; // replaced by array-of-tensors. c.f. ArrayType
Copy link
Collaborator

@reyoung reyoung Feb 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All tensors in the LOD_TENSOR_ARRAY will hold the same type and the same shape in compile time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. So LoDTensorArray is not Arary. We got that. Thanks for the reminder!

@jacquesqiao
Copy link
Member

jacquesqiao commented Feb 12, 2018

@abhinavarora @wangkuiyi I personally prefer the first way to separate the two kinds of Type: PODType and CompositeType. There are several reasons:

  • In the future, if we want to add some new POD type to our framework.proto, like int8, it will be better to add it to the end of PODType::FP64 but not Type::LOD_RANK_TABLE.
  • IsPOD will be easier to implement with oneof, we did not need to repeat all the types in POD when do the check in C++ and Python.

@wangkuiyi
Copy link
Collaborator Author

wangkuiyi commented Feb 12, 2018

@jacquesqiao @abhinavarora @reyoung and everyone: Let us use a single enum and do NOT separate POD types from composite types, as I posted in the PR.

I understand your point is due to the fact that the composite type Tensor requires that its elements are POD types. Let us follow this logic and imagine that in the future we might have an additional composite type, whose elements must be, say, tensors. How are we going to handle this? Do we need to separate LOD_TENSOR and SELECTED_ROW out from CompositeType and make it a new enum named TensorType? If we do so, we will have

enum PODType {...}
enum TensorType {...}
enum CompositeType {...}
message VarType {
  oneof {
    PODType
    TensorType
    CompositeType
  }
  ...
}

There are two critical problems with the above strategy:

  1. It is intractable to have an asymptotically infinite amount of enums with the development of Fluid.

  2. Logically, TensorType and CompositeType are not peers to each other and cannot simply be described by the oneof keyword. Instead, we must have methods like:

    bool IsPOD(const VarType&);
    bool IsComposite(const VarType&); // returns true for tensors and other composite types
    bool isTensor(const VarType&); // returns true for tensors but not other composite types

@paddle-bot-old paddle-bot-old bot closed this May 22, 2020
@paddle-bot-old
Copy link

Since you haven't replied for a long time, we have closed this issue/pr.
If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up.
由于您长期未回复,我们将关闭这个issue/pr。
若问题未解决或有后续问题,请随时重新打开,我们会继续跟进。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Re-design the type system of Fluid
7 participants