Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding the Xavier Initializer #5270

Merged
merged 4 commits into from
Nov 2, 2017
Merged

Conversation

abhinavarora
Copy link
Contributor

No description provided.

fan_out = shape[1]
else:
# Assume this to be a convolutional kernel
# In Paddlepaddle, the shape of the kernel is like:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Paddlepaddle/PaddlePaddle/g

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4e1fa1b

(http://proceedings.mlr.press/v9/glorot10a.html)
"""

def __init__(self, uniform=True, fan_in=None, fan_out=None, seed=0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normal distribution is used more frequently than Uniform distribution. Shall we change the default value of uniform to False

Copy link
Contributor

@lcy-seso lcy-seso Nov 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think uniform distribution is acceptable for me, my concern is we are not in a stage that we are intensively encoding the best practices from the community into our framework. We can carefully tune this later.

But uniform distribution from [-1, 1] as a default initialization seems terrible (only from my personal experience. It seems too large). Maybe, we can reduce the region limit into le-3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at tensorflow where they keep uniform as the default. That is why I chose uniform as the default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it~ thank you for the information.

def __init__(self, uniform=True, fan_in=None, fan_out=None, seed=0):
"""Constructor for XavierInitializer

Args:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add descriptions for the default values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback. Could you give me an example of the description you are referring to. I have talked about fan_in and fan_out in the Args section of my docstring.

Copy link
Contributor

@pengli09 pengli09 Nov 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just ignore this comment, I didn't realize that the signature of the function will be displayed in the doc.

For example, "uniform: .... Default value is True.". Otherwise, one may need to dig into the code to find what are the default values.

f_in, f_out = self._compute_fans(var)

# If fan_in and fan_out are passed, use them
fan_in = self._fan_in or f_in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f_in will be used if self._fan_in is 0, which is counter-intuitive. Therefore I think f_in if self._fan_in is None else self._fan_in. 0 is not a proper value for self._fan_in, which means, when encountering a 0, either the user is using it intentionally for debugging purpose or an error occurs. Therefore, I think failing deterministically will be a better choice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4e1fa1b


# If fan_in and fan_out are passed, use them
fan_in = self._fan_in or f_in
fan_out = self._fan_out or f_out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

outputs={"Out": var},
attrs={
"shape": var.shape,
"data_type": int(var.data_type),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with Block. Is int correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is correct. We pass the data type as integer which is then mapped to a given type.

limit = np.sqrt(6.0 / (param.shape[0] + param.shape[1]))
self.assertAlmostEqual(init_op.attr('min'), -limit, delta=DELTA)
self.assertAlmostEqual(init_op.attr('max'), limit, delta=DELTA)
self.assertEqual(init_op.attr('seed'), 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also test whether seed can be set properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4e1fa1b

approximately same in all the layers. In case of Uniform distribution,
the range is [-x, x], where x = sqrt(6 / (fan_in + fan_out)).
In case of Normal distribution, the mean is 0 and the standard deviation
is sqrt(2/ (fan_in + fan_out)).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recommended scale/std for different nonlinearity functions are different. I think we can borrow the idea from PyTorch to make this initializer more general.
https://github.com/pytorch/pytorch/blob/master/torch/nn/init.py#L8
https://github.com/pytorch/pytorch/blob/master/torch/nn/init.py#L184

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original paper does not talk about Relus and gain because this paper was before the time when RELUs became popular. I also looked at tensorflow and keras source code. They have kept this initialization as it is defined in the paper. Maybe we can merge this for now and then add the gain later in a separate PR when we have more knowledge about it. I can read few more papers to see how the gain attribute is used. I do not think it is right to borrow the idea directly without researching it thoroughly. We can merge this and can look at this in more detail after refactoring is complete. What do you suggest?

Copy link
Contributor

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like to read your doc/comments, it is quite easy to follow, thank you.

networks. International conference on artificial intelligence and
statistics.
(http://proceedings.mlr.press/v9/glorot10a.html)
"""
Copy link
Contributor

@lcy-seso lcy-seso Nov 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like to read your comments~ Quite easy to follow.

Just for my personal interests. I remember this is a recommended initializer for tanh, the usually recommended initializer varies according to different activations.

I think we can just add all these helpful initializers into our framework for the first goal, and later to choose a much better initialization strategy to encode the best practice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback. I agree with you.

(http://proceedings.mlr.press/v9/glorot10a.html)
"""

def __init__(self, uniform=True, fan_in=None, fan_out=None, seed=0):
Copy link
Contributor

@lcy-seso lcy-seso Nov 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think uniform distribution is acceptable for me, my concern is we are not in a stage that we are intensively encoding the best practices from the community into our framework. We can carefully tune this later.

But uniform distribution from [-1, 1] as a default initialization seems terrible (only from my personal experience. It seems too large). Maybe, we can reduce the region limit into le-3?

Copy link
Contributor

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@lcy-seso
Copy link
Contributor

lcy-seso commented Nov 2, 2017

Not changing the default uniform distribution is acceptable for me. (Personally) Because for me, this is not a top concern currently and I do not have a strong evidence to support any other change (they all seems reasonable).

@abhinavarora abhinavarora merged commit 66d1c6c into PaddlePaddle:develop Nov 2, 2017
@abhinavarora abhinavarora deleted the xavier branch November 2, 2017 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants