Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional use of header to get user IP address #6973

Conversation

qqmyers
Copy link
Member

@qqmyers qqmyers commented Jun 9, 2020

What this PR does / why we need it: Allows optional specification of an HTTP header as the source of user ip addresses. This is needed to get valid ip addresses for MDC and IP based access control when running Dataverse behind a proxy/load balancer

Which issue(s) this PR closes:

Closes #6970

Special notes for your reviewer: There are security implications if the ip can be spoofed. The header is explicitly not used/ignored if it reports a localhost (127.0.0.1) address

Suggestions on how to test this: This should work with/without a header set on a 'normal' test instance (since the header should be null, and should change the reported IP (e.g. as reported in the mdc logs) on a server behind an AWS load balancer when you use 'X-Forwarded-For' from the IP of the load balancer to the correct user IP address.

Does this PR introduce a user interface change? If mockups are available, please link/include them here: No

Is there a release notes update needed for this change?: Yes? Since it affects MDC reporting.

Additional documentation:

@coveralls
Copy link

coveralls commented Jun 9, 2020

Coverage Status

Coverage decreased (-0.03%) to 19.56% when pulling f520d38 on QualitativeDataRepository:IQSS/6970_use_header_for_ip_address into 75742b9 on IQSS:develop.

Copy link
Contributor

@poikilotherm poikilotherm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qqmyers would you like me to provide a review for this? I see some potential for optimization and have opinions 😉

@qqmyers
Copy link
Member Author

qqmyers commented Jun 9, 2020

Sure. That would be helpful.

FWIW @landreev and I are looking into issues of multiple values/multiple headers right now.

@poikilotherm
Copy link
Contributor

Sure. That would be helpful.

FWIW @landreev and I are looking into issues of multiple values/multiple headers right now.

Great. It's quite late already over here, so please let me add my review tomorrow (while you are asleep 😄).

@landreev
Copy link
Contributor

landreev commented Jun 9, 2020

@poikilotherm

@qqmyers would you like me to provide a review for this? I see some potential for optimization and have opinions 😉

We absolutely want to hear your feedback.
But just in case one of your questions was why was it being implemented as a JVM, and not a database option - the answer is, it would take an impractical amount of code changes, because of how that DataverseRequest object is created and initialized throughout the application.

@qqmyers
Copy link
Member Author

qqmyers commented Jun 9, 2020

@poikilotherm - based on a side discussion, I've updated the code to handle cases where there may be multiple headers and/or headers with several IP addresses. Not sure how that affects your existing comments but hopefully you can still review.

We believe that the ordering is always such that the last value is closest to Dataverse. This code should cover the single proxy/load balancer case and enable IP group support with, for example, a campus proxy and load balancer, although in the latter case, MDC gets the campus proxy IP address (since that's what would be needed for IP groups.). If you think the logic is confused or the code isn't clear, we can hop into chat, etc.

Copy link
Contributor

@poikilotherm poikilotherm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a very high level of looking at this and keeping in mind its security relevance, I would not recommend implementing these checks on such a deep level of the application.

From my point of view, it would make much more sense to implement this as a HTTPFilter class.

  • Can be done without changes to web.xml by using the @WebFilter annotation.
  • Should catch calls to the REST API, too.
  • Can replace the remote address in the HttpServletRequest with the client IP instead of the proxy
  • Can be configured once on startup to avoid the repeated validation issue

I'm also missing tests for this crucial part of the code.


private static String headerToUse = System.getProperty("dataverse.useripaddresssourceheader");

private static final HashSet<String> ALLOWED_HEADERS = new HashSet<String>(Arrays.asList(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing support for RFC 7239 based headers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. No reason those shouldn't be included except that they have more structure and it would take more time to understand them/test, etc.

Comment on lines 88 to 91
int index = ip.indexOf(',');
if (index >= 0) {
ip = ip.substring(index + 1);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this is attempting to support the comma separated list used in the X-Forwarded-For header.

What I don't understand: the indexOf will return an index to first appearance of , in the list. Then this is used to get the part beyond that as the IP.

Let's have an example:
10.0.0.1, 123.13.123.1, 134.94.231.0

  1. The code will result in a substring " 123.13.123.1, 134.94.231.0", which is no valid IP.
  2. It just extracted the IP of the proxies between the service and the client. Please review the format: X-Forwarded-For: client, proxy1, proxy2
  3. This doesn't handle whitespace, too.

I don't think this has been intended here, does it?

Maybe using commons.lang3.StringUtils.substringBefore() is the easiest way forward, null safe etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Straight up bug - lastIndexOf() was what I intended.

Comment on lines 121 to 128
String remoteAddressFromRequest = aHttpServletRequest.getRemoteAddr();
if (remoteAddressFromRequest != null) {
remoteAddressStr = remoteAddressFromRequest;
try {
address = IpAddress.valueOf(remoteAddressStr);
} catch (IllegalArgumentException iae) {
address = IpAddress.valueOf(saneDefault);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very verbose. Wouldn't it make sense to refactor the code of IpAddress.valueOf() to be null safe and use a default value as second parameter? That would render this into a oneliner, much easier to read and understand.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems reasonable.

private final User user;
private final IpAddress sourceAddress;

private final static String undefined = "0.0.0.0";
private final static String localhost = "127.0.0.1";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this variable is not used anymore?

public DataverseRequest(User aUser, HttpServletRequest aHttpServletRequest) {
this.user = aUser;

final String undefined = "0.0.0.0";
String saneDefault = undefined;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make sense to refactor this to use a class variable saneDefault directly then?

String saneDefault = undefined;
String remoteAddressStr = saneDefault;

IpAddress address = null;

if (aHttpServletRequest != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no handling of when the request IS null, which would be a major failure. Instead, the object is created anyway. Either remove the if or throw an exception in case it's null. (Why would that ever happen?) You might think of using @NotNull annotation for the parameter to document that you assume this will be never null.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there are places in the code where null is sent. I'd have to look again to find specific examples but I think they are places where there's internal/async processing and possibly for tests. Refactoring could probably be done to avoid that.

Comment on lines +77 to +85
// One can have multiple instances of a given header. They SHOULD be in order.
Enumeration<String> ipEnumeration = aHttpServletRequest.getHeaders(headerToUse);
if (ipEnumeration.hasMoreElements()) {
// Always get the last header, which SHOULD be from the proxy closest to
// Dataverse
String ip = ipEnumeration.nextElement();
while (ipEnumeration.hasMoreElements()) {
ip = ipEnumeration.nextElement();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a source this behavior is actually more than grey theory? I could imagine that you might have a situation where there are multiple different headers caused by different proxy implementations beyond each other (which is unsupported here). At least the most common de-facto standard X-Forwarded-For is handling multiple proxies differently, see below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We spent time yesterday looking online for specs/practices and saw that multiple headers are allowed. It wasn't clear if they are in practice. Non-buggy software is supposed to have later proxies put their X-Forwarded-For headers after the initial ones so there is a global order even if there are multiple headers and some headers with multiple values. We debated whether we should try to support this or just flag it and fail, but the challenge is that, if you just use getHeader() you get the first header, which is not the one you'd want, so it looked like you still need to getHeaders() at some point.

logger.warning("Ignoring invalid IP address received in header " + headerToUse + " : " + ip);
address = null;
}
if (address!= null && address.isLocalhost()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the use of .isLocalhost() instead of checking for a String. 👍

Comment on lines +97 to +104
if (ip != null) {
// This conversion will throw an IllegalArgumentException if it can't be parsed.
try {
address = IpAddress.valueOf(ip);
} catch (IllegalArgumentException iae) {
logger.warning("Ignoring invalid IP address received in header " + headerToUse + " : " + ip);
address = null;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very verbose. Please see my comment below about refactoring IpAddress.valueOf() becoming null safe (here: just return null if given).

// Not allowed since it is hard to image why a localhost request would be
// proxied and we want to protect
// the internal admin apis that can be restricted to localhost access
logger.warning("Ignoring localhost received as IP address in header " + headerToUse + " : " + ip);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seriously? This is a malicious hacking attempt and all we do is log it? IMHO this request should be aborted with a 401.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The attempt fails. That said, we couldn't think of a legitimate case for localhost to be sent. IP addresses that can't be parsed (the check before this) is another one where there's probably no reason to tolerate it as the code does now. Using a filter would definitely make it easier/more efficient to just refuse the request.

Copy link
Contributor

@landreev landreev Jun 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fair, we don't know for the fact that this is a malicious hacking attempt. Could be a misconfigured proxy, or an honest API scripting mistake, or something of that nature. I will mention this particular warning in the guide, as something to look for.
I will also recommend that anyone behind a proxy adds %header.x-forwarded-for% (or whatever header it is in their case) to the access log format line in their Payara configuration (something we've been doing on our servers). Then the admin should monitor the log; and specifically look for entries with the localhost address in them - something like 127.0.0.1, 111.222.333.444 - could potentially indicate someone trying to send it along maliciously. An entry like 111.222.333.444, 127.0.0.1 on the other hand would indicate that the proxy is doing something seriously wrong, possibly messing up the order of the proxies in the comma-separated list. In which case the practice of trusting the header as the source of the true ip address should be discontinued.

I will also remind everybody in that guide section, that allowing localhost connections to bypass authentication for admin calls is also optional and configurable. And that any Dataverse running behind a proxy that is interested in enabling this mechanism of extracting true IP addresses from headers may consider disabling it after weighing all the risks involved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I guess the thing to emphasize here is that it's less likely to be a "malicious attempt", and much more likely a "something is very wrong in your configuration" situation. Normally, this just should not happen. Even if someone is trying to maliciously pretend to be coming from localhost, that address should never end up in the portion of the header where Dataverse would be inclined to use it. Seeing this in the log would indicate that either the header lookup option was activated on a Dataverse that is NOT behind a proxy; or the proxy is handling the headers incorrectly, etc.)

@qqmyers
Copy link
Member Author

qqmyers commented Jun 10, 2020

@poikilotherm - Thanks for the great comments. I'll make a few comments inline but, in terms of new code, would it make more sense for you to make a PR? If you're interested and able to do that, I'm fine with it. If not, I can work to make changes.

@qqmyers
Copy link
Member Author

qqmyers commented Jun 10, 2020

W.r.t. being a filter - I agree that makes sense in terms of security.

@poikilotherm
Copy link
Contributor

poikilotherm commented Jun 10, 2020

If you folks want me to craft a PR, sure. I will need to put this in our sprint planing first, next sprint starting 06/22/2020 (ping @bronger ). How fast or slow can we move with this @qqmyers?

@qqmyers
Copy link
Member Author

qqmyers commented Jun 10, 2020

My main concern is that this breaks MDC geolocation anywhere using load balancers, which either means lost info, or manual work to pull IP addresses from other logs. That said, there's no emergency and we're just aiming to get this into the 5.x series (versus a 4.20.1), so I don't think a short delay is too bad.

That said, I don't know how urgent the issue of IP group access is for anyone or if MDC geolocation is critical for anyone affected (using a proxy/LB). So - if others think that something has to be part of 5.0, I could make minor updates to this PR that could then be replaced when you can do a more thorough redesign job.)

@landreev
Copy link
Contributor

My main concern is that this breaks MDC geolocation anywhere using load balancers, which either means lost info, or manual work to pull IP addresses from other logs. That said, there's no emergency and we're just aiming to get this into the 5.x series (versus a 4.20.1), so I don't think a short delay is too bad.

That said, I don't know how urgent the issue of IP group access is for anyone or if MDC geolocation is critical for anyone affected (using a proxy/LB). So - if others think that something has to be part of 5.0, I could make minor updates to this PR that could then be replaced when you can do a more thorough redesign job.)

I should've said so earlier, but I would very much vote in favor of using this PR, with whatever minor cleanup you deem necessary. This implementation does the job; and it does not appear to expose us to any serious risk. Anyone who have reasons to believe the risk is unacceptable has an option of not using this header lookup feature. Or to disable the admin API auth override for localhost.

If we have a PR with a more elegant solution - a filter - later on, we'll accept it too. (I want this on my own server badly enough that I will probably make a patch based on this PR or a similar solution without waiting for v5).

So I'll proceed adding the security warning to the guide under the assumption that we are using this PR.

@poikilotherm
Copy link
Contributor

I was up for hacking away on this in spare time. It's late, so just a first draft: #6977

@qqmyers
Copy link
Member Author

qqmyers commented Jun 12, 2020

Just updated to fix the bug identified (not getting the lastIndexOf(',') in the header string and did some minor cleanup per the comments. If we want this for 5.0 or for back-porting prior to #6977 , I think the code is ready (awaiting documentation).

@landreev
Copy link
Contributor

@qqmyers

Just updated to fix the bug identified (not getting the lastIndexOf(',') in the header string and did some minor cleanup per the comments. If we want this for 5.0 or for back-porting prior to #6977 , I think the code is ready (awaiting documentation).

Thank you, much appreciated.
I looked at #6977 the other day. I like the idea (will comment there as well). But seeing how the goal appears to be to try and cut the v5 release literally in the next few days, I would much like to use this PR. So I will proceed checking in the doc additions as discussed, and then move it along.

landreev and others added 2 commits June 13, 2020 21:15
Added  a section discussing the security implications of using a proxy supplied header for tracking the addresses of incoming requests.
proxy header address lookup
Copy link
Contributor

@landreev landreev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving the PR; anyone should feel free to review the extra documentation text that I have added; feel free to either suggest changes, or edit as needed)

@landreev
Copy link
Contributor

I have removed the "closes" annotation for #6456. So that the issue could be used to track a potential filter implementation PR (DRAFT PR #6977)

@kcondon kcondon self-assigned this Jun 16, 2020
@kcondon
Copy link
Contributor

kcondon commented Jun 16, 2020

@qqmyers There are some doc formatting issues: the first explanatory section is one big wall of text with no breaks. Leonid thinks the line breaks need to be formatted differently. The second involves the jvm-option section. It would be good to provide an example command line and the actual argument option list is malformatted -the first option is normally formatted but has a couple extra quotes before the name and then the remaining options are formatted as a quoted pasted section of text, indented with quoted bar.

@landreev
Copy link
Contributor

@qqmyers There are some doc formatting issues: ...

yeah, my blurb is a mess. will address.

@kcondon
Copy link
Contributor

kcondon commented Jun 16, 2020

Functionality checks out, blocked when not behind lb, header address added to mdc log
just waiting for doc clean up to merge. Can view built docs here:

http://guides.dataverse.org/en/6970_use_header_for_ip_address/installation/config.html#dataverse-useripaddresssourceheader

http://guides.dataverse.org/en/6970_use_header_for_ip_address/installation/config.html#recording-user-ip-addresses

@landreev
Copy link
Contributor

@kcondon
I made a PR with some cosmetic fixes (QualitativeDataRepository#42; I don't have a write on @qqmyers ' repo)

landreev and others added 3 commits June 16, 2020 19:14
(this one should finally make the github online previewer work on this file!)
formatting fixes for config.rst
@landreev
Copy link
Contributor

@qqmyers did merging my diff overwrite the asadmin example command you had added earlier? (./asadmin create-jvm-options '-Ddataverse.useripaddresssourceheader=X-Forwarded-For') - I'm not seeing it anymore...
(I fixed a lot of other, unrelated junk in that file)

@kcondon kcondon merged commit 7ff0124 into IQSS:develop Jun 17, 2020
IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from QA 🔎✅ to Done 🚀 Jun 17, 2020
@djbrooke djbrooke added this to the Dataverse 5 milestone Jun 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Enable geo-location for MDC when behind a load balancer
6 participants