Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DS-1089 Feedback form breaks when hostname is short #195

Closed
wants to merge 1 commit into from

Conversation

helix84
Copy link
Member

@helix84 helix84 commented Feb 28, 2013

Fix for a string index out of bounds error

https://jira.duraspace.org/browse/DS-1089

Rebased #193 on the correct branch (master)

Fix for a string index out of bounds error
@bram-atmire
Copy link
Member

looks good, +1

@mwoodiupui
Copy link
Member

If I understand this, the original code assumes that the FQDN is hostname.domain.domain, while the revised code assumes that it is either hostname.domain.domain or hostname.domain. Instead of clipping domains off the right end, shouldn't we just take the leftmost longest substring not containing a period? That is: int firstDot = host.indexOf('.'); basicHost = (firstDot < 0) ? host : host.substring(0, firstDot);

@bram-atmire
Copy link
Member

Correct me if I'm wrong, but wouldn't your code leave "www" as the hostname if you would feed it www.myrepo.edu?
Right now, I still think Ryan's code is correct.

@mwoodiupui
Copy link
Member

If the FQDN is "www.myrepo.edu", then "www" is the hostname. What do you think it should be? I may have completely misunderstood the intent here.

@helix84
Copy link
Member Author

helix84 commented May 13, 2013

The reason why this is confusing is because this code doesn't intend to find out the hostname, it wants to make sure that all of example.com, www.example.com, www.sub.example.com etc. are allowed as referrers when only example.com is listed as mail.allowed.referrers.

@bram-atmire
Copy link
Member

The original comment before the functionality states:

// cut off all but the hostname, to cover cases where more than one URL
// arrives at the installation; e.g. presence or absence of "www"

So in this context I think it was the original author's intent to keep the top level domain+extension.

e.g.

www.atmire.com
repo.atmire.com
mylibrariesredirect.repo.atmire.com

should all end up as atmire.com

The original code fails because it assumes there are always at least two "." symbols. Which is not the case for short hostnames like atmire.com (without www)

@mwoodiupui
Copy link
Member

Ah, I was reading "cut off" to mean "cut off and throw away". That's what I get for looking only at the diff.

So, if the desire is that "any.old.thing.example.com" should match "example.com", rather than making assumptions about how many name elements there are, we should simply compare the rightmost portion of the name with the name from mail.allowed.referrers. That is, in RE terms: if the allowed referrer is "example.com", then the presented name must match "example.com$" Otherwise we not only have problems with short allowed names, but with long ones such as "library.kcl.ac.uk".

Looking at the surronding code, I think we have other problems. "String.indexOf(String) != -1" does not mean "occurs at the end"; it means "occurs anywhere". So "example.com" would match "badbot.example.com.roguesite.porn".

@helix84
Copy link
Member Author

helix84 commented May 13, 2013

Yes, Mark, while writing my response I also realized that is what we want. With one difference, we don't want to allow xxxexample.com, so the comparison should be separated by a dot (".example.com$") unless it equals to mail.allowed.referrers ("example.com$").

@bram-atmire
Copy link
Member

in an attempt to revive this:

Instead of doing all of this cutting ourselves, maybe we can use URI.getHost() ?

http://docs.oracle.com/javase/7/docs/api/java/net/URI.html#getHost()

http://stackoverflow.com/questions/9607903/get-domain-name-from-given-url

@helix84
Copy link
Member Author

helix84 commented Jul 22, 2013

Hi Bram, thanks for the tip, but the problem wasn't extracting the domain name from the URL, it was subdomain matching (see my first comment above).

@mwoodiupui
Copy link
Member

The more I look at the code, the more confused I get. The patch doesn't seem to be related to mail.allowed.referers code at all (although that code needs work too). The code to be patched seems to be trying to match the referer to the DSpace host -- that is, ${dspace.hostname} is always implicitly a member of mail.allowed.referers -- or to a stem of it, to allow for things like example.org CNAME www.example.org.

We have to be careful about this stemming. Not only might the offered name be too short to remove two labels ("example.org"); stemming "example.org" produces "org" which is too short for a sensible match. If stemming makes sense at all, I think it should be: remove one leftmost label from the domain IFF the result would contain at least two labels. ("foo.example.org" -> "example.org" and "foo.bar.example.org" -> "bar.example.org" but "example.org" -> "example.org".) The patched code does seem to do that.

However I wonder if the intended behavior is too clever. Shouldn't we just look for an exact match to ${dspace.hostname} or to any element of ${mail.allowed.referer} and document that if you want stemming then you must specify it explicitly in mail.allowed.referer? I think we have assumed too much.

Also, I can't find where this stemming behavior is documented.

@mwoodiupui
Copy link
Member

In #duraspace on 17-Jul-2013, helix84 commented that it would be good to have unit tests for this behavior.

@helix84
Copy link
Member Author

helix84 commented Jul 22, 2013

Hi Mark. It's true that may be trying to be too clever here. I was only trying to fix a common bug by implementing the intended behaviour correctly and now it turns out to be a little complex.

The reason why the intended behaviour exists is that you might have a dozen subdomains, e.g. dspace.university.edu, blog.university.edu, library.university.edu etc. and you want to allow linking from all of them to the feedback page. The subdomains may change over time and this way you wouldn't need to bother the dspace admin to add each one of them.

So we may decide to

  • trade this complexity for manual work; or
  • implement this complexity as intended; or
  • suggest a different solution.

I now see what you mean - that stemming is implemented only for the dspace hostname and not in the allowedReferrersSplit loop. It's an open question whether we want it there, too. My example above suggests it would have a use.

@tdonohue tdonohue added the bug label Aug 13, 2014
@tdonohue tdonohue added the work in progress PR is still being worked on & is not currently ready for review label Jun 17, 2015
alanorth added a commit to alanorth/DSpace that referenced this pull request Mar 21, 2016
@tdonohue
Copy link
Member

tdonohue commented Feb 3, 2017

Closing as outdated/obsolete (no activity in several years). Please resubmit if you are interested in helping move this forward. Thanks!

@tdonohue tdonohue closed this Feb 3, 2017
olli-gold pushed a commit to tubhh/DSpace that referenced this pull request Dec 8, 2020
[D4CRIS-1033] use transfer manager to support multipart upload and to avoid reaching max number of connections on S3
benbosman added a commit that referenced this pull request Jun 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug work in progress PR is still being worked on & is not currently ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants