Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefer full titles shown in search results #2191

Closed
garthg opened this issue May 22, 2015 · 15 comments
Closed

Prefer full titles shown in search results #2191

garthg opened this issue May 22, 2015 · 15 comments

Comments

@garthg
Copy link

garthg commented May 22, 2015

Hi,

When searching in a Dataverse, the search results are shown in a list in the right-hand pane. It appears that each result is having its title truncated sometimes, which makes the results unclear. For us it would be preferable if the full title was always shown. See attached screenshot for an example.

Thanks,

Garth

screenshot from 2015-05-22 09 43 04

@pdurbin
Copy link
Member

pdurbin commented May 22, 2015

@garthg I see what you mean. A search for https://dataverse.harvard.edu/dataverse/antislaverypetitionsma?q=garrison shows "of William Lloyd Garrison" rather than "Senate Unpassed Legislation 1864, referred to next general court, SC1/series 231, Petition of William Lloyd Garrison". This is for https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/L8PGT

Perhaps the fix will be as simple as increasing this value: https://wiki.apache.org/solr/HighlightingParameters#hl.fragsize

@garthg
Copy link
Author

garthg commented May 22, 2015

Thanks @pdurbin ! You're exactly right about what we'd expect to see there.

@scolapasta scolapasta added this to the Candidates for 4.0.2 milestone Jun 1, 2015
@pdurbin
Copy link
Member

pdurbin commented Jun 17, 2015

Option 1: Always show the full title with no highlighting

For us it would be preferable if the full title was always shown.

The most straightforward way to make the full title always shown is to simply never show the version with highlights. It would look something like this:

root_dataverse_-_2015-06-17_15 17 39

Note that the word "Garrison" in the title is no longer in bold. This would be a change to what's currently written in http://guides.dataverse.org/en/4.0.1/user/find-use-data.html which says:

"If the search term or query was found in the title or name of the dataverse, dataset, or file, the search term or query will be bolded within it."

For illustrative purposes, in the screenshot above I'm showing "Title: of William Lloyd Garrison" at the bottom of the card to show what it would look like if we placed the highlighted/matched title at the bottom with other fields that may match. We don't have to do this (we can continue to suppress "title" from being shown at the bottom) but it's an option.

The code change would look something like this:

murphy:dataverse pdurbin$ git diff src/main/webapp/search-include-fragment.xhtml src/main/java/edu/harvard/iq/dataverse/SolrSearchResult.java
diff --git a/src/main/java/edu/harvard/iq/dataverse/SolrSearchResult.java b/src/main/java/edu/harvard/iq/dataverse/SolrSearchResult.java
index 19fdbe9..5998d64 100644
--- a/src/main/java/edu/harvard/iq/dataverse/SolrSearchResult.java
+++ b/src/main/java/edu/harvard/iq/dataverse/SolrSearchResult.java
@@ -549,7 +549,7 @@ public class SolrSearchResult {
                     && !field.equals(SearchFields.DESCRIPTION)
                     && !field.equals(SearchFields.DATASET_DESCRIPTION)
                     && !field.equals(SearchFields.AFFILIATION)
-                    && !field.equals("title")) {
+                    ) {
                 filtered.add(highlight);
             }
         }
diff --git a/src/main/webapp/search-include-fragment.xhtml b/src/main/webapp/search-include-fragment.xhtml
index 94f12f1..a04b714 100644
--- a/src/main/webapp/search-include-fragment.xhtml
+++ b/src/main/webapp/search-include-fragment.xhtml
@@ -571,8 +571,7 @@
                                         <span class="icon-dataset text-info pull-right" title="#{bundle.dataset}"/>

                                         <a href="#{result.datasetUrl}" target="#{showFacets == true ? '_self' : '_blank'}">
-                                            <h:outputText value="#{result.title}" style="padding:4px 0;" rendered="#{result.titleHighlightSnippet == null}"/>
-                                            <h:outputText value="#{result.titleHighlightSnippet}" style="padding:4px 0;" rendered="#{result.titleHighlightSnippet != null}" escape="false"/>
+                                            <h:outputText value="#{result.title}" style="padding:4px 0;"/>
                                             <h:outputText value=" (#{result.entityId})" style="padding:4px 0;" rendered="#{SearchIncludeFragment.debug == true}"/></a>
                                         <h:outputText value="#{SearchIncludeFragment.DRAFT}" styleClass="label label-primary" rendered="#{result.draftState}"/>
                                         <h:outputText value="#{SearchIncludeFragment.UNPUBLISHED}" styleClass="label label-warning" rendered="#{result.unpublishedState}"/>
murphy:dataverse pdurbin$ 

Option 2: Find a better "fragsize"

I also played a bit with setting the "fragsize" of the highlight snippets. For example, if you set the fragsize to zero (as in the code below), then all the characters in the field that matched are returned, which can be quite long in the case of descriptions. It's the difference between this...

world1

... and this (when searching for "world"):

world2

So the question with the second approach is if we could find a fragsize we're happy with. Perhaps this could be a configurable option so we could tweak it runtime until we setting on a value we like. The default fragsize is 100 (the first screenshot above). Here's how it looks with a fragsize of 300:

world300

murphy:dataverse pdurbin$ git diff src/main/java/edu/harvard/iq/dataverse/SearchServiceBean.java
diff --git a/src/main/java/edu/harvard/iq/dataverse/SearchServiceBean.java b/src/main/java/edu/harvard/iq/dataverse/SearchServiceBean.java
index 1b63bbb..d821022 100644
--- a/src/main/java/edu/harvard/iq/dataverse/SearchServiceBean.java
+++ b/src/main/java/edu/harvard/iq/dataverse/SearchServiceBean.java
@@ -89,6 +89,7 @@ public class SearchServiceBean {
 //        }
 //        solrQuery.setSort(sortClause);
         solrQuery.setHighlight(true).setHighlightSnippets(1);
+        solrQuery.setHighlightFragsize(0);
         solrQuery.setHighlightSimplePre("<span class=\"search-term-match\">");
         solrQuery.setHighlightSimplePost("</span>");
         Map<String, String> solrFieldsToHightlightOnMap = new HashMap<>();
murphy:dataverse pdurbin$ 

@scolapasta scolapasta modified the milestones: 4.0.2, Candidates for 4.0.2 Jun 18, 2015
@garthg
Copy link
Author

garthg commented Jun 18, 2015

Thanks @pdurbin for this great breakdown. For my project, it's preferable to have the full title always (your Option 1), and even better if it also includes highlighting.

However, I could also see our use case being satisfied with a larger cutoff for the title fragment such that the fragment usually included almost all of our titles. Our titles appear to be usually between 100-150 characters, so if that amount of title was shown (ideally with the matching text highlighted), that would be a fine solution for us as well.

In any case, thank you for your responsiveness here!

@pdurbin
Copy link
Member

pdurbin commented Jun 18, 2015

@garthg oh sure. We've been discussing this internally as well. Option 1 seems to be ahead but I plan to deploy branch to a test server so we can make sure we like it. For consistency, I'll remove highlighting from the names of dataverses and files as well.

@garthg
Copy link
Author

garthg commented Jun 18, 2015

@pdurbin Great! Thanks for the update.

@pdurbin
Copy link
Member

pdurbin commented Jun 30, 2015

I spoke briefly with @eaquigley about how @scolapasta and I were planning on merging the "Option 1: Always show the full title with no highlighting" commit ( 40697f9 ) into the 4.0.2 branch but I'm going to wait until we've had a chance to talk more.

@garthg
Copy link
Author

garthg commented Jul 1, 2015

That makes sense. Thanks for the update!

On Tue, Jun 30, 2015 at 4:28 PM, Philip Durbin notifications@github.com
wrote:

I spoke briefly with @eaquigley https://github.com/eaquigley about how
@scolapasta https://github.com/scolapasta and I were planning on
merging the "Option 1: Always show the full title with no highlighting"
commit ( 40697f9
40697f9
) into the 4.0.2 branch but I'm going to wait until we've had a chance to
talk more.


Reply to this email directly or view it on GitHub
#2191 (comment).

@pdurbin
Copy link
Member

pdurbin commented Jul 7, 2015

@eaquigley @mheppler and I met this morning to discuss this bug as well as #537 which is related. I took some notes in a Google doc: https://docs.google.com/document/d/1p8zXIbzlACxfFhumkZN0_niyOM7V5LR8j_x3cOxkboE/edit?usp=sharing

We decided to try option 2 after all, after playing around with a "frag size" of 320. I made a build (number 21) and deployed it to dvn-build and dataverse-internal, where I also set the frag size to 320. I also documented SearchHighlightFragmentSize in the the Harvard set up scrip and as a setting at http://guides.dataverse.org/en/4.0.2/installation/installation-main.html#searchhighlightfragmentsize

Passing to QA.

@pdurbin pdurbin removed their assignment Jul 7, 2015
@garthg
Copy link
Author

garthg commented Jul 8, 2015

Hi @pdurbin and @eaquigley ,

I just wanted to say that I'm in favor of "option 2" as discussed, and I'm glad to hear that we're moving in that direction.

Garth

@pdurbin
Copy link
Member

pdurbin commented Jul 8, 2015

@garthg cool. Thanks.

Meanwhile, I've been playing around Solr trying to understand some strange fragsize behavior I was demonstrating to @eaquigley and @mheppler yesterday.

"The size, in characters, of the snippets (aka fragments) created by the highlighter" is what fragsize means according to https://wiki.apache.org/solr/HighlightingParameters#hl.fragsize . Setting fragsize to 0 should mean "the whole field value should be used with no fragmenting". This works fine. 100 is the default so we see " of William Lloyd <em>Garrison</em>" as originally reported, but this is not 100 characters... it's 27...

murphy:dataverse pdurbin$ echo -n " of William Lloyd <em>Garrison</em>" | awk '{gsub("<[^>]*>", "")}1'
 of William Lloyd Garrison
murphy:dataverse pdurbin$ echo -n " of William Lloyd <em>Garrison</em>" | awk '{gsub("<[^>]*>", "")}1' | wc -c
      27

... and when I bump fragsize to 110 I get fewer characters, only 10 for " <em>Garrison</em>". Bumping the value up to 120 results in all 117 characters being shown, which makes sense. But what's up with fragsize=110 showing only 10 characters? Very strange.

Here are the curl commands I'm using:

murphy:dataverse pdurbin$ FRAGSIZE=0 && curl -s "http://localhost:8983/solr/collection1/select?wt=json&indent=true&hl=true&hl.fl=*&q=garrison&hl.fragsize=$FRAGSIZE" | jq '.highlighting.dataset_25_draft.title'
[
  "Senate Unpassed Legislation 1864, referred to next general court, SC1/series 231, Petition of William Lloyd <em>Garrison</em>"
]
murphy:dataverse pdurbin$ FRAGSIZE=100 && curl -s "http://localhost:8983/solr/collection1/select?wt=json&indent=true&hl=true&hl.fl=*&q=garrison&hl.fragsize=$FRAGSIZE" | jq '.highlighting.dataset_25_draft.title'
[
  " of William Lloyd <em>Garrison</em>"
]
murphy:dataverse pdurbin$ FRAGSIZE=110 && curl -s "http://localhost:8983/solr/collection1/select?wt=json&indent=true&hl=true&hl.fl=*&q=garrison&hl.fragsize=$FRAGSIZE" | jq '.highlighting.dataset_25_draft.title'
[
  " <em>Garrison</em>"
]
murphy:dataverse pdurbin$ FRAGSIZE=120 && curl -s "http://localhost:8983/solr/collection1/select?wt=json&indent=true&hl=true&hl.fl=*&q=garrison&hl.fragsize=$FRAGSIZE" | jq '.highlighting.dataset_25_draft.title'
[
  "Senate Unpassed Legislation 1864, referred to next general court, SC1/series 231, Petition of William Lloyd <em>Garrison</em>"
]

I should probably report this on the Solr mailing list but using the example data that ships with Solr.

@pdurbin
Copy link
Member

pdurbin commented Jul 8, 2015

I should probably report this on the Solr mailing list but using the example data that ships with Solr.

Yeah, not hard to reproduce with the sample data from Solr. I just emailed the solr list about it: http://lucene.472066.n3.nabble.com/unexpected-hl-fragsize-behavior-td4216356.html

@sbarbosadataverse
Copy link

sbarbosadataverse commented Jul 15, 2015

my search for titles are returning full title results.

@pdurbin
Copy link
Member

pdurbin commented Jul 31, 2015

I'm still seeing this bug in production if I scroll halfway down the page at https://dataverse.harvard.edu/dataverse/antislaverypetitionsma?q=garrison

My guess is that we need to set the SearchHighlightFragmentSize per http://guides.dataverse.org/en/4.1/installation/installation-main.html#searchhighlightfragmentsize

curl -X PUT -d 320 http://localhost:8080/api/admin/settings/:SearchHighlightFragmentSize

@pdurbin pdurbin reopened this Jul 31, 2015
@kcondon
Copy link
Contributor

kcondon commented Jul 31, 2015

OK, ran the update and full titles are now showing for this test case. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants