View the GitHub project here or download the latest release here.
Written By: Jason Wells
Use this script to help you explore the terms present in a Nuix case.
Begin by providing one or more single term wild card expressions and/or fuzzy expressions. Separate multiple expressions with spaces. The provided expressions are then used to find related terms present in the current Nuix case.
Wild card expressions support *
(0 or more characters) or ?
(1 character).
Expression | Example Term Matches |
---|---|
cat* |
cat , catalina , catch , catching , category , caterpillar |
*th |
depth , path , width , with |
ca? |
car , cat , can |
ca?? |
call , card , case , cats |
h???th |
health , hearth |
Fuzzy terms expressions take the form TERM~SIMILARITY
, with SIMILARITY
being a value between 0.0
(not similar) and 1.0
(exactly the same). Providing a blank value for SIMILARITY
(i.e. car~
) is the same as providing 0.5
(i.e. car~0.5
).
Expression | Example Term Matches |
---|---|
cat~0.5 |
act , at , bat , can , car , cat , cats , coat , eat , hat , nat , pat , rat , sat |
jason~0.5 |
bison , jackson , jadon , jalyn , jaron , jason , jast , jevon , json , larson , olson , saison , samson , wasn |
jason~0.8 |
jadon , jaron , jasen , jason , javon , jayson , mason |
Fuzzy similarity can be calculated using several difference methods:
- Nuix: Determines what terms Nuix resolved the given fuzzy search term to. Running the given fuzzy query expression in the Nuix search bar is expected to return the same results as when you take the matched terms found by this script and join them into an OR query (i.e.
jadon OR jaron OR jasen OR javon OR ...
==jason~0.8
). Note that this approach most accurately represents how Nuix would resolve the given fuzzy search to terms, but also takes longer because it is resolved against each responsive item individually. That means the more items responsive to the given fuzzy search, the longer this approach can take to resolve! - Levenstein Distance - Uses Lucene's built in Levenshstein distance string comparison to filter provided fuzzy term against case terms list. Levenshstein distance compares 2 strings by determining the number of edits (insertions, deletions and substitutions) needed to change one string to the other. The Lucene method scales the resulting edit distance to a range between
0.0
and1.0
. The script resolves this type of fuzzy search against the case terms list. - Jaro-Winkler - Uses Lucene's built in Jaro-Winkler string comparison method.
- NGram Distance - Uses Lucene's built in NGram distance string comparison method. The script resolves this type of fuzzy search against the case terms list.
Resulting matched terms can be exported to a CSV or added to a running collection of terms you build up (the right hand table). This collection of terms you have selected can then be used in different ways including generating a query from the terms or saving them to a file.
The Expression Matches table is populated with terms that match your expression against the specified locations (content
and/or properties
) and fall within items responsive to your scope query. The table, and CSVs exported from it, have the following columns.
Column | Description |
---|---|
Original Expressions |
What expression or expressions you provided led to this term being a match. |
Matched Term |
A term which matched one or more of your provided single term expressions. |
Similarity |
When the provided expression is a fuzzy expression, this column will contain the highest similarity value of all the fuzzy expressions that contributed to matching this term. |
Occurrences |
How many times does this term occur in the scope. A single item may have multiple occurrences of any given term and therefore contribute more than 1 occurence to this count. |
Scope Responsive Items |
Based the scope query and whether you chose to resolve matches against content and/or properties this will be the count of items within those contraints that have the given matched term. See below for more detail. |
When determining the Scope Responsive Items
count column, Case.count(String queryString) is used to determine responsive item count, using queries built with the following logic.
Matched Term | Fields | Scope Query | Count Query |
---|---|---|---|
catalina |
content |
kind:email |
(kind:email) AND (content:catalina) |
catalina |
properties |
kind:email |
(kind:email) AND (properties:catalina) |
catalina |
content and properties |
kind:email |
(kind:email) AND (content:catalina OR properties:catalina) |
Begin by downloading the latest release of this code. Extract the contents of the archive into your Nuix scripts directory. In Windows the script directory is likely going to be either of the following:
%appdata%\Nuix\Scripts
- User level script directory%programdata%\Nuix\Scripts
- System level script directory
This script relies on 2 JAR files, not included in this repository (but is included in releases).
TermExplorerGUI.jar
provides the user interface of the script. The source code of this JAR file is included in this repository in the Java sub-directory.
The other JAR file is SuperUtilities.jar
which includes the class TermExpander which does the work of taking a given term expression and resolving it to the appropriate related terms. You can head over to the SuperUtilities repository and either download a copy of the source and build it yourself or download an already built release JAR.
Copyright 2020 Nuix
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.