Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Signals/Noise Issue #74

Open
sschneiderman opened this issue Apr 24, 2013 · 4 comments
Open

Signals/Noise Issue #74

sschneiderman opened this issue Apr 24, 2013 · 4 comments
Milestone

Comments

@sschneiderman
Copy link

Andrew, We previously discussed methods for promoting or demoting source documents based on analyst judgment. This was an interest of both Aveshka and CGS. Pls advise if there is any follow up on how this might work.
Thanks,
Scott

@ghost ghost assigned astrite Apr 24, 2013
@astrite
Copy link
Contributor

astrite commented Apr 24, 2013

That's partially implemented currently via Tag Weighting. When a user
creates a source, they can set a number of user-defined tags. These tags
are transmitted to each document coming across that particular harvest. If
you provide a unique tag to each source, you can then define weights to
apply to query scoring on the Advanced Options pane. The format "Tag1":
number, "Tag2": number, etc... where the number is the weighting factor you
want on the score. So for an RSS feed of CNN sources, you can tag it with
"CNN" and then if you want all CNN documents to get weighted x 2, you'd put
"CNN": 2 in the tag weighting. When you run a query, documents then will
be assigned an overall score based on how well the document matches the
query terms and then that will be weighted further by geo / time / tag
weighting parameters that exist. Note that in the current implementation,
you can update a source's tags, but this will only impact new documents -
it's not retroactive. There's an open issue to alter this functionality to
be retroactive, but we do not have an ETA at this time as to when it might
be worked into an upcoming build.

From a functional perspective sense, the case management layer would also
partially resolve the issue you're describing because once an analyst flags
a document relevant to a case, it can be moved into the supporting evidence
folder. At that level then, you'll only be working with documents deemed
relevant by an analyst and the analysis / collection layer retains granular
query-specific relevance.

On Wed, Apr 24, 2013 at 11:42 AM, sschneiderman notifications@github.comwrote:

Andrew, We previously discussed methods for promoting or demoting source
documents based on analyst judgment. This was an interest of both Aveshka
and CGS. Pls advise if there is any follow up on how this might work.
Thanks,
Scott


Reply to this email directly or view it on GitHubhttps://github.com//issues/74
.

Andrew Strite
Intelligence Solutions Architect | IKANOW http://www.ikanow.com
Email: astrite@ikanow.com
Mobile: 301.514.1384

@sschneiderman
Copy link
Author

Can you provide training on Thursday on how Tag Weighting would be applied to reduce false positives on similar names (John Smith the target versus John Smith the innocent bystander)? I understand the principle but not the implementation.
Thanks.

From: Andrew [mailto:notifications@github.com]
Sent: Wednesday, April 24, 2013 12:24 PM
To: IKANOW/Absolute-Pin
Cc: Scott Schneiderman
Subject: Re: [Absolute-Pin] Signals/Noise Issue (#74)

That's partially implemented currently via Tag Weighting. When a user
creates a source, they can set a number of user-defined tags. These tags
are transmitted to each document coming across that particular harvest. If
you provide a unique tag to each source, you can then define weights to
apply to query scoring on the Advanced Options pane. The format "Tag1":
number, "Tag2": number, etc... where the number is the weighting factor you
want on the score. So for an RSS feed of CNN sources, you can tag it with
"CNN" and then if you want all CNN documents to get weighted x 2, you'd put
"CNN": 2 in the tag weighting. When you run a query, documents then will
be assigned an overall score based on how well the document matches the
query terms and then that will be weighted further by geo / time / tag
weighting parameters that exist. Note that in the current implementation,
you can update a source's tags, but this will only impact new documents -
it's not retroactive. There's an open issue to alter this functionality to
be retroactive, but we do not have an ETA at this time as to when it might
be worked into an upcoming build.

From a functional perspective sense, the case management layer would also
partially resolve the issue you're describing because once an analyst flags
a document relevant to a case, it can be moved into the supporting evidence
folder. At that level then, you'll only be working with documents deemed
relevant by an analyst and the analysis / collection layer retains granular
query-specific relevance.

On Wed, Apr 24, 2013 at 11:42 AM, sschneiderman <notifications@github.commailto:notifications@github.com>wrote:

Andrew, We previously discussed methods for promoting or demoting source
documents based on analyst judgment. This was an interest of both Aveshka
and CGS. Pls advise if there is any follow up on how this might work.
Thanks,
Scott


Reply to this email directly or view it on GitHubhttps://github.com//issues/74
.

Andrew Strite
Intelligence Solutions Architect | IKANOW http://www.ikanow.com
Email: astrite@ikanow.commailto:astrite@ikanow.com
Mobile: 301.514.1384


Reply to this email directly or view it on GitHubhttps://github.com//issues/74#issuecomment-16945286.

@astrite
Copy link
Contributor

astrite commented Apr 24, 2013

That's a slightly different issue. Tag weighting is appropriate for
inflating the score of a particular kind of document (eg all those from CNN
or Databot) which will ensure that certain kinds of documents show up
before others.

"False positives" like the one you describe are better solved using
alternative query strategies and query qualifiers, and to a lesser extent
aliasing. Selecting documents that match the correct John Smith and
finding associated entities will give you additional query parameters.
These terms, if included in the query for John Smith, should push the
relevant documents up to the top.

eg John Smith AND ( Company A OR Company B OR Associate A OR Associate B)

Alternately, if you have a scenario where you have John Smith (incorrect
person) and John B. Smith (correct person), you can either discard one of
the entities so it not longer displays or run queries like:

eg (John B. Smith OR "John Smith") NOT John Smith.

A certain amount experimentation is probably required to develop an
effective query.

As an aside, John Smith (the accountant) vs. John Smith (the priest) isn't
a true false positive. In both cases, a query for John Smith should bring
back matches with "John Smith" (of whatever entity type you define) back.
A false positive would be if documents were getting labeled with John
Smith when they are not actually about that entity. This is more the
situation where an advertisement might flag a document to be about a
company, but it is not actually in the text.

On Wed, Apr 24, 2013 at 12:30 PM, sschneiderman notifications@github.comwrote:

Can you provide training on Thursday on how Tag Weighting would be applied
to reduce false positives on similar names (John Smith the target versus
John Smith the innocent bystander)? I understand the principle but not the
implementation.
Thanks.

From: Andrew [mailto:notifications@github.com]
Sent: Wednesday, April 24, 2013 12:24 PM
To: IKANOW/Absolute-Pin
Cc: Scott Schneiderman
Subject: Re: [Absolute-Pin] Signals/Noise Issue (#74)

That's partially implemented currently via Tag Weighting. When a user
creates a source, they can set a number of user-defined tags. These tags
are transmitted to each document coming across that particular harvest. If
you provide a unique tag to each source, you can then define weights to
apply to query scoring on the Advanced Options pane. The format "Tag1":
number, "Tag2": number, etc... where the number is the weighting factor
you
want on the score. So for an RSS feed of CNN sources, you can tag it with
"CNN" and then if you want all CNN documents to get weighted x 2, you'd
put
"CNN": 2 in the tag weighting. When you run a query, documents then will
be assigned an overall score based on how well the document matches the
query terms and then that will be weighted further by geo / time / tag
weighting parameters that exist. Note that in the current implementation,
you can update a source's tags, but this will only impact new documents -
it's not retroactive. There's an open issue to alter this functionality to
be retroactive, but we do not have an ETA at this time as to when it might
be worked into an upcoming build.

From a functional perspective sense, the case management layer would also
partially resolve the issue you're describing because once an analyst
flags
a document relevant to a case, it can be moved into the supporting
evidence
folder. At that level then, you'll only be working with documents deemed
relevant by an analyst and the analysis / collection layer retains
granular
query-specific relevance.

On Wed, Apr 24, 2013 at 11:42 AM, sschneiderman <notifications@github.com
mailto:notifications@github.com>wrote:

Andrew, We previously discussed methods for promoting or demoting source
documents based on analyst judgment. This was an interest of both
Aveshka
and CGS. Pls advise if there is any follow up on how this might work.
Thanks,
Scott


Reply to this email directly or view it on GitHub<
https://github.com/IKANOW/Absolute-Pin/issues/74>
.

Andrew Strite
Intelligence Solutions Architect | IKANOW http://www.ikanow.com
Email: astrite@ikanow.commailto:astrite@ikanow.com
Mobile: 301.514.1384


Reply to this email directly or view it on GitHub<
https://github.com/IKANOW/Absolute-Pin/issues/74#issuecomment-16945286>.


Reply to this email directly or view it on GitHubhttps://github.com//issues/74#issuecomment-16945722
.

Andrew Strite
Intelligence Solutions Architect | IKANOW http://www.ikanow.com
Email: astrite@ikanow.com
Mobile: 301.514.1384

@sschneiderman
Copy link
Author

Understood. Lets discuss again Thursday.

From: Andrew [mailto:notifications@github.com]
Sent: Wednesday, April 24, 2013 12:49 PM
To: IKANOW/Absolute-Pin
Cc: Scott Schneiderman
Subject: Re: [Absolute-Pin] Signals/Noise Issue (#74)

That's a slightly different issue. Tag weighting is appropriate for
inflating the score of a particular kind of document (eg all those from CNN
or Databot) which will ensure that certain kinds of documents show up
before others.

"False positives" like the one you describe are better solved using
alternative query strategies and query qualifiers, and to a lesser extent
aliasing. Selecting documents that match the correct John Smith and
finding associated entities will give you additional query parameters.
These terms, if included in the query for John Smith, should push the
relevant documents up to the top.

eg John Smith AND ( Company A OR Company B OR Associate A OR Associate B)

Alternately, if you have a scenario where you have John Smith (incorrect
person) and John B. Smith (correct person), you can either discard one of
the entities so it not longer displays or run queries like:

eg (John B. Smith OR "John Smith") NOT John Smith.

A certain amount experimentation is probably required to develop an
effective query.

As an aside, John Smith (the accountant) vs. John Smith (the priest) isn't
a true false positive. In both cases, a query for John Smith should bring
back matches with "John Smith" (of whatever entity type you define) back.
A false positive would be if documents were getting labeled with John
Smith when they are not actually about that entity. This is more the
situation where an advertisement might flag a document to be about a
company, but it is not actually in the text.

On Wed, Apr 24, 2013 at 12:30 PM, sschneiderman <notifications@github.commailto:notifications@github.com>wrote:

Can you provide training on Thursday on how Tag Weighting would be applied
to reduce false positives on similar names (John Smith the target versus
John Smith the innocent bystander)? I understand the principle but not the
implementation.
Thanks.

From: Andrew [mailto:notifications@github.com]
Sent: Wednesday, April 24, 2013 12:24 PM
To: IKANOW/Absolute-Pin
Cc: Scott Schneiderman
Subject: Re: [Absolute-Pin] Signals/Noise Issue (#74)

That's partially implemented currently via Tag Weighting. When a user
creates a source, they can set a number of user-defined tags. These tags
are transmitted to each document coming across that particular harvest. If
you provide a unique tag to each source, you can then define weights to
apply to query scoring on the Advanced Options pane. The format "Tag1":
number, "Tag2": number, etc... where the number is the weighting factor
you
want on the score. So for an RSS feed of CNN sources, you can tag it with
"CNN" and then if you want all CNN documents to get weighted x 2, you'd
put
"CNN": 2 in the tag weighting. When you run a query, documents then will
be assigned an overall score based on how well the document matches the
query terms and then that will be weighted further by geo / time / tag
weighting parameters that exist. Note that in the current implementation,
you can update a source's tags, but this will only impact new documents -
it's not retroactive. There's an open issue to alter this functionality to
be retroactive, but we do not have an ETA at this time as to when it might
be worked into an upcoming build.

From a functional perspective sense, the case management layer would also
partially resolve the issue you're describing because once an analyst
flags
a document relevant to a case, it can be moved into the supporting
evidence
folder. At that level then, you'll only be working with documents deemed
relevant by an analyst and the analysis / collection layer retains
granular
query-specific relevance.

On Wed, Apr 24, 2013 at 11:42 AM, sschneiderman <notifications@github.com
mailto:notifications@github.com%20%0b> mailto:notifications@github.com>wrote:

Andrew, We previously discussed methods for promoting or demoting source
documents based on analyst judgment. This was an interest of both
Aveshka
and CGS. Pls advise if there is any follow up on how this might work.
Thanks,
Scott


Reply to this email directly or view it on GitHub<
https://github.com/IKANOW/Absolute-Pin/issues/74>
.

Andrew Strite
Intelligence Solutions Architect | IKANOW http://www.ikanow.com
Email: astrite@ikanow.commailto:astrite@ikanow.commailto:astrite@ikanow.com%3cmailto:astrite@ikanow.com
Mobile: 301.514.1384


Reply to this email directly or view it on GitHub<
https://github.com/IKANOW/Absolute-Pin/issues/74#issuecomment-16945286>.


Reply to this email directly or view it on GitHubhttps://github.com//issues/74#issuecomment-16945722
.

Andrew Strite
Intelligence Solutions Architect | IKANOW http://www.ikanow.com
Email: astrite@ikanow.commailto:astrite@ikanow.com
Mobile: 301.514.1384


Reply to this email directly or view it on GitHubhttps://github.com//issues/74#issuecomment-16946849.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants