-
Notifications
You must be signed in to change notification settings - Fork 13
Home
Dexter is a web application built for Media Monitoring Africa to partially automate the collection and classification of news sources in the South African media. It uses machine learning to extract quotations, names, places and other entities from news articles and then uses those to inform the decisions and classifications that MMA uses to evaluate the media.
Here we describe the basics concepts in Dexter and some of how those concepts translate into database entities.
A full database diagram is available in the repo.
The central concept in Dexter is a document, representing a news article or another news item that has been captured. It has an associated URL, title, blurb, author and publication date etc. Every document is associated with a publication medium, such as the Mail and Guardian.
A document source is someone who informed the content of an article. For example, a person who is directly quoted in an article is a source. A person who is indirectly referenced by an article is also a source.
When a source is linked to an article, the nature of the link is described, including whether:
- the source was directly quoted
- the source was named (eg. a politician is generally named, but a mine worker may be anonymous)
- the source was photographed
- the role the source was playing at the time, such as whether they were quoted as the leader of a political organisation or in their private capacity
A source is almost always a person, but may also be an organisation, and is linked with an entity.
An entity is a generic term for a thing that was found in an article, such as a person, organisation or location. Entities are extracted from an article using machine learning, and may be incorrect or ambiguous. Because they may be incorrect, entities are used as suggestions to inform the decision about an article's sources which can be corrected by a human monitor.
An entity has a group, which is its type such as person
or organisation
and a name, such as Helen Zille
or ANC
.
Because most sources in Dexter are people, it's useful to have additional information about people such as their race and gender. In Dexter's model, a person
entity can be linked to a Person in the database which stores this additional information.
Sometimes there are multiple entities that correspond to the same person. This happens because the machine learning algorithms don't realise that two people are the same. For instance, it might not realise that 'Zuma' and 'Jacob Zuma' are most likely the same person when mentioned in the same article. Therefore many entities might be linked to the same Person.
When a document source is directly quoted in an article we extract the quotation and store it as an utterance. This utterance is linked to the same entity that the source is linked to. The same person might have many quotations in one document.