Although this solution meets the basic requirements of the challenge, we believe there are still considerable opportunities for improving both the results as well as the underlying E2E processing and visualization pipeline.
Obtaining additional tagged training data
We have been working in partnership with CrowdFlower to obtain more pre-labelled training data through Crowdsourcing. This enabled us to obtain data for training our classifier, and in the future we would move on to obtaining more labelled excerpts too.
Enhancing classification and report extraction
Based on initial results, we believe that a larger set of labelled excerpt could have a significant impact on the results (Precision & Recall) of the report extraction machine learning models.
We believe there are still many opportunities to implement additional visualizations into our tool. We would also like to give analysts the flexibility to choose or create their own visualizations based on the underlying data.
Additional Complementary Fields
While designing the tool, we identified a number of opportunities for enhancing the utility to analysts by implementing various meta-fields for both articles and reports. Some ideas, that have not yet been implemented, include:
A measure of the estimated reliability of a given article, that could take numerous factors into account, including the underlying domain, manually captured analyst ratings, similarity of content with other articles etc.
A measure of the likely accuracy of an extracted report, which could, for example, take into account:
- Incidence of certain key words
- Presence of conflicting or synonymous words
- Output probabilities of ML models etc.