Workshop at the Wellcome Trust
http://www.eventbrite.co.uk/e/contentmine-workshop-at-the-wellcome-trust-tickets-15567357385) (registration is FREE, places limited to 25 )Register [here] (
Location: Darwin Room, Wellcome Trust, 215 Euston Road, London NW1 2BE
Dates: 13-14 April 2015
|13 April 2015||14 April 2015|
|Training Workshop||Hackday & Policy Panel Session|
|10:00 - 18:00||09:00 - 17:00|
- Peter Murray-Rust @petermurrayrust
- Jenny Molloy @jenny_molloy
- The ContentMine Team [@TheContentMine] (https://twitter.com/TheContentMine)
Content mining technologies hold much potential for maximising discovery and reuse of research as well as generating novel scientific discoveries through automatedly searching, indexing and analysing the scientific literature.
This workshop aims to educate and engage researchers and research-related professionals in the UK who are interested in using these technologies for their work in human and animal health. In turn this will help to provide demonstrable examples of the benefits of content mining.
Participants will be instructed in use and custom development of tools in the ContentMine pipeline before having the opportunity to apply these skills to a series of hackday projects aimed at producing useful applications for researchers and funders.
On the final afternoon, participants will showcase their projects to invited policy makers and a policy panel session will discuss the future potential of content mining. This includes how to accelerate uptake for research and research assessment with the purpose of promoting content mining in the UK.
- Raise awareness of content mining among researchers.
- Train 20-25 researchers in using the ContentMine pipeline tools.
- Promote legal and responsible content mining practices.
- Prototype or at least explore developing a range of new applications and tools that might be useful to funders and researchers.
- Showcase the scope of potential applications for content mining to workshop participants and invited policy staff.
- Discuss and suggest requirements to drive uptake of content mining e.g. training, policy, use cases, commercial applications.
Anticipated Workshop Outcomes
- Research or assessment performed using ContentMine tools by a proportion of workshop participants.
- Production of two or more prototype applications designed to be useful to the funders and/or researchers.
- Continuing collaboration and development of a proportion of hackday prototypes by participants and ContentMine staff.
- One or more ContentMine trainers recruited from pool of participants, increasing the reach of training for content mining.
- Better informed policy staff as a result of panel discussion and publication of outputs.
- Demand for training by further organisations or individuals linked to the Wellcome Trust, therefore potential for increased impact beyond this particular workshop.
This two day event is intended for researchers or research-related staff who are not currently heavily involved in text and data mining but have at least some pre-existing computational skills. At minimum we expect familiarity with a command line interface and basic coding abilities in some language. Experienced developers are also extremely welcome. We expect participants to be researchers working in the fields of life science and biomedicine. All our examples and activities will therefore be geared to animal and health sciences and the hackday projects will focus on custom tools and analyses of interest to researchers in these disciplines. Open Access content in the Europe PMC repository will be the primary source material for this hackathon.
Training Workshop Agenda
|10:10||What is content mining?
|10:30||Think like a content miner
|11:00||Scraping and the anatomy of scrapers
|12:00||Entity recognition using AMI
|13:30||Introduction to regular expressions and AMI-regex
|15:00||Legality of content mining: what can I mine?
|15:30||Responsible content mining: how should I mine?
|16:00||Hackday pitches and forming teams
|18:00 onwards||Informal social event
Workshop Hackday and Policy Panel Session Agenda
|09:00||Hacking in teams|
|13:30||Hacking in teams|
|15:45||Introduction to content mining
|16:00||Presentation of hackday projects
|16:30||Panel discussion on accelerating uptake of content mining.
The primary means of communication and taking notes during the workshop will be via this [Etherpad] (http://pads.cottagelabs.com/p/contentmine_wellcometrust2014).
If you're not familiar with Etherpads, it's easy - you just type! Your writing will show up in a different colour to other people's and you can associate the colour with your name by clicking the person icon in the top right corner of the screen.
Potential Hackday Projects
Tasks within these projects range from non-technical e.g. putting together a bag of words to retrieve documents related to a certain research area through to technically challenging even for skilled developers.
- Find Wellcome Trust funded papers and visualise their associated MeSH heading or other topic indicators in a tag cloud to indicate areas with which Wellcome funding is most associated.
- Find Wellcome funded Open Access (OA) papers and compare subject set to main tag cloud. Use document sectioning with ContentMine pipeline to search for mentions of “Wellcome Trust” in acknowledgements section on a daily basis.
- Find Wellcome funded papers and visualise or summarise by publisher.
- Build custom scrapers for journals and publishers with which attendees publish.
- Retrieve documents relate to Wellcome Trust areas of interest using a bag-of-regex system. Compile lists of words and phrases then search cached OA literature or daily RSS feeds from publishers. ContentMine have been experimenting with this system for terms related to Ebola and (in a separate project) agriculture. The results have been very interesting in terms of facts extracted.
- Co-location of entities (both within document and within sentence).
- We can currently extract species and chemistry in standalone modules.
- The AMI-regex module allows content miners to search for specific terms or relatively easily described patterns e.g. microRNAs, primers.
- This allows us to pull out co-occurrences of plants and herbivores, chemicals and mosquitoes etc. This feature is not fully functioning but is a development priority.
- Custom searches using regular expressions module AMI-regex to expand the keyword searches offered by most custom/saved search systems. This is especially poweful if sectioning is offered. Custom prototype would be a challenging project e.g. for EuropePMC or daily output of PLOS.
- Developing a system for feeding facts from Wellcome funded CC-BY publications into Wikidata and Wikipedia to maximise access. This would likely involve generating a feed which Wikipedia editors can follow and then add facts manually. It could make use of custom searches and co-location.