Workshop resources for two day workshop at the Wellcome Trust 13-14 April 2015
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.

Workshop at the Wellcome Trust

ContentMine logo

Register [here] ( (registration is FREE, places limited to 25 )

Location: Darwin Room, Wellcome Trust, 215 Euston Road, London NW1 2BE

Dates: 13-14 April 2015

13 April 2015 14 April 2015
Training Workshop Hackday & Policy Panel Session
10:00 - 18:00 09:00 - 17:00

twitter logo #WellcomeTDM

Contact us via [@TheContentMine] ( or


Please read the [Pre-workshop Installation Instructions] (

We would also appreciate your feedback

Workshop Purpose

Content mining technologies hold much potential for maximising discovery and reuse of research as well as generating novel scientific discoveries through automatedly searching, indexing and analysing the scientific literature.

This workshop aims to educate and engage researchers and research-related professionals in the UK who are interested in using these technologies for their work in human and animal health. In turn this will help to provide demonstrable examples of the benefits of content mining.

Participants will be instructed in use and custom development of tools in the ContentMine pipeline before having the opportunity to apply these skills to a series of hackday projects aimed at producing useful applications for researchers and funders.

On the final afternoon, participants will showcase their projects to invited policy makers and a policy panel session will discuss the future potential of content mining. This includes how to accelerate uptake for research and research assessment with the purpose of promoting content mining in the UK.

Workshop Objectives

  1. Raise awareness of content mining among researchers.
  2. Train 20-25 researchers in using the ContentMine pipeline tools.
  3. Promote legal and responsible content mining practices.
  4. Prototype or at least explore developing a range of new applications and tools that might be useful to funders and researchers.
  5. Showcase the scope of potential applications for content mining to workshop participants and invited policy staff.
  6. Discuss and suggest requirements to drive uptake of content mining e.g. training, policy, use cases, commercial applications.

Anticipated Workshop Outcomes

  1. Research or assessment performed using ContentMine tools by a proportion of workshop participants.
  2. Production of two or more prototype applications designed to be useful to the funders and/or researchers.
  3. Continuing collaboration and development of a proportion of hackday prototypes by participants and ContentMine staff.
  4. One or more ContentMine trainers recruited from pool of participants, increasing the reach of training for content mining.
  5. Better informed policy staff as a result of panel discussion and publication of outputs.
  6. Demand for training by further organisations or individuals linked to the Wellcome Trust, therefore potential for increased impact beyond this particular workshop.

Intended Audience

This two day event is intended for researchers or research-related staff who are not currently heavily involved in text and data mining but have at least some pre-existing computational skills. At minimum we expect familiarity with a command line interface and basic coding abilities in some language. Experienced developers are also extremely welcome. We expect participants to be researchers working in the fields of life science and biomedicine. All our examples and activities will therefore be geared to animal and health sciences and the hackday projects will focus on custom tools and analyses of interest to researchers in these disciplines. Open Access content in the Europe PMC repository will be the primary source material for this hackathon.

Training Workshop Agenda

Times Session
10:00 Introductions
10:10 What is content mining?
  • Overview presentation from ContentMine staff
10:30 Think like a content miner
  • Hands-on activity facilitated by ContentMine staff introducing entity extraction techniques, precision and recall.
11:00 Scraping and the anatomy of scrapers
  • Hands-on activity facilitated by ContentMine staff including use of quickscrape and custom scraper development.
12:00 Entity recognition using AMI
  • Hands-on activity facilitated by ContentMine staff including extracting species names from OA papers using AMI-species.
13:00 Lunch
13:30 Introduction to regular expressions and AMI-regex
  • Hands-on activity facilitated by ContentMine staff including regex creation in groups and use of AMI-regex.
15:00 Legality of content mining: what can I mine?
  • Presentation and Q&A by ContentMine staff covering copyright, database and contract law. Special attention will be paid to the UK copyright exception
15:30 Responsible content mining: how should I mine?
  • Presentation and Q&A by ContentMine staff covering server limits, online scraping etiquette and responsible technology use.
16:00 Hackday pitches and forming teams
  • Presentations by individuals and groups followed by discussion in newly formed teams facilitated by ContentMine staff.
18:00 onwards Informal social event
  • Move as a group to nearby pub or late opening cafe to discuss hackday projects.

Workshop Hackday and Policy Panel Session Agenda

Times Session
09:00 Hacking in teams
13:00 Lunch
13:30 Hacking in teams
15:30 Coffee Break
  • Arrival of policy panel session attendees
15:45 Introduction to content mining
  • Presentation delivered by Peter Murray-Rust to new attendeess
16:00 Presentation of hackday projects
  • Presentations delivered by participants, including future scope for development of their projects.
16:30 Panel discussion on accelerating uptake of content mining.
  • Panel and Q&A with audience including workshop participants.
17:30 Event close


The primary means of communication and taking notes during the workshop will be via this [Etherpad] (

If you're not familiar with Etherpads, it's easy - you just type! Your writing will show up in a different colour to other people's and you can associate the colour with your name by clicking the person icon in the top right corner of the screen.

Potential Hackday Projects

Tasks within these projects range from non-technical e.g. putting together a bag of words to retrieve documents related to a certain research area through to technically challenging even for skilled developers.

  1. Find Wellcome Trust funded papers and visualise their associated MeSH heading or other topic indicators in a tag cloud to indicate areas with which Wellcome funding is most associated.
  2. Find Wellcome funded Open Access (OA) papers and compare subject set to main tag cloud. Use document sectioning with ContentMine pipeline to search for mentions of “Wellcome Trust” in acknowledgements section on a daily basis.
  3. Find Wellcome funded papers and visualise or summarise by publisher.
  4. Build custom scrapers for journals and publishers with which attendees publish.
  5. Retrieve documents relate to Wellcome Trust areas of interest using a bag-of-regex system. Compile lists of words and phrases then search cached OA literature or daily RSS feeds from publishers. ContentMine have been experimenting with this system for terms related to Ebola and (in a separate project) agriculture. The results have been very interesting in terms of facts extracted.
  6. Co-location of entities (both within document and within sentence).
  • We can currently extract species and chemistry in standalone modules.
  • The AMI-regex module allows content miners to search for specific terms or relatively easily described patterns e.g. microRNAs, primers.
  • This allows us to pull out co-occurrences of plants and herbivores, chemicals and mosquitoes etc. This feature is not fully functioning but is a development priority.
  1. Custom searches using regular expressions module AMI-regex to expand the keyword searches offered by most custom/saved search systems. This is especially poweful if sectioning is offered. Custom prototype would be a challenging project e.g. for EuropePMC or daily output of PLOS.
  2. Developing a system for feeding facts from Wellcome funded CC-BY publications into Wikidata and Wikipedia to maximise access. This would likely involve generating a feed which Wikipedia editors can follow and then add facts manually. It could make use of custom searches and co-location.

Click here to be advised of future ContentMine Workshops