Skip to content

DsSaurabh/SkimLit-Project

Repository files navigation

SkimLit-Project

In this project we're going to be replicating the deep learning model behind the 2017 paper PubMed 200k RCT: a Dataset for Sequenctial Sentence Classification in Medical Abstracts.

When it was released, the paper presented a new dataset called PubMed 200k RCT which consists of ~200,000 labelled Randomized Controlled Trial (RCT) abstracts.

The goal of the dataset was to explore the ability for NLP models to classify sentences which appear in sequential order.

In other words, given the abstract of a RCT, what role does each sentence serve in the abstract?

Problem in a sentence

The number of RCT papers released is continuing to increase, those without structured abstracts can be hard to read and in turn slow down researchers moving through the literature.

Solution in a sentence

Create an NLP model to classify abstract sentences into the role they play (e.g. objective, methods, results, etc) to enable researchers to skim through the literature (hence SkimLit) and dive deeper when necessary.

Resources: Before going through the code in this notebook, you might want to get a background of what we're going to be doing. To do so, spend an hour (or two) going through the following papers and then return to this notebook:

1. Where our data is coming from: PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts
2. Where our model is coming from: Neural networks for joint sentence classification in medical paper abstracts.

What we're going to cover

Time to take what we've learned in the NLP fundmentals notebook and build our biggest NLP model yet:

  • Downloading a text dataset (PubMed RCT200k from GitHub)
  • Writing a preprocessing function to prepare our data for modelling
  • Setting up a series of modelling experiments
    • Making a baseline (TF-IDF classifier)
    • Deep models with different combinations of: token embeddings, character embeddings, pretrained embeddings, positional embeddings
  • Building our first multimodal model (taking multiple types of data inputs)
  • Find the most wrong predictions
  • Making predictions on PubMed abstracts from the wild

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published