GitHub - DsSaurabh/SkimLit-Project

SkimLit-Project

In this project we're going to be replicating the deep learning model behind the 2017 paper PubMed 200k RCT: a Dataset for Sequenctial Sentence Classification in Medical Abstracts.

When it was released, the paper presented a new dataset called PubMed 200k RCT which consists of ~200,000 labelled Randomized Controlled Trial (RCT) abstracts.

The goal of the dataset was to explore the ability for NLP models to classify sentences which appear in sequential order.

In other words, given the abstract of a RCT, what role does each sentence serve in the abstract?

Problem in a sentence

The number of RCT papers released is continuing to increase, those without structured abstracts can be hard to read and in turn slow down researchers moving through the literature.

Solution in a sentence

Create an NLP model to classify abstract sentences into the role they play (e.g. objective, methods, results, etc) to enable researchers to skim through the literature (hence SkimLit) and dive deeper when necessary.

Resources: Before going through the code in this notebook, you might want to get a background of what we're going to be doing. To do so, spend an hour (or two) going through the following papers and then return to this notebook:

1. Where our data is coming from: PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts
2. Where our model is coming from: Neural networks for joint sentence classification in medical paper abstracts.

What we're going to cover

Time to take what we've learned in the NLP fundmentals notebook and build our biggest NLP model yet:

Downloading a text dataset (PubMed RCT200k from GitHub)
Writing a preprocessing function to prepare our data for modelling
Setting up a series of modelling experiments
- Making a baseline (TF-IDF classifier)
- Deep models with different combinations of: token embeddings, character embeddings, pretrained embeddings, positional embeddings
Building our first multimodal model (taking multiple types of data inputs)
- Replicating the model architecture from https://arxiv.org/pdf/1612.05251.pdf
Find the most wrong predictions
Making predictions on PubMed abstracts from the wild

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md
SkimLit_NLP_milestone_project_Pubmed_20k.ipynb		SkimLit_NLP_milestone_project_Pubmed_20k.ipynb
skimlit_example_abstracts.json		skimlit_example_abstracts.json
skimlit_tribrid_model.zip		skimlit_tribrid_model.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

SkimLit_NLP_milestone_project_Pubmed_20k.ipynb

SkimLit_NLP_milestone_project_Pubmed_20k.ipynb

skimlit_example_abstracts.json

skimlit_example_abstracts.json

skimlit_tribrid_model.zip

skimlit_tribrid_model.zip

Repository files navigation

SkimLit-Project

Problem in a sentence

Solution in a sentence

What we're going to cover

About

Releases

Packages

Languages

DsSaurabh/SkimLit-Project

Folders and files

Latest commit

History

Repository files navigation

SkimLit-Project

Problem in a sentence

Solution in a sentence

What we're going to cover

About

Resources

Stars

Watchers

Forks

Languages