# Extracting Insights from Medical Research Papers

**In this exercise, we will use natural language processing (NLP) to extract key insights from academic medical papers.**

We will use the latest GPT model from OpenAI and [a dataset of nearly 200,000 PubMed Articles](https://huggingface.co/datasets/ccdv/pubmed-summarization). We'll first generate a summary of the full article text, and then we'll use that summary to generate an abstract and answer questions about the paper.

In this exercise, we'll learn how to:

- **Perform various NLP tasks**, including abstractive summarisation, extractive summarisation, and question-answering
- **Prepare large amounts of text data** for use by NLP models
- Use OpenAI's GPT **large language model (LLM)**
- **Query an API** (Application Program Interface)

## Part 0: Installations

In [None]:
! pip install datasets
! pip install openai
! pip install nltk
! pip install textwrap3

## Part 1: Loading and understanding our data

For this exercise, we're going to use full-text versions of academic articles. 

Because large language models like GPT are trained on ['plain text'](https://en.wikipedia.org/wiki/Plain_text) from the internet, we'll need to ensure that the data we're working with is in that format. For example, we can't directly feed in PDF files. And some of the formattings that programs like Word might include will also be unhelpful.

To make our lives easier, in this exercise, we're going to use text that's already been processed into a plain text format. We'll use the [Hugging Face library](https://huggingface.co/), which makes it easy to load data and models with only a few lines of code.

After [installing the Hugging Face library](https://huggingface.co/docs/transformers/installation), we can call the following to load a dataset of PubMed articles:

In [2]:
from datasets import load_dataset
dataset = load_dataset("ccdv/pubmed-summarization")

  from .autonotebook import tqdm as notebook_tqdm
No config specified, defaulting to: pubmed-summarization/section
Found cached dataset pubmed-summarization (/Users/chrislovejoy/.cache/huggingface/datasets/ccdv___pubmed-summarization/section/1.0.0/f765ec606c790e8c5694b226814a13f1974ba4ea98280989edaffb152ded5e2b)
100%|█████████████████████████████████████████████| 3/3 [00:00<00:00, 53.12it/s]


This is a relatively small dataset, but it's still 1.5GB and may take some time to load, depending on your computer and internet connection speed. In my case, it took just under 5 minutes. (Note: after the first time, your computer will cache it and won't need to re-download it the next time you come back to the exercise)

You can browse other datasets accessible via Hugging Face [here](https://huggingface.co/datasets), and to load those, you just need to change the parameter in the ```load_dataset()``` function call above.

Let's have a look at our dataset:

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['article', 'abstract'],
        num_rows: 119924
    })
    validation: Dataset({
        features: ['article', 'abstract'],
        num_rows: 6633
    })
    test: Dataset({
        features: ['article', 'abstract'],
        num_rows: 6658
    })
})

We can see that it's divided into training, validation, and test sets. This is because it's designed for 'fine-tuning' NLP models. But we won't worry about that; we will just use some of the data in the 'training' portion. We can view it as follows:

In [4]:
dataset['train']

Dataset({
    features: ['article', 'abstract'],
    num_rows: 119924
})

But now we want to find the actual article texts. It's not *obvious* from just looking at the Dataset object we have exactly how to do that. 

This will be the case sometimes, and we must figure out how to get the data we want. There may be documentation on how to do so. But sometimes, you'll need to experiment and figure it out.

One helpful command for this is ```.__dict__```. Let's try that:

In [1]:
dataset['train'].__dict__

This may look like a bit of a jumble, but from within it, we can see the structure of the dataset. To access the data, we first index to the data point of interest, and then we select either 'article' or 'abstract'. Let's look at the 10th article:

In [6]:
dataset['train'][10]['article']

"an exponential rise in alzheimer 's disease ( ad ) prevalence rates is predicted to parallel the aging of baby boomers creating a potentially unsustainable economic burden to the healthcare system .   delaying the onset or progression of ad , even modestly , by earlier pharmacological intervention could substantially reduce the economic and psychosocial impact of the illness [ 1 , 2 ] .   unfortunately \n , many ad patients remain undiagnosed or go undetected until the later stages of disease . \n insights into the underlying pathological mechanisms involving beta - amyloid plaque deposition within the brain have   led to the development of a host of antiamyloid agents   that are in various stages of clinical investigation . \n there is now a scientific consensus that the pathological events in ad initiate decades before clinical symptoms become apparent , and if disease modification is realized in the coming decades , the need for improved methods of early detection prior to the over

Great! It looks like an article about Alzheimer's Disease and blood-based markers for diagnosis. Now that we can access the full articles let's prepare our data for our large language model.



## Step 2: Preparing the data

Let's continue with our Alzheimer's Disease article to help us understand the data.

If we look at the text above, we can notice that there are a lot of extra spaces and certain symbols that we don't normally have, like '\n'. This reflects that text is divided into 'tokens' to perform NLP.

Tokens are a set of characters that represents a "unit of meaning" in a text. Tokens are typically individual words but can also be phrases or other meaningful sequences of characters, such as numbers, symbols, or punctuation marks.

If we were working with a new dataset, we might need to do the tokenisation ourselves. In this case, it's already done for us.

A space separates each token. The '\n' token denotes a new line.

Using the Python ``` print()``` function will convert the '\n' symbols into new lines, makings it easier to follow. However, the additional spacing will still be present.

*(NOTE: we'll index the text to only look at the first 5000 characters, rather than print out the entire article below)*

In [13]:
print(dataset['train'][10]['article'][:5000])

an exponential rise in alzheimer 's disease ( ad ) prevalence rates is predicted to parallel the aging of baby boomers creating a potentially unsustainable economic burden to the healthcare system .   delaying the onset or progression of ad , even modestly , by earlier pharmacological intervention could substantially reduce the economic and psychosocial impact of the illness [ 1 , 2 ] .   unfortunately 
 , many ad patients remain undiagnosed or go undetected until the later stages of disease . 
 insights into the underlying pathological mechanisms involving beta - amyloid plaque deposition within the brain have   led to the development of a host of antiamyloid agents   that are in various stages of clinical investigation . 
 there is now a scientific consensus that the pathological events in ad initiate decades before clinical symptoms become apparent , and if disease modification is realized in the coming decades , the need for improved methods of early detection prior to the overt cl

We're going to be feeding this text data into our language models. However, an important consideration here is the length of our text. Language models have limits on the length of text that they can take in at one point in time. 

For example, GPT-derived models can typically take a maximum of 1024-2048 tokens. This needs to include *both* the text we're providing and the accompanying command we will provide.

We can look at the number of characters using ```len()```:

In [8]:
len(dataset['train'][10]['article'])

22397

However, there's no direct conversion of characters to 'tokens' given that the token length can vary.

We can use the handy NLTK library for this. NLTK is the 'natural language toolkit' and contains various helpful functions, including tokenisation, text tagging, and more.

Let's import it and use the ```word_tokenize()``` function:

In [4]:
import nltk
nltk.download('punkt')
print(len(nltk.word_tokenize(dataset['train'][10]['article'])))

3936


Nearly 4000 tokens... That will be a problem as it's far beyond our limit.

**What can we do?**

There are a few different approaches; the most appropriate approach depends on our end goal.

One option is to use a tool like [GPT Index](https://gpt-index.readthedocs.io/en/latest/index.html). This tool divides the text into parts, which it calls "indices". Then, when you ask a question, it will identify which of the indices (ie. which segments of the original text) are the best for answering that particular question, and it will use that section to generate an answer.

An alternative is to create a *summary* of the original text and then use that summary as the basis for future questions to the language model.

The GPT Index approach is necessary if the text is very long. For example, if there were 10,000+ tokens in the original text, it wouldn't be possible to make a summary without losing key information.

We need to condense to around 20-25% (from ~4000 tokens to ~1000 tokens), which is quite reasonable. Therefore, we'll go with summarisation, which is also easier to implement.


## Part 3: Using AI to create an initial summary

### Extractive and Abstractive summarisation

There are broadly two types of summarisation: **extractive** and **abstractive** summarisation.

In **extractive** summarisation, the model highlights the most important sentences in the text and cuts out all the rest. So the final summary has no *new* words and comprises all the important sentences.

In **abstractive** summarisation, the model *creates* a new summary in its own words.





In this exercise, we're going to use **abstractive** summarisation. 

Given the token limitations, we can't just ask the model to write an overall summary. So let's divide it into chunks, create a short summary of each chunk, and then combine the chunks to make the overall summary. The result won't be *perfect*, but hopefully, it contains all the information we need. Let's use the **textwrap3** library.

In [5]:
import textwrap3

We can write a function that splits up our article into separate 'chunks':

In [6]:
chunk_length = 2000

In [7]:
def chunk_paper(paper):    
    chunks = textwrap3.wrap(paper, chunk_length)    
    return chunks

In [8]:
test_paper = dataset['train'][10]['article']

Let's look at first first 'chunk' as a sanity check:

In [10]:
chunk_paper(test_paper)[0]

"an exponential rise in alzheimer 's disease ( ad ) prevalence rates is predicted to parallel the aging of baby boomers creating a potentially unsustainable economic burden to the healthcare system .   delaying the onset or progression of ad , even modestly , by earlier pharmacological intervention could substantially reduce the economic and psychosocial impact of the illness [ 1 , 2 ] .   unfortunately   , many ad patients remain undiagnosed or go undetected until the later stages of disease .   insights into the underlying pathological mechanisms involving beta - amyloid plaque deposition within the brain have   led to the development of a host of antiamyloid agents   that are in various stages of clinical investigation .   there is now a scientific consensus that the pathological events in ad initiate decades before clinical symptoms become apparent , and if disease modification is realized in the coming decades , the need for improved methods of early detection prior to the overt c

### Prompt Engineering

For each of those chunks, we now want GPT to generate a summary.

So **how do we do that?**

The way that GPT-3 and similar models work is that you **ask them in 'natural language'** (i.e. using words). If you want to understand who is mentioned in a text, you could say something like, "Read this text and list all the people it contains". If you want to translate a text into another language, you could say, "Re-write this paragraph into French".

This is great because it means the same model can perform many different types of tasks. Before GPT, you would often use models that specifically performed one thing.

However, writing the "prompts" that give the response you want is something of an art. The model's responses can vary a lot depending on small changes in the instructions. (You can read more about this [here](https://gwern.net/gpt-3#prompts-as-programming).)



Here's a simple prompt that works quite well for our purposes:

In [11]:
def generate_prompt(chunk):
    prompt = f"Write a concise summary of the following: \n \n {chunk} \n \n CONCISE SUMMARY:"
    return prompt

Play around with your own prompts and compare how the outputs of the model vary.

You can do that within this Jupyter Notebook, plus OpenAI has a "playground" to experiment in: https://platform.openai.com/playground


### Using the GPT API

Now we're going to interact with our model. One option would be to load a model into our Jupyter Notebook and interact with it. An easier option, though, is to send our text to a language model 'API'.

An 'API' is an "Application Programming Interface", which basically means it's somewhere that you can send information and receive information back.

OpenAI has an API for the GPT models, so we can send our text there directly from within this notebook. To do that, we'll import the openai library:

In [12]:
import openai

And we'll need to define our 'API key'. This is a long string of text which tells the OpenAI API *who* is sending the request. To get this, you'll need to make an account with OpenAI and generate a new API key, which you can copy into the cell below.

One reason for the API key is to stop people from attacking their service with too many requests. But another reason is that this service isn't *free*, so they use the API key to know which account to charge.

It's not *free*, but the cost for personal use is very low. A whole day of playing with the model and sending requests will cost less than a coffee. I'd say it's worth it for the educational experience.

But suppose you're absolutely against spending money here. In that case, an alternative is to load a model from [Hugging Face](https://huggingface.co) and use it in the notebook - see the documentation on their website for how to do so (it's a fair amount more work than calling OpenAI's API).

In [13]:
openai.api_key = ""


To send the API request, there are a number of standard variables we need to provide. Have a look at the [OpenAI API documentation](https://platform.openai.com/docs/api-reference/completions) and fill out the function below:

In [25]:
def gpt_completion(chunk):
    result = openai.Completion.create(
    model="", # TODO: add model name here
    prompt=generate_prompt(chunk),
    max_tokens = 1000,
    temperature = , # TODO: add an appropriate temperature here
    n = 1
    )
    return result

### Bringing it together

We now have functions for (1) dividing our long text into chunks, (2) generating a prompt to summarise the chunk, and (3) asking OpenAI's GPT to perform that task.

The final step is to run all of those functions over our full text to generate our summary.

Fill out the gaps in the cell below to generate a summary for the Alzheimer's paper we've been looking at.

In [None]:
def create_summary(paper):
    chunks = # TODO: use the relevant function to create the chunks
    results = []
    for chunk in chunks:
        result = gpt_completion() # TODO: add the relevant argument here for the gpt_completion function call
        chunk_summary = # TODO: look at what 'result' includes and index into the appropriate text we want
        results.append(chunk_summary)
    summary = ' '.join(results)
    return results

In [23]:
alzheimers_paper = dataset['train'][10]['article']

summary = create_summary(alzheimers_paper)

### Looking at our summary

We can look at the summary that our model created:

In [24]:
summary

" The prevalence of Alzheimer's Disease (AD) is expected to increase as the Baby Boomer generation ages, creating a potentially unsustainable economic burden on the healthcare system. To reduce this burden, antiamyloid agents are being developed to delay the onset or progression of AD. Neuropsychological measures have been used to identify cognitively normal elders who subsequently develop AD, but false positives are possible and interpretation of extensive testing requires expertise. Molecular imaging tracers are being developed to improve early detection.  This study examined the usefulness of brief neuropsychological tests in combination with blood Aβ140 and Aβ142 as a predictive test for detecting MCI/AD in at-risk older adults at a pre-symptomatic stage. This approach is more practical for clinical use and could be used to design large-scale prevention trials. Participants included a subset of subjects enrolled in the Alzheimer's Disease Anti-Inflammatory Prevention Trial (ADAPT).

Let's compare it's length with the original text

In [30]:
print(f"The original text was {len(alzheimers_paper)} characters.")
print(f"The summary text is {len(summary)} characters.")

import numpy as np
print(f"\nThis is {np.round(len(summary)/len(alzheimers_paper) * 100,2)} percent of the original length.")


The original text was 22397 characters.
The summary text is 6234 characters.

This is 27.83 percent of the original length.


That's a pretty decent compression.

We should also do a visual inspection of the summary text. Does it seem like a reasonable representation?

This is one of the challenges with NLP: numerical, objective performance measures can be harder because text is so varied. So it's always worth visually inspecting the text ourselves and seeing if it looks reasonable.

If our text is still too long, one option is to do a second round of summarisation. We could also experiment with changing the prompt. We asked for a "concise summary", but could change it - for example, to "very concise summary", or specify a maximum number of sentences, etc. It's ultimately about trial-and-error, to see what gives the desired output.

## Part 4: Using our summary to answer questions

Now that we've generated an initial summary, we can use it to answer questions about the text and generate other summaries. For example, we could generate an Abstract using a set format (Background, Methods, Results, Conclusion) and compare this to the *true* abstract of the paper.

To do this, we can continue using GPT and create new prompts.

Let's first define a general function which can take in both our summary and our prompt function, and return the generated text:

In [36]:
def gpt_complete_custom_prompt(summary, prompt_function):
    result = openai.Completion.create(
    model="text-davinci-003",
    prompt=prompt_function(summary),
    max_tokens = 1000,
    temperature = 0.25,
    n = 1
    )
    return result['choices'][0]['text']

### Creating an abstract

Here is an example prompt for creating an abstract:

In [37]:
# Prompt for creating full abstract
def make_abstract_prompt(summary):
    prompt = f"I want you to act as an academic researcher writing an abstract for an academic article you wrote.\
    I will share a summary of the article and it will be your job to write the abstract. The abstract should have four sections: \
    background, materials and methods, results and conclusion. PAPER: \n \ ${summary} \n \ ABSTRACT: \n BACKGROUND:"
    return prompt

Let's test it out:

In [39]:
"BACKGROUND:" + gpt_complete_custom_prompt(summary, make_abstract_prompt)
# NOTE: We've added "BACKGROUND:" to the output text, as it is used as part of the input.

"BACKGROUND:  The prevalence of Alzheimer's Disease (AD) is expected to increase as the Baby Boomer generation ages, creating a potentially unsustainable economic burden on the healthcare system. To reduce this burden, antiamyloid agents are being developed to delay the onset or progression of AD. Neuropsychological measures have been used to identify cognitively normal elders who subsequently develop AD, but false positives are possible and interpretation of extensive testing requires expertise. Molecular imaging tracers are being developed to improve early detection. \n\nMATERIALS AND METHODS: This study utilized a battery of cognitive tests to assess early changes associated with mild cognitive impairment (MCI) or Alzheimer's Disease (AD). Tests included the Wechsler Adult Intelligence Scale-Revised (WAIS-R) Digit Span (forward and backward), a generative verbal fluency test (supermarket items), the Rivermead Behavioral Memory Test (RBMT) narratives, the Brief Visuospatial Memory Te

**Looks pretty good!**

### Question-answering

We can also ask specific questions about the text. Below are two custom prompts - the first for identifying the medical conditions and the second for identifying the main findings:

In [41]:
def med_conds_prompt(summary):
    prompt = f"Look at the following text and identify what medical conditions are mentioned. \n \n {summary} \n \n MEDICAL CONDITIONS:"
    return prompt

In [42]:
def main_findings_prompt(summary):
    prompt = f"Look at the following summary of a research study and identify what the main findings were. \n \n {summary} \n \n MAIN FINDINGS:"
    return prompt

In [43]:
print("The medical conditions mentioned in the paper are:")
gpt_complete_custom_prompt(summary, med_conds_prompt)

The medical conditions mentioned in the paper are:


" \nAlzheimer's Disease (AD), Mild Cognitive Impairment (MCI), Amnestic Mild Cognitive Impairment (MCI), APOE4 allele"

In [45]:
print("The main findings of the paper were:")
gpt_complete_custom_prompt(summary, main_findings_prompt)

The main findings of the paper were:


' Combining brief neuropsychological tests and blood biomarkers (A142 and A142/A140 ratios) had higher sensitivities and specificities for predicting cognitive decline in at-risk cognitively normal older adults than either test alone, with an accuracy of 91%. Low levels of serum A142 and A142/A140 ratios were associated with cognitive decline even within one year.'

This seems to be working pretty well.

Your task: play around with other prompts for asking other questions. Come up with at least three other prompts for different aspects of the paper.

What seems to work well and what not so well?

In [None]:
# TODO: write and test further prompts here

## Next steps

1. **Play around with different prompts for obtaining different information**. Check out [this course](https://learnprompting.org/docs/intro) if you want more guidance on how to generate prompts.

2. **Try different models and compare them to the GPT model we used**. For this, you can:
    - Modify the "model" parameter when calling the "openai.Completion.create()" function
    - Use a different API. For example, the [HuggingFace API](https://huggingface.co) is another popular API for large language models.
    - Load in specific models and see how they perform. We used a general model here (GPT), but there are models fine-tuned for biomedical text, such as [biomedLM](https://github.com/stanford-crfm/BioMedLM) and [bio-clinical-BERT](https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT). Read the documentation, implement them, and compare their performance against more general models like the above.

3. **Look at performance metrics for NLP model performance**. How might we compare the generated abstracts with the real ones?
4. **Try fine-tuning your model** for "abstract generation" on the whole dataset to see if its performance improves. Use the real paper abstracts as the "ground truth" training data. 


Fill out the form below and we'll provide feedback on your code.

**Any feedback on the exercise? Any questions? Want feedback on your code? Please fill out the form [here](https://docs.google.com/forms/d/e/1FAIpQLSdoOjVom8YKf11LxJ_bWN40afFMsWcoJ-xOrKhMbfBzgxTS9A/viewform).**