# Summarizing Medical Documents using spacy

<img src= "https://images.pexels.com/photos/6801648/pexels-photo-6801648.jpeg?auto=compress&cs=tinysrgb&dpr=2&h=650&w=940" alt ="Document" style='width: 1050px;'>

# What is Document Summarization?

Text Summarization is a **Natural Language Processing** (NLP) task in which we try to create a summary starting from a textual input like books, articles, news.

When the source is a document (in our case a clinical document 📝 and like discharge letters) it is calles document summarization.

Based on the output type, document summarization can be:

* Extractive: the summary is extracted from the input text. The output is usually the concatenation of the most important sentences of the original text.

* Abstractive: the summary is generated. This means that we use the original text to learn internal representations and then we use such representations to generate new text. The output is original, not a combination/concatenation of the input sentences.

* Mixed: produce an abstractive summary after identifying an extractive intermediate state or they can choose which approach to use (eg: pointer models) based on the particulars of the text.


<img src= "https://miro.medium.com/max/875/1*SM41ES3n-q71Xn8zCIdRMw.png" alt ="Document" style='width: 1000px;'>

# Summarization Methods and Implementation

We have plenty of summarization algorithms today. Assessing which one is the best is a chance hit, though. First of all, there is no clear consensus on which metrics to utilize to evaluate these systems. Moreover, the best summarization technique is highly dependent on the domain and the type of text you are intrigued with summarizing.

## Frequency-Based Sentence Scoring

> This approach is the most facile and most straightforward one. Here we utilize information theory to assign each sentence of the input with a score that is predicated on relative frequencies. A high value for a sentence betokens that its content is liable to be informative.

WE will be using **spacy**, a fantastic library designed to implement standard NLP pipelines expeditiously and smoothly, we can implement this with a few lines of code.

In [1]:
import spacy
import textwrap
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from heapq import nlargest
punctuation += '\n' 
stopwords = list(STOP_WORDS)

reduction_rate = 0.1  #Shows how small the output summary should be compared with the input


In [2]:
text = """I saw ABC back in Neuro-Oncology Clinic today. He comes in for an urgent visit because of increasing questions about what to do next for his anaplastic astrocytoma.
Within the last several days, he has seen you in clinic and once again discussed whether or not to undergo radiation for his left temporal lesion. The patient has clearly been extremely ambivalent about this therapy for reasons that are not immediately apparent. It is clear that his MRI is progressing and that it seems unlikely at this time that anything other than radiation would be particularly effective. Despite repeatedly emphasizing this; however, the patient still is worried about potential long-term side effects from treatment that frankly seem unwarranted at this particular time.
After seeing you in clinic, he and his friend again wanted to discuss possible changes in the chemotherapy regimen. They came in with a list of eight possible agents that they would like to be administered within the next two weeks. They then wanted another MRI to be performed and they were hoping that with the use of this type of approach, they might be able to induce another remission from which he can once again be spared radiation.
From my view, I noticed a man whose language has deteriorated in the week since I last saw him. This is very worrisome. Today, for the first time, I felt that there was a definite right facial droop as well. Therefore, there is no doubt that he is becoming symptomatic from his growing tumor. It suggests that he is approaching the end of his compliance curve and that the things may rapidly deteriorate in the near future.
Emphasizing this once again, in addition, to recommending steroids I once again tried to convince him to undergo radiation. Despite an hour, this again amazingly was not possible. It is not that he does not want treatment, however. Because I told him that I did not feel it was ethical to just put him on the radical regimen that him and his friend devised, we compromised and elected to go back to Temodar in a low dose daily type regimen. We would plan on giving 75 mg/sq m everyday for 21 days out of 28 days. In addition, we will stop thalidomide 100 mg/day. If he tolerates this for one week, we then agree that we would institute another one of the medications that he listed for us. At this stage, we are thinking of using Accutane at that point.
While I am very uncomfortable with this type of approach, I think as long as he is going to be monitored closely that we may be able to get away with this for at least a reasonable interval. In the spirit of compromise, he again consented to be evaluated by radiation and this time, seemed more resigned to the fact that it was going to happen sooner than later. I will look at this as a positive sign because I think radiation is the one therapy from which he can get a reasonable response in the long term.
I will keep you apprised of followups. If you have any questions or if I could be of any further assistance, feel free to contact me."""

Here we utilize the SpaCy NLP pipeline for English, which is very handy because it returns a Doc object that contains the already tokenized and preprocessed text, split into words and sentences.

In [3]:
nlp_pl = spacy.load('en_core_web_sm')     #process original text according with the Spacy nlp pipeline for english
document = nlp_pl(text)                   #doc object

tokens = [token.text for token in document] #tokenized text

word_frequencies = {}
for word in document:
    if word.text.lower() not in stopwords:
        if word.text.lower() not in punctuation:
            if word.text not in word_frequencies.keys():
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1

max_frequency = max(word_frequencies.values())
print(max_frequency)

for word in word_frequencies.keys():
    word_frequencies[word] = word_frequencies[word]/max_frequency

print(word_frequencies)

6
{'saw': 0.3333333333333333, 'ABC': 0.16666666666666666, 'Neuro': 0.16666666666666666, 'Oncology': 0.16666666666666666, 'Clinic': 0.16666666666666666, 'today': 0.16666666666666666, 'comes': 0.16666666666666666, 'urgent': 0.16666666666666666, 'visit': 0.16666666666666666, 'increasing': 0.16666666666666666, 'questions': 0.3333333333333333, 'anaplastic': 0.16666666666666666, 'astrocytoma': 0.16666666666666666, 'days': 0.5, 'seen': 0.16666666666666666, 'clinic': 0.3333333333333333, 'discussed': 0.16666666666666666, 'undergo': 0.3333333333333333, 'radiation': 1.0, 'left': 0.16666666666666666, 'temporal': 0.16666666666666666, 'lesion': 0.16666666666666666, 'patient': 0.3333333333333333, 'clearly': 0.16666666666666666, 'extremely': 0.16666666666666666, 'ambivalent': 0.16666666666666666, 'therapy': 0.3333333333333333, 'reasons': 0.16666666666666666, 'immediately': 0.16666666666666666, 'apparent': 0.16666666666666666, 'clear': 0.16666666666666666, 'MRI': 0.3333333333333333, 'progressing': 0.16

I already mentioned how extractive summarization is essentially based on **sentence scoring**. Therefore, we need to find a way to give an **importance score** to each sentence, so that we can include in the summary the most important ones. To give each sentence a score, we **sum the relative word frequencies** in each sentence and then we create a dictionary that pairs the sentences and their scores.

In [4]:
sentence_tokens = [sent for sent in document.sents]

def get_sentence_scores(sentence_tok, len_norm=True):
  sentence_scores = {}
  for sent in sentence_tok:
      word_count = 0
      for word in sent:
          if word.text.lower() in word_frequencies.keys():
              word_count += 1
              if sent not in sentence_scores.keys():
                  sentence_scores[sent] = word_frequencies[word.text.lower()]
              else:
                  sentence_scores[sent] += word_frequencies[word.text.lower()]
      if len_norm:
        sentence_scores[sent] = sentence_scores[sent]/word_count
  return sentence_scores
                
sentence_scores = get_sentence_scores(sentence_tokens,len_norm=False)        #sentence scoring without lenght normalization
sentence_scores_rel = get_sentence_scores(sentence_tokens,len_norm=True)     #sentence scoring with length normalization

The final summary is made with the nlargest function from heapq module, which efficiently returns the **k sentences with the highest score**.

In [5]:
def get_summary(sentence_sc, rate):
  summary_length = int(len(sentence_sc)*rate)
  summary = nlargest(summary_length, sentence_sc, key = sentence_sc.get)
  final_summary = [word.text for word in summary]
  summary = ' '.join(final_summary)
  return summary

print("Lenghty description: "+ get_summary(sentence_scores, reduction_rate))
print("Concise: "+ get_summary(sentence_scores_rel, reduction_rate))

Lenghty description: Because I told him that I did not feel it was ethical to just put him on the radical regimen that him and his friend devised, we compromised and elected to go back to Temodar in a low dose daily type regimen. I will look at this as a positive sign because I think radiation is the one therapy from which he can get a reasonable response in the long term.

Concise: It is clear that his MRI is progressing and that it seems unlikely at this time that anything other than radiation would be particularly effective. I will look at this as a positive sign because I think radiation is the one therapy from which he can get a reasonable response in the long term.



As we expected, the **first output shows verbose and content-heavy sentences**, while the **second one is much more concise**. The second summary withal inclines to focus more on what is going well with the patient, omitting consequential information about what the doctor is worried about.