<a href="https://colab.research.google.com/github/Rhino-byte/Skimit_NLP/blob/main/nlp_skimit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Milestone Project : SkimLit 📄🔥

The purpose of this notebook is to build an NLP model to make reading medical abstracts easier.

The paper we're replicating (the source of the dataset that we'll be using) is available here: https://arxiv.org/abs/1710.06071

And reading through the paper above, we see that the model architecture that they use to achieve their best results is available here: https://arxiv.org/abs/1612.05251

## Confirm access to GPU


In [None]:
!nvidia -smi -L

## Get the data

Since we'll be replicating the paper above (Published 200k RCT), Let's download the dataset they used.

We cando so from the authors GitHub: https://github.com/Franck-Dernoncourt/pubmed-rct

In [15]:
!git clone https://github.com/Franck-Dernoncourt/pubmed-rct
!ls pubmed-rct

Cloning into 'pubmed-rct'...
remote: Enumerating objects: 39, done.[K
remote: Counting objects: 100% (14/14), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 39 (delta 8), reused 5 (delta 5), pack-reused 25 (from 1)[K
Receiving objects: 100% (39/39), 177.08 MiB | 34.77 MiB/s, done.
Resolving deltas: 100% (15/15), done.
PubMed_200k_RCT
PubMed_200k_RCT_numbers_replaced_with_at_sign
PubMed_20k_RCT
PubMed_20k_RCT_numbers_replaced_with_at_sign
README.md


In [2]:
# Check what files are in the PubMed 20K dataset
!ls pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/

dev.txt  test.txt  train.txt


In [16]:
# Start our expperiments using 20k dataset with numbers replaced by @ sign
data_dir = "/content/pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign"

In [6]:
# Check all of the file names in the target directory
import os
filenames = [data_dir + filename for filename in os.listdir(data_dir)]
filenames

['/content/pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_signtrain.txt',
 '/content/pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_signdev.txt',
 '/content/pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_signtest.txt']

In [12]:
'/content/pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_signtrain.txt'== '/content/pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/train.txt'


False

## Preprocess data

Now we've got some text data, it's time to become one with it and one of the best ways to become one with the data is to ...

`Visualize, visualize, visualize`

So with that in mind, let's write a function to read in all of the lines of a target text file



In [17]:
 # Create a function to read the lines of a document
def get_lines(filename):
  """
  Reads filename (a text filename) and returns the lines of a text as a list

  Args:
    filename: a string containing the target filepath.

  Returns:
    A list of strings with one string per line from the target filename.
  """
  with open(filename,"r") as f:
    return f.readlines()



In [18]:
# Let's read in the training lines
train_lines = get_lines(data_dir+"/train.txt")# read lines with the training files
train_lines[:27]

['###24293578\n',
 'OBJECTIVE\tTo investigate the efficacy of @ weeks of daily low-dose oral prednisolone in improving pain , mobility , and systemic low-grade inflammation in the short term and whether the effect would be sustained at @ weeks in older adults with moderate to severe knee osteoarthritis ( OA ) .\n',
 'METHODS\tA total of @ patients with primary knee OA were randomized @:@ ; @ received @ mg/day of prednisolone and @ received placebo for @ weeks .\n',
 'METHODS\tOutcome measures included pain reduction and improvement in function scores and systemic inflammation markers .\n',
 'METHODS\tPain was assessed using the visual analog pain scale ( @-@ mm ) .\n',
 'METHODS\tSecondary outcome measures included the Western Ontario and McMaster Universities Osteoarthritis Index scores , patient global assessment ( PGA ) of the severity of knee OA , and @-min walk distance ( @MWD ) .\n',
 'METHODS\tSerum levels of interleukin @ ( IL-@ ) , IL-@ , tumor necrosis factor ( TNF ) - , and 

In [17]:
len(train_lines)

210040

Let's think about how we want our data to look...

How I think our data would be best reprsented as a list of dictionaries...

```
[{ 'line_number': 0
   'target': 'BACKGROUND',
   'text': ' Emotional eating is associated with overeating and the development of obesity .\n'
   'total_lines': 11},
   ......]

In [37]:
def preprocess_text_with_line_numbers(filename):
  """
  Returns a list of dictionaries of abstract line data.

  Takes in filename, reads it content and sorts through each line,
  extracting things like the target label, the text of the sentence,
  how many sentences are in the current abstract and what sentence
  number the target line is.
  """
  input_lines = get_lines(filename) # get all line from filename
  abstract_lines = "" # Create an empty abstract
  abstract_sample = [] # Create an empty list of abstracts

  # Loop through each line in the target file
  for line in input_lines:
    if line.startswith("###"): # Check to see if their is an ID line
      abstract_id =line
      abstract_lines = "" # reset the abstract string if line is a new line
    elif line.isspace(): # Check to see if it is a new line
      abstract_line_split = abstract_lines.splitlines() # split abstract into separate lines(\n)

      # Iterate through each line in a single abstract and count them at the same time
      for abstract_line_number, abstract_line in enumerate(abstract_line_split):
        line_data = {} # Create an empty dictionary for each line
        target_text_split = abstract_line.split("\t") # split target label from text
        line_data["line_number"] = abstract_line_number # what no line does the line appear in the abstract?
        line_data["target"] = target_text_split[0] # get target label
        line_data["text"] = target_text_split[1].lower() # get target text and lower it
        line_data["total_lines"] = len(abstract_line_split)-1 # How many total lines are there in the target abstracts
        abstract_sample.append(line_data) # Add line data to abstract sample list

    else: # If the above condition aren't fullfilled, the line contains a labelled sentence
      abstract_lines += line
  return abstract_sample






In [6]:
# @title How `isspace()` works

# @markdown In Python, whitespace refers to any character that represents horizontal or vertical space, including spaces, tabs, newlines, and carriage returns. Unlike many other languages where it is mostly ignored, whitespace is syntactically significant in Python, especially for indicating code structure through indentation.

text_list =[ "\n", '\t', ' ']

x = [txt.isspace() for txt in text_list]
print(x)

[True, True, True]


In [11]:
# @title How `splitlines()` works

# @markdown The `splitlines()` method splits a string into a list. The splitting is done at line breaks.
txt = "Thank you for the music\nWelcome to the jungle"

x = txt.splitlines()

print(x)

['Thank you for the music', 'Welcome to the jungle']


In [38]:
# Get data from file and preprocess it

%%time
train_samples = preprocess_text_with_line_numbers(data_dir+'/train.txt')
val_samples = preprocess_text_with_line_numbers(data_dir+'/dev.txt')
test_samples = preprocess_text_with_line_numbers(data_dir+'/test.txt')
print(len(train_samples),len(val_samples),len(test_samples))


180040 30212 30135
CPU times: user 280 ms, sys: 35.9 ms, total: 316 ms
Wall time: 320 ms


In [39]:
# Check the first abstract of our training data
train_samples[:10]

[{'line_number': 0,
  'target': 'OBJECTIVE',
  'text': 'to investigate the efficacy of @ weeks of daily low-dose oral prednisolone in improving pain , mobility , and systemic low-grade inflammation in the short term and whether the effect would be sustained at @ weeks in older adults with moderate to severe knee osteoarthritis ( oa ) .',
  'total_lines': 11},
 {'line_number': 1,
  'target': 'METHODS',
  'text': 'a total of @ patients with primary knee oa were randomized @:@ ; @ received @ mg/day of prednisolone and @ received placebo for @ weeks .',
  'total_lines': 11},
 {'line_number': 2,
  'target': 'METHODS',
  'text': 'outcome measures included pain reduction and improvement in function scores and systemic inflammation markers .',
  'total_lines': 11},
 {'line_number': 3,
  'target': 'METHODS',
  'text': 'pain was assessed using the visual analog pain scale ( @-@ mm ) .',
  'total_lines': 11},
 {'line_number': 4,
  'target': 'METHODS',
  'text': 'secondary outcome measures include