<a href="https://colab.research.google.com/github/LivingstonTardzenyuy/Deep-Learning-with-TensorFlow/blob/main/09_SkimLit_nlp_milestone_project_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Milestone project 2: SkimLit.

The purpose of this notebook is to build an NLP model to make reading medical abstract easier.

The paper we're replicating(the source of the dataset) is available here: https://arxiv.org/abs/1710.06071 .



In [1]:
# Confirm access to a GPU.

!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-4c324027-0c61-de9d-61dd-ccc00c2efdd1)


## Get data.

Since we'll be replicating the paper above (PubMed 200k RCT), let's download the dataset used.

We can do so from the authors github. https://github.com/Franck-Dernoncourt/pubmed-rct

In [2]:
# Get the github dataset

!git clone https://github.com/Franck-Dernoncourt/pubmed-rct
!ls pubmed-rct

Cloning into 'pubmed-rct'...
remote: Enumerating objects: 39, done.[K
remote: Counting objects: 100% (14/14), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 39 (delta 8), reused 5 (delta 5), pack-reused 25 (from 1)[K
Receiving objects: 100% (39/39), 177.08 MiB | 39.83 MiB/s, done.
Resolving deltas: 100% (15/15), done.
Updating files: 100% (13/13), done.
PubMed_200k_RCT				       PubMed_20k_RCT_numbers_replaced_with_at_sign
PubMed_200k_RCT_numbers_replaced_with_at_sign  README.md
PubMed_20k_RCT


In [3]:
# Check what files are in the PubMed 20k datasets.

!ls pubmed-rct/PubMed_200k_RCT_numbers_replaced_with_at_sign

dev.txt  test.txt  train.zip


In [4]:
# Start our experiments using the 20k dataset with numbers replaced by "@" sign.

data_dir = "/content/pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/"

In [5]:
# Check all of the file names in the target directory.
import os

filenames = [data_dir + filename for filename in os.listdir(data_dir)]
filenames

['/content/pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/train.txt',
 '/content/pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/dev.txt',
 '/content/pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/test.txt']

## Preprocess data.

Now we've got some text data, it's time to become one with it.

One of the best ways to become one with the data is to *Visualize Visualize Visualize*

So with that in mind, let's write a function to read in all of the lines of a target text file.

In [6]:
 # Create function to read line from doc.

 def get_lines(filename):
  """
    Read a file name and return a list of lines as a list.
  """
  with open(filename, "r") as f:
    return f.readlines()

In [7]:
train_lines = get_lines(data_dir + "train.txt")
train_lines[:10]

['###24293578\n',
 'OBJECTIVE\tTo investigate the efficacy of @ weeks of daily low-dose oral prednisolone in improving pain , mobility , and systemic low-grade inflammation in the short term and whether the effect would be sustained at @ weeks in older adults with moderate to severe knee osteoarthritis ( OA ) .\n',
 'METHODS\tA total of @ patients with primary knee OA were randomized @:@ ; @ received @ mg/day of prednisolone and @ received placebo for @ weeks .\n',
 'METHODS\tOutcome measures included pain reduction and improvement in function scores and systemic inflammation markers .\n',
 'METHODS\tPain was assessed using the visual analog pain scale ( @-@ mm ) .\n',
 'METHODS\tSecondary outcome measures included the Western Ontario and McMaster Universities Osteoarthritis Index scores , patient global assessment ( PGA ) of the severity of knee OA , and @-min walk distance ( @MWD ) .\n',
 'METHODS\tSerum levels of interleukin @ ( IL-@ ) , IL-@ , tumor necrosis factor ( TNF ) - , and 

In [8]:
len(train_lines)

210040

# Let's think about how we want our data to look.

Let's think about how we want our data to look.

one possible way is to turn our words into a list of dic.

'''
[{'line_number': 0,
    'target': 'BACKGROUND ',
    'text': 'Emotional eating is associated with overeating and the development of obesity .\n'
    total_lines': 11
}]

'''

In [13]:
def preprocess_text_with_line_numbers(filename):
  """
    Returns a list of dictionaries of abstract line data.

    Takes in filename, read it contents and sorts through each line,
    extracting things like the target label, text of the sentence,
    how many sentences are in current abstract and what sentence number the target line is.

  """
  input_lines = get_lines(filename)   # get all lines from filename
  abstract_lines = ""                 # empty list to hold lines
  abstract_samples = []               # create an empty list of abstracts.

  # Loop through each line in the target file.
  for line in input_lines:

    if line.startswith("###"):
      abstract_id = line
      abstract_lines = ""   # reset the abstract string if the line is an 1D line.
    elif line.isspace():
      abstract_line_split = abstract_lines.split("\n")      # split abstract into seperate lines.

      # iteract through each line in a single abstract.
      for abstract_line_number, abstract_line in enumerate(abstract_line_split):
        line_data  = {}
        target_text_split = abstract_line.split("\t")
        line_data['target'] = target_text_split[0]  # get target label.
        line_data['text'] = target_text_split[1].lower()   # get target text and lower it.
        line_data['line_number'] = abstract_line_number
        line_data['total_lines'] = len(abstract_line_split) -1 # -1 means we start counting from 0.
        abstract_samples.append(line_data)

    else:  # if above conditions not meet. the lines contains are labelled sentences
      abstract_lines += line
  return abstract_samples


In [14]:
train_samples = preprocess_text_with_line_numbers(data_dir + "train.txt")
val_samples = preprocess_text_with_line_numbers(data_dir + "dev.txt")  # dev is another name for validation.
test_samples = preprocess_text_with_line_numbers(data_dir + "test.txt")

len(train_samples), len(val_samples), len(test_samples)

IndexError: list index out of range