# Milestone Project 2

### Problem Statement
The number of RCT papers released is continuously increasing. It is difficult for researchers to keep up with the latest research in their field. This project aims to create a tool that can help researchers to keep up with the latest research in their field. The tool will be able to search for the latest RCT papers in a specific field and summarize the key findings of the papers.

### Solution
Create an NLP model to classify abstract sentences into the role they play (e.g. objective, methods, results, etc)  to enable researchers to skim through the literature (hence SkimLit 🤓🔥) and dive deeper when necessary.

> 📖 **Resources:** Before going through the code in this notebook, you might want to get a background of what we're going to be doing. To do so, spend an hour (or two) going through the following papers and then return to this notebook:
1. Where our data is coming from: [*PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts*](https://arxiv.org/abs/1710.06071)
2. Where our model is coming from: [*Neural networks for joint sentence
classification in medical paper abstracts*](https://arxiv.org/pdf/1612.05251.pdf).

## Topics to be covered
* Downloading a dataset ([PubMed RCT200k from GitHub](https://github.com/Franck-Dernoncourt/pubmed-rct))
* Preprocessing data for NLP models
* Setting up a series of NLP models
    * Making a baseline model (TF-IDF Multinomial Naive Bayes)
    * Deep models with different combinations of layers (Conv1D, LSTM, GRU)
* Building first multimodal model (feature extraction and token embeddings)
    * Replicating the model architecture from https://arxiv.org/abs/1612.05251
* Find the most wrong predictions
* Making predictions on PubMed abstracts


## Get Data

* `train.txt` - training samples
* `dev.txt` - validation samples
* `test.txt` - test samples

In [1]:
# Strat by using the 20k dataset
data_dir = "data/PubMed_20k_RCT_numbers_replaced_with_at_sign/"

In [2]:
# Check all of the filenames in the data directory
import os
filenames = [data_dir + filename for filename in os.listdir(data_dir)]
filenames

['data/PubMed_20k_RCT_numbers_replaced_with_at_sign/dev.txt',
 'data/PubMed_20k_RCT_numbers_replaced_with_at_sign/test.txt',
 'data/PubMed_20k_RCT_numbers_replaced_with_at_sign/train.txt']

## Preprocess Data

In [3]:
# Create a function to read the lines of the documents
def get_lines(filename):
    """
    Reads filename (a text filename) and returns the lines of text as a list.
    Args:
    filename: a string containing the target text filename.
    Returns:
    A list of strings with one string per line from the target text filename.
    """
    with open(filename, "r") as file:
        return file.readlines()

In [5]:
train_lines = get_lines(data_dir + "train.txt")
train_lines[:20]

['###24293578\n',
 'OBJECTIVE\tTo investigate the efficacy of @ weeks of daily low-dose oral prednisolone in improving pain , mobility , and systemic low-grade inflammation in the short term and whether the effect would be sustained at @ weeks in older adults with moderate to severe knee osteoarthritis ( OA ) .\n',
 'METHODS\tA total of @ patients with primary knee OA were randomized @:@ ; @ received @ mg/day of prednisolone and @ received placebo for @ weeks .\n',
 'METHODS\tOutcome measures included pain reduction and improvement in function scores and systemic inflammation markers .\n',
 'METHODS\tPain was assessed using the visual analog pain scale ( @-@ mm ) .\n',
 'METHODS\tSecondary outcome measures included the Western Ontario and McMaster Universities Osteoarthritis Index scores , patient global assessment ( PGA ) of the severity of knee OA , and @-min walk distance ( @MWD ) .\n',
 'METHODS\tSerum levels of interleukin @ ( IL-@ ) , IL-@ , tumor necrosis factor ( TNF ) - , and 

In [7]:
# Create a function to read the lines of the documents
def preprocess_text_with_line_numbers(filename):
    """
    Returns a list of dictionaries of abstract line data.
    
    Takes in filename, reads its contents and sorts through each line,
    extracting things like the line number, the text of the sentence,
    and the target label.

    Args:
    filename: a string containing the target text filename.

    Returns:
    A list of dictionaries each containing the key value pairs for
    line_number, target, and text.
    """

    input_lines = get_lines(filename) # get all lines from filename
    abstract_lines = "" # create an empty abstract
    abstract_samples = [] # create an empty list to hold abstracts

    # Loop through each line in the target file
    for line in input_lines:
        if line.startswith("###"): # check to see if line is an ID line
            abstract_id = line
            abstract_lines = "" # reset abstract string
        elif line.isspace(): # check to see if line is a new line
            abstract_line_split = abstract_lines.splitlines() # split abstract into separate lines

            # Iterate through each line in abstract and count them at the same time
            for abstract_line_number, abstract_line in enumerate(abstract_line_split):
                line_data = {} # create an empty dictionary for each line
                target_text_split = abstract_line.split("\t") # split target label from text
                line_data["target"] = target_text_split[0] # get target label
                line_data["text"] = target_text_split[1].lower() # get target text and lower it
                line_data["line_number"] = abstract_line_number # what line number does the line appear in the abstract?
                line_data["total_lines"] = len(abstract_line_split) - 1 # how many total lines are in the abstarct? (start from 0)
                abstract_samples.append(line_data) # add line data to absract samples list

            else: # if the above conditions aren't fulfilled, the line contains a labelled sentence
                abstract_lines += line

            return abstract_samples

In [None]:
# 