<a href="https://colab.research.google.com/github/LyaSolis/exBERT/blob/master/1_data_prep.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
!ls /content/drive/MyDrive/GitHub/bluebert

1_data_prep_blue_bert.ipynb  elmo	  mt-bluebert  requirements.txt
bert			     LICENSE.txt  NER_output   tokenizer
bluebert		     mribert	  README.md


# Preprocess Data
### We will make 2 types of dataset: for BlueBERT pretraining and finetuning and for exBERT.

### Input file format for BlueBERT: 
1. One sentence per line. These should ideally be actual sentences, not entire paragraphs or arbitrary spans of text. 
(Because we use the sentence boundaries for the "next sentence prediction" task).


In [3]:
import pandas as pd
df = pd.read_csv("/content/drive/MyDrive/GitHub/exBERT/data/paragrafs.csv")
df.head(1)

Unnamed: 0.1,Unnamed: 0,articleids,txts
0,0,PMC8519417,Patients with chronic lymphocytic leukemia (CL...


In [4]:
df = df.drop(['Unnamed: 0'], axis = 1)
df.head(1)

Unnamed: 0,articleids,txts
0,PMC8519417,Patients with chronic lymphocytic leukemia (CL...


In [5]:
import re
# Testing patterns
text = "Patients with. chronic lymphocytic 's leukemia (. CL Patients with chronic. lymphocytic. leukemia (CL"
re.findall("\.(?= [a-z])",  text)

['.', '.', '.']

In [6]:
re.sub(r"\.(?= [a-z])", ".\\n", text)

"Patients with.\n chronic lymphocytic 's leukemia (. CL Patients with chronic.\n lymphocytic.\n leukemia (CL"

In [7]:
df[df['txts'].isna()]

Unnamed: 0,articleids,txts
384,PMC7216400,


In [8]:
print(df['txts'][384])

nan


In [9]:
df = df[df['txts'].notna()]
df[df['txts'].isna()]

Unnamed: 0,articleids,txts


In [10]:
sent_list = []
for pargr in df['txts']:
    pargr = pargr.strip()
    pargr = re.sub(r"\.(?= [A-Z])", ".\\n", pargr) # Adding new lines to ends of sentences only
    pargr1 = pargr.strip()
    sent_list.append(pargr1)
df['sents']=sent_list
df.head(1)

Unnamed: 0,articleids,txts,sents
0,PMC8519417,Patients with chronic lymphocytic leukemia (CL...,Patients with chronic lymphocytic leukemia (CL...



 2. Blank lines between documents. Document boundaries are needed so that the "next sentence prediction" task doesn't span between documents.

In [11]:
# Adding blank lines between docs
mask = df['articleids'].ne(df['articleids'].shift(-1))
df1 = pd.DataFrame('',index=mask.index[mask] + .5, columns=df.columns)

df = pd.concat([df, df1]).sort_index().reset_index(drop=True).iloc[:-1]
df.tail(3)

Unnamed: 0,articleids,txts,sents
7669,PMC2150407,B-cell chronic lymphocytic leukaemia (CLL) can...,B-cell chronic lymphocytic leukaemia (CLL) can...
7670,,,
7671,PMC1968218,2'-Chlorodeoxyadenosine (2CDA) is a purine ana...,2'-Chlorodeoxyadenosine (2CDA) is a purine ana...


 Now we will put updated text into text file

In [12]:
text_file = []
for row in df['sents']:
  row = row.split('\n')
  for i in row:
    i = i.lstrip()
    text_file.append(i)

text_file[:1]

['Patients with chronic lymphocytic leukemia (CLL) experience hypogammaglobinemia and non-neutropenic infections.']

In [13]:
save_file = "drive/MyDrive/GitHub/exBERT/data/bluebert_train_data.txt"

with open(save_file, 'w') as f:
    for item in text_file:
        f.write("%s\n" % item)

Preprocessed PubMed texts corpus used to pre-train the BlueBERT models contains ~4000M words extracted from the PubMed ASCII code version. 

Other operations include:

 - lowercasing the text
 - removing speical chars \x00-\x7F
 - tokenizing the text using the NLTK Treebank tokenizer


In [14]:
preprocessed_text = []
for line in text_file:
    line = line.lower()
    line = re.sub(r'[\r\n]+', ' ', line)
    line = re.sub(r'[^\x00-\x7F]+', ' ', line)
    preprocessed_text.append(line)
preprocessed_text[:1]


['patients with chronic lymphocytic leukemia (cll) experience hypogammaglobinemia and non-neutropenic infections.']

In [15]:
len(preprocessed_text)

30904

In [16]:
from nltk import TreebankWordTokenizer

In [17]:
pubmed_sent_nltk = []
for line in preprocessed_text:
  tokenized = TreebankWordTokenizer().tokenize(line)
  sentence = ' '.join(tokenized)
  sentence = re.sub(r"\s's\b", "'s", sentence)
  pubmed_sent_nltk.append(sentence)

pubmed_sent_nltk[:1]

['patients with chronic lymphocytic leukemia ( cll ) experience hypogammaglobinemia and non-neutropenic infections .']

In [18]:
len(pubmed_sent_nltk)

30904

In [19]:
save_file = "drive/MyDrive/GitHub/exBERT/data/bluebert_clean_train_data.txt"

with open(save_file, 'w') as f:
    for item in pubmed_sent_nltk:
        f.write("%s\n" % item)

## For exBERT text file needs to have paragraphs separated by new lines (no blank lines though).

In [20]:
df = pd.read_csv("/content/drive/MyDrive/GitHub/exBERT/data/paragrafs.csv")
text = df["txts"]

In [21]:
text.to_csv("/content/drive/MyDrive/GitHub/exBERT/data/exbert_train_data.txt", sep='\n', index=False, header=False)

Next we will create our new dictionary and tokenizer (notebook 2_get_vocab_and_tokenizer.ipynb)