<a href="https://colab.research.google.com/github/LyaSolis/exBERT/blob/master/1_data_prep_blue_bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
!ls /content/drive/MyDrive/GitHub/bluebert

bert	  LICENSE.txt  NER_output	 set_up_bluebert.ipynb
bluebert  mribert      README.md	 tokenizer
elmo	  mt-bluebert  requirements.txt


## Preprocess Data

### Input file format: 
1. One sentence per line. These should ideally be actual sentences, not entire paragraphs or arbitrary spans of text. 
(Because we use the sentence boundaries for the "next sentence prediction" task).


In [81]:
import pandas as pd
df = pd.read_csv("/content/drive/MyDrive/GitHub/exBERT/data/paragrafs.csv")
df.head(1)

Unnamed: 0.1,Unnamed: 0,articleids,txts
0,0,PMC8519417,Patients with chronic lymphocytic leukemia (CL...


In [82]:
df = df.drop(['Unnamed: 0'], axis = 1)
df

Unnamed: 0,articleids,txts
0,PMC8519417,Patients with chronic lymphocytic leukemia (CL...
1,PMC8519417,Patients with CLL with serum IgG >= 400 mg/dL ...
2,PMC8519417,Fifteen patients enrolled with median IgG = 78...
3,PMC8519417,Patients with CLL demonstrate humoral immunode...
4,PMC8519417,NCT 03730129.
...,...,...
7467,PMC4535919,The use of GM-CSF as a single dose with standa...
7468,PMC4535919,"A low immunogenicity of PPV, which predominant..."
7469,PMC4535919,Various strategies to improve vaccine response...
7470,PMC2150407,B-cell chronic lymphocytic leukaemia (CLL) can...


In [83]:
import re
# Testing patterns
text = "Patients with. chronic lymphocytic 's leukemia (. CL Patients with chronic. lymphocytic. leukemia (CL"
re.findall("\.(?= [a-z])",  text)

['.', '.', '.']

In [84]:
re.sub(r"\.(?= [a-z])", ".\\n", text)

"Patients with.\n chronic lymphocytic 's leukemia (. CL Patients with chronic.\n lymphocytic.\n leukemia (CL"

In [97]:
df[df['txts'].isna()]

Unnamed: 0,articleids,txts
384,PMC7216400,


In [91]:
print(df['txts'][384])

nan


In [98]:
df = df[df['txts'].notna()]
df[df['txts'].isna()]

Unnamed: 0,articleids,txts


In [101]:
sent_list = []
for pargr in df['txts']:
    pargr = pargr.strip()
    pargr = re.sub(r"\.(?= [A-Z])", ".\\n", pargr) # Adding new lines to ends of sentences only
    pargr1 = pargr.strip()
    sent_list.append(pargr1)
df['sents']=sent_list
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


Unnamed: 0,articleids,txts,sents
0,PMC8519417,Patients with chronic lymphocytic leukemia (CL...,Patients with chronic lymphocytic leukemia (CL...
1,PMC8519417,Patients with CLL with serum IgG >= 400 mg/dL ...,Patients with CLL with serum IgG >= 400 mg/dL ...
2,PMC8519417,Fifteen patients enrolled with median IgG = 78...,Fifteen patients enrolled with median IgG = 78...
3,PMC8519417,Patients with CLL demonstrate humoral immunode...,Patients with CLL demonstrate humoral immunode...
4,PMC8519417,NCT 03730129.,NCT 03730129.
...,...,...,...
7467,PMC4535919,The use of GM-CSF as a single dose with standa...,The use of GM-CSF as a single dose with standa...
7468,PMC4535919,"A low immunogenicity of PPV, which predominant...","A low immunogenicity of PPV, which predominant..."
7469,PMC4535919,Various strategies to improve vaccine response...,Various strategies to improve vaccine response...
7470,PMC2150407,B-cell chronic lymphocytic leukaemia (CLL) can...,B-cell chronic lymphocytic leukaemia (CLL) can...



 2. Blank lines between documents. Document boundaries are needed so that the "next sentence prediction" task doesn't span between documents.

In [102]:
# Adding blank lines between docs
mask = df['articleids'].ne(df['articleids'].shift(-1))
df1 = pd.DataFrame('',index=mask.index[mask] + .5, columns=df.columns)

df = pd.concat([df, df1]).sort_index().reset_index(drop=True).iloc[:-1]
print(df)

      articleids                                               txts  \
0     PMC8519417  Patients with chronic lymphocytic leukemia (CL...   
1     PMC8519417  Patients with CLL with serum IgG >= 400 mg/dL ...   
2     PMC8519417  Fifteen patients enrolled with median IgG = 78...   
3     PMC8519417  Patients with CLL demonstrate humoral immunode...   
4     PMC8519417                                      NCT 03730129.   
...          ...                                                ...   
7667  PMC4535919  Various strategies to improve vaccine response...   
7668                                                                  
7669  PMC2150407  B-cell chronic lymphocytic leukaemia (CLL) can...   
7670                                                                  
7671  PMC1968218  2'-Chlorodeoxyadenosine (2CDA) is a purine ana...   

                                                  sents  
0     Patients with chronic lymphocytic leukemia (CL...  
1     Patients with CLL with se

 Now we will put updated text into text file

In [116]:
!ls drive/MyDrive/GitHub/exBERT/data

paragrafs.csv  paragrafs.zip  paragraphs.txt


In [125]:
text_file = []
for row in df['sents']:
  row = row.split('\n')
  for i in row:
    i = i.lstrip()
    text_file.append(i)

text_file[:10]

['Patients with chronic lymphocytic leukemia (CLL) experience hypogammaglobinemia and non-neutropenic infections.',
 'In this exploratory proof of concept study, our objective was to determine the prevalence of humoral immunodeficiency in patients with CLL and serum IgG >= 400 mg/dL, and to evaluate the efficacy of subcutaneous immunoglobulin (SCIG) in this population.',
 'Patients with CLL with serum IgG >= 400 mg/dL were evaluated for serum IgG, IgM, IgA, along with pre/post vaccine IgG titers to diphtheria, tetanus, and Streptococcus pneumoniae.',
 'Patients with evidence of humoral dysfunction were treated with SCIG with Hizentra every 7+-2 days for 24 weeks.',
 'Fifteen patients enrolled with median IgG = 782 mg/dL [IQR: 570 to 827], and 6/15 (40%) responded to vaccination with Td, while 5/15 (33%) responded to vaccination with PPV23. 14/15 (93.3%) demonstrated humoral immunodeficiency as evidenced by suboptimal vaccine responses, and were treated with SCIG.',
 'In patients treate

Preprocessed PubMed texts corpus used to pre-train the BlueBERT models contains ~4000M words extracted from the PubMed ASCII code version. 

Other operations include:

 - lowercasing the text
 - removing speical chars \x00-\x7F
 - tokenizing the text using the NLTK Treebank tokenizer


In [129]:
preprocessed_text = []
for line in text_file:
    line = line.lower()
    line = re.sub(r'[\r\n]+', ' ', line)
    line = re.sub(r'[^\x00-\x7F]+', ' ', line)
    preprocessed_text.append(line)
preprocessed_text[:5]


['patients with chronic lymphocytic leukemia (cll) experience hypogammaglobinemia and non-neutropenic infections.',
 'in this exploratory proof of concept study, our objective was to determine the prevalence of humoral immunodeficiency in patients with cll and serum igg >= 400 mg/dl, and to evaluate the efficacy of subcutaneous immunoglobulin (scig) in this population.',
 'patients with cll with serum igg >= 400 mg/dl were evaluated for serum igg, igm, iga, along with pre/post vaccine igg titers to diphtheria, tetanus, and streptococcus pneumoniae.',
 'patients with evidence of humoral dysfunction were treated with scig with hizentra every 7+-2 days for 24 weeks.',
 'fifteen patients enrolled with median igg = 782 mg/dl [iqr: 570 to 827], and 6/15 (40%) responded to vaccination with td, while 5/15 (33%) responded to vaccination with ppv23. 14/15 (93.3%) demonstrated humoral immunodeficiency as evidenced by suboptimal vaccine responses, and were treated with scig.']

In [130]:
len(preprocessed_text)

30904

In [15]:
from nltk import TreebankWordTokenizer

In [131]:
pubmed_sent_nltk = []
for line in preprocessed_text:
  tokenized = TreebankWordTokenizer().tokenize(line)
  sentence = ' '.join(tokenized)
  sentence = re.sub(r"\s's\b", "'s", sentence)
  pubmed_sent_nltk.append(sentence)

pubmed_sent_nltk[:10]

['patients with chronic lymphocytic leukemia ( cll ) experience hypogammaglobinemia and non-neutropenic infections .',
 'in this exploratory proof of concept study , our objective was to determine the prevalence of humoral immunodeficiency in patients with cll and serum igg > = 400 mg/dl , and to evaluate the efficacy of subcutaneous immunoglobulin ( scig ) in this population .',
 'patients with cll with serum igg > = 400 mg/dl were evaluated for serum igg , igm , iga , along with pre/post vaccine igg titers to diphtheria , tetanus , and streptococcus pneumoniae .',
 'patients with evidence of humoral dysfunction were treated with scig with hizentra every 7+-2 days for 24 weeks .',
 'fifteen patients enrolled with median igg = 782 mg/dl [ iqr : 570 to 827 ] , and 6/15 ( 40 % ) responded to vaccination with td , while 5/15 ( 33 % ) responded to vaccination with ppv23. 14/15 ( 93.3 % ) demonstrated humoral immunodeficiency as evidenced by suboptimal vaccine responses , and were treated w

In [132]:
len(pubmed_sent_nltk)

30904

In [133]:
save_file = "drive/MyDrive/GitHub/exBERT/data/train_data.txt"

with open(save_file, 'w') as f:
    for item in pubmed_sent_nltk:
        f.write("%s\n" % item)

Next we will create our new dictionary and tokenizer (notebook 2_get_vocab_and_tokenizer.ipynb)