# BERT for BETO

In [160]:
import pandas as pd
import numpy as np
import torch
import random
import os
import nltk
from nltk import tokenize

In [136]:
#Load in some data
save_path = '/Users/Jonathan/Desktop/LabeledChemEData/Labeled_Sheets/'
#Do we have something that allows us to fill empty cells with "" or something else? Maybe set any NaNs to 0? That would imply
#A null label. 
df = pd.read_excel(save_path+"Carbon_0.xlsx")

In [50]:
df

Unnamed: 0.1,Unnamed: 0,name,tokens,BESIO,entity,mol_class,Unnamed: 6,name.1,tokens.1,BESIO.1,...,tokens.48,BESIO.48,entity.48,mol_class.48,Unnamed: 294,name.49,tokens.49,BESIO.49,entity.49,mol_class.49
0,0,Jon O,In,,,,0,Jon O,X-ray,,...,©,,,,0,Jon O,©,,,
1,1,2010,the,,,,1,2014,photoelectron,,...,2020,,,,1,2015,2015,,,
2,2,250,interaction,,,,2,114,spectroscopy,,...,Elsevier,,,,2,104,Elsevier,,,
3,3,10.1016/j.carbon.2010.02.003,between,,,,3,10.1016/j.carbon.2013.12.061,(XPS),,...,Ltd,,,,3,10.1016/j.carbon.2015.08.007,Ltd.,,,
4,4,,gas,,,,4,,has,,...,The,,,,4,,Hydrogels,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
349,349,,,,,,349,,,,...,,,,,349,,,,,
350,350,,,,,,350,,,,...,,,,,350,,,,,
351,351,,,,,,351,,,,...,,,,,351,,,,,
352,352,,,,,,352,,,,...,,,,,352,,,,,


Need to update extract_xy_ to pull out just the x values (ie, tokens), and the y values as a smashed together BESIO. He used NLTK sent_tokenize, so we are in the same place, and we can just go back to having sentences or something.

In [138]:
def extract_xy_(df):
    """
    This method extracts and correctly aranges the NER training x-values (tokens)
    and y-values (BESIO labels) from a pandas dataframe containing labeled NER
    data

    Parameters:
        df (pandas DataFrame, required): Dataframe loaded via pd.read_excel() on
            a labeled NER dataset

        endings_dict (dictionary, required): Dictionary containing the indicies
            where each sentence in each line of tokens ends.

    Returns:
        list: List of tuples, containing the x,y pairs
    """
    labeled = []
    columns = df.columns
    new_df = pd.DataFrame()
    all_tokens = []
    besio = []
    mol = []
    IorO = []
        
    for idx, column in enumerate(columns):
        # find every column that starts with 'name'
        if column.startswith('name'):

            # check if the entry in 'name' cell is a str
            if isinstance(df[column][0], str):
                tokens = df[columns[idx + 1]].values
                #find the index where the tokens become NaNs, and chop the token length down to that size. 
                l = 0
                for entries in tokens: 
                    if type(entries) == str:
                        l += 1
                all_tokens.append(tokens[:l])
                df[columns[idx+2]].replace(np.nan, 'O', inplace = True)
                besio.append(df[columns[idx+2]][:l].values)
                df[columns[idx+3]].replace(np.nan, '', inplace = True)
                mol.append(df[columns[idx+3]][:l].values)
                df[columns[idx+4]].replace(np.nan, '', inplace = True)
                IorO.append(df[columns[idx+4]][:l].values)

    i = 0
    label_values = []
   # print(len(besio))
    while i < len(besio):
        label_values.append([])
       # print(len(besio[i]))
       # print(range(len(besio[i])))
        for j in range(len(besio[i])):
            if besio[i][j] == 'O':
                label_values[i].append(besio[i][j])
            else:
                label_values[i].append(besio[i][j]+'-'+mol[i][j]+'-'+IorO[i][j])
        i += 1   
    return all_tokens, label_values

In [131]:
tokens, labels = extract_xy_(df)

219
91
133
47
114
127
135
232
136
196
124
52
39
115


In [135]:
len(labels[1])

91

In [134]:
len(tokens[1])

91

Ok, so now we have two lists. Both lists are structured so that each entry represents a unique paper (ie, tokens[1] is a whole paper). We need to now chop it down so that all the extra entries at the end are removed, and the list tokens[1] is only as long as there are words in that paper. 

Once we have done that for each entry, we will need to build a looping/wrapper function that will read every excel sheet in a directory. It'd be ideal if that looping function would append each new list to the previous list, so we could end up with every single labeled paper in a single set of two lists. 

After we have that function built, the next step is to try to regenerate our sentence-split structure. First step is to make each paper back into a single string. We'll do this by making a homemade inverse .split() function, which means we'll add each item in the tokens[1] list together with a single whitespace between them. Example is in the case ['The', 'dog', 'ran.'] we would want to regen the original string of ['The dog ran.']. We could do that by doing original_string += (token[1][i]+ ' '). Once we have made each paper in the list back into individual strings, we'll chop each string into individual sentences by using NLTK. 

All of the above is now done!

From that point, it's more standard BERT. We'll use BERT's tokenizer. We'll need to make sure we hand-extend each label to match the tokenization done by the BERT tokenizer so we don't have length mismatches (a la: https://www.depends-on-the-definition.com/named-entity-recognition-with-bert/ function tokenize_and_preserve_labels) and then we'll send it to a dataloader. 

In [214]:
#Ok, need another def function that takes in a list of tokens, and a list of labels, and returns them each as a list of strings
def tokenized_to_string(token_list):
    token_stringlist = []
    for paper_tokens in token_list:
        paper_string = ""
        for i in paper_tokens:
            #This is basically an 'unsplit' method lol
            paper_string += (str(i) + " ")
        token_stringlist.append(paper_string)
    return token_stringlist

In [226]:
def labeled_sheets_to_listed_tokens(directory_url):
    """This function opens a directory of labeled excel sheets from David's excel sheets and returns the tokens as a list 
    of strings fully combined on a document level. It returns a list of strings, with each string being a full document."""
    files = os.listdir(directory_url)
    token_list = []
    label_list = []
    for file in files:
        df = pd.read_excel(directory_url+file)
        token, label = extract_xy_(df)
        token_list += (tokenized_to_string(token))
        label_list += (label)
    #Now we tokenize each paper by sentences using NLTK:
    for i in range(len(token_list)):
        sentences = tokenize.sent_tokenize(token_list[i])
        token_list[i] = sentences
    return token_list, label_list

In [227]:
list_o_tokens, list_o_labels = labeled_sheets_to_listed_tokens(dir_url)

In [228]:
print(list_o_tokens[41])

['Soot may be formed when carbonaceous fuels are burned under local reducing conditions.', 'Its subsequent oxidation is of great significance for pollution control in industrial flames, auto engines etc.', 'Oxidation (gasification) can be achieved with oxygen, carbon dioxide, water vapour or nitrogen dioxide.', 'In this review, the experimental techniques which have been used to study the gasification of soot are described and the methods and results obtained by analysis of the data from them are considered.', 'Firstly, the mechanism of soot formation and its structure are briefly discussed.', 'The various scales of particulate which comprise it, i.e.', 'spherule, particle and aggregate, influence its properties and behaviour.', 'Next, the experimental equipment used in the study of its gasification is briefly described.', 'Gasification kinetics at low temperatures are measured either in fixed beds or by thermogravimetry.', 'The apparatus may be operated as a thermally programmed desor

In [225]:
print(list_o_labels[41])

['S-MOL-O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'S-MOL-I', 'B-MOL-I', 'E-MOL-I', 'S-MOL-I', 'O', 'O', 'B-MOL-I', 'E-MOL-I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'S-MOL-I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'S-MOL-I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'S-MOL-I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O',

In [104]:
#Wrapper function that loops all excel spreadsheets and truncates each entry in each spreadsheet.
dir_url = '/Users/Jonathan/Desktop/LabeledChemEData/Labeled_Sheets/'
#for all sheets in dir_url:
files = os.listdir(dir_url)
print(files)
print(files[2])

['Carbon_0.xlsx', 'Carbon_1.xlsx', 'Journal_of_Inorganic_Biochemistry_0.xlsx', 'Journal_of_Inorganic_Biochemistry_1.xlsx', 'Journal_of_Organometallic_Chemistry_0.xlsx', 'Journal_of_Organometallic_Chemistry_1.xlsx']
Journal_of_Inorganic_Biochemistry_0.xlsx


In [153]:
token_list = []
label_list = []
for file in files:
    df = pd.read_excel(dir_url+file)
    token, label = extract_xy_(df)
    token_list.append(tokenized_to_string(token))
    label_list.append(label)
print(token_list[0][1])
print(label_list[0][1])

X-ray photoelectron spectroscopy (XPS) has been commonly used to determine the nitrogen-containing functional groups of graphene. However, reported assignments of C1s shifts of nitrogen-containing functional groups are unclear. Most works discuss peak shifts of only N1s spectra and C1s shifts and the full width at half maximum (FWHM) are excluded. Thus, peak shifts and FWHMs of C1s and N1s XPS spectra of graphene with nitrogen-containing functional groups such as pyridinic, phenanthroline-like, sp2C-NH2, sp 3C-NH2, pyrrolic, imine, pyridazine-like, pyrazole-like, sp2C-CN, sp3C-CN, and valley quaternary nitrogen (Q-N) on edges and sp3C-NH2, center amine, and center Q-N in the basal plane were simulated using density functional theory calculation. Main peaks of C1s spectra were shifted positively and negatively by the electron-withdrawing and electron-donating functional groups, respectively. FWHMs of the main peaks of C1s spectra were influenced by mainly electron-withdrawing functional

In [143]:
token_stringlist = []
tokens, labels = extract_xy_(df)

for paper_tokens in tokens:
    paper_string = ""
    for i in paper_tokens:
        paper_string += (str(i) + " ")
    token_stringlist.append(paper_string)
    
# for paper_labels in labels:
#     label_string = ""
#     for i in paper_labels:
#         print(i)
#         label_string += (str(i) + " ")
#     label_stringlist.append(label_string)
# print(label_stringlist)
#print(token_stringlist)

['In the interaction between gas molecules with single-walled carbon nanotube (SWCNT) we show that as a result of collisions the gas scattering contributes with an important background signal and should be considered in SWCNT-based gas sensors. Experimental evidence of the collision-induced tube wall deformation is demonstrated using in situ X-ray absorption near-edge structure spectroscopy. Results support the occurrence of the scattering process and show how gas collisions may affect the electronic structure of SWCNTs. © 2010 Elsevier ', 'X-ray photoelectron spectroscopy (XPS) has been commonly used to determine the nitrogen-containing functional groups of graphene. However, reported assignments of C1s shifts of nitrogen-containing functional groups are unclear. Most works discuss peak shifts of only N1s spectra and C1s shifts and the full width at half maximum (FWHM) are excluded. Thus, peak shifts and FWHMs of C1s and N1s XPS spectra of graphene with nitrogen-containing functional 

In [147]:
token_stringlist[1]

'X-ray photoelectron spectroscopy (XPS) has been commonly used to determine the nitrogen-containing functional groups of graphene. However, reported assignments of C1s shifts of nitrogen-containing functional groups are unclear. Most works discuss peak shifts of only N1s spectra and C1s shifts and the full width at half maximum (FWHM) are excluded. Thus, peak shifts and FWHMs of C1s and N1s XPS spectra of graphene with nitrogen-containing functional groups such as pyridinic, phenanthroline-like, sp2C-NH2, sp 3C-NH2, pyrrolic, imine, pyridazine-like, pyrazole-like, sp2C-CN, sp3C-CN, and valley quaternary nitrogen (Q-N) on edges and sp3C-NH2, center amine, and center Q-N in the basal plane were simulated using density functional theory calculation. Main peaks of C1s spectra were shifted positively and negatively by the electron-withdrawing and electron-donating functional groups, respectively. FWHMs of the main peaks of C1s spectra were influenced by mainly electron-withdrawing functiona

In [151]:
tokenized_to_string(tokens)

['In the interaction between gas molecules with single-walled carbon nanotube (SWCNT) we show that as a result of collisions the gas scattering contributes with an important background signal and should be considered in SWCNT-based gas sensors. Experimental evidence of the collision-induced tube wall deformation is demonstrated using in situ X-ray absorption near-edge structure spectroscopy. Results support the occurrence of the scattering process and show how gas collisions may affect the electronic structure of SWCNTs. © 2010 Elsevier ',
 'X-ray photoelectron spectroscopy (XPS) has been commonly used to determine the nitrogen-containing functional groups of graphene. However, reported assignments of C1s shifts of nitrogen-containing functional groups are unclear. Most works discuss peak shifts of only N1s spectra and C1s shifts and the full width at half maximum (FWHM) are excluded. Thus, peak shifts and FWHMs of C1s and N1s XPS spectra of graphene with nitrogen-containing functional

In [None]:
#next, need to tokenize with nltk
