<font size=6>
    <b>Read_conllu_files.ipynb:</b> Read and parse information from diverse .conllu files, and use integrations with libraries to apply data efficently
 </font>

## Introduction

This notebook demonstrates how diverse .conllu files can be imported, converted and worked with using the open source library [Text Extensions for Pandas](https://github.com/CODAIT/text-extensions-for-pandas). This library uses [Pandas](https://pandas.pydata.org/) DataFrames as a primary data storage format, and to work with several different NLP libraries, such as [SpaCy](https://spacy.io), [Huggingface Transformers](https://huggingface.co/transformers/). 

Here we show how these features can be used in conjunction to import, select data, display sentence structure information, and then finally retokenize and train a classifier model on the dataset. 



In [1]:
import os
import sys
import numpy as np
import pandas as pd
import sklearn.pipeline
import sklearn.linear_model
import transformers

# And of course we need the text_extensions_for_pandas library itself.
try:
    import text_extensions_for_pandas as tp
except ModuleNotFoundError as e:
    # If we're running from within the project source tree and the parent Python
    # environment doesn't have the text_extensions_for_pandas package, use the
    # version in the local source tree.
    if not os.getcwd().endswith("notebooks"):
        raise e
    if ".." not in sys.path:
        sys.path.insert(0, "..")
    import text_extensions_for_pandas as tp


### Loading files
There are several sub-flavors of .conllu files, including those used in the EWT, Ontonotes, Universal Dependencies, and  CoNLL 2009 corpuses. Text Extensions is designed to take advantage of the common features of .conllu files, while allowing for varied types to be accepted. 

In importing this file type, we
1. Translate the raw words into Token Dtypes
1. Preserve the dependencies between tokens as represented in the `head` and `deprel` columns
1. Capture conllu metadata written into the file, if it exists 
1. Allow for conll 09 and Ontonotes style predicate - predicate argument representations
1. Capture each token's sentence 
1. Allow the user to choose how sub-tokens are handled


First, though we must load the datasets we will be using for this demo notebook 

In the following cell, we use the facilities of Text Extensions for Pandas to download a copy of the [Universal Dependencies EWT data set](https://github.com/UniversalDependencies/UD_English-EWT) and the [Trial section of the CoNLL 2009 dataset](https://ufal.mff.cuni.cz/conll2009-st/trial-data.html). **Make sure that you adhere to the terms under which they are liscensed when using them** 

Then we read them in and display them in the document. Notice how different the information stored in each dataset is. One thing to note is in this specific example, we drop a few columns from each dataset for brevity; remove the `.drop()` methods to show more lines. 

In [2]:
# init file locations, and download data if necessary. 
BASE_DIR = 'CoNLL_u_test_inputs/'
FEATHER_FILE = "conllu_database.feather"

ewt_base_url = "https://github.com/UniversalDependencies/UD_English-EWT/raw/master/en_ewt-ud-"
ewt_dev_url = ewt_base_url + 'dev.conllu'
conll_09_test_data_url =  'https://ufal.mff.cuni.cz/conll2009-st/trial/CoNLL2009-ST-English-trial.zip'

# allows us to re-start from saved points
corpus_df = None 

ewt_dev_url, conll_09_test_data_url

('https://github.com/UniversalDependencies/UD_English-EWT/raw/master/en_ewt-ud-dev.conllu',
 'https://ufal.mff.cuni.cz/conll2009-st/trial/CoNLL2009-ST-English-trial.zip')

In [3]:
# download the files if they have not already been downloaded 
conll_09_path = tp.io.conll.maybe_download_dataset_data(BASE_DIR, conll_09_test_data_url)
conllu_ewt_path = tp.io.conll.maybe_download_dataset_data(BASE_DIR, ewt_dev_url)

# if you already have access to the full conll 2009 dataset, name the file accordingly and uncomment this line 
# conll_09_path = BASE_DIR + 'CoNLL2009-ST-evaluation-English.conllu'
conll_09_path, conllu_ewt_path

('CoNLL_u_test_inputs//CoNLL2009-ST-English-trial.txt',
 'CoNLL_u_test_inputs//en_ewt-ud-dev.conllu')

In [4]:
# import two very different documents, both in the conllu file format. 

# by default we look for EWT style column names, 
# so we have to define a new set for this specific conll09 format
conll_09_cols = ["LEMMA","PLEMMA",'POS','PPOS','FEAT','PFEAT','head','phead','DEPREL','PDEPREL','FILLPRED','PRED']

conll_09_docs = tp.io.conll.conll_u_to_dataframes(conll_09_path,column_names=conll_09_cols)
#now just filter,and display the document 
conll_09_doc = conll_09_docs[0].drop(columns=["PLEMMA",'PPOS','PFEAT','phead','PDEPREL','FILLPRED','sentence','line_num'])
print("Conll 09 format .conllu document:")
display(conll_09_doc.head())


#simultaneously, we can import an ewt style document, and display it with the same function
conll_u_docs = tp.io.conll.conll_u_to_dataframes(conllu_ewt_path)
#display 
DOC_NUM = 0
doc_df = conll_u_docs[DOC_NUM]
# here we drop the sentence argument for brevity.
print("EWT format .conllu document:")
doc_df.head(10).drop(columns = ["sentence"])

Conll 09 format .conllu document:


Unnamed: 0,span,LEMMA,POS,FEAT,head,DEPREL,PRED,predicate,pred0arg,pred1arg,pred2arg,pred3arg,pred4arg,pred5arg,pred6arg,pred7arg,pred8arg,pred9arg,pred10arg
0,"[0, 3): 'The'",the,DT,,1.0,NMOD,,,,,,,,,,,,,
1,"[4, 11): 'economy'",economy,NN,,3.0,NMOD,,A1,,,,,,,,,,,
2,"[11, 13): ''s'",'s,POS,,1.0,SUFFIX,,,,,,,,,,,,,
3,"[14, 25): 'temperature'",temperature,NN,,4.0,SBJ,temperature.01,A2,A1,,,,,,,,,,
4,"[26, 30): 'will'",will,MD,,,ROOT,,,AM-MOD,,,,,,,,,,


EWT format .conllu document:


Unnamed: 0,span,lemma,upostag,xpostag,features,head,deprel,deps,misc,sentence_id,paragraph_id,doc_id,line_num
0,"[0, 4): 'From'",from,ADP,IN,,2.0,case,3:case,,weblog-blogspot.com_nominations_20041117172713...,weblog-blogspot.com_nominations_20041117172713...,weblog-blogspot.com_nominations_20041117172713...,4
1,"[5, 8): 'the'",the,DET,DT,Definite=Def|PronType=Art,2.0,det,3:det,,weblog-blogspot.com_nominations_20041117172713...,weblog-blogspot.com_nominations_20041117172713...,weblog-blogspot.com_nominations_20041117172713...,5
2,"[9, 11): 'AP'",AP,PROPN,NNP,Number=Sing,3.0,obl,4:obl:from,,weblog-blogspot.com_nominations_20041117172713...,weblog-blogspot.com_nominations_20041117172713...,weblog-blogspot.com_nominations_20041117172713...,6
3,"[12, 17): 'comes'",come,VERB,VBZ,Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbF...,,root,0:root,,weblog-blogspot.com_nominations_20041117172713...,weblog-blogspot.com_nominations_20041117172713...,weblog-blogspot.com_nominations_20041117172713...,7
4,"[18, 22): 'this'",this,DET,DT,Number=Sing|PronType=Dem,5.0,det,6:det,,weblog-blogspot.com_nominations_20041117172713...,weblog-blogspot.com_nominations_20041117172713...,weblog-blogspot.com_nominations_20041117172713...,8
5,"[23, 28): 'story'",story,NOUN,NN,Number=Sing,3.0,nsubj,4:nsubj,,weblog-blogspot.com_nominations_20041117172713...,weblog-blogspot.com_nominations_20041117172713...,weblog-blogspot.com_nominations_20041117172713...,9
6,"[28, 29): ':'",:,PUNCT,:,,3.0,punct,4:punct,,weblog-blogspot.com_nominations_20041117172713...,weblog-blogspot.com_nominations_20041117172713...,weblog-blogspot.com_nominations_20041117172713...,10
7,"[30, 39): 'President'",President,PROPN,NNP,Number=Sing,11.0,nsubj,5:nsubj,,weblog-blogspot.com_nominations_20041117172713...,weblog-blogspot.com_nominations_20041117172713...,weblog-blogspot.com_nominations_20041117172713...,15
8,"[40, 44): 'Bush'",Bush,PROPN,NNP,Number=Sing,7.0,flat,1:flat,,weblog-blogspot.com_nominations_20041117172713...,weblog-blogspot.com_nominations_20041117172713...,weblog-blogspot.com_nominations_20041117172713...,16
9,"[45, 47): 'on'",on,ADP,IN,,10.0,case,4:case,,weblog-blogspot.com_nominations_20041117172713...,weblog-blogspot.com_nominations_20041117172713...,weblog-blogspot.com_nominations_20041117172713...,17


## Combining documents and saving as a .feather file. 

Something that is often useful is to store multiple documents from a corpus in one single dataset. Here we do that, then make an adjustment to keep the `'head'` column of our database pointing at the correct elements.


Next we quickly write then reread our document as a `.feather` file in its. manipulated state. Because this is serialized, writing and reading is significantly faster than writing to a raw `.conllu` format.  

In [5]:
# Because we are concatenating our dataframes, we need to modify the "head" 
# fields to still point at their desired targets 
df_starts_at =0
temp = conll_u_docs.copy()
for df in temp:
    df['head'] = df['head'].apply(lambda i: i +df_starts_at if i is not None else -1)
    df_starts_at += df.shape[0]

# Now concatenate all our documents into one big dataframe
complete_df = pd.concat(temp, ignore_index=True)

#show the last few rows of the dataframe, select just a few columns for compactness
print(f"size is {complete_df.shape[0]}")
complete_df[["span","lemma","upostag","head","deprel"]]

size is 25153


Unnamed: 0,span,lemma,upostag,head,deprel
0,"[0, 4): 'From'",from,ADP,2.0,case
1,"[5, 8): 'the'",the,DET,2.0,det
2,"[9, 11): 'AP'",AP,PROPN,3.0,obl
3,"[12, 17): 'comes'",come,VERB,,root
4,"[18, 22): 'this'",this,DET,5.0,det
...,...,...,...,...,...
25148,"[251, 254): 'and'",and,CCONJ,25152.0,cc
25149,"[255, 256): 'a'",a,DET,25152.0,det
25150,"[257, 261): 'very'",very,ADV,25151.0,advmod
25151,"[262, 275): 'knowledgeable'",knowledgeable,ADJ,25152.0,amod


In [6]:
# one advantage of using pandas dataframes is that we can write and read them signifcantly faster than we could the raw conllu files 
# here we use pyarrow with feather to save and reload our dataframe. 

# Currently writing multi document files is not supported, so we will have to use a workaround, 
# by converting sentences from TokenSpanArrays to SpanArrays
complete_df["sentence"] = tp.SpanArray(complete_df["span"].array.target_text, complete_df["sentence"].array.begin, complete_df["sentence"].array.end)

#finally write to file using feather 
path = BASE_DIR +FEATHER_FILE
# increase the chunksize slightly, to allow writing in a single block
# time to show how fast Feather actually is 
%time
complete_df.to_feather(path,chunksize= 65536*8)
print(f"File written to {path}")

CPU times: user 1 μs, sys: 0 ns, total: 1 μs
Wall time: 1.67 μs
File written to CoNLL_u_test_inputs/conllu_database.feather


In [7]:
# now we can read this df and continue operating on it as before. Time the read operation 
%time 
re_read_df = pd.read_feather(path)
print(f"size is {re_read_df.shape[0]}")
# show the same subset of the dataframe as above 
re_read_df.tail()[["span","lemma","upostag","head","deprel"]]

CPU times: user 0 ns, sys: 1e+03 ns, total: 1e+03 ns
Wall time: 3.1 μs
size is 25153


Unnamed: 0,span,lemma,upostag,head,deprel
25148,"[251, 254): 'and'",and,CCONJ,25152.0,cc
25149,"[255, 256): 'a'",a,DET,25152.0,det
25150,"[257, 261): 'very'",very,ADV,25151.0,advmod
25151,"[262, 275): 'knowledgeable'",knowledgeable,ADJ,25152.0,amod
25152,"[276, 281): 'staff'",staff,NOUN,25147.0,conj


## Show sentence parse trees using pandas data manipulation, and SpaCy integrations
Because of the integrations built into Text extensions, we can use powerful data visualization tools here we're leveraging spaCy's dependency tree visualization tools, to show the parse tree as specified in the raw conllu file. 

First, we use Pandas groupby to to quickly select the n'th sentence in the dataset, and store it as its own dataframe and display selected columns 

Then we use Spacy to render the parse tree of that specific sentence, as found in the raw data. 

In [8]:
Sentence_num = 110

# use pandas to quickly select the 'n'th sentence in the dataset 
nth_sentence = list(re_read_df.groupby("sentence_id",sort=False))[Sentence_num][1]
display(nth_sentence[["span","lemma","upostag","xpostag","head","deprel","sentence"]])

# now use spacy integration to rendeer the parse tree
tp.io.spacy.render_parse_tree(nth_sentence,tag_col="upostag",label_col="deprel",head_col="head")

Unnamed: 0,span,lemma,upostag,xpostag,head,deprel,sentence
2511,"[979, 982): 'And'",and,CCONJ,CC,2515.0,cc,"[979, 1014): 'And what do we get for this effo..."
2512,"[983, 987): 'what'",what,PRON,WP,2515.0,obj,"[979, 1014): 'And what do we get for this effo..."
2513,"[988, 990): 'do'",do,AUX,VBP,2515.0,aux,"[979, 1014): 'And what do we get for this effo..."
2514,"[991, 993): 'we'",we,PRON,PRP,2515.0,nsubj,"[979, 1014): 'And what do we get for this effo..."
2515,"[994, 997): 'get'",get,VERB,VB,,root,"[979, 1014): 'And what do we get for this effo..."
2516,"[998, 1001): 'for'",for,ADP,IN,2518.0,case,"[979, 1014): 'And what do we get for this effo..."
2517,"[1002, 1006): 'this'",this,DET,DT,2518.0,det,"[979, 1014): 'And what do we get for this effo..."
2518,"[1007, 1013): 'effort'",effort,NOUN,NN,2515.0,obl,"[979, 1014): 'And what do we get for this effo..."
2519,"[1013, 1014): '?'",?,PUNCT,.,2515.0,punct,"[979, 1014): 'And what do we get for this effo..."


# Train a classifier model

Now use more text extensions integrations, with *transformers* to quickly and easily train a part of speech classifier model using bert embeddings on our data. We loosely follow the same process as is used in the [Model_Training_with_BERT](./Model_Training_with_BERT.ipynb) demo, notebook so check there for a more indepth explanation of each step.

Broadly, what we do is: 
1. Import all the folds of the dataset we're using (Universal dependencies EWT) 
1. Create a Pandas Categorical datatype on over which to classify
1. Retokenize that dataset using Huggingface Transformers to Bert-compatible tokens
1. Correlate the new tokens with their original counterpart's parts of speech
1. Create the Bert embeddings for each sub-token
1. Convert the parts of speech tags to our categoical datatype
1. Initialize and train a sklearn model on the Bert embeddings -> Part of Speech
1. Use that model to perform inference on our dataset



In [9]:
# We're going to need the whole ewt dataset for this: download them, and parse them in 
fold_paths = {"test":  tp.io.conll.maybe_download_dataset_data(BASE_DIR, ewt_base_url + "test.conllu"),
              "dev":   tp.io.conll.maybe_download_dataset_data(BASE_DIR, ewt_base_url + "dev.conllu"),
              "train": tp.io.conll.maybe_download_dataset_data(BASE_DIR, ewt_base_url + "train.conllu")}
fold_docs = {}
for fold,fold_path in fold_paths.items(): 
    fold_docs[fold] = tp.io.conll.conll_u_to_dataframes(fold_path)
    print(f"converted fold: '{fold}' to list of {len(fold_docs[fold])} dataframes")
    #     uncomment to display segments of the extracted folds 
    #     display(fold_docs[fold][0].head()[['span','lemma','upostag','features','sentence']])

converted fold: 'test' to list of 316 dataframes
converted fold: 'dev' to list of 318 dataframes
converted fold: 'train' to list of 540 dataframes


### Initialize elements for preprocessing steps
Instantiate pretrained tokenizer and BERT models from transformers library, and create a pandas categorical datatype for parts of speech

In [10]:
bert_model_name = "dslim/bert-base-NER"
tokenizer = transformers.BertTokenizerFast.from_pretrained(bert_model_name)
bert = transformers.BertModel.from_pretrained(bert_model_name)

# also we will want to create a pandas categorical dtype for what we want to predict- part of speech. 
# use the combined df, because it has all the elements 
upostags_list = list(re_read_df["upostag"].unique())
# upostag_dtype,upostag_list,upostag_dict = tp.io.conll.make_iob_tag_categories(upostags)
upostag_dtype = pd.CategoricalDtype(categories = upostags_list)
upostag_dict = {upostags_list[i]:i for i in range(len(upostags_list)) }

## Preprocess the document

Because steps 3-6 can only be done on a document-by-document basis, we create a method to do them in a batch, then run them  on the whole corpus. Note this process is computationally intensive so it may take a few minutes to run.

In [11]:
# make a method to take care of preprocessing steps: 3-6
def preprocess_document(document, tokenizer,bert):
    # create BERT compatible tokens using our tokenizer
    temp = tp.io.bert.make_bert_tokens(document.loc[0,'span'].target_text, tokenizer)
    # re-correlate our original spans with their bert-compatible equivalents
    spans = tp.TokenSpanArray.align_to_tokens(temp["span"],document["span"])

    # now carry over some features from the old spans to the new ones
    #spans_df = spans.as_frame().drop(columns = ["begin","end"])
    spans_df = spans.as_frame().drop(columns=['begin','end','covered_text'])
    spans_df['postag'] = document['upostag']
    # printed = 20
    for i, b_tok, e_tok, pos in spans_df.itertuples():
        temp.loc[b_tok:e_tok-1, ["postag","raw_span",'raw_span_id']] = pos,spans[i],i

    # now translate from text tags to postag
    temp['postag'] = temp['postag'].fillna('X') # in our Labels, 'X' is a standin for "N/A" so convert N/A's to 'X'
    temp["postag_id"] = temp['postag'].apply(lambda t: int(upostag_dict[str(t)]))
    temp = temp.astype({'postag_id':'int','postag':upostag_dtype})
    return tp.io.bert.add_embeddings(temp, bert)


# preprocess the whole corpus: 
bert_docs_by_fold = {}
for fold in fold_docs.keys():
    docs = fold_docs[fold]
    print(f"processing fold {fold}")
    bert_docs_by_fold[fold] = tp.jupyter.run_with_progress_bar(len(docs),lambda i: preprocess_document(docs[i],tokenizer,bert))

processing fold test


IntProgress(value=0, description='Starting...', layout=Layout(width='100%'), max=316, style=ProgressStyle(desc…

Token indices sequence length is longer than the specified maximum sequence length for this model (713 > 512). Running this sequence through the model will result in indexing errors


processing fold dev


IntProgress(value=0, description='Starting...', layout=Layout(width='100%'), max=318, style=ProgressStyle(desc…

processing fold train


IntProgress(value=0, description='Starting...', layout=Layout(width='100%'), max=540, style=ProgressStyle(desc…

## Checkpoint: save preprocessed data 

Because the last step was time intensive, combine all the documents together, then save them as a feather file, so that we can restart from here if need be. 

In [12]:
# combine folds and save to a feather file, so we don't necessarily need to redo the preprocessing. 
corpus_df = tp.io.conll.combine_folds(bert_docs_by_fold)
corpus_df["text"] = corpus_df["span"].apply(lambda s: s.covered_text)
cols_to_drop = [c for c in corpus_df.columns if "span" in c]
corpus_df.drop(columns=cols_to_drop).to_feather("outputs/conll_u_corpus.feather")
corpus_df.head()

  pa.PyExtensionType.__init__(self, pa.list_(pyarrow_dtype))


Unnamed: 0,fold,doc_num,token_id,span,input_id,token_type_id,attention_mask,special_tokens_mask,postag,raw_span,raw_span_id,postag_id,embedding,text
0,test,0,0,"[0, 0): ''",101,0,1,True,X,,,14,"[ -0.37686658, -0.14841501, 0.7397996, ...",
1,test,0,1,"[0, 4): 'What'",1327,0,1,False,PRON,"[0, 4): 'What'",0.0,11,"[ -0.23267001, -0.4054631, 0.6171939, ...",What
2,test,0,2,"[5, 7): 'if'",1191,0,1,False,SCONJ,"[5, 7): 'if'",1.0,13,"[ -0.8156848, -0.047826126, 0.08148584, ...",if
3,test,0,3,"[8, 14): 'Google'",7986,0,1,False,PROPN,"[8, 14): 'Google'",2.0,2,"[ 0.78967667, -0.8511877, -0.4881261, ...",Google
4,test,0,4,"[15, 17): 'Mo'",12556,0,1,False,VERB,"[15, 22): 'Morphed'",3.0,3,"[ -0.25935128, 0.5710735, -0.091067344, ...",Mo


In [13]:
# re-read feather document if need be: 
if corpus_df is None or corpus_df.size == 0:
    corpus_df = pd.read_feather("outputs/conll_u_corpus.feather")
corpus_df

Unnamed: 0,fold,doc_num,token_id,span,input_id,token_type_id,attention_mask,special_tokens_mask,postag,raw_span,raw_span_id,postag_id,embedding,text
0,test,0,0,"[0, 0): ''",101,0,1,True,X,,,14,"[ -0.37686658, -0.14841501, 0.739799...",
1,test,0,1,"[0, 4): 'What'",1327,0,1,False,PRON,"[0, 4): 'What'",0.0,11,"[ -0.23267001, -0.4054631, 0.617193...",What
2,test,0,2,"[5, 7): 'if'",1191,0,1,False,SCONJ,"[5, 7): 'if'",1.0,13,"[ -0.8156848, -0.047826126, 0.0814858...",if
3,test,0,3,"[8, 14): 'Google'",7986,0,1,False,PROPN,"[8, 14): 'Google'",2.0,2,"[ 0.78967667, -0.8511877, -0.488126...",Google
4,test,0,4,"[15, 17): 'Mo'",12556,0,1,False,VERB,"[15, 22): 'Morphed'",3.0,3,"[ -0.25935128, 0.5710735, -0.09106734...",Mo
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
307909,train,539,756,"[3152, 3154): 'my'",1139,0,1,False,PRON,"[3152, 3154): 'my'",690.0,11,"[ -0.069846205, -0.46460724, 0.8547706...",my
307910,train,539,757,"[3155, 3158): 'car'",1610,0,1,False,NOUN,"[3155, 3158): 'car'",691.0,4,"[ 0.14624085, -0.46386108, 0.5966832...",car
307911,train,539,758,"[3158, 3159): ')'",114,0,1,False,PUNCT,"[3158, 3159): ')'",692.0,5,"[ -0.090651415, -0.2959277, 0.5970228...",)
307912,train,539,759,"[3159, 3160): '.'",119,0,1,False,PUNCT,"[3159, 3160): '.'",693.0,5,"[ 0.03102396, -0.27608696, 0.782190...",.


In [14]:
# now get ready to train our model: 
train_df = corpus_df[corpus_df["fold"] == "train"]
train_df = train_df.astype({'postag_id':'int'}, copy=False)
train_df

Unnamed: 0,fold,doc_num,token_id,span,input_id,token_type_id,attention_mask,special_tokens_mask,postag,raw_span,raw_span_id,postag_id,embedding,text
64732,train,0,0,"[0, 0): ''",101,0,1,True,X,,,14,"[ -0.41927955, -0.22575185, 0.664876...",
64733,train,0,1,"[0, 2): 'Al'",2586,0,1,False,PROPN,"[0, 2): 'Al'",0.0,2,"[ -0.36961353, -1.0804737, -0.28336...",Al
64734,train,0,2,"[2, 3): '-'",118,0,1,False,PUNCT,"[2, 3): '-'",1.0,5,"[ -0.9178746, -0.9462433, -0.808997...",-
64735,train,0,3,"[4, 5): 'Z'",163,0,1,False,PROPN,"[4, 9): 'Zaman'",2.0,2,"[ -0.90530235, -0.9708696, -1.440879...",Z
64736,train,0,4,"[5, 9): 'aman'",19853,0,1,False,PROPN,"[4, 9): 'Zaman'",2.0,2,"[ -1.1586107, -1.1497655, -1.19497...",aman
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
307909,train,539,756,"[3152, 3154): 'my'",1139,0,1,False,PRON,"[3152, 3154): 'my'",690.0,11,"[ -0.069846205, -0.46460724, 0.8547706...",my
307910,train,539,757,"[3155, 3158): 'car'",1610,0,1,False,NOUN,"[3155, 3158): 'car'",691.0,4,"[ 0.14624085, -0.46386108, 0.5966832...",car
307911,train,539,758,"[3158, 3159): ')'",114,0,1,False,PUNCT,"[3158, 3159): ')'",692.0,5,"[ -0.090651415, -0.2959277, 0.5970228...",)
307912,train,539,759,"[3159, 3160): '.'",119,0,1,False,PUNCT,"[3159, 3160): '.'",693.0,5,"[ 0.03102396, -0.27608696, 0.782190...",.


## Train the model
Use a sklearn pipeline to train a multinomial regression model ontop of Bert embeddings to predict Part of speech


In [15]:
# now actually train a model, using sklearn
#MULTI_CLASS = "multinomial"

# How many iterations to run the BGFS optimizer when fitting logistic
# regression models. 100 ==> Fast; 10000 ==> Full convergence
LBGFS_ITERATIONS = 1000
REGULARIZATION_COEFF = 1e-1

base_pipeline = sklearn.pipeline.Pipeline([
    # Standard scaler. This only makes a difference for certain classes
    # of embeddings.
    #("scaler", sklearn.preprocessing.StandardScaler()),
    ("mlogreg", sklearn.linear_model.LogisticRegression(
        #multi_class=MULTI_CLASS,
        verbose=1,
        max_iter=LBGFS_ITERATIONS,
        C=REGULARIZATION_COEFF
    ))
])

X_train = train_df["embedding"].values
Y_train = train_df["postag_id"]
base_model = base_pipeline.fit(X_train, Y_train)
base_model

In [16]:
# pickle model so I don't need to re-fit it every time
import pickle 

load_from_file= False
pickle_model_file = "conllu_pos_classifier.pickle"

if not load_from_file:
    with open(BASE_DIR+ pickle_model_file, 'wb') as file: 
        pickle.dump(base_model,file)
    print("saved")
else: 
    with open(BASE_DIR+ pickle_model_file, 'rb') as file: 
        base_model = pickle.load(file)
    print("loaded")


saved


### Use the model to run inference on the test set of the data 

In [17]:
def infer_on_df(df: pd.DataFrame, id_to_class_dict, predictor):
    result_df = df.copy()
    inputs = result_df["embedding"].to_numpy()
    raw_outputs = tp.TensorArray(predictor.predict_proba(inputs))
    result_df["p_id"] = np.argmax(raw_outputs, axis=1)
    result_df["p_postag"]= result_df["p_id"].apply(lambda p_id: id_to_class_dict[p_id])
    result_df["raw_output"] = raw_outputs
    return result_df

test_results = infer_on_df(corpus_df[corpus_df["fold"] == "test"], upostags_list, base_model)
test_results.head(20)

Unnamed: 0,fold,doc_num,token_id,span,input_id,token_type_id,attention_mask,special_tokens_mask,postag,raw_span,raw_span_id,postag_id,embedding,text,p_id,p_postag,raw_output
0,test,0,0,"[0, 0): ''",101,0,1,True,X,,,14,"[ -0.37686658, -0.14841501, 0.739799...",,14,X,"[3.8628878957358064e-09, 1.5814630122443712e-0..."
1,test,0,1,"[0, 4): 'What'",1327,0,1,False,PRON,"[0, 4): 'What'",0.0,11,"[ -0.23267001, -0.4054631, 0.617193...",What,5,PUNCT,"[ 0.000151603222407165, 0.00262331201456116..."
2,test,0,2,"[5, 7): 'if'",1191,0,1,False,SCONJ,"[5, 7): 'if'",1.0,13,"[ -0.8156848, -0.047826126, 0.0814858...",if,13,SCONJ,"[ 0.004580406302863653, 1.5823046733312133e-0..."
3,test,0,3,"[8, 14): 'Google'",7986,0,1,False,PROPN,"[8, 14): 'Google'",2.0,2,"[ 0.78967667, -0.8511877, -0.488126...",Google,2,PROPN,"[ 6.787568556052076e-10, 1.8462375223634097e-0..."
4,test,0,4,"[15, 17): 'Mo'",12556,0,1,False,VERB,"[15, 22): 'Morphed'",3.0,3,"[ -0.25935128, 0.5710735, -0.09106734...",Mo,4,NOUN,"[ 0.02169692151317765, 2.687465260450824e-0..."
5,test,0,5,"[17, 19): 'rp'",15615,0,1,False,VERB,"[15, 22): 'Morphed'",3.0,3,"[ -0.3267123, -0.10905984, 0.0530879...",rp,4,NOUN,"[ 7.801477438016654e-11, 3.233460822608078e-0..."
6,test,0,6,"[19, 22): 'hed'",8961,0,1,False,VERB,"[15, 22): 'Morphed'",3.0,3,"[ -0.9018075, -0.16881414, 0.4379887...",hed,3,VERB,"[ 0.0001596840954975109, 3.373975531340552e-0..."
7,test,0,7,"[23, 27): 'Into'",14000,0,1,False,ADP,"[23, 27): 'Into'",4.0,0,"[ 0.0956599, -0.10993048, -0.1493198...",Into,0,ADP,"[ 0.9232216075809515, 6.113139647861225e-1..."
8,test,0,8,"[28, 34): 'Google'",7986,0,1,False,PROPN,"[28, 36): 'GoogleOS'",5.0,2,"[ -1.2023001, -0.29254347, 0.2236390...",Google,2,PROPN,"[1.7792656145443183e-15, 4.8210263680143775e-1..."
9,test,0,9,"[34, 36): 'OS'",9025,0,1,False,PROPN,"[28, 36): 'GoogleOS'",5.0,2,"[ -0.78179884, -0.20742226, -1.288185...",OS,2,PROPN,"[ 2.03803406860384e-08, 4.815731588956786e-1..."


## Now look at the data we've made and aggregate it to calculate F1 scores 

First, aggregate by raw surface token, to get the 'real' token predictions from the bert-ified values To do this we multiply the probabilities for each subtoken

Then compare with existing labels 

In [18]:
def agg_outputs(series: pd.Series):
    return series.to_numpy().prod(axis=0).argmax()
     
test_raw_preds = test_results.groupby("raw_span").agg({"fold":'first', "doc_num": 'first','postag':'first','postag_id':'first','raw_output': agg_outputs}).rename(columns= {'raw_output':'p_postag_id'}).sort_values(["fold","doc_num",'raw_span'])
test_raw_preds['predicted_postag'] = test_raw_preds["p_postag_id"].apply(lambda p_id: upostags_list[p_id]) 
test_raw_preds.head(200)

Unnamed: 0_level_0,fold,doc_num,postag,postag_id,p_postag_id,predicted_postag
raw_span,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"[0, 4): 'What'",test,0,PRON,11,5,PUNCT
"[5, 7): 'if'",test,0,SCONJ,13,13,SCONJ
"[8, 14): 'Google'",test,0,PROPN,2,2,PROPN
"[15, 22): 'Morphed'",test,0,VERB,3,4,NOUN
"[23, 27): 'Into'",test,0,ADP,0,0,ADP
...,...,...,...,...,...,...
"[307, 309): 'of'",test,2,ADP,0,0,ADP
"[310, 313): 'the'",test,2,DET,1,1,DET
"[314, 319): 'pic's'",test,2,NOUN,4,4,NOUN
"[319, 320): '.'",test,2,PUNCT,5,5,PUNCT


In [19]:
from sklearn import metrics
# calculate precision, recall, and f1 score for each pos
print(metrics.classification_report(test_raw_preds['postag'], test_raw_preds['predicted_postag'], digits=3))

              precision    recall  f1-score   support

         ADJ      0.798     0.769     0.783      1794
         ADP      0.912     0.919     0.916      2030
         ADV      0.784     0.752     0.768      1183
         AUX      0.956     0.963     0.959      1543
       CCONJ      0.975     0.967     0.971       736
         DET      0.957     0.957     0.957      1896
        INTJ      0.857     0.694     0.767       121
        NOUN      0.866     0.890     0.878      4123
         NUM      0.817     0.897     0.855       542
        PART      0.949     0.937     0.943       649
        PRON      0.962     0.965     0.964      2166
       PROPN      0.825     0.856     0.840      2076
       PUNCT      0.984     0.965     0.974      3096
       SCONJ      0.870     0.786     0.826       384
         SYM      0.671     0.523     0.588       109
        VERB      0.909     0.906     0.908      2606
           X      0.220     0.214     0.217        42

    accuracy              