# Create tokenization Dataset (from Metadata)
Copyright (C) 2021 ServiceNow, Inc.

We use the metadata to produce a clean sentence-tokenized dataset for training BERT tokenization models. 

## Load the metadata file

In [1]:
import pandas as pd
output_large = '/nrcan_p2/data/01_raw/20201006/geoscan/GEOSCAN-extract-20200211144755.xml_processed_Feb29.parquet'
df_s_large = pd.read_parquet(output_large)

output_small = '/nrcan_p2/data/01_raw/20201006/geoscan/EAIDown.xml_processed_Feb29.parquet'
df_s = pd.read_parquet(output_small)

In [3]:
df_s_large.columns

Index(['{http://purl.org/dc/elements/1.1/}contributor',
       '{http://purl.org/dc/elements/1.1/}title_en',
       '{http://purl.org/dc/elements/1.1/}creator',
       '{http://purl.org/dc/elements/1.1/}subject_en',
       '{http://purl.org/dc/elements/1.1/}subject_fr',
       '{http://purl.org/dc/elements/1.1/}source_en',
       '{http://purl.org/dc/elements/1.1/}source_fr',
       '{http://purl.org/dc/elements/1.1/}description_en',
       '{http://purl.org/dc/elements/1.1/}description_fr',
       '{http://purl.org/dc/elements/1.1/}date',
       '{http://purl.org/dc/elements/1.1/}type_en',
       '{http://purl.org/dc/elements/1.1/}format',
       '{http://purl.org/dc/elements/1.1/}identifier_geoscanid',
       '{http://purl.org/dc/elements/1.1/}identifier_en',
       '{http://purl.org/dc/elements/1.1/}identifier_fr',
       '{http://purl.org/dc/elements/1.1/}language',
       '{http://purl.org/dc/elements/1.1/}coverage_en',
       '{http://purl.org/dc/elements/1.1/}coverage_fr',
     

## Filter to title and description columns

In [4]:
DESC_COL = 'desc_en_en'
TITLE_COL = 'title_merged'

In [6]:
df_s_large[[DESC_COL, TITLE_COL]]

Unnamed: 0,desc_en_en,title_merged
0,,"Voggite, a new hydrated Na-Zr hydroxide-phosph..."
1,Airborne electromagnetic (EM) methods were dev...,The inversion of time-domain airborne electrom...
2,Cornwall and Princess Margaret arches are majo...,"Lithosphere folds in the Eurekan orogen, Arcti..."
3,,
4,,Archaean Geology; Dating Old Gold Deposits
...,...,...
92658,Clumped isotope (,Clumped isotope temperature calibration for ca...
92659,Climatic reconstructions based on tree-ring is...,An Overview on Isotopic Divergences - Causes f...
92660,,Catalogue of Mines Branch Publications
92661,,"Catalogue of Mines Branch Publications, with a..."


## Perform sentence tokenization

In [7]:
import sys
sys.path.append('..')

In [8]:
from nrcan_p2.data_processing.preprocessing_str import sentence_tokenize_spacy_lg

In [10]:
import tqdm
tqdm.tqdm.pandas()

In [12]:
df_s_large['title_merged_split'] = df_s_large.title_merged.progress_apply(lambda x: None if x is None else sentence_tokenize_spacy_lg(x))

100%|██████████| 92663/92663 [08:33<00:00, 180.29it/s]


In [13]:
df_s_large['desc_en_en_split'] = df_s_large.desc_en_en.progress_apply(lambda x: None if x is None else sentence_tokenize_spacy_lg(x))

100%|██████████| 92663/92663 [10:17<00:00, 150.07it/s] 


In [15]:
dff = pd.concat([df_s_large['title_merged_split'], df_s_large['desc_en_en_split']])
display(dff)
print(df_s_large.shape)
print(dff.shape)
dff = dff.dropna()

0        Voggite, a new hydrated Na-Zr hydroxide-phosph...
1        The inversion of time-domain airborne electrom...
2        Lithosphere folds in the Eurekan orogen, Arcti...
3                                                     None
4             Archaean Geology; Dating Old Gold Deposits\n
                               ...                        
92658                                  Clumped isotope (\n
92659    Climatic reconstructions based on tree-ring is...
92660                                                 None
92661                                                 None
92662    This study examined the relationship between t...
Length: 185326, dtype: object

(92663, 45)
(185326,)


In [18]:
dff_2 = dff.apply(lambda x: x.split('\n'))
dff_2 = dff_2.explode()

In [26]:
dff_2 = dff_2.dropna()
dff_2 = dff_2[dff_2.str.strip() != ""]

In [27]:
with pd.option_context('display.max_colwidth', None):
    display(dff_2.sample(20).to_frame())

Unnamed: 0,0
88327,"carboxydotrophs including Thermincola, Desulfotomaculum, Thermolithobacter, and Carboxydocella, although a few species with lower similarity to known bacteria were also found that may represent previously unconfirmed CO-oxidizers."
45050,Contextual Analysis of Sea Ice Types From Remotely Sensed Imagery
34288,"northwestward from 70 m to 142 m, reflecting Laurentide loading."
23124,"Yakoun Lake, British Columbia"
48536,Evolution of the early Paleozoic Cordilleran margin of Laurentia: tectonic and eustatic events interpreted from sequence stratigraphy and conodont community patterns
80820,The heavy signature in the
85364,show that this is a rare event during the Quaternary; it is the largest MTD observed in the upper c. 375 m of the levee succession and among the largest and deepest in the western North Atlantic.
3885,The zone is
3794,Basin are also illustrated.
78352,Two-dimensional InSAR provides valuable information about slope processes and the nature of terrain movement.


## Save to a file...

In [29]:
output_file = '/nrcan_p2/data/03_primary/metadata/EAIDown.xml_processed_sentences.txt'

with open(output_file, 'w') as f:
    for value in dff_2.values: #.iterrows():
        f.write(value + "\n")


## Investigate the data...

In [37]:
dff = pd.concat([df_s_large['title_merged'], df_s_large['desc_en_en']])
display(dff)
print(df_s_large.shape)
print(dff.shape)
dff = dff.dropna()

0        Voggite, a new hydrated Na-Zr hydroxide-phosph...
1        The inversion of time-domain airborne electrom...
2        Lithosphere folds in the Eurekan orogen, Arcti...
3                                                     None
4               Archaean Geology; Dating Old Gold Deposits
                               ...                        
92658                                    Clumped isotope (
92659    Climatic reconstructions based on tree-ring is...
92660                                                 None
92661                                                 None
92662    This study examined the relationship between t...
Length: 185326, dtype: object

(92663, 45)
(185326,)


## Clean up newline hyphenation and remove null rows

In [38]:
dff_2 = dff.str.replace(r'([a-z])(-\s*\n\s*)([a-z])', r'\1\3', regex=True)
dff_2 = dff.str.replace(r'\n', ' ', regex=True)
dff_2 = dff_2.explode()

In [39]:
dff_2 = dff_2.dropna()
dff_2 = dff_2[dff_2.str.strip() != ""]

In [40]:
with pd.option_context('display.max_colwidth', None):
    display(dff_2.sample(20).to_frame())

Unnamed: 0,0
9629,"Waterton, west of Fourth Meridian, Alberta"
79624,"Heat pumps, as a means of achieving significant energy reductions, have attracted a great deal of attention for decades. However, the main challenge remains improving their performance in cold climates. This paper represents the first step of a larger research project for the implementation of the zeotropic refrigerant mixtures in order to increase the performance of residential air-source heat pumps in cold climates. A detailed screening heat pump model is developed and used to assess the performance of zeotropic refrigerant mixtures. A group of pure refrigerants are selected and their potential mixtures are studied. The performance of these mixtures is compared in order to find suitable zeotropic refrigerant mixtures for cold climate residential applications. The main goal of this paper is to illustrate the possibility of applying environmentally friendly zeotropic refrigerant mixtures in conventional heat pumps, with minimal changes in the components, in order to improve their performance."
8545,"Tazin Lake Sheet, Northern Saskatchewan"
73303,"Absolute gravity and GRACE satellite data have been combined with GPS data to identify a large-scale water storage anomaly on the Canadian prairies. Monthly GRACE data for the period 2002-2011 were used to produce a gravity rate map of the northern mid-continent. This map was corrected for glacial isostatic adjustment (GIA) using a GPS-based, vertical velocity map derived from 27 continuous and over 50 campaign sites, combined over the period 1996-2010. The vertical velocity map used to correct for GIA was first converted into a virtual gravity rate map using a linear relationship between surface gravity rate and vertical velocity (-0.16 microGal/mm), empirically derived from combined annual absolute gravity and continuous GPS observations at 7 sites outside the anomalous area. The corrected GRACE gravity rate map reveals a major mass rate anomaly with a water equivalent thickness rate of around 3 cm/yr and approximate dimensions of 600 km (N-S) and 800 km (E-W) centered on the Manitoba-Saskatchewan border. The amplitude and spatial extent of the anomaly are estimated by data inversion, taking into account the effect of elastic loading on the GPS-based GIA correction. The source of the anomaly is confirmed by records from deep observation wells in Saskatchewan to be an increase in total water content from 2002 to 2011, amounting to an overall water equivalent accumulation of around 27 cm over a wide area."
81644,"Air Carrier Routes, 2006 - Air Canada"
71566,"A lithostratigraphic transect through the Cambro - Ordovician Franklin Mountain Formation in NTS 96D (Carcajou Canyon) and 96E (Norman Wells), Northwest Territories"
85952,"Newmarket Till is a stony, sandy (38%) silty (~47%) diamicton, which is of variable thickness (~1 - 69 m) and of widespread distribution in Southern Ontario. The Newmarket Till has unusually high densities (2.2 - 2.4 g/cm3); elevated seismic velocities (Vp ~2600 m/s) determined by downhole geophysical studies are characteristic and the Till can be traced across the region as a seismostratigraphic marker. As the Till is highly indurated and has low permeability, it forms a regional aquitard that confines underlying aquifers, and is also a basal aquitard for overlying aquifers (e.g. Oak Ridges Moraine). Given the high sand content of this diamicton, the low permeability and indurated nature is surprising, and could be resultant from over-consolidation due to glacial loading, presence of a secondary cement, or both processes. Recent observations from drill core and surficial sampling transects illustrate that Newmarket Till is not always cemented, but the observation of residual cement on pebbles indicates it was potentially formerly cemented. Our new studies indicate that the matrix of the Dummer moraine (adjacent to and south of the Shield - Paleozoic boundary and to the north of the Newmarket Till) is mineralogically and geochemically equivalent to Newmarket Till, and we thus suggest the Dummer Moraine is a very stone- to boulder-rich equivalent of the Newmarket Till. The matrix mineral assemblage of the Till (in decreasing abundance) is quartz, calcite, K-feldspar, plagioclase, dolomite, amphibole and clinopyroxene; these grains are comminuted and range in size from ~2000 ?m to ~2 ?m, leading to optimum packing, and potentially over-consolidation. The intra-grain matrix is exceptionally fine (<1 ?m, typically 0.25 - 0.50 ?m) and not resolvable by optical methods. Higher resolution SEM and FE-SEM backscattered electron and secondary electron images of the intra-grain matrix reveals a complex pore filling cement. The minerals comprising the secondary cement are a challenge to analyze due to their very fine grain size and composition. Semi-quantitative EDS analyses indicate a calcite (CaCO3) cement with minor phyllosilicates, as confirmed by XRD on the clay-silt and clay fractions. The calcite cements the silt- to sand-sized mineral grains and larger clasts, and result in the Newmarket Till being highly indurated and of low permeability. The timing and process of the initial cementation event is currently being evaluated; we also note that in the vadose zone the Till becomes uncemented (i.e. the original calcite cement dissolves out)."
32273,"Geological Investigations of Proposed Pipeline Channel Crossings in the Vicinity of Taglu and Niglintgak Islands, Mackenzie Delta, Northwest Territories"
85358,"The collection of multibeam echosounder and marine seismic-reflection profiles within the San Juan Islands and Gulf Islands Archipelagos exhibit well defined fault zones and zones of deformation that cross the Cascadia forearc. Recent interpretation of these data provides a new and modified geometry of recent and past faulting within the Archipelagos. Previous structural mapping in the region was island and land based devoid of any marine data, but these new marine data and interpretations facilitate a better structural view. We present a comprehensive map of the geology that includes the Devil's Mountain and Skipjack Island fault zones, the Lopez Fracture zone, and intermediate faults that appear to facilitate rotation of the San Juan islands. In addition, the associated coastal and marine hazards of the region are presented with emphasis on local active faulting, mass wasting, and tsunami generation."
68223,"Integrating ice-flow history, geochronology, geology, and geophysics to trace mineralized glacial erratics to their bedrock source: an example from south-central British Columbia"


## Save the final dataset

In [41]:
output_file = '/nrcan_p2/data/03_primary/metadata/EAIDown.xml_processed_nosentences_Feb29.txt'

with open(output_file, 'w') as f:
    for value in dff_2.values: #.iterrows():
        f.write(value + "\n")