# Compare tokenizers
Aim: Comparing different tokenizers on biomedical abstracts

created by Sonja Aits, Lund University

Further reading:
    
https://lhncbc.nlm.nih.gov/publication/lhncbc-tr-2006-003

https://www.aclweb.org/anthology/W15-2605/

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4749772/
    
https://en.wikipedia.org/wiki/Gene_nomenclature

https://en.wikipedia.org/wiki/Chemical_nomenclature

https://www.genenames.org/about/guidelines/


In [1]:
from nltk import FreqDist

downloaded cell death journal abstracts from following pubmed search (using send function and abstract option): (Cell Death Differ[journal]) OR (Cell Death Discov[journal]) OR (Cell Death Dis[journal]) OR Apoptosis[journal] Save in file celldeathabstracts_20191230.txt

first abstract saved as abstract1.txt

In [2]:
abstract = open('abstract1.txt', encoding="utf8").read()
abstract = abstract.lower()
print(abstract)

1. cell death differ. 2019 dec 20. doi: 10.1038/s41418-019-0483-6. [epub ahead of
print]

hnrnp f/h associate with hterc and telomerase holoenzyme to modulate telomerase
function and promote cell proliferation.

xu c(1), xie n(2), su y(1), sun z(1), liang y(1), zhang n(1), liu d(1), jia s(3),
xing x(3), han l(1), li g(1), tong t(1), chen j(4).

author information: 
(1)peking university research center on aging, beijing key laboratory of protein 
posttranslational modifications and cell function, department of biochemistry and
molecular biology, department of integration of chinese and western medicine,
school of basic medical science, peking university, beijing, 100191, china.
(2)department of physiology and pathophysiology, school of basic medical science,
peking university, beijing, 100191, china.
(3)department of molecular diagnostics, key laboratory of carcinogenesis and
translational research (ministry of education), peking university cancer hospital
& institute, beijing, 100142, 

In [3]:
cd_abstracts = open('celldeathabstracts_20191230.txt', encoding="utf8").read()
cd_abstracts = cd_abstracts.lower()

In [4]:
#tricky examples of biomedical text inspired by articles above and compiled by me (stored at D:/Lab/Data/tokenization.txt)

examples = """Compound words
Hydrogen peroxide causes cell death
Cell death is caused by hydrogen peroxide
UV radiation kills cancer cells

Hyphenated compound words
co-localization
Co-localization
wild-type 
Wild-type 
TLR-4 
X-ray

Slashes
downregulation/mutation 
Downregulation/mutation 
Downregulation/Mutation 
P53/73 
p53/73 
Omi/HtrA2 
HER2/Neu
mg/ml

&
Material&Methods
material&methods

‘ ' "
Parkinson’s
can’t 
wouldn’t 
haven’t 
hadn’t 
shouldn’t 
cells’ circumference 
‘localization’ 
Parkinson's 
can't 
wouldn't 
haven't 
hadn't 
shouldn't 
cells' 
'localization' 
"localization"

Non-alphanumerical symbols
TNF-α 

Brackets
(cells)
[cells]
{cells}
<cells>
(Cells)
[Cells]
{Cells}
<Cells>
(GAPDH)
[GAPDH]
{GAPDH}
<GAPDH>
(cells)
(cells) 
A)
2)

Combined letters and punctuation
A.
e.g.
e.g.,
i.e.
p.o.
The cell is dead. Therefore, we 

Combined letters and numbers
O2
30th
LAMP2
log2
2nd 

Numbers with blanks
4 000 453

Combined numbers and punctuation
3,000,000
1/2
76%
1.4
1-20
1)
1.

Mathematical operators
5*4
5x4
5⋅4
56/8
5+70=75
5 + 70 = 75
5-4
-40
6.65×10^−34
6.65 × 10^−34
20+/-5
5.6-7.5%
7^3
5˃3
5˃=x

Units
20 mg/ml
20 mg/mL
20 µl/ml
20 µg/mol
20 m/s
20 Gy
20Gy
20 °C
20°C
20mmol
20 mg/mg/h

Nucleotide sequences
5’-TTAC-3’
GGGCAAATT
GGGCAAAUU 

Gene, RNA, Protein names
miR-643
LAMP2
Gal1-3
leuAum
leuAcs
leuA+
leuA−
ΔleuA
leuA-lacZ
leuA:lacZ
leuA::Tn10
ΩleuA
ΔleuA::nptII(KanR)
mTOR

Chemicals
HC9H7O4
2-acetyloxybenzoic acid
InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)
BSYNRYMUTXBXSQ-UHFFFAOYSA-N
CC(=O)OC1=CC=CC=C1C(=O)O
Ca2+

Identifiers
50-78-2
GAL3_HUMAN
P2446246
NM_235235

Complex combinations
LAMP1/2
LAMP1-2
(LAMP1-2)
(see below (Fig.1))
A-B)


Hypertext 
&lt
&quot

Links
http://www.lu.se
https://ai-lu.com/gst.txt 
ftp://as.li.de
https://doi.org/10.1109/5.771073

Emails
sdfsag@gmail.com
asg.smg@lu.se 

Times and dates
7d
7 d
00:30 min
30min
5h
5 hours
5 weeks
5 s
5 sec
5 seconds
2019-09-01
1 Sept 2019
1st of September 2019
Sept 1, 2019
Sept 1 2019
1.9.2019
01.09.2019

Abbreviations
CytoC
hGal3
h-Gal13
Cyto-c
MERS
CD4+
CD4+
C. elegans growth was inhibited
"""

#how are formatting differences preserved? superscript, subscript, italics, bold, font size

# nltk RegexpTokenizer

In [5]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
#Placing r or R before string creates Raw-string. Raw-strings do not process escape sequences (\n, \b, etc.) and are commonly used for Regex patterns, which often contain a lot of \ characters 
#\w+ matches one or several characters of a-z, A-Z, 0-9, _
tokens_abstract = tokenizer.tokenize(abstract)
tokens_examples = tokenizer.tokenize(examples)
tokens_cd_abstracts = tokenizer.tokenize(cd_abstracts)
counts_cd_abstracts = FreqDist(tokens_cd_abstracts)


print(len(tokens_abstract))
print(len(tokens_examples))
print(len(tokens_cd_abstracts))
print()
print(tokens_abstract[:40])
print()
print(tokens_examples)
print()
print(counts_cd_abstracts.most_common(40))

412
407
3921337

['1', 'cell', 'death', 'differ', '2019', 'dec', '20', 'doi', '10', '1038', 's41418', '019', '0483', '6', 'epub', 'ahead', 'of', 'print', 'hnrnp', 'f', 'h', 'associate', 'with', 'hterc', 'and', 'telomerase', 'holoenzyme', 'to', 'modulate', 'telomerase', 'function', 'and', 'promote', 'cell', 'proliferation', 'xu', 'c', '1', 'xie', 'n']

['Compound', 'words', 'Hydrogen', 'peroxide', 'causes', 'cell', 'death', 'Cell', 'death', 'is', 'caused', 'by', 'hydrogen', 'peroxide', 'UV', 'radiation', 'kills', 'cancer', 'cells', 'Hyphenated', 'compound', 'words', 'co', 'localization', 'Co', 'localization', 'wild', 'type', 'Wild', 'type', 'TLR', '4', 'X', 'ray', 'Slashes', 'downregulation', 'mutation', 'Downregulation', 'mutation', 'Downregulation', 'Mutation', 'P53', '73', 'p53', '73', 'Omi', 'HtrA2', 'HER2', 'Neu', 'mg', 'ml', 'Material', 'Methods', 'material', 'methods', 'Parkinson', 's', 'can', 't', 'wouldn', 't', 'haven', 't', 'hadn', 't', 'shouldn', 't', 'cells', 'circumference'

# nltk RegexpTokenizer (gaps mode)

In [6]:
tokenizer = RegexpTokenizer('\s+', gaps=True) #splits on the indicated pattern
tokens_abstract = tokenizer.tokenize(abstract)
tokens_examples = tokenizer.tokenize(examples)
tokens_cd_abstracts = tokenizer.tokenize(cd_abstracts)
counts_cd_abstracts = FreqDist(tokens_cd_abstracts)


print(len(tokens_abstract))
print(len(tokens_examples))
print(len(tokens_cd_abstracts))
print()
print(tokens_abstract[:40])
print()
print(tokens_examples)
print()
print(counts_cd_abstracts.most_common(40))

367
270
3465883

['1.', 'cell', 'death', 'differ.', '2019', 'dec', '20.', 'doi:', '10.1038/s41418-019-0483-6.', '[epub', 'ahead', 'of', 'print]', 'hnrnp', 'f/h', 'associate', 'with', 'hterc', 'and', 'telomerase', 'holoenzyme', 'to', 'modulate', 'telomerase', 'function', 'and', 'promote', 'cell', 'proliferation.', 'xu', 'c(1),', 'xie', 'n(2),', 'su', 'y(1),', 'sun', 'z(1),', 'liang', 'y(1),', 'zhang']

['Compound', 'words', 'Hydrogen', 'peroxide', 'causes', 'cell', 'death', 'Cell', 'death', 'is', 'caused', 'by', 'hydrogen', 'peroxide', 'UV', 'radiation', 'kills', 'cancer', 'cells', 'Hyphenated', 'compound', 'words', 'co-localization', 'Co-localization', 'wild-type', 'Wild-type', 'TLR-4', 'X-ray', 'Slashes', 'downregulation/mutation', 'Downregulation/mutation', 'Downregulation/Mutation', 'P53/73', 'p53/73', 'Omi/HtrA2', 'HER2/Neu', 'mg/ml', '&', 'Material&Methods', 'material&methods', '‘', "'", '"', 'Parkinson’s', 'can’t', 'wouldn’t', 'haven’t', 'hadn’t', 'shouldn’t', 'cells’', 'circumfe

# nltk word_tokenize

In [9]:
from nltk.tokenize import word_tokenize
tokens_abstract = word_tokenize(abstract)
tokens_examples = word_tokenize(examples)
tokens_cd_abstracts = word_tokenize(cd_abstracts)
counts_cd_abstracts = FreqDist(tokens_cd_abstracts)

print(len(tokens_abstract))
print(len(tokens_examples))
print(len(tokens_cd_abstracts))
print()
print(tokens_abstract[:40])
print()
print(tokens_examples)
print()
print(counts_cd_abstracts.most_common(40))

505
399
4650661

['1.', 'cell', 'death', 'differ', '.', '2019', 'dec', '20.', 'doi', ':', '10.1038/s41418-019-0483-6', '.', '[', 'epub', 'ahead', 'of', 'print', ']', 'hnrnp', 'f/h', 'associate', 'with', 'hterc', 'and', 'telomerase', 'holoenzyme', 'to', 'modulate', 'telomerase', 'function', 'and', 'promote', 'cell', 'proliferation', '.', 'xu', 'c', '(', '1', ')']

['Compound', 'words', 'Hydrogen', 'peroxide', 'causes', 'cell', 'death', 'Cell', 'death', 'is', 'caused', 'by', 'hydrogen', 'peroxide', 'UV', 'radiation', 'kills', 'cancer', 'cells', 'Hyphenated', 'compound', 'words', 'co-localization', 'Co-localization', 'wild-type', 'Wild-type', 'TLR-4', 'X-ray', 'Slashes', 'downregulation/mutation', 'Downregulation/mutation', 'Downregulation/Mutation', 'P53/73', 'p53/73', 'Omi/HtrA2', 'HER2/Neu', 'mg/ml', '&', 'Material', '&', 'Methods', 'material', '&', 'methods', '‘', "'", '``', 'Parkinson', '’', 's', 'can', '’', 't', 'wouldn', '’', 't', 'haven', '’', 't', 'hadn', '’', 't', 'shouldn', '’'

# nltk TreebankWordTokenizer

In [10]:
from nltk.tokenize.treebank import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokens_abstract = tokenizer.tokenize(abstract)
tokens_examples = tokenizer.tokenize(examples)
tokens_cd_abstracts = tokenizer.tokenize(cd_abstracts)
counts_cd_abstracts = FreqDist(tokens_cd_abstracts)

print(len(tokens_abstract))
print(len(tokens_examples))
print(len(tokens_cd_abstracts))
print()
print(tokens_abstract[:40])
print()
print(tokens_examples)
print()
print(counts_cd_abstracts.most_common(40))

488
394
4467643

['1.', 'cell', 'death', 'differ.', '2019', 'dec', '20.', 'doi', ':', '10.1038/s41418-019-0483-6.', '[', 'epub', 'ahead', 'of', 'print', ']', 'hnrnp', 'f/h', 'associate', 'with', 'hterc', 'and', 'telomerase', 'holoenzyme', 'to', 'modulate', 'telomerase', 'function', 'and', 'promote', 'cell', 'proliferation.', 'xu', 'c', '(', '1', ')', ',', 'xie', 'n']

['Compound', 'words', 'Hydrogen', 'peroxide', 'causes', 'cell', 'death', 'Cell', 'death', 'is', 'caused', 'by', 'hydrogen', 'peroxide', 'UV', 'radiation', 'kills', 'cancer', 'cells', 'Hyphenated', 'compound', 'words', 'co-localization', 'Co-localization', 'wild-type', 'Wild-type', 'TLR-4', 'X-ray', 'Slashes', 'downregulation/mutation', 'Downregulation/mutation', 'Downregulation/Mutation', 'P53/73', 'p53/73', 'Omi/HtrA2', 'HER2/Neu', 'mg/ml', '&', 'Material', '&', 'Methods', 'material', '&', 'methods', '‘', "'", '``', 'Parkinson', '’', 's', 'can', '’', 't', 'wouldn', '’', 't', 'haven', '’', 't', 'hadn', '’', 't', 'shouldn'

# scispacy tokenization (en_core_sci_md)

In [11]:
import spacy
path = 'C:\\Users\\Sonja\\Anaconda3\\envs\\scispacy\\Lib\\site-packages\\en_core_sci_md\\en_core_sci_md-0.2.4'
#the model was downloaded from https://github.com/allenai/scispacy
nlp = spacy.load(path)

In [13]:
print(nlp)
print(len(cd_abstracts))

<spacy.lang.en.English object at 0x0000029C088A08C8>
25881725


The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the nlp.max_length limit. The limit is in number of characters. Then, when you call your spaCy pipeline, disable RAM-intensive parts such as ner and parser.

In [14]:
cd_abstracts_s = cd_abstracts[:500000]

In [None]:
#Example:
#nlp.max_length = 26000000
#doc_cd_ab = nlp(cd_abstracts, disable = ['ner', 'parser'])

In [15]:
doc_ab = nlp(abstract)
tokens_abstract = [token.text for token in doc_ab]

doc_ex = nlp(examples)
tokens_examples = [token.text for token in doc_ex]

doc_cd_ab_s = nlp(cd_abstracts_s)
tokens_cd_abstracts_s = [token.text for token in doc_cd_ab_s]

counts_cd_abstracts_s = FreqDist(tokens_cd_abstracts_s)

print(len(tokens_abstract))
print(len(tokens_examples))
print(len(tokens_cd_abstracts_s))
print()
print(tokens_abstract[:40])
print()
print(tokens_examples)
print()
print(counts_cd_abstracts_s.most_common(40))

515
585
93512

['1', '.', 'cell', 'death', 'differ', '.', '2019', 'dec', '20', '.', 'doi', ':', '10.1038/s41418', '-', '019', '-', '0483', '-', '6', '.', '[', 'epub', 'ahead', 'of', '\n', 'print', ']', '\n\n', 'hnrnp', 'f/h', 'associate', 'with', 'hterc', 'and', 'telomerase', 'holoenzyme', 'to', 'modulate', 'telomerase', '\n']

['Compound', 'words', '\n', 'Hydrogen', 'peroxide', 'causes', 'cell', 'death', '\n', 'Cell', 'death', 'is', 'caused', 'by', 'hydrogen', 'peroxide', '\n', 'UV', 'radiation', 'kills', 'cancer', 'cells', '\n\n', 'Hyphenated', 'compound', 'words', '\n', 'co-localization', '\n', 'Co-localization', '\n', 'wild-type', '\n', 'Wild-type', '\n', 'TLR-4', '\n', 'X-ray', '\n\n', 'Slashes', '\n', 'downregulation/mutation', '\n', 'Downregulation/mutation', '\n', 'Downregulation/Mutation', '\n', 'P53/73', '\n', 'p53/73', '\n', 'Omi/HtrA2', '\n', 'HER2/Neu', '\n', 'mg/ml', '\n\n', '&', '\n', 'Material&Methods', '\n', 'material&methods', '\n\n', '‘', "'", '"', '\n', 'Parkinson',

# spacy default tokenizer

In [16]:
tokenizer = nlp.Defaults.create_tokenizer(nlp)
tokens_abstract = tokenizer(abstract)
token_list_ab = []
for token in tokens_abstract:
    token_list_ab.append(token.text)
    
tokens_examples = tokenizer(examples)
token_list_ex = []
for token in tokens_examples:
    token_list_ex.append(token.text)

tokens_cd_abstracts_s = tokenizer(cd_abstracts_s)
token_list_cd_ab_s = []
for token in tokens_cd_abstracts_s:
    token_list_cd_ab_s.append(token.text)
    
    
counts_cd_abstracts_s = FreqDist(token_list_cd_ab_s)


print(len(token_list_ab))
print(len(token_list_ex))
print(len(token_list_cd_ab_s))
print()
print(token_list_ab[:40])
print()
print(token_list_ex)
print()
print(counts_cd_abstracts_s.most_common(40))

547
643
98953

['1', '.', 'cell', 'death', 'differ', '.', '2019', 'dec', '20', '.', 'doi', ':', '10.1038', '/', 's41418', '-', '019', '-', '0483', '-', '6', '.', '[', 'epub', 'ahead', 'of', '\n', 'print', ']', '\n\n', 'hnrnp', 'f', '/', 'h', 'associate', 'with', 'hterc', 'and', 'telomerase', 'holoenzyme']

['Compound', 'words', '\n', 'Hydrogen', 'peroxide', 'causes', 'cell', 'death', '\n', 'Cell', 'death', 'is', 'caused', 'by', 'hydrogen', 'peroxide', '\n', 'UV', 'radiation', 'kills', 'cancer', 'cells', '\n\n', 'Hyphenated', 'compound', 'words', '\n', 'co', '-', 'localization', '\n', 'Co', '-', 'localization', '\n', 'wild', '-', 'type', '\n', 'Wild', '-', 'type', '\n', 'TLR-4', '\n', 'X', '-', 'ray', '\n\n', 'Slashes', '\n', 'downregulation', '/', 'mutation', '\n', 'Downregulation', '/', 'mutation', '\n', 'Downregulation', '/', 'Mutation', '\n', 'P53/73', '\n', 'p53/73', '\n', 'Omi', '/', 'HtrA2', '\n', 'HER2', '/', 'Neu', '\n', 'mg', '/', 'ml', '\n\n', '&', '\n', 'Material&Methods', '

# Scispacy custom tokenizer

This seems to be identical to using the scispacy model as above

In [17]:
import scispacy
from scispacy.custom_tokenizer import combined_rule_tokenizer
tokenizer = combined_rule_tokenizer(nlp)

In [18]:
tokens_abstract = tokenizer(abstract)
token_list_ab = []
for token in tokens_abstract:
    token_list_ab.append(token.text)
    
tokens_examples = tokenizer(examples)
token_list_ex = []
for token in tokens_examples:
    token_list_ex.append(token.text)

tokens_cd_abstracts_s = tokenizer(cd_abstracts_s)
token_list_cd_ab_s = []
for token in tokens_cd_abstracts_s:
    token_list_cd_ab_s.append(token.text)
    
    
counts_cd_abstracts_s = FreqDist(token_list_cd_ab_s)


print(len(token_list_ab))
print(len(token_list_ex))
print(len(token_list_cd_ab_s))
print()
print(token_list_ab[:40])
print()
print(token_list_ex)
print()
print(counts_cd_abstracts_s.most_common(40))

515
585
93512

['1', '.', 'cell', 'death', 'differ', '.', '2019', 'dec', '20', '.', 'doi', ':', '10.1038/s41418', '-', '019', '-', '0483', '-', '6', '.', '[', 'epub', 'ahead', 'of', '\n', 'print', ']', '\n\n', 'hnrnp', 'f/h', 'associate', 'with', 'hterc', 'and', 'telomerase', 'holoenzyme', 'to', 'modulate', 'telomerase', '\n']

['Compound', 'words', '\n', 'Hydrogen', 'peroxide', 'causes', 'cell', 'death', '\n', 'Cell', 'death', 'is', 'caused', 'by', 'hydrogen', 'peroxide', '\n', 'UV', 'radiation', 'kills', 'cancer', 'cells', '\n\n', 'Hyphenated', 'compound', 'words', '\n', 'co-localization', '\n', 'Co-localization', '\n', 'wild-type', '\n', 'Wild-type', '\n', 'TLR-4', '\n', 'X-ray', '\n\n', 'Slashes', '\n', 'downregulation/mutation', '\n', 'Downregulation/mutation', '\n', 'Downregulation/Mutation', '\n', 'P53/73', '\n', 'p53/73', '\n', 'Omi/HtrA2', '\n', 'HER2/Neu', '\n', 'mg/ml', '\n\n', '&', '\n', 'Material&Methods', '\n', 'material&methods', '\n\n', '‘', "'", '"', '\n', 'Parkinson',

# syntok

In [19]:
from syntok.tokenizer import Tokenizer
tokenizer = Tokenizer()  # optional: keep "n't" contractions and "-", "_" inside words as tokens

token_list_ab = []
for token in tokenizer.tokenize(abstract):
    token_list_ab.append(token.value)

token_list_ex = []
for token in tokenizer.tokenize(examples):
    token_list_ex.append(token.value)

token_list_cd_ab_s = []
for token in tokenizer.tokenize(cd_abstracts_s):
    token_list_cd_ab_s.append(token.value)
    
    
counts_cd_abstracts_s = FreqDist(token_list_cd_ab_s)


print(len(token_list_ab))
print(len(token_list_ex))
print(len(token_list_cd_ab_s))
print()
print(token_list_ab[:40])
print()
print(token_list_ex)
print()
print(counts_cd_abstracts_s.most_common(40))

508
426
95387

['1', '.', 'cell', 'death', 'differ', '.', '2019', 'dec', '20', '.', 'doi', ':', '10.1038/s41418-019-0483-6', '.', '[', 'epub', 'ahead', 'of', 'print', ']', 'hnrnp', 'f/h', 'associate', 'with', 'hterc', 'and', 'telomerase', 'holoenzyme', 'to', 'modulate', 'telomerase', 'function', 'and', 'promote', 'cell', 'proliferation', '.', 'xu', 'c', '(']

['Compound', 'words', 'Hydrogen', 'peroxide', 'causes', 'cell', 'death', 'Cell', 'death', 'is', 'caused', 'by', 'hydrogen', 'peroxide', 'UV', 'radiation', 'kills', 'cancer', 'cells', 'Hyphenated', 'compound', 'words', 'co', 'localization', 'Co', 'localization', 'wild', 'type', 'Wild', 'type', 'TLR', '4', 'X', 'ray', 'Slashes', 'downregulation/mutation', 'Downregulation/mutation', 'Downregulation/Mutation', 'P53/73', 'p53/73', 'Omi/Htr', 'A2', 'HER2/Neu', 'mg/ml', '&', 'Material&Methods', 'material&methods', '‘', "'", '"', 'Parkinson', '’s', 'ca', 'not', 'would', 'not', 'have', 'not', 'had', 'not', 'should', 'not', 'cells', '’', 'c

In [20]:
#for more information about the individual tokens

for token in tokenizer.tokenize(abstract):
    print(repr(token))

<Token '' : '1' @ 0>
<Token '' : '.' @ 1>
<Token ' ' : 'cell' @ 3>
<Token ' ' : 'death' @ 8>
<Token ' ' : 'differ' @ 14>
<Token '' : '.' @ 20>
<Token ' ' : '2019' @ 22>
<Token ' ' : 'dec' @ 27>
<Token ' ' : '20' @ 31>
<Token '' : '.' @ 33>
<Token ' ' : 'doi' @ 35>
<Token '' : ':' @ 38>
<Token ' ' : '10.1038/s41418-019-0483-6' @ 40>
<Token '' : '.' @ 65>
<Token ' ' : '[' @ 67>
<Token '' : 'epub' @ 68>
<Token ' ' : 'ahead' @ 73>
<Token ' ' : 'of' @ 79>
<Token '\n' : 'print' @ 82>
<Token '' : ']' @ 87>
<Token '\n\n' : 'hnrnp' @ 90>
<Token ' ' : 'f/h' @ 96>
<Token ' ' : 'associate' @ 100>
<Token ' ' : 'with' @ 110>
<Token ' ' : 'hterc' @ 115>
<Token ' ' : 'and' @ 121>
<Token ' ' : 'telomerase' @ 125>
<Token ' ' : 'holoenzyme' @ 136>
<Token ' ' : 'to' @ 147>
<Token ' ' : 'modulate' @ 150>
<Token ' ' : 'telomerase' @ 159>
<Token '\n' : 'function' @ 170>
<Token ' ' : 'and' @ 179>
<Token ' ' : 'promote' @ 183>
<Token ' ' : 'cell' @ 191>
<Token ' ' : 'proliferation' @ 196>
<Token '' : '.' @ 209