#### Task Details

What do we know about vaccines and therapeutics? <br> 
What has been published concerning research and development and evaluation efforts of vaccines and therapeutics?

In [1]:
import numpy as np
import pandas as pd
from tqdm import tqdm
import string

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# ML
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.manifold import TSNE

# NLP
import spacy
import scispacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
from scispacy.umls_linking import UmlsEntityLinker
from scispacy.abbreviation import AbbreviationDetector 
from negspacy.negation import Negex

In [2]:
meta_data = pd.read_csv('../data/raw/all_sources_metadata_2020-03-13.csv')

In [3]:
meta_data.head()

Unnamed: 0,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_full_text
0,c630ebcdf30652f0422c3ec12a00b50241dc9bd9,CZI,Angiotensin-converting enzyme 2 (ACE2) as a SA...,10.1007/s00134-020-05985-9,,32125455.0,cc-by-nc,,2020,"Zhang, Haibo; Penninger, Josef M.; Li, Yimin; ...",Intensive Care Med,2002765000.0,#3252,True
1,53eccda7977a31e3d0f565c884da036b1e85438e,CZI,Comparative genetic analysis of the novel coro...,10.1038/s41421-020-0147-1,,,cc-by,,2020,"Cao, Yanan; Li, Lin; Feng, Zhimin; Wan, Shengq...",Cell Discovery,3003431000.0,#1861,True
2,210a892deb1c61577f6fba58505fd65356ce6636,CZI,Incubation Period and Other Epidemiological Ch...,10.3390/jcm9020538,,,cc-by,The geographic spread of 2019 novel coronaviru...,2020,"Linton, M. Natalie; Kobayashi, Tetsuro; Yang, ...",Journal of Clinical Medicine,3006065000.0,#1043,True
3,e3b40cc8e0e137c416b4a2273a4dca94ae8178cc,CZI,Characteristics of and Public Health Responses...,10.3390/jcm9020575,,32093211.0,cc-by,"In December 2019, cases of unidentified pneumo...",2020,"Deng, Sheng-Qun; Peng, Hong-Juan",J Clin Med,177663100.0,#1999,True
4,92c2c9839304b4f2bc1276d41b1aa885d8b364fd,CZI,Imaging changes in severe COVID-19 pneumonia,10.1007/s00134-020-05976-w,,32125453.0,cc-by-nc,,2020,"Zhang, Wei",Intensive Care Med,3006643000.0,#3242,False


In [4]:
# How many papers are we talking about?
meta_data.shape

(29500, 14)

29.500 articles are too many to read through manually, especially because many of them are not going to investigate vaccines or therapies. Thus, how do we find the relevant ones?<br>
Look at abstracts and find the ones which have lemmas of "vaccine", "drug", or "therapy" in them.<br>
Assumptions: these words would appear in the abstract if our topic is being discussed in the paper

Preprocessing: Replace all medical abbraviations with their official full form (e.g. COVID-19 = corona virus 2019). Lemmatize?<br>
Processing: Find all papers which mention "vaccine" or "therapy". From this subgroup, visualize clusters of mentioned compounds or even a cluster which mentions that "no drug has been found"

In [5]:
# How does an abstract look like?
meta_data.abstract[6]

'The initial cluster of severe pneumonia cases that triggered the 2019-nCoV epidemic was identified in Wuhan, China in December 2019. While early cases of the disease were linked to a wet market, human-to-human transmission has driven the rapid spread of the virus throughout China. The Chinese government has implemented containment strategies of city-wide lockdowns, screening at airports and train stations, and isolation of suspected patients; however, the cumulative case count keeps growing every day. The ongoing outbreak presents a challenge for modelers, as limited data are available on the early growth trajectory, and the epidemiological characteristics of the novel coronavirus are yet to be fully elucidated. We use phenomenological models that have been validated during previous outbreaks to generate and assess short-term forecasts of the cumulative number of confirmed reported cases in Hubei province, the epicenter of the epidemic, and for the overall trajectory in China, excludi

In [6]:
# Let's filter out the ones without an abstract
abstracts = meta_data[~meta_data.abstract.isna()]

In [7]:
# Ok this did not really bring down the number of articles by much...
abstracts.shape

(26553, 14)

In [9]:
'''abbreviation_pipe = AbbreviationDetector(nlp)
nlp.add_pipe(abbreviation_pipe)'''

In [10]:
'''linker = UmlsEntityLinker(resolve_abbreviations=True)
nlp.add_pipe(linker)''' # this runs endlessly

'linker = UmlsEntityLinker(resolve_abbreviations=True)\nnlp.add_pipe(linker)'

In [11]:
'''negex = Negex(nlp)
nlp.add_pipe(negex, last=True)'''

In [12]:
abstracts.iloc[1,7]

'In December 2019, cases of unidentified pneumonia with a history of exposure in the Huanan Seafood Market were reported in Wuhan, Hubei Province. A novel coronavirus, SARS-CoV-2, was identified to be accountable for this disease. Human-to-human transmission is confirmed, and this disease (named COVID-19 by World Health Organization (WHO)) spread rapidly around the country and the world. As of 18 February 2020, the number of confirmed cases had reached 75,199 with 2009 fatalities. The COVID-19 resulted in a much lower case-fatality rate (about 2.67%) among the confirmed cases, compared with Severe Acute Respiratory Syndrome (SARS) and Middle East Respiratory Syndrome (MERS). Among the symptom composition of the 45 fatality cases collected from the released official reports, the top four are fever, cough, short of breath, and chest tightness/pain. The major comorbidities of the fatality cases include hypertension, diabetes, coronary heart disease, cerebral infarction, and chronic bronch

What are the differences between the different NER models? How do they perform on our documents?

In [23]:
sample = -4

In [28]:
nlp = spacy.load("en_ner_jnlpba_md")
abbreviation_pipe = AbbreviationDetector(nlp)
nlp.add_pipe(abbreviation_pipe)
doc = nlp(abstracts.iloc[sample,7])
spacy.displacy.render(doc, style='ent',jupyter=True)

In [25]:
nlp = spacy.load("en_ner_bc5cdr_md")
doc = nlp(abstracts.iloc[sample,7])
spacy.displacy.render(doc, style='ent',jupyter=True)

In [26]:
nlp = spacy.load("en_ner_bionlp13cg_md")
doc = nlp(abstracts.iloc[sample,7])
spacy.displacy.render(doc, style='ent',jupyter=True)

In [55]:
nlp = spacy.load("en_core_sci_sm")
doc = nlp(abstracts.iloc[sample,7])
spacy.displacy.render(doc, style='ent',jupyter=True)

  "__main__", mod_spec)


In [35]:
for abrv in doc._.abbreviations:
    print(f"{abrv} \t ({abrv.start}, {abrv.end}) {abrv._.long_form}")

WHO 	 (57, 58) World Health Organization
SARS 	 (113, 114) Severe Acute Respiratory Syndrome
MERS 	 (121, 122) Middle East Respiratory Syndrome


In [56]:
# This takes a while...
nlp = spacy.load("en_core_sci_sm") # we use this more sensitive model here to detect all instances of entities
abbreviation_pipe = AbbreviationDetector(nlp)
nlp.add_pipe(abbreviation_pipe)

vaccine_idx = []

for i in range(abstracts.shape[0]) :   
    doc = nlp(abstracts.iloc[i,7])
    # Check entities for our lemmas
    for w in doc.ents:
        if any(x in w.lemma_ for x in ['vaccine', 'drug', 'therapy', 'therapeutic']):
            vaccine_idx.append(i)
            print(i)

  "__main__", mod_spec)


1
1
1
8
8
10
10
10
15
25
25
25
25
28
38
38
41
42
52
54
54
54
54
54
54
54
54
55
57
57
57
63
63
63
70
70
70
70
70
70
77
77
77
80
80
80
80
80
80
80
80
82
82
82
82
82
82
92
92
92
92
92
92
92
92
94
94
102
105
105
109
111
111
119
119
119
119
121
131
131
131
132
132
132
132
132
132
133
134
134
134
134
134
145
150
152
155
157
157
165
167
167
167
167
167
167
167
167
169
169
169
176
176
179
179
179
179
179
179
183
183
184
184
186
186
186
199
201
203
205
207
208
212
213
213
213
214
214
215
215
215
216
222
225
225
225
225
225
226
226
229
234
236
240
243
243
245
246
246
247
247
247
247
247
249
249
256
261
262
263
265
274
280
280
280
280
280
280
284
285
285
285
285
288
288
302
302
302
302
302
302
305
305
305
305
305
305
305
305
305
305
305
307
307
314
314
323
341
349
349
354
354
354
354
355
355
355
363
363
373
373
373
374
374
374
375
375
375
377
378
378
378
380
380
380
381
385
385
397
400
400
411
413
413
414
414
432
432
437
437
438
442
443
449
449
449
449
452
452
452
471
471
479
479
479
488
488
488


3008
3008
3024
3027
3027
3027
3027
3028
3037
3037
3037
3037
3038
3038
3038
3038
3038
3039
3039
3039
3040
3044
3044
3044
3044
3044
3044
3044
3057
3057
3062
3062
3063
3063
3078
3080
3085
3095
3095
3095
3095
3095
3096
3098
3098
3098
3101
3102
3105
3106
3106
3107
3110
3110
3110
3114
3125
3125
3125
3125
3131
3132
3132
3132
3132
3133
3133
3133
3133
3141
3141
3143
3143
3143
3143
3143
3143
3143
3143
3143
3143
3143
3143
3149
3149
3151
3152
3153
3158
3158
3158
3158
3161
3161
3162
3165
3166
3167
3167
3167
3168
3168
3168
3168
3169
3176
3180
3182
3184
3186
3186
3186
3186
3186
3186
3187
3187
3187
3187
3187
3187
3188
3188
3188
3188
3188
3188
3193
3193
3197
3197
3197
3197
3197
3197
3201
3208
3208
3208
3210
3210
3214
3221
3237
3237
3237
3239
3239
3239
3239
3241
3242
3242
3256
3259
3260
3260
3260
3260
3260
3261
3261
3262
3262
3262
3264
3264
3280
3280
3280
3280
3281
3281
3281
3281
3282
3284
3284
3285
3286
3287
3289
3289
3295
3295
3295
3295
3300
3300
3301
3301
3314
3314
3314
3321
3321
3321
3321
3322
3322


5832
5832
5846
5847
5847
5847
5847
5847
5847
5847
5847
5847
5847
5847
5861
5866
5866
5867
5867
5870
5870
5870
5870
5872
5875
5877
5877
5877
5877
5877
5877
5877
5878
5878
5878
5878
5878
5878
5878
5879
5879
5879
5879
5879
5879
5879
5880
5880
5880
5880
5880
5880
5880
5885
5889
5889
5891
5891
5891
5901
5901
5902
5902
5903
5903
5919
5920
5921
5922
5931
5938
5945
5946
5946
5950
5955
5955
5955
5955
5955
5955
5955
5955
5955
5958
5958
5958
5960
5961
5961
5968
5971
5971
5975
5987
5987
5987
5987
5987
5987
5997
5997
5997
5997
6003
6009
6014
6014
6015
6023
6024
6024
6028
6030
6030
6030
6030
6030
6030
6030
6030
6031
6031
6031
6031
6031
6031
6031
6031
6034
6034
6034
6035
6035
6035
6036
6037
6047
6047
6047
6056
6067
6067
6072
6081
6084
6084
6084
6084
6085
6085
6085
6085
6086
6087
6088
6089
6090
6091
6092
6093
6094
6095
6104
6108
6118
6118
6118
6118
6118
6118
6118
6118
6119
6119
6126
6127
6127
6127
6127
6127
6127
6127
6127
6127
6127
6127
6129
6143
6143
6152
6162
6162
6162
6162
6162
6162
6165
6165
6165


8511
8513
8514
8529
8529
8529
8539
8539
8539
8539
8542
8542
8542
8542
8542
8542
8542
8543
8543
8543
8546
8546
8546
8546
8546
8551
8551
8552
8552
8557
8558
8560
8560
8560
8560
8560
8560
8563
8563
8563
8563
8564
8564
8564
8564
8570
8571
8574
8574
8575
8575
8576
8576
8577
8577
8577
8577
8578
8591
8591
8591
8592
8592
8592
8592
8592
8592
8592
8592
8598
8599
8600
8603
8603
8604
8605
8605
8605
8607
8607
8607
8608
8610
8610
8611
8612
8613
8613
8619
8625
8630
8630
8630
8631
8631
8631
8633
8642
8642
8645
8648
8651
8651
8655
8656
8657
8661
8661
8664
8665
8673
8673
8673
8673
8673
8676
8676
8677
8677
8677
8677
8677
8682
8682
8693
8693
8693
8694
8694
8694
8695
8695
8695
8696
8696
8696
8697
8697
8697
8698
8698
8698
8699
8699
8699
8700
8700
8700
8701
8701
8701
8719
8724
8724
8724
8724
8724
8731
8731
8731
8731
8731
8732
8732
8732
8732
8732
8739
8740
8740
8742
8743
8746
8750
8750
8750
8756
8757
8758
8759
8760
8761
8762
8778
8780
8781
8785
8785
8785
8785
8786
8787
8787
8788
8788
8789
8789
8792
8792
8794


10800
10806
10806
10809
10817
10817
10818
10818
10820
10820
10820
10820
10820
10825
10826
10826
10826
10828
10828
10833
10833
10833
10833
10833
10833
10835
10837
10838
10838
10838
10838
10838
10838
10838
10838
10841
10857
10858
10859
10859
10863
10863
10865
10865
10865
10865
10865
10865
10865
10866
10875
10875
10875
10875
10875
10875
10875
10875
10875
10876
10876
10876
10876
10876
10876
10876
10876
10876
10878
10878
10879
10879
10879
10881
10881
10892
10892
10897
10903
10905
10910
10912
10912
10915
10917
10917
10921
10921
10921
10921
10921
10922
10922
10922
10922
10923
10924
10925
10926
10933
10934
10934
10934
10936
10936
10936
10936
10937
10937
10937
10938
10938
10938
10939
10939
10939
10940
10951
10951
10952
10952
10953
10953
10954
10954
10955
10955
10955
10955
10955
10956
10956
10967
10967
10972
10980
10982
10983
10992
10992
10996
10996
10996
10997
10997
10997
10998
11008
11008
11009
11012
11053
11054
11056
11056
11056
11056
11057
11057
11063
11063
11063
11063
11074
11075
11077
1107

12407
12408
12409
12410
12411
12411
12411
12414
12414
12415
12415
12415
12421
12421
12421
12422
12422
12422
12422
12425
12425
12428
12428
12428
12432
12433
12434
12435
12441
12468
12468
12476
12477
12478
12479
12501
12554
12554
12560
12560
12560
12564
12565
12565
12565
12590
12590
12592
12592
12596
12596
12614
12614
12634
12636
12642
12645
12645
12649
12658
12659
12664
12667
12667
12672
12672
12672
12702
12702
12703
12703
12704
12704
12705
12705
12708
12725
12729
12729
12732
12732
12732
12735
12750
12755
12755
12755
12755
12755
12755
12760
12760
12760
12761
12761
12762
12786
12787
12787
12790
12792
12792
12792
12793
12793
12809
12810
12818
12819
12819
12820
12820
12823
12829
12829
12829
12829
12833
12833
12840
12841
12841
12841
12841
12841
12844
12848
12849
12855
12858
12858
12859
12875
12875
12876
12876
12877
12877
12878
12878
12879
12879
12880
12880
12881
12881
12902
12916
12919
12919
12922
12924
12925
12926
12927
12928
12930
12931
12932
12941
12942
12943
12944
12944
12955
12955
1295

14454
14456
14458
14458
14458
14458
14459
14465
14465
14465
14465
14465
14466
14466
14466
14466
14466
14467
14467
14467
14467
14467
14476
14477
14477
14477
14481
14483
14483
14484
14484
14485
14485
14485
14491
14495
14495
14495
14496
14496
14496
14496
14496
14496
14497
14497
14497
14497
14497
14497
14498
14498
14498
14498
14498
14498
14499
14499
14499
14504
14504
14504
14513
14513
14513
14514
14514
14514
14516
14516
14516
14517
14518
14519
14527
14528
14529
14529
14530
14530
14531
14531
14532
14532
14533
14533
14534
14534
14535
14535
14540
14540
14556
14557
14558
14561
14564
14564
14566
14566
14566
14566
14566
14566
14566
14567
14567
14582
14582
14582
14614
14614
14614
14615
14616
14622
14622
14622
14622
14623
14623
14623
14623
14624
14624
14624
14625
14625
14625
14626
14626
14626
14631
14632
14634
14656
14656
14656
14656
14656
14659
14660
14661
14664
14665
14669
14669
14670
14670
14675
14675
14675
14675
14676
14677
14678
14679
14680
14681
14682
14683
14684
14686
14687
14704
14704
1470

16874
16874
16876
16879
16879
16879
16879
16879
16879
16880
16883
16884
16885
16886
16886
16886
16888
16888
16888
16888
16888
16888
16888
16888
16888
16888
16889
16889
16891
16891
16894
16894
16894
16896
16896
16897
16897
16899
16899
16900
16900
16900
16900
16900
16903
16905
16908
16908
16908
16908
16908
16908
16908
16911
16913
16914
16914
16914
16914
16914
16914
16914
16915
16917
16917
16917
16917
16921
16929
16936
16936
16936
16936
16937
16937
16937
16937
16940
16940
16940
16940
16940
16940
16948
16948
16948
16948
16948
16948
16949
16957
16957
16960
16960
16968
16973
16973
16973
16975
16975
16976
16982
16983
16983
16986
16986
16986
16991
16994
16994
16996
16996
16996
16996
16996
16996
16996
16996
16999
16999
16999
16999
16999
17001
17003
17006
17014
17014
17014
17019
17023
17023
17025
17025
17025
17025
17025
17025
17025
17030
17030
17030
17030
17030
17030
17030
17030
17030
17033
17033
17033
17033
17036
17036
17036
17036
17036
17047
17047
17048
17055
17055
17059
17059
17059
17059
1705

18379
18379
18379
18381
18381
18382
18382
18387
18387
18390
18391
18391
18391
18398
18398
18398
18398
18398
18399
18401
18401
18401
18401
18401
18402
18402
18406
18406
18408
18409
18409
18410
18410
18410
18410
18414
18414
18414
18414
18414
18414
18416
18416
18416
18418
18419
18419
18419
18419
18419
18419
18423
18423
18423
18427
18428
18429
18432
18433
18433
18436
18439
18439
18442
18447
18447
18447
18447
18452
18452
18452
18456
18457
18457
18457
18457
18463
18465
18466
18471
18471
18471
18471
18471
18471
18471
18472
18472
18476
18476
18476
18476
18476
18480
18483
18483
18483
18486
18489
18491
18491
18495
18496
18500
18500
18500
18500
18500
18503
18503
18503
18505
18508
18508
18508
18510
18511
18512
18514
18514
18514
18517
18517
18517
18520
18520
18520
18520
18520
18520
18520
18522
18522
18524
18525
18525
18525
18525
18525
18527
18527
18528
18528
18529
18529
18529
18529
18530
18534
18534
18534
18534
18534
18535
18540
18542
18542
18544
18544
18544
18545
18545
18545
18549
18549
18549
1855

20109
20109
20109
20113
20116
20126
20126
20126
20126
20128
20130
20130
20130
20130
20130
20130
20130
20130
20130
20130
20137
20140
20140
20140
20140
20140
20140
20140
20140
20140
20140
20140
20140
20140
20140
20140
20140
20140
20140
20140
20140
20140
20140
20140
20140
20140
20143
20143
20143
20148
20148
20148
20148
20148
20149
20149
20151
20151
20158
20158
20158
20158
20158
20158
20159
20159
20159
20159
20179
20186
20186
20186
20186
20186
20186
20186
20186
20186
20188
20192
20192
20192
20192
20192
20203
20203
20208
20209
20209
20209
20209
20213
20213
20213
20213
20214
20214
20214
20214
20214
20214
20217
20217
20217
20217
20217
20217
20217
20217
20217
20221
20221
20226
20238
20239
20239
20242
20242
20244
20255
20258
20258
20258
20258
20259
20269
20269
20271
20271
20271
20271
20271
20272
20273
20283
20296
20304
20305
20305
20305
20305
20305
20305
20311
20313
20313
20313
20313
20313
20313
20313
20313
20313
20319
20319
20319
20324
20324
20324
20324
20327
20327
20343
20343
20343
20343
2034

22211
22215
22215
22219
22224
22226
22226
22226
22229
22229
22229
22236
22236
22236
22247
22247
22247
22248
22249
22256
22257
22258
22258
22258
22260
22261
22261
22261
22261
22265
22269
22269
22269
22271
22281
22284
22285
22285
22285
22285
22285
22285
22285
22285
22286
22286
22286
22294
22295
22298
22301
22314
22314
22316
22318
22318
22318
22322
22328
22329
22332
22333
22337
22337
22338
22339
22339
22341
22343
22343
22343
22343
22343
22343
22343
22344
22344
22345
22352
22353
22353
22353
22353
22353
22353
22353
22353
22357
22359
22359
22360
22360
22360
22360
22360
22361
22361
22371
22371
22373
22374
22380
22380
22385
22385
22386
22387
22387
22393
22393
22393
22393
22393
22396
22398
22402
22405
22405
22405
22405
22406
22406
22406
22406
22406
22406
22407
22407
22407
22407
22409
22410
22410
22415
22415
22416
22424
22424
22437
22437
22440
22442
22446
22448
22452
22459
22464
22464
22464
22466
22466
22467
22467
22467
22467
22468
22468
22468
22468
22468
22468
22468
22468
22468
22474
22481
2248

26219
26219
26219
26219
26219
26219
26219
26219
26219
26219
26219
26219
26219
26219
26219
26219
26219
26219
26219
26221
26221
26223
26230
26230
26235
26241
26241
26242
26242
26242
26245
26251
26251
26251
26251
26251
26251
26251
26254
26254
26254
26254
26254
26254
26256
26260
26268
26274
26275
26276
26280
26280
26280
26280
26280
26280
26280
26280
26281
26281
26281
26287
26290
26294
26296
26297
26297
26298
26299
26300
26304
26304
26304
26305
26313
26313
26313
26313
26313
26313
26316
26316
26317
26317
26317
26319
26323
26323
26323
26323
26323
26323
26323
26323
26329
26329
26329
26329
26329
26329
26329
26329
26329
26335
26337
26339
26357
26364
26364
26365
26366
26368
26368
26371
26371
26377
26382
26382
26385
26385
26385
26391
26391
26392
26392
26392
26399
26404
26404
26404
26404
26405
26405
26405
26405
26406
26406
26408
26412
26412
26412
26412
26412
26413
26415
26415
26417
26423
26423
26426
26426
26426
26426
26427
26427
26428
26430
26430
26431
26431
26431
26431
26431
26432
26437
26438
2644

In [None]:
'''    for e in doc.ents:
        if any(x in e.text for x in ['vaccine', 'drug', 'therapy']):
            print(i)
            print(e.text, e._.negex)'''

In [57]:
# This drastically cuts the papers of interest!
len(np.unique(vaccine_idx))

6666

In [58]:
# We can now start to look at the entities mentioned in these abstracts and construct a topic cloud of these
vaccines = abstracts.iloc[np.unique(vaccine_idx)]
vaccines.reset_index(inplace=True)

In [None]:
'''for umls_ent in entity._.umls_ents:
    print(linker.umls.cui_to_entity[umls_ent[0]])'''

In [60]:
def replace_acronyms(text):
    # Replace new abbreviations (not always detected by SpaCy)
    text = text.replace("SARS-CoV-2", "acute respiratory syndrome coronavirus 2")
    text = text.replace("COVID-19"  , "Coronavirus disease 2019")
    
    doc = nlp(text)
    altered_tok = [tok.text for tok in doc]
    for abrv in doc._.abbreviations:
        altered_tok[abrv.start] = str(abrv._.long_form)
    text = " ".join(altered_tok)
    
    return(text)

  "__main__", mod_spec)


In [62]:
tqdm.pandas()

nlp = spacy.load("en_core_sci_sm") # we use this because is it faster
abbreviation_pipe = AbbreviationDetector(nlp)
nlp.add_pipe(abbreviation_pipe)

vaccines["abstract_noabbreviations"] = vaccines.abstract.progress_apply(replace_acronyms)

  "__main__", mod_spec)









  0%|          | 0/6666 [00:00<?, ?it/s][A[A[A[A[A[A[A[A[A








  0%|          | 3/6666 [00:00<05:03, 21.94it/s][A[A[A[A[A[A[A[A[A








  0%|          | 5/6666 [00:00<05:15, 21.09it/s][A[A[A[A[A[A[A[A[A








  0%|          | 7/6666 [00:00<05:20, 20.75it/s][A[A[A[A[A[A[A[A[A








  0%|          | 9/6666 [00:00<06:04, 18.24it/s][A[A[A[A[A[A[A[A[A








  0%|          | 11/6666 [00:00<05:57, 18.60it/s][A[A[A[A[A[A[A[A[A








  0%|          | 13/6666 [00:00<06:15, 17.72it/s][A[A[A[A[A[A[A[A[A








  0%|          | 15/6666 [00:00<06:16, 17.69it/s][A[A[A[A[A[A[A[A[A








  0%|          | 17/6666 [00:00<06:08, 18.04it/s][A[A[A[A[A[A[A[A[A








  0%|          | 19/6666 [00:01<06:24, 17.30it/s][A[A[A[A[A[A[A[A[A








  0%|          | 21/6666 [00:01<06:18, 17.54it/s][A[A[A[A[A[A[A[A[A








  0%|          | 23/6666 [00:01<07:19, 15.10i

  4%|▍         | 250/6666 [00:13<08:31, 12.55it/s][A[A[A[A[A[A[A[A[A








  4%|▍         | 252/6666 [00:13<08:01, 13.33it/s][A[A[A[A[A[A[A[A[A








  4%|▍         | 254/6666 [00:13<07:29, 14.28it/s][A[A[A[A[A[A[A[A[A








  4%|▍         | 257/6666 [00:14<06:33, 16.30it/s][A[A[A[A[A[A[A[A[A








  4%|▍         | 260/6666 [00:14<06:04, 17.59it/s][A[A[A[A[A[A[A[A[A








  4%|▍         | 262/6666 [00:14<05:57, 17.90it/s][A[A[A[A[A[A[A[A[A








  4%|▍         | 265/6666 [00:14<05:33, 19.22it/s][A[A[A[A[A[A[A[A[A








  4%|▍         | 268/6666 [00:14<05:25, 19.64it/s][A[A[A[A[A[A[A[A[A








  4%|▍         | 271/6666 [00:14<05:33, 19.20it/s][A[A[A[A[A[A[A[A[A








  4%|▍         | 273/6666 [00:14<05:37, 18.94it/s][A[A[A[A[A[A[A[A[A








  4%|▍         | 275/6666 [00:15<05:54, 18.05it/s][A[A[A[A[A[A[A[A[A








  4%|▍         | 278/6666 [00:15<05:36, 18.97it/s][A

  7%|▋         | 499/6666 [00:26<04:51, 21.13it/s][A[A[A[A[A[A[A[A[A








  8%|▊         | 502/6666 [00:26<04:38, 22.15it/s][A[A[A[A[A[A[A[A[A








  8%|▊         | 505/6666 [00:26<04:24, 23.31it/s][A[A[A[A[A[A[A[A[A








  8%|▊         | 508/6666 [00:26<04:19, 23.70it/s][A[A[A[A[A[A[A[A[A








  8%|▊         | 511/6666 [00:26<04:11, 24.47it/s][A[A[A[A[A[A[A[A[A








  8%|▊         | 514/6666 [00:27<04:17, 23.90it/s][A[A[A[A[A[A[A[A[A








  8%|▊         | 517/6666 [00:27<04:16, 23.97it/s][A[A[A[A[A[A[A[A[A








  8%|▊         | 520/6666 [00:27<04:01, 25.49it/s][A[A[A[A[A[A[A[A[A








  8%|▊         | 524/6666 [00:27<03:50, 26.66it/s][A[A[A[A[A[A[A[A[A








  8%|▊         | 527/6666 [00:27<03:59, 25.66it/s][A[A[A[A[A[A[A[A[A








  8%|▊         | 530/6666 [00:27<04:08, 24.71it/s][A[A[A[A[A[A[A[A[A








  8%|▊         | 534/6666 [00:27<03:44, 27.29it/s][A

 12%|█▏        | 805/6666 [00:38<12:14,  7.98it/s][A[A[A[A[A[A[A[A[A








 12%|█▏        | 807/6666 [00:39<12:03,  8.10it/s][A[A[A[A[A[A[A[A[A








 12%|█▏        | 809/6666 [00:39<11:40,  8.36it/s][A[A[A[A[A[A[A[A[A








 12%|█▏        | 811/6666 [00:39<11:32,  8.45it/s][A[A[A[A[A[A[A[A[A








 12%|█▏        | 814/6666 [00:39<09:20, 10.43it/s][A[A[A[A[A[A[A[A[A








 12%|█▏        | 816/6666 [00:39<08:17, 11.77it/s][A[A[A[A[A[A[A[A[A








 12%|█▏        | 819/6666 [00:40<07:09, 13.62it/s][A[A[A[A[A[A[A[A[A








 12%|█▏        | 821/6666 [00:40<06:30, 14.97it/s][A[A[A[A[A[A[A[A[A








 12%|█▏        | 825/6666 [00:40<05:35, 17.39it/s][A[A[A[A[A[A[A[A[A








 12%|█▏        | 828/6666 [00:40<04:57, 19.60it/s][A[A[A[A[A[A[A[A[A








 12%|█▏        | 831/6666 [00:40<04:40, 20.81it/s][A[A[A[A[A[A[A[A[A








 13%|█▎        | 835/6666 [00:40<04:24, 22.05it/s][A

 16%|█▌        | 1075/6666 [00:52<03:44, 24.90it/s][A[A[A[A[A[A[A[A[A








 16%|█▌        | 1079/6666 [00:52<03:20, 27.83it/s][A[A[A[A[A[A[A[A[A








 16%|█▌        | 1082/6666 [00:52<03:37, 25.70it/s][A[A[A[A[A[A[A[A[A








 16%|█▋        | 1085/6666 [00:52<03:55, 23.74it/s][A[A[A[A[A[A[A[A[A








 16%|█▋        | 1088/6666 [00:52<03:40, 25.32it/s][A[A[A[A[A[A[A[A[A








 16%|█▋        | 1092/6666 [00:53<03:37, 25.66it/s][A[A[A[A[A[A[A[A[A








 16%|█▋        | 1096/6666 [00:53<03:19, 27.91it/s][A[A[A[A[A[A[A[A[A








 16%|█▋        | 1099/6666 [00:53<03:22, 27.49it/s][A[A[A[A[A[A[A[A[A








 17%|█▋        | 1102/6666 [00:53<03:33, 26.10it/s][A[A[A[A[A[A[A[A[A








 17%|█▋        | 1105/6666 [00:53<03:52, 23.92it/s][A[A[A[A[A[A[A[A[A








 17%|█▋        | 1108/6666 [00:53<03:39, 25.29it/s][A[A[A[A[A[A[A[A[A








 17%|█▋        | 1112/6666 [00:53<03:16, 28

 20%|█▉        | 1331/6666 [01:05<04:12, 21.14it/s][A[A[A[A[A[A[A[A[A








 20%|██        | 1334/6666 [01:05<04:09, 21.36it/s][A[A[A[A[A[A[A[A[A








 20%|██        | 1337/6666 [01:05<04:23, 20.23it/s][A[A[A[A[A[A[A[A[A








 20%|██        | 1340/6666 [01:06<05:22, 16.50it/s][A[A[A[A[A[A[A[A[A








 20%|██        | 1342/6666 [01:06<05:59, 14.81it/s][A[A[A[A[A[A[A[A[A








 20%|██        | 1344/6666 [01:06<05:43, 15.47it/s][A[A[A[A[A[A[A[A[A








 20%|██        | 1347/6666 [01:06<05:08, 17.25it/s][A[A[A[A[A[A[A[A[A








 20%|██        | 1349/6666 [01:06<05:09, 17.20it/s][A[A[A[A[A[A[A[A[A








 20%|██        | 1351/6666 [01:06<05:13, 16.98it/s][A[A[A[A[A[A[A[A[A








 20%|██        | 1354/6666 [01:06<04:38, 19.06it/s][A[A[A[A[A[A[A[A[A








 20%|██        | 1357/6666 [01:06<04:23, 20.12it/s][A[A[A[A[A[A[A[A[A








 20%|██        | 1360/6666 [01:07<04:02, 21

 24%|██▍       | 1612/6666 [01:21<03:41, 22.83it/s][A[A[A[A[A[A[A[A[A








 24%|██▍       | 1615/6666 [01:22<03:54, 21.51it/s][A[A[A[A[A[A[A[A[A








 24%|██▍       | 1618/6666 [01:22<03:56, 21.33it/s][A[A[A[A[A[A[A[A[A








 24%|██▍       | 1621/6666 [01:22<03:55, 21.38it/s][A[A[A[A[A[A[A[A[A








 24%|██▍       | 1624/6666 [01:22<03:41, 22.73it/s][A[A[A[A[A[A[A[A[A








 24%|██▍       | 1627/6666 [01:22<03:45, 22.30it/s][A[A[A[A[A[A[A[A[A








 24%|██▍       | 1630/6666 [01:22<03:37, 23.15it/s][A[A[A[A[A[A[A[A[A








 24%|██▍       | 1633/6666 [01:22<03:43, 22.55it/s][A[A[A[A[A[A[A[A[A








 25%|██▍       | 1636/6666 [01:22<03:50, 21.82it/s][A[A[A[A[A[A[A[A[A








 25%|██▍       | 1639/6666 [01:23<03:37, 23.09it/s][A[A[A[A[A[A[A[A[A








 25%|██▍       | 1642/6666 [01:23<03:36, 23.24it/s][A[A[A[A[A[A[A[A[A








 25%|██▍       | 1645/6666 [01:23<03:35, 23

 28%|██▊       | 1893/6666 [01:35<05:24, 14.69it/s][A[A[A[A[A[A[A[A[A








 28%|██▊       | 1895/6666 [01:35<05:04, 15.68it/s][A[A[A[A[A[A[A[A[A








 28%|██▊       | 1897/6666 [01:35<05:00, 15.88it/s][A[A[A[A[A[A[A[A[A








 28%|██▊       | 1899/6666 [01:35<04:54, 16.20it/s][A[A[A[A[A[A[A[A[A








 29%|██▊       | 1901/6666 [01:36<04:45, 16.67it/s][A[A[A[A[A[A[A[A[A








 29%|██▊       | 1903/6666 [01:36<04:32, 17.51it/s][A[A[A[A[A[A[A[A[A








 29%|██▊       | 1906/6666 [01:36<04:14, 18.73it/s][A[A[A[A[A[A[A[A[A








 29%|██▊       | 1909/6666 [01:36<04:11, 18.92it/s][A[A[A[A[A[A[A[A[A








 29%|██▊       | 1911/6666 [01:36<04:15, 18.59it/s][A[A[A[A[A[A[A[A[A








 29%|██▊       | 1915/6666 [01:36<03:51, 20.53it/s][A[A[A[A[A[A[A[A[A








 29%|██▉       | 1918/6666 [01:36<03:43, 21.29it/s][A[A[A[A[A[A[A[A[A








 29%|██▉       | 1921/6666 [01:36<03:38, 21

 33%|███▎      | 2179/6666 [01:51<03:13, 23.15it/s][A[A[A[A[A[A[A[A[A








 33%|███▎      | 2182/6666 [01:51<03:01, 24.75it/s][A[A[A[A[A[A[A[A[A








 33%|███▎      | 2185/6666 [01:51<03:16, 22.79it/s][A[A[A[A[A[A[A[A[A








 33%|███▎      | 2188/6666 [01:51<03:29, 21.38it/s][A[A[A[A[A[A[A[A[A








 33%|███▎      | 2191/6666 [01:51<03:34, 20.89it/s][A[A[A[A[A[A[A[A[A








 33%|███▎      | 2194/6666 [01:52<03:17, 22.60it/s][A[A[A[A[A[A[A[A[A








 33%|███▎      | 2197/6666 [01:52<03:06, 23.96it/s][A[A[A[A[A[A[A[A[A








 33%|███▎      | 2200/6666 [01:52<03:02, 24.40it/s][A[A[A[A[A[A[A[A[A








 33%|███▎      | 2203/6666 [01:52<03:09, 23.59it/s][A[A[A[A[A[A[A[A[A








 33%|███▎      | 2206/6666 [01:52<03:08, 23.69it/s][A[A[A[A[A[A[A[A[A








 33%|███▎      | 2209/6666 [01:52<03:20, 22.28it/s][A[A[A[A[A[A[A[A[A








 33%|███▎      | 2212/6666 [01:52<03:21, 22

 37%|███▋      | 2472/6666 [02:03<02:57, 23.59it/s][A[A[A[A[A[A[A[A[A








 37%|███▋      | 2475/6666 [02:03<02:54, 24.06it/s][A[A[A[A[A[A[A[A[A








 37%|███▋      | 2479/6666 [02:03<02:48, 24.92it/s][A[A[A[A[A[A[A[A[A








 37%|███▋      | 2482/6666 [02:03<02:54, 23.97it/s][A[A[A[A[A[A[A[A[A








 37%|███▋      | 2485/6666 [02:04<03:00, 23.19it/s][A[A[A[A[A[A[A[A[A








 37%|███▋      | 2488/6666 [02:04<02:59, 23.34it/s][A[A[A[A[A[A[A[A[A








 37%|███▋      | 2491/6666 [02:04<02:59, 23.25it/s][A[A[A[A[A[A[A[A[A








 37%|███▋      | 2494/6666 [02:04<03:02, 22.80it/s][A[A[A[A[A[A[A[A[A








 37%|███▋      | 2497/6666 [02:04<03:01, 23.03it/s][A[A[A[A[A[A[A[A[A








 38%|███▊      | 2500/6666 [02:04<02:55, 23.75it/s][A[A[A[A[A[A[A[A[A








 38%|███▊      | 2503/6666 [02:04<02:50, 24.36it/s][A[A[A[A[A[A[A[A[A








 38%|███▊      | 2506/6666 [02:04<02:49, 24

 41%|████▏     | 2762/6666 [02:15<02:51, 22.83it/s][A[A[A[A[A[A[A[A[A








 41%|████▏     | 2765/6666 [02:15<03:04, 21.18it/s][A[A[A[A[A[A[A[A[A








 42%|████▏     | 2768/6666 [02:16<02:55, 22.18it/s][A[A[A[A[A[A[A[A[A








 42%|████▏     | 2771/6666 [02:16<02:54, 22.37it/s][A[A[A[A[A[A[A[A[A








 42%|████▏     | 2774/6666 [02:16<02:44, 23.59it/s][A[A[A[A[A[A[A[A[A








 42%|████▏     | 2777/6666 [02:16<02:38, 24.61it/s][A[A[A[A[A[A[A[A[A








 42%|████▏     | 2780/6666 [02:16<02:42, 23.91it/s][A[A[A[A[A[A[A[A[A








 42%|████▏     | 2783/6666 [02:16<02:49, 22.87it/s][A[A[A[A[A[A[A[A[A








 42%|████▏     | 2786/6666 [02:16<03:07, 20.67it/s][A[A[A[A[A[A[A[A[A








 42%|████▏     | 2789/6666 [02:16<03:07, 20.65it/s][A[A[A[A[A[A[A[A[A








 42%|████▏     | 2792/6666 [02:17<03:00, 21.50it/s][A[A[A[A[A[A[A[A[A








 42%|████▏     | 2795/6666 [02:17<02:51, 22

 46%|████▌     | 3060/6666 [02:28<02:33, 23.46it/s][A[A[A[A[A[A[A[A[A








 46%|████▌     | 3063/6666 [02:28<02:26, 24.64it/s][A[A[A[A[A[A[A[A[A








 46%|████▌     | 3066/6666 [02:28<02:30, 23.86it/s][A[A[A[A[A[A[A[A[A








 46%|████▌     | 3069/6666 [02:28<02:35, 23.20it/s][A[A[A[A[A[A[A[A[A








 46%|████▌     | 3072/6666 [02:28<02:45, 21.68it/s][A[A[A[A[A[A[A[A[A








 46%|████▌     | 3075/6666 [02:28<02:38, 22.66it/s][A[A[A[A[A[A[A[A[A








 46%|████▌     | 3078/6666 [02:28<02:46, 21.51it/s][A[A[A[A[A[A[A[A[A








 46%|████▌     | 3081/6666 [02:28<02:39, 22.41it/s][A[A[A[A[A[A[A[A[A








 46%|████▋     | 3084/6666 [02:29<02:38, 22.57it/s][A[A[A[A[A[A[A[A[A








 46%|████▋     | 3087/6666 [02:29<02:30, 23.70it/s][A[A[A[A[A[A[A[A[A








 46%|████▋     | 3090/6666 [02:29<02:24, 24.79it/s][A[A[A[A[A[A[A[A[A








 46%|████▋     | 3093/6666 [02:29<02:26, 24

 50%|█████     | 3363/6666 [02:39<01:57, 28.11it/s][A[A[A[A[A[A[A[A[A








 50%|█████     | 3366/6666 [02:40<01:59, 27.64it/s][A[A[A[A[A[A[A[A[A








 51%|█████     | 3369/6666 [02:40<02:08, 25.72it/s][A[A[A[A[A[A[A[A[A








 51%|█████     | 3372/6666 [02:40<02:07, 25.81it/s][A[A[A[A[A[A[A[A[A








 51%|█████     | 3376/6666 [02:40<02:04, 26.50it/s][A[A[A[A[A[A[A[A[A








 51%|█████     | 3379/6666 [02:40<02:12, 24.78it/s][A[A[A[A[A[A[A[A[A








 51%|█████     | 3382/6666 [02:40<02:12, 24.82it/s][A[A[A[A[A[A[A[A[A








 51%|█████     | 3385/6666 [02:40<02:22, 22.96it/s][A[A[A[A[A[A[A[A[A








 51%|█████     | 3388/6666 [02:41<02:44, 19.93it/s][A[A[A[A[A[A[A[A[A








 51%|█████     | 3391/6666 [02:41<03:12, 17.01it/s][A[A[A[A[A[A[A[A[A








 51%|█████     | 3393/6666 [02:41<03:08, 17.34it/s][A[A[A[A[A[A[A[A[A








 51%|█████     | 3395/6666 [02:41<03:33, 15

 54%|█████▍    | 3626/6666 [02:53<01:59, 25.36it/s][A[A[A[A[A[A[A[A[A








 54%|█████▍    | 3630/6666 [02:53<01:53, 26.78it/s][A[A[A[A[A[A[A[A[A








 55%|█████▍    | 3634/6666 [02:53<01:46, 28.38it/s][A[A[A[A[A[A[A[A[A








 55%|█████▍    | 3638/6666 [02:53<01:43, 29.37it/s][A[A[A[A[A[A[A[A[A








 55%|█████▍    | 3642/6666 [02:53<01:35, 31.63it/s][A[A[A[A[A[A[A[A[A








 55%|█████▍    | 3646/6666 [02:53<01:44, 28.94it/s][A[A[A[A[A[A[A[A[A








 55%|█████▍    | 3650/6666 [02:53<01:51, 27.17it/s][A[A[A[A[A[A[A[A[A








 55%|█████▍    | 3653/6666 [02:54<02:04, 24.17it/s][A[A[A[A[A[A[A[A[A








 55%|█████▍    | 3656/6666 [02:54<02:05, 23.91it/s][A[A[A[A[A[A[A[A[A








 55%|█████▍    | 3659/6666 [02:54<02:23, 20.95it/s][A[A[A[A[A[A[A[A[A








 55%|█████▍    | 3662/6666 [02:54<02:34, 19.41it/s][A[A[A[A[A[A[A[A[A








 55%|█████▍    | 3665/6666 [02:54<02:39, 18

 58%|█████▊    | 3895/6666 [03:07<03:41, 12.53it/s][A[A[A[A[A[A[A[A[A








 58%|█████▊    | 3897/6666 [03:08<03:40, 12.57it/s][A[A[A[A[A[A[A[A[A








 58%|█████▊    | 3899/6666 [03:08<03:32, 12.99it/s][A[A[A[A[A[A[A[A[A








 59%|█████▊    | 3902/6666 [03:08<03:01, 15.25it/s][A[A[A[A[A[A[A[A[A








 59%|█████▊    | 3905/6666 [03:08<02:41, 17.06it/s][A[A[A[A[A[A[A[A[A








 59%|█████▊    | 3909/6666 [03:08<02:23, 19.18it/s][A[A[A[A[A[A[A[A[A








 59%|█████▊    | 3912/6666 [03:08<02:29, 18.48it/s][A[A[A[A[A[A[A[A[A








 59%|█████▊    | 3915/6666 [03:08<02:14, 20.41it/s][A[A[A[A[A[A[A[A[A








 59%|█████▉    | 3918/6666 [03:08<02:04, 22.07it/s][A[A[A[A[A[A[A[A[A








 59%|█████▉    | 3922/6666 [03:09<01:49, 25.11it/s][A[A[A[A[A[A[A[A[A








 59%|█████▉    | 3925/6666 [03:09<01:44, 26.34it/s][A[A[A[A[A[A[A[A[A








 59%|█████▉    | 3928/6666 [03:09<01:45, 26

 63%|██████▎   | 4181/6666 [03:19<01:36, 25.68it/s][A[A[A[A[A[A[A[A[A








 63%|██████▎   | 4184/6666 [03:20<01:40, 24.72it/s][A[A[A[A[A[A[A[A[A








 63%|██████▎   | 4187/6666 [03:20<01:50, 22.41it/s][A[A[A[A[A[A[A[A[A








 63%|██████▎   | 4190/6666 [03:20<01:59, 20.74it/s][A[A[A[A[A[A[A[A[A








 63%|██████▎   | 4193/6666 [03:20<02:07, 19.45it/s][A[A[A[A[A[A[A[A[A








 63%|██████▎   | 4196/6666 [03:20<02:24, 17.05it/s][A[A[A[A[A[A[A[A[A








 63%|██████▎   | 4200/6666 [03:21<02:08, 19.14it/s][A[A[A[A[A[A[A[A[A








 63%|██████▎   | 4203/6666 [03:21<02:21, 17.46it/s][A[A[A[A[A[A[A[A[A








 63%|██████▎   | 4205/6666 [03:21<02:31, 16.24it/s][A[A[A[A[A[A[A[A[A








 63%|██████▎   | 4207/6666 [03:21<02:27, 16.67it/s][A[A[A[A[A[A[A[A[A








 63%|██████▎   | 4210/6666 [03:21<02:22, 17.21it/s][A[A[A[A[A[A[A[A[A








 63%|██████▎   | 4212/6666 [03:21<02:26, 16

 67%|██████▋   | 4456/6666 [03:34<01:57, 18.89it/s][A[A[A[A[A[A[A[A[A








 67%|██████▋   | 4459/6666 [03:34<01:45, 20.88it/s][A[A[A[A[A[A[A[A[A








 67%|██████▋   | 4462/6666 [03:34<01:36, 22.74it/s][A[A[A[A[A[A[A[A[A








 67%|██████▋   | 4465/6666 [03:34<01:35, 22.98it/s][A[A[A[A[A[A[A[A[A








 67%|██████▋   | 4468/6666 [03:34<01:32, 23.74it/s][A[A[A[A[A[A[A[A[A








 67%|██████▋   | 4471/6666 [03:34<01:41, 21.64it/s][A[A[A[A[A[A[A[A[A








 67%|██████▋   | 4474/6666 [03:34<01:35, 22.89it/s][A[A[A[A[A[A[A[A[A








 67%|██████▋   | 4477/6666 [03:35<01:33, 23.34it/s][A[A[A[A[A[A[A[A[A








 67%|██████▋   | 4481/6666 [03:35<01:35, 22.99it/s][A[A[A[A[A[A[A[A[A








 67%|██████▋   | 4484/6666 [03:35<01:36, 22.66it/s][A[A[A[A[A[A[A[A[A








 67%|██████▋   | 4487/6666 [03:35<01:38, 22.22it/s][A[A[A[A[A[A[A[A[A








 67%|██████▋   | 4490/6666 [03:35<01:49, 19

 71%|███████▏  | 4754/6666 [03:47<01:14, 25.60it/s][A[A[A[A[A[A[A[A[A








 71%|███████▏  | 4757/6666 [03:47<01:22, 23.07it/s][A[A[A[A[A[A[A[A[A








 71%|███████▏  | 4761/6666 [03:47<01:16, 25.00it/s][A[A[A[A[A[A[A[A[A








 71%|███████▏  | 4765/6666 [03:47<01:12, 26.27it/s][A[A[A[A[A[A[A[A[A








 72%|███████▏  | 4768/6666 [03:47<01:20, 23.58it/s][A[A[A[A[A[A[A[A[A








 72%|███████▏  | 4771/6666 [03:47<01:15, 25.18it/s][A[A[A[A[A[A[A[A[A








 72%|███████▏  | 4775/6666 [03:47<01:11, 26.38it/s][A[A[A[A[A[A[A[A[A








 72%|███████▏  | 4779/6666 [03:47<01:05, 28.71it/s][A[A[A[A[A[A[A[A[A








 72%|███████▏  | 4782/6666 [03:48<01:09, 27.24it/s][A[A[A[A[A[A[A[A[A








 72%|███████▏  | 4785/6666 [03:48<01:08, 27.52it/s][A[A[A[A[A[A[A[A[A








 72%|███████▏  | 4788/6666 [03:48<01:09, 27.03it/s][A[A[A[A[A[A[A[A[A








 72%|███████▏  | 4791/6666 [03:48<01:08, 27

 76%|███████▌  | 5050/6666 [03:59<01:11, 22.53it/s][A[A[A[A[A[A[A[A[A








 76%|███████▌  | 5053/6666 [03:59<01:13, 22.04it/s][A[A[A[A[A[A[A[A[A








 76%|███████▌  | 5056/6666 [03:59<01:20, 19.94it/s][A[A[A[A[A[A[A[A[A








 76%|███████▌  | 5060/6666 [03:59<01:12, 22.10it/s][A[A[A[A[A[A[A[A[A








 76%|███████▌  | 5064/6666 [04:00<01:04, 24.79it/s][A[A[A[A[A[A[A[A[A








 76%|███████▌  | 5067/6666 [04:00<01:09, 23.07it/s][A[A[A[A[A[A[A[A[A








 76%|███████▌  | 5070/6666 [04:00<01:06, 24.09it/s][A[A[A[A[A[A[A[A[A








 76%|███████▌  | 5073/6666 [04:00<01:07, 23.45it/s][A[A[A[A[A[A[A[A[A








 76%|███████▌  | 5076/6666 [04:00<01:22, 19.39it/s][A[A[A[A[A[A[A[A[A








 76%|███████▌  | 5079/6666 [04:00<01:34, 16.73it/s][A[A[A[A[A[A[A[A[A








 76%|███████▌  | 5082/6666 [04:01<01:29, 17.69it/s][A[A[A[A[A[A[A[A[A








 76%|███████▋  | 5084/6666 [04:01<01:29, 17

 80%|███████▉  | 5324/6666 [04:12<00:53, 25.14it/s][A[A[A[A[A[A[A[A[A








 80%|███████▉  | 5327/6666 [04:12<00:54, 24.50it/s][A[A[A[A[A[A[A[A[A








 80%|███████▉  | 5330/6666 [04:12<00:52, 25.23it/s][A[A[A[A[A[A[A[A[A








 80%|████████  | 5333/6666 [04:13<00:52, 25.30it/s][A[A[A[A[A[A[A[A[A








 80%|████████  | 5336/6666 [04:13<00:55, 23.97it/s][A[A[A[A[A[A[A[A[A








 80%|████████  | 5339/6666 [04:13<00:53, 24.72it/s][A[A[A[A[A[A[A[A[A








 80%|████████  | 5342/6666 [04:13<00:54, 24.50it/s][A[A[A[A[A[A[A[A[A








 80%|████████  | 5345/6666 [04:13<00:51, 25.78it/s][A[A[A[A[A[A[A[A[A








 80%|████████  | 5349/6666 [04:13<00:48, 27.01it/s][A[A[A[A[A[A[A[A[A








 80%|████████  | 5352/6666 [04:13<00:47, 27.54it/s][A[A[A[A[A[A[A[A[A








 80%|████████  | 5355/6666 [04:13<00:47, 27.50it/s][A[A[A[A[A[A[A[A[A








 80%|████████  | 5358/6666 [04:14<00:51, 25

 84%|████████▍ | 5622/6666 [04:24<00:43, 24.26it/s][A[A[A[A[A[A[A[A[A








 84%|████████▍ | 5626/6666 [04:24<00:39, 26.39it/s][A[A[A[A[A[A[A[A[A








 84%|████████▍ | 5629/6666 [04:24<00:37, 27.30it/s][A[A[A[A[A[A[A[A[A








 84%|████████▍ | 5632/6666 [04:24<00:40, 25.47it/s][A[A[A[A[A[A[A[A[A








 85%|████████▍ | 5635/6666 [04:24<00:44, 22.92it/s][A[A[A[A[A[A[A[A[A








 85%|████████▍ | 5638/6666 [04:24<00:42, 23.98it/s][A[A[A[A[A[A[A[A[A








 85%|████████▍ | 5641/6666 [04:24<00:44, 23.23it/s][A[A[A[A[A[A[A[A[A








 85%|████████▍ | 5644/6666 [04:24<00:42, 24.17it/s][A[A[A[A[A[A[A[A[A








 85%|████████▍ | 5647/6666 [04:25<00:39, 25.49it/s][A[A[A[A[A[A[A[A[A








 85%|████████▍ | 5652/6666 [04:25<00:35, 28.47it/s][A[A[A[A[A[A[A[A[A








 85%|████████▍ | 5656/6666 [04:25<00:34, 29.23it/s][A[A[A[A[A[A[A[A[A








 85%|████████▍ | 5660/6666 [04:25<00:32, 30

 89%|████████▉ | 5920/6666 [04:35<00:27, 27.44it/s][A[A[A[A[A[A[A[A[A








 89%|████████▉ | 5924/6666 [04:35<00:25, 29.29it/s][A[A[A[A[A[A[A[A[A








 89%|████████▉ | 5928/6666 [04:35<00:25, 29.28it/s][A[A[A[A[A[A[A[A[A








 89%|████████▉ | 5931/6666 [04:35<00:26, 27.27it/s][A[A[A[A[A[A[A[A[A








 89%|████████▉ | 5935/6666 [04:35<00:25, 28.32it/s][A[A[A[A[A[A[A[A[A








 89%|████████▉ | 5938/6666 [04:36<00:26, 27.48it/s][A[A[A[A[A[A[A[A[A








 89%|████████▉ | 5942/6666 [04:36<00:25, 28.60it/s][A[A[A[A[A[A[A[A[A








 89%|████████▉ | 5945/6666 [04:36<00:26, 27.17it/s][A[A[A[A[A[A[A[A[A








 89%|████████▉ | 5948/6666 [04:36<00:26, 26.99it/s][A[A[A[A[A[A[A[A[A








 89%|████████▉ | 5951/6666 [04:36<00:26, 26.80it/s][A[A[A[A[A[A[A[A[A








 89%|████████▉ | 5955/6666 [04:36<00:24, 28.44it/s][A[A[A[A[A[A[A[A[A








 89%|████████▉ | 5959/6666 [04:36<00:24, 28

 94%|█████████▎| 6235/6666 [04:46<00:17, 25.13it/s][A[A[A[A[A[A[A[A[A








 94%|█████████▎| 6238/6666 [04:47<00:18, 22.78it/s][A[A[A[A[A[A[A[A[A








 94%|█████████▎| 6242/6666 [04:47<00:17, 24.64it/s][A[A[A[A[A[A[A[A[A








 94%|█████████▎| 6245/6666 [04:47<00:19, 22.08it/s][A[A[A[A[A[A[A[A[A








 94%|█████████▎| 6248/6666 [04:47<00:19, 21.53it/s][A[A[A[A[A[A[A[A[A








 94%|█████████▍| 6251/6666 [04:47<00:18, 22.07it/s][A[A[A[A[A[A[A[A[A








 94%|█████████▍| 6254/6666 [04:47<00:17, 23.26it/s][A[A[A[A[A[A[A[A[A








 94%|█████████▍| 6257/6666 [04:47<00:18, 21.89it/s][A[A[A[A[A[A[A[A[A








 94%|█████████▍| 6260/6666 [04:48<00:18, 21.64it/s][A[A[A[A[A[A[A[A[A








 94%|█████████▍| 6263/6666 [04:48<00:17, 23.14it/s][A[A[A[A[A[A[A[A[A








 94%|█████████▍| 6266/6666 [04:48<00:16, 24.38it/s][A[A[A[A[A[A[A[A[A








 94%|█████████▍| 6269/6666 [04:48<00:15, 24

 98%|█████████▊| 6540/6666 [04:59<00:05, 22.36it/s][A[A[A[A[A[A[A[A[A








 98%|█████████▊| 6543/6666 [04:59<00:05, 21.27it/s][A[A[A[A[A[A[A[A[A








 98%|█████████▊| 6546/6666 [04:59<00:05, 21.24it/s][A[A[A[A[A[A[A[A[A








 98%|█████████▊| 6549/6666 [04:59<00:05, 20.26it/s][A[A[A[A[A[A[A[A[A








 98%|█████████▊| 6552/6666 [04:59<00:05, 21.12it/s][A[A[A[A[A[A[A[A[A








 98%|█████████▊| 6555/6666 [04:59<00:05, 20.51it/s][A[A[A[A[A[A[A[A[A








 98%|█████████▊| 6559/6666 [04:59<00:04, 23.28it/s][A[A[A[A[A[A[A[A[A








 98%|█████████▊| 6562/6666 [05:00<00:04, 21.02it/s][A[A[A[A[A[A[A[A[A








 98%|█████████▊| 6565/6666 [05:00<00:04, 21.20it/s][A[A[A[A[A[A[A[A[A








 99%|█████████▊| 6568/6666 [05:00<00:04, 23.12it/s][A[A[A[A[A[A[A[A[A








 99%|█████████▊| 6571/6666 [05:00<00:04, 23.41it/s][A[A[A[A[A[A[A[A[A








 99%|█████████▊| 6574/6666 [05:00<00:04, 22

In [64]:
vaccines.abstract.iloc[-5]

'It is of special significance to find a safe and effective vaccine against coronavirus disease 2019 (COVID-19) that can induce T cell and B cell -mediated immune responses. There is currently no vaccine to prevent COVID-19. In this project, a novel multi-epitope vaccine for COVID-19 virus based on surface glycoprotein was designed through application of bioinformatics methods. At the first, seventeen potent linear B-cell and T-cell binding epitopes from surface glycoprotein were predicted in silico, then the epitopes were joined together via different linkers. The immunogenicity of these epitopes was identified using IFN-{gamma} ELIspot assays. The IFN-{gamma} producing T cell variation ranged from 11.1 {+/-}1.2 SFU to 38.2 {+/-} 2.1 SFU per 10 6 PBMCs. One final vaccine was constructed which composed of 398 amino acids and attached to 50S ribosomal protein L7/L12 as adjuvant. Physicochemical properties, as well as antigenicity in the proposed vaccines, were checked for defining the v

In [65]:
# Parser for reviews
parser = English()
def spacy_tokenizer(sentence):
    mytokens = parser(sentence)
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in stopwords and word not in punctuations ]
    mytokens = " ".join([i for i in mytokens])
    return mytokens

In [None]:
punctuations = string.punctuation
stopwords = list(STOP_WORDS)

In [83]:
tqdm.pandas()
abstracts["abstract_description"] = abstracts["abstract"].progress_apply(spacy_tokenizer)


  0%|          | 0/26553 [00:00<?, ?it/s][A
  0%|          | 12/26553 [00:00<03:54, 113.26it/s][A
  0%|          | 28/26553 [00:00<03:36, 122.67it/s][A
  0%|          | 45/26553 [00:00<03:18, 133.23it/s][A
  0%|          | 64/26553 [00:00<03:01, 146.18it/s][A
  0%|          | 83/26553 [00:00<02:48, 156.80it/s][A
  0%|          | 102/26553 [00:00<02:39, 165.36it/s][A
  0%|          | 122/26553 [00:00<02:32, 173.17it/s][A
  1%|          | 139/26553 [00:00<02:35, 169.64it/s][A
  1%|          | 166/26553 [00:00<02:19, 189.61it/s][A
  1%|          | 186/26553 [00:01<02:20, 187.02it/s][A
  1%|          | 206/26553 [00:01<02:20, 187.58it/s][A
  1%|          | 226/26553 [00:01<02:23, 183.70it/s][A
  1%|          | 250/26553 [00:01<02:13, 196.51it/s][A
  1%|          | 271/26553 [00:01<02:18, 190.23it/s][A
  1%|          | 299/26553 [00:01<02:05, 209.43it/s][A
  1%|          | 321/26553 [00:01<02:16, 192.14it/s][A
  1%|▏         | 342/26553 [00:01<02:31, 172.79it/s][A
  1%|▏ 

 11%|█▏        | 3031/26553 [00:16<02:13, 176.00it/s][A
 11%|█▏        | 3052/26553 [00:16<02:08, 183.44it/s][A
 12%|█▏        | 3076/26553 [00:16<01:59, 196.58it/s][A
 12%|█▏        | 3097/26553 [00:16<01:58, 198.75it/s][A
 12%|█▏        | 3121/26553 [00:16<01:52, 207.64it/s][A
 12%|█▏        | 3143/26553 [00:16<01:54, 204.10it/s][A
 12%|█▏        | 3165/26553 [00:16<01:52, 207.23it/s][A
 12%|█▏        | 3187/26553 [00:17<01:51, 208.98it/s][A
 12%|█▏        | 3215/26553 [00:17<01:44, 223.73it/s][A
 12%|█▏        | 3239/26553 [00:17<01:43, 225.79it/s][A
 12%|█▏        | 3264/26553 [00:17<01:40, 231.65it/s][A
 12%|█▏        | 3288/26553 [00:17<01:50, 211.29it/s][A
 12%|█▏        | 3310/26553 [00:17<01:51, 209.08it/s][A
 13%|█▎        | 3332/26553 [00:17<02:04, 186.12it/s][A
 13%|█▎        | 3352/26553 [00:18<02:47, 138.69it/s][A
 13%|█▎        | 3369/26553 [00:18<02:50, 136.05it/s][A
 13%|█▎        | 3388/26553 [00:18<02:35, 148.50it/s][A
 13%|█▎        | 3409/26553 [00

 24%|██▎       | 6241/26553 [00:32<01:31, 223.02it/s][A
 24%|██▎       | 6268/26553 [00:32<01:27, 232.90it/s][A
 24%|██▎       | 6293/26553 [00:32<01:25, 236.63it/s][A
 24%|██▍       | 6318/26553 [00:32<01:24, 239.27it/s][A
 24%|██▍       | 6343/26553 [00:32<01:27, 231.79it/s][A
 24%|██▍       | 6370/26553 [00:32<01:23, 240.55it/s][A
 24%|██▍       | 6395/26553 [00:32<01:26, 233.93it/s][A
 24%|██▍       | 6421/26553 [00:32<01:24, 238.51it/s][A
 24%|██▍       | 6451/26553 [00:32<01:19, 254.03it/s][A
 24%|██▍       | 6482/26553 [00:33<01:14, 267.93it/s][A
 25%|██▍       | 6510/26553 [00:33<03:56, 84.59it/s] [A
 25%|██▍       | 6534/26553 [00:34<03:11, 104.74it/s][A
 25%|██▍       | 6557/26553 [00:34<02:40, 124.68it/s][A
 25%|██▍       | 6585/26553 [00:34<02:13, 149.23it/s][A
 25%|██▍       | 6609/26553 [00:34<02:00, 165.99it/s][A
 25%|██▍       | 6635/26553 [00:34<01:47, 185.96it/s][A
 25%|██▌       | 6662/26553 [00:34<01:37, 204.23it/s][A
 25%|██▌       | 6689/26553 [00

 37%|███▋      | 9846/26553 [00:49<01:10, 235.93it/s][A
 37%|███▋      | 9872/26553 [00:49<01:08, 242.34it/s][A
 37%|███▋      | 9897/26553 [00:49<01:09, 238.36it/s][A
 37%|███▋      | 9921/26553 [00:49<01:13, 226.89it/s][A
 37%|███▋      | 9945/26553 [00:49<01:12, 228.11it/s][A
 38%|███▊      | 9975/26553 [00:49<01:07, 245.62it/s][A
 38%|███▊      | 10001/26553 [00:49<01:11, 232.86it/s][A
 38%|███▊      | 10025/26553 [00:49<01:10, 234.67it/s][A
 38%|███▊      | 10049/26553 [00:50<01:11, 231.10it/s][A
 38%|███▊      | 10084/26553 [00:50<01:04, 255.93it/s][A
 38%|███▊      | 10111/26553 [00:50<01:05, 249.51it/s][A
 38%|███▊      | 10137/26553 [00:50<01:08, 241.17it/s][A
 38%|███▊      | 10166/26553 [00:50<01:05, 250.74it/s][A
 38%|███▊      | 10192/26553 [00:50<01:06, 246.34it/s][A
 38%|███▊      | 10218/26553 [00:50<01:05, 248.89it/s][A
 39%|███▊      | 10244/26553 [00:50<01:06, 243.68it/s][A
 39%|███▊      | 10269/26553 [00:50<01:09, 233.77it/s][A
 39%|███▉      | 102

 50%|█████     | 13303/26553 [01:04<01:06, 199.58it/s][A
 50%|█████     | 13325/26553 [01:04<01:04, 205.27it/s][A
 50%|█████     | 13350/26553 [01:04<01:00, 216.87it/s][A
 50%|█████     | 13373/26553 [01:05<01:00, 217.25it/s][A
 50%|█████     | 13402/26553 [01:05<00:56, 234.55it/s][A
 51%|█████     | 13429/26553 [01:05<00:54, 240.62it/s][A
 51%|█████     | 13454/26553 [01:05<00:58, 223.91it/s][A
 51%|█████     | 13477/26553 [01:05<01:02, 207.96it/s][A
 51%|█████     | 13499/26553 [01:05<01:09, 187.94it/s][A
 51%|█████     | 13524/26553 [01:05<01:04, 202.02it/s][A
 51%|█████     | 13546/26553 [01:05<01:03, 205.96it/s][A
 51%|█████     | 13568/26553 [01:05<01:02, 208.33it/s][A
 51%|█████     | 13590/26553 [01:06<01:01, 209.87it/s][A
 51%|█████▏    | 13612/26553 [01:06<01:02, 206.59it/s][A
 51%|█████▏    | 13633/26553 [01:06<01:05, 197.46it/s][A
 51%|█████▏    | 13662/26553 [01:06<00:59, 217.34it/s][A
 52%|█████▏    | 13690/26553 [01:06<00:55, 232.72it/s][A
 52%|█████▏   

 64%|██████▍   | 17096/26553 [01:20<00:56, 166.20it/s][A
 64%|██████▍   | 17123/26553 [01:20<00:50, 187.42it/s][A
 65%|██████▍   | 17149/26553 [01:21<00:46, 201.33it/s][A
 65%|██████▍   | 17171/26553 [01:21<00:46, 200.85it/s][A
 65%|██████▍   | 17192/26553 [01:21<00:51, 181.18it/s][A
 65%|██████▍   | 17213/26553 [01:21<00:49, 187.19it/s][A
 65%|██████▍   | 17236/26553 [01:21<00:47, 196.17it/s][A
 65%|██████▍   | 17258/26553 [01:21<00:46, 201.35it/s][A
 65%|██████▌   | 17282/26553 [01:21<00:44, 209.53it/s][A
 65%|██████▌   | 17304/26553 [01:21<00:45, 202.31it/s][A
 65%|██████▌   | 17325/26553 [01:21<00:45, 202.02it/s][A
 65%|██████▌   | 17346/26553 [01:22<00:45, 204.34it/s][A
 65%|██████▌   | 17372/26553 [01:22<00:42, 216.92it/s][A
 66%|██████▌   | 17396/26553 [01:22<00:41, 222.98it/s][A
 66%|██████▌   | 17419/26553 [01:22<00:41, 221.99it/s][A
 66%|██████▌   | 17442/26553 [01:22<00:42, 213.43it/s][A
 66%|██████▌   | 17464/26553 [01:22<00:42, 213.39it/s][A
 66%|██████▌  

 77%|███████▋  | 20517/26553 [01:36<00:25, 234.50it/s][A
 77%|███████▋  | 20542/26553 [01:36<00:25, 238.82it/s][A
 77%|███████▋  | 20568/26553 [01:36<00:24, 244.42it/s][A
 78%|███████▊  | 20593/26553 [01:37<00:24, 244.16it/s][A
 78%|███████▊  | 20623/26553 [01:37<00:23, 256.49it/s][A
 78%|███████▊  | 20649/26553 [01:37<00:23, 248.32it/s][A
 78%|███████▊  | 20676/26553 [01:37<00:23, 254.42it/s][A
 78%|███████▊  | 20704/26553 [01:37<00:22, 260.47it/s][A
 78%|███████▊  | 20731/26553 [01:37<00:23, 248.20it/s][A
 78%|███████▊  | 20757/26553 [01:37<00:25, 227.19it/s][A
 78%|███████▊  | 20783/26553 [01:37<00:24, 235.08it/s][A
 78%|███████▊  | 20807/26553 [01:37<00:24, 236.34it/s][A
 78%|███████▊  | 20836/26553 [01:38<00:22, 248.68it/s][A
 79%|███████▊  | 20862/26553 [01:38<00:22, 247.93it/s][A
 79%|███████▊  | 20889/26553 [01:38<00:22, 253.32it/s][A
 79%|███████▉  | 20915/26553 [01:38<00:26, 213.05it/s][A
 79%|███████▉  | 20938/26553 [01:38<00:27, 202.33it/s][A
 79%|███████▉ 

 91%|█████████ | 24115/26553 [01:52<00:10, 232.94it/s][A
 91%|█████████ | 24141/26553 [01:52<00:10, 235.15it/s][A
 91%|█████████ | 24167/26553 [01:52<00:09, 240.96it/s][A
 91%|█████████ | 24198/26553 [01:52<00:09, 255.81it/s][A
 91%|█████████▏| 24231/26553 [01:53<00:08, 273.21it/s][A
 91%|█████████▏| 24260/26553 [01:53<00:08, 259.97it/s][A
 91%|█████████▏| 24287/26553 [01:53<00:08, 261.20it/s][A
 92%|█████████▏| 24314/26553 [01:53<00:09, 233.34it/s][A
 92%|█████████▏| 24339/26553 [01:53<00:09, 227.10it/s][A
 92%|█████████▏| 24363/26553 [01:53<00:09, 227.98it/s][A
 92%|█████████▏| 24387/26553 [01:53<00:09, 220.91it/s][A
 92%|█████████▏| 24410/26553 [01:53<00:10, 206.35it/s][A
 92%|█████████▏| 24432/26553 [01:54<00:10, 204.41it/s][A
 92%|█████████▏| 24454/26553 [01:54<00:10, 207.60it/s][A
 92%|█████████▏| 24477/26553 [01:54<00:09, 212.90it/s][A
 92%|█████████▏| 24500/26553 [01:54<00:09, 216.89it/s][A
 92%|█████████▏| 24522/26553 [01:54<00:09, 211.45it/s][A
 92%|█████████

In [90]:
abstracts.abstract_description.iloc[0]

'geographic spread 2019 novel coronavirus covid-19 infections epicenter wuhan china provided opportunity study natural history recently emerged virus publicly available event date data ongoing epidemic present study investigated incubation period time intervals govern epidemiological dynamics covid-19 infections results incubation period falls range 2&ndash;14 days 95 confidence mean 5 days approximated best fit lognormal distribution mean time illness onset hospital admission treatment and/or isolation estimated 3&ndash;4 days truncation 5&ndash;9 days right truncated based 95th percentile estimate incubation period recommend length quarantine 14 days median time delay 13 days illness onset death 17 days right truncation considered estimating covid-19 case fatality risk'

In [66]:
vectorizer = CountVectorizer(min_df=5, max_df=0.9, stop_words='english', lowercase=True, token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
data_vectorized = vectorizer.fit_transform(vaccines['abstract_noabbreviations']) # abstracts["abstract_description"]

In [67]:
# How do we determine the number of topics?
lda = LatentDirichletAllocation(n_components=10, max_iter=10, learning_method='online',verbose=True)
data_lda = lda.fit_transform(data_vectorized)

iteration: 1 of max_iter: 10
iteration: 2 of max_iter: 10
iteration: 3 of max_iter: 10
iteration: 4 of max_iter: 10
iteration: 5 of max_iter: 10
iteration: 6 of max_iter: 10
iteration: 7 of max_iter: 10
iteration: 8 of max_iter: 10
iteration: 9 of max_iter: 10
iteration: 10 of max_iter: 10


In [68]:
def selected_topics(model, vectorizer, top_n=10):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-top_n - 1:-1]]) 

In [69]:
print("LDA Model:")
selected_topics(lda, vectorizer)

LDA Model:
Topic 0:
[('vaccine', 2192.680779980336), ('disease', 1776.8429811449632), ('health', 1492.3306565077617), ('vaccines', 1349.2935450237783), ('virus', 1307.4587099015944), ('infectious', 984.9000526115249), ('diseases', 901.4981328214324), ('vaccination', 844.0977141085943), ('control', 733.5851657142026), ('development', 730.2851330833446)]
Topic 1:
[('protein', 914.7381254206164), ('drug', 706.3341044651825), ('proteins', 647.690698262426), ('binding', 647.6490054125255), ('structure', 611.6579524438013), ('peptides', 557.7706130488925), ('potential', 481.02409473966134), ('molecular', 472.62903862958797), ('structural', 470.49202058615504), ('design', 458.15161117636484)]
Topic 2:
[('virus', 6460.56802216828), ('protein', 2542.5406316911267), ('antibodies', 2483.9095625275318), ('human', 1815.2193188967678), ('vaccine', 1669.2005700559453), ('antibody', 1503.4465480650542), ('porcine', 1419.3577280640386), ('domain', 1198.3849710760499), ('neutralizing', 1159.466968365216

How about taking all relevant entities and building a topic map from these? The topics are nice to look at, but we are actually only interested in the "protein", "rna", "dna", ... entities, because we want to know which ones are being researched right now.

In [70]:
# since we saw differences between models, it would be helpful to collect all recognized entities
nlp_protein = spacy.load("en_ner_jnlpba_md")
nlp_disease = spacy.load("en_ner_bc5cdr_md")
nlp_genes   = spacy.load("en_ner_bionlp13cg_md")

In [72]:
nlp = spacy.load("en_ner_bionlp13cg_md")
doc = nlp(abstracts.iloc[sample,7])
spacy.displacy.render(doc, style='ent',jupyter=True)

In [191]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

coronavirus 25 36 ORGANISM
Wuhan 71 76 ORGANISM
2019-nCoV. 315 325 GENE_OR_GENE_PRODUCT
coronaviruses 361 374 ORGANISM


In [195]:
vaccines['Entities'] = ''
vaccines['EntitiesLabels'] = ''

In [196]:
for idx, a in enumerate(vaccines.abstract.iloc[:100]): # 
    doc = nlp_protein(a)
    e1 = [doc.ents[x].text for x in range(len(doc.ents)) if len(doc.ents) != 0]
    l1 = [doc.ents[x].label_ for x in range(len(doc.ents)) if len(doc.ents) != 0]
    
    doc = nlp_disease(a)
    e2 = [doc.ents[x].text for x in range(len(doc.ents)) if len(doc.ents) != 0]
    l2 = [doc.ents[x].label_ for x in range(len(doc.ents)) if len(doc.ents) != 0]
    
    doc = nlp_genes(a)
    e3 = [doc.ents[x].text for x in range(len(doc.ents)) if len(doc.ents) != 0]
    l3 = [doc.ents[x].label_ for x in range(len(doc.ents)) if len(doc.ents) != 0]
    
    ents = [e1,e2,e3]
    labels = [l1,l2,l3]
    
    vaccines.at[idx, 'Entities'] = [item for sublist in ents for item in sublist if sublist] # list(set())
    vaccines.at[idx, 'EntitiesLabels'] = [item for sublist in labels for item in sublist if sublist]

In [201]:
def extract_named_ents(text):
    return [(ent.text, ent.label_) for ent in nlp(text).ents]

In [244]:
vaccines['Entities'] = vaccines[~vaccines.abstract_noabbreviations.isna()]['abstract_noabbreviations'].progress_apply(extract_named_ents)











  0%|          | 0/6666 [00:00<?, ?it/s][A[A[A[A[A[A[A[A[A[A









  0%|          | 2/6666 [00:00<05:37, 19.76it/s][A[A[A[A[A[A[A[A[A[A









  0%|          | 4/6666 [00:00<05:48, 19.09it/s][A[A[A[A[A[A[A[A[A[A









  0%|          | 7/6666 [00:00<05:35, 19.83it/s][A[A[A[A[A[A[A[A[A[A









  0%|          | 9/6666 [00:00<06:08, 18.06it/s][A[A[A[A[A[A[A[A[A[A









  0%|          | 12/6666 [00:00<05:53, 18.83it/s][A[A[A[A[A[A[A[A[A[A









  0%|          | 14/6666 [00:00<06:33, 16.89it/s][A[A[A[A[A[A[A[A[A[A









  0%|          | 16/6666 [00:00<06:52, 16.12it/s][A[A[A[A[A[A[A[A[A[A









  0%|          | 18/6666 [00:01<06:37, 16.70it/s][A[A[A[A[A[A[A[A[A[A









  0%|          | 20/6666 [00:01<06:42, 16.50it/s][A[A[A[A[A[A[A[A[A[A









  0%|          | 22/6666 [00:01<06:49, 16.23it/s][A[A[A[A[A[A[A[A[A[A









  0%|          | 24/6666

  4%|▎         | 244/6666 [00:13<09:35, 11.15it/s][A[A[A[A[A[A[A[A[A[A









  4%|▎         | 246/6666 [00:13<09:52, 10.83it/s][A[A[A[A[A[A[A[A[A[A









  4%|▎         | 248/6666 [00:13<09:48, 10.90it/s][A[A[A[A[A[A[A[A[A[A









  4%|▍         | 250/6666 [00:16<52:31,  2.04it/s][A[A[A[A[A[A[A[A[A[A









  4%|▍         | 252/6666 [00:16<38:53,  2.75it/s][A[A[A[A[A[A[A[A[A[A









  4%|▍         | 254/6666 [00:16<29:07,  3.67it/s][A[A[A[A[A[A[A[A[A[A









  4%|▍         | 256/6666 [00:17<21:59,  4.86it/s][A[A[A[A[A[A[A[A[A[A









  4%|▍         | 258/6666 [00:17<17:11,  6.21it/s][A[A[A[A[A[A[A[A[A[A









  4%|▍         | 260/6666 [00:17<13:57,  7.65it/s][A[A[A[A[A[A[A[A[A[A









  4%|▍         | 262/6666 [00:17<11:59,  8.90it/s][A[A[A[A[A[A[A[A[A[A









  4%|▍         | 264/6666 [00:17<10:03, 10.62it/s][A[A[A[A[A[A[A[A[A[A









  4%|▍    

  7%|▋         | 484/6666 [00:28<05:01, 20.49it/s][A[A[A[A[A[A[A[A[A[A









  7%|▋         | 487/6666 [00:28<04:58, 20.71it/s][A[A[A[A[A[A[A[A[A[A









  7%|▋         | 490/6666 [00:28<04:43, 21.78it/s][A[A[A[A[A[A[A[A[A[A









  7%|▋         | 494/6666 [00:29<04:24, 23.31it/s][A[A[A[A[A[A[A[A[A[A









  7%|▋         | 497/6666 [00:29<04:46, 21.52it/s][A[A[A[A[A[A[A[A[A[A









  8%|▊         | 500/6666 [00:29<05:31, 18.58it/s][A[A[A[A[A[A[A[A[A[A









  8%|▊         | 503/6666 [00:29<05:10, 19.84it/s][A[A[A[A[A[A[A[A[A[A









  8%|▊         | 506/6666 [00:29<05:26, 18.89it/s][A[A[A[A[A[A[A[A[A[A









  8%|▊         | 509/6666 [00:29<05:05, 20.13it/s][A[A[A[A[A[A[A[A[A[A









  8%|▊         | 512/6666 [00:29<05:02, 20.37it/s][A[A[A[A[A[A[A[A[A[A









  8%|▊         | 515/6666 [00:30<04:57, 20.68it/s][A[A[A[A[A[A[A[A[A[A









  8%|▊    

 11%|█▏        | 762/6666 [00:40<04:30, 21.81it/s][A[A[A[A[A[A[A[A[A[A









 11%|█▏        | 765/6666 [00:40<04:46, 20.62it/s][A[A[A[A[A[A[A[A[A[A









 12%|█▏        | 768/6666 [00:41<04:32, 21.66it/s][A[A[A[A[A[A[A[A[A[A









 12%|█▏        | 771/6666 [00:41<04:36, 21.36it/s][A[A[A[A[A[A[A[A[A[A









 12%|█▏        | 774/6666 [00:41<04:16, 22.94it/s][A[A[A[A[A[A[A[A[A[A









 12%|█▏        | 777/6666 [00:41<04:51, 20.24it/s][A[A[A[A[A[A[A[A[A[A









 12%|█▏        | 780/6666 [00:41<05:26, 18.00it/s][A[A[A[A[A[A[A[A[A[A









 12%|█▏        | 782/6666 [00:41<06:26, 15.21it/s][A[A[A[A[A[A[A[A[A[A









 12%|█▏        | 784/6666 [00:42<07:05, 13.84it/s][A[A[A[A[A[A[A[A[A[A









 12%|█▏        | 786/6666 [00:42<06:49, 14.35it/s][A[A[A[A[A[A[A[A[A[A









 12%|█▏        | 788/6666 [00:42<06:40, 14.67it/s][A[A[A[A[A[A[A[A[A[A









 12%|█▏   

 15%|█▌        | 1028/6666 [00:52<03:43, 25.24it/s][A[A[A[A[A[A[A[A[A[A









 15%|█▌        | 1031/6666 [00:53<04:03, 23.11it/s][A[A[A[A[A[A[A[A[A[A









 16%|█▌        | 1034/6666 [00:53<04:15, 22.04it/s][A[A[A[A[A[A[A[A[A[A









 16%|█▌        | 1037/6666 [00:53<04:31, 20.72it/s][A[A[A[A[A[A[A[A[A[A









 16%|█▌        | 1040/6666 [00:53<04:07, 22.75it/s][A[A[A[A[A[A[A[A[A[A









 16%|█▌        | 1043/6666 [00:53<04:08, 22.62it/s][A[A[A[A[A[A[A[A[A[A









 16%|█▌        | 1046/6666 [00:53<03:56, 23.79it/s][A[A[A[A[A[A[A[A[A[A









 16%|█▌        | 1049/6666 [00:53<04:00, 23.38it/s][A[A[A[A[A[A[A[A[A[A









 16%|█▌        | 1052/6666 [00:54<03:47, 24.65it/s][A[A[A[A[A[A[A[A[A[A









 16%|█▌        | 1055/6666 [00:54<03:47, 24.65it/s][A[A[A[A[A[A[A[A[A[A









 16%|█▌        | 1058/6666 [00:54<03:42, 25.21it/s][A[A[A[A[A[A[A[A[A[A










 19%|█▉        | 1277/6666 [01:07<04:47, 18.71it/s][A[A[A[A[A[A[A[A[A[A









 19%|█▉        | 1280/6666 [01:07<04:24, 20.36it/s][A[A[A[A[A[A[A[A[A[A









 19%|█▉        | 1283/6666 [01:07<04:20, 20.66it/s][A[A[A[A[A[A[A[A[A[A









 19%|█▉        | 1286/6666 [01:07<04:00, 22.39it/s][A[A[A[A[A[A[A[A[A[A









 19%|█▉        | 1289/6666 [01:07<03:53, 22.98it/s][A[A[A[A[A[A[A[A[A[A









 19%|█▉        | 1292/6666 [01:07<04:15, 21.05it/s][A[A[A[A[A[A[A[A[A[A









 19%|█▉        | 1295/6666 [01:07<04:19, 20.71it/s][A[A[A[A[A[A[A[A[A[A









 19%|█▉        | 1298/6666 [01:08<04:27, 20.06it/s][A[A[A[A[A[A[A[A[A[A









 20%|█▉        | 1301/6666 [01:08<04:51, 18.42it/s][A[A[A[A[A[A[A[A[A[A









 20%|█▉        | 1304/6666 [01:08<04:32, 19.69it/s][A[A[A[A[A[A[A[A[A[A









 20%|█▉        | 1307/6666 [01:08<04:15, 20.99it/s][A[A[A[A[A[A[A[A[A[A










 23%|██▎       | 1522/6666 [01:19<04:19, 19.83it/s][A[A[A[A[A[A[A[A[A[A









 23%|██▎       | 1525/6666 [01:19<04:16, 20.05it/s][A[A[A[A[A[A[A[A[A[A









 23%|██▎       | 1529/6666 [01:19<03:45, 22.78it/s][A[A[A[A[A[A[A[A[A[A









 23%|██▎       | 1532/6666 [01:19<03:45, 22.75it/s][A[A[A[A[A[A[A[A[A[A









 23%|██▎       | 1535/6666 [01:20<04:19, 19.77it/s][A[A[A[A[A[A[A[A[A[A









 23%|██▎       | 1538/6666 [01:20<04:24, 19.37it/s][A[A[A[A[A[A[A[A[A[A









 23%|██▎       | 1541/6666 [01:20<04:21, 19.59it/s][A[A[A[A[A[A[A[A[A[A









 23%|██▎       | 1544/6666 [01:20<04:35, 18.56it/s][A[A[A[A[A[A[A[A[A[A









 23%|██▎       | 1547/6666 [01:20<04:11, 20.32it/s][A[A[A[A[A[A[A[A[A[A









 23%|██▎       | 1550/6666 [01:20<03:49, 22.28it/s][A[A[A[A[A[A[A[A[A[A









 23%|██▎       | 1553/6666 [01:21<04:08, 20.54it/s][A[A[A[A[A[A[A[A[A[A










 27%|██▋       | 1795/6666 [01:36<03:49, 21.26it/s][A[A[A[A[A[A[A[A[A[A









 27%|██▋       | 1798/6666 [01:36<03:48, 21.29it/s][A[A[A[A[A[A[A[A[A[A









 27%|██▋       | 1801/6666 [01:36<04:09, 19.48it/s][A[A[A[A[A[A[A[A[A[A









 27%|██▋       | 1804/6666 [01:36<04:16, 18.93it/s][A[A[A[A[A[A[A[A[A[A









 27%|██▋       | 1806/6666 [01:36<04:28, 18.09it/s][A[A[A[A[A[A[A[A[A[A









 27%|██▋       | 1810/6666 [01:36<03:47, 21.36it/s][A[A[A[A[A[A[A[A[A[A









 27%|██▋       | 1813/6666 [01:36<03:30, 23.06it/s][A[A[A[A[A[A[A[A[A[A









 27%|██▋       | 1816/6666 [01:37<03:20, 24.18it/s][A[A[A[A[A[A[A[A[A[A









 27%|██▋       | 1819/6666 [01:37<03:28, 23.21it/s][A[A[A[A[A[A[A[A[A[A









 27%|██▋       | 1822/6666 [01:37<03:19, 24.30it/s][A[A[A[A[A[A[A[A[A[A









 27%|██▋       | 1825/6666 [01:37<03:10, 25.42it/s][A[A[A[A[A[A[A[A[A[A










 31%|███       | 2069/6666 [01:50<03:50, 19.94it/s][A[A[A[A[A[A[A[A[A[A









 31%|███       | 2072/6666 [01:50<03:48, 20.13it/s][A[A[A[A[A[A[A[A[A[A









 31%|███       | 2075/6666 [01:50<03:46, 20.29it/s][A[A[A[A[A[A[A[A[A[A









 31%|███       | 2078/6666 [01:50<03:50, 19.93it/s][A[A[A[A[A[A[A[A[A[A









 31%|███       | 2081/6666 [01:50<03:51, 19.79it/s][A[A[A[A[A[A[A[A[A[A









 31%|███▏      | 2084/6666 [01:50<03:46, 20.25it/s][A[A[A[A[A[A[A[A[A[A









 31%|███▏      | 2087/6666 [01:51<03:35, 21.23it/s][A[A[A[A[A[A[A[A[A[A









 31%|███▏      | 2090/6666 [01:51<03:16, 23.24it/s][A[A[A[A[A[A[A[A[A[A









 31%|███▏      | 2093/6666 [01:51<03:11, 23.83it/s][A[A[A[A[A[A[A[A[A[A









 31%|███▏      | 2097/6666 [01:51<02:56, 25.87it/s][A[A[A[A[A[A[A[A[A[A









 32%|███▏      | 2100/6666 [01:51<02:59, 25.49it/s][A[A[A[A[A[A[A[A[A[A










 35%|███▌      | 2337/6666 [02:02<04:30, 16.02it/s][A[A[A[A[A[A[A[A[A[A









 35%|███▌      | 2339/6666 [02:02<04:27, 16.18it/s][A[A[A[A[A[A[A[A[A[A









 35%|███▌      | 2342/6666 [02:02<03:59, 18.04it/s][A[A[A[A[A[A[A[A[A[A









 35%|███▌      | 2345/6666 [02:02<03:31, 20.43it/s][A[A[A[A[A[A[A[A[A[A









 35%|███▌      | 2348/6666 [02:02<03:11, 22.54it/s][A[A[A[A[A[A[A[A[A[A









 35%|███▌      | 2351/6666 [02:03<03:01, 23.80it/s][A[A[A[A[A[A[A[A[A[A









 35%|███▌      | 2354/6666 [02:03<03:24, 21.07it/s][A[A[A[A[A[A[A[A[A[A









 35%|███▌      | 2357/6666 [02:03<03:37, 19.79it/s][A[A[A[A[A[A[A[A[A[A









 35%|███▌      | 2360/6666 [02:03<03:21, 21.37it/s][A[A[A[A[A[A[A[A[A[A









 35%|███▌      | 2363/6666 [02:03<03:19, 21.59it/s][A[A[A[A[A[A[A[A[A[A









 35%|███▌      | 2366/6666 [02:03<03:28, 20.63it/s][A[A[A[A[A[A[A[A[A[A










 39%|███▉      | 2593/6666 [02:15<03:17, 20.67it/s][A[A[A[A[A[A[A[A[A[A









 39%|███▉      | 2596/6666 [02:15<03:08, 21.62it/s][A[A[A[A[A[A[A[A[A[A









 39%|███▉      | 2599/6666 [02:15<02:59, 22.60it/s][A[A[A[A[A[A[A[A[A[A









 39%|███▉      | 2602/6666 [02:15<03:13, 20.98it/s][A[A[A[A[A[A[A[A[A[A









 39%|███▉      | 2605/6666 [02:15<03:45, 18.02it/s][A[A[A[A[A[A[A[A[A[A









 39%|███▉      | 2608/6666 [02:15<03:22, 20.06it/s][A[A[A[A[A[A[A[A[A[A









 39%|███▉      | 2611/6666 [02:15<03:11, 21.14it/s][A[A[A[A[A[A[A[A[A[A









 39%|███▉      | 2614/6666 [02:16<03:51, 17.53it/s][A[A[A[A[A[A[A[A[A[A









 39%|███▉      | 2616/6666 [02:16<04:17, 15.74it/s][A[A[A[A[A[A[A[A[A[A









 39%|███▉      | 2618/6666 [02:16<04:40, 14.43it/s][A[A[A[A[A[A[A[A[A[A









 39%|███▉      | 2620/6666 [02:16<04:53, 13.78it/s][A[A[A[A[A[A[A[A[A[A










 43%|████▎     | 2850/6666 [02:27<03:00, 21.14it/s][A[A[A[A[A[A[A[A[A[A









 43%|████▎     | 2853/6666 [02:27<02:47, 22.81it/s][A[A[A[A[A[A[A[A[A[A









 43%|████▎     | 2857/6666 [02:27<02:38, 23.99it/s][A[A[A[A[A[A[A[A[A[A









 43%|████▎     | 2860/6666 [02:28<02:47, 22.72it/s][A[A[A[A[A[A[A[A[A[A









 43%|████▎     | 2863/6666 [02:28<02:44, 23.15it/s][A[A[A[A[A[A[A[A[A[A









 43%|████▎     | 2866/6666 [02:28<02:38, 24.04it/s][A[A[A[A[A[A[A[A[A[A









 43%|████▎     | 2869/6666 [02:28<03:03, 20.66it/s][A[A[A[A[A[A[A[A[A[A









 43%|████▎     | 2872/6666 [02:28<02:49, 22.44it/s][A[A[A[A[A[A[A[A[A[A









 43%|████▎     | 2876/6666 [02:28<02:37, 24.09it/s][A[A[A[A[A[A[A[A[A[A









 43%|████▎     | 2879/6666 [02:28<02:48, 22.48it/s][A[A[A[A[A[A[A[A[A[A









 43%|████▎     | 2882/6666 [02:28<02:42, 23.35it/s][A[A[A[A[A[A[A[A[A[A










 47%|████▋     | 3118/6666 [02:40<02:58, 19.87it/s][A[A[A[A[A[A[A[A[A[A









 47%|████▋     | 3121/6666 [02:40<02:50, 20.74it/s][A[A[A[A[A[A[A[A[A[A









 47%|████▋     | 3124/6666 [02:40<02:42, 21.84it/s][A[A[A[A[A[A[A[A[A[A









 47%|████▋     | 3127/6666 [02:40<03:00, 19.59it/s][A[A[A[A[A[A[A[A[A[A









 47%|████▋     | 3130/6666 [02:40<03:24, 17.28it/s][A[A[A[A[A[A[A[A[A[A









 47%|████▋     | 3133/6666 [02:41<03:10, 18.52it/s][A[A[A[A[A[A[A[A[A[A









 47%|████▋     | 3135/6666 [02:41<03:08, 18.69it/s][A[A[A[A[A[A[A[A[A[A









 47%|████▋     | 3138/6666 [02:41<02:52, 20.42it/s][A[A[A[A[A[A[A[A[A[A









 47%|████▋     | 3141/6666 [02:41<02:41, 21.78it/s][A[A[A[A[A[A[A[A[A[A









 47%|████▋     | 3145/6666 [02:41<02:33, 22.99it/s][A[A[A[A[A[A[A[A[A[A









 47%|████▋     | 3148/6666 [02:41<02:46, 21.17it/s][A[A[A[A[A[A[A[A[A[A










 51%|█████     | 3391/6666 [02:52<02:45, 19.80it/s][A[A[A[A[A[A[A[A[A[A









 51%|█████     | 3394/6666 [02:52<02:49, 19.26it/s][A[A[A[A[A[A[A[A[A[A









 51%|█████     | 3396/6666 [02:52<02:57, 18.39it/s][A[A[A[A[A[A[A[A[A[A









 51%|█████     | 3398/6666 [02:52<03:14, 16.83it/s][A[A[A[A[A[A[A[A[A[A









 51%|█████     | 3400/6666 [02:52<03:28, 15.68it/s][A[A[A[A[A[A[A[A[A[A









 51%|█████     | 3402/6666 [02:52<03:37, 14.99it/s][A[A[A[A[A[A[A[A[A[A









 51%|█████     | 3404/6666 [02:53<03:21, 16.15it/s][A[A[A[A[A[A[A[A[A[A









 51%|█████     | 3407/6666 [02:53<03:00, 18.05it/s][A[A[A[A[A[A[A[A[A[A









 51%|█████     | 3410/6666 [02:53<02:49, 19.20it/s][A[A[A[A[A[A[A[A[A[A









 51%|█████     | 3413/6666 [02:53<02:43, 19.86it/s][A[A[A[A[A[A[A[A[A[A









 51%|█████     | 3416/6666 [02:53<03:03, 17.69it/s][A[A[A[A[A[A[A[A[A[A










 55%|█████▍    | 3658/6666 [03:04<02:13, 22.52it/s][A[A[A[A[A[A[A[A[A[A









 55%|█████▍    | 3661/6666 [03:04<02:18, 21.65it/s][A[A[A[A[A[A[A[A[A[A









 55%|█████▍    | 3664/6666 [03:04<02:30, 20.01it/s][A[A[A[A[A[A[A[A[A[A









 55%|█████▌    | 3667/6666 [03:04<02:22, 21.03it/s][A[A[A[A[A[A[A[A[A[A









 55%|█████▌    | 3671/6666 [03:05<02:07, 23.45it/s][A[A[A[A[A[A[A[A[A[A









 55%|█████▌    | 3674/6666 [03:05<02:18, 21.64it/s][A[A[A[A[A[A[A[A[A[A









 55%|█████▌    | 3677/6666 [03:05<02:08, 23.19it/s][A[A[A[A[A[A[A[A[A[A









 55%|█████▌    | 3680/6666 [03:05<02:03, 24.26it/s][A[A[A[A[A[A[A[A[A[A









 55%|█████▌    | 3683/6666 [03:05<01:57, 25.35it/s][A[A[A[A[A[A[A[A[A[A









 55%|█████▌    | 3686/6666 [03:05<02:46, 17.86it/s][A[A[A[A[A[A[A[A[A[A









 55%|█████▌    | 3689/6666 [03:05<02:44, 18.08it/s][A[A[A[A[A[A[A[A[A[A










 59%|█████▉    | 3936/6666 [03:16<01:36, 28.27it/s][A[A[A[A[A[A[A[A[A[A









 59%|█████▉    | 3940/6666 [03:17<01:31, 29.68it/s][A[A[A[A[A[A[A[A[A[A









 59%|█████▉    | 3944/6666 [03:17<01:30, 30.24it/s][A[A[A[A[A[A[A[A[A[A









 59%|█████▉    | 3948/6666 [03:17<01:27, 30.93it/s][A[A[A[A[A[A[A[A[A[A









 59%|█████▉    | 3952/6666 [03:17<01:31, 29.69it/s][A[A[A[A[A[A[A[A[A[A









 59%|█████▉    | 3956/6666 [03:17<01:38, 27.61it/s][A[A[A[A[A[A[A[A[A[A









 59%|█████▉    | 3959/6666 [03:17<01:47, 25.22it/s][A[A[A[A[A[A[A[A[A[A









 59%|█████▉    | 3962/6666 [03:17<01:42, 26.28it/s][A[A[A[A[A[A[A[A[A[A









 59%|█████▉    | 3965/6666 [03:17<01:39, 27.07it/s][A[A[A[A[A[A[A[A[A[A









 60%|█████▉    | 3968/6666 [03:18<01:41, 26.51it/s][A[A[A[A[A[A[A[A[A[A









 60%|█████▉    | 3971/6666 [03:18<01:45, 25.44it/s][A[A[A[A[A[A[A[A[A[A










 63%|██████▎   | 4217/6666 [03:29<02:15, 18.02it/s][A[A[A[A[A[A[A[A[A[A









 63%|██████▎   | 4219/6666 [03:29<02:18, 17.66it/s][A[A[A[A[A[A[A[A[A[A









 63%|██████▎   | 4222/6666 [03:29<02:02, 19.90it/s][A[A[A[A[A[A[A[A[A[A









 63%|██████▎   | 4225/6666 [03:29<02:01, 20.07it/s][A[A[A[A[A[A[A[A[A[A









 63%|██████▎   | 4228/6666 [03:29<02:04, 19.52it/s][A[A[A[A[A[A[A[A[A[A









 63%|██████▎   | 4231/6666 [03:29<02:03, 19.77it/s][A[A[A[A[A[A[A[A[A[A









 64%|██████▎   | 4234/6666 [03:29<01:57, 20.70it/s][A[A[A[A[A[A[A[A[A[A









 64%|██████▎   | 4237/6666 [03:30<01:55, 21.10it/s][A[A[A[A[A[A[A[A[A[A









 64%|██████▎   | 4240/6666 [03:30<01:56, 20.75it/s][A[A[A[A[A[A[A[A[A[A









 64%|██████▎   | 4243/6666 [03:30<02:04, 19.50it/s][A[A[A[A[A[A[A[A[A[A









 64%|██████▎   | 4245/6666 [03:30<02:03, 19.54it/s][A[A[A[A[A[A[A[A[A[A










 67%|██████▋   | 4474/6666 [03:41<01:42, 21.32it/s][A[A[A[A[A[A[A[A[A[A









 67%|██████▋   | 4477/6666 [03:41<01:48, 20.15it/s][A[A[A[A[A[A[A[A[A[A









 67%|██████▋   | 4481/6666 [03:41<01:40, 21.71it/s][A[A[A[A[A[A[A[A[A[A









 67%|██████▋   | 4484/6666 [03:42<01:41, 21.50it/s][A[A[A[A[A[A[A[A[A[A









 67%|██████▋   | 4487/6666 [03:42<01:43, 21.04it/s][A[A[A[A[A[A[A[A[A[A









 67%|██████▋   | 4490/6666 [03:42<01:49, 19.89it/s][A[A[A[A[A[A[A[A[A[A









 67%|██████▋   | 4493/6666 [03:42<01:43, 21.00it/s][A[A[A[A[A[A[A[A[A[A









 67%|██████▋   | 4496/6666 [03:42<01:36, 22.41it/s][A[A[A[A[A[A[A[A[A[A









 67%|██████▋   | 4499/6666 [03:42<01:32, 23.51it/s][A[A[A[A[A[A[A[A[A[A









 68%|██████▊   | 4503/6666 [03:42<01:26, 24.92it/s][A[A[A[A[A[A[A[A[A[A









 68%|██████▊   | 4507/6666 [03:43<01:21, 26.64it/s][A[A[A[A[A[A[A[A[A[A










 71%|███████   | 4748/6666 [03:54<01:24, 22.79it/s][A[A[A[A[A[A[A[A[A[A









 71%|███████▏  | 4751/6666 [03:54<01:25, 22.38it/s][A[A[A[A[A[A[A[A[A[A









 71%|███████▏  | 4754/6666 [03:54<01:20, 23.74it/s][A[A[A[A[A[A[A[A[A[A









 71%|███████▏  | 4757/6666 [03:55<01:28, 21.47it/s][A[A[A[A[A[A[A[A[A[A









 71%|███████▏  | 4760/6666 [03:55<01:22, 23.22it/s][A[A[A[A[A[A[A[A[A[A









 71%|███████▏  | 4763/6666 [03:55<01:18, 24.23it/s][A[A[A[A[A[A[A[A[A[A









 71%|███████▏  | 4766/6666 [03:55<01:16, 24.96it/s][A[A[A[A[A[A[A[A[A[A









 72%|███████▏  | 4769/6666 [03:55<01:25, 22.29it/s][A[A[A[A[A[A[A[A[A[A









 72%|███████▏  | 4773/6666 [03:55<01:16, 24.62it/s][A[A[A[A[A[A[A[A[A[A









 72%|███████▏  | 4776/6666 [03:55<01:15, 25.04it/s][A[A[A[A[A[A[A[A[A[A









 72%|███████▏  | 4780/6666 [03:55<01:12, 26.02it/s][A[A[A[A[A[A[A[A[A[A










 75%|███████▍  | 4994/6666 [04:07<01:54, 14.59it/s][A[A[A[A[A[A[A[A[A[A









 75%|███████▍  | 4996/6666 [04:07<01:54, 14.57it/s][A[A[A[A[A[A[A[A[A[A









 75%|███████▍  | 4999/6666 [04:08<01:43, 16.07it/s][A[A[A[A[A[A[A[A[A[A









 75%|███████▌  | 5002/6666 [04:08<01:34, 17.64it/s][A[A[A[A[A[A[A[A[A[A









 75%|███████▌  | 5005/6666 [04:08<01:31, 18.13it/s][A[A[A[A[A[A[A[A[A[A









 75%|███████▌  | 5008/6666 [04:08<01:23, 19.79it/s][A[A[A[A[A[A[A[A[A[A









 75%|███████▌  | 5011/6666 [04:08<01:18, 21.07it/s][A[A[A[A[A[A[A[A[A[A









 75%|███████▌  | 5014/6666 [04:08<01:12, 22.84it/s][A[A[A[A[A[A[A[A[A[A









 75%|███████▌  | 5017/6666 [04:08<01:17, 21.29it/s][A[A[A[A[A[A[A[A[A[A









 75%|███████▌  | 5020/6666 [04:09<01:28, 18.55it/s][A[A[A[A[A[A[A[A[A[A









 75%|███████▌  | 5022/6666 [04:09<01:28, 18.62it/s][A[A[A[A[A[A[A[A[A[A










 79%|███████▉  | 5257/6666 [04:20<00:53, 26.45it/s][A[A[A[A[A[A[A[A[A[A









 79%|███████▉  | 5260/6666 [04:20<00:52, 27.02it/s][A[A[A[A[A[A[A[A[A[A









 79%|███████▉  | 5264/6666 [04:20<00:50, 27.82it/s][A[A[A[A[A[A[A[A[A[A









 79%|███████▉  | 5267/6666 [04:20<00:51, 27.19it/s][A[A[A[A[A[A[A[A[A[A









 79%|███████▉  | 5270/6666 [04:20<00:52, 26.70it/s][A[A[A[A[A[A[A[A[A[A









 79%|███████▉  | 5273/6666 [04:20<00:54, 25.42it/s][A[A[A[A[A[A[A[A[A[A









 79%|███████▉  | 5276/6666 [04:20<01:00, 23.05it/s][A[A[A[A[A[A[A[A[A[A









 79%|███████▉  | 5279/6666 [04:21<00:58, 23.66it/s][A[A[A[A[A[A[A[A[A[A









 79%|███████▉  | 5283/6666 [04:21<00:54, 25.34it/s][A[A[A[A[A[A[A[A[A[A









 79%|███████▉  | 5286/6666 [04:21<00:57, 24.13it/s][A[A[A[A[A[A[A[A[A[A









 79%|███████▉  | 5289/6666 [04:21<00:56, 24.46it/s][A[A[A[A[A[A[A[A[A[A










 83%|████████▎ | 5541/6666 [04:31<00:42, 26.33it/s][A[A[A[A[A[A[A[A[A[A









 83%|████████▎ | 5544/6666 [04:31<00:41, 26.95it/s][A[A[A[A[A[A[A[A[A[A









 83%|████████▎ | 5547/6666 [04:31<00:48, 23.21it/s][A[A[A[A[A[A[A[A[A[A









 83%|████████▎ | 5551/6666 [04:32<00:44, 25.15it/s][A[A[A[A[A[A[A[A[A[A









 83%|████████▎ | 5554/6666 [04:32<00:42, 26.43it/s][A[A[A[A[A[A[A[A[A[A









 83%|████████▎ | 5557/6666 [04:32<00:43, 25.58it/s][A[A[A[A[A[A[A[A[A[A









 83%|████████▎ | 5560/6666 [04:32<00:41, 26.34it/s][A[A[A[A[A[A[A[A[A[A









 83%|████████▎ | 5563/6666 [04:32<00:41, 26.63it/s][A[A[A[A[A[A[A[A[A[A









 83%|████████▎ | 5566/6666 [04:32<00:44, 24.66it/s][A[A[A[A[A[A[A[A[A[A









 84%|████████▎ | 5569/6666 [04:32<00:47, 23.18it/s][A[A[A[A[A[A[A[A[A[A









 84%|████████▎ | 5572/6666 [04:32<00:48, 22.73it/s][A[A[A[A[A[A[A[A[A[A










 87%|████████▋ | 5822/6666 [04:42<00:37, 22.49it/s][A[A[A[A[A[A[A[A[A[A









 87%|████████▋ | 5825/6666 [04:43<00:34, 24.08it/s][A[A[A[A[A[A[A[A[A[A









 87%|████████▋ | 5829/6666 [04:43<00:32, 25.40it/s][A[A[A[A[A[A[A[A[A[A









 88%|████████▊ | 5833/6666 [04:43<00:31, 26.52it/s][A[A[A[A[A[A[A[A[A[A









 88%|████████▊ | 5836/6666 [04:43<00:31, 26.19it/s][A[A[A[A[A[A[A[A[A[A









 88%|████████▊ | 5839/6666 [04:43<00:30, 26.83it/s][A[A[A[A[A[A[A[A[A[A









 88%|████████▊ | 5842/6666 [04:43<00:32, 25.07it/s][A[A[A[A[A[A[A[A[A[A









 88%|████████▊ | 5845/6666 [04:43<00:32, 25.18it/s][A[A[A[A[A[A[A[A[A[A









 88%|████████▊ | 5848/6666 [04:43<00:32, 25.27it/s][A[A[A[A[A[A[A[A[A[A









 88%|████████▊ | 5851/6666 [04:44<00:32, 25.05it/s][A[A[A[A[A[A[A[A[A[A









 88%|████████▊ | 5854/6666 [04:44<00:33, 24.02it/s][A[A[A[A[A[A[A[A[A[A










 92%|█████████▏| 6112/6666 [04:54<00:20, 26.39it/s][A[A[A[A[A[A[A[A[A[A









 92%|█████████▏| 6115/6666 [04:54<00:20, 26.61it/s][A[A[A[A[A[A[A[A[A[A









 92%|█████████▏| 6118/6666 [04:54<00:19, 27.44it/s][A[A[A[A[A[A[A[A[A[A









 92%|█████████▏| 6122/6666 [04:54<00:18, 28.84it/s][A[A[A[A[A[A[A[A[A[A









 92%|█████████▏| 6126/6666 [04:54<00:18, 29.69it/s][A[A[A[A[A[A[A[A[A[A









 92%|█████████▏| 6130/6666 [04:54<00:18, 28.28it/s][A[A[A[A[A[A[A[A[A[A









 92%|█████████▏| 6134/6666 [04:54<00:18, 29.13it/s][A[A[A[A[A[A[A[A[A[A









 92%|█████████▏| 6137/6666 [04:54<00:18, 28.52it/s][A[A[A[A[A[A[A[A[A[A









 92%|█████████▏| 6140/6666 [04:55<00:19, 26.58it/s][A[A[A[A[A[A[A[A[A[A









 92%|█████████▏| 6144/6666 [04:55<00:18, 27.71it/s][A[A[A[A[A[A[A[A[A[A









 92%|█████████▏| 6147/6666 [04:55<00:18, 27.53it/s][A[A[A[A[A[A[A[A[A[A










 96%|█████████▌| 6401/6666 [05:05<00:12, 21.11it/s][A[A[A[A[A[A[A[A[A[A









 96%|█████████▌| 6404/6666 [05:05<00:13, 19.93it/s][A[A[A[A[A[A[A[A[A[A









 96%|█████████▌| 6407/6666 [05:06<00:14, 17.80it/s][A[A[A[A[A[A[A[A[A[A









 96%|█████████▌| 6410/6666 [05:06<00:12, 19.92it/s][A[A[A[A[A[A[A[A[A[A









 96%|█████████▌| 6413/6666 [05:06<00:12, 20.22it/s][A[A[A[A[A[A[A[A[A[A









 96%|█████████▌| 6416/6666 [05:06<00:11, 21.41it/s][A[A[A[A[A[A[A[A[A[A









 96%|█████████▋| 6419/6666 [05:06<00:11, 22.18it/s][A[A[A[A[A[A[A[A[A[A









 96%|█████████▋| 6423/6666 [05:06<00:10, 24.07it/s][A[A[A[A[A[A[A[A[A[A









 96%|█████████▋| 6426/6666 [05:06<00:09, 24.68it/s][A[A[A[A[A[A[A[A[A[A









 96%|█████████▋| 6429/6666 [05:07<00:09, 25.19it/s][A[A[A[A[A[A[A[A[A[A









 96%|█████████▋| 6432/6666 [05:07<00:09, 24.00it/s][A[A[A[A[A[A[A[A[A[A










100%|█████████▉| 6643/6666 [05:18<00:00, 25.18it/s][A[A[A[A[A[A[A[A[A[A









100%|█████████▉| 6646/6666 [05:18<00:00, 24.28it/s][A[A[A[A[A[A[A[A[A[A









100%|█████████▉| 6649/6666 [05:18<00:00, 24.62it/s][A[A[A[A[A[A[A[A[A[A









100%|█████████▉| 6652/6666 [05:18<00:00, 24.05it/s][A[A[A[A[A[A[A[A[A[A









100%|█████████▉| 6655/6666 [05:19<00:00, 20.69it/s][A[A[A[A[A[A[A[A[A[A









100%|█████████▉| 6659/6666 [05:19<00:00, 22.71it/s][A[A[A[A[A[A[A[A[A[A









100%|█████████▉| 6662/6666 [05:19<00:00, 24.29it/s][A[A[A[A[A[A[A[A[A[A









100%|██████████| 6666/6666 [05:19<00:00, 20.86it/s][A[A[A[A[A[A[A[A[A[A


In [270]:
s = pd.Series(list(chain.from_iterable(vaccines[~vaccines.Entities.isna()]['Entities'])))

In [271]:
s.str[1].value_counts()

ORGANISM                           47289
GENE_OR_GENE_PRODUCT               34153
CELL                               14202
SIMPLE_CHEMICAL                    12616
CELLULAR_COMPONENT                  4741
ORGAN                               4522
CANCER                              3648
TISSUE                              2574
ORGANISM_SUBSTANCE                  2311
MULTI-TISSUE_STRUCTURE              1964
IMMATERIAL_ANATOMICAL_ENTITY         564
AMINO_ACID                           388
PATHOLOGICAL_FORMATION               349
ORGANISM_SUBDIVISION                 278
ANATOMICAL_SYSTEM                    192
DEVELOPING_ANATOMICAL_STRUCTURE        1
dtype: int64

In [287]:
idx_ent = np.where(s.str[1] == 'GENE_OR_GENE_PRODUCT') # Change this string to see results for different entities
s.str[0].iloc[idx_ent].value_counts().head(30)

”                                  391
antigen                            338
’s                                 266
hemagglutinin                      195
IFN-γ                              156
IFN-β                              144
nucleocapsid                       143
IFNAR1                             127
CD4                                121
IC(50                              114
miR-27b                            110
ACE2                               110
type I interferon                  108
ELECTRONIC SUPPLEMENTARY           102
interferons                         99
ubiquitin                           98
neuraminidase                       96
alpha-1,2-fucosyltransferase        95
Ifnar1(SA                           93
angiotensin-converting enzyme 2     91
CD8                                 87
IFN                                 86
MERS-CoV.                           85
dipeptidyl peptidase 4              83
interferon                          83
green fluorescent protein

Visualise which *ents* have been researched across the years

In [242]:
vaccines.publish_time = pd.to_datetime(vaccines.publish_time)

ParserError: Unknown string format: 2006 Jun-Dec

In [198]:
s = pd.Series(list(chain.from_iterable(vaccines[:100]['EntitiesLabels']))).value_counts()
s.head(40)

ORGANISM                        569
DISEASE                         486
PROTEIN                         437
GENE_OR_GENE_PRODUCT            305
SIMPLE_CHEMICAL                 230
CHEMICAL                        212
CELL                             75
DNA                              51
ORGAN                            48
ORGANISM_SUBSTANCE               22
CELL_LINE                        22
CELL_TYPE                        21
CELLULAR_COMPONENT               16
MULTI-TISSUE_STRUCTURE           16
CANCER                           14
TISSUE                           11
PATHOLOGICAL_FORMATION            7
ORGANISM_SUBDIVISION              6
IMMATERIAL_ANATOMICAL_ENTITY      5
RNA                               3
AMINO_ACID                        2
dtype: int64

(1588,)

In [29]:
nlp = spacy.load("en_ner_bionlp13cg_md")
doc = nlp(abstracts.iloc[sample,7])
spacy.displacy.render(doc, style='ent',jupyter=True)

ValueError: [E007] 'Negex' already exists in pipeline. Existing names: ['tagger', 'parser', 'ner', 'Negex']

In [71]:
'''doc = nlp(abstracts.iloc[np.where(vaccine_idx)[0][6],7])
for e in doc.ents:
    print(e.text, e._.negex)'''

'doc = nlp(abstracts.iloc[np.where(vaccine_idx)[0][6],7])\nfor e in doc.ents:\n    print(e.text, e._.negex)'

In [45]:
'''for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, 
          token.shape_, token.is_alpha, token.is_stop)'''

'for token in doc:\n    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, \n          token.shape_, token.is_alpha, token.is_stop)'