### Import

In [1]:
import pubmed

### Download Neurosurgery articles

In [2]:
db = 'pmc'
query = 'text classification neural network'
query0 = 'neurosurgery'
query1 = '(neurosurg*) AND (complicat*[Text Word])'
query2 = '("Glioma/surgery"[Mesh] OR "Glioma/therapy"[Mesh])'
query3 = '("Neurosurgery"[Mesh] OR "Neurosurgical Procedures"[Mesh]) AND "Postoperative Complications"[Mesh]'
query4 = '("Neurosurgery"[Mesh] OR "Neurosurgical Procedures"[Mesh]) AND complicat*[Text Word]'

In [3]:
article_ids = pubmed.download_all_articles(query=query3, db=db, refresh=False, cache=True)

1583 articles found in pmc with query specified.
1583 articles are already stored in the database.
0 articles will be downloaded from pmc.
Total 1583 articles stored in the database.


### Generate dataset

In [4]:
items = pubmed.generate_dataset(db, article_ids)
items.head()

HBox(children=(FloatProgress(value=0.0, max=1583.0), HTML(value='')))




Unnamed: 0,pmid,pmc,pii,doi,art-access-id,sici,pmc-scan,medline,manuscript,other,...,journal-id_nlm-journal-id,journal-id_iso-abbrev,abstract_len,abstract,full_text_len,full_text,license-type,license,copyright,keywords
0,30773564,7433549,,10.19723/j.issn.1671-167X.2019.01.030,1671-167X-51-1-177,,,,,R614.4,...,,Beijing Da Xue Xue Bao,936,目的 比较超声引导下肌间沟臂丛神经阻滞和颈5-6神经根阻滞用于肩关节镜术后镇痛的效果。 方法...,0,,,,版权所有©《北京大学学报(医学版)》编辑部2019,肌间沟臂丛神经阻滞; 颈神经根阻滞; 肩关节手术; 镇痛
1,32176032,7440145,,10.1097/MD.0000000000019071,19071,,,,,,...,,Medicine (Baltimore),2266,Abstract Background: It is important to manage...,23380,1 Introduction A craniotomy is a surgical oper...,open-access,This is an open access article distributed und...,Copyright © 2020 the Author(s). Published by W...,acupuncture; craniotomy; cytokine; electroacup...
2,32332664,7440095,,10.1097/MD.0000000000019896,19896,,,,,,...,,Medicine (Baltimore),53,Supplemental Digital Content is available in t...,18530,1 Introduction Postoperative delirium (POD) is...,open-access,This is an open access article distributed und...,Copyright © 2020 the Author(s). Published by W...,inflammation; pain; paravertebral block; posto...
3,19213828,7051788,,10.3174/ajnr.A1453,,,,,,08-00789,...,,AJNR Am J Neuroradiol,1885,"BACKGROUND AND PURPOSE: Recently, surgeons hav...",0,,,,Copyright © American Society of Neuroradiology,
4,32011490,7220741,,10.1097/MD.0000000000018817,18817,,,,,,...,,Medicine (Baltimore),1890,Abstract Rationale: Although C5 palsy is a com...,7929,1 Introduction C5 palsy is a common complicati...,open-access,This is an open access article distributed und...,Copyright © 2020 the Author(s). Published by W...,endoscopic; foraminotomy; palsy; radiculopathy...


In [6]:
items.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1583 entries, 0 to 1582
Columns: 45 entries, pmid to keywords
dtypes: object(45)
memory usage: 568.9+ KB


In [10]:
items.to_csv('database/{0}_queryX.csv'.format(db), sep='|')
items.to_excel('database/{0}_queryX.xlsx'.format(db))

## MySQL tests

In [4]:
import pubmed
item = {}
filename = 'pmc/7051788'

root = pubmed.get_element_tree(filename)
item = pubmed.parse_element_tree_pmc(root)

In [3]:
root = pubmed.get_element_tree(filename)
item, article_meta = pubmed.parse_top_level_subroots_pmc(root)
pubmed.parse_affiliations(article_meta, 'contrib-group/aff')

'(1) Brain Imaging and Behavior: Systems Neuroscience Krembil Research Institute, University Health Network Toronto Ontario Canada // (2) Department of Psychology University of Toronto Toronto Ontario Canada'

In [5]:
item

{'article-type': 'research-article',
 'pmid': '19213828',
 'pmc': '7051788',
 'other': '08-00789',
 'doi': '10.3174/ajnr.A1453',
 'category': 'Head & Neck',
 'title': 'The MR Imaging Appearance of the Vascular Pedicle Nasoseptal Flap',
 'authors': 'M.D. Kang (a); E. Escott (b); A.J. Thomas (c); R.L. Carrau (c,d); C.H. Snyderman (c,d); A.B. Kassam (c,d); W. Rothfus (b)',
 'affiliations': '(a) Department of Neuroradiology, Thomas Jefferson University Hospital, Philadelphia, Pa // (b) Department of Neuroradiology, University of Pittsburgh Medical Center, Pittsburgh, Pa // (c) Department of Neurosurgery, University of Pittsburgh Medical Center, Pittsburgh, Pa // (d) Department of Otolaryngology—Head & Neck Surgery, University of Pittsburgh Medical Center, Pittsburgh, Pa',
 'pub_date': '2009-4',
 'copyright': 'Copyright © American Society of Neuroradiology',
 'license': '',
 'keywords': nan,
 'abstract': 'BACKGROUND AND PURPOSE: Recently, surgeons have used an expanded endonasal surgical ap

In [6]:
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

from config import conn_string
from models.pubmed_model import PubmedArticle, Base

engine = create_engine(conn_string)
Base.metadata.bind = engine
DBSession = sessionmaker(bind=engine)

session = DBSession()

### Add

In [4]:
article = PubmedArticle(item)
session.add(article) 
session.commit()

### Read

In [7]:
result = session.query(PubmedArticle).filter_by(pmid=29522659).first()
print(result.to_dict()['abstract'])

Introduction: The aim of the study was an analysis of ophthalmic symptoms coexisting with the tumour of the cerebellum. Material and methods: The study included 14 patients in the age between 21–55 years old with the tumor of cerebellum, who were operated in the Neurosurgery Clinic of the Pomeranian Medical University in Szczecin. The comprehensive ophthalmic examination were performed before and after 5 days from surgery. The examinations included evaluation of: pupillary reactions, visual acuity, fundus ophthalmoscopy, intraocular pressures, eye motility, visual field, optometrical tests and visual manual localization test. Results: The symptoms found before surgery of cerebellum tumors: diplopia (3 persons), early papilloedema (4 persons), nystagmus (2 persons), lack (5 persons) and weakened of convergence re􀏐lex (3 persons), latent strabismus (5 persons), manifest strabismus (3 persons). On the 5th day after the surgery were found: nystagmus (1 person), lack (5 persons) and weakene

### Update

In [34]:
item = session.query(PubmedArticle).filter_by(pmid=18263039).first()
item.pmc = 'PMCID888'
session.commit()

### Delete

In [5]:
article = session.query(PubmedArticle).filter_by(pmid=29368597).first()
session.delete(article)
session.commit()

In [5]:
articles = session.query(PubmedArticle).all()
for article in articles[:100]:
    session.delete(article)
session.commit()

In [11]:
session.close()

### Debug

In [2]:
import pandas as pd

In [5]:
articles = pd.read_csv('database/pmc_neurosurgery.csv', sep='|', low_memory=False)
articles.head()

Unnamed: 0,abstract,abstract_len,art-access-id,article-type,authors,category,coden,copyright,doi,elocation-id,...,pmcid,pmid,pub_date,publisher-id,publisher-manuscript,publisher_loc,publisher_name,sici,title,volume
0,,0,,editorial,"Douglas Kondziolka, William T. Couldwell, Jame...",Editorial,,Copyright held by the American Association of ...,10.3171/2020.7.JNS202691,,...,,32707559.0,2020-7-24,2020.7.JNS202691,,,American Association of Neurological Surgeons,,Editorial. Putting pen to paper during a pande...,
1,,0,,research-article,"Junad M. Chowdhury, Maulin Patel, Matthew Zhen...",Perspective,,Copyright © 2020 by the American Thoracic Society,10.1513/AnnalsATS.202003-259PS,,...,,32315201.0,2020-8,,202003-259PS,,American Thoracic Society,,Mobilization and Preparation of a Large Urban ...,17.0
2,SUMMARY We address whether combinations with a...,957,,research-article,"Irem Ozkan-Dagliyan, J. Nathaniel Diehl, Samue...",Article,,,10.1016/j.celrep.2020.107764,,...,,32553168.0,2020-6-16,,,,,,Low-Dose Vertical Inhibition of the RAF-MEK-ER...,31.0
3,,0,,news,Concezio Di Rocco,Obituary,,"© Springer-Verlag GmbH Germany, part of Spring...",10.1007/s00381-020-04804-2,,...,,,2020-7-31,4804,,Berlin/Heidelberg,Springer Berlin Heidelberg,,Jim Tait Goodrich,
4,,0,,discussion,"Tariq Al-Saadi, Humaid Al-Kalbani, Jack Lam",Article,,Crown Copyright © 2020 Published by Elsevier I...,10.1016/j.wneu.2020.06.123,,...,,,2020-7-30,S1878-8750(20)31381-4,,,Published by Elsevier Inc.,,Letter to the Editor: Spinal and Neurosurgical...,


In [16]:
articles.columns

Index(['abstract', 'abstract_len', 'art-access-id', 'article-type', 'authors',
       'category', 'coden', 'copyright', 'doi', 'elocation-id', 'file_size',
       'full_text', 'full_text_len', 'issn_epub', 'issn_ppub', 'issue',
       'journal-id_', 'journal-id_allenpress-id', 'journal-id_archive',
       'journal-id_coden', 'journal-id_doi', 'journal-id_hwp',
       'journal-id_iso-abbrev', 'journal-id_issn', 'journal-id_nlm-journal-id',
       'journal-id_nlm-ta', 'journal-id_pmc', 'journal-id_publisher-id',
       'journal-id_pubmed', 'journal-id_pubmed-jr-id', 'journal-id_sc',
       'journal_title', 'keywords', 'license', 'license-type', 'manuscript',
       'medline', 'other', 'pages', 'pii', 'pmc', 'pmc-scan', 'pmcid', 'pmid',
       'pub_date', 'publisher-id', 'publisher-manuscript', 'publisher_loc',
       'publisher_name', 'sici', 'title', 'volume'],
      dtype='object')

In [49]:
columns = ['pmid', 'pmc', 'pii', 'doi', 'art-access-id', 'sici', 'pmc-scan', 'medline', 'manuscript', 'other',
                   'title', 'authors', 'article-type', 'category',
                   'journal_title', 'volume', 'issue', 'pages', 'pub_date', 'issn_epub', 'issn_ppub',
                   'publisher_name', 'publisher_loc', 'publisher-id', 'elocation-id', 'publisher-manuscript',
                   'journal-id_nlm-ta', 'journal-id_pubmed-jr-id', 'journal-id_issn', 'journal-id_pmc',
                   'journal-id_doi', 'journal-id_coden', 'journal-id_publisher-id', 'journal-id_hwp',
                   'journal-id_nlm-journal-id', 'journal-id_iso-abbrev',
                   'abstract_len', 'abstract', 'full_text_len', 'full_text',
                   'license-type', 'license', 'copyright', 'keywords']

In [50]:
intersection = list(set(articles.columns.difference(set(columns))))
intersection

['journal-id_pubmed',
 'pmcid',
 'coden',
 'file_size',
 'journal-id_archive',
 'journal-id_sc',
 'journal-id_',
 'journal-id_allenpress-id']

In [51]:
articles[intersection].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132342 entries, 0 to 132341
Data columns (total 8 columns):
 #   Column                    Non-Null Count   Dtype 
---  ------                    --------------   ----- 
 0   journal-id_pubmed         4 non-null       object
 1   pmcid                     2 non-null       object
 2   coden                     21 non-null      object
 3   file_size                 132342 non-null  int64 
 4   journal-id_archive        50 non-null      object
 5   journal-id_sc             1 non-null       object
 6   journal-id_               1 non-null       object
 7   journal-id_allenpress-id  1 non-null       object
dtypes: int64(1), object(7)
memory usage: 8.1+ MB


In [4]:
articles['pmc'].dropna()

NameError: name 'articles' is not defined

In [15]:
articles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132342 entries, 0 to 132341
Data columns (total 52 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   abstract                   110056 non-null  object 
 1   abstract_len               132342 non-null  int64  
 2   art-access-id              1478 non-null    object 
 3   article-type               132342 non-null  object 
 4   authors                    128259 non-null  object 
 5   category                   132328 non-null  object 
 6   coden                      21 non-null      object 
 7   copyright                  84607 non-null   object 
 8   doi                        98345 non-null   object 
 9   elocation-id               29050 non-null   object 
 10  file_size                  132342 non-null  int64  
 11  full_text                  58351 non-null   object 
 12  full_text_len              132342 non-null  int64  
 13  issn_epub                  12

In [14]:
for column_name in articles.columns:
    test = articles[column_name]
    m = 0
    for i in range(len(articles)):
        n = len(str(test.loc[i]))
        m = n if n > m else m
    print('{0:>25}  {1}'.format(column_name, m))

                 abstract  119942
             abstract_len  6
            art-access-id  18
             article-type  19
                  authors  40418
                 category  176
                    coden  17
                copyright  1124
                      doi  60
             elocation-id  152
                file_size  7
                full_text  6327298
            full_text_len  7
                issn_epub  9
                issn_ppub  9
                    issue  29
              journal-id_  3
 journal-id_allenpress-id  4
       journal-id_archive  4
         journal-id_coden  18
           journal-id_doi  32
           journal-id_hwp  43
    journal-id_iso-abbrev  80
          journal-id_issn  9
journal-id_nlm-journal-id  9
        journal-id_nlm-ta  80
           journal-id_pmc  44
  journal-id_publisher-id  67
        journal-id_pubmed  4
  journal-id_pubmed-jr-id  7
            journal-id_sc  4
            journal_title  237
                 keywords  1979
    