# <p style="text-align: center;">Topic Evolution in Life Sciences Research</p>

## Problem Statement
I aim to use the text from biomedical and life science literature to gain insights on research topic trends over time.

## Project Goals
Through text mining and topic modeling with biomedical and life science literature, I seek to discover major research topics for a certain time period as well as their trends over time. Ultimately, I aim to build a tool that allows users to explore the trend of a topic of their own interest.

## Data description
I have the full text of a subset of the archive of biomedical and life sciences journal literature at the U.S. National Institutes of Health's National Library of Medicine, which is made [freely available](https://www.ncbi.nlm.nih.gov/pmc/tools/textmining/) to the public under a Creative Commons or similar license. This subset contains over 1 millions articles, including historical articles digitized by OCR.

## Proposed Methods and Models
- Extract text features from article titles and abstracts (may extend to full text if time allowed) using Natural Language Processing techniques
- Use topic modeling techniques to identify major topics within a time period and the trends over time

## Exploratory Data Analysis
Let's get started!

In [1]:
import os
import csv
from lxml import etree, html
import pandas as pd
import numpy as np

# import the EDA functions I've built in a Python script
from eda import *

### Extract the meta data, title, and abstract of each article from the `.nxml` files and append to a `.csv` file

In [5]:
# # desktop
# CSV = '/media/fay/Seagate Backup Plus Drive/GA capstone/pmc.csv'

# # laptop
CSV = 'non_comm_use.csv'
CWD = '/run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml'
# CWD = '/run/media/fay/Seagate Backup Plus Drive/GA capstone/articles'

XPATHS = [
    ('pmid', '/article/front/article-meta/article-id[@pub-id-type="pmid"]'),
    ('pmc', '/article/front/article-meta/article-id[@pub-id-type="pmc"]'),
    ('doi', '/article/front/article-meta/article-id[@pub-id-type="doi"]'),
    ('article_subject', '/article/front/article-meta/article-categories/subj-group/subject'),
    ('pub_year', '/article/front/article-meta/pub-date/year'),
    ('pub_month', '/article/front/article-meta/pub-date/month'),
    ('pub_day', '/article/front/article-meta/pub-date/day'),
    ('publisher_name', '/article/front/journal-meta/publisher/publisher-name'),
    ('journal_id', '/article/front/journal-meta/journal-id'),
    ('journal_title', '//journal-title')
]

XPATHS_SPECIAL = [
    ('article_title', '/front/article-meta/title-group/article-title'),
    ('abstract', '/front/article-meta/abstract')
]

In [3]:
def dir_to_csv(xml_paths, csvfile):
    for sub_dir in xml_paths:
        for xml_path in sub_dir:
            entry = extract_elements(xml_path)
            append_to_csv(csvfile, entry)   
        if sub_dir:
            print('Finished ' + '/'.join(sub_dir[0].split('/')[:-1]))

In [4]:
def find_xml_paths(cwd):
    xml_paths = [[os.path.join(os.path.abspath(dirpath), f) 
                  for f in filenames if f.endswith('.nxml')]
                 for dirpath, dirnames, filenames in os.walk(cwd)]
    return xml_paths

In [5]:
def extract_elements(xml_path):
    tree = etree.parse(xml_path)
    entry = [xml_path]

    for path in XPATHS:
        try:
            entry.append(tree.xpath(path[1])[0].text)
        except IndexError:
            entry.append(None)

    for path in XPATHS_SPECIAL:
        entry.append(extract_text(tree, path[1]))
    
    return entry

In [6]:
def extract_text(tree, xpath):
    el = tree.find(xpath)
    if el is not None:
        html_el = html.fragment_fromstring(etree.tostring(el))
        return html_el.text_content().strip().replace('\xa0', ' ')
    else:
        return ''

In [7]:
def append_to_csv(csvfile, entry):
    with open(csvfile, 'a', newline='') as f:
        entry_writer = csv.writer(f, delimiter=',', quoting=csv.QUOTE_MINIMAL)
        entry_writer.writerow(entry)

In [None]:
%%time
xml_paths = find_xml_paths(CWD)

CPU times: user 3.21 s, sys: 234 ms, total: 3.45 s
Wall time: 17.7 s


In [None]:
%%time
dir_to_csv(xml_paths, CSV)

Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Icarus
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/IDCases
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Identity_(Mahwah,_N_J)
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/IEEE_J_Transl_Eng_Health_Med
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/IJC_Metab_Endocr
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/ILAR_J
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Illn_Crises_Loss
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Images_Paediatr_Cardiol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Imaging_Sci_Dent
Finished /run/media/fay

Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Inhal_Toxicol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Injury
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Inj_Prev
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Innate_Immun
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Innovations_(Phila)
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Innov_Food_Sci_Emerg_Technol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Inorganica_Chim_Acta
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Inorg_Chem
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Inorg_Chem_Commun
Finished /run/media/fay/Seagat

Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Int_J_Cardiol_Heart_Vasc
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Int_J_Care_Coord
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Int_J_Child_Comput_Interact
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Int_J_Child_Health_Nutr
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Int_J_Chron_Obstruct_Pulmon_Dis
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Int_J_Circumpolar_Health
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Int_J_Climatol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Int_J_Clin_Oncol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_u

Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Int_J_Law_Context
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Int_J_Law_Psychiatry
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Int_J_Legal_Med
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Int_J_Life_Cycle_Assess
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Int_J_Marit_Hist
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Int_J_Mass_Spectrom
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Int_J_MCH_AIDS
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Int_J_Med_Inform
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Int_J_Med_Microbiol
Fini

Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Int_J_Spine_Surg
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Int_J_Sport_Exerc_Psychol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Int_J_Stat_Med_Res
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Int_J_STD_AIDS
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Int_J_Stomatol_Occlusion_Med
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Int_J_Stroke
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Int_J_Surg_Case_Rep
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Int_J_Surg_Oncol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Int_J_Surg_Patho

Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Int_Clin_Psychopharmacol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Int_Dent_J
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Int_Endod_J
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Int_Forum_Allergy_Rhinol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Int_Health
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Int_Immunol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/In_Vitro_Cell_Dev_Biol_Anim
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/In_Vitro_Cell_Dev_Biol_Plant
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Iperception
Finishe

Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Iran_J_Pharm_Res
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/JBI_Database_System_Rev_Implement_Rep
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Abnorm_Psychol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Acad_Mark_Sci
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Acad_Nutr_Diet
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Acquir_Immune_Defic_Syndr
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Addict_Dis
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Addict_Med
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Addict_Nurs

Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Appl_Genet
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Appl_Microbiol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Appl_Oral_Sci
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Appl_Phycol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Appl_Physiol_(1985)
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Appl_Res_Intellect_Disabil
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Appl_Res_Mem_Cogn
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Appl_Sport_Psychol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Appl_Toxicol
Finis

Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Bone_Oncol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Bras_Pneumol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Breast_Cancer
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Burn_Care_Res
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Bus_Ethics
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Bus_Psychol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Bus_Res
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Cachexia_Sarcopenia_Muscle
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Cancer
Finished /run/media/fay/Seagate Backu

Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Clin_Endocrinol_Metab
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Clin_Epidemiol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Adv_Pharm_Technol_Res
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Agromedicine
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Am_Assoc_Nurse_Pract
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Anaesthesiol_Clin_Pharmacol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Anxiety_Disord
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Atmos_Sol_Terr_Phys
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/

Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Conserv_Dent
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Constr_Psychol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Consult_Clin_Psychol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Contemp_Brachytherapy
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Contemp_Psychother
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Contin_Educ_Health_Prof
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Control_Release
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Cosmet_Dermatol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Cosmet_L

Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Environ_Sci_Health_A_Tox_Hazard_Subst_Environ_Eng
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Environ_Sci_Health_B
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Environ_Sci_Health_C_Environ_Carcinog_Ecotoxicol_Rev
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Enzyme_Inhib_Med_Chem
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Epidemiol_Community_Health
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Epidemiol_Glob_Health
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Epilepsy_Res
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Equine_Sci
Finished /run/media/fay/Seaga

Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Genet_Couns
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Genet_Genomics
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Genet_Psychol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Genet_Syndr_Gene_Ther
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Gene_Med
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Genomics
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Gen_Intern_Med
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Gen_Physiol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Geochem_Explor
Finished /run/media/fay/Seagate

Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Histochem_Cytochem
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Histotechnol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Hist_Med_Allied_Sci
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Hist_Neurosci
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_HIV_AIDS_Soc_Serv
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Holist_Nurs
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Homosex
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Hosp_Infect
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Hosp_Med
Finished /run/media/fay/Seaga

Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Korean_Neurosurg_Soc
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Korean_Soc_Coloproctol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Korean_Soc_Coloproctology
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Korean_Surg_Soc
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Labelled_Comp_Radiopharm
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Lab_Physicians
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Law_Biosci
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Law_Econ_Organ
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Le

Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Midlife_Health
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Midwifery_Womens_Health
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Minim_Access_Surg
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Mix_Methods_Res
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Molluscan_Stud
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Mol_Biol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Mol_Biomark_Diagn
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Mol_Catal_B_Enzym
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Mol_Cell_Biol
Finishe

Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Inflamm_(Lond)
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Invertebr_Pathol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Laparoendosc_Adv_Surg_Tech_A
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Magn_Reson_Imaging
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Math_Imaging_Vis
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Med_Genet
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Med_Ultrason_(2001)
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Mod_Opt
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Nat_Sci_Biol_Med
Fi

Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Pathol_Transl_Med
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Patient_Exp
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Patient_Saf
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Pediatr
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Pediatric_Infect_Dis_Soc
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Pediatr_Gastroenterol_Nutr
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Pediatr_Neurosci
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Pediatr_Oncol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Pediatr_Oncol_Nurs
Fi

Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Psycholinguist_Res
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Psychol_Med_Ment_Pathol_(Lond)
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Psychopathol_Behav_Assess
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Psychopharmacol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Psychophysiol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Psychosoc_Oncol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Psychosom_Obstet_Gynaecol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Psychosom_Res
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-

Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Soc_Psychol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Soc_Serv_Res
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Soc_Welf_Fam_Law
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Soc_Work_(Lond)
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Soc_Work_End_Life_Palliat_Care
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Soc_Work_Pract
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Soc_Work_Pract_Addict
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Soils_Sediments
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Solid_State_

Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Venom_Res
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Vet_Intern_Med
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Vet_Med_Sci
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Vet_Pharmacol_Ther
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Vet_Sci
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Viral_Hepat
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Virol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Virol_Methods
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/J_Virus_Erad
Finished /run/media/fay/Seagate Backup Plus Dri

Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Landsc_Ecol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Land_use_policy
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Langenbecks_Arch_Surg
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Langmuir
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Lang_Cogn_Neurosci
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Lang_Cogn_Process
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Lang_Resour_Eval
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Lang_Sci
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Laryngoscope
Finished /run/media/fay/Seagate Backu

Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Matern_Child_Health_J
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Matern_Child_Nutr
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Matern_Health_Neonatol_Perinatol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Mater_Charact
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Mater_Lett
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Mater_Sci_Eng_A_Struct_Mater
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Mater_Sociomed
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Mater_Struct
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Math_Comput_Model

Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Med_Sci_Monit_Basic_Res
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Med_Sci_Sports_Exerc
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Med_Stud
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Med_Teach
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Melanoma_Res
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Memo
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Memory
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Mem_Cognit
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Mem_Inst_Oswaldo_Cruz
Finished /run/media/fay/Seagate Backup Plus Drive/

Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Mol_Breed
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Mol_Cancer
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Mol_Cancer_Ther
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Mol_Carcinog
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Mol_Cell
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Mol_Cells
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Mol_Cell_Biochem
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Mol_Cell_Biol
Finished /run/media/fay/Seagate Backup Plus Drive/GA capstone/articles/non_comm_use.I-N.xml/Mol_Cell_Endocrinol
Finished /run/media/fay/Seagate Backup Plus Drive/GA

In [6]:
df = pd.read_csv(CSV)
eda(df)

Head of the dataframe:

                                           file_path        pmid      pmc  \
0  /run/media/fay/Seagate Backup Plus Drive/GA ca...  25147756.0  4136529   
1  /run/media/fay/Seagate Backup Plus Drive/GA ca...  25558448.0  4280839   
2  /run/media/fay/Seagate Backup Plus Drive/GA ca...  25558447.0  4280861   
3  /run/media/fay/Seagate Backup Plus Drive/GA ca...  25793156.0  4360915   
4  /run/media/fay/Seagate Backup Plus Drive/GA ca...  25825692.0  4375554   

                            doi  article_subject  pub_year  pub_month  \
0  10.1016/j.bbacli.2014.04.001           Review      2014        4.0   
1  10.1016/j.bbacli.2014.11.003  Regular Article      2014       11.0   
2  10.1016/j.bbacli.2014.10.002  Regular Article      2014       11.0   
3  10.1016/j.bbacli.2015.01.002  Regular Article      2015        1.0   
4  10.1016/j.bbacli.2015.03.005  Regular Article      2015        3.0   

   pub_day publisher_name journal_id journal_title  \
0     13.0       Els

In [7]:
df[(df.abstract.isnull()) & (df.article_title.isnull())]

Unnamed: 0,file_path,pmid,pmc,doi,article_subject,pub_year,pub_month,pub_day,publisher_name,journal_id,journal_title,article_title,abstract
31109,/run/media/fay/Seagate Backup Plus Drive/GA ca...,26130931.0,4481558,10.5214/ans.0972.7531.220313,Molecular Shots,2015,7.0,,Indian Academy of Neurosciences,Ann Neurosci,Annals of Neurosciences,,
31116,/run/media/fay/Seagate Backup Plus Drive/GA ca...,26535016.0,4627196,10.5214/ans.0972.7531.220414,Molecular Shots,2015,10.0,,Indian Academy of Neurosciences,Ann Neurosci,Annals of Neurosciences,,
31634,/run/media/fay/Seagate Backup Plus Drive/GA ca...,,3180995,,Selected Summaries,2011,,,Medknow Publications,Ann Pediatr Cardiol,Annals of Pediatric Cardiology,,
37487,/run/media/fay/Seagate Backup Plus Drive/GA ca...,26558081.0,4443012,10.1016/j.aju.2013.08.001,Editorial,2013,9.0,2.0,Elsevier,Arab J Urol,Arab Journal of Urology,,
46800,/run/media/fay/Seagate Backup Plus Drive/GA ca...,22375229.0,3289206,,Author's Reply,2011,6.0,,Tehran University of Medical Sciences,Asian J Sports Med,Asian Journal of Sports Medicine,,
55334,/run/media/fay/Seagate Backup Plus Drive/GA ca...,,3255004,,Poster Presentation,2010,9.0,24.0,BioMed Central,BMC Proc,BMC Proceedings,,
55335,/run/media/fay/Seagate Backup Plus Drive/GA ca...,,3255005,,Poster Presentation,2010,9.0,24.0,BioMed Central,BMC Proc,BMC Proceedings,,
55344,/run/media/fay/Seagate Backup Plus Drive/GA ca...,,3255014,,Poster Presentation,2010,9.0,24.0,BioMed Central,BMC Proc,BMC Proceedings,,
55350,/run/media/fay/Seagate Backup Plus Drive/GA ca...,,3255020,,Poster Presentation,2010,9.0,24.0,BioMed Central,BMC Proc,BMC Proceedings,,
72690,/run/media/fay/Seagate Backup Plus Drive/GA ca...,27990342.0,5154632,10.1016/j.bdq.2016.10.003,Editorial,2016,11.0,25.0,Elsevier,Biomol Detect Quantif,Biomolecular Detection and Quantification,,


It appears most articles without a title or an abstract are not primary research papers, so they are not essential to my analysis. (It's cool to see the variety of publications that is included in the PubMed Central archive, though!)

In [8]:
cols = ['article_subject', 'pub_year', 'publisher_name', 'journal_title']
category_counts(df[cols], max_nunique=None, numeric=True)

Original Article                                               71541
Article                                                        53355
Research Article                                               37503
Articles                                                       36745
Case Report                                                    29962
Original Research                                              20724
Review                                                         16140
Review Article                                                 10807
Editorial                                                       7484
Poster Presentation                                             6555
Original Paper                                                  6244
Research                                                        6139
Research Articles                                               6067
Research Paper                                                  5627
Original Articles                 