<h1>Data parsing<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-dependencies" data-toc-modified-id="Import-dependencies-1">Import dependencies</a></span></li><li><span><a href="#Load-data" data-toc-modified-id="Load-data-2">Load data</a></span></li><li><span><a href="#Data-exploration" data-toc-modified-id="Data-exploration-3">Data exploration</a></span><ul class="toc-item"><li><span><a href="#Get-a-list-of-standalone-paragraphs-from-each-article's-body" data-toc-modified-id="Get-a-list-of-standalone-paragraphs-from-each-article's-body-3.1">Get a list of standalone paragraphs from each article's body</a></span></li><li><span><a href="#Get-a-list-of-sections-(in-each-section-are-one-or-more-paragraphs)--from-each-article's-body" data-toc-modified-id="Get-a-list-of-sections-(in-each-section-are-one-or-more-paragraphs)--from-each-article's-body-3.2">Get a list of sections (in each section are one or more paragraphs)  from each article's body</a></span></li><li><span><a href="#Get-article's-body" data-toc-modified-id="Get-article's-body-3.3">Get article's body</a></span></li><li><span><a href="#Get-article's-abstract" data-toc-modified-id="Get-article's-abstract-3.4">Get article's abstract</a></span></li><li><span><a href="#Get-article's-categories" data-toc-modified-id="Get-article's-categories-3.5">Get article's categories</a></span></li><li><span><a href="#Get-article's-title" data-toc-modified-id="Get-article's-title-3.6">Get article's title</a></span></li><li><span><a href="#Get-final-dataframe" data-toc-modified-id="Get-final-dataframe-3.7">Get final dataframe</a></span></li><li><span><a href="#Get-number-of-pages" data-toc-modified-id="Get-number-of-pages-3.8">Get number of pages</a></span></li><li><span><a href="#Get-authors-list" data-toc-modified-id="Get-authors-list-3.9">Get authors list</a></span></li><li><span><a href="#Get-author's-affiliation" data-toc-modified-id="Get-author's-affiliation-3.10">Get author's affiliation</a></span></li><li><span><a href="#Get-number-of-figures" data-toc-modified-id="Get-number-of-figures-3.11">Get number of figures</a></span></li></ul></li></ul></div>

## Import dependencies

In [1]:
import os, shutil
import warnings
import random
warnings.simplefilter("ignore")

from shutil import copyfile
from distutils.dir_util import copy_tree

## Load data

In [2]:
input_dir = "../datasets/2016_testing_df/"

df = sqlContext.read.format('com.databricks.spark.xml').option("rowTag", "record")\
    .load(input_dir)

In [3]:
df.rdd.count()

114

## Data exploration

### Get a list of standalone paragraphs from each article's body

The paragraphs can be found: in **body/p** or simply in **p** tag from the root. But the results are the same.

In [4]:
def parse_paragraphs_list(list_of_paragraphs):
    list_of_paragraphs_to_ret = []
    if list_of_paragraphs != None:
        for paragraph in list_of_paragraphs:
            if paragraph != None:
                if type(paragraph) == str:
                    list_of_paragraphs_to_ret.append(paragraph)
                else:
                    if paragraph._VALUE != None:
                        list_of_paragraphs_to_ret.append(paragraph._VALUE)
    return list_of_paragraphs_to_ret

def get_paragraphs_list_from_body(input_dir):
    df =   spark.read \
                .format('com.databricks.spark.xml') \
                .options(rowTag='record')\
                .options(rowTag='body')\
                .load(input_dir)
    return  df.select("p")\
              .rdd\
              .map(lambda row: parse_paragraphs_list(row['p']))\
              .zipWithIndex()\
              .map(lambda record: (record[1], record[0]))\


get_paragraphs_list_from_body(input_dir).take(10)

[(0,
  ['The mature 26S proteasome consists of at least 33 distinct subunits. Fourteen of them (α1-7 and β1-7) form the 20S core particle (CP), a barrel-shaped structure that encloses three types of peptidase activities (trypsin-like, caspase-like and chymotrypsin-like). The remaining 19 subunits (Rpt1-6, Rpn1-3, 5-13 and 15) constitute the 19S regulatory particle (RP) that caps the CP on one or both ends. Protein substrates destined for proteasomal degradation are captured and processed by the 19S RP before they are threaded into the 20S CP for proteolysis. During this process, the ATPase subunits (Rpt1-6) play key roles in substrate engagement, unfolding, translocation and CP gate opening']),
 (1,
  ['Purely computational approaches have been used to perform virtual screens across multiple mechanisms of action']),
 (2,
  ['The zebrafish has recently emerged as a useful vertebrate model system to study neural circuits and behavior, but tools to modulate neurons in freely behaving anim

### Get a list of sections (in each section are one or more paragraphs)  from each article's body

The sections are the same as in **metadata/article/body**

In [5]:
def parse_sections_list(list_of_sections):
    if list_of_sections != None:
        list_of_sections = [parse_paragraphs_list(section.p) for section in list_of_sections if section.p != None]
    else:
        list_of_sections = []
    return list_of_sections

def get_sections_list(input_dir):
    df =   spark.read \
                .format('com.databricks.spark.xml') \
                .options(rowTag='record')\
                .options(rowTag='body')\
                .load(input_dir)
    return  df.select("sec")\
              .rdd\
              .map(lambda row: parse_sections_list(row['sec']), )\
              .zipWithIndex()\
              .map(lambda record: (record[1], record[0]))\

get_sections_list(input_dir).take(10)

[(0,
  [['The 26S proteasome is an essential protein complex responsible for degrading the majority of cellular proteins in eukaryotes']]),
 (1,
  [['Advances in molecular biology have led to an unprecedented ability to profile the genetic- and pathway-level changes that occur in disease']]),
 (2,
  [['We fused rat TRPV1 (NM_031982), containing the E600K mutation that increases sensitivity to Csn by over 10–fold ']]),
 (3,
  [['BICR-31 was obtained from Sigma-Aldrich, and A549, NCI-H2009, NCI-H358, HCC95, and Ishikawa cells were obtained from the American Type Culture Collection. Cells were cultured in RPMI 1640 medium supplemented with 10% FBS and 1% penicillin-streptomycin.'],
   ['Chromatin-immunoprecipitation followed by massive parallel sequencing (ChIP-seq) was performed as previously described']]),
 (4,
  [['Lentivirus was produced at titers >10E9 using methods similar to those described by Han et al. ']]),
 (5,
  [['Cornelia de Lange syndrome (CdLS, OMIM 608667) is the prototyp

### Get article's body

The body of the article is going to be build in the following manner:
-  we will concatenate each standalone paragraph (1)
-  we will concatenate each paragraph from a section
-  we will concatenate each section (3)
-  finally, we will concatenate the the results from (1) and (3)

In [6]:
def build_body(row_sec, row_p):
    concat_parsed_sections = ' '.join([' '.join(section) for section in parse_sections_list(row_sec)])
    concat_standalone_paragraphs = ' '.join(parse_paragraphs_list(row_p))
    print(concat_parsed_sections, concat_standalone_paragraphs)
    return concat_standalone_paragraphs + " " + concat_parsed_sections

In [7]:
def get_bodys_list(input_dir):
    df =   spark.read \
                .format('com.databricks.spark.xml') \
                .options(rowTag='record')\
                .options(rowTag='body')\
                .load(input_dir)
    return  df.select("sec", "p")\
              .rdd\
              .map(lambda row: build_body(row['sec'], row["p"]))\
              .zipWithIndex()\
              .map(lambda record: (record[1], record[0]))

get_bodys_list(input_dir).take(5)

[(0,
  'The mature 26S proteasome consists of at least 33 distinct subunits. Fourteen of them (α1-7 and β1-7) form the 20S core particle (CP), a barrel-shaped structure that encloses three types of peptidase activities (trypsin-like, caspase-like and chymotrypsin-like). The remaining 19 subunits (Rpt1-6, Rpn1-3, 5-13 and 15) constitute the 19S regulatory particle (RP) that caps the CP on one or both ends. Protein substrates destined for proteasomal degradation are captured and processed by the 19S RP before they are threaded into the 20S CP for proteolysis. During this process, the ATPase subunits (Rpt1-6) play key roles in substrate engagement, unfolding, translocation and CP gate opening The 26S proteasome is an essential protein complex responsible for degrading the majority of cellular proteins in eukaryotes'),
 (1,
  'Purely computational approaches have been used to perform virtual screens across multiple mechanisms of action Advances in molecular biology have led to an unprecede

### Get article's abstract

The abstract of the article is going to be built the same as the body.

In [8]:
def get_abstract_list(input_dir):
    df =   spark.read \
                .format('com.databricks.spark.xml') \
                .options(rowTag='record')\
                .options(rowTag='abstract')\
                .load(input_dir)
    return  df.select("sec", "p")\
              .rdd\
              .map(lambda row: build_body(row['sec'], row["p"]))\
              .zipWithIndex()\
              .map(lambda record: (record[1], record[0]))\

get_abstract_list(input_dir).take(5)

[(0,
  'Despite the fundamental importance of proteasomal degradation in cells, little is known about whether and how the 26S proteasome itself is regulated in coordination with various physiological processes. Here we show that the proteasome is dynamically phosphorylated during cell cycle at Thr25 of the 19S subunit Rpt3. CRISPR/Cas9-mediated genome editing, RNA interference and biochemical studies demonstrate that blocking Rpt3-Thr25 phosphorylation markedly impairs proteasome activity and impedes cell proliferation. Through a kinome-wide screen, we have identified dual-specificity tyrosine-regulated kinase 2 (DYRK2) as the primary kinase that phosphorylates Rpt3-Thr25, leading to enhanced substrate translocation and degradation. Importantly, loss of the single phosphorylation of Rpt3-Thr25 or knockout of DYRK2 significantly inhibits tumor formation by proteasome-addicted human breast cancer cells in mice. These findings define an important mechanism for proteasome regulation and de

### Get article's categories

In [9]:
def get_article_categories(row):
    categories = []
    try: 
        for subj_group in row['subj-group']:
            if type(subj_group['subject']) == str:
                categories = [subj_group['subject']]
            elif type(subj_group['subject']) == list:
                for cat in subj_group['subject']:
                    categories.append(cat)
    except:
        return [row['subj-group']['subject']]
    return row['subj-group']
    

def get_categories_list(input_dir):
    df = spark.read \
            .format('com.databricks.spark.xml') \
            .options(rowTag='record')\
            .options(rowTag='metadata')\
            .options(rowTag='article')\
            .options(rowTag='front')\
            .options(rowTag='article-meta')\
            .load(input_dir)
    
    return  df.select("article-categories")\
              .rdd\
              .map(lambda row: get_article_categories(row['article-categories']))\
              .zipWithIndex()\
              .map(lambda record: (record[1], record[0]))\

get_categories_list(input_dir).take(100)

[(0, ['Article']),
 (1, ['Article']),
 (2, ['Article']),
 (3, ['Article']),
 (4, ['Article']),
 (5, ['Case Report']),
 (6, ['Research Article']),
 (7, ['Case Report']),
 (8, ['Research Article']),
 (9, ['Review']),
 (10, ['Original Research Article']),
 (11, ['Research Article']),
 (12, ['Case Report']),
 (13, ['Case Report']),
 (14, ['Articles']),
 (15, ['Articles']),
 (16, ['Articles']),
 (17, ['Articles']),
 (18, ['Articles']),
 (19, ['Articles']),
 (20, ['Articles']),
 (21, ['Articles']),
 (22, ['Articles']),
 (23, ['Articles']),
 (24, ['Correspondence']),
 (25, ['Correspondence']),
 (26, ['Article']),
 (27, ['Review']),
 (28, ['Article']),
 (29, ['Review Article']),
 (30, ['Research Article']),
 (31, ['Review Article']),
 (32, ['Research Article']),
 (33, ['Research Article']),
 (34, ['Research Article']),
 (35, ['Research Article']),
 (36, ['Research Article']),
 (37, ['Research Article']),
 (38, ['Research Article']),
 (39, ['Research']),
 (40, ['Research']),
 (41, ['Research Ar

### Get article's title

In [10]:
def get_titles_list(input_dir):
    df = spark.read \
            .format('com.databricks.spark.xml') \
            .options(rowTag='record')\
            .options(rowTag='metadata')\
            .options(rowTag='article')\
            .options(rowTag='front')\
            .options(rowTag='article-meta')\
            .options(rowTag='title-group')\
            .load(input_dir)
    
    return  df.select("article-title")\
              .rdd\
              .map(lambda row: row['article-title'])\
              .zipWithIndex()\
              .map(lambda record: (record[1], record[0]))\

get_titles_list(input_dir).take(5)

[(0,
  'Site-specific Proteasome Phosphorylation Controls Cell Proliferation and Tumorigenesis'),
 (1,
  'Improving drug discovery with high-content phenotypic screens by systematic selection of reporter cell lines'),
 (2,
  'TRP channel mediated neuronal activation and ablation in freely behaving zebrafish'),
 (3,
  'Identification of focally amplified lineage-specific super-enhancers in human epithelial cancers'),
 (4,
  'Disruption of relative reward value by reversible disconnection of orbitofrontal and rhinal cortex using DREADDs in rhesus monkeys')]

### Get final dataframe

In [11]:
def get_final_rdd():
    final_rdd = get_abstract_list(input_dir)
    final_rdd = final_rdd.join(get_bodys_list(input_dir))
    final_rdd = final_rdd.join(get_categories_list(input_dir))
    final_rdd = final_rdd.join(get_titles_list(input_dir))
    return final_rdd.sortByKey()

def get_final_to_pandas():
    return get_final_rdd().map(lambda record: (record[1][0][0][0], record[1][0][0][1], record[1][0][1], record[1][1]))\
                          .toDF(["abstract", "body", "categories", "title"])

get_final_to_pandas().toPandas()

Unnamed: 0,abstract,body,categories,title
0,Despite the fundamental importance of proteaso...,The mature 26S proteasome consists of at least...,[Article],Site-specific Proteasome Phosphorylation Contr...
1,"High-content, image-based screens enable the i...",Purely computational approaches have been used...,[Article],Improving drug discovery with high-content phe...
2,Whole genome analysis approaches are revealing...,The zebrafish has recently emerged as a useful...,[Article],TRP channel mediated neuronal activation and a...
3,To study how the interaction between orbitofro...,"Somatic copy number alterations (SCNAs), inclu...",[Article],Identification of focally amplified lineage-sp...
4,C o r n e l i a d e L a n g e s y n d r...,Interrupting the flow of information by discon...,[Article],Disruption of relative reward value by reversi...
...,...,...,...,...
109,Pikman et al. demonstrate that the mitochondri...,Cardiolipin (CL) is a dimeric phospholipid tha...,[Article],Hydrophobic CDR3 residues promote the developm...
110,Drugs targeting metabolism have formed the bac...,The ability of αβ T cell repertoires to target...,[Article],The genomic landscape and evolution of endomet...
111,"Antignano, Zaph, and collaborators show that t...",Recent large-scale sequencing studies of prima...,[Article],Genomic analysis of local variation and recent...
112,Innate lymphoid cells (ILCs) are emerging as i...,In this study we analysed,[Article],Recognizing millions of consistently unidentif...


### Get number of pages

In [12]:
def get_page_counts(row):
    try:
        return row['page-count']._count
    except:
        return 'Not specified'
    
def get_number_of_pages_list(input_dir):
    df = spark.read \
            .format('com.databricks.spark.xml') \
            .options(rowTag='record')\
            .options(rowTag='metadata')\
            .options(rowTag='article')\
            .load(input_dir)
    
    return  df.select("counts")\
              .rdd\
              .map(lambda row: get_page_counts(row['counts']))\
              .zipWithIndex()\
              .map(lambda record: (record[1], record[0]))\

get_number_of_pages_list(input_dir).collect()

[(0, 'Not specified'),
 (1, 'Not specified'),
 (2, 'Not specified'),
 (3, 'Not specified'),
 (4, 'Not specified'),
 (5, 'Not specified'),
 (6, 'Not specified'),
 (7, 'Not specified'),
 (8, 'Not specified'),
 (9, 'Not specified'),
 (10, 'Not specified'),
 (11, 'Not specified'),
 (12, 'Not specified'),
 (13, 'Not specified'),
 (14, 'Not specified'),
 (15, 'Not specified'),
 (16, 'Not specified'),
 (17, 'Not specified'),
 (18, 'Not specified'),
 (19, 'Not specified'),
 (20, 'Not specified'),
 (21, 'Not specified'),
 (22, 'Not specified'),
 (23, 'Not specified'),
 (24, 'Not specified'),
 (25, 'Not specified'),
 (26, 'Not specified'),
 (27, 'Not specified'),
 (28, 'Not specified'),
 (29, 'Not specified'),
 (30, 'Not specified'),
 (31, 'Not specified'),
 (32, 'Not specified'),
 (33, 'Not specified'),
 (34, 'Not specified'),
 (35, 7),
 (36, 'Not specified'),
 (37, 'Not specified'),
 (38, 'Not specified'),
 (39, 'Not specified'),
 (40, 'Not specified'),
 (41, 'Not specified'),
 (42, 'Not speci

### Get authors list

In [13]:
def get_authors(row):
    authors = []
    try:
        for contrib in row['contrib']:
            print(contrib)
            if contrib['_contrib-type'] == 'author':
                authors.append(contrib['name']['given-names'] + ' ' + contrib['name']['surname'])
        return authors
    except:
        return 'Not specified'
    
def get_authors_list(input_dir):
    df = spark.read \
            .format('com.databricks.spark.xml') \
            .options(rowTag='record')\
            .options(rowTag='metadata')\
            .options(rowTag='article')\
            .options(rowTag='front')\
            .options(rowTag='article-meta')\
            .load(input_dir)
    
    return  df.select("contrib-group")\
              .rdd\
              .map(lambda row: get_authors(row['contrib-group']))\
              .zipWithIndex()\
              .map(lambda record: (record[1], record[0]))\

get_authors_list(input_dir).take(5)

[(0,
  ['Xing Guo',
   'Xiaorong Wang',
   'Zhiping Wang',
   'Sourav Banerjee',
   'Jing Yang',
   'Lan Huang',
   'Jack E. Dixon']),
 (1,
  ['Jungseog Kang',
   'Chien-Hsiang Hsu',
   'Qi Wu',
   'Shanshan Liu',
   'Adam D. Coster',
   'Bruce A. Posner',
   'Steven J. Altschuler',
   'Lani F. Wu']),
 (2,
  ['Shijia Chen',
   'Cindy N. Chiu',
   'Kimberly L. McArthur',
   'Joseph R. Fetcho',
   'David A. Prober']),
 (3,
  ['Xiaoyang Zhang',
   'Peter S. Choi',
   'Joshua M. Francis',
   'Marcin Imielinski',
   'Hideo Watanabe',
   'Andrew D. Cherniack',
   'Matthew Meyerson']),
 (4,
  ['Mark A G Eldridge',
   'Walter Lerchner',
   'Richard C Saunders',
   'Hiroyuki Kaneko',
   'Kristopher W Krausz',
   'Frank J Gonzalez',
   'Bin Ji',
   'Makoto Higuchi',
   'Takafumi Minamimoto',
   'Barry J Richmond'])]

### Get author's affiliation

In [14]:
def get_affiliations(row):
    affiliations = dict()
    affiliations_for_authors = []
    try:
        for aff in row['aff']:
            affiliations[aff['_id']] = (aff['country'], aff['institution'])
            
        for contrib in row['contrib']:
            if contrib['_contrib-type'] == 'author':
                ids_aff = [aff['_rid'] for aff in contrib['xref']]
                affiliations_for_authors.append([])
                
                for id_aff in ids_aff:
                    affiliations_for_authors[-1].append(affiliations[id_aff])
        return affiliations_for_authors
    except:
        return 'Not specified'
    
def get_affiliations_list(input_dir):
    df = spark.read \
            .format('com.databricks.spark.xml') \
            .options(rowTag='record')\
            .options(rowTag='metadata')\
            .options(rowTag='article')\
            .options(rowTag='front')\
            .options(rowTag='article-meta')\
            .load(input_dir)
    
    return  df.select("contrib-group")\
              .rdd\
              .map(lambda row: get_affiliations(row['contrib-group']))\
              .zipWithIndex()\
              .map(lambda record: (record[1], record[0]))\

get_affiliations_list(input_dir).collect()

[(0, 'Not specified'),
 (1, 'Not specified'),
 (2, 'Not specified'),
 (3, 'Not specified'),
 (4, 'Not specified'),
 (5, 'Not specified'),
 (6, 'Not specified'),
 (7, 'Not specified'),
 (8, 'Not specified'),
 (9, 'Not specified'),
 (10, 'Not specified'),
 (11, 'Not specified'),
 (12, 'Not specified'),
 (13, 'Not specified'),
 (14, 'Not specified'),
 (15, 'Not specified'),
 (16, 'Not specified'),
 (17, 'Not specified'),
 (18, 'Not specified'),
 (19, 'Not specified'),
 (20, 'Not specified'),
 (21, 'Not specified'),
 (22, 'Not specified'),
 (23, 'Not specified'),
 (24, 'Not specified'),
 (25, 'Not specified'),
 (26, 'Not specified'),
 (27, 'Not specified'),
 (28, 'Not specified'),
 (29, 'Not specified'),
 (30, 'Not specified'),
 (31, 'Not specified'),
 (32, 'Not specified'),
 (33, 'Not specified'),
 (34, 'Not specified'),
 (35, 'Not specified'),
 (36, 'Not specified'),
 (37, 'Not specified'),
 (38, 'Not specified'),
 (39, 'Not specified'),
 (40, 'Not specified'),
 (41, 'Not specified'),
 (

### Get number of figures

In [15]:
def get_fig_no(row):
    try:
        if row['fig'] != None:
            if type(row['fig']) == list:
                return len(row['fig'])
            return 1
        else:
            return 'Not specified' 
    except:
        return 'Not specified'

def get_number_of_figures(input_dir):
    df = spark.read \
            .format('com.databricks.spark.xml') \
            .options(rowTag='record')\
            .load(input_dir)
    
    return  df.select('fig')\
              .rdd\
              .map(lambda row: get_fig_no(row))\
              .zipWithIndex()\
              .map(lambda record: (record[1], record[0]))\

get_number_of_figures(input_dir).collect()

[(0, 'Not specified'),
 (1, 'Not specified'),
 (2, 'Not specified'),
 (3, 'Not specified'),
 (4, 'Not specified'),
 (5, 'Not specified'),
 (6, 'Not specified'),
 (7, 'Not specified'),
 (8, 'Not specified'),
 (9, 'Not specified'),
 (10, 'Not specified'),
 (11, 'Not specified'),
 (12, 'Not specified'),
 (13, 'Not specified'),
 (14, 'Not specified'),
 (15, 1),
 (16, 'Not specified'),
 (17, 1),
 (18, 'Not specified'),
 (19, 'Not specified'),
 (20, 'Not specified'),
 (21, 1),
 (22, 'Not specified'),
 (23, 'Not specified'),
 (24, 'Not specified'),
 (25, 'Not specified'),
 (26, 'Not specified'),
 (27, 'Not specified'),
 (28, 'Not specified'),
 (29, 'Not specified'),
 (30, 'Not specified'),
 (31, 'Not specified'),
 (32, 1),
 (33, 'Not specified'),
 (34, 'Not specified'),
 (35, 'Not specified'),
 (36, 'Not specified'),
 (37, 'Not specified'),
 (38, 'Not specified'),
 (39, 'Not specified'),
 (40, 'Not specified'),
 (41, 'Not specified'),
 (42, 'Not specified'),
 (43, 'Not specified'),
 (44, 'Not