<h1>Data parsing<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-dependencies" data-toc-modified-id="Import-dependencies-1">Import dependencies</a></span></li><li><span><a href="#Load-data" data-toc-modified-id="Load-data-2">Load data</a></span></li><li><span><a href="#Data-exploration" data-toc-modified-id="Data-exploration-3">Data exploration</a></span><ul class="toc-item"><li><span><a href="#Get-a-list-of-standalone-paragraphs-from-each-article's-body" data-toc-modified-id="Get-a-list-of-standalone-paragraphs-from-each-article's-body-3.1">Get a list of standalone paragraphs from each article's body</a></span></li><li><span><a href="#Get-a-list-of-sections-(in-each-section-are-one-or-more-paragraphs)--from-each-article's-body" data-toc-modified-id="Get-a-list-of-sections-(in-each-section-are-one-or-more-paragraphs)--from-each-article's-body-3.2">Get a list of sections (in each section are one or more paragraphs)  from each article's body</a></span></li><li><span><a href="#Get-article's-body" data-toc-modified-id="Get-article's-body-3.3">Get article's body</a></span></li><li><span><a href="#Get-article's-abstract" data-toc-modified-id="Get-article's-abstract-3.4">Get article's abstract</a></span></li><li><span><a href="#Get-article's-categories" data-toc-modified-id="Get-article's-categories-3.5">Get article's categories</a></span></li><li><span><a href="#Get-article's-title" data-toc-modified-id="Get-article's-title-3.6">Get article's title</a></span></li><li><span><a href="#Get-final-dataframe" data-toc-modified-id="Get-final-dataframe-3.7">Get final dataframe</a></span></li></ul></li></ul></div>

## Import dependencies

In [1]:
import os, shutil
import warnings
import random
warnings.simplefilter("ignore")

from shutil import copyfile
from distutils.dir_util import copy_tree

## Load data

In [2]:
input_dir = "../datasets/2016_testing_df/"

df = sqlContext.read.format('com.databricks.spark.xml').option("rowTag", "record")\
    .load(input_dir)

In [3]:
df.rdd.count()

175

## Data exploration

### Get a list of standalone paragraphs from each article's body

The paragraphs can be found: in **body/p** or simply in **p** tag from the root. But the results are the same.

In [4]:
def parse_paragraphs_list(list_of_paragraphs):
    list_of_paragraphs_to_ret = []
    if list_of_paragraphs != None:
        for paragraph in list_of_paragraphs:
            if paragraph != None:
                if type(paragraph) == str:
                    list_of_paragraphs_to_ret.append(paragraph)
                else:
                    if paragraph._VALUE != None:
                        list_of_paragraphs_to_ret.append(paragraph._VALUE)
    return list_of_paragraphs_to_ret

def get_paragraphs_list_from_body(input_dir):
    df =   spark.read \
                .format('com.databricks.spark.xml') \
                .options(rowTag='record')\
                .options(rowTag='body')\
                .load(input_dir)
    return  df.select("p")\
              .rdd\
              .map(lambda row: parse_paragraphs_list(row['p']))\
              .zipWithIndex()\
              .map(lambda record: (record[1], record[0]))\


get_paragraphs_list_from_body(input_dir).take(10)

[(0,
  ['Recent work has shown that some TOR-dependent processes are regulated through control of mRNA stability. Although mechanisms of its regulation remain poorly understood, steps leading to deadenylation-dependent mRNA degradation in eukaryotes are well known ']),
 (1,
  ['Prior to 2006, the 2nd Edition of the Bayley (Bayley-II) Psychomotor Developmental Index (PDI)']),
 (2,
  ['Policies related to alcohol, as well as other substances, both in the US and more broadly, remain an open area of debate and controversy. Most recently, the Amethyst Initiative (']),
 (3,
  ['Malaria exacts a deadly toll on human populations worldwide. A new era in the fight against the disease has seen multiple malaria vaccines entering or completing advanced clinical study']),
 (4, ['Genetic crosses in the human malaria parasite ']),
 (5,
  ['One class of drugs that has promise for the treatment of relapsed CLL are the cyclin dependent kinase (CDK) inhibitors. Flavopiridol is the first member of this cla

### Get a list of sections (in each section are one or more paragraphs)  from each article's body

The sections are the same as in **metadata/article/body**

In [5]:
def parse_sections_list(list_of_sections):
    if list_of_sections != None:
        list_of_sections = [parse_paragraphs_list(section.p) for section in list_of_sections if section.p != None]
    else:
        list_of_sections = []
    return list_of_sections

def get_sections_list(input_dir):
    df =   spark.read \
                .format('com.databricks.spark.xml') \
                .options(rowTag='record')\
                .options(rowTag='body')\
                .load(input_dir)
    return  df.select("sec")\
              .rdd\
              .map(lambda row: parse_sections_list(row['sec']), )\
              .zipWithIndex()\
              .map(lambda record: (record[1], record[0]))\

get_sections_list(input_dir).take(10)

[(0, [['Autophagy is a conserved essential eukaryotic catabolic process ']]),
 (1,
  [['Motor function is an important component of neurodevelopmental assessment and is included in assessment tools such as the Bayley Scales of Infant (and Toddler) Development (Bayley).']]),
 (2,
  [['Alcohol and marijuana are among the most commonly used drugs by adolescents and young adults in the United States (US) (']]),
 (3,
  [['The tertiary structure of the AnAPN1 ectodomain (residues 57-942) exhibited the classical four-domain assembly of M1-family metallopeptidases, designated domains I-IV (']]),
 (4,
  [['Non-blood fed adult female mosquitoes three to seven days post-emergence were fed on mixed gametocyte cultures. Gametocyte cultures were quickly spun down and the pelleted infected erythrocytes diluted to a 40% hematocrit with fresh A+ human serum and O+ erythrocytes. Mosquitoes were allowed to feed through Parafilm for up to 20 minutes. Following blood feeding, mosquitoes were maintained for

### Get article's body

The body of the article is going to be build in the following manner:
-  we will concatenate each standalone paragraph (1)
-  we will concatenate each paragraph from a section
-  we will concatenate each section (3)
-  finally, we will concatenate the the results from (1) and (3)

In [6]:
def build_body(row_sec, row_p):
    concat_parsed_sections = ' '.join([' '.join(section) for section in parse_sections_list(row_sec)])
    concat_standalone_paragraphs = ' '.join(parse_paragraphs_list(row_p))
    print(concat_parsed_sections, concat_standalone_paragraphs)
    return concat_standalone_paragraphs + " " + concat_parsed_sections

In [7]:
def get_bodys_list(input_dir):
    df =   spark.read \
                .format('com.databricks.spark.xml') \
                .options(rowTag='record')\
                .options(rowTag='body')\
                .load(input_dir)
    return  df.select("sec", "p")\
              .rdd\
              .map(lambda row: build_body(row['sec'], row["p"]))\
              .zipWithIndex()\
              .map(lambda record: (record[1], record[0]))

get_bodys_list(input_dir).take(5)

[(0,
  'Recent work has shown that some TOR-dependent processes are regulated through control of mRNA stability. Although mechanisms of its regulation remain poorly understood, steps leading to deadenylation-dependent mRNA degradation in eukaryotes are well known  Autophagy is a conserved essential eukaryotic catabolic process '),
 (1,
  'Prior to 2006, the 2nd Edition of the Bayley (Bayley-II) Psychomotor Developmental Index (PDI) Motor function is an important component of neurodevelopmental assessment and is included in assessment tools such as the Bayley Scales of Infant (and Toddler) Development (Bayley).'),
 (2,
  'Policies related to alcohol, as well as other substances, both in the US and more broadly, remain an open area of debate and controversy. Most recently, the Amethyst Initiative ( Alcohol and marijuana are among the most commonly used drugs by adolescents and young adults in the United States (US) ('),
 (3,
  'Malaria exacts a deadly toll on human populations worldwide.

### Get article's abstract

The abstract of the article is going to be built the same as the body.

In [8]:
def get_abstract_list(input_dir):
    df =   spark.read \
                .format('com.databricks.spark.xml') \
                .options(rowTag='record')\
                .options(rowTag='abstract')\
                .load(input_dir)
    return  df.select("sec", "p")\
              .rdd\
              .map(lambda row: build_body(row['sec'], row["p"]))\
              .zipWithIndex()\
              .map(lambda record: (record[1], record[0]))\

get_abstract_list(input_dir).take(5)

[(0,
  'Autophagy is an essential eukaryotic pathway requiring tight regulation to maintain homeostasis and preclude disease. Using yeast and mammalian cells, we report a conserved mechanism of autophagy regulation by RNA helicase RCK family members in association with the decapping enzyme Dcp2. Under nutrient-replete conditions, Dcp2 undergoes TOR-dependent phosphorylation and associates with RCK members to form a complex with autophagy-related ( '),
 (1,
  ' To determine whether a Bayley-III Motor Composite score of 85 may overestimate moderate-severe motor impairment by analyzing Bayley-III motor components and developing cut-point scores for each. Retrospective study of 1183 children born <27 weeks gestation at NICHD Neonatal Research Network centers and evaluated at 18-22 months corrected age. Gross Motor Function Classification System determined gross motor impairment. Statistical analyses included linear and logistic regression and sensitivity/specificity. Bayley-III Motor Compo

### Get article's categories

In [9]:
def get_article_categories(row):
    categories = []
    for subj_group in row['subj-group']:
        if type(subj_group['subject']) == str:
            categories = [subj_group['subject']]
        elif type(subj_group['subject']) == list:
            for cat in subj_group['subject']:
                categories.append(cat)
    return categories
    

def get_categories_list(input_dir):
    df = spark.read \
            .format('com.databricks.spark.xml') \
            .options(rowTag='record')\
            .options(rowTag='metadata')\
            .options(rowTag='article')\
            .options(rowTag='front')\
            .options(rowTag='article-meta')\
            .load(input_dir)
    
    return  df.select("article-categories")\
              .rdd\
              .map(lambda row: get_article_categories(row['article-categories']))\
              .zipWithIndex()\
              .map(lambda record: (record[1], record[0]))\

get_categories_list(input_dir).take(20)

[(0, ['Article']),
 (1, ['Article']),
 (2, ['Article']),
 (3, ['Article']),
 (4, ['Article']),
 (5, ['Article']),
 (6, ['Article']),
 (7, ['Analytical Methods']),
 (8, ['Article']),
 (9, ['Article']),
 (10, ['Article']),
 (11, ['Original Article']),
 (12, ['Article']),
 (13, ['Article']),
 (14, ['Article']),
 (15, ['Article']),
 (16, ['Article']),
 (17, ['Article']),
 (18, ['Article']),
 (19, ['Article'])]

### Get article's title

In [10]:
def get_titles_list(input_dir):
    df = spark.read \
            .format('com.databricks.spark.xml') \
            .options(rowTag='record')\
            .options(rowTag='metadata')\
            .options(rowTag='article')\
            .options(rowTag='front')\
            .options(rowTag='article-meta')\
            .options(rowTag='title-group')\
            .load(input_dir)
    
    return  df.select("article-title")\
              .rdd\
              .map(lambda row: row['article-title'])\
              .zipWithIndex()\
              .map(lambda record: (record[1], record[0]))\

get_titles_list(input_dir).take(5)

[(0,
  'A conserved mechanism of TOR-dependent RCK-mediated mRNA degradation regulates autophagy'),
 (1,
  'Do currently recommended Bayley III cutoffs overestimate motor impairment in infants born <27 weeks gestation?'),
 (2,
  'Effects of minimum legal drinking age on alcohol and marijuana use: evidence from toxicological testing data for fatally injured drivers aged 16 to 25 years'),
 (3, 'Structural analysis of '),
 (4, '<italic>Plasmodium falciparum</italic>')]

### Get final dataframe

In [11]:
def get_final_rdd():
    final_rdd = get_abstract_list(input_dir)
    final_rdd = final_rdd.join(get_bodys_list(input_dir))
    final_rdd = final_rdd.join(get_categories_list(input_dir))
    final_rdd = final_rdd.join(get_titles_list(input_dir))
    return final_rdd.sortByKey()

def get_final_to_pandas():
    return get_final_rdd().map(lambda record: (record[1][0][0][0], record[1][0][0][1], record[1][0][1], record[1][1]))\
                          .toDF(["abstract", "body", "categories", "title"])

get_final_to_pandas().toPandas()

Unnamed: 0,abstract,body,categories,title
0,Autophagy is an essential eukaryotic pathway r...,Recent work has shown that some TOR-dependent ...,[Article],A conserved mechanism of TOR-dependent RCK-med...
1,To determine whether a Bayley-III Motor Compo...,"Prior to 2006, the 2nd Edition of the Bayley (...",[Article],Do currently recommended Bayley III cutoffs ov...
2,Alcohol and marijuana are among the most comm...,"Policies related to alcohol, as well as other ...",[Article],Effects of minimum legal drinking age on alcoh...
3,Mosquito-based malaria transmission-blocking v...,Malaria exacts a deadly toll on human populati...,[Article],Structural analysis of
4,Genetic crosses of phenotypically distinct str...,Genetic crosses in the human malaria parasite,[Article],<italic>Plasmodium falciparum</italic>
...,...,...,...,...
169,Recent studies suggest that pro-inflammatory t...,These SNP arrays or chips have been widely use...,[Research Paper],Retinoic Acid Induced-Autophagic Flux Inhibits...
170,,Spiders (Araneae) are among the largest animal...,[Review],Genome Wide Sampling Sequencing for SNP Genoty...
171,Zinc-fingers and homeoboxes 1 (ZHX1) was impli...,Immunogenic apoptosis is characterized by secr...,[Research Paper],The Complete Mitochondrial Genome of two
172,,Recent technological advancements have enabled...,[Research Paper],Photodynamic-therapy Activates Immune Response...
