<h1>Data parsing<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-dependencies" data-toc-modified-id="Import-dependencies-1">Import dependencies</a></span></li><li><span><a href="#Load-data" data-toc-modified-id="Load-data-2">Load data</a></span></li><li><span><a href="#Data-exploration" data-toc-modified-id="Data-exploration-3">Data exploration</a></span><ul class="toc-item"><li><span><a href="#Get-a-list-of-standalone-paragraphs-from-each-article's-body" data-toc-modified-id="Get-a-list-of-standalone-paragraphs-from-each-article's-body-3.1">Get a list of standalone paragraphs from each article's body</a></span></li><li><span><a href="#Get-a-list-of-sections-(in-each-section-are-one-or-more-paragraphs)--from-each-article's-body" data-toc-modified-id="Get-a-list-of-sections-(in-each-section-are-one-or-more-paragraphs)--from-each-article's-body-3.2">Get a list of sections (in each section are one or more paragraphs)  from each article's body</a></span></li><li><span><a href="#Get-article's-body" data-toc-modified-id="Get-article's-body-3.3">Get article's body</a></span></li><li><span><a href="#Get-article's-abstract" data-toc-modified-id="Get-article's-abstract-3.4">Get article's abstract</a></span></li><li><span><a href="#Get-article's-categories" data-toc-modified-id="Get-article's-categories-3.5">Get article's categories</a></span></li><li><span><a href="#Get-article's-title" data-toc-modified-id="Get-article's-title-3.6">Get article's title</a></span></li><li><span><a href="#Get-final-dataframe" data-toc-modified-id="Get-final-dataframe-3.7">Get final dataframe</a></span></li></ul></li></ul></div>

## Import dependencies

In [5]:
import os, shutil
import warnings
import random
warnings.simplefilter("ignore")

from shutil import copyfile
from distutils.dir_util import copy_tree

## Load data

In [47]:
input_dir = "../datasets/2016_testing_df/"

df = sqlContext.read.format('com.databricks.spark.xml').option("rowTag", "record")\
    .load(input_dir)

In [48]:
df.rdd.count()

103

## Data exploration

### Get a list of standalone paragraphs from each article's body

The paragraphs can be found: in **body/p** or simply in **p** tag from the root. But the results are the same.

In [49]:
def parse_paragraphs_list(list_of_paragraphs):
    list_of_paragraphs_to_ret = []
    if list_of_paragraphs != None:
        for paragraph in list_of_paragraphs:
            if paragraph != None:
                if type(paragraph) == str:
                    list_of_paragraphs_to_ret.append(paragraph)
                else:
                    if paragraph._VALUE != None:
                        list_of_paragraphs_to_ret.append(paragraph._VALUE)
    return list_of_paragraphs_to_ret

def get_paragraphs_list_from_body(input_dir):
    df =   spark.read \
                .format('com.databricks.spark.xml') \
                .options(rowTag='record')\
                .options(rowTag='body')\
                .load(input_dir)
    return  df.select("p")\
              .rdd\
              .map(lambda row: parse_paragraphs_list(row['p']))\
              .zipWithIndex()\
              .map(lambda record: (record[1], record[0]))\


get_paragraphs_list_from_body(input_dir).take(10)

[(0,
  ['Cerebral white matter injury is a prominent brain injury and a leading cause of cerebral palsy in preterm infants [']),
 (1,
  ['Like many developing countries, India is experiencing an epidemiologic transition, in which the burdens of infectious disease, maternal and child health problems are decreasing, while the burden of non-communicable chronic diseases, such as stroke and injury, is increasing [']),
 (2,
  ['Betatrophin is a newly recognized liver-derived hormone that has been implicated in both glucose and lipid metabolism [']),
 (3,
  ['In addition, an outer membrane protein, multivalent adhesion molecule (MAM) which includes MCE (from Mammalian cell entry domains) was recently described in ']),
 (4,
  ['The timing of initiation of an RCT raises ethical as well as scientific issues. Physician-researchers are widely regarded as having a duty of care to patients in RCTs, meaning that there must be good grounds to believe that study interventions are consistent with stand

### Get a list of sections (in each section are one or more paragraphs)  from each article's body

The sections are the same as in **metadata/article/body**

In [50]:
def parse_sections_list(list_of_sections):
    if list_of_sections != None:
        list_of_sections = [parse_paragraphs_list(section.p) for section in list_of_sections if section.p != None]
    else:
        list_of_sections = []
    return list_of_sections

def get_sections_list(input_dir):
    df =   spark.read \
                .format('com.databricks.spark.xml') \
                .options(rowTag='record')\
                .options(rowTag='body')\
                .load(input_dir)
    return  df.select("sec")\
              .rdd\
              .map(lambda row: parse_sections_list(row['sec']), )\
              .zipWithIndex()\
              .map(lambda record: (record[1], record[0]))\

get_sections_list(input_dir).take(10)

[(0,
  [['Cytokines from monocytes and macrophages initiate inflammation by inducing chemokines released from inflammatory and endothelial cells [']]),
 (1,
  [['Stroke causes 6 million deaths each year among 17 million affected people, with the greatest burden experienced in populations of low- and middle-income countries [']]),
 (2,
  [['Type 2 diabetes mellitus (T2DM) is one of the most severe public health problems and affects over 170 million people worldwide [']]),
 (3,
  [['<italic>Vibrio parahaemolyticus</italic><italic>V. parahaemolyticus</italic><italic>V. parahaemolyticus</italic><xref>1</xref><xref>1</xref><italic>V. parahaemolyticus</italic><ext-link>http://www.cdc.gov/foodnet/data/trends/trends-2013-progress.html</ext-link><ext-link>http://epi.minsal.cl/</ext-link>',
    'Pandemic ']]),
 (4,
  [['Randomized controlled trials (RCTs) help to provide reliable information on the safety and efficacy of healthcare interventions. To have scientific and clinical utility, an RCT o

### Get article's body

The body of the article is going to be build in the following manner:
-  we will concatenate each standalone paragraph (1)
-  we will concatenate each paragraph from a section
-  we will concatenate each section (3)
-  finally, we will concatenate the the results from (1) and (3)

In [51]:
def build_body(row_sec, row_p):
    concat_parsed_sections = ' '.join([' '.join(section) for section in parse_sections_list(row_sec)])
    concat_standalone_paragraphs = ' '.join(parse_paragraphs_list(row_p))
    print(concat_parsed_sections, concat_standalone_paragraphs)
    return concat_standalone_paragraphs + " " + concat_parsed_sections

In [52]:
def get_bodys_list(input_dir):
    df =   spark.read \
                .format('com.databricks.spark.xml') \
                .options(rowTag='record')\
                .options(rowTag='body')\
                .load(input_dir)
    return  df.select("sec", "p")\
              .rdd\
              .map(lambda row: build_body(row['sec'], row["p"]))\
              .zipWithIndex()\
              .map(lambda record: (record[1], record[0]))

get_bodys_list(input_dir).take(5)

[(0,
  'Cerebral white matter injury is a prominent brain injury and a leading cause of cerebral palsy in preterm infants [ Cytokines from monocytes and macrophages initiate inflammation by inducing chemokines released from inflammatory and endothelial cells ['),
 (1,
  'Like many developing countries, India is experiencing an epidemiologic transition, in which the burdens of infectious disease, maternal and child health problems are decreasing, while the burden of non-communicable chronic diseases, such as stroke and injury, is increasing [ Stroke causes 6 million deaths each year among 17 million affected people, with the greatest burden experienced in populations of low- and middle-income countries ['),
 (2,
  'Betatrophin is a newly recognized liver-derived hormone that has been implicated in both glucose and lipid metabolism [ Type 2 diabetes mellitus (T2DM) is one of the most severe public health problems and affects over 170 million people worldwide ['),
 (3,
  'In addition, an 

### Get article's abstract

The abstract of the article is going to be built the same as the body.

In [53]:
def get_abstract_list(input_dir):
    df =   spark.read \
                .format('com.databricks.spark.xml') \
                .options(rowTag='record')\
                .options(rowTag='abstract')\
                .load(input_dir)
    return  df.select("sec", "p")\
              .rdd\
              .map(lambda row: build_body(row['sec'], row["p"]))\
              .zipWithIndex()\
              .map(lambda record: (record[1], record[0]))\

get_abstract_list(input_dir).take(5)

[(0,
  ' In very preterm infants, white matter injury is a prominent brain injury, and hypoxic ischemia (HI) and infection are the two primary pathogenic factors of this injury. Microglia and microvascular endothelial cells closely interact; therefore, a common signaling pathway may cause neuroinflammation and blood–brain barrier (BBB) damage after injury to the immature brain. CXC chemokine ligand 5 (CXCL5) is produced in inflammatory and endothelial cells by various organs in response to insults. CXCL5 levels markedly increased in the amniotic cavity in response to intrauterine infection and preterm birth in clinical studies. The objective of this study is to determine whether CXCL5 signaling is a shared pathway of neuroinflammation and BBB injury that contributes to white matter injury in the immature brain. Postpartum day 2 (P2) rat pups received lipopolysaccharide (LPS) followed by 90-min HI. Immunohistochemical analyses were performed to determine microglial activation, neutrophi

### Get article's categories

In [59]:
def get_article_categories(row):
    categories = []
    for subj_group in row['subj-group']:
        if type(subj_group['subject']) == str:
            categories = [subj_group['subject']]
        elif type(subj_group['subject']) == list:
            for cat in subj_group['subject']:
                categories.append(cat)
    return categories
    

def get_categories_list(input_dir):
    df = spark.read \
            .format('com.databricks.spark.xml') \
            .options(rowTag='record')\
            .options(rowTag='metadata')\
            .options(rowTag='article')\
            .options(rowTag='front')\
            .options(rowTag='article-meta')\
            .load(input_dir)
    
    return  df.select("article-categories")\
              .rdd\
              .map(lambda row: get_article_categories(row['article-categories']))\
              .zipWithIndex()\
              .map(lambda record: (record[1], record[0]))\

get_categories_list(input_dir).take(20)

[(0, ['Article']),
 (1, ['Article']),
 (2, ['Article']),
 (3, ['Article']),
 (4, ['Article']),
 (5, ['Article']),
 (6, ['Article']),
 (7, ['Analytical Methods']),
 (8, ['Article']),
 (9, ['Article']),
 (10, ['Article']),
 (11, ['Original Article']),
 (12, ['Article']),
 (13, ['Article']),
 (14, ['Article']),
 (15, ['Article']),
 (16, ['Article']),
 (17, ['Article']),
 (18, ['Article']),
 (19, ['Article'])]

### Get article's title

In [55]:
def get_titles_list(input_dir):
    df = spark.read \
            .format('com.databricks.spark.xml') \
            .options(rowTag='record')\
            .options(rowTag='metadata')\
            .options(rowTag='article')\
            .options(rowTag='front')\
            .options(rowTag='article-meta')\
            .options(rowTag='title-group')\
            .load(input_dir)
    
    return  df.select("article-title")\
              .rdd\
              .map(lambda row: row['article-title'])\
              .zipWithIndex()\
              .map(lambda record: (record[1], record[0]))\

get_titles_list(input_dir).take(5)

[(0,
  'CXCL5 signaling is a shared pathway of neuroinflammation and blood–brain barrier injury contributing to white matter injury in the immature brain'),
 (1,
  'Family-led rehabilitation after stroke in India: the ATTEND trial, study protocol for a randomized controlled trial'),
 (2,
  'Higher serum betatrophin level in type 2 diabetes subjects is associated with urinary albumin excretion and renal function'),
 (3, 'The expression of heterologous MAM-7 in '),
 (4,
  'The ethics of future trials: qualitative analysis of physicians’ decision making')]

### Get final dataframe

In [56]:
def get_final_rdd():
    final_rdd = get_abstract_list(input_dir)
    final_rdd = final_rdd.join(get_bodys_list(input_dir))
    final_rdd = final_rdd.join(get_categories_list(input_dir))
    final_rdd = final_rdd.join(get_titles_list(input_dir))
    return final_rdd.sortByKey()

def get_final_to_pandas():
    return get_final_rdd().map(lambda record: (record[1][0][0][0], record[1][0][0][1], record[1][0][1], record[1][1]))\
                          .toDF(["abstract", "body", "categories", "title"])

get_final_to_pandas().toPandas()

Unnamed: 0,abstract,body,categories,title
0,"In very preterm infants, white matter injury ...",Cerebral white matter injury is a prominent br...,[Research],CXCL5 signaling is a shared pathway of neuroin...
1,"Globally, most strokes occur in low- and midd...","Like many developing countries, India is exper...",[Study Protocol],Family-led rehabilitation after stroke in Indi...
2,Betatrophin is a newly identified liver-deriv...,Betatrophin is a newly recognized liver-derive...,[Original Investigation],Higher serum betatrophin level in type 2 diabe...
3,<italic>Vibrio parahaemolyticus</italic><ital...,"In addition, an outer membrane protein, multiv...",[Research Article],The expression of heterologous MAM-7 in
4,The decision to conduct a randomized controll...,The timing of initiation of an RCT raises ethi...,[Research],The ethics of future trials: qualitative analy...
...,...,...,...,...
98,Malaria in pregnancy contributes greatly to m...,Interventions to prevent malaria through the p...,[Research],Perceptions and practices for preventing malar...
99,Improving the quality of care of at the medic...,Many conditions are managed exclusively at the...,[Research Article],The relationship between GPs and hospital cons...
100,Adipose-derived stem cells (ASCs) are being i...,It has been demonstrated that the regenerative...,[Research],Mass spectrometry analysis of adipose-derived ...
101,Most adolescents begin alcohol consumption du...,In most Western industrialised countries there...,[Research Article],Parental supply of alcohol to Australian minor...
