<h1>Data parsing<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-dependencies" data-toc-modified-id="Import-dependencies-1">Import dependencies</a></span></li><li><span><a href="#Load-data" data-toc-modified-id="Load-data-2">Load data</a></span></li><li><span><a href="#Data-exploration" data-toc-modified-id="Data-exploration-3">Data exploration</a></span><ul class="toc-item"><li><span><a href="#Get-a-list-of-standalone-paragraphs-from-each-article's-body" data-toc-modified-id="Get-a-list-of-standalone-paragraphs-from-each-article's-body-3.1">Get a list of standalone paragraphs from each article's body</a></span></li><li><span><a href="#Get-a-list-of-sections-(in-each-section-are-one-or-more-paragraphs)--from-each-article's-body" data-toc-modified-id="Get-a-list-of-sections-(in-each-section-are-one-or-more-paragraphs)--from-each-article's-body-3.2">Get a list of sections (in each section are one or more paragraphs)  from each article's body</a></span></li><li><span><a href="#Get-article's-body" data-toc-modified-id="Get-article's-body-3.3">Get article's body</a></span></li><li><span><a href="#Get-article's-abstract" data-toc-modified-id="Get-article's-abstract-3.4">Get article's abstract</a></span></li><li><span><a href="#Get-article's-categories" data-toc-modified-id="Get-article's-categories-3.5">Get article's categories</a></span></li><li><span><a href="#Get-article's-title" data-toc-modified-id="Get-article's-title-3.6">Get article's title</a></span></li><li><span><a href="#Get-final-dataframe" data-toc-modified-id="Get-final-dataframe-3.7">Get final dataframe</a></span></li></ul></li></ul></div>

## Import dependencies

In [5]:
import os, shutil
import warnings
import random
warnings.simplefilter("ignore")

from shutil import copyfile
from distutils.dir_util import copy_tree

## Load data

In [6]:
input_dir = "../datasets/2016_testing_df/"

df = sqlContext.read.format('com.databricks.spark.xml').option("rowTag", "record")\
    .load(input_dir)

In [7]:
df.rdd.count()

103

## Data exploration

### Get a list of standalone paragraphs from each article's body

The paragraphs can be found: in **body/p** or simply in **p** tag from the root. But the results are the same.

In [36]:
def parse_paragraphs_list(list_of_paragraphs):
    list_of_paragraphs_to_ret = []
    if list_of_paragraphs != None:
        for paragraph in list_of_paragraphs:
            if paragraph != None:
                if type(paragraph) == str:
                    list_of_paragraphs_to_ret.append(paragraph)
                else:
                    if paragraph._VALUE != None:
                        list_of_paragraphs_to_ret.append(paragraph._VALUE)
    return list_of_paragraphs_to_ret

def get_paragraphs_list_from_body(input_dir):
    df =   spark.read \
                .format('com.databricks.spark.xml') \
                .options(rowTag='record')\
                .options(rowTag='body')\
                .load(input_dir)
    return  df.select("p")\
              .rdd\
              .map(lambda row: parse_paragraphs_list(row['p']))\
              .zipWithIndex()\
              .map(lambda record: (record[1], record[0]))\


get_paragraphs_list_from_body(input_dir).take(10)

[(0,
  ['There are few studies regarding stenosis at the level of the atlas, with little data existing to define a critical threshold for stenosis.']),
 (1, []),
 (2, []),
 (3, []),
 (4,
  ['Several cohort studies have assessed longitudinal WMH progression over approximately 2–6 years (']),
 (5, []),
 (6,
  ['One of the rarest complications involves the formation of a mass surrounding the SCS electrode, resulting in spinal cord compression. A search of the published literature showed that only seven documented cases have been reported.']),
 (7,
  ['This bibliometric study aims to present a success story of the journal based on journal metric analysis and comparison to 3 other orthopedic journals from Asia that are indexed in Web of Science Core Collection, which will also show how far-sighted the leaders of the KOA who decided to publish an English journal were. At the same time, I hope this article will be helpful to local journals in paving the way for international circulation.',
  

### Get a list of sections (in each section are one or more paragraphs)  from each article's body

The sections are the same as in **metadata/article/body**

In [37]:
def parse_sections_list(list_of_sections):
    if list_of_sections != None:
        list_of_sections = [parse_paragraphs_list(section.p) for section in list_of_sections if section.p != None]
    else:
        list_of_sections = []
    return list_of_sections

def get_sections_list(input_dir):
    df =   spark.read \
                .format('com.databricks.spark.xml') \
                .options(rowTag='record')\
                .options(rowTag='body')\
                .load(input_dir)
    return  df.select("sec")\
              .rdd\
              .map(lambda row: parse_sections_list(row['sec']), )\
              .zipWithIndex()\
              .map(lambda record: (record[1], record[0]))\

get_sections_list(input_dir).take(10)

[(0,
  [['The sagittal diameter of the cervical spinal canal is important, as a small canal diameter is associated with cervical myelopathy and with a high risk of spinal cord injury after trauma.']]),
 (1,
  [['Pathological gambling (PG) is a psychiatric disorder characterized by a pre-occupation with thoughts of gambling, repeated attempts to reduce or quit, debt and/or illegal activity, and disruption of personal relationships and/or employment. PG has been estimated to affect between 0.2 and 5.3% of the adult population worldwide (']]),
 (2,
  [['The sex-steroid hormone estradiol (E2) enhances the psychoactive effects of cocaine, as evidenced by clinical and preclinical studies (']]),
 (3, [['ADCC in HIV-1 has been studied for over 20\xa0years (']]),
 (4,
  [['Brain white matter hyperintensities (WMH) are common in community-dwelling older people and are associated with cognitive decline, dementia, stroke, and death (']]),
 (5,
  [['Spontaneous cerebrospinal fluid (CSF) fistula of 

### Get article's body

The body of the article is going to be build in the following manner:
-  we will concatenate each standalone paragraph (1)
-  we will concatenate each paragraph from a section
-  we will concatenate each section (3)
-  finally, we will concatenate the the results from (1) and (3)

In [38]:
def build_body(row_sec, row_p):
    concat_parsed_sections = ' '.join([' '.join(section) for section in parse_sections_list(row_sec)])
    concat_standalone_paragraphs = ' '.join(parse_paragraphs_list(row_p))
    print(concat_parsed_sections, concat_standalone_paragraphs)
    return concat_standalone_paragraphs + " " + concat_parsed_sections

In [39]:
def get_bodys_list(input_dir):
    df =   spark.read \
                .format('com.databricks.spark.xml') \
                .options(rowTag='record')\
                .options(rowTag='body')\
                .load(input_dir)
    return  df.select("sec", "p")\
              .rdd\
              .map(lambda row: build_body(row['sec'], row["p"]))\
              .zipWithIndex()\
              .map(lambda record: (record[1], record[0]))

get_bodys_list(input_dir).take(5)

[(0,
  'There are few studies regarding stenosis at the level of the atlas, with little data existing to define a critical threshold for stenosis. The sagittal diameter of the cervical spinal canal is important, as a small canal diameter is associated with cervical myelopathy and with a high risk of spinal cord injury after trauma.'),
 (1,
  ' Pathological gambling (PG) is a psychiatric disorder characterized by a pre-occupation with thoughts of gambling, repeated attempts to reduce or quit, debt and/or illegal activity, and disruption of personal relationships and/or employment. PG has been estimated to affect between 0.2 and 5.3% of the adult population worldwide ('),
 (2,
  ' The sex-steroid hormone estradiol (E2) enhances the psychoactive effects of cocaine, as evidenced by clinical and preclinical studies ('),
 (3, ' ADCC in HIV-1 has been studied for over 20\xa0years ('),
 (4,
  'Several cohort studies have assessed longitudinal WMH progression over approximately 2–6 years ( Brai

### Get article's abstract

The abstract of the article is going to be built the same as the body.

In [40]:
def get_abstract_list(input_dir):
    df =   spark.read \
                .format('com.databricks.spark.xml') \
                .options(rowTag='record')\
                .options(rowTag='abstract')\
                .load(input_dir)
    return  df.select("sec", "p")\
              .rdd\
              .map(lambda row: build_body(row['sec'], row["p"]))\
              .zipWithIndex()\
              .map(lambda record: (record[1], record[0]))\

get_abstract_list(input_dir).take(5)

[(0, ' '),
 (1,
  'Pathological gambling is a psychiatric disorder and the first recognized behavioral addiction, with similarities to substance use disorders but without the confounding effects of drug-related brain changes. Pathophysiology within the opioid receptor system is increasingly recognized in substance dependence, with higher mu-opioid receptor (MOR) availability reported in alcohol, cocaine and opiate addiction. Impulsivity, a risk factor across the addictions, has also been found to be associated with higher MOR availability. The aim of this study was to characterize baseline MOR availability and endogenous opioid release in pathological gamblers (PG) using [ '),
 (2,
  'The sex-steroid hormone estradiol (E2) enhances the psychoactive effects of cocaine, as evidenced by clinical and preclinical studies. The medial preoptic area (mPOA), a region in the hypothalamus, is a primary neural locus for neuroendocrine integration, containing one of the richest concentrations of es

### Get article's categories

In [41]:
def get_article_categories(row):
    categories = []
    for subj_group in row['subj-group']:
        for cat in subj_group['subject']:
            categories.append(cat)
    return categories
    

def get_categories_list(input_dir):
    df = spark.read \
            .format('com.databricks.spark.xml') \
            .options(rowTag='record')\
            .options(rowTag='metadata')\
            .options(rowTag='article')\
            .options(rowTag='front')\
            .options(rowTag='article-meta')\
            .load(input_dir)
    
    return  df.select("article-categories")\
              .rdd\
              .map(lambda row: get_article_categories(row['article-categories']))\
              .zipWithIndex()\
              .map(lambda record: (record[1], record[0]))\

get_categories_list(input_dir).take(5)

[(0, ['Article']),
 (1, ['Original Article']),
 (2, ['Original Article']),
 (3, ['Research Paper']),
 (4, ['Regular Article'])]

### Get article's title

In [42]:
def get_titles_list(input_dir):
    df = spark.read \
            .format('com.databricks.spark.xml') \
            .options(rowTag='record')\
            .options(rowTag='metadata')\
            .options(rowTag='article')\
            .options(rowTag='front')\
            .options(rowTag='article-meta')\
            .options(rowTag='title-group')\
            .load(input_dir)
    
    return  df.select("article-title")\
              .rdd\
              .map(lambda row: row['article-title'])\
              .zipWithIndex()\
              .map(lambda record: (record[1], record[0]))\

get_titles_list(input_dir).take(5)

[(0, 'Spinolaminar Line Test as a Screening Tool for C1 Stenosis'),
 (1,
  'Blunted Endogenous Opioid Release Following an Oral Amphetamine Challenge in Pathological Gamblers'),
 (2,
  'Estradiol in the Preoptic Area Regulates the Dopaminergic Response to Cocaine in the Nucleus Accumbens'),
 (3,
  'A new cell line for high throughput HIV-specific antibody-dependent cellular cytotoxicity (ADCC) and cell-to-cell virus transmission studies'),
 (4,
  'Vascular risk factors and progression of white matter hyperintensities in the Lothian Birth Cohort 1936')]

### Get final dataframe

In [43]:
def get_final_rdd():
    final_rdd = get_abstract_list(input_dir)
    final_rdd = final_rdd.join(get_bodys_list(input_dir))
    final_rdd = final_rdd.join(get_categories_list(input_dir))
    final_rdd = final_rdd.join(get_titles_list(input_dir))
    return final_rdd.sortByKey()

def get_final_to_pandas():
    return get_final_rdd().map(lambda record: (record[1][0][0][0], record[1][0][0][1], record[1][0][1], record[1][1]))\
                          .toDF(["abstract", "body", "categories", "title"])

get_final_to_pandas().toPandas()

Unnamed: 0,abstract,body,categories,title
0,,There are few studies regarding stenosis at th...,[Article],Spinolaminar Line Test as a Screening Tool for...
1,Pathological gambling is a psychiatric disorde...,Pathological gambling (PG) is a psychiatric d...,[Original Article],Blunted Endogenous Opioid Release Following an...
2,The sex-steroid hormone estradiol (E2) enhance...,The sex-steroid hormone estradiol (E2) enhanc...,[Original Article],Estradiol in the Preoptic Area Regulates the D...
3,Several lines of evidence indicate that antibo...,ADCC in HIV-1 has been studied for over 20 ye...,[Research Paper],A new cell line for high throughput HIV-specif...
4,We aimed to determine associations between mul...,Several cohort studies have assessed longitudi...,[Regular Article],Vascular risk factors and progression of white...
...,...,...,...,...
98,Kraft’s work focuses on the mechanisms that re...,Tau protein can transfer between neurons trans...,[Article],Neuronal activity enhances tau propagation and...
99,Many signaling proteins permanently or transie...,The hippocampus is critical to memory for even...,[Article],Bidirectional prefrontal-hippocampal interacti...
100,Tumours exist in a hypoxic microenvironment an...,The brain depends almost exclusively on glucos...,[Article],Glucose responsive neurons of the paraventricu...
101,Genetic heterogeneity contributes to clinical ...,Mutations in the breast cancer susceptibility ...,[Article],FANCD2 limits replication stress and genome in...


In [44]:
def get_abstract_list(input_dir):
    df =   spark.read \
                .format('com.databricks.spark.xml') \
                .options(rowTag='record')\
                .options(rowTag='trans-abstract')\
                .load(input_dir)
    return  df.select("title")\
              .rdd\
              .map(lambda row: row['title'])\
              .zipWithIndex()\
              .map(lambda record: (record[1], record[0]))\

get_abstract_list(input_dir).join(get_bodys_list(input_dir)).take(3)

[(0,
  ('ABRÉGÉ',
   'There are few studies regarding stenosis at the level of the atlas, with little data existing to define a critical threshold for stenosis. The sagittal diameter of the cervical spinal canal is important, as a small canal diameter is associated with cervical myelopathy and with a high risk of spinal cord injury after trauma.')),
 (1,
  (None,
   ' Pathological gambling (PG) is a psychiatric disorder characterized by a pre-occupation with thoughts of gambling, repeated attempts to reduce or quit, debt and/or illegal activity, and disruption of personal relationships and/or employment. PG has been estimated to affect between 0.2 and 5.3% of the adult population worldwide ('))]