# A - 1 - Metadata Acquistion - MARC XML

## Description
**Process aim:**
This process aims at acquiring a MARC XML representing a collection of records, fetching and extracting full text of the ressources described, and saving all information in a CSV file.  The majority of the MARC records should contain at least one URL to locate the full text of the resources in PDF.

**Input:** a MARC XML file saved in data/acquisition

**Sub-processes**:
1. Import and Parse MARC XML
2. Convert MARC XML to a Dataframe
3. Save Extracted Metadata

**Output:** a CSV file

In [1]:
# Python Libraries required in this section
# lxml: used to parse and extract data from an XML file
from lxml import etree
# Pandas: used to structure metadata as table for better visualization and manipulation
import pandas as pd

## 1. Import and Parse MARC XML
* **Input**: MARC XML file
* **Output**: Collection of XML records extracted from the file
* **Customization:** None, the code can be run, without customization

In [2]:
def clean_prefixes(xml_root):
    '''
    This function takes root as argument and clean it from any prefix, before returning it.
    '''
    for e in xml_root.getiterator():
        if hasattr(e.tag, 'find'):
            i = e.tag.find('}')
            if i >= 0:
                e.tag = e.tag[i+1:]
    return xml_root

In [3]:
# Set a variable with parameters on how the parser should behave.
parser = etree.XMLParser(encoding='utf-8')
# Import xml file as an xml etree
tree = etree.parse('data/acquisition/undl_marc.xml',parser)
# Remove prefixes in XML elements
root = clean_prefixes(tree.getroot())
# Get all <records> element in XML MARC
xml_records = root.findall("record")

## 2. Convert MARC XML to a Dataframe
* **Aim:** The aim of this sub-process is to extract the relevant values from the XML records, organize and store them in data structure for furhter manipulation and processing.
* **Input**: A collection of MARC XML records
* **Steps**:
    * Extract and get metadata values
    * Create, check and reshape the dataframe of Metadata
* **Output:** A dataframe containing all relevant metadata
* **Customization:** remove and add desired metadata fields

### Extract and get metadata values

#### get_medatata()
This function returns a dictionary for each record in the MARC XML. The structure of the dictionary is:
<pre><code>
{
field1: 'value'
field2: ['value1', 'value2']
}
</code></pre>

* field1 represent a field with only one value, field2 a field with multiple values.

***Customization***

You can remove unwanted fields by adding a **#** in front of the relevant line. To add a new field:
<pre><code>"variable_name": get_data(r,"datafield[@tag='MARC_tag'][@ind1='indicator']/subfield[@code='MARC_subfield']"),</code></pre>

For instance:
* variable_name: report;
* a MARC_field: @tag='993'
* an indicator: @ind1='3' (optional)
* a MARC_subfield: subfield[@code='a']") (optional)

In [4]:
def get_metadata(r):
    """
    Takes a MARC XML record. For each metadata, the function will call annother get_element, which will return
    'None', a string (one value found), or a list of string (multiple values found)
    """
    # Each record will be a dictionary and include
    metadata =  {
        # Standard MARC fields
        "record_id": get_md_value(r,"controlfield[@tag='001']"),
        # The title field concatenate 3 subfields (a, b, c)
        "title": get_md_value(
            r,"datafield[@tag='245']/subfield[@code='a']") + " " + get_md_value(
            r,"datafield[@tag='245']/subfield[@code='b']") + " " + get_md_value(
            r,"datafield[@tag='245']/subfield[@code='c']"),
        "topics": get_md_value(r,"datafield[@tag='650']/subfield[@code='a']"),
        "geographic_terms": get_md_value(r,"datafield[@tag='651']/subfield[@code='a']"),
        "corporates": get_md_value(r,"datafield[@tag='610']/subfield[@code='a']"),      
        # Local MARC fields in used for UN documents
        "symbol": get_md_value(r,"datafield[@tag='191']/subfield[@code='a']"),
        "body": get_md_value(r,"datafield[@tag='191']/subfield[@code='b']"),
        "session": get_md_value(r,"datafield[@tag='191']/subfield[@code='c']"),
    }
    # Call the function get_files_info. This function return annother dictionary{description: url}
    files_info = get_files_md(r)
    # Merge the initial metadata dictionary with the dictionary containing the files information
    metadata = {**metadata,**files_info}
    return metadata

#### get_files_md()
This function is used by get_metadata to get information recorded in MARC 856, the description of the file and its url. Based on a metadata record it returns a the following data structure:
<pre><code>
{
    description: url
}
</code></pre>

By default, it returns, the values in subfield y (description) and subfield u (url) MARC fields 856 with the first indicator set to 4, which inidicate that the resourcs is accessible via http.

***Customization***

The subfield used for the description can be modified. For instance, if you would rather have the file extension, change the reference to subfield y to subfield q.
<pre><code>
description = get_md_value(item, "subfield[@code='y']")
</code></pre>
to 
<pre><code>
description = get_md_value(item, "subfield[@code='q']")
</code></pre>

In [5]:
def get_files_md(r):
    '''
    This function get the URL from the xml records as well as the description. It forms a dictionary,
    where the description is the key and the url the value.
    '''
    files = {}
    MARC_856 = r.findall("datafield[@tag='856'][@ind1='4']")
    if len(MARC_856) > 0:
        for item in MARC_856:
            description = "url-" + get_md_value(item, "subfield[@code='y']")
            # Deals with some variation of encoding in the name of languages (Specific to UN)
            description = description.replace('ñ','ñ').replace('ç','ç')
            url = get_md_value(item, "subfield[@code='u']")
            files[description] = url
    return files

#### get_md_value()

This function is used by the two preceedings functions (get_metadata and get_files_md). It process XML record and get the relevant MARC field or subfield identified by the query. It returns a string, or a list of strings. It assumes that the MARC XML element queries does not have children. If the element has a children or if it does not exists, then the function will return an empty string.

In [6]:
def get_md_value(marc_xml_record,query):
    """
    Takes 2 arguments: a MARC XML record and a query (XPath) that identifies an XML element. 
    Queries the record to identify the targeted element. If no element is found return 'None', if one element is found returns a string, if more than one
    elements are found returns a list of strings.
    """
    # Parse the record to get all matchin gelements
    xml_element = marc_xml_record.findall(query)
    # Process xml_element according to its length to return either a string, or a list of strings.
    if len(xml_element)>1:
        values = []
        i = 0
        for item in xml_element:
            element = values.append(xml_element[i].text)
            i +=1
        return values # multiple values, retunrns a list of strings
    elif len(xml_element) == 1:
        return xml_element[0].text # one value, retunrns a string
    else:
        return "" # no value, returns an empty string.

In [7]:
# Get a list of dictionaries containing the metadata specified in the get_metadata function
dictionary_records = [get_metadata(r) for r in xml_records]   

### Create, check and reshape the metadta dataframe
From the dictionary, a dataframe can be created. These are table-like structures that ease data manipulation and extraction. Some basic information on the dataset can be obtained using:
* md_dataset.info(): print information about the datset, for instance:
    ** number of entries (rows)
    ** number of non-null value by columns
* md_dataset.columns: name of columns
* md_dataset.head(): print the five first row of the table, including the headers

In [8]:
# Create a data frame
md_dataset = (pd.DataFrame(dictionary_records)
              .set_index('record_id'))  
# Get dataframe information
md_dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 200 entries, 703243 to 702423
Data columns (total 15 columns):
body                200 non-null object
corporates          200 non-null object
geographic_terms    200 non-null object
session             200 non-null object
symbol              200 non-null object
title               200 non-null object
topics              200 non-null object
url-                1 non-null object
url-English         198 non-null object
url-Español        195 non-null object
url-Français       198 non-null object
url-Other           1 non-null object
url-Русский         197 non-null object
url-العربية         194 non-null object
url-中文              196 non-null object
dtypes: object(15)
memory usage: 25.0+ KB


In [9]:
# Print a limited number of rows, in this case 2
md_dataset.head(2)

Unnamed: 0_level_0,body,corporates,geographic_terms,session,symbol,title,topics,url-,url-English,url-Español,url-Français,url-Other,url-Русский,url-العربية,url-中文
record_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
703243,A/,,,65,A/65/PV.71,"General Assembly official records, 65th sessio...","[SOCIAL DEVELOPMENT, AGEING PERSONS, DEMOGRAPH...",,http://digitallibrary.un.org/record/703243/fil...,http://digitallibrary.un.org/record/703243/fil...,http://digitallibrary.un.org/record/703243/fil...,,http://digitallibrary.un.org/record/703243/fil...,http://digitallibrary.un.org/record/703243/fil...,http://digitallibrary.un.org/record/703243/fil...
703245,A/,[UN. Peacebuilding Commission. Organizational ...,,65,A/65/PV.72,"General Assembly official records, 65th sessio...","[POPULATION PROGRAMMES, CHEMICAL WEAPONS, COOP...",,http://digitallibrary.un.org/record/703245/fil...,http://digitallibrary.un.org/record/703245/fil...,http://digitallibrary.un.org/record/703245/fil...,,http://digitallibrary.un.org/record/703245/fil...,http://digitallibrary.un.org/record/703245/fil...,http://digitallibrary.un.org/record/703245/fil...


In [10]:
# Print the columns names
md_dataset.columns

Index(['body', 'corporates', 'geographic_terms', 'session', 'symbol', 'title',
       'topics', 'url-', 'url-English', 'url-Español', 'url-Français',
       'url-Other', 'url-Русский', 'url-العربية', 'url-中文'],
      dtype='object')

#### Removing and renaming columns
* Using .drop() we remove values with limited information contained in columns 'url-' and 'url-Other'.
* Using .rename() we rename the url columns to remove any special character.

In [11]:
# Drop unwanted colums and shorten some columns name
md_dataset = (md_dataset
              .drop(['url-','url-Other'], axis=1) # delete columns 'url-' and 'url-Other'
              .rename(columns={'url-English':'url-en','url-Español':'url-es','url-Français':'url-fr',
                               'url-Русский': 'url-ru','url-العربية':'url-ar','url-中文': 'url-zh'}))

In [12]:
# Check the dataframe information to ensure it was correctly process
md_dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 200 entries, 703243 to 702423
Data columns (total 13 columns):
body                200 non-null object
corporates          200 non-null object
geographic_terms    200 non-null object
session             200 non-null object
symbol              200 non-null object
title               200 non-null object
topics              200 non-null object
url-en              198 non-null object
url-es              195 non-null object
url-fr              198 non-null object
url-ru              197 non-null object
url-ar              194 non-null object
url-zh              196 non-null object
dtypes: object(13)
memory usage: 21.9+ KB


## 3. Save extracted metadata
From the dataset it is easy to save the metadata in a variety of format including CSV, JSON or Excel.
* CSV: dataset_name.to_csv(path)
* JSON: dataset_name.to_json(path)
* Excel: dataset_name.to_Excel(path)

In this case we will save a CSV file in data/acquisition/

In [13]:
# Save the content of the dataset in data/acquisition/
md_dataset.to_csv('data/acquisition/dataset_from_MARC.csv')