# A - 1 - Metadata Acquistion - MARC XML

## Description
**Process aim:** 
* to transform a MARC XML file representing a collection of records
* to select label fields and potential features fields in a CSV file.

**Input:** a MARC XML file saved in data/acquisition

**Sub-processes**:
1. Import and Parse MARC XML
2. Convert MARC XML to a Dataframe
3. Save Extracted Metadata

**Output:** a CSV file

In [1]:
# Python Libraries required in this section
# lxml: used to parse and extract data from an XML file
from lxml import etree
# Pandas: used to structure metadata as table for better visualization and manipulation
import pandas as pd

## 1. Import and Parse MARC XML
* **Input**: MARC XML file
* **Output**: Collection of XML records extracted from the file
* **Customization:** None, the code can be run, without customization

In [2]:
def clean_prefixes(xml_root):
    '''
    This function takes root as argument and clean it from any prefix, before returning it.
    '''
    for e in xml_root.getiterator():
        if hasattr(e.tag, 'find'):
            i = e.tag.find('}')
            if i >= 0:
                e.tag = e.tag[i+1:]
    return xml_root

In [3]:
# Set a variable with parameters on how the parser should behave.
parser = etree.XMLParser(encoding='utf-8')
# Import xml file as an xml etree
tree = etree.parse('../data/A_input/doc_2000_2017.xml',parser)
# Remove prefixes in XML elements
root = clean_prefixes(tree.getroot())
# Get all <records> element in XML MARC
xml_records = root.findall("record")

## 2. Convert MARC XML to a Dataframe
* **Aim:** 
    * to extract the relevant values from the XML records
    * to organize and store them in data structure for furhter manipulation and processing.
* **Input**: a collection of MARC XML records
* **Steps**:
    * Extract and get metadata values
    * Create, check and reshape the dataframe of Metadata
* **Output:** a dataframe containing all relevant metadata
* **Customization:** remove and add desired metadata fields

### Extract and get metadata values

#### get_medatata()
This function returns a dictionary for each record in the MARC XML. The structure of the dictionary is:
<pre><code>
{
field1: 'value'
field2: ['value1', 'value2']
}
</code></pre>

* field1 represent a field with only one value, field2 a field with multiple values.

***Customization***

You can remove unwanted fields by adding a **#** in front of the relevant line. To add a new field:
<pre><code>"variable_name": get_data(r,"datafield[@tag='MARC_tag'][@ind1='indicator']/subfield[@code='MARC_subfield']"),</code></pre>

For instance:
* variable_name: report;
* a MARC_field: @tag='993'
* an indicator: @ind1='3' (optional)
* a MARC_subfield: subfield[@code='a']") (optional)

In [4]:
def get_metadata(r):
    """
    Takes a MARC XML record. For each metadata, the function will call annother get_element, which will return
    'None', a string (one value found), or a list of string (multiple values found)
    """
    # Each record will be a dictionary and include
    metadata =  {
        "record_id": get_md_value(r,"controlfield[@tag='001']"),
        "title": get_md_value(
            r,"datafield[@tag='245']/subfield[@code='a']") + " " + get_md_value(
            r,"datafield[@tag='245']/subfield[@code='b']") + " " + get_md_value(
            r,"datafield[@tag='245']/subfield[@code='c']"),
        "topics_primary": get_md_value(r,"datafield[@tag='650'][@ind1='1']/subfield[@code='a']"),
        "topics_secondary": get_md_value(r,"datafield[@tag='650'][@ind1='2']/subfield[@code='a']"),
        "geo": get_md_value(r,"datafield[@tag='651']/subfield[@code='a']"),
        "date": get_md_value(r,"datafield[@tag='269']/subfield[@code='a']"),
        "symbol": get_md_value(r,"datafield[@tag='191']/subfield[@code='a']"),
        "body": get_md_value(r,"datafield[@tag='191']/subfield[@code='b']"),
    }
    # Call the function get_files_info. This function return annother dictionary{description: url}
    files_info = get_files_md(r)
    # Merge the initial metadata dictionary with the dictionary containing the files information
    metadata = {**metadata,**files_info}
    return metadata

#### get_files_md()
This function is used by get_metadata to get information recorded in MARC 856, the description of the file and its url. Based on a metadata record it returns a the following data structure:
<pre><code>
{
    description: url
}
</code></pre>

By default, it returns, the values in subfield y (description) and subfield u (url) of MARC fields 856 with the first indicator set to 4, which inidicate that the resourcs is accessible via http.

***Customization***

The subfield used for the description can be modified. For instance, to get the file extension, change the reference to subfield y to subfield q.
<pre><code>
description = get_md_value(item, "subfield[@code='y']")
</code></pre>
to 
<pre><code>
description = get_md_value(item, "subfield[@code='q']")
</code></pre>

In [5]:
def get_files_md(r):
    '''
    This function get the URL from the xml records as well as the description. It forms a dictionary,
    where the description is the key and the url the value.
    '''
    files = {}
    MARC_856 = r.findall("datafield[@tag='856'][@ind1='4']")
    if len(MARC_856) > 0:
        for item in MARC_856:
            description = "url_" + get_md_value(item, "subfield[@code='y']")
            # Deals with some variation of encoding in the name of languages (Specific to UN)
            description = description.replace('ñ','ñ').replace('ç','ç')
            url = get_md_value(item, "subfield[@code='u']")
            files[description] = url
    return files

#### get_md_value()

This function is used by the two preceedings functions (get_metadata and get_files_md). It process XML record and get the relevant MARC field or subfield identified by the query. It returns a string, or a list of strings. It assumes that the MARC XML element queries does not have children. If the element has a children or if it does not exists, then the function will return an empty string.

In [6]:
def get_md_value(marc_xml_record,query):
    """
    Takes 2 arguments: a MARC XML record and a query (XPath) that identifies an XML element. 
    Queries the record to identify the targeted element. If no element is found return 'None', if one element is found returns a string, if more than one
    elements are found returns a list of strings.
    """
    # Parse the record to get all matchin gelements
    xml_element = marc_xml_record.findall(query)
    # Process xml_element according to its length to return either a string, or a list of strings.
    if len(xml_element)>1:
        values = []
        i = 0
        for item in xml_element:
            element = values.append(xml_element[i].text)
            i +=1
        return values # multiple values, returns a list of strings
    elif len(xml_element) == 1:
        return xml_element[0].text # one value, retunrns a string
    else:
        return "" # no value, returns an empty string.

In [7]:
# Get a list of dictionaries containing the metadata specified in the get_metadata function
dictionary_records = [get_metadata(r) for r in xml_records]   

In [8]:
# Create a data frame
md_dataset = (pd.DataFrame(dictionary_records))

In [9]:
md_dataset.head()

Unnamed: 0,body,date,geo,record_id,symbol,title,topics,url_,url_ 中文,url_Deutsch,url_English,url_Español,url_Français,url_Other,url_Русский,url_العربية,url_中文,url_ﺎﻠﻋﺮﺒﻳﺓ
0,"[A/, S/]",20011206,,455823,"[A/56/682, S/2001/1159]",Letter dated 2001/12/06 from the Permanent Rep...,"[NON-ALIGNED COUNTRIES, INTERNATIONAL SECURITY...",,,,http://digitallibrary.un.org/record/455823/fil...,http://digitallibrary.un.org/record/455823/fil...,http://digitallibrary.un.org/record/455823/fil...,,http://digitallibrary.un.org/record/455823/fil...,http://digitallibrary.un.org/record/455823/fil...,http://digitallibrary.un.org/record/455823/fil...,
1,E/,20101004,"[HAITI, ISRAEL, GAZA STRIP (STATE OF PALESTINE)]",694579,E/2010/SR.46,Provisional summary record of the 46th meeting...,"[OPERATIONAL ACTIVITIES, ORGANIZATIONAL REFORM...",,,,http://digitallibrary.un.org/record/694579/fil...,http://digitallibrary.un.org/record/694579/fil...,http://digitallibrary.un.org/record/694579/fil...,,,,,
2,E/,20041028,,550037,E/2004/SR.47,Provisional summary record of the 47th meeting...,"[WOMEN'S ADVANCEMENT, GENDER EQUALITY, NARCOTI...",,,,http://digitallibrary.un.org/record/550037/fil...,http://digitallibrary.un.org/record/550037/fil...,http://digitallibrary.un.org/record/550037/fil...,,,,,
3,E/,2007,,590996,E/RES/2006/15,Promoting youth employment,"[EMPLOYMENT POLICY, YOUTH EMPLOYMENT, DEVELOPM...",,,,,,,,,,,
4,E/,20031128,,524202,E/2003/SR.49,Provisional summary record of the 49th meeting...,"[SUSTAINABLE DEVELOPMENT, SCIENCE AND TECHNOLO...",,,,http://digitallibrary.un.org/record/524202/fil...,http://digitallibrary.un.org/record/524202/fil...,http://digitallibrary.un.org/record/524202/fil...,,,,,


## 3. Save extracted metadata
From the dataset it is easy to save the metadata in a variety of format including CSV, JSON or Excel.
* CSV: dataset_name.to_csv(path)
* JSON: dataset_name.to_json(path)
* Excel: dataset_name.to_Excel(path)

In this case we will save a CSV file in data/acquisition/

In [10]:
# Save the content of the dataset in data/acquisition/
md_dataset.to_csv('../data/A_input/doc_2000_2017.csv')