# 1.1.1 Metadata Acquistion - MARC XML
**Process overview:**

This process aims at acquiring MARC XML files, import them, get the relevant metadata and convert it to a more suitable structure. It includes the following steps:
1. import MARC XML files located in data/acquisition
2. gets relevant metadata fields from the XML. Relevant metadata include:
    * at least 1 field representing a unique identifier (i.e. MARC field 001, 035, or any relevant local field);
    * one or several fields representing the labels, in other words the metadata your institutions wishes to automatically generate;
    * metadata fields that could be used to predict the labels (optional)
3. create a CSV file and store it in data/pre-processing

In [108]:
# Python Libraries required to perform this process
# lxml: used to parse and extract data from an XML file
from lxml import etree, objectify
# Pandas: used to structure metadata as table for better visualization and manipulation
import pandas as pd

## Getting Metadata 
The following script represent a function that get relevant metadata fields to return a provisional record for further manipulation.

Additional lines can be added:

<pre><code>"variable_name": get_data(r,"datafield[@tag='MARC_tag'][@ind1='indicator']/subfield[@code='MARC_subfield']"),</code></pre>

* variable_name: report;
* a MARC_field: @tag='993'
* an indicator: @ind1='3' (optional)
* a MARC_subfield: subfield[@code='a']") (optional)


In [99]:
# Functions needed to acquire MARC XML records and transform them in a more suitable structure
def clean_prefixes(xml_root):
    '''
    This function takes root as argument and clean it from any prefix, before returning it.
    '''
    for e in xml_root.getiterator():
        if hasattr(e.tag, 'find'):
            i = e.tag.find('}')
            if i >= 0:
                e.tag = e.tag[i+1:]
    return xml_root

def get_metadata(r):
    """
    Takes a MARC XML record. For each metadata, the function will call annother get_element, which will return
    'None', a string (one value found), or a list of string (multiple values found)
    """
    # Each record will be a dictionary and include 
    metadata =  {
        # Standard MARC fields
        "record_id": get_md_value(r,"controlfield[@tag='001']"),
        "title": get_md_value(
            r,"datafield[@tag='245']/subfield[@code='a']") + " " + get_md_value(
            r,"datafield[@tag='245']/subfield[@code='b']") + " " + get_md_value(
            r,"datafield[@tag='245']/subfield[@code='c']"),
        "topics": get_md_value(r,"datafield[@tag='650']/subfield[@code='a']"),
        "geographic_terms": get_md_value(r,"datafield[@tag='651']/subfield[@code='a']"),
        "corporates": get_md_value(r,"datafield[@tag='610']/subfield[@code='a']"),      
        # Local MARC fields in used for UN documents
        "symbol": get_md_value(r,"datafield[@tag='191']/subfield[@code='a']"),
        "body": get_md_value(r,"datafield[@tag='191']/subfield[@code='b']"),
        "session": get_md_value(r,"datafield[@tag='191']/subfield[@code='c']"),
    }
    return metadata

def get_md_value(marc_xml_record,query):
    """
    Takes 2 arguments: a MARC XML record and a query (XPath) that identifies an XML element. 
    Queries the record to identify the targeted element. If no element is found return 'None', if one element is found returns a string, if more than one
    elements are found returns a list of strings.
    """
    # Parse the record to get all matchin gelements
    xml_element = marc_xml_record.findall(query)
    # Process xml_element according to its length to return either 'None', a string, or a list of strings.
    if len(xml_element)>1:
        values = []
        i = 0
        for item in xml_element:
            element = values.append(xml_element[i].text)
            i +=1
        return values
    elif len(xml_element) == 1:
        return xml_element[0].text
    else:
        return 'None'

## Importing and parsing XML


In [122]:
# Set a variable with parameters on how the parser should behave.
parser = etree.XMLParser(encoding='utf-8')
# Import xml file as an xml etree
tree = etree.parse('data/acquisition/undl_marc.xml',parser)
# Remove prefixes in XML elements
root = clean_prefixes(tree.getroot())
# Get all <records> element in XML MARC
xml_records = root.findall("record")
# Get a list of dictionaries containing the metadata specified in the get_metadata function
dictionary_records = [get_metadata(r) for r in xml_records]
# Create a Pandas data frame
md_dataset = pd.DataFrame(dictionary_records).set_index('record_id')
# Export to csv in data/pre-processing
md_dataset.to_csv('data/pre-processing/mdMARC_dataset.csv')