Skip to content

Netherlands-Standardization-Institute/xml2csv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

xml2csv

xml2csv is a Python module to transform standards from semi-structured to structured data. It provides a set of classes to parse XML that uses the ISO Standard Tag Set (ISO STS) and/or NISO Standard Tag Suite (NISO STS). The results are written to CSV.

The API documentation and additional information are available via data.nen.nl.

Description

Parses standards as XML and outputs data as CSV. The output includes:

  • committees
  • ICS codes
  • dates, e.g. review or withdrawal
  • references
  • meta data
  • terms and definitions
  • titles, e.g. NL and EN
  • sections
  • equations

How it works

  1. Create an instance of a Processor and call the process method.
  2. Pass a reader oject and writer object as parameters to the constructor of the class.
from xml2csv import IcsProcessor
from csv import DictWriter

reader = open('input.xml', 'r', encoding='utf-8')
writer = DictWriter(open('output.csv', 'a'), delimiter=',', lineterminator='\n', fieldnames=IcsProcessor.fieldnames)

p = IcsProcessor(reader, writer)
p.process()

To implement your own parser:

  1. Create a subclass of the Processor class
  2. Overwrite the converter method

Installation

How to install the project locally:

  1. Clone the repository
  2. Copy the XML documents to /data/xml directory.

Note the /data/xml directory contains a sample document (NISO-STS-Standard-1-0.XML)

Usage

  1. Run main.py which defines a pipeline (list of processors)
  2. The output is written to the /data/csv directory (set of CSV files)

License

Exclusive copyright: GNU GPLv3

About

Data pipeline to transform standards in XML to CSV

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages