# Tensile Test Data Annotation Pipeline

This notebook demonstrates how the script takes in a dataset input in JSON format, uses vocabulary from the new ontology created by our team to annotate the data in RDF triples with help from the RDFLib library, and finally serializes the annotated data into Turtle and JSON-LD formats.

First, import the necessary packages.
From rdflib we want to import graph which stores the RDF triples, namespace which allows us to define prefixes, and literal which helps with value formatting. We also want to import RDF, for standard datatypes, and XSD for when we want to serialize datatypes differently.

In [1]:
import os, json
from rdflib import Graph, Namespace, Literal
from rdflib.namespace import RDF, XSD

Here, we set the input and output directory path names. We should specify where to pull the input data from. The input data for this script should be in JSON format. Then, we specify where we want the annotated output datasets to go. We create two sets of outputs, one serialized with the Turtle format, and the other with the JSON-LD format. These two sets of annotated data go to separate directories as specified.

In [2]:
input_dir = "../data/FAIRtrain_data_json"
output_dir_jsonld = "../output/annotatedBy_newOntology/jsonld_output"
output_dir_ttl = "../output/annotatedBy_newOntology/ttl_output"

Now, we loop through each JSON file in the input directory. We want to create a new graph to store the RDF triples we will create for each file. Then, we create the prefixes for our namespace, which just helps make it easier to use throughout the script.

Next, we begin by making a list of the rows in the input file we are currently iterating over. We want to first create the necessary URIs to identify parts of this specific test. We do so by first creating a URI for the sample and the machine group it belongs to. Using these URIs we then declare the sample and machine by adding them to the graph we previously created as an RDF triple.

Next, we annotate the various data present in the input datasets. We do so by creating a node for each data point, such as width, and then adding data to this node. We declare what type it is according to the classes in the new ontology created by our team, such as "OriginalWidth". Then, we use the "hasValue" property to associate the value from the dataset to the node. If necessary, we use the "hasUnitLabel" to attribute which unit label it will correspond to. And finally we relate the test piece to the node we created by using the corresponding vocabulary, such as "hasWidth". This basically allows us to create a node with various data and attribute all of this information to the test piece in a digestible way.

The script numbers the numerous force and elongation pairs. It creates a node of the type "MeasuredData" and links a force and elongation node to it, with the same number tag. In this way, we annotate each instance of force and associated elongation.

Finally, we serialize all this annotated data in turtle and json-ld and output to the specified directories.

In [3]:
for filename in os.listdir(input_dir):
    if filename.endswith(".json"):
        filepath = os.path.join(input_dir, filename)

        g = Graph()

        NTTO = Namespace(
            "https://webprotege.stanford.edu/#projects/bbc9ebd0-8a7b-49f3-aedf-36cb71ee8a37/edit/Classes/")
        g.bind("ntto", NTTO)

        '''
        Need to update namespace below to our domain
        '''
        EX = Namespace("http://example.org/tensile/")
        g.bind("ex", EX)

        with open(filepath) as f:
            data = json.load(f)

            sample_id = data["sample_id"]
            test_piece_id = sample_id.split("_")[2]
            test_piece = EX["testPiece_" + test_piece_id]
            machine = EX[data["sample_id"][:6]]

            g.add((test_piece, RDF.type, NTTO.Sample))
            g.add((machine, RDF.type, NTTO.Machine))

            #Original width
            #Creating a node to represent the concept of width for our data
            width_node = EX[f"{sample_id}_width"]
            #Saying this node (subject) is a (predicate) object of this type (object)
            g.add((width_node, RDF.type, NTTO.OriginalWidth))
            #Saying this node (subject) has value (predicate) of this float (object)
            g.add((width_node, NTTO.hasValue, Literal(data["width"], datatype=XSD.float)))
            #Saying this node (subject) has unit label (predicate) of this label type (object)
            g.add((width_node, NTTO.hasUnitLabel, Literal("mm")))
            #Saying the test piece (subject) has width (predicate) from details in width_node (object)
            g.add((test_piece, NTTO.hasWidth, width_node))

            #Original thickness
            thickness_node = EX[f"{sample_id}_thickness"]
            g.add((thickness_node, RDF.type, NTTO.OriginalThickness))
            g.add((thickness_node, NTTO.hasValue, Literal(data["thickness"], datatype=XSD.float)))
            g.add((thickness_node, NTTO.hasUnitLabel, Literal("mm")))
            g.add((test_piece, NTTO.hasThickness, thickness_node))

            #Gauge length
            length_node = EX[f"{sample_id}_length"]
            g.add((length_node, RDF.type, NTTO.OriginalGaugeLength))
            g.add((length_node, NTTO.hasValue, Literal(data["length"], datatype=XSD.float)))
            g.add((length_node, NTTO.hasUnitLabel, Literal("mm")))
            g.add((test_piece, NTTO.hasLength, length_node))

            #Youngs modulus / slope of the elastic part
            youngs_mod_node = EX[f"{sample_id}_youngs_modulus"]
            g.add((youngs_mod_node, RDF.type, NTTO.SlopeOfTheElasticPart))
            g.add((youngs_mod_node, NTTO.hasValue, Literal(data["extracted_properties"]["youngs_modulus"], datatype=XSD.float)))
            g.add((test_piece, NTTO.hasYoungsModulus, youngs_mod_node))

            #Ultimate tensile strength / upper yield strength
            ult_tensile_strength_node = EX[f"{sample_id}_ultimate_tensile_strength"]
            g.add((ult_tensile_strength_node, RDF.type, NTTO.UpperYieldStrength))
            g.add((ult_tensile_strength_node, NTTO.hasValue, Literal(data["extracted_properties"]["ultimate_tensile_strength"], datatype=XSD.float)))
            g.add((test_piece, NTTO.hasUTS, ult_tensile_strength_node))

            for i, point in enumerate(data["data"]):
                measured_data_node = EX[f"{sample_id}_measured_data_{i}"]
                g.add((measured_data_node, RDF.type, NTTO.MeasuredData))
                #Force
                force_node = EX[f"{sample_id}_force_{i}"]
                g.add((force_node, RDF.type, NTTO.Force))
                g.add((force_node, NTTO.hasValue, Literal(point["N"], datatype=XSD.float)))
                g.add((force_node, NTTO.hasUnitLabel, Literal("N")))
                g.add((measured_data_node, NTTO.hasForce, force_node))

                #Elongation
                elong_node = EX[f"{sample_id}_elongation_{i}"]
                g.add((elong_node, RDF.type, NTTO.Elongation))
                g.add((elong_node, NTTO.hasValue, Literal(point["mm"], datatype=XSD.float)))
                g.add((elong_node, NTTO.hasUnitLabel, Literal("mm")))
                g.add((measured_data_node, NTTO.hasElongation, elong_node))

                #Link data points to test piece
                g.add((test_piece, NTTO.hasMeasurement, measured_data_node))

            output_base_jsonld = os.path.join(output_dir_jsonld, f"annotated_{sample_id}")
            output_base_ttl = os.path.join(output_dir_ttl, f"annotated_{sample_id}")

            #Serializing in jsonld and ttl
            g.serialize(output_base_jsonld + ".jsonld", format="json-ld")
            g.serialize(output_base_ttl + ".ttl", format="turtle")