# Generating Text Embeddings for SoSEn
Here, we will how how to generate text embeddings for SoSEn. The process has four steps
1. Import the Knowledge Graph (as ntriples) to KGTK format
2. Filter the Knowledge Graph, so we only use the edges we want
3. Sort the Knowledge Graph edges by the node, so edges for a node are together
4. Do the embedding


## Importing the Knowledge Graph
First, we import the Knowledge Graph into KTGK format. KGTK will automatically assign prefixes for us, but since we want our prefixes to be consistent, we'll have to define them ourselves. Here is the code I used to generate the prefixes.

In [4]:
from rdflib import XSD, RDFS

schema =  "https://schema.org/"
sd = "https://w3id.org/okn/o/sd#"
xsd = str(XSD)
obj = "https://w3id.org/okn/o/i/"
sosen = "http://example.org/sosen#"
rdfs = str(RDFS)

prefixes_dict = {
    "schema": schema,
    "sd": sd,
    "xsd": xsd,
    "obj": obj,
    "sosen": sosen,
    "rdfs": rdfs
}

with open("prefixes.tsv", "w") as out_file:
    out_file.write("node1\tlabel\tnode2\n")

    for key, value in prefixes_dict.items():
        out_file.write(f"{key}\tprefix_expansion\t\"{value}\"\n")

In [5]:
!cat prefixes.tsv

node1	label	node2
schema	prefix_expansion	"https://schema.org/"
sd	prefix_expansion	"https://w3id.org/okn/o/sd#"
xsd	prefix_expansion	"http://www.w3.org/2001/XMLSchema#"
obj	prefix_expansion	"https://w3id.org/okn/o/i/"
sosen	prefix_expansion	"http://example.org/sosen#"
rdfs	prefix_expansion	"http://www.w3.org/2000/01/rdf-schema#"


## Filtering the Knowledge Graph
Now that we have created the prefix table, we will filter the Knowledge Graph. We use the filter
`" : sd:name, sd:keyword, sd:description ; `,
which means that we will match an edge that uses one of these attributes. Below, we'll filter with only name and keyword to show the functionality so far (the descriptions are very long)

In [7]:
%%bash
kgtk import_ntriples -i small_graph.nt \
  --namespace-file prefixes.tsv | \
kgtk filter \
  -p " ; sd:name, sd:keyword ; "


node1	label	node2
obj:Software/manoelcampos/cloudsim-plus	sd:keyword	"google-cluster-data"
obj:Software/manoelcampos/cloudsim-plus	sd:keyword	"cloud-simulation"
obj:Software/manoelcampos/cloudsim-plus	sd:keyword	"auto-scaling"
obj:Software/manoelcampos/cloudsim-plus	sd:keyword	"java8"
obj:Software/manoelcampos/cloudsim-plus	sd:keyword	"load-balancing"
obj:Software/manoelcampos/cloudsim-plus	sd:keyword	"simulation"
obj:Software/manoelcampos/cloudsim-plus	sd:keyword	"java"
obj:Software/manoelcampos/cloudsim-plus	sd:keyword	"cloud-computing"
obj:Software/manoelcampos/cloudsim-plus	sd:keyword	"cloudsimplus"
obj:Software/manoelcampos/cloudsim-plus	sd:keyword	"trace"
obj:Software/manoelcampos/cloudsim-plus	sd:keyword	"research"
obj:Software/manoelcampos/cloudsim-plus	sd:keyword	"iaas"
obj:Software/manoelcampos/cloudsim-plus	sd:name	"manoelcampos/cloudsim-plus: Fixes critical bugs and introduces features to provide more simulation control"
obj:Software/manoelcampos/cloudsim-plus	sd:keyword	"t

## Sorting
Next, we need to sort the graph. The text embedding script reads through the knowledge graph in order, focussing on only one node at a time, so in order to get all the properties of that node, all properties of the node must appear sequentially. We can accomplish this by sorting.

## Generating Embeddings
Finally, with the graph processed, we can generate embeddings. The script creates a sentence for each node, then creates an embedding using some sentence embedding model. There are three important arguments here, `label-properties`, `has-properties`, and `property-value`. `label-properties` is a list of properties that are used to determine the name of the node. `has-properties` will say in the sentence whether the node has that property. `property-value` will say that it has the property, and then will list the property after. In order to have the program call these edges what we want, we supply the file `property_labels.tsv`, which says what the edges should be called. See below.

In [8]:
!cat property_labels.tsv

predicate	label
sd:keyword	has keyword
sd:description	has description

# Putting it all together
Below, we see a full example. We are running it on the file `small_graph.nt`, which contains only one software object, to save time, but the same could be run on a larger graph. Additionally, you may want to remove the `--save-embedding-sentence` argument. This argument says to save the sentences used to generate the embeddings, and it is useful for illustrative purposes, but it will just make your output file larger.

In [12]:
%%bash
kgtk import_ntriples -i small_graph.nt \
  --namespace-file prefixes.tsv | \
kgtk filter \
  -p " ; sd:name, sd:keyword, sd:description ; " | \
kgtk sort \
 -c node1 | \
kgtk text-embedding \
  --model bert-base-nli-cls-token \
  -f kgtk_format \
  --output-format kgtk_format \
  --label-properties "sd:name" \
  --has-properties "" \
  --property-value "sd:keyword" "sd:description"\
  --property-labels-file property_labels.tsv \
  --save-embedding-sentence \
  > test_embed.txt

Running with logging level 30
  0%|          | 0/31 [00:00<?, ?it/s]  3%|▎         | 1/31 [00:00<00:06,  4.81it/s]  6%|▋         | 2/31 [00:00<00:05,  5.34it/s] 10%|▉         | 3/31 [00:00<00:04,  5.84it/s] 13%|█▎        | 4/31 [00:00<00:04,  6.31it/s] 16%|█▌        | 5/31 [00:00<00:03,  6.66it/s] 19%|█▉        | 6/31 [00:00<00:03,  6.87it/s] 23%|██▎       | 7/31 [00:00<00:03,  7.21it/s] 26%|██▌       | 8/31 [00:01<00:03,  6.99it/s] 29%|██▉       | 9/31 [00:01<00:03,  7.27it/s] 32%|███▏      | 10/31 [00:01<00:02,  7.47it/s] 35%|███▌      | 11/31 [00:01<00:02,  7.62it/s] 45%|████▌     | 14/31 [00:01<00:01,  8.77it/s] 48%|████▊     | 15/31 [00:01<00:01,  8.27it/s] 52%|█████▏    | 16/31 [00:02<00:01,  8.04it/s] 55%|█████▍    | 17/31 [00:02<00:01,  7.84it/s] 58%|█████▊    | 18/31 [00:02<00:01,  7.54it/s] 61%|██████▏   | 19/31 [00:02<00:01,  7.71it/s] 65%|██████▍   | 20/31 [00:02<00:01,  7.72it/s] 74%|███████▍  | 23/31 [00:02<00:00,  8.87it/s] 77%|███████▋  | 24/31 [00:

# Results
The text embeddings are not human-readable, but below we can see some examples of the embedding sentences that are generated.

In [14]:
!cat test_embed.txt | grep embedding_sentence

obj:SoftwareVersion/manoelcampos/cloudsim-plus/v5.1.0	embedding_sentence	It has description ## New Features\r\n\r\n- Add size() method to Datacenter interface to enable getting its number of Hosts.\r\n- #216 :: Enable manual VM migrations based on arbitrary conditions\r\n- #221 :: Enable UtilizationModelPlanetLab to read the number of lines from a comment at the beginning of the file \r\n- #222 :: Enable UtilizationModelPlanetLab to scale the values read from the trace\r\n\r\n\r\n## BugFixes\r\n\r\n- #214\r\n- #215\r\n- #217\r\n- #218\r\n- #219\r\n- #220\r\n- #223\r\n- #224\r\n.
obj:SoftwareVersion/manoelcampos/cloudsim-plus/v5.0.0	embedding_sentence	It has description ## 1. Important new Features\r\n\r\n### 1.1. Implement synchronous simulations (#205)\r\n\r\nThis feature enables the simulation to be run inside a regular loop as below:\r\n\r\n```java\r\nsimulation.startSync();\r\nwhile(simulation.isRunning()){\r\n    simulation.runFor(INTERVAL);\r\n    /*\r\n    Perform some operatio