# A basic tutorial to create RDF data from a CSV file

Crucial in the construction of a knowledge graph is to turn input data into linked data by using schemas as connectors.

Today we are going to use a popular python library `kglab` to convert CSV file to RDF file. The first step is to import this library. 

IMPORTANT NOTE: at the moment `kglab` requires python 3.10, hence, choose python 3.10 once prompted to choose the environment.

In [None]:
import kglab

### The CSV file

The CSV file we are going to convert is in `/data/amazon_books.csv` and contains a small database of books sold on Amazon. The CSV file is already cleaned and ready to use.

Let's use the pandas library to see the content of the CSV file:

In [None]:
import pandas as pd

#Read the CSV file
csv_data = pd.read_csv("data/amazon_books.csv")

# Preview of the CSV file
csv_data

So, looking at the header and the values of our csv file, we would like our rdf model to be something like:

```
$(id) a ex:Book .
$(id) rdfs:label $(bookName) .
$(id) ex:authorLabel $(author) .
$(id) ex:hasPrice $(price) . # with the price to be a xsd:decimal
$(id) ex:hasCurrency $(currency) .
$(id) ex:hasCustomerRatings $(customerRatings) . # with the customerRatings to be an xsd:integer
$(id) ex:hasRating $(rating) . # with the rating to be a xsd:decimal
$(id) ex:hasBookCover $(bookCover) .
```

NOTE: The syntax `$()` point to the column label of the CSV file.

So, this method allows us to create a mapping file where we can define this mapping. Let's go to the repository `mapping` and fill the `main_mapping.yml` file.

Note: Try to get familiar with the YARRRML docs [here](https://rml.io/yarrrml/tutorial/getting-started/). In particular, you can see an example of a complete YARRRML file in [section 11](https://rml.io/yarrrml/tutorial/getting-started/#complete-yarrrml-document).


### Create the rdf file

Now, let's define all the namespaces that we are going to use in our file and crate the knowledge graph `kg`

In [None]:
namespaces = {
    "ex":  "http://example.com/",
    "schema": "https://schema.org/",
    "owl": "http://www.w3.org/2002/07/owl#",
    "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
    "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
    "skos": "http://www.w3.org/2004/02/skos/core#",
    "xsd": "http://www.w3.org/2001/XMLSchema#"
    }

kg = kglab.KnowledgeGraph(
    name = "A KG example",
    namespaces = namespaces
    )

Let's now create an RDF file using the data of the CSV following the schema described in the mapping file we've just created

In [None]:
kg.materialize('morph_configs/main_config.ini')

Initialise the .ttl file

In [None]:
kg.load_rdf('data/books.ttl')

Now let's create the graph and the .ttl file that store our knowledge graph

In [None]:
# Define a dictionary called VIS_STYLE that specifies visualization styles for different types of nodes in the graph
VIS_STYLE = {
    "owl": {
        "color": "orange",
        "size": 20,
    },
    "ex":{
        "color": "blue",
        "size": 35,
    },
}

# Create a SubgraphTensor object from the knowledge graph, which allows for tensor-based operations on the graph
subgraph = kglab.SubgraphTensor(kg)

# Build a PyVis graph object from the SubgraphTensor, specifying that it should be rendered in a Jupyter notebook and using the VIS_STYLE dictionary for node styling
pyvis_graph = subgraph.build_pyvis_graph(notebook=True, style=VIS_STYLE)

# Apply a force-directed graph layout algorithm (Force Atlas 2) to the PyVis graph
pyvis_graph.force_atlas_2based()

# Save the PyVis graph as an HTML file
pyvis_graph.show('output/graph.html')

# Save the knowledge graph as a TTL (Turtle) file
kg.save_rdf('output/triples.ttl')

# Convert triples to string and print
ttl = kg.save_rdf_text()
print(ttl)

### Query

Now that we have the knowledge graph, let's try to query it using SPARQL queries.

NOTE: When working with SPARQL, make sure you specified all the datatype and you are consistent with them throughout. If you're not seeing the results you expect, maybe your triples have an incorrect datatype, which can lead to unintuitive behaviour when comparing.

Let's run a generic SPARQL query:

In [None]:
kg.query_as_df('''
SELECT ?subject ?predicate ?object 
WHERE {
    ?subject ?predicate ?object .
}
''')

Let's now query for all the books (labels) of the books with ratings higher than 4.8.

In [None]:
kg.query_as_df('''
# list the books that has rating higher than 4.8 (show also the rating)

ADD YOUR SPARQL QUERY HERE 

''')

Same query here, but try to sort the result in alphabetical order.

In [None]:
kg.query_as_df('''
# list the books that has rating higher than 4.8 and order alphabetically

ADD YOUR SPARQL QUERY HERE 

''')

Now show another possible SPARQL query.

In [None]:
kg.query_as_df('''

ADD YOUR SPARQL QUERY HERE 

''')