# SparkKG-ML Tutorial

Welcome to this tutorial on **SparkKG-ML**, a Python library designed to facilitate machine learning with Spark on semantic web and knowledge graph data.

In this notebook, we will walk through the installation of SparkKG-ML, and demonstrate some of the key functionalities including data acquisition from SPARQL endpoints, feature engineering, vectorization, and semantification. We will also create a simple pipeline and evaluate the results.

## Installation

We begin by installing the SparkKG-ML library.


In [None]:
!pip install sparkkgml

## Data Acquisition

We will retrieve data from out ttl file and convert it into a Spark DataFrame using the `getDataFrame` function. Here's how you can achieve that.


In [29]:
from sparkkgml.data_acquisition import DataAcquisition

# Create an instance of DataAcquisition
dataAcquisitionObject = DataAcquisition()

query ='''
    PREFIX schema: <https://schema.org/>
    PREFIX recipeKG:<http://purl.org/recipekg/>
    SELECT  ?recipe
    WHERE { ?recipe a schema:Recipe. }
    LIMIT 3
'''
spark_df = dataAcquisitionObject.query_local_rdf("recipekg_100.ttl",'ttl', query)
spark_df.show(truncate=False)

Drop the columns where at least %0 element is missing.
Drop the rows where at least %100 element is missing.
+--------------------------------------------------------+
|recipe                                                  |
+--------------------------------------------------------+
|http://purl.org/recipekg/recipe/peanut-butter-tandy-bars|
|http://purl.org/recipekg/recipe/the-best-oatmeal-cookies|
|http://purl.org/recipekg/recipe/peach-cobbler-ii        |
+--------------------------------------------------------+



## Feature Engineering

Let's get more information about the recipes.
After acquiring the data, we can use the `getFeatures` function to extract features and their descriptions from the Spark DataFrame.


In [30]:
from pyspark.sql.functions import regexp_replace

# Specify the SPARQL endpoint and query
query2 ="""
     PREFIX schema: <https://schema.org/>
     PREFIX recipeKG:<http://purl.org/recipekg/>
     PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
     SELECT DISTINCT ?recipe ?calorie ?category ?USDAScore
     WHERE {
                 ?recipe a schema:Recipe.

                 ?recipe recipeKG:hasNutritionalInformation ?a.
                 ?a recipeKG:hasCalorificData ?b.
                 ?b recipeKG:hasAmount ?calorie.

                 ?recipe recipeKG:belongsTo ?subcategory.
                 ?subcategory rdfs:subClassOf* ?category.
                 ?category a recipeKG:RecipeCategory.

                 ?recipe recipeKG:hasUSDAScore ?USDAScore.
         }
         LIMIT 10
     """

# Retrieve the data as a Spark DataFrame
spark_df = dataAcquisitionObject.query_local_rdf("recipekg_100.ttl",'turtle', query2)
#let's also delete the url and just have names
spark_df = spark_df.withColumn("recipe", regexp_replace('recipe','http://purl.org/recipekg/recipe/',''))
spark_df = spark_df.withColumn("category", regexp_replace('category','http://purl.org/recipekg/categories/',''))
spark_df = spark_df.withColumn("category", regexp_replace('category','/',''))
spark_df.show(truncate=False)

Drop the columns where at least %0 element is missing.
Drop the rows where at least %100 element is missing.
+--------------------------------------------+-------+-------------------------------+---------+
|recipe                                      |calorie|category                       |USDAScore|
+--------------------------------------------+-------+-------------------------------+---------+
|peanut-butter-tandy-bars                    |230.0  |desserts                       |3        |
|the-best-oatmeal-cookies                    |172.8  |desserts                       |4        |
|peach-cobbler-ii                            |672.4  |desserts                       |1        |
|pie-crust-v                                 |210.4  |desserts                       |3        |
|dads-beef-and-chive-dip                     |77.6   |appetizers-and-snacks          |4        |
|palak-paneer-indian-spinach-and-paneer      |315.1  |trusted-brands-recipes-and-tips|4        |
|corn-and-porcini-

In [31]:
from sparkkgml.feature_engineering import FeatureEngineering

# Create an instance of FeatureEngineering
featureEngineeringObject = FeatureEngineering()

# Extract features
df2, features = featureEngineeringObject.getFeatures(spark_df)
features

No entity column has been set, that is why the first column recipe is used as entity column


{'calorie': {'featureType': 'Single_NonCategorical_Double',
  'name': 'calorie',
  'nullable': False,
  'datatype': DoubleType(),
  'numberDistinctValues': 10,
  'isListOfEntries': False,
  'isCategorical': False},
 'category': {'featureType': 'Single_NonCategorical_String',
  'name': 'category',
  'nullable': False,
  'datatype': StringType(),
  'numberDistinctValues': 6,
  'isListOfEntries': False,
  'isCategorical': False},
 'USDAScore': {'featureType': 'Single_NonCategorical_Long',
  'name': 'USDAScore',
  'nullable': False,
  'datatype': LongType(),
  'numberDistinctValues': 3,
  'isListOfEntries': False,
  'isCategorical': False}}

## Vectorization

Next, we can vectorize the features we extracted using the `vectorize` function.


In [26]:
from sparkkgml.vectorization import Vectorization

# Create an instance of Vectorization
vectorizationObject = Vectorization()

# Vectorize the DataFrame
digitized_df = vectorizationObject.vectorize(df2, features)
digitized_df.show(5)


No entity column has been set, that is why the first column recipe is used as entity column
+--------------------+-------+--------------------+---------+
|              recipe|calorie|            category|USDAScore|
+--------------------+-------+--------------------+---------+
|corn-and-porcini-...|  214.7|[0.06451535224914...|        3|
|dads-beef-and-chi...|   77.6|[-0.1662092804908...|        4|
|easy-mexican-frie...|  548.1|[0.06132638454437...|        3|
|onion-masala-omel...|  522.3|[-0.0420703291893...|        4|
|  orange-raisin-cake|  168.2|[0.11072149872779...|        4|
+--------------------+-------+--------------------+---------+
only showing top 5 rows



## Semantification

Finally, we will use the `semantify` function to convert the DataFrame results into RDF data in Turtle format.


In [28]:
from sparkkgml.semantification import Semantification

# Create an instance of Semantification
semantificationObject = Semantification()

# Semantify the data
semantificationObject.semantify(df2, namespace="http://purl.org/recipekg/recipe/", exp_uri="recipe", exp_label="USDAScore", exp_prediction="USDAScore", dest="output.ttl")

## Conclusion

In this tutorial, we installed SparkKG-ML, retrieved data from a SPARQL endpoint, performed feature engineering, vectorized the data, and finally semantified the machine learning results. This demonstrates how SparkKG-ML facilitates a complete machine learning pipeline with semantic web and knowledge graph data.

Feel free to explore the additional functionalities in the SparkKG-ML documentation.
