# SparkKG-ML Tutorial

Welcome to this tutorial on **SparkKG-ML**, a Python library designed to facilitate machine learning with Spark on semantic web and knowledge graph data.

In this notebook, we will walk through the installation of SparkKG-ML, and demonstrate some of the key functionalities including data acquisition from SPARQL endpoints, feature engineering, vectorization, and semantification. We will also create a simple pipeline and evaluate the results.

## Installation

We begin by installing the SparkKG-ML library.


In [None]:

!pip install sparkkgml


## Data Acquisition

We will retrieve data from a SPARQL endpoint and convert it into a Spark DataFrame using the `getDataFrame` function. Here's how you can achieve that.


In [None]:

from sparkkgml.data_acquisition import DataAcquisition

# Create an instance of DataAcquisition
dataAcquisitionObject = DataAcquisition()

# Specify the SPARQL endpoint and query
endpoint = "https://recipekg.arcc.albany.edu/RecipeKG"
query ='''
    PREFIX schema: <https://schema.org/>
    PREFIX recipeKG:<http://purl.org/recipekg/>
    SELECT  ?recipe
    WHERE { ?recipe a schema:Recipe. }
    LIMIT 3
'''

# Retrieve the data as a Spark DataFrame
spark_df = dataAcquisitionObject.getDataFrame(endpoint=endpoint, query=query)
spark_df.show()


## Feature Engineering

After acquiring the data, we can use the `getFeatures` function to extract features and their descriptions from the Spark DataFrame.


In [None]:

from sparkkgml.feature_engineering import FeatureEngineering
from pyspark.sql.functions import regexp_replace

# Clean the data
spark_df = spark_df.withColumn("recipe", regexp_replace('recipe', 'http://purl.org/recipekg/recipe/', ''))

# Create an instance of FeatureEngineering
featureEngineeringObject = FeatureEngineering()

# Extract features
df2, features = featureEngineeringObject.getFeatures(spark_df)
df2.show()
print(features)


## Vectorization

Next, we can vectorize the features we extracted using the `vectorize` function.


In [None]:

from sparkkgml.vectorization import Vectorization

# Create an instance of Vectorization
vectorizationObject = Vectorization()

# Vectorize the DataFrame
digitized_df = vectorizationObject.vectorize(df2, features)
digitized_df.show(5)


## Semantification

Finally, we will use the `semantify` function to convert the DataFrame results into RDF data in Turtle format.


In [None]:

from sparkkgml.semantification import Semantification

# Create an instance of Semantification
semantificationObject = Semantification()

# Semantify the data
semantificationObject.semantify(df2, namespace="http://example.com/", exp_uri="recipe", exp_label="calorie", exp_prediction="category", dest="output.ttl")


## Conclusion

In this tutorial, we installed SparkKG-ML, retrieved data from a SPARQL endpoint, performed feature engineering, vectorized the data, and finally semantified the machine learning results. This demonstrates how SparkKG-ML facilitates a complete machine learning pipeline with semantic web and knowledge graph data.

Feel free to explore the additional functionalities in the SparkKG-ML documentation.
