<a href="https://colab.research.google.com/github/SuperCUDA/ARCs/blob/main/Kumo_SDK_Item_Recommendation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![link text](https://kumo-ai.github.io/kumo-sdk/docs/_static/kumo-logo.svg)

## **This notebook requires a Kumo API key. To provision one for free and get started, visit https://kumo.ai/try/**.


Your API key and environment will be emailed to you shortly after submitting the form on the website.

---


## Introduction

This notebook demonstrates an end-to-end example of building a model and generating predictions in the Kumo SDK. Here, we build a [relational deep learning](https://arxiv.org/abs/2312.04615) model directly on a dataset of customers, transactions, and products, to predict which products a user is most likely to interact with in the next 7 days. The model is specified using Kumo's [predictive query language](https://docs.kumo.ai/docs/pquery-structure) and trained on the Kumo machine learning platform as part of a smooth, performant, and scalable end-to-end pipeline.

**Our documentation is hosted at https://kumo-ai.github.io/kumo-sdk/docs/**.

In [None]:
API_KEY = 'kumo:<secret>'
ENVIRONMENT = 'https://<environment>.trial.kumoai.cloud'

## Initialization

Initializing the SDK is simple: install with `pip`, import, and connect to your Kumo platform endpoint using a provisioned API key.

*See documentation [here](https://kumo-ai.github.io/kumo-sdk/docs/generated/kumoai.init.html#kumoai.init).*

In [None]:
!pip install kumoai==0.4.2 --extra-index-url=https://sdk-pkg.kumoai.cloud

In [None]:
import kumoai as kumo

In [None]:
kumo.init(f"{ENVIRONMENT}/api", API_KEY)

## Connecting Data

You can connect data to the Kumo platform from a variety of data sources: see [`kumo.Connector`](https://kumo-ai.github.io/kumo-sdk/docs/modules/connector.html) for more details. We support connecting to data on Snowflake, Databricks, BigQuery, and Amazon S3.

In this notebook, we will connect to the customer lifetime value dataset hosted at `s3://kumo-public-datasets/customerltv_mini/`.

*See documentation [here](https://kumo-ai.github.io/kumo-sdk/docs/modules/connector.html).*

<img src="https://kumo-sdk-public.s3.us-west-2.amazonaws.com/kumo_data.png" alt="drawing" width="800"/>


In [None]:
connector = kumo.S3Connector(root_dir="s3://kumo-public-datasets/customerltv_mini/")

Connectors can be used to inspect the tables within them, and fetch samples of the source data.

In [None]:
# List all table names behind this connector:
connector.table_names()

In [None]:
# View a sample of the 'customer' table's rows:
connector['customer'].head(num_rows=5)

In [None]:
# View a sample of the 'transaction' table's rows:
connector['transaction'].head(num_rows=5)

In [None]:
# View a sample of the 'stock' table's rows:
connector['stock'].head(num_rows=5)

## Creating Tables

Once we are comfortable with our source data, we can prepare data for the Kumo platform by constructing Kumo `Table` objects from the source tables. Kumo `Table` objects define important metadata for the downstream machine learning problem, including
* Column data types (`dtype`) and semantic types (`stype`)
* The table's primary key, if present
* The table's time and end time columns, if present

*See documentation [here](https://kumo-ai.github.io/kumo-sdk/docs/generated/kumoai.graph.Table.html).*

In [None]:
# Create a Kumo table from a source table, specifying
# additional metadata about the table's structure:
transaction = kumo.Table.from_source_table(
  source_table=connector['transaction'],
  primary_key=None,
  time_column='InvoiceDate'
)
transaction['InvoiceDate'].dtype = 'time'
transaction.validate()

In [None]:
# Print the table's definition, ready to copy-and-paste back into
# code if needed:
transaction.print_definition()

In [None]:
# Adjust any semantic types if necessary:
transaction['Quantity'].stype = 'categorical'

We can repeat this process for the other tables in the dataset:

In [None]:
customer = kumo.Table.from_source_table(
    source_table=connector['customer'],
    primary_key='CustomerID',
)
customer.validate()

In [None]:
stock = kumo.Table.from_source_table(
    source_table=connector['stock'],
    primary_key='StockCode',
)
stock.validate()

### Aside: Saving Table Schemas

The Kumo SDK and UI are fully compatible, so that you can define a table in the SDK and access it in the UI, and vice-versa. This is made possible by calling the `table.save(name=...)` method, which saves the schema of your table under `name`; you can access the schema on the "Tables" page in the UI.

Let's try it below:

In [None]:
stock.save('stock')
customer.save('customer')
transaction.save('transaction')

You can now access and modify these tables in the UI. To load the tables in the SDK, you can call the `Table.load(name)` method, as follows:

In [None]:
transaction_loaded = kumo.Table.load('transaction')

## Creating a Graph

After specifying our Kumo tables, we can next create a `Graph`, which represents relationships between these tables. Defining this graph is the final step of the data specification pipeline; after its creation, we are able to create predictive queries to answer business problems that relate to our data.

*See documentation [here](https://kumo-ai.github.io/kumo-sdk/docs/generated/kumoai.graph.Graph.html).*

In [None]:
retail_graph = kumo.Graph(
	# These are the tables that participate in the graph: the keys of this
	# dictionary are the names of the tables, and the values are the Table
	# objects that correspond to these names:
	tables={
    'customer': customer,
    'transaction': transaction,
    'stock': stock,
	},

 	# These are the edges that define the primary key / foreign key
	# relationships between the tables defined above. Here, `src_table`
	# is the table that has the foreign key `fkey`, which maps to the
	# table `dst_table`'s primary key:`
	edges=[
    dict(src_table='transaction', fkey='CustomerID', dst_table='customer'),
    dict(src_table='transaction', fkey='StockCode', dst_table='stock')
	],
)

In [None]:
# Let's visualize our graph, to get a sense for how all our tables are
# connected:
retail_graph.visualize(show_cols=True)

### Aside: Saving Graph Schemas

The Kumo SDK and UI are fully compatible, so that you can define a graph in the SDK and access it in the UI, and vice-versa. This is made possible by calling the `graph.save(name=...)` method, which saves the schema of your table under `name`; you can access the schema on the "Graphs" page in the UI.

Let's try it below:

In [None]:
retail_graph.save('retail_graph')

You can now access and modify the graph in the UI. To load the graph in the SDK, you can call the `Graph.load(name)` method, as follows:

In [None]:
graph_loaded = kumo.Graph.load('retail_graph')

## Writing a Predictive Query

After we've connected our data as Kumo Tables in a Kumo Graph, we can write a predictive query representing a business problem we would like to solve on our specified tables; please see the Kumo documentation for the specification of the predictive query language.

<img src="https://kumo-sdk-public.s3.us-west-2.amazonaws.com/kumo_pq.png" alt="drawing" width="700"/>


*See documentation [here](https://kumo-ai.github.io/kumo-sdk/docs/generated/kumoai.pquery.PredictiveQuery.html#kumoai.pquery.PredictiveQuery).*

In [None]:
# Construct a query to predict which products (StockCode) the user is most likely
# to interact with in the next 7 days, for all customers in the customer table
query = kumo.PredictiveQuery(
	graph=retail_graph,
	query="PREDICT LIST_DISTINCT(transaction.StockCode, 0, 7, days) RANK TOP 12 FOR EACH customer.CustomerID",
)

# Ensure this query is specified appropriately for this graph:
query.validate()

In [None]:
# Fetch the machine learning task type for this query:
print(f"This query is a {query.get_task_type().replace('_', ' ')} task.")

## Training a Model

With a predictive query in place, we can now train a model to predict the desired outputs of the query over our Kumo Graph. The Kumo SDK supports modular execution of the different components of the training pipeline for ease of experimentation and hyperparameter tuning.

<img src="https://kumo-sdk-public.s3.us-west-2.amazonaws.com/kumo_model.png" alt="drawing" width="800"/>

### Generating a Training Table

The first step of training is the generation of a training table from your predictive query. You can specify a granular plan to determine how exactly this is done, including specifications of elements like the `split`, `train_start_offset`, and more.

As with all long-running jobs in the Kumo SDK, training table generation can be run in nonblocking mode, which returns a job that can be attached to, polled, and resolved (once complete) to a training table object to inspect.

*See documentation [here](https://kumo-ai.github.io/kumo-sdk/docs/generated/kumoai.pquery.PredictiveQuery.html#kumoai.pquery.PredictiveQuery.generate_training_table).*

In [None]:
# Let Kumo intelligently suggest a training table generation plan, given the
# specified graph and query:
training_table_plan = query.suggest_training_table_plan()

In [None]:
# Take a look inside:
print(training_table_plan)

In [None]:
# Launch an asynchronous (nonblocking) job to generate a training table, given
# our specified model plan. This job is scheduled and orchestrated by the
# Kumo platform, and can be chained with other jobs (e.g. training) downstream:
train_table_job = query.generate_training_table(training_table_plan, non_blocking=True)

In [None]:
# The ID of this job:
print(train_table_job.id)

In [None]:
# OPTIONAL: If you want to wait for training table generation to complete
# train_table_job.attach()

### Training

After launching a training table generation job, we are ready to train a model. Following the same pattern as with training table generation, let's let Kumo intelligently suggest a model plan, that we can modify downstream:

<img src="https://kumo-sdk-public.s3.us-west-2.amazonaws.com/kumo_training.png" alt="drawing" width="700"/>

*See documentation [here](https://kumo-ai.github.io/kumo-sdk/docs/generated/kumoai.trainer.Trainer.html).*

In [None]:
# Let Kumo intelligently suggest a modeling plan, given the
# specified graph and query:
model_plan = query.suggest_model_plan()
print(model_plan)

In [None]:
# Let's make a minor adjustment:
model_plan.training_job.num_experiments = 2

Now, we train:

In [None]:
# A Trainer object manages the execution of a training pipeline, according to
# the `model_plan` specification:
trainer = kumo.Trainer(model_plan)

# Launch an asynchronous (nonblocking) job to train a model, given
# our specified model plan. This job is scheduled and orchestrated by the
# Kumo platform, and is chained with the job to generate the training table
# launched above (it will sequence itself after training table generation is
# complete):
training_job = trainer.fit(
	graph=retail_graph,
	train_table=train_table_job,
	non_blocking=True,
)

In [None]:
# The ID of this job:
print(f'The ID of our training job is {training_job.job_id}. To see the results later you can run kumo.TrainingJob("{training_job.job_id}").result()')

In [None]:
# Let's follow along...
training_job.attach()

Once training is done, we can observe generated artifacts to get a sense for how the model behaved:

In [None]:
job_result = kumo.TrainingJob(training_job.job_id).result()

In [None]:
print(job_result.metrics())

### Aside: Training a Model in the UI

Since we've saved our graph under the name `retail_graph` in the UI, we can now navigate to the "New > Model" button in the UI sidebar to train this exact same model, on the same data, in the UI.

Simply paste the predictive query in the predictive query box, adjust any model parameters as necessary, and train! You can interact with your job in the SDK, using its job id.

## Generating Predictions

After our training job is completed, we can generate batch predictions using our trained model. We can choose to output these batch predictions directly to a connector (e.g. Amazon S3, Databricks, Snowflake), or we can generate predictions for download and export at our convenience later with the [`export`](https://kumo-ai.github.io/kumo-sdk/docs/generated/kumoai.trainer.BatchPredictionJobResult.html#kumoai.trainer.BatchPredictionJobResult.export) method.

We will do the latter here.

*See documentation [here](https://kumo-ai.github.io/kumo-sdk/docs/modules/trainer.html#batch-prediction).*

In [None]:
# Predict on your trained model:
prediction_job = trainer.predict(
    graph=retail_graph,
    prediction_table=query.generate_prediction_table(non_blocking=True),
    output_types={'predictions', 'embeddings'},
    training_job_id=training_job.job_id,  # use our training job's model
    non_blocking=False,
)
print(f'Batch prediction job summary: {prediction_job.summary()}')

In [None]:
# See your predictions:
prediction_job.predictions_df()

**And... that's it! You've trained a complex model and made predictions on a relational dataset.**

Feel free to continue to work with more examples, or reach out to us on Slack if you have any further questions.

### Generating Predictions in a New Session

The above line of code assumes that you have `retail_graph`, `query`, and `training_job` stored as local variables that you can use at prediction time. Often, you'll want to train a model once, and generate predictions at a regular cadence in a separate workflow.

To do so, you can simply load the relevant objects associated with your training job, and call `predict`. Here's an example:

In [None]:
training_job = kumo.TrainingJob('<training_job_id>')
training_query = kumo.PredictiveQuery.load_from_training_job(training_job.job_id)
training_graph = training_query.graph

# Predict on your trained model:
prediction_job = trainer.predict(
    graph=training_graph,
    prediction_table=training_query.generate_prediction_table(non_blocking=True),
    output_types={'predictions', 'embeddings'},
    training_job_id=training_job.job_id,  # use our training job's model
    non_blocking=False,
)
print(f'Batch prediction job summary: {prediction_job.summary()}')

### Aside: Generating Predictions in the UI

You can generate predictions in the UI or SDK regardless of where the model was trained. To do so, after training is completed, simply navigate to the "New > Prediction" button in the UI sidebar, select your trained model, and fill in the form to generate predictions!

You can interact with your job in the SDK, using its job id.