# TileDB 101 Lab : Intro to TileDB Answers! 
 




### Hands on Section
**This section requires manual configuration of credentials and organizations. If you are running into trouble please reach out to the workshop host or admin!**
#### Cloud Credentials
Your adminstrator or instructor will provide you with an Amazong Resource Name as well as classical creds. From there, follow the academy instructions [here](https://cloud.tiledb.com/academy/accounts/individual/profile/cloud-credentials/index.html) to set up both.
#### Organization Creation
Now that you've setup your individual credentials, follow the [academy guide](https://cloud.tiledb.com/academy/accounts/org-admin/create-org/index.html) to create an organization. After your organization is complete, go ahead and invite another member from the workshop. Once complete, you are ready to move on and start cataloging! 

Going forward, **please ensure you use Python for these tutorials**



## Section 1: TileDB Fundamentals(Arrays, Tables, and Files)

### Hands on Section
These tutorials will require a TileDB `Basic Data Science` image.
#### Notebook Arrays 

Let's create dense and sparse arrays within your notebook.
Follow the [Academy Tutorial](https://cloud.tiledb.com/academy/structure/arrays/quickstart/). 
Once complete, move onto the next section.

In [None]:
# Dense Array Code. Tested on TileDB core version: (2, 26, 2) TileDB-Py version: (0, 32, 0)
import tiledb
import numpy as np, shutil, os.path

print("TileDB core version: {}".format(tiledb.libtiledb.version()))
print("TileDB-Py version: {}".format(tiledb.version()))


In [None]:
#remove array if it already exists
dense_array = os.path.expanduser("~/dense_array")
if os.path.exists(dense_array):
    shutil.rmtree(dense_array)

In [None]:
# Create the two dimensions
d1 = tiledb.Dim(name="d1", domain=(1, 4), tile=2, dtype=np.int32)
d2 = tiledb.Dim(name="d2", domain=(1, 4), tile=2, dtype=np.int32)

# Create a domain using the two dimensions
dom = tiledb.Domain(d1, d2)

# Create an integer attribute
a = tiledb.Attr(name="a", dtype=np.int32)

# Create the array schema, setting `sparse=False` to indicate a dense array
schema = tiledb.ArraySchema(domain=dom, sparse=False, attrs=[a])

# Create the array on disk (it will initially be empty)
tiledb.Array.create(dense_array, schema)

In [None]:
# Read the array schema
schema = tiledb.ArraySchema.load(dense_array)
print(schema)

In [None]:
# Prepare some data in a numpy array
data = np.array([
    [1, 2, 3, 4],
    [5, 6, 7, 8],
    [9, 10, 11, 12],
    [13, 14, 15, 16]], dtype=np.int32)

# Open the array in write mode and write to the whole array domain
with tiledb.open(dense_array, 'w') as A:
    A[:] = data

In [None]:
# Open the array in read mode
A = tiledb.open(dense_array, 'r')

In [None]:
# Read the whole array
print(A[:])        # dictionary of 2D numpy arrays, one for each attribute
print(A[:]['a'])   # numpy array

In [None]:
# close the array
A.close()

In [None]:
# Sparse Array Code Tested on TileDB core version: (2, 26, 2) TileDB-Py version: (0, 32, 0)

In [None]:
import tiledb
import numpy as np, shutil, os.path

print("TileDB core version: {}".format(tiledb.libtiledb.version()))
print("TileDB-Py version: {}".format(tiledb.version()))

In [None]:
sparse_array = os.path.expanduser("~/sparse_array")
if os.path.exists(sparse_array):
    shutil.rmtree(sparse_array)

In [None]:
# Create the two dimensions
d1 = tiledb.Dim(name="d1", domain=(0, 3), tile=2, dtype=np.int32)
d2 = tiledb.Dim(name="d2", domain=(0, 3), tile=2, dtype=np.int32)

# Create a domain using the two dimensions
dom = tiledb.Domain(d1, d2)

# Create an integer attribute
a = tiledb.Attr(name="a", dtype=np.int32)

# Create the array schema, setting `sparse=True` to indicate a sparse array
schema = tiledb.ArraySchema(domain=dom, sparse=True, attrs=[a])

# Create the array on disk (it will initially be empty)
tiledb.Array.create(sparse_array, schema)

In [None]:
# Read the array schema
schema = tiledb.ArraySchema.load(sparse_array)
print(schema)

In [None]:
# Prepare some data in numpy arrays, simulating the COO format
d1_data = np.array([2, 0, 3, 2, 0, 1], dtype=np.int32)
d2_data = np.array([0, 1, 1, 2, 3, 3], dtype=np.int32)
a_data = np.array([4, 1, 6, 5, 2, 3], dtype=np.int32)

# Open the array in write mode and write the data
with tiledb.open(sparse_array, 'w') as A:
    A[d1_data, d2_data] = a_data

In [None]:
# Open the array in read mode
A = tiledb.open(sparse_array, 'r')

In [None]:
# Read the whole array
A[:]

In [None]:
A.close()

#### TileDB URIs (Arrays on S3 with TileDB URIs)

Now that we've created some local arrays, lets follow the next [Academy Tutorial](https://cloud.tiledb.com/academy/structure/arrays/tutorials/basics/basic-tiledb-cloud/) and create remote and centralized arrays. Once you create the array, navigate to the Assets -> Arrays tab and view your newly created and registered array.


For this tutorial you will need to setup a REST API token. You can find that guide [here](https://cloud.tiledb.com/academy/accounts/individual/profile/api-tokens/index.html). You TileDB account can be found in the upper left hand corner of your browser. Click the tile with the letters on it. You should see your username and personal namespace. An example would be `john-doe`. This is your `TILEDB_ACCOUNT` value. for the 'S3_BUCKET' value, this is the name of the bucket you used when configuring your storage. You can find this under Settings -> Storage paths. The value is the `Root default path`. I.E. "`s3://john-doe-cloud-bucket`

In [None]:
import tiledb, os
tiledb_account = tiledb.cloud.user_profile().username
tiledb_token = os.getenv("TILEDB_REST_TOKEN")

In [None]:
# Get the bucket and region from environment variables
os.environ['S3_BUCKET'] = "<your bucket>"
s3_bucket = os.environ['S3_BUCKET']

In [None]:
# This context initialization can be performed only once!!! If you see an error, restart your kernel and run this section again starting at the imports.
cfg = tiledb.Config({
    "rest.token": tiledb_token,
})
tiledb.default_ctx(cfg)

In [None]:
# Import necessary libraries
import numpy as np, shutil

# Set array URI
array_name = "basic_tiledb_cloud"
array_uri = "tiledb://" + tiledb_account + "/" + array_name

# Delete array if it already exists
if tiledb.array_exists(array_uri):
    tiledb.Array.delete_array(array_uri)

In [None]:
# Create the two dimensions
d1 = tiledb.Dim(name="d1", domain=(1, 4), tile=2, dtype=np.int32)
d2 = tiledb.Dim(name="d2", domain=(1, 4), tile=2, dtype=np.int32)

# Create a domain using the two dimensions
dom = tiledb.Domain(d1, d2)

# Create an attribute
a = tiledb.Attr(name="a", dtype=np.int32)

# Create the array schema, setting `sparse=False` to indicate a dense array.
sch = tiledb.ArraySchema(domain=dom, sparse=False, attrs=[a])

# NOTE: This is the only special thing about TileDB Cloud when
# creating and registering arrays: the URI should be of the form:
# tiledb://<account>/s3://<bucket>/<array_name>
# TileDB Cloud understands that you are trying to create an array in
# s3://<bucket>/<array_name> and register it under <account>.
# After the array is created and registered, it will be accessible
# simply as tiledb://<account>/<array_name>
array_uri_reg = "tiledb://" + tiledb_account + "/" + s3_bucket + "/" + array_name

# Create the array on disk (it will initially be empty)
tiledb.Array.create(array_uri_reg, sch)

In [None]:
# Prepare some data in a NumPy array
data = np.array([
    [1, 2, 3, 4],
    [5, 6, 7, 8],
    [9, 10, 11, 12],
    [13, 14, 15, 16]], dtype=np.int32)

# Write data to the array
with tiledb.open(array_uri, 'w') as A:
    A[:] = data

In [None]:
# Open the array in read mode
A = tiledb.open(array_uri, 'r')

# Show the entire array
print("Entire array: ")
print(A[:])
print("\n")

# Slice a portion of the array, which is useful 
# when the arrays are too big to fit in main memory
print("Slice [1:3), [1:2): ")
print(A[1:3, 1:2]["a"])

# Remember to close the array
A.close()

Before deleting the array, visit Assets -> Arrays and find `basic_tiledb_cloud` array. From there, you can view details about the array such as the schema and the activity.

In [None]:
# Delete the array
if tiledb.array_exists(array_uri):
    tiledb.Array.delete_array(array_uri)

#### Notebook Tables

let's build some *dense* and *sparse* tables using arrays! Follow the [Academy Tutorial](https://cloud.tiledb.com/academy/structure/tables/tutorials/basics/csv-ingestion/) and store the code below!

In [None]:
import os, shutil
import tiledb
import warnings
warnings.filterwarnings("ignore")
import tiledb.sql
import pandas as pd
import numpy as np

# Print library versions
print("TileDB core version: {}".format(tiledb.libtiledb.version()))
print("TileDB-Py version: {}".format(tiledb.version()))
print("TileDB-SQL version: {}".format(tiledb.sql.version))

# Set table dataset URIs, and the URI to an example CSV
dense_table_uri = "my_dense_table"
sparse_table_uri = "my_sparse_table"
example_csv_uri = "s3://tiledb-inc-demo-data/examples/notebooks/nyc_yellow_tripdata/taxi_first_10.csv"

# Set configuration parameters.
cfg = tiledb.Config({
    "vfs.s3.no_sign_request": "true",
    "vfs.s3.region": "us-east-1"
})
ctx = tiledb.Ctx(cfg)

# Clean up the tables if they already exist
if os.path.exists(dense_table_uri):
    shutil.rmtree(dense_table_uri)
if os.path.exists(sparse_table_uri):
    shutil.rmtree(sparse_table_uri)

In [None]:
#Dense Table
tiledb.from_csv(
    dense_table_uri,
    example_csv_uri,
    ctx=ctx,
    parse_dates=['tpep_dropoff_datetime', 'tpep_pickup_datetime']
)

In [None]:
# Open the Dataset in read mode
table = tiledb.open(dense_table_uri, mode='r', ctx=ctx)

In [None]:
# Show which samples were ingested
print(table.schema)

In [None]:
# Read entire dataset into a pandas dataframe
df = table.df[:] # Equivalent to: A.df[0:9]
df

In [None]:
#Sparse Table
tiledb.from_csv(
    sparse_table_uri,
    example_csv_uri,
    ctx=ctx,
    sparse=True,
    index_dims=["tpep_pickup_datetime", "PULocationID"],
    allows_duplicates=True,
    dim_filters={'tpep_pickup_datetime': tiledb.FilterList([tiledb.GzipFilter(level=-1)])},
    attr_filters={'passenger_count': tiledb.FilterList([tiledb.GzipFilter(level=-1)])},
    dtype={"fare_amount": np.float32},
    parse_dates=['tpep_dropoff_datetime', 'tpep_pickup_datetime'])

In [None]:
# Open the table in read mode
table = tiledb.open(sparse_table_uri, mode='r', ctx=ctx)

In [None]:
# Show which samples were ingested
print(table.schema)

In [None]:
# Read entire dataset into a pandas dataframe
df = table.df[:]
df

In [None]:
# Clean up the tables if they already exist
if os.path.exists(dense_table_uri):
    shutil.rmtree(dense_table_uri)
if os.path.exists(sparse_table_uri):
    shutil.rmtree(sparse_table_uri)

#### TileDB Files

TileDB allows you to import, securely manage, and search over all your files, in one governed and compliant data platform. You can follow our files guide via the [Academy Tutorial](https://cloud.tiledb.com/academy/catalog/data/files/index.html). Make sure you view your files via TileDB's UI directly as well as via programatic tools!

You can run all the items below within your local BASH environment to generate a file for this task. Create a file named `create_file.sh` and save it with the below content.

In [None]:
#!/bin/bash

# Define the file name and path
FILE_NAME="example.txt"
DESKTOP_PATH="$HOME/Desktop"

# Create the file on the desktop
cat <<EOL > "$DESKTOP_PATH/$FILE_NAME"
This is an example file.
It contains simple text data.
EOL

# Print a success message
echo "File '$FILE_NAME' has been created on your Desktop."


Give the script execute permissions 

```chmod +x create_file.sh```

Run the script

```./create_file.sh```

Follow the instructions on the academy tutorial to upload the file. Once you add the file(s) to the TileDB catalog, you can browse them under Assets -> Data -> Files. 



In [None]:
# The following will return a JSON file with various info about the file.
tiledb.cloud.asset.info("tiledb://<account-name>/<example>") 

Using a UUID (you can find this in the file details) is the unique way to access files. 

In [None]:
tiledb.cloud.asset.info("tiledb://<account_name>/<UUID>") 

## Section 2: Life Sciences on TileDB ( VCF, Biomedical Images, and Single Cell Data)


### Hands On Section

These tutorials will require a TileDB `Genomics` image.

#### TileDB SOMA (Stacks of Matrices, Annotated) 
[Academy Tutorial](https://cloud.tiledb.com/academy/structure/life-sciences/single-cell/tutorials/data-ingestion/) will guide you through ingesting and accessing SOMA data stored on S3 into TileDB.


In [None]:
import anndata as ad
import scanpy as sc
import tiledb
import tiledb.cloud
import tiledbsoma
import tiledbsoma.io

tiledbsoma.show_package_versions()

In [None]:
cfg = tiledb.Config({"vfs.s3.no_sign_request": True})
vfs = tiledb.VFS(config=cfg)

In [None]:
H5AD_URI = "s3://tiledb-inc-demo-data/singlecell/h5ad/pbmc3k_processed.h5ad"

with vfs.open(H5AD_URI) as h5ad:
    adata = ad.read_h5ad(h5ad)

In [None]:
adata

In [None]:
import os

os.environ["S3_BUCKET"]="<Your S3 Bucket>"
TILEDB_NAMESPACE = tiledb_account =tiledb.cloud.user_profile().username
S3_BUCKET = os.environ["S3_BUCKET"]
EXPERIMENT_NAME = "soma-exp-pbmc3k"
EXPERIMENT_URI = f"tiledb://{TILEDB_NAMESPACE}/{S3_BUCKET}/{EXPERIMENT_NAME}"

In [None]:
tiledbsoma.io.from_anndata(
    experiment_uri=EXPERIMENT_URI, measurement_name="RNA", anndata=adata
)

Once you see a URI above, you should be able to see you SOMA experiment in Assets -> Data -> Soma

#### TileDB VCF 
The [Academy Tutorial](https://cloud.tiledb.com/academy/structure/life-sciences/population-genomics/tutorials/basics/basic-tiledb-cloud/) uses TileDB cloud for a basic ingestion. You can use the TileDB UI to directly ingest files as well. You can follow that walkthrough [here](https://cloud.tiledb.com/academy/catalog/data/genomics/index.html). Use the below cells to run and organize your code. Once ingested, view your code in the catalog! 


In [None]:
import tiledb, os

tiledb_token = os.getenv("TILEDB_REST_TOKEN")
tiledb_account = tiledb_account =tiledb.cloud.user_profile().username
# Set the AWS keys and region to the config of the default context
# This context initialization can be performed only once.
cfg = tiledb.Config({
    #"rest.username": tiledb_username, 
    #"rest.password": tiledb_password,
    "vfs.s3.no_sign_request": "true", # boosts performance when accessing public S3 buckets
    # or use
     "rest.token": tiledb_token
})
ctx = tiledb.Ctx(cfg)

In [7]:
print(os.getenv(TILEDB_TOKEN))

NameError: name 'TILEDB_TOKEN' is not defined

In [None]:
import tiledb.cloud
import tiledbvcf
import pandas as pd
import numpy as np
import shutil, urllib.request, os.path

# Print library versions
print("TileDB core version: {}".format(tiledb.libtiledb.version()))
print("TileDB-Py version: {}".format(tiledb.version()))
print("TileDB-VCF version: {}".format(tiledbvcf.version))
print("TileDB-Cloud-Py version: {}".format(tiledb.cloud.version.version))

# Set array URI
vcf_name = "basic_tiledb_cloud"
vcf_uri = "tiledb://" + tiledb_account + "/" + vcf_name

# Clean up VCF dataset if it already exists
if tiledb.object_type(vcf_uri, ctx=ctx) == "group":
    tiledb.cloud.asset.delete(vcf_uri, recursive=True)

In [None]:
# Specify the sample URIs
vcf_bucket = "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen"
samples_to_ingest = ["HG00096_chr21.gvcf.gz",
                     "HG00097_chr21.gvcf.gz", 
                     "HG00099_chr21.gvcf.gz", 
                     "HG00100_chr21.gvcf.gz", 
                     "HG00101_chr21.gvcf.gz"]
sample_uris = [f"{vcf_bucket}/{s}" for s in samples_to_ingest]
sample_uris

In [None]:
# NOTE: This is the only special thing about TileDB Cloud when
# creating and registering VCF datasets: the URI should be of the form:
# tiledb://<account>/s3://<bucket>/<vcf_name>
# TileDB Cloud understands that you are trying to create VCF dataset in
# s3://<bucket>/<vcf_name> and register it under <account>.
# After the VCF dataset is created and registered, it will be accessible
# simply as tiledb://<account>/<vcf_name>
vcf_uri_reg = "tiledb://" + tiledb_account + "/" + S3_BUCKET + "/" + vcf_name

# Open a VCF dataset in write mode.
# Notice you need ot pass the TileDB Cloud config for authentication.
ds = tiledbvcf.Dataset(uri=vcf_uri_reg, mode="w", tiledb_config=cfg)

# Create empty VCF dataset
ds.create_dataset()
    
# Ingest samples
ds.ingest_samples(sample_uris = sample_uris)

Once launched, you can visit Monitor -> Logs and select `Tasks`. From there, you can view `Queries`. These are tasks launched by TileDB for ingestion in a serverless fashion. A huge advantage when using the TileDB platform.

Once completed, visit `Assets` -> `VCF`  and view your VCF entry.

#### Biomedical Imaging Data

For this section, all you needed to do was fill in the blanks. We provided the code in your notebook. The code is specific to your environement and previous sections have the answers within them. Reach out if you are still stuck. Make sure you find your account name, your S3 bucket (the bucket you registered with cloud) your REST API token (you can use the token from a previous section), and create the proper TileDB CTX and URI. Many of these details are in Section 1 of your notebook.  Your account name can be found programmatically with  ```tiledb.cloud.user_profile().username```.

## Section 3: Machine Learning on TileDB (Vector Search and Models)


### Hands on Section
#### Vector Search
The power of TileDBs vector search is it's ability to store the code, vectors, original data, chunked data, and models all within the same system. The [Academy Tutorial](https://cloud.tiledb.com/academy/structure/ai-ml/vector-search/tutorials/basics/ingestion-and-querying/#indexes-on) will help you learn how to ingest a set vectors into an IVF_FLAT index and perform basic similarity search.  You will ingest the small (10k) SIFT dataset from the Datasets for approximate nearest neighbor search site. You will download a mirrored copy of the dataset from the TileDB-Vector-Search repo on GitHub. Ensure you are in the section called `Indexes on TileDB Cloud` to ensure you are unlocking TileDB's true potential.


##### **Section Code (Use Below to Organize Your Code)**

In [None]:
# Import necessary libraries
import os
import tarfile
import shutil
import urllib.request
import numpy as np
import tiledb.vector_search as vs
from tiledb.vector_search.utils import load_fvecs, load_ivecs
import tiledb

# The URIs for the data to download and ingest
data_uri = "https://github.com/TileDB-Inc/TileDB-Vector-Search/releases/download/0.0.1/siftsmall.tgz"
data_filename = "siftsmall.tar.gz"
data_dir = os.path.expanduser("~/sift10k/")
local_data_path = os.path.join(data_dir, data_filename)

# Clean up previous data
if os.path.exists(data_dir):
    shutil.rmtree(data_dir)

In [None]:
# Create a directory to store the source dataset
os.makedirs(os.path.dirname(data_dir))

# Download the file that contains the vector dataset
urllib.request.urlretrieve(data_uri, local_data_path)

# untar the file
tarfile.open(local_data_path, "r:gz").extractall(
    os.path.dirname(local_data_path), filter="fully_trusted"
)

In [None]:
# Get your username
username = tiledb.cloud.user_profile().username

# Get the bucket from an environment variable
s3_bucket = "<your bucket>"

# Set index URI
index_name = "cloud_vector_index"
index_uri = "tiledb://" + username + "/" + index_name
index_reg_uri = "tiledb://" + username + "/" + s3_bucket + "/" + index_name


# Clean up index if it already exists
if tiledb.object_type(index_uri, ctx=ctx) == "group":
    tiledb.cloud.asset.delete(index_uri, recursive=True)

In [None]:
tiledb_token  = os.getenv("TILEDB_REST_TOKEN")

In [None]:
# This context initialization can be performed only once!!! If you see an error, restart your kernel and run this section again starting at the imports.
cfg = tiledb.Config({
    "rest.token": tiledb_token,
})
tiledb.default_ctx(cfg)

In [None]:
# Create an index, where the dimensionality of each vector is 3,
# the type of the vector values is float32, and the index will
# use 3 partitions.
index = vs.ivf_flat_index.create(
    uri=index_reg_uri,
    dimensions=128,
    partitions=100,
    vector_type=np.dtype(np.float32),
)

In [None]:
index = vs.ingest(
    index_type="IVF_FLAT",
    source_uri=os.path.join(data_dir, "siftsmall_base.fvecs"),
    index_uri=index_reg_uri,
    source_type="FVEC",
    partitions=100,
)

Once completed view `Monitor` -> `Tasks` to view the ingestion tasks. Then visit `Assets` -> `Vector Search`. To view your newly ingested asset! 

In [None]:
# Show the physical group
group = tiledb.Group(index_uri, "r")
print("Index physical contents:\n")
print(group)

# Prepare the index for reading
index = vs.IVFFlatIndex(index_uri)

# Open the vector array to inspect it
print("Vector array URI:", index.db_uri, "\n")
A = tiledb.open(index.db_uri)

# Print the schema of the vector array
print("Vector array schema:\n")
print(A.schema)

# Print the first vector
print("Contents of first vector:\n")
print(A[:, 0]["values"])

In [None]:
# Get query vectors with ground truth
query_vectors = load_fvecs(os.path.join(data_dir, "siftsmall_query.fvecs"))
ground_truth = load_ivecs(os.path.join(data_dir, "siftsmall_groundtruth.ivecs"))

# Select a query vector
query_id = 77
qv = np.array([query_vectors[query_id]])

# Return the 100 most similar vectors to the query vector with IVF_FLAT
result_d, result_i = index.query(qv, k=100, nprobe=10)
print("Result vector ids:\n")
print(result_i)
print("\nResult vector distances:\n")
print(result_d)

### Machine Learning Models
This [Academy Tutorial](https://cloud.tiledb.com/academy/structure/ai-ml/ml-models/tutorials/ingestion/model-ingestion/) will walk you through storing a model based on the framework in TileDB cloud. Once you have a stored model, you could pull it later for training or fine tuning. Those topics are beyond the scope of this workshop. For this lab, use the `Pytorch` example. 

##### **Section Code (Use Below to Organize Your Code)**

In [8]:
import os

import tiledb.cloud
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision

from tiledb.ml.models.pytorch import PyTorchTileDBModel

epochs = 1
batch_size_train = 128
learning_rate = 0.01
momentum = 0.5
log_interval = 10

# Set random seeds for anything using random number generation
torch.manual_seed(seed=1)

# Disable nondeterministic algorithms
torch.backends.cudnn.enabled = False

In [18]:
ctx = tiledb.cloud.Ctx()
tiledb.cloud.login(token=os.getenv("TILEDB_REST_TOKEN"))
namespace = tiledb.cloud.client.default_user().username

In [22]:
data_home = os.path.expanduser("~/data")
dataset = torchvision.datasets.MNIST(
    root=data_home, 
    train=True, 
    download=True,
    transform=torchvision.transforms.Compose([
        torchvision.transforms.ToTensor(),
        torchvision.transforms.Normalize((0.1307,), (0.3081,))
    ])
)
train_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size_train, shuffle=True)


Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to /home/jovyan/data/MNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 9912422/9912422 [00:00<00:00, 142281708.67it/s]


Extracting /home/jovyan/data/MNIST/raw/train-images-idx3-ubyte.gz to /home/jovyan/data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to /home/jovyan/data/MNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 28881/28881 [00:00<00:00, 12382264.52it/s]

Extracting /home/jovyan/data/MNIST/raw/train-labels-idx1-ubyte.gz to /home/jovyan/data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz





Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to /home/jovyan/data/MNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 1648877/1648877 [00:00<00:00, 131483324.71it/s]


Extracting /home/jovyan/data/MNIST/raw/t10k-images-idx3-ubyte.gz to /home/jovyan/data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to /home/jovyan/data/MNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 4542/4542 [00:00<00:00, 2034878.10it/s]

Extracting /home/jovyan/data/MNIST/raw/t10k-labels-idx1-ubyte.gz to /home/jovyan/data/MNIST/raw






In [None]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x, dim = 1)

In [None]:
model = Net()
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum)


In [None]:
train_losses = []
train_counter = []

def train(epoch):
  model.train()
  for batch_idx, (data, target) in enumerate(train_loader):
    optimizer.zero_grad()
    output = model(data)
    loss = F.nll_loss(output, target)
    loss.backward()
    optimizer.step()
    if batch_idx % log_interval == 0:
      print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
        epoch, batch_idx * len(data), len(train_loader.dataset),
        100. * batch_idx / len(train_loader), loss.item()))
      train_losses.append(loss.item())
      train_counter.append(
        (batch_idx*64) + ((epoch-1)*len(train_loader.dataset)))

for epoch in range(1, epochs + 1):
  train(epoch)

In [None]:
print('Defining PyTorchTileDBModel model...')
# In order to save our model on S3 and register it on TileDB-Cloud we have to pass our Namespace and TileDB Context.
tiledb_model = PyTorchTileDBModel(uri='tiledb-pytorch-model', namespace=namespace, ctx=ctx, model=model, optimizer=optimizer)

# We will need the uri that was created from our model class
# (and follows pattern tiledb://my_username/s3://my_bucket/my_array),
# in order to interact with our model on TileDB-Cloud.
tiledb_cloud_model_uri = tiledb_model.uri

print('Saving model on S3 and registering on TileDB-Cloud...')
tiledb_model.save(meta={'epochs': epochs,
                        'train_loss': train_losses})

In [None]:
# List all our models. Here, we filter with file_type = 'ml_model'. All machine learning model TileDB arrays are of type
# 'ml_model'
print(tiledb.cloud.client.list_arrays(file_type=['ml_model'], namespace=namespace))

# Get model's info
print(tiledb.cloud.array.info(tiledb_cloud_model_uri))

# Load our model for inference
# Place holder for the loaded model
loaded_model = Net()
loaded_optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum)

tiledb_model = PyTorchTileDBModel(uri=os.path.basename(tiledb_cloud_model_uri), namespace=namespace, ctx=ctx)
tiledb_model.load(model=loaded_model, optimizer=loaded_optimizer)

# Check model parameters
assert str(model.state_dict()) == str(loaded_model.state_dict())

# Check optimizer parameters
assert str(optimizer.state_dict()) == str(loaded_optimizer.state_dict()

You should now be able to see your model from `Assets` -> `Code` -> `ML Models`

### User Defined Functions
TileDB provides effortless scalability for Python and R code using serverless user-defined functions (UDFs). UDFs come in three types:

    Generic: run any Python function at scale with arbitrary input arguments.
    Single-array: apply a function to a predefined slice of a TileDB array.
    Multi-array: UDFs that can be applied to any number of arrays.
This [Academy Tutorial](https://cloud.tiledb.com/academy/analyze/user-defined-functions/) will walk you through some basic examples. Once completed, attempt to apply your knowledge to our challenge problem. 

##### **Section Code (Use Below to Organize Your Code)**

### UDF Challenge Problem

Write a function called `sum_array` that takes an input list of numbers and returns their sum. Register the function with TileDB Cloud, and then execute it.

In [None]:
def sum_array(numbers):
    """
    Calculate the sum of a list of numbers.

    Args:
        numbers (list): A list of numerical values.

    Returns:
        float: The sum of the numbers in the list.
    """
    return sum(numbers)

# Example usage
numbers = [1, 2, 3, 4, 5]
print(sum_array(numbers))  # Output: 15

In [None]:
namespace = tiledb.cloud.user_profile().username
tiledb.cloud.udf.register_generic_udf(sum_array, name="sum_array", namespace=namespace)

Now visit `Assets` -> `Code` -> `UDFs` and find `sum_array`

In [None]:
numbers = [1, 2, 3, 4, 5]
tiledb.cloud.udf.exec(f"{namespace}/sum_array", numbers) #output 15

### Task Graphs

This [Academy Tutorial](https://cloud.tiledb.com/academy/scale/api-usage/index.html#modes-of-operation) will walk you through some of the basics of task graphs. Once completed move onto our challenge problem.

### Task Graphs Challenge

Create a task graph with these steps:

    Step 1: Generate a list of numbers.
    Step 2: Compute the square of each number.
    Step 3: Compute the sum of the squares.
    Step 4: Run a step AFTER the above step that prints "done"! 

Register and execute these task graphs on TileDB Cloud in batch mode as a DAG. Monitor the task graphs and ensure they log their inputs and outputs to the console. You must also visulize the DAG output. 

##### **Section Code (Use below to organize your code. Feel free to add additional cells as needed.)**

In [None]:
def task_graphs_challenge_start(n):
    """
    Generates a list of numbers, computes their squares, and returns the sum of the squares.

    Args:
        n (int): The number of integers to generate (from 1 to n).

    Returns:
        int: The sum of the squares of the numbers.
    """
    # Step 1: Generate a list of numbers
    numbers = list(range(1, n + 1))
    
    # Step 2: Compute the square of each number
    squares = [num ** 2 for num in numbers]
    
    # Step 3: Compute the sum of the squares
    sum_of_squares = sum(squares)
    
    return sum_of_squares

# Example usage
n = 10
result = task_graphs_challenge_start(n)
print(f"Sum of squares for numbers 1 to {n}: {result}")


In [None]:
def task_graphs_challenge_end():
    print("done!")

In [None]:
import tiledb.cloud
batch_dag = tiledb.cloud.dag.DAG(mode=tiledb.cloud.dag.Mode.BATCH)
step_1 = batch_dag.submit(task_graphs_challenge_start,5)
step_2 = batch_dag.submit(task_graphs_challenge_end)
step_2.depends_on(step_1)
#Visualize the task_graph
batch_dag.visualize()



In [None]:
# Start task graph
batch_dag.compute()

Visit `Monitor` -> `Logs` -> `Task Graphs` from the TileDB UI and select the launching task graph. You should see something that matches the above visualization. 

## Section 3: Marketplace 
This final section is all about marketplace. We've published a few items to Marketplace already. Now, let's list the public items and publish an asset to Marketplace. 
To achieve 
Victory:
1. List all assets of type"bioimg"
2. Publish an image to marketplace
3. Search for the new asset by name
You will need this [article](https://cloud.tiledb.com/academy/collaborate/marketplace/index.html) and [this article](https://cloud.tiledb.com/academy/catalog/search/index.html)

In [None]:
from tiledb.cloud import asset

In [None]:
asset.list_public(type="bioimg") # run to confirm not public

In [None]:
help(asset.list_public)

In [None]:
tiledb.cloud.asset.list( 
    search="bioimage",  # The search keywords based on what you named your image.
)

## Congratulations! 

You did it! If you ran into any issues (or just got stuck) please reach out to us directly or check out the answers guide on the git repo. 