# TileDB 101 Lab : Intro to TileDB!
## Goals and Outcomes
Welcome to the Intro to TileDB Lab! This lab can either be self-guided (download and go) or you can walkthrough this lab alongside the workshop instructor. By the end of this lab you will have
* Logged into TileDB using your organizations creds. 
* Toured the TileDB UI from within your allocated namespace. 
* Joined your organizations TileDB Org
* Created and Ingested: Arrays, Tables, SOMA Experiments, VCF Datasets, Biomedical Image Array, Ingested Files, Vector Search dataset. 
* Launched: Dashboards, User Defined Functions, Task Graph, Notebooks.
* Cataloged: All Of The Above! + Stored a Model as an Array.

Then, we will revisit our organizations page to view all the ingested data and models.

We will also view our UDFs and DAGs after they are submitted.

Lastly, we will clean up our environments and shut down our servers. 

By the end of this lab you will have a foundational understanding of TileDB concepts and components as well as how to navigate the TileDB platform.
This notebook can serve as an introductory reference to TileDB concepts as a whole. Once done, you may download it or (if using your organization'ss cluster) keep it in your namespace. 

***Think of this as your Intro to TileDB Workbook you can use as a reference going forward!*** 



## Requirements
* AWS Access Keys
* AWS Assume Role
* Bucket name 

## Section 0: Configuring Your Cloud Credentials and Creating Organizations
### Workshop Recap
TileDB has two types of credentials we use to access data. **AWS Access Keys** for when working on our isolated notebooks, and **Asssume Roles** for accessing data from User Defined Functions or Task Graphs. 
### Section Goals
In this section, we will setup your credentials for your newly created Namespace, create an Organization, and invite a fellow user to your new org.
### Hands on Section
#### Cloud Credentials
Your adminstrator or instructor will provide you with an Amazong Resource Name as well as classical creds. From there, follow the academy instructions [here](https://cloud.tiledb.com/academy/accounts/individual/profile/cloud-credentials/index.html) to set up both.
#### Organization Creation
Now that you've setup your individual credentials, follow the [academy guide](https://cloud.tiledb.com/academy/accounts/org-admin/create-org/index.html) to create an organization. After your organization is complete, go ahead and invite another member from the workshop. Once complete, you are ready to move on and start cataloging! 


## Section 1: TileDB Fundamentals(Arrays, Tables, and Files)
### Workshop Recap
TileDBs foundation is built on arrays. Our Arrays workshop goes into this deeper, and our [Academy](https://cloud.tiledb.com/academy/structure/arrays/index.html) has a whole section on them. At a high level,  TileDB is architected around multi-dimensional arrays and, therefore, the array is a first-class citizen in TileDB. TileDB supports both dense and sparse arrays. In a dense array, each element (called a “cell”) has a value that is materialized on storage. In a sparse array, the majority of cells are empty and, therefore, TileDB does not materialize them on storage. These two array types offer interesting tradeoffs, which is a topic we will dive into deeper in a different lab. The decision on whether to model your data with a dense or sparse array depends on the application, and it can greatly affect performance. We can store all sorts of data in arrays. This concept is something we will cover in a different workshop, but you can also explore [Academy's Performance Section](https://cloud.tiledb.com/academy/structure/arrays/tutorials/performance/index.html) to learn more! 
### Section Goals
In this section, we will create both dense and sparse arrays, ingest files, and store data in a table using TileDB. Launch the following cells in order and follow along before attempting to create these arrays, files, and tables yourself! 

### Hands on Section
These tutorials will require a TileDB `Basic Data Science` image.
#### Local Arrays

Let's create dense and sparse arrays locally. Follow the [Academy Tutorial](https://cloud.tiledb.com/academy/structure/arrays/quickstart/). Once complete, move onto the next section.

##### **Section Code (Use Below to Organize Your Code)**

#### Remote Arrays

Now that we've created some local arrays, lets follow the next [Academy Tutorial](https://cloud.tiledb.com/academy/structure/arrays/tutorials/basics/basic-s3/) and create remote and centralized arrays. Once you create the array, navigate to the Asssets -> Arrays tab and view your newly created and registered array.


##### **Section Code (Use Below to Organize Your Code)**

#### Local Tables

Tabular data is just an instance of multi-dimensional arrays. As such, TileDB can very efficiently model tables as arrays, and therefore tables can inherit all the TileDB functionality built around arrays. Now that we better understand arrays, let's build some *dense* and *sparse* tables using arrays! Follow the [Academy Tutorial](https://cloud.tiledb.com/academy/structure/tables/tutorials/basics/csv-ingestion/) and store the code below! 

##### **Section Code (Use Below to Organize Your Code)**

#### Tables on S3

Now that we've ingested some CSV's into arrays,  let's continue to unlock the full potential of TileDB by leveraging S3 to handle ingestion. Follow the [Academy Tutorial](https://cloud.tiledb.com/academy/structure/tables/tutorials/basics/basic-s3/) and use below to store your code!  

##### **Section Code (Use Below to Organize Your Code)**

#### TileDB Files

Hopefully by now you've come to realize TileDB can be very flexible with it's array based format. This format can quickly become complex due to the complex nature of structuring and storing data. That being said, TileDB's flexibility ensure you CAN efficiently store and structure your data in ways that benefit your business and application needs. Your team is no longer subject to rigid structures being force fitted into tables and paying the performance and other associated taxes for doing so. TileDB's flexibliy also enables us to handle more standardized types of data such as files.  TileDB allows you to import, securely manage, and search over all your files, in one governed and compliant data platform. You can follow our files guide via the [Academy Tutorial](https://cloud.tiledb.com/academy/catalog/data/files/index.html). Make sure you view your files via TileDB's UI directly as well as via programatic tools!

## Section 2: Life Sciences on TileDB ( VCF, Biomedical Images, and Single Cell Data)


### Workshop Recap
#### SOMA
TileDB-SOMA is an open-source library introducing a data format, query engine and API for storing, managing and analyzing collections of annotated matrices and derived results that are typical of systems biology data. While single-cell genomics data is the most common use case, TileDB-SOMA is equally useful for other types of omics data as well, such as bulk RNA-seq. 
#### VCF
TileDB was first used to build a variant store system in 2015, as part of a collaboration between Intel Labs, Intel Health and Life Sciences, MIT and the Broad Institute. The resulting software, GenomicsDB, became part of Broad’s GATK 4.0. TileDB-VCF is an evolution of that work, built on an improved and always up-to-date version of the open-source TileDB array engine, incorporating new algorithms, features, and optimizations. TileDB-VCF has extended security, management, scalable compute, and visualization features on TileDB Cloud.
#### Imaging
TileDB provides support for importing, visualizing, analyzing, and exporting multi-resolution whole-slide microscopy images. This can be done through integrated ingestion, viewer, data management, access control, and computation for bioimaging datasets within the TileDB user interface. As well as, Python APIs for ingesting images to TileDB BioImage arrays, slicing them with NumPy array semantics, or reading them via an OpenSlide Python-compatible API. 

### Section Goals
This section seeks to introduce you to some of the basic life science's modalities we support through TileDB and the project supporting them. SOMA, VCF, and Biomedical Imaging being the major projects. We will ensure you have a basic understanding of the projects by igesting, cataloging, and accessing single cell, imaging, and population genomics data as well as providing links to the relevant Academy sections so you can learn more. 

### Hands On Section

These tutorials will require a TileDB `Genomics` image.

#### TileDB SOMA (Stacks of Matrices, Annotated) 
[Academy Tutorial](https://cloud.tiledb.com/academy/structure/life-sciences/single-cell/tutorials/basic-s3/) will guide you through ingesting and accessing SOMA data stored on S3. Do not worry about the deeper details! There are other workshops where we go deeper into SOMA. This is just the start of that journey!  Use the below section to store and run your code!. Once you are done, check your *Assets* tab in TileDB to see your ingested data! 


##### **Section Code (Use Below to Organize Your Code)**

#### TileDB VCF 
TileDB Cloud provides a method to ingest a batch of single sample VCF files into a VCF dataset and add the dataset to the catalog, all in one step. The source VCF files are read from cloud object store and written into TileDB arrays defined by TileDB-VCF. The [Academy Tutorial](https://cloud.tiledb.com/academy/structure/life-sciences/population-genomics/tutorials/basics/basic-tiledb-cloud/) uses TileDB cloud for a basic ingestion. You can use the TileDB UI to directly ingest files as well. You can follow that walkthrough [here](https://cloud.tiledb.com/academy/catalog/data/genomics/index.html). Use the below cells to run and organize your code.


##### **Section Code (Use Below to Organize Your Code)**

### TileDB Images


To get started with bioimages on TileDB, start up a TileDB notebook with the Genomics environment, and import the TileDB BioImage package:


# `We don't have a good tutorial for this one. We will need to find one.` 

## Section 2: Machine Learning on TileDB (Vector Search and Models)


### Workshop Recap
TileDB's orchestration power and arrays, give you the ability to highly parallelize your workloads and data streams. Also with TileDB you can share the code AND the data of any experiment making your work much more reproducible. Beyond that, once you are complete with your ML tasks, you can store your models in a model registry all as arrays. Your original data, embeddings, models, and more can be consolidated into TileDB. The story continues with the ability to store `User Defined Functions` for shareable/reproducible code as well as more custom `Task Graphs` to create custome workflows based on local or remote functions. 
### Section Goals
In this section, we will kick off a simple vector ingestion, store a model as an array, create User Defined Functions, and lastly createa task graph. These features will help set us up for deeper ML walkthroughs in different workshops where we discuss scraping, ingesting, and qurying our data at scale with Large Language Models such as `Bio Minstral` as well as specialized life science Transformer models such as `scBERT`.

### Hands on Section
#### Vector Search
The power of TileDBs vector search is it's ability to store the code, vectors, original data, chunked data, and models all within the same system. The [Academy Tutorial](https://cloud.tiledb.com/academy/structure/ai-ml/vector-search/tutorials/basics/ingestion-and-querying/) will walk you through ingesting a sample dataset and querying it.

##### **Section Code (Use Below to Organize Your Code)**

### Machine Learning Models
This [Academy Tutorial](https://cloud.tiledb.com/academy/structure/ai-ml/ml-models/tutorials/ingestion/model-ingestion/) will walk you through storing a model based on the framework in TileDB cloud. Once you have a stored model, you could pull it later for training or fine tuning. 

##### **Section Code (Use Below to Organize Your Code)**

### User Defined Functions
TileDB provides effortless scalability for Python and R code using serverless user-defined functions (UDFs). UDFs come in three types:

    Generic: run any Python function at scale with arbitrary input arguments.
    Single-array: apply a function to a predefined slice of a TileDB array.
    Multi-array: UDFs that can be applied to any number of arrays.
This [Academy Tutorial](https://cloud.tiledb.com/academy/analyze/user-defined-functions/) will walk you through some basic examples. Once completed, attempt to apply your knowledge to our challenge problem. 

##### **Section Code (Use Below to Organize Your Code)**

### UDF Challenge Problem

Write a function called `sum_array` that takes an input list of numbers and returns their sum. Register the function with TileDB Cloud, and then execute it. Use the appopriate sized resource class.

In [None]:
### Task Graphs

This [Academy Tutorial](https://cloud.tiledb.com/academy/scale/api-usage/index.html#modes-of-operation) will walk you through some of the basics of task graphs. Once completed move onto our challenge problem.

In [None]:
### Task Graphs Challenge

Create a task graph with three steps:

    Step 1: Generate a list of numbers.
    Step 2: Compute the square of each number.
    Step 3: Compute the sum of the squares.

Register and execute this task graph on TileDB Cloud in batch mode. Monitor the task graphs and ensure they log their inputs and outputs to the console.

## Congratulations! 

You did it! If you ran into any issues (or just got stuck) please reach out to us directly or check out the answers guide on the git repo. 