### Requirements

Installing the required python packages

In [0]:
!pip install clarifaipyspark
!pip install protobuf==4.24.2

Setting CLARIFAI_PAT as an environment variable.  
*Note: Guide to get your [PAT](https://docs.clarifai.com/clarifai-basics/authentication/personal-access-tokens)*

In [0]:
import os
#Mention your PAT
os.environ['CLARIFAI_PAT'] = ''

### Clarifai-PySpark Interface

Create a ClarifaiPyspark client object to connect to your App on Clarifai

In [0]:
cspark = ClarifaiPySpark(user_id='user_id', app_id='app_id')

Mention the dataset from your App on which you want to work.  
This creates a new dataset in the App if it doesn't already exist.

In [0]:
dataset_obj = cspark_obj.dataset(dataset_id='dataset_id')

#1. Upload Dataset

### a. Upload from Volume folder   
If you have your dataset images/text files stored in databricks volume, you can directly upload the data files from volume to Clarifai App.  
The folder should contain only images.  
If the folder name is the label for all images inside it, then you can set labels paramter to True.

In [0]:
dataset_obj.upload_dataset_from_folder(folder_path='volume_folder_path', input_type='image/text', labels=True/False)

### b. Upload from CSV
Dataset can be populated from a CSV which contains these mandatory columns - inputid, input.  
Other columns supported in CSV are - concepts, metadata, geopoints.  
The input column can have file url or path, or can contain raw text.  
If concepts column exists in the CSV, set labels=True

In [0]:
dataset_obj.upload_dataset_from_csv(csv_path='volume_csv_path', input_type='text/image', labels=True/False, csv_type='raw/url/filepath')

### c. Upload from Delta table
A delta table can be used to populate dataset in App. The table should have these mandatory columns - inputid, input.   
Other columns supported in delta table are - concepts, metadata, geopoints.   
The input column can have file url or path, or can contain raw text.    
If concepts column exists in the table, set labels=True.  

In [0]:
dataset_obj.upload_dataset_from_table(table_path='volume_table_path', input_type='text/image', labels=True/False, csv_type='raw/url/filepath')

### d. Upload from Dataframe
Dataset can be uploaded from a dataframe which contains these mandatory columns - inputid, input.  
Other columns supported in dataframe are - concepts, metadata, geopoints.  
The input column can have file url or path, or can contain raw text.  
If concepts column exists in the dataframe, set labels=True

In [0]:
dataset_obj.upload_dataset_from_table(dataframe=spark_dataframe, input_type='text/image', labels=True/False, csv_type='raw/url/filepath')

### e. Upload with custom dataloader
If you have your dataset stored in a different format or need to do any preprocessing, a custom dataloader can be provided.  
You can find examples of different dataloaders [here](https://github.com/Clarifai/examples/tree/main/datasets/upload)

In [0]:
dataset_obj.upload_dataset_from_dataloader(task="visual-classification", split="train", module_dir="volume_module_path")

#2. Fetch inputs from dataset in App

### a. Get inputs from dataset in your Clarfai App
You can fetch the inputs and its metadata from your dataset in Clarifai App. It returns a list of jsons. 

In [0]:
inputs_response = list(dataset_obj.list_inputs())

### b. Fetch actual image/text files from your dataset and store them into volume
The image/text files from your App's dataset can be exported and stored into databricks Volume

In [0]:
#For images
dataset_obj.export_images_to_volume(path="destination_volume_path", input_response=inputs_response)

#For text
dataset_obj.export_text_to_volume(path="destination_volume_path", input_response=inputs_response)

# 3. Fetch Annotations from dataset in App
You can retrieve the image/text annotations from App's dataset in a dataframe format.  
The resultant dataframe will have these columns - id (annotation_id), user_id, input_id, annotation (json), created_at, modified_at

In [0]:
annotations_df = dataset_obj.export_annotations_to_dataframe()