### Requirements

Installing the required python packages

In [None]:
!pip install clarifai-pyspark
!pip install protobuf==4.24.2

Setting CLARIFAI_PAT as an environment variable.  
*Note: Guide to get your [PAT](https://docs.clarifai.com/clarifai-basics/authentication/personal-access-tokens)*

In [None]:
import os
from clarifaipyspark.client import ClarifaiPySpark
#Mention your PAT
os.environ['CLARIFAI_PAT'] = ''

### Clarifai-PySpark Interface

Create a ClarifaiPyspark client object to connect to your App on Clarifai

In [None]:
cspark_obj = ClarifaiPySpark(user_id='user_id', app_id='app_id')

Mention the dataset from your App on which you want to work.  
This creates a new dataset in the App if it doesn't already exist.

In [None]:
dataset_obj = cspark_obj.dataset(dataset_id='dataset_id')

# 1. Ingest Dataset from Databricks Volume to Clarifai App

### a. Upload from Volume folder   
If your dataset images or text files are stored within a Databricks volume, you have the option to directly upload the data files from the volume to your Clarifai App.   
Please ensure that the folder solely contains images/text files. If the folder name serves as the label for all the images within it, you can set labels parameter to True.

In [None]:
dataset_obj.upload_dataset_from_folder(folder_path='volume_folder_path', 
                                       input_type='image/text', 
                                       labels=True/False)

### b. Upload from CSV
You can populate the dataset from a CSV that must include these essential columns: 'inputid' and 'input'.   
Additional supported columns in the CSV are 'concepts', 'metadata', and 'geopoints'.   
The 'input' column can contain a file URL or path, or it can have raw text. If the 'concepts' column exists in the CSV, make sure to set 'labels=True'.   
You also have the option to use a CSV file directly from your AWS S3 bucket. Simply specify the 'source' parameter as 's3' in such cases.


In [None]:
dataset_obj.upload_dataset_from_csv(csv_path='volume_csv_path/S3_csv_path', 
                                    input_type='text/image', 
                                    labels=True/False, 
                                    csv_type='raw/url/filepath',
                                    source='volume/s3')


### c. Upload from Delta table
You can employ a delta table to populate a dataset in your App.   
The table should include these essential columns: 'inputid' and 'input'.   
Furthermore, the delta table supports additional columns such as 'concepts,' 'metadata,' and 'geopoints.'   
The 'input' column is versatile, allowing it to contain file URLs or paths as well as raw text.   
If the 'concepts' column is present in the table, remember to enable the 'labels' parameter by setting it to 'True.'   
You also have the choice to use a delta table stored within your AWS S3 bucket by providing its S3 path.


In [None]:
dataset_obj.upload_dataset_from_table(table_path='volume_table_path', 
                                      input_type='text/image', 
                                      labels=True/False, 
                                      csv_type='raw/url/filepath')

### d. Upload from Dataframe
You can upload a dataset from a dataframe that should include these required columns: 'inputid' and 'input'.   
Additionally, the dataframe supports other columns such as 'concepts', 'metadata', and 'geopoints.'   
The 'input' column can accommodate file URLs or paths, or it can hold raw text.   
If the dataframe contains the 'concepts' column, ensure to set 'labels=True'.


In [None]:
dataset_obj.upload_dataset_from_table(dataframe=spark_dataframe, 
                                      input_type='text/image', 
                                      labels=True/False, 
                                      csv_type='raw/url/filepath')

### e. Upload with custom dataloader
In case your dataset is stored in an alternative format or requires preprocessing, you have the flexibility to supply a custom dataloader.   
You can explore various dataloader examples for reference [here](https://github.com/Clarifai/examples/tree/main/datasets/upload).   
The required files & folders for dataloader should be stored in databricks volume storage.

In [None]:
dataset_obj.upload_dataset_from_dataloader(task="visual-classification", 
                                           split="train", 
                                           module_dir="volume_module_path")

# 2. Fetching Dataset Information from Clarifai App

### a. Retrieve data file details in JSON format
To access information about the data files within your Clarifai App's dataset, you can use the following function which returns a JSON response.   
You may use the 'input_type' parameter for retrieving the details for a specific type of data file such as 'image', 'video', 'audio', or 'text'.
 

In [None]:
inputs_response = list(dataset_obj.list_inputs())

### 2. Retrieve data file details as a dataframe
You can also obtain input details in a structured dataframe format, featuring columns such as 'input_id,' 'image_url/text_url,' 'image_info/text_info,' 'input_created_at,' and 'input_modified_at.'   
Be sure to specify the 'input_type' when using this function.   
Please note that the the JSON response might include additional attributes.

In [None]:
inputs_df = dataset_obj.export_inputs_to_dataframe(input_type="text/image")

### b. Download image/text files from Clarifai App to Databricks Volume
With this function, you can directly download the image/text files from your Clarifai App's dataset to your Databricks volume.   
You'll need to specify the storage path in the volume for the download and use the response obtained from list_inputs() as the parameter.

In [None]:
#For images
dataset_obj.export_images_to_volume(path="destination_volume_path", 
                                    input_response=inputs_response)

#For text
dataset_obj.export_text_to_volume(path="destination_volume_path", 
                                  input_response=inputs_response)

# 3. Fetching Annotations from Clarifai App

### a. Retrieve annotation details in JSON format
To obtain annotations within your Clarifai App's dataset, you can utilize the following function, which provides a JSON response.   
Additionally, you have the option to specify a list of input IDs for which you require annotations.


In [None]:
annotations_response = list(dataset_obj.list_annotations(input_ids=None))

### b. Retrieve annotation details as a dataframe
You can also acquire annotations in a structured dataframe format, including columns like annotation_id’, 'annotation', 'annotation_user_id', 'iinput_id', 'annotation_created_at' and ‘annotation_modified_at’.   
If necessary, you can specify a list of input IDs for which you require annotations.   
Please note that the JSON response may contain supplementary attributes.


In [None]:
annotations_df = dataset_obj.export_annotations_to_dataframe(input_ids=None)

### c. Acquire inputs with their associated annotations in a dataframe
You have the capability to retrieve both input details and their corresponding annotations simultaneously using the following function.   
This function produces a dataframe that consolidates data from both the annotations and inputs dataframes, as described in the functions mentioned earlier.

In [None]:
dataset_df = dataset_obj.export_dataset_to_dataframe()