# Unity Catalog Configuration

## Overview
This notebook sets up the data infrastructure for a GenAI demo by:
1. Importing Unity Catalog configuration settings
2. Creating a volume for raw data storage
3. Loading PDF files from a GitHub repository into the volume

## Use Case
This is particularly useful for:
- RAG (Retrieval Augmented Generation) applications
- Document management systems
- Knowledge base creation for AI agents

In [0]:
%run ../_config/config_unity_catalog

## Create a Volume

### What is a Volume?
A Unity Catalog Volume is a storage location for unstructured data (files).

### Purpose
In this volume, we will load CSV and PDF files before creating Delta tables.

### Benefits
- Governed access control through Unity Catalog
- Centralized file management
- Version control and lineage tracking
- Integration with Databricks File System (DBFS)

In [0]:
%sql
-- Create a Unity Catalog volume named 'raw_data'
-- IF NOT EXISTS: Prevents errors if the volume already exists
-- This volume will be created in the current catalog.schema namespace
-- Location: /Volumes/{catalog}/{schema}/raw_data/
CREATE VOLUME IF NOT EXISTS raw_data;

## Load PDF Files

### Source
Load PDF files into the raw_data volume from our GitHub repository.

### Files Included
- **product_catalog.pdf**: Product information and specifications
- **return_policy.pdf**: Return and refund guidelines
- **shipping_guide.pdf**: Shipping information and procedures
- **technical_faq.pdf**: Technical frequently asked questions

### Process
Uses `dbutils.fs.cp()` to copy files from GitHub to the Unity Catalog volume.

In [0]:
# Define the source location (GitHub raw content URL)
# This points to the main branch of the demo repository
prefix = "https://github.com/O-Faraday/databricks_genai_demo/raw/main/"

# Define the destination path in the Unity Catalog volume
# Uses the catalog and schema variables imported from the config notebook
# Format: /Volumes/{catalog}/{schema}/raw_data/
path_volume = f"/Volumes/{catalog}/{schema}/raw_data/"

# List of PDF file names to download
# These files contain knowledge base information for the GenAI agent
l_pdf_names = [
    "product_catalog",    # Product specifications and details
    "return_policy",      # Return and refund procedures
    "shipping_guide",     # Shipping methods and timelines
    "technical_faq",      # Technical support and troubleshooting
]

# Loop through each PDF file and copy it to the volume
for pdf_name in l_pdf_names:
    # Construct source URL: repository URL + data/pdf/ + filename.pdf
    source_path = f"{prefix}data/pdf/{pdf_name}.pdf"
    
    # Construct destination path: volume path + pdf/ + filename.pdf
    destination_path = f"{path_volume}/pdf/{pdf_name}.pdf"
    
    # Copy the file using Databricks File System utilities
    # dbutils.fs.cp(): Copy files between different file systems
    # Supports: DBFS, Unity Catalog Volumes, S3, ADLS, etc.
    dbutils.fs.cp(
        source_path,
        destination_path
    )
    
    # Optional: Print confirmation message (uncomment if desired)
    # print(f"âœ“ Copied {pdf_name}.pdf to {destination_path}")

# All PDF files are now available in the Unity Catalog volume
# Path: /Volumes/{catalog}/{schema}/raw_data/pdf/
# These files can now be processed for RAG applications