# Music Flow Verification

This notebook allows us to check if the Apache Polaris setup has been successful and we are able to work with the catalog e.g create namespaces, tables etc.,

## Imports

In [1]:
import os
from dotenv import load_dotenv, find_dotenv


if load_dotenv('../.env',interpolate=True):
    print("Loaded environment variables from .env file")

Loaded environment variables from .env file


### Define Variables
Let us define some variables for use across the notebook

In [5]:
client_id, client_secret = os.getenv("OPENCATALOG_CLIENT_ID"), os.getenv("OPENCATALOG_CLIENT_SECRET")
# Catalog details
namespace = "events"
table_name = "music_events"
CATALOG_URI = f"{os.getenv("OPENCATALOG_API_URL")}/polaris"
catalog_name = os.getenv("OPENCATALOG_CATALOG_NAME")

# Print configuration
print(f"Connecting to catalog: {catalog_name}")
print(f"API endpoint: {CATALOG_URI}")
print(f"Working with namespace: {namespace}")
print(f"Working with table: {table_name}")

Connecting to catalog: music_flow
API endpoint: https://npb00565.snowflakecomputing.com/polaris
Working with namespace: events
Working with table: music_events


## Working with Catalog
Let us retrieve the catalog `music_flow` that we created earlier.

In [8]:
from pyiceberg.catalog.rest import RestCatalog

catalog = RestCatalog(
    name=catalog_name,
    **{
        "uri": f"{CATALOG_URI}/api/catalog",
        "credential": f"{client_id}:{client_secret}",
        "header.content-type": "application/vnd.api+json",
        "header.X-Iceberg-Access-Delegation": "vended-credentials",
        "warehouse": catalog_name,
        "scope": "PRINCIPAL_ROLE:ALL",
    },
)
print("\nCatalog connection established.\nListing Namespaces:\n")
namespaces = catalog.list_namespaces()
for ns in namespaces:
    print(f" - {ns}")


Catalog connection established.
Listing Namespaces:



## Verify Snowflake Integration

## Verify Setup
Let us verify the setup by creating a namespace and a table. 
> NOTE: This will not used for music flow demo, its only to test the setup.

In [10]:
import pandas as pd 

import pyarrow as pa
from pyiceberg.catalog.rest import RestCatalog
from pyiceberg.exceptions import NamespaceAlreadyExistsError, TableAlreadyExistsError
import traceback
try:
    penguins_df = pd.read_csv(
        "https://raw.githubusercontent.com/dataprofessor/data/refs/heads/master/penguins_cleaned.csv",
    )
    penguins_table = pa.Table.from_pandas(penguins_df)
    wildlife_ns = catalog.create_namespace("wildlife")
    p_tbl = catalog.create_table(
        identifier="wildlife.penguins",
        schema=penguins_table.schema,
    )
    p_tbl.append(penguins_table)
except NamespaceAlreadyExistsError as e:
    print(f"Namespace 'wildlife' already exists: {e}")
except TableAlreadyExistsError as e:
    print(f"Table 'penguins' already exists: {e}")
except Exception:
    traceback.print_exc()

### Query the data

In [11]:
p_tbl.scan().to_pandas().head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181,3750,male
1,Adelie,Torgersen,39.5,17.4,186,3800,female
2,Adelie,Torgersen,40.3,18.0,195,3250,female
3,Adelie,Torgersen,36.7,19.3,193,3450,female
4,Adelie,Torgersen,39.3,20.6,190,3650,male


## Cleanup
Clean the test namespace and table


In [12]:
catalog.drop_table("wildlife.penguins")
catalog.drop_namespace("wildlife")

## Music Flow Demo

In [None]:
table = catalog.load_table(f"{namespace}.{table_name}".upper())
print(table)

### View Data


### All Data


In [None]:
import pandas as pd
try:
    # Query mixed data through PyIceberg
    df = table.scan().to_pandas()

except Exception as e:
    print(f"Error loading table: {e}")
df


### Evolved Data


In [None]:
import pandas as pd
try:
    baseline_columns = ['artist_name', 'vip_price', 'genre', 'sponsor']
    # Query mixed data through PyIceberg
    df_mixed = table.scan().to_pandas()

    # Show data version analysis
    df_mixed['data_version'] = df_mixed['genre'].apply(
        lambda x: 'Enhanced Data' if pd.notna(x) else 'Legacy Data'
    )
    # Display results
    baseline_columns_with_version = baseline_columns + ['data_version']
except Exception as e:
    print(f"Error loading table: {e}")
df_mixed.head()


### Data Statistics

In [None]:
for snapshot in table.snapshots():
    print(f"Snapshot ID: {snapshot.snapshot_id}, Timestamp: {snapshot.timestamp_ms}")
    
print(f"Total Records Loaded: {len(df)}")