# Accessing Fabric OneLake from Azure ML Notebook

This notebook is designed to run **inside Azure ML compute instances/clusters**.

**Prerequisites:**
- Running on Azure ML Compute Instance or Compute Cluster
- OneLake datastore already registered

---

## 1. Install Required Packages

In [None]:
# Install/upgrade required packages
#%pip install --upgrade azure-ai-ml azure-identity pandas pyarrow

## 2. Configuration

Set your workspace details and datastore name.

In [None]:
# Azure ML Workspace Configuration
SUBSCRIPTION_ID = "<your-subscription-id>"
RESOURCE_GROUP = "<your-resource-group>"
WORKSPACE_NAME = "<your-workspace-name>"

# OneLake Datastore Name (registered via register-with-cli.ps1)
DATASTORE_NAME = "onelakesp_datastore" 

print(f"Workspace: {WORKSPACE_NAME}")
print(f"Datastore: {DATASTORE_NAME}")

## 3. Connect Using Managed Identity (Azure ML Compute)

On Azure ML compute, we use the **compute's managed identity** - no browser needed!

### Authentication Options

This notebook supports multiple authentication methods:

1. **Managed Identity** - Works automatically on Azure ML compute
2. **Azure CLI** - If you've run `az login` locally
3. **Environment Variables** - Service principal via env vars

**If running locally**, choose one of these options:

#### Option A: Install Azure CLI and Login (Recommended for Local Development)

In [None]:
# Option A: Use Azure CLI authentication
# 1. Install Azure CLI: https://docs.microsoft.com/en-us/cli/azure/install-azure-cli
# 2. Run in terminal: az login
# 3. Then run the cells below

# Uncomment to check if Azure CLI is installed:
# !az --version

#### Option B: Use Service Principal via Environment Variables

In [None]:
# Option B: Use Service Principal from your .env file
# Set these environment variables (from your azml_onelakesp_datastore.yml):

import os

# Uncomment and set these values from your service principal:
os.environ["AZURE_TENANT_ID"] = "<your-tenant-id>"
os.environ["AZURE_CLIENT_ID"] = "<your-client-id>"
os.environ["AZURE_CLIENT_SECRET"] = "<your-client-secret>"  # Get from Key Vault or secure store

---

### Create the Credential Chain

In [None]:
# Create Credential Chain
from azure.ai.ml import MLClient
from azure.identity import (
    ManagedIdentityCredential,
    AzureCliCredential,
    ChainedTokenCredential,
    EnvironmentCredential
)
import os

# Use credential chain with multiple fallbacks
# Priority: Managed Identity -> Azure CLI -> Environment Variables
try:
    print("Trying authentication methods...\n")
    
    credential = ChainedTokenCredential(
        ManagedIdentityCredential(),  # Works on Azure ML compute
        AzureCliCredential(),  # Works if 'az login' has been run
        EnvironmentCredential()  # Uses env vars (AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_TENANT_ID)
    )
    
    print("Credential chain created")
    print("Will try: Managed Identity -> Azure CLI -> Environment Vars")
    
except Exception as e:
    print(f"Error creating credential: {str(e)}")
    raise

In [None]:
# Connect to Azure ML workspace
try:
    ml_client = MLClient(
        credential=credential,
        subscription_id=SUBSCRIPTION_ID,
        resource_group_name=RESOURCE_GROUP,
        workspace_name=WORKSPACE_NAME
    )
    
    # Test connection
    workspace = ml_client.workspaces.get(WORKSPACE_NAME)
    print(f"Connected to workspace: {workspace.name}")
    print(f"Location: {workspace.location}")
    
except Exception as e:
    print(f"Connection failed: {str(e)}")
    print("\nTroubleshooting:")
    print("1. Make sure WORKSPACE_NAME is correct")
    print("2. Verify compute has permissions to access the workspace")
    print("3. Check subscription ID and resource group")

## 4. List All Available Datastores
See what datastores are registered in your workspace.

In [None]:
# List all datastores
try:
    print("Available Datastores:\n")
    
    datastores = ml_client.datastores.list()
    
    for ds in datastores:
        print(f"  - {ds.name} ({ds.type})")
    
    print(f"\nLooking for: '{DATASTORE_NAME}'")
    
except Exception as e:
    print(f"Could not list datastores: {str(e)}")

## 5. Verify OneLake Datastore Registration
Check if the datastore was registered correctly.

In [None]:
# Get the OneLake datastore
try:
    datastore = ml_client.datastores.get(DATASTORE_NAME)
    
    print("OneLake Datastore Found!\n")
    print(f"Name: {datastore.name}")
    print(f"Type: {datastore.type}")
    
    if hasattr(datastore, 'description'):
        print(f"Description: {datastore.description}")
    
    print("\nDatastore is ready to use!")
    
except Exception as e:
    print(f"Datastore '{DATASTORE_NAME}' not found!")
    print(f"\nError: {str(e)}")
    print("\nSOLUTION: Register the datastore first:")
    print(f"   .\\register-with-cli.ps1 -s '{SUBSCRIPTION_ID}' -g '{RESOURCE_GROUP}' -w '{WORKSPACE_NAME}'")

---

## 6. Access OneLake Data Using Datastore URI

Once the datastore is verified, you can access files directly.

In [None]:
import pandas as pd

# Construct FULL datastore URI (required for Azure ML Compute Clusters)
# Format: azureml://subscriptions/{sub}/resourcegroups/{rg}/workspaces/{ws}/datastores/{datastore}/paths/{path}

# Example: Read a CSV file
# NOTE: OneLake datastore paths do NOT include "Files/" prefix
# Use just the folder/filename relative to the Files folder
file_path = "RawData/your-file.csv"  # Replace with your file path (no "Files/" prefix)

# FULL URI format (required for Compute Clusters)
datastore_uri = f"azureml://subscriptions/{SUBSCRIPTION_ID}/resourcegroups/{RESOURCE_GROUP}/workspaces/{WORKSPACE_NAME}/datastores/{DATASTORE_NAME}/paths/{file_path}"

print(f"Datastore URI:\n{datastore_uri}")
print(f"\nPath format: Use folder/file.csv (without 'Files/' prefix)")
print("Using FULL URI format (required for Compute Clusters)")

### Read CSV from OneLake

In [None]:
# Read CSV file from OneLake with automatic encoding detection
try:
    # Try UTF-8 first (default)
    df = pd.read_csv(datastore_uri)
    print(f"Data loaded successfully!")
    print(f"Shape: {df.shape}")
    print(f"Columns: {list(df.columns)}")
    print(f"\nFirst 5 rows:")
    display(df.head())

except UnicodeDecodeError as e:
    # UTF-8 failed, try alternative encodings
    print(f"UTF-8 encoding failed, trying alternative encodings...")
    encodings_to_try = ["latin1", "iso-8859-1", "cp1252", "utf-16"]
    for encoding in encodings_to_try:
        try:
            print(f"Trying {encoding}...")
            df = pd.read_csv(datastore_uri, encoding=encoding)
            print(f"Data loaded successfully with {encoding} encoding!")
            print(f"Shape: {df.shape}")
            print(f"Columns: {list(df.columns)}")
            print(f"\nFirst 5 rows:")
            display(df.head())
            break
        except Exception as enc_error:
            print(f"{encoding} failed")
            continue
    else:
        print(f"\nCould not decode file with any common encoding")
        print(f"Try specifying encoding manually:")
        print(f"   df = pd.read_csv(datastore_uri, encoding='latin1')")

except FileNotFoundError:
    print(f"File not found: {file_path}")
    print("\nCheck the file path in your OneLake lakehouse")
    print("Remember: Don't include 'Files/' prefix in the path")
    print(f"Current path: '{file_path}'")

except Exception as e:
    print(f"Error reading file: {str(e)}")
    print(f"\nMake sure:")
    print(f"1. Datastore '{DATASTORE_NAME}' is registered")
    print(f"2. File path is correct (no 'Files/' prefix): '{file_path}'")
    print(f"3. Service principal has read access to the lakehouse")

### Handle Different File Encodings

If you know your file's encoding, specify it directly:

In [None]:
# Example: Read file with specific encoding
file_with_special_chars = "RawData/YourFile.csv"

# FULL URI format (required for Compute Clusters)
file_uri = f"azureml://subscriptions/{SUBSCRIPTION_ID}/resourcegroups/{RESOURCE_GROUP}/workspaces/{WORKSPACE_NAME}/datastores/{DATASTORE_NAME}/paths/{file_with_special_chars}"

# Common encodings for different regions:
# - 'latin1' or 'iso-8859-1': Western European (French, Spanish, German)
# - 'cp1252': Windows Western European
# - 'utf-8': Modern standard (international)
# - 'utf-16': Some Windows exports

try:
    # Option 1: Specify encoding directly
    df = pd.read_csv(file_uri, encoding='latin1')  # Most common for Windows files with special chars
    
    print(f"File loaded with latin1 encoding")
    print(f"Shape: {df.shape}")
    display(df.head())
    
except Exception as e:
    print(f"Error: {str(e)}")
    print("\nTry these encodings:")
    print("   - encoding='latin1' (Western European)")
    print("   - encoding='cp1252' (Windows)")
    print("   - encoding='utf-16' (Some exports)")
    print("   - encoding='iso-8859-1' (Latin-1)")

### Read Parquet from OneLake

In [None]:
# Example: Read Parquet file
parquet_path = "your-folder/your-data.parquet"  # Replace with your file path (no "Files/" prefix)

# FULL URI format (required for Compute Clusters)
parquet_uri = f"azureml://subscriptions/{SUBSCRIPTION_ID}/resourcegroups/{RESOURCE_GROUP}/workspaces/{WORKSPACE_NAME}/datastores/{DATASTORE_NAME}/paths/{parquet_path}"

try:
    df_parquet = pd.read_parquet(parquet_uri)
    print(f"Parquet file loaded successfully!")
    print(f"Shape: {df_parquet.shape}")
    display(df_parquet.head())
    
except Exception as e:
    print(f"Could not read parquet file: {str(e)}")

---

## 7. Write Data to OneLake

Save processed data back to your OneLake lakehouse.

In [None]:
# Create sample data
sample_data = pd.DataFrame({
    'id': range(1, 11),
    'value': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
    'score': [85.5, 90.2, 78.9, 92.1, 88.7, 95.3, 82.4, 91.6, 87.8, 93.2]
})

print("Sample data created:")
display(sample_data)

### Option A: Write Using DataLakeServiceClient (Recommended)

This approach uploads bytes directly to OneLake using the Azure Data Lake SDK. Works with Managed Identity on Azure ML compute.


In [None]:
# Option A: Upload using DataLakeServiceClient (recommended)
from azure.storage.filedatalake import DataLakeServiceClient

# Configure these for your OneLake environment
# Get these from your Fabric workspace
ONELAKE_WORKSPACE_ID = "<your-fabric-workspace-id>"  # From Fabric workspace URL
LAKEHOUSE_ID = "<your-lakehouse-id>"  # From lakehouse settings in Fabric

# Define output path (relative to Files/ folder)
output_file_path = "Processed/processed_data.csv"

print(f"Uploading data to OneLake...")
print(f"Target: Files/{output_file_path}\n")

try:
    # Create DataLake client for OneLake
    onelake_endpoint = "https://onelake.dfs.fabric.microsoft.com"
    
    service_client = DataLakeServiceClient(
        account_url=onelake_endpoint,
        credential=credential  # Uses the credential from authentication section above
    )
    
    print(f"Step 1: Connecting to OneLake")
    print(f"  Workspace: {ONELAKE_WORKSPACE_ID}")
    print(f"  Lakehouse: {LAKEHOUSE_ID}")
    
    # Get file client
    # OneLake path format: {workspace}/{lakehouse}/Files/{path}
    file_system_client = service_client.get_file_system_client(file_system=ONELAKE_WORKSPACE_ID)
    full_path = f"{LAKEHOUSE_ID}/Files/{output_file_path}"
    file_client = file_system_client.get_file_client(full_path)
    
    # Convert DataFrame to CSV bytes
    csv_bytes = sample_data.to_csv(index=False).encode("utf-8")
    
    print(f"\nStep 2: Uploading CSV ({len(csv_bytes)} bytes)")
    
    # Upload (overwrite if exists)
    file_client.upload_data(csv_bytes, overwrite=True)
    
    print(f"\n✓ SUCCESS: Data uploaded to OneLake!")
    print(f"  Total records: {len(sample_data)}")
    print(f"  Location: {ONELAKE_WORKSPACE_ID}/{LAKEHOUSE_ID}/Files/{output_file_path}")
    
    print(f"\nYou can access this file from:")
    print(f"  - Microsoft Fabric Lakehouse UI (Files/{output_file_path})")
    print(f"  - Other Azure ML notebooks")
    print(f"  - Azure ML jobs/pipelines")
    
    # Optional: Verify upload by downloading a preview
    print(f"\nStep 3: Verifying upload...")
    downloaded = file_client.download_file().readall()
    print(f"✓ Verified - downloaded {len(downloaded)} bytes")
    
except Exception as e:
    print(f"\n✗ Upload failed: {str(e)}")
    
    # Check for common configuration issues
    if "<your-" in ONELAKE_WORKSPACE_ID or "<your-" in LAKEHOUSE_ID:
        print("\nConfiguration needed:")
        print("  1. Set ONELAKE_WORKSPACE_ID (find in Fabric workspace URL)")
        print("  2. Set LAKEHOUSE_ID (find in lakehouse settings)")
    
    print("\nAdditional troubleshooting:")
    print("  - Ensure compute identity has WRITE permissions to the lakehouse")
    print("  - Check if OneLake Access Protection (OAP) is enabled")
    print("  - Verify the lakehouse exists and is accessible")


### Option B: Write Using Azure ML Datastore Upload (Alternative)

This approach saves to a temp file locally, then uploads via Azure ML Datastore API. Requires `azureml-core` package.


In [None]:
# Option B: Save locally then upload via Azure ML Datastore
import tempfile
import os

# Ensure azureml-core is installed
try:
    from azureml.core import Workspace, Datastore
except ImportError:
    print("Installing azureml-core...")
    %pip install azureml-core
    from azureml.core import Workspace, Datastore

output_folder = "output"  # Folder inside datastore (no "Files/" prefix)

try:
    # Save DataFrame to a temp file
    with tempfile.NamedTemporaryFile(suffix=".csv", delete=False, mode='w') as tmp:
        local_path = tmp.name
        sample_data.to_csv(local_path, index=False)
    
    print(f"Saved temp file: {local_path}")
    
    # Get workspace using the existing ml_client credential
    ws = Workspace(
        subscription_id=SUBSCRIPTION_ID,
        resource_group=RESOURCE_GROUP,
        workspace_name=WORKSPACE_NAME,
        auth=credential
    )
    
    # Get the datastore
    ds = Datastore.get(ws, DATASTORE_NAME)
    
    # Upload to the datastore
    # target_path is relative to the datastore mapping (no "Files/" prefix)
    ds.upload_files(
        files=[local_path],
        target_path=output_folder,
        overwrite=True,
        show_progress=True
    )
    
    print(f"\nSUCCESS: Data uploaded via Datastore.upload_files!")
    print(f"Location: {output_folder}/")    
    # Clean up temp file
    os.remove(local_path)
    print(f"\nSUCCESS: Temp file cleaned up")
    
except Exception as e:
    print(f"Upload failed: {str(e)}")
    print("\nTroubleshooting:")
    print("1. Ensure azureml-core is installed")
    print("2. Verify datastore name is correct")
    print("3. Check that the compute identity has write permissions")
    print("4. Make sure Workspace.from_config() can authenticate")


---

## 8. Common File Operations

### Pattern 1: Load Multiple CSV Files

In [None]:
# Load multiple files from a folder
folder_path = "data"  # Your folder in OneLake (no "Files/" prefix)
file_names = ["file1.csv", "file2.csv", "file3.csv"]  # List your files

dfs = []
for file_name in file_names:
    try:
        # FULL URI format (required for Compute Clusters)
        file_uri = f"azureml://subscriptions/{SUBSCRIPTION_ID}/resourcegroups/{RESOURCE_GROUP}/workspaces/{WORKSPACE_NAME}/datastores/{DATASTORE_NAME}/paths/{folder_path}/{file_name}"
        df_temp = pd.read_csv(file_uri)
        dfs.append(df_temp)
        print(f"Loaded: {file_name} ({df_temp.shape[0]} rows)")
    except Exception as e:
        print(f"Skipped {file_name}: {str(e)}")

if dfs:
    combined_df = pd.concat(dfs, ignore_index=True)
    print(f"\nCombined {len(dfs)} files")
    print(f"Total rows: {combined_df.shape[0]}")
else:
    print("\nNo files loaded")

### Pattern 2: Process Large Files in Chunks

In [None]:
# Read large file in chunks
folder_path = "data"  # Your folder in OneLake (no "Files/" prefix)
large_file_path = "large_file.csv"  # Your file path (no "Files/" prefix)

# FULL URI format (required for Compute Clusters)
large_file_uri = f"azureml://subscriptions/{SUBSCRIPTION_ID}/resourcegroups/{RESOURCE_GROUP}/workspaces/{WORKSPACE_NAME}/datastores/{DATASTORE_NAME}/paths/{folder_path}/{large_file_path}"

try:
    chunk_size = 10000  # Rows per chunk
    chunks_processed = 0
    
    for chunk in pd.read_csv(large_file_uri, chunksize=chunk_size):
        # Process each chunk
        chunks_processed += 1
        print(f"Processing chunk {chunks_processed}: {len(chunk)} rows")
        
        # Your processing logic here
        # ...
    
    print(f"\nProcessed {chunks_processed} chunks")
    
except Exception as e:
    print(f"Could not process file: {str(e)}")

---

## 9. Troubleshooting & Diagnostics

In [None]:
def diagnose_connection():
    """Run diagnostics to troubleshoot connection issues"""
    print("Running Diagnostics...\n")
    print("=" * 60)
    
    # 1. Check workspace connection
    print("\n1. Workspace Connection")
    try:
        ws = ml_client.workspaces.get(WORKSPACE_NAME)
        print(f"   Connected to: {ws.name}")
        print(f"   Location: {ws.location}")
        print(f"   Resource Group: {ws.resource_group}")
    except Exception as e:
        print(f"   Connection failed: {str(e)}")
        return
    
    # 2. Check datastore
    print("\n2. Datastore Verification")
    try:
        ds = ml_client.datastores.get(DATASTORE_NAME)
        print(f"   Datastore found: {ds.name}")
        print(f"   Type: {ds.type}")
    except Exception as e:
        print(f"   Datastore not found: {str(e)}")
        print("\n   To register the datastore, run:")
        print(f"      .\\register-with-cli.ps1 -s '{SUBSCRIPTION_ID}' -g '{RESOURCE_GROUP}' -w '{WORKSPACE_NAME}'")
        return
    
    # 3. Test URI construction
    print("\n3. URI Format Test")
    test_uri = f"azureml://subscriptions/{SUBSCRIPTION_ID}/resourcegroups/{RESOURCE_GROUP}/workspaces/{WORKSPACE_NAME}/datastores/{DATASTORE_NAME}/paths/test.csv"
    print(f"   FULL URI format (required for Compute Clusters):")
    print(f"      {test_uri}")
    
    # 4. Environment check
    print("\n4. Environment Check")
    import sys
    print(f"   Python version: {sys.version.split()[0]}")
    print(f"   Pandas version: {pd.__version__}")
    
    print("\n" + "=" * 60)
    print("Diagnostics Complete!")
    print("\nNext steps:")
    print("   1. Update 'file_path' with your actual OneLake file")
    print("   2. Run the data access cells above")
    print("   3. If issues persist, check service principal permissions")

# Run diagnostics
diagnose_connection()

---

## Quick Reference

### Your Configuration
```python
SUBSCRIPTION_ID = "<your-subscription-id>"
RESOURCE_GROUP = "<your-resource-group>"
WORKSPACE_NAME = "<your-workspace-name>"
DATASTORE_NAME = "onelakesp_datastore"
```

### Authentication Methods
1. **Managed Identity** - Automatic on Azure ML compute (recommended)
2. **Azure CLI** - Run `az login` locally
3. **Environment Variables** - Set AZURE_TENANT_ID, AZURE_CLIENT_ID, AZURE_CLIENT_SECRET

### URI Format (FULL - Required for Compute Clusters)
```
azureml://subscriptions/{subscription}/resourcegroups/{resource_group}/workspaces/{workspace}/datastores/{datastore_name}/paths/{path}
```

**IMPORTANT:** Azure ML Compute Clusters require the FULL URI format with subscription, resource group, and workspace details.

### Path Format Rule
- In Fabric Portal: Files are shown as "Files/RawData/file.csv"
- In Azure ML Datastore: Use "RawData/file.csv" (omit "Files/" prefix)
- Reason: Datastore automatically references the Files/ folder

### Common Operations
```python
# FULL URI format
base_uri = f"azureml://subscriptions/{SUBSCRIPTION_ID}/resourcegroups/{RESOURCE_GROUP}/workspaces/{WORKSPACE_NAME}/datastores/{DATASTORE_NAME}/paths"

# Read CSV
df = pd.read_csv(f"{base_uri}/data.csv")

# Read CSV with encoding
df = pd.read_csv(f"{base_uri}/data.csv", encoding='latin1')

# Read Parquet
df = pd.read_parquet(f"{base_uri}/data.parquet")

# Write CSV
df.to_csv(f"{base_uri}/output.csv", index=False)
```