# Purview → scikit-learn end-to-end demo (notebook)

This notebook shows how to:
1. Discover governed data in **Microsoft Purview** and load it for training.
2. Train a **scikit-learn** model and save artifacts to Azure Storage.
3. Push **lineage** (input → training process → model) to Purview using the Atlas API.

**Notes**:
- Purview catalogs metadata (not your data) and powers discovery, classification, labeling, and lineage. 
- Azure Data Factory/Synapse can auto-publish runtime lineage; we add **custom lineage** here for the Python training step. 

References: Purview governance & catalog (Data Map/Unified Catalog) [docs](https://learn.microsoft.com/en-us/purview/data-governance-overview), Atlas lineage API [tutorial](https://learn.microsoft.com/en-us/purview/data-gov-api-create-lineage-relationships), ADF lineage [how-to](https://learn.microsoft.com/en-us/azure/data-factory/tutorial-push-lineage-to-purview).


In [None]:
%pip install --quiet --upgrade azure-identity adlfs pandas scikit-learn requests pyarrow


In [None]:
import os, json, uuid, io, pickle, logging
from datetime import datetime
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from azure.identity import DefaultAzureCredential
from adlfs import AzureBlobFileSystem
from purview_helpers.purview_client import PurviewAtlasClient
from purview_helpers.policy_gate import PolicyGate
logging.basicConfig(level=logging.INFO)


## 1) Configuration
Fill these based on your environment. You must have already registered & scanned your data sources in Purview.


In [None]:
PURVIEW_ACCOUNT = os.environ.get('PURVIEW_ACCOUNT', '<your-purview-account-name>')
STORAGE_ACCOUNT_NAME = os.environ.get('STORAGE_ACCOUNT_NAME', '<yourstorage>')
CONTAINER = os.environ.get('CONTAINER', 'data')
# Paths inside the container
TRAIN_DATA_PATH = os.environ.get('TRAIN_DATA_PATH', 'ml/train.parquet')
MODEL_OUTPUT_PATH = os.environ.get('MODEL_OUTPUT_PATH', 'models/churn_model.pkl')
# Entity type names used in Purview for ADLS Gen2 and Blob path assets (adjust if needed)
INPUT_ENTITY_TYPE = 'azure_datalake_gen2_path'
OUTPUT_ENTITY_TYPE = 'azure_blob_path'
# Current user (optional): object id for policy evaluation hooks; can be omitted for demo
CURRENT_USER_OBJECT_ID = os.environ.get('CURRENT_USER_OBJECT_ID')
print('Using Purview:', PURVIEW_ACCOUNT)


## 2) Initialize clients (Purview + Storage)


In [None]:
credential = DefaultAzureCredential()
pv = PurviewAtlasClient(PURVIEW_ACCOUNT, credential=credential)
policy = PolicyGate(mode='allow_all')  # change to 'deny_writes' to see enforcement
abfs = AzureBlobFileSystem(account_name=STORAGE_ACCOUNT_NAME, credential=credential)
storage_options = abfs.storage_options
print('Storage options ready for account', STORAGE_ACCOUNT_NAME)


## 3) (Optional) Discover candidate datasets via Purview search
Search for assets in the catalog (by keyword).


In [None]:
search = pv.search_basic(keywords='churn')
hits = [(e['typeName'], e['attributes'].get('qualifiedName')) for e in search.get('entities', [])]
hits[:5]  # peek


## 4) Enforce read policy & load the training data from ADLS Gen2


In [None]:
dataset_url = f'abfss://{CONTAINER}@{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net/{TRAIN_DATA_PATH}'
policy.enforce(CURRENT_USER_OBJECT_ID, dataset_url, 'read')
df = pd.read_parquet(dataset_url, storage_options=storage_options)
print(df.shape)
df.head()


## 5) Train a scikit-learn pipeline


In [None]:
# Adjust feature/label columns for your dataset
label_col = df.columns[-1]
X = df.drop(columns=[label_col]).select_dtypes(include=['number']).fillna(0)
y = df[label_col]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y if y.nunique()<=20 else None)
pipe = Pipeline([('scaler', StandardScaler(with_mean=False)), ('clf', LogisticRegression(max_iter=500))])
pipe.fit(X_train, y_train)
pred = pipe.predict(X_test)
print(classification_report(y_test, pred))


## 6) Save the model artifact to Azure Storage (ADLS/Blob)


In [None]:
policy.enforce(CURRENT_USER_OBJECT_ID, f'abfss://{CONTAINER}@{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net/{MODEL_OUTPUT_PATH}', 'write')
model_bytes = io.BytesIO(pickle.dumps(pipe))
model_url = f"abfss://{CONTAINER}@{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net/{MODEL_OUTPUT_PATH}"
# AzureBlobFileSystem.open expects a path relative to the account; strip scheme and account host
relative_path = model_url.replace('abfss://', '')
with abfs.open(relative_path, 'wb') as f:
    f.write(model_bytes.getvalue())
print('Model written to', model_url)


## 7) Push lineage to Purview (input → process → output)
This uses the Atlas entity and relationship APIs.


In [None]:
process_name = f"sklearn_training_{datetime.utcnow().strftime('%Y%m%d_%H%M%S')}"
input_qn = f"https://{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net/{CONTAINER}/{TRAIN_DATA_PATH}"
output_qn = f"https://{STORAGE_ACCOUNT_NAME}.blob.core.windows.net/{CONTAINER}/{MODEL_OUTPUT_PATH}"
input_guid = pv.create_or_stub_dataset(INPUT_ENTITY_TYPE, input_qn, name='training_parquet')
proc_guid = pv.create_process(name=process_name, description='scikit-learn training run')
output_guid = pv.create_or_stub_dataset(OUTPUT_ENTITY_TYPE, output_qn, name='sklearn_model_pkl')
pv.add_lineage(input_guid, INPUT_ENTITY_TYPE, proc_guid, output_guid, OUTPUT_ENTITY_TYPE)
print('Lineage submitted. GUIDs:', {'input': input_guid, 'process': proc_guid, 'output': output_guid})


### Next steps
- In the Purview portal, locate the **training_parquet** or **sklearn_model_pkl** assets and open the **Lineage** tab.
- Apply **sensitivity labels** to inputs/outputs in the Data Map, and pilot **Protection policies** (preview) if you need data-plane enforcement on labeled assets.
