<div align="center">

# Getting Started with Patra Model Card Toolkit

</div>

The Patra Toolkit is a component of the Patra ModelCards framework designed to simplify the process of creating and documenting AI/ML models. It provides a structured schema that guides users in providing essential information about their models, including details about the model's purpose, development process, and performance. The toolkit also includes features for semi-automating the capture of key information, such as fairness and explainability metrics, through integrated analysis tools. By reducing the manual effort involved in creating model cards, the Patra Toolkit encourages researchers and developers to adopt best practices for documenting their models, ultimately contributing to greater transparency and accountability in AI/ML development.

---

## Features

1. **Encourages Accountability**
   - Incorporate essential model information (metadata, dataset details, fairness, explainability) at training time, ensuring AI models remain transparent from development to deployment.

2. **Semi-Automated Capture**
   - Automated *Fairness* and *Explainability* scanners compute demographic parity, equal odds, SHAP-based feature importances, etc., for easy integration into Model Cards.

3. **Machine-Actionable Model Cards**
   - Produce a structured JSON representation for ingestion into the Patra Knowledge Base. Ideal for advanced queries on model selection, provenance, versioning, or auditing.

4. **Flexible Repository Support**
   - Pluggable backends for storing models/artifacts on **Hugging Face** or **GitHub**, unifying the model publishing workflow.

5. **Versioning & Model Relationship Tracking**
   - Maintain multiple versions of a model with recognized edges (e.g., `revisionOf`, `alternateOf`) using embedding-based similarity. This ensures clear lineages and easy forward/backward provenance.


---

This notebook demonstrates:

1. **Loading & Preprocessing** the UCI Adult Dataset  
2. **Training** a simple TensorFlow model  
3. **Creating a Model Card** with optional Fairness and XAI scans  
4. **Submitting** the Model Card (and optionally the model, inference label, and artifacts) to:
   - **Patra server** (for model card storage)  
   - **Backend** (Hugging Face or GitHub) for model storage

---

## 1. Environment Setup

In [1]:
!pip install git+https://github.com/Data-to-Insight-Center/patra-toolkit

Collecting git+https://github.com/Data-to-Insight-Center/patra-toolkit
  Cloning https://github.com/Data-to-Insight-Center/patra-toolkit to /private/var/folders/d7/zwq9fkgs65xdfbrv7v00g8dc0000gn/T/pip-req-build-534u7zs1
  Running command git clone --filter=blob:none --quiet https://github.com/Data-to-Insight-Center/patra-toolkit /private/var/folders/d7/zwq9fkgs65xdfbrv7v00g8dc0000gn/T/pip-req-build-534u7zs1
  Resolved https://github.com/Data-to-Insight-Center/patra-toolkit to commit 61070d54259f4f1cf552f816fc45a56f64cdf53f
  Preparing metadata (setup.py) ... [?25ldone

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
!pip install numpy pandas tensorflow scikit-learn


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [1]:
import logging
# logging.basicConfig(level=logging.INFO)
logging.getLogger("absl").setLevel(logging.ERROR)
logging.getLogger("huggingface_hub").setLevel(logging.ERROR)
logging.getLogger("PyGithub").setLevel(logging.ERROR)

import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

from patra_toolkit import ModelCard, AIModel

  from .autonotebook import tqdm as notebook_tqdm


## 2. Load and Pre-process the Data

We'll use the **UCI Adult Dataset**, which predicts whether an individual's income is above or below $50K based on demographics. This dataset is a common benchmark for exploring model fairness.

In [2]:
url = "data/adult/adult.data"
cols = [
    "age", "workclass", "fnlwgt", "education", "education_num",
    "marital_status", "occupation", "relationship", "race",
    "sex", "capital_gain", "capital_loss", "hours_per_week",
    "native_country", "income"
]
df = pd.read_csv(url, names=cols, header=None)

# Encode target
df["income"] = LabelEncoder().fit_transform(df["income"])  # 1 if >50K, else 0

# One-hot encode everything except the target
df = pd.get_dummies(df, drop_first=True, dtype=float)

# Split into features/labels
X = df.drop("income", axis=1).astype("float32").values
y = df["income"].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print("Train shape:", X_train.shape, "Test shape:", X_test.shape)

Train shape: (26048, 100) Test shape: (6513, 100)


## 3. Train a Simple TensorFlow Model

Below is a straightforward neural network: two hidden layers plus a final sigmoid for binary classification. We'll train for a few epochs to demonstrate end-to-end usage.

In [3]:
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(X_train.shape[1],)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    loss='binary_crossentropy',
    metrics=['accuracy']
)

model.fit(X_train, y_train, epochs=5, batch_size=64, verbose=1)

loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
print(f"Test Loss: {loss:.4f}, Test Accuracy: {accuracy:.4f}")

Epoch 1/5
[1m407/407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 433us/step - accuracy: 0.6826 - loss: 113.1765
Epoch 2/5
[1m407/407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 460us/step - accuracy: 0.6896 - loss: 133.7432
Epoch 3/5
[1m407/407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 498us/step - accuracy: 0.6676 - loss: 102.2878
Epoch 4/5
[1m407/407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 434us/step - accuracy: 0.6761 - loss: 84.1222
Epoch 5/5
[1m407/407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 450us/step - accuracy: 0.6856 - loss: 83.2587
Test Loss: 15.7583, Test Accuracy: 0.7875


## 4. Building a Patra Model Card

### 4.1 Basic Model Card Setup
We start with essential metadata like name, version, short description, and so on.  

In [4]:
mc = ModelCard(
    name="UCI_Adult_Model",
    version="1.0",
    short_description="Predicting whether an individual's income is above $50K using TensorFlow.",
    full_description=(
        "This is a feed-forward neural network trained on the UCI Adult Dataset. "
        "It demonstrates how Patra Toolkit can store model details, fairness scans, "
        "and basic explainability data in a comprehensive Model Card."
    ),
    keywords="uci, adult, patra, fairness, xai, tensorflow",
    author="neelk",
    input_type="Tabular",
    category="classification",
    citation="Becker, B. & Kohavi, R. (1996). Adult [Dataset]. UCI."
)

### 4.2 Attach AI Model Information
Here we describe the model's ownership, license, performance metrics, etc.

In [5]:
ai_model = AIModel(
    name="AdultTFModel",
    version="1.0",
    description="DNN on UCI Adult dataset for income prediction",
    owner="username",
    location="",
    license="BSD-3-Clause",
    framework="tensorflow",
    model_type="dnn",
    test_accuracy=accuracy
)

# Add additional performance or training metrics
ai_model.add_metric("Epochs", 5)
ai_model.add_metric("BatchSize", 64)
ai_model.add_metric("Optimizer", "Adam")

mc.ai_model = ai_model

## 5. Fairness & Explainability

### 5.1 Bias (Fairness) Analysis
Patra Toolkit has a built-in `populate_bias` method to measure metrics like **demographic parity** or **equalized odds**. We'll focus on the protected attribute "sex" in the data.

**Why check bias?** Real-world models often inadvertently penalize certain groups. By calling `mc.populate_bias(...)`, you get a quick sense of whether the model is systematically advantaging or disadvantaging certain subpopulations.

In [6]:
y_pred = model.predict(X_test)
y_pred = (y_pred >= 0.5).flatten()

mc.populate_bias(
    X_test,
    y_test,
    y_pred,
    "gender",
    X_test[:, 58],
    model
)

print("Bias Analysis:\n", mc.bias_analysis)


[1m204/204[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 303us/step
Bias Analysis:
 {'demographic_parity_diff': 0.08921713666543653, 'equal_odds_difference': 0.09993135613336157}


### 5.2 Explainability (XAI)

If we want to understand model decisions, we can generate interpretability metrics (like feature importance) using Patra’s internal SHAP-based approach.

In [7]:
# Rebuild the list of columns used in training
x_columns = df.columns.tolist()
x_columns.remove('income')

mc.populate_xai(
    X_test[:10],
    x_columns,
    model
)

print("Explainability Analysis:\n", mc.xai_analysis)

Explainability Analysis:
 {'capital_gain': 0.40989761417881254, 'fnlwgt': 0.01063571285538347, 'hours_per_week': 0.00022047044171769075, 'education__Masters': 0.0001725960598939185, 'age': 0.0001642985578004169, 'education__HS_grad': 0.00013157561064331007, 'relationship__Wife': 0.00010773577373172417, 'occupation__Exec_managerial': 9.979047750643738e-05, 'sex__Male': 9.451595324317836e-05, 'education_num': 8.068107493101794e-05}


## 6. Add Requirements
We let Patra auto-detect Python package dependencies to ensure reproducibility.

In [8]:
mc.populate_requirements()

## 7. Submission

**Tapis Authentication:**  
Before submitting, ensure you have obtained a valid Tapis token using your TACC credentials. If you do not already have a TACC account, you can create one at [https://accounts.tacc.utexas.edu/begin](https://accounts.tacc.utexas.edu/begin). You can use the `authenticate()` method provided by the toolkit (or any other method) to obtain the token. When calling the submission methods, pass the token as the `tapis_token` parameter so that your request is authenticated by the Patra server. If Tapis authentication isn’t required for your scenario, you can set `tapis_token` to `None`.

The `mc.submit(...)` method can do one or more of the following:
1. **Submit only the card** (no model, no artifacts).
2. **Include the trained model** (uploading to Hugging Face or GitHub).
3. **Add artifacts** (such as data files, inference labels, or any additional resources).


In [13]:
token = mc.authenticate(username="neelk", password="52ah$.7U-4hMQx5")

Authentication successful.
X-Tapis-Token: eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJqdGkiOiIzOTEzNTRkOS01MzkxLTQ3ZjctODdmYy02MDUzMjE2ZTY4YmMiLCJpc3MiOiJodHRwczovL2ljaWNsZWFpLnRhcGlzLmlvL3YzL3Rva2VucyIsInN1YiI6Im5lZWxrQGljaWNsZWFpIiwidGFwaXMvdGVuYW50X2lkIjoiaWNpY2xlYWkiLCJ0YXBpcy90b2tlbl90eXBlIjoiYWNjZXNzIiwidGFwaXMvZGVsZWdhdGlvbiI6ZmFsc2UsInRhcGlzL2RlbGVnYXRpb25fc3ViIjpudWxsLCJ0YXBpcy91c2VybmFtZSI6Im5lZWxrIiwidGFwaXMvYWNjb3VudF90eXBlIjoidXNlciIsImV4cCI6MTc0NDM5ODE0OSwidGFwaXMvY2xpZW50X2lkIjpudWxsLCJ0YXBpcy9ncmFudF90eXBlIjoicGFzc3dvcmQifQ.HhKP8Nz4PCwfcwoPbJHrRE0R00EoU7VfkefYcUxSOmA8FC3OLNWnZ7T7DwAT0DV3NCdWt3Zh7_g2wUnYbM7CuegqLoidhqW9b0-4X9EPJJPHMH3DhCsrMwcabjEEe0TV40ozS6_ei4hufQABLPCrA2qQax2_Rcfvt9w0qBdudjJE22Gqek4QF3s7fjkekQ1PchJYyXKuqp97O3C0u9f7SuSF7qVHgiTCNbomWTJzlh4oDx6dzorSimLiry8wYlJOpVpOvPVt_hcWg_bGcZffazPmIoxP7HKQuSN0_vNvzZC_AJtAJIuq5J0r0W8011Rh5hQsLGHGQmNXYmWTNmgvNA


### 7.1 Submit Model Card

In [None]:
mc.submit(patra_server_url="https://patra.pods.icicleai.tapis.io",
          tapis_token=token)

### 7.2 Submit AI/ML Model

We can specify `"huggingface"` or `"github"` for `model_store`. This will attempt to upload our trained model, while the card is posted to the Patra server.

In [None]:
mc.submit(patra_server_url="https://patra.pods.icicleai.tapis.io",
          tapis_token=token,
          model=model,
          file_format="h5",
          model_store="huggingface")

### 7.3 Submit Artifacts

In [None]:
mc.submit(patra_server_url="https://patra.pods.icicleai.tapis.io",
          tapis_token=token,
          model_store="huggingface",
          artifacts=["data/adult/adult.data", "data/adult/adult.names"])

### 7.4 Submit Model Card, Model, and Artifacts

This scenario might include a special label file plus multiple dataset artifacts.

In [None]:
mc.version = "1.2"
with open("labels.txt", "w") as f:
    f.write("Label 1\n")
    f.write("Label 2\n")

mc.submit(patra_server_url=patra_server_url, tapis_token=None, model=model, file_format="h5", model_store="huggingface",
          inference_labels="labels.txt", artifacts=["data/adult/adult.data", "data/adult/adult.names"])

### 7.4 Pushing to GitHub

By switching `"huggingface"` to `"github"`, you can store your model in a GitHub repo.

In [None]:
mc.version = "1.3"
mc.submit(patra_server_url=patra_server_url, tapis_token=None, model=model, file_format="h5", model_store="github",
          artifacts=["adult.data", "adult.names"])

By following this notebook, you have:
1. Loaded and preprocessed the UCI Adult Dataset
2. Trained a TensorFlow model to predict income
3. Built a Patra Model Card describing the model’s purpose, performance, and environment
4. Scanned for fairness and explainability metrics
5. Submitted the card to a Patra server along with the model or artifacts to a chosen store (Hugging Face or GitHub)


In [None]:
mc.save("model_card.json")