# Getting Started with Patra Toolkit

This notebook serves as a quickstart guide to help you learn how to:

- Load and preprocess the UCI Adult Dataset  
- Build and train a neural network in TensorFlow  
- Generate a comprehensive Model Card using the **Patra Toolkit**  

By the end of this tutorial, you’ll have a validated Model Card (in JSON format) that captures key information about your model, including fairness and explainability metrics.  

---

## 1. Environment Setup

### 1.1 Install Required Packages


In [1]:
!pip install tensorflow scikit-learn pandas patra_toolkit


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### 1.2 Import Dependencies

In [2]:
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

import sys
import os

# Add the repository root to sys.path so that the latest local version of patra_toolkit is imported.
repo_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
if repo_root not in sys.path:
    sys.path.insert(0, repo_root)

import torchvision
from patra_toolkit import ModelCard, AIModel


  from .autonotebook import tqdm as notebook_tqdm


---
## 2. Load and Inspect the Data

We’ll use the **UCI Adult Dataset**, a commonly used dataset to predict whether a person's income exceeds a certain threshold based on demographic factors. Download the data from:
[https://archive.ics.uci.edu/ml/datasets/adult](https://archive.ics.uci.edu/ml/datasets/adult).

For convenience, we assume the file is saved locally at `data/adult/adult.data`.


In [3]:
import io
import pandas as pd
import requests

cert_path = __import__("certifi").where()
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"

# Download the data with certificate verification
response = requests.get(data_url, verify=cert_path)
response.raise_for_status()

# Use io.StringIO to load the text content into pandas
data = pd.read_csv(io.StringIO(response.text),
                   names=[
                       "age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
                       "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
                       "hours-per-week", "native-country", "income"
                   ],
                   header=None)
data.head()


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


---
## 3. Preprocessing

### 3.1 Encode Target Variable
We’ll encode the **income** column using `LabelEncoder`, transforming the categorical values (e.g., `>50K` and `<=50K`) into numerical labels.

In [4]:
label_encoder = LabelEncoder()
data['income'] = label_encoder.fit_transform(data['income'])

### 3.2 One-Hot Encode Categorical Features
We’ll convert other categorical variables into **one-hot encoding**. We use the parameter `drop_first=True` to avoid dummy variable traps.

In [5]:
data = pd.get_dummies(data, drop_first=True, dtype=float)
data.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,income,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
0,39,77516,13,2174,0,40,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,50,83311,13,0,0,13,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,38,215646,9,0,0,40,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,53,234721,7,0,0,40,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,28,338409,13,0,0,40,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 3.3 Train-Test Split
Next, we separate features (**X**) from the target (**y**) and then split into training and testing sets.

In [6]:
X = data.drop('income', axis=1).values
y = data['income'].values

print("List of columns after one-hot encoding:")
print(data.columns.tolist())

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)

List of columns after one-hot encoding:
['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week', 'income', 'workclass_ Federal-gov', 'workclass_ Local-gov', 'workclass_ Never-worked', 'workclass_ Private', 'workclass_ Self-emp-inc', 'workclass_ Self-emp-not-inc', 'workclass_ State-gov', 'workclass_ Without-pay', 'education_ 11th', 'education_ 12th', 'education_ 1st-4th', 'education_ 5th-6th', 'education_ 7th-8th', 'education_ 9th', 'education_ Assoc-acdm', 'education_ Assoc-voc', 'education_ Bachelors', 'education_ Doctorate', 'education_ HS-grad', 'education_ Masters', 'education_ Preschool', 'education_ Prof-school', 'education_ Some-college', 'marital-status_ Married-AF-spouse', 'marital-status_ Married-civ-spouse', 'marital-status_ Married-spouse-absent', 'marital-status_ Never-married', 'marital-status_ Separated', 'marital-status_ Widowed', 'occupation_ Adm-clerical', 'occupation_ Armed-Forces', 'occupation_ Craft-repair', 'occupation_ Exec-managerial'

---
## 4. Model Training

We define a simple feed-forward neural network in TensorFlow, compile it with an **Adam** optimizer, and fit it on our training set.

In [7]:
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(X_train.shape[1],)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])

early_stopping = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)

history = model.fit(
    X_train,
    y_train,
    epochs=100,
    batch_size=32,
    validation_split=0.1,
    callbacks=[early_stopping]
)

loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Loss: {loss:.4f}, Test Accuracy: {accuracy:.4f}")


Epoch 1/100
[1m733/733[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 699us/step - accuracy: 0.6694 - loss: 269.3651 - val_accuracy: 0.2365 - val_loss: 87.7503
Epoch 2/100
[1m733/733[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 631us/step - accuracy: 0.6802 - loss: 59.5821 - val_accuracy: 0.7908 - val_loss: 71.4242
Epoch 3/100
[1m733/733[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 793us/step - accuracy: 0.6918 - loss: 21.3576 - val_accuracy: 0.2361 - val_loss: 17.9792
Epoch 4/100
[1m733/733[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 658us/step - accuracy: 0.6720 - loss: 11.6666 - val_accuracy: 0.8046 - val_loss: 3.0014
Epoch 5/100
[1m733/733[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 649us/step - accuracy: 0.6890 - loss: 3.4128 - val_accuracy: 0.7927 - val_loss: 1.1349
Epoch 6/100
[1m733/733[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 692us/step - accuracy: 0.7110 - loss: 1.0887 - val_accuracy: 0.8046 - val_loss: 0.6712
Ep

---
## 5. Model Card Generation with Patra Toolkit

Now that we have a trained model, let’s create a **Model Card** to capture essential metadata.

1. **ModelCard** object contains high-level information about the model (description, use-cases, etc.).  
2. **AIModel** object contains details about the model architecture, performance metrics, ownership, and location.  

Afterward, we’ll demonstrate how to automatically populate the following fields:  
- **Requirements** (packages and versions)  
- **Fairness/Bias Analysis**  
- **Explainability/XAI Analysis**  

---
### 5.1 Create a Model Card


In [8]:
mc = ModelCard(
    name="UCI_Model",
    version="0.1",
    short_description="UCI Adult Data analysis using Tensorflow for demonstration of Patra Model Cards.",
    full_description=(
        "We have trained a ML model using the tensorflow framework to predict income "
        "for the UCI Adult Dataset. We leverage this data to run the Patra model cards "
        "to capture metadata about the model as well as fairness and explainability metrics."
    ),
    keywords="uci adult, tensorflow, explainability, fairness, patra",
    author="neelk",
    input_type="Tabular",
    category="classification",
    foundational_model="None",
    citation="Becker, B. & Kohavi, R. (1996). Adult [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5XW20."
)

mc.input_data = "https://archive.ics.uci.edu/ml/datasets/adult"


### 5.2 Create an AIModel Instance

This object describes the **model** itself, capturing details like the model’s location, license, framework, and metrics.


In [9]:
ai_model = AIModel(
    name="Tensorflow Model",
    version="0.1",
    description="Census classification problem using TensorFlow Neural Network using the UCI Adult Dataset",
    owner="Neelesh Karthikeyan",
    location="",
    license="BSD-3 Clause",
    framework="tensorflow",
    model_type="dnn",
    test_accuracy=accuracy
)

# ai_model.inference_label = ""

# Populate the model's architecture details
ai_model.populate_model_structure(model)

# Add extra metrics
ai_model.add_metric("Test loss", loss)
ai_model.add_metric("Epochs", 100)
ai_model.add_metric("Batch Size", 32)
ai_model.add_metric("Optimizer", "Adam")
ai_model.add_metric("Learning Rate", 0.001)
ai_model.add_metric("Input Shape", str(X_train.shape))

# Attach the AIModel object to the ModelCard
mc.ai_model = ai_model


"### 5.3 Automatically Capture Requirements

`populate_requirements()` will parse your environment to identify installed packages and capture them under **environment/requirements** in the Model Card.


In [10]:
mc.populate_requirements()

### 5.4 Bias (Fairness) Analysis

Below, we show how to call the `populate_bias()` method, which takes the test dataset, predicted labels, and the feature on which you want to measure bias. For demonstration, we assume the “gender” feature is at index 58 in **X_test** (as determined after one-hot encoding).

- `feature_name`: "gender"  
- `protected_feature_data`: The specific column from your **X_test** that corresponds to "gender"  
- `model`: The trained TensorFlow model (not strictly needed to compute bias, but used in some advanced checks)


In [11]:
y_pred = model.predict(X_test)
y_pred = (y_pred >= 0.5).flatten()

mc.populate_bias(
    X_test,
    y_test,
    y_pred,
    "gender",           # Name you want displayed in the report
    X_test[:, 58],      # The slice of data that corresponds to gender
    model
)

print("Bias Analysis:\n", mc.bias_analysis)


[1m204/204[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 503us/step
Bias Analysis:
 {'demographic_parity_diff': 0.007677798575754372, 'equal_odds_difference': 0.011425675372248634}


### 5.5 Explainability (XAI) Analysis

Similarly, we can generate some basic SHAP-based interpretability metrics or feature attribution for a sample of inputs.
- `num_samples_to_explain`: 10 in our case  
- We provide `X_test[:10]` along with the actual column names from the dataset (minus the target column).

In [12]:
# Rebuild the list of columns used in training
x_columns = data.columns.tolist()
x_columns.remove('income')  # Remove the target

mc.populate_xai(
    X_test[:10],
    x_columns,
    model
)

print("Explainability Analysis:\n", mc.xai_analysis)


Explainability Analysis:
 {'fnlwgt': 0.007139739427301619, 'capital_gain': 0.007126364939742618, 'hours_per_week': 1.1592639817132382e-05, 'age': 2.7945637702944362e-06, 'relationship__Wife': 1.6353527704876526e-06, 'sex__Male': 8.985069062975953e-07, 'workclass__Self_emp_not_inc': 2.0090076658497835e-07, 'workclass__State_gov': 7.496939765063962e-08, 'native_country__Holand_Netherlands': 0.0, 'native_country__Haiti': 0.0}


---
## 6. Validate and Save the Model Card

Before saving, let’s ensure our card follows Patra’s default schema by calling `mc.validate()`. If all checks pass, you can save it locally as a JSON file and later upload it to the **Patra Knowledge Base**.

## 7. Submit the Model, Artifacts, and Model Card to the Patra Server and Model Store

In [14]:
mc.submit_model(
    patra_server_url="http://127.0.0.1:5002",
    model=model,
    file_format="h5",
    model_store="huggingface",
)

INFO:root:Model card validated successfully.
INFO:root:Model ID retrieved: neelk-uci_model-0.1
INFO:root:Repository credentials stored.
INFO:root:Model serialized successfully.
neelk-uci_model-0.1.h5: 100%|██████████| 314k/314k [00:00<00:00, 1.12MB/s]
INFO:root:Model uploaded at: https://huggingface.co/nkarthikeyan/neelk-uci_model-0.1/blob/main/neelk-uci_model-0.1.h5
INFO:root:ModelCard submitted successfully.


{'message': 'Successfully uploaded the model card',
 'model_card_id': 'neelk-uci_model-0.1'}

In [17]:
mc.submit_artifact(artifact_path="data/adult/adult.data")


INFO:root:Artifact stored at: https://huggingface.co/nkarthikeyan/neelk-uci_model-0.1/blob/main/adult.data


{'artifact_location': 'https://huggingface.co/nkarthikeyan/neelk-uci_model-0.1/blob/main/adult.data'}

In [20]:
mc.save("/Users/neeleshkarthikeyan/d2i/patra-toolkit/examples/notebooks/README.json")

INFO:root:Model card saved to /Users/neeleshkarthikeyan/d2i/patra-toolkit/examples/notebooks/README.json.


In [21]:
mc.submit_artifact(artifact_path="/Users/neeleshkarthikeyan/d2i/patra-toolkit/examples/notebooks/README.json")

INFO:root:Artifact stored at: https://huggingface.co/nkarthikeyan/neelk-uci_model-0.1/blob/main/README.json


{'artifact_location': 'https://huggingface.co/nkarthikeyan/neelk-uci_model-0.1/blob/main/README.json'}

---
# Conclusion

Congratulations! You have successfully:

1. Trained a neural network on the UCI Adult Dataset using TensorFlow.  
2. Created a **Patra Model Card** capturing essential metadata.  
3. Automatically analyzed bias and generated basic explainability metrics.  
4. Validated and saved the Model Card in JSON format.

This process is a foundation for more advanced use-cases, such as:
- Uploading the Model Card to the **Patra Knowledge Base** for search and provenance tracking.
- Performing deeper fairness analysis (e.g., multiple protected attributes).
- Integrating advanced interpretability approaches.

By consistently generating and maintaining Model Cards, you’ll be on your way to creating **more transparent** and **accountable** AI solutions.
