# Getting Started with Patra Toolkit

This notebook serves as a quickstart guide to help you learn how to:

- Load and preprocess the UCI Adult Dataset  
- Build and train a neural network in TensorFlow  
- Generate a comprehensive Model Card using the **Patra Toolkit**  

By the end of this tutorial, you’ll have a validated Model Card (in JSON format) that captures key information about your model, including fairness and explainability metrics.  

---

## 1. Environment Setup

### 1.1 Install Required Packages


In [2]:
!pip install tensorflow scikit-learn pandas patra_toolkit

Collecting patra_toolkit
  Downloading patra_toolkit-0.1.2-py3-none-any.whl.metadata (492 bytes)
Collecting fairlearn~=0.11.0 (from patra_toolkit)
  Downloading fairlearn-0.11.0-py3-none-any.whl.metadata (7.0 kB)
Downloading patra_toolkit-0.1.2-py3-none-any.whl (13 kB)
Downloading fairlearn-0.11.0-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.3/232.3 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fairlearn, patra_toolkit
Successfully installed fairlearn-0.11.0 patra_toolkit-0.1.2


### 1.2 Import Dependencies

In [3]:
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

from patra_toolkit import ModelCard, AIModel

---
## 2. Load and Inspect the Data

We’ll use the **UCI Adult Dataset**, a commonly used dataset to predict whether a person's income exceeds a certain threshold based on demographic factors. Download the data from:
[https://archive.ics.uci.edu/ml/datasets/adult](https://archive.ics.uci.edu/ml/datasets/adult).

For convenience, we assume the file is saved locally at `data/adult/adult.data`.


In [4]:
import pandas as pd

columns = [
    "age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
    "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
    "hours-per-week", "native-country", "income"
]

data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
data = pd.read_csv(data_url, names=columns, header=None)
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


---
## 3. Preprocessing

### 3.1 Encode Target Variable
We’ll encode the **income** column using `LabelEncoder`, transforming the categorical values (e.g., `>50K` and `<=50K`) into numerical labels.

In [5]:
label_encoder = LabelEncoder()
data['income'] = label_encoder.fit_transform(data['income'])

### 3.2 One-Hot Encode Categorical Features
We’ll convert other categorical variables into **one-hot encoding**. We use the parameter `drop_first=True` to avoid dummy variable traps.

In [6]:
data = pd.get_dummies(data, drop_first=True, dtype=float)
data.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,income,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
0,39,77516,13,2174,0,40,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,50,83311,13,0,0,13,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,38,215646,9,0,0,40,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,53,234721,7,0,0,40,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,28,338409,13,0,0,40,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 3.3 Train-Test Split
Next, we separate features (**X**) from the target (**y**) and then split into training and testing sets.

In [7]:
X = data.drop('income', axis=1).values
y = data['income'].values

print("List of columns after one-hot encoding:")
print(data.columns.tolist())

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)

List of columns after one-hot encoding:
['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week', 'income', 'workclass_ Federal-gov', 'workclass_ Local-gov', 'workclass_ Never-worked', 'workclass_ Private', 'workclass_ Self-emp-inc', 'workclass_ Self-emp-not-inc', 'workclass_ State-gov', 'workclass_ Without-pay', 'education_ 11th', 'education_ 12th', 'education_ 1st-4th', 'education_ 5th-6th', 'education_ 7th-8th', 'education_ 9th', 'education_ Assoc-acdm', 'education_ Assoc-voc', 'education_ Bachelors', 'education_ Doctorate', 'education_ HS-grad', 'education_ Masters', 'education_ Preschool', 'education_ Prof-school', 'education_ Some-college', 'marital-status_ Married-AF-spouse', 'marital-status_ Married-civ-spouse', 'marital-status_ Married-spouse-absent', 'marital-status_ Never-married', 'marital-status_ Separated', 'marital-status_ Widowed', 'occupation_ Adm-clerical', 'occupation_ Armed-Forces', 'occupation_ Craft-repair', 'occupation_ Exec-managerial'

---
## 4. Model Training

We define a simple feed-forward neural network in TensorFlow, compile it with an **Adam** optimizer, and fit it on our training set.

In [8]:
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(X_train.shape[1],)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])

early_stopping = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)

history = model.fit(
    X_train,
    y_train,
    epochs=100,
    batch_size=32,
    validation_split=0.1,
    callbacks=[early_stopping]
)

loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Loss: {loss:.4f}, Test Accuracy: {accuracy:.4f}")


Epoch 1/100
[1m733/733[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - accuracy: 0.6614 - loss: 340.8387 - val_accuracy: 0.8035 - val_loss: 65.6032
Epoch 2/100
[1m733/733[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.6862 - loss: 72.9035 - val_accuracy: 0.2365 - val_loss: 89.4528
Epoch 3/100
[1m733/733[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.6853 - loss: 53.0041 - val_accuracy: 0.6269 - val_loss: 6.0791
Epoch 4/100
[1m733/733[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - accuracy: 0.6884 - loss: 25.8541 - val_accuracy: 0.8023 - val_loss: 16.7428
Epoch 5/100
[1m733/733[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 4ms/step - accuracy: 0.6866 - loss: 12.1903 - val_accuracy: 0.7758 - val_loss: 0.9661
Epoch 6/100
[1m733/733[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - accuracy: 0.6839 - loss: 4.3003 - val_accuracy: 0.8042 - val_loss: 3.7309
Epoch 7/100


---
## 5. Model Card Generation with Patra Toolkit

Now that we have a trained model, let’s create a **Model Card** to capture essential metadata.

1. **ModelCard** object contains high-level information about the model (description, use-cases, etc.).  
2. **AIModel** object contains details about the model architecture, performance metrics, ownership, and location.  

Afterward, we’ll demonstrate how to automatically populate the following fields:  
- **Requirements** (packages and versions)  
- **Fairness/Bias Analysis**  
- **Explainability/XAI Analysis**  

---
### 5.1 Create a Model Card


In [9]:
mc = ModelCard(
    name="UCI Adult Data Analysis model using Tensorflow",
    version="0.1",
    short_description="UCI Adult Data analysis using Tensorflow for demonstration of Patra Model Cards.",
    full_description=(
        "We have trained a ML model using the tensorflow framework to predict income "
        "for the UCI Adult Dataset. We leverage this data to run the Patra model cards "
        "to capture metadata about the model as well as fairness and explainability metrics."
    ),
    keywords="uci adult, tensorflow, explainability, fairness, patra",
    author="Your Name",
    input_type="Tabular",
    category="classification",
    foundational_model="None"
)

# Input and output references
mc.input_data = "https://archive.ics.uci.edu/ml/datasets/adult"
mc.output_data = "https://github.iu.edu/d2i/dockerhub/tensorflow/adult_modelv01"  # Update with your model path


### 5.2 Create an AIModel Instance

This object describes the **model** itself, capturing details like the model’s location, license, framework, and metrics.


In [10]:
ai_model = AIModel(
    name="Income prediction tensorflow model",
    version="0.1",
    description="Census classification problem using TensorFlow Neural Network using the UCI Adult Dataset",
    owner="Your Name or Organization",
    location="https://example.com/path-to-model",  # Update with the actual location if hosted
    license="BSD-3 Clause",
    framework="tensorflow",
    model_type="dnn",
    test_accuracy=accuracy
)

# Populate the model's architecture details
ai_model.populate_model_structure(model)

# Add extra metrics
ai_model.add_metric("Test loss", loss)
ai_model.add_metric("Epochs", 100)
ai_model.add_metric("Batch Size", 32)
ai_model.add_metric("Optimizer", "Adam")
ai_model.add_metric("Learning Rate", 0.001)
ai_model.add_metric("Input Shape", str(X_train.shape))

# Attach the AIModel object to the ModelCard
mc.ai_model = ai_model


### 5.3 Automatically Capture Requirements

`populate_requirements()` will parse your environment to identify installed packages and capture them under **environment/requirements** in the Model Card.


In [11]:
mc.populate_requirements()

### 5.4 Bias (Fairness) Analysis

Below, we show how to call the `populate_bias()` method, which takes the test dataset, predicted labels, and the feature on which you want to measure bias. For demonstration, we assume the “gender” feature is at index 58 in **X_test** (as determined after one-hot encoding).

- `feature_name`: "gender"  
- `protected_feature_data`: The specific column from your **X_test** that corresponds to "gender"  
- `model`: The trained TensorFlow model (not strictly needed to compute bias, but used in some advanced checks)


In [12]:
y_pred = model.predict(X_test)
y_pred = (y_pred >= 0.5).flatten()

mc.populate_bias(
    X_test,
    y_test,
    y_pred,
    "gender",           # Name you want displayed in the report
    X_test[:, 58],      # The slice of data that corresponds to gender
    model
)

print("Bias Analysis:\n", mc.bias_analysis)


[1m204/204[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
Bias Analysis:
 {'demographic_parity_diff': 0.005868703414968667, 'equal_odds_difference': 0.00824367931125182}


### 5.5 Explainability (XAI) Analysis

Similarly, we can generate some basic SHAP-based interpretability metrics or feature attribution for a sample of inputs.
- `num_samples_to_explain`: 10 in our case  
- We provide `X_test[:10]` along with the actual column names from the dataset (minus the target column).

In [14]:
# Rebuild the list of columns used in training
x_columns = data.columns.tolist()
x_columns.remove('income')  # Remove the target

mc.populate_xai(
    X_test[:10],
    x_columns,
    model
)

print("Explainability Analysis:\n", mc.xai_analysis)


Explainability Analysis:
 {'relationship__Not_in_family': 3.228584927184031e-10, 'age': 2.3179584070066718e-10, 'sex__Male': 1.986821492821416e-10, 'fnlwgt': 1.9868214912794398e-10, 'education_num': 1.9040372627331257e-10, 'marital_status__Married_civ_spouse': 1.7384688071824742e-10, 'hours_per_week': 1.6556845770941838e-10, 'occupation__Adm_clerical': 1.490116120001556e-10, 'occupation__Exec_managerial': 1.4073318914552422e-10, 'workclass__Private': 1.1589792042743241e-10}


---
## 6. Validate and Save the Model Card

Before saving, let’s ensure our card follows Patra’s default schema by calling `mc.validate()`. If all checks pass, you can save it locally as a JSON file and later upload it to the **Patra Knowledge Base**.

In [16]:
mc.validate()

mc.save("patra_modelcard.json")
print("Model Card validation successful and file saved.")

Model Card validation successful and file saved.


---
# Conclusion

Congratulations! You have successfully:

1. Trained a neural network on the UCI Adult Dataset using TensorFlow.  
2. Created a **Patra Model Card** capturing essential metadata.  
3. Automatically analyzed bias and generated basic explainability metrics.  
4. Validated and saved the Model Card in JSON format.

This process is a foundation for more advanced use-cases, such as:
- Uploading the Model Card to the **Patra Knowledge Base** for search and provenance tracking.
- Performing deeper fairness analysis (e.g., multiple protected attributes).
- Integrating advanced interpretability approaches.

By consistently generating and maintaining Model Cards, you’ll be on your way to creating **more transparent** and **accountable** AI solutions.


---

# Uploading your model to HuggingFace

In [29]:
!pip install huggingface_hub
!pip install tensorflow

from huggingface_hub import create_repo, create_commit, CommitOperationAdd
notebook_login()



VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [34]:
# Save the model in HDF5 format
model_save_name = "my_keras_model.h5"
model.save(model_save_name)



In [35]:
username = "nkarthikeyan"
repo_name = "IubNet"
repo_id = f"{username}/{repo_name}"

# Create the repository on Hugging Face Hub
create_repo(repo_id=repo_id, private=False, exist_ok=True)

RepoUrl('https://huggingface.co/nkarthikeyan/IubNet', endpoint='https://huggingface.co', repo_type='model', repo_id='nkarthikeyan/IubNet')

In [36]:
# Upload the HDF5 model file to the repository
upload_file(
    path_or_fileobj=model_save_name,       # Correct parameter name
    path_in_repo=model_save_name,          # Destination path in the repo
    repo_id=repo_id,                        # Repository ID
    commit_message="Add Keras model via upload_file"
)


my_keras_model.h5:   0%|          | 0.00/314k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/nkarthikeyan/IubNet/commit/bed6db87257c9f49434795b68de964ec744833e5', commit_message='Add Keras model via upload_file', commit_description='', oid='bed6db87257c9f49434795b68de964ec744833e5', pr_url=None, repo_url=RepoUrl('https://huggingface.co/nkarthikeyan/IubNet', endpoint='https://huggingface.co', repo_type='model', repo_id='nkarthikeyan/IubNet'), pr_revision=None, pr_num=None)

# Loading the Model from Hugging Face Hub

In [38]:
from huggingface_hub import hf_hub_download
import tensorflow as tf

# Download the model file from the Hub
model_file = hf_hub_download(
    repo_id="nkarthikeyan/IubNet",
    filename="my_keras_model.h5"
)

# Load the model
loaded_model = tf.keras.models.load_model(model_file)
loaded_model.summary()


my_keras_model.h5:   0%|          | 0.00/314k [00:00<?, ?B/s]

