# Data Management Examples

This notebook demonstrates the core data management capabilities through the `TabAccessor` extension and schema system for organized data handling.

## Features Demonstrated

- **Schema Definition**: Organize data structure with column roles (numerical, categorical, target)
- **Dynamic Schema Updates**: Modify column roles dynamically for flexible data handling
- **Role-based Data Views**: Access specific subsets based on column roles through `TabAccessor`
- **Data Partitioning**: Create train/validation/test splits with `random_split`
- **Partition Management**: Access and modify specific data partitions while preserving schema

## Setup

In [1]:
import sys
import os

# Add the project root to Python path
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.insert(0, project_root)

In [2]:
import pandas as pd

from src.idspy.data.schema import ColumnRole
from src.idspy.data.partition import random_split
from src.idspy.data.tab_accessor import TabAccessor

# --- toy dataset ---
df = pd.DataFrame({
    "character": [
        "Mario", "Luigi", "Peach", "Bowser", "Yoshi",
        "Toad", "Donkey Kong", "Wario", "Waluigi", "Koopa"
    ],
    "coins": [120, 95, 30, 300, 60, 45, 150, 80, 70, 20],
    "lives": [3, 2, 4, 10, 5, 1, 6, 1, 2, 1],
    "power_up": [
        "Mushroom", "Mushroom", "Star", "Fire Flower", "Egg",
        "Mushroom", "Banana", "Garlic", "Trickster", "Shell"
    ],
    "is_enemy": [0, 0, 0, 1, 0, 0, 0, 1, 1, 1]
})
df.index.name = "id"
df

Unnamed: 0_level_0,character,coins,lives,power_up,is_enemy
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,Mario,120,3,Mushroom,0
1,Luigi,95,2,Mushroom,0
2,Peach,30,4,Star,0
3,Bowser,300,10,Fire Flower,1
4,Yoshi,60,5,Egg,0
5,Toad,45,1,Mushroom,0
6,Donkey Kong,150,6,Banana,0
7,Wario,80,1,Garlic,1
8,Waluigi,70,2,Trickster,1
9,Koopa,20,1,Shell,1


In [3]:
TabAccessor

src.idspy.data.tab_accessor.TabAccessor

## Schema Definition

Define column roles to organize data structure with automatic categorization.

In [4]:
df = df.tab.set_schema(
    numerical=["coins", "lives"],
    categorical=["character", "power_up"],
    target=["is_enemy"],
)

print("Features:", df.tab.schema.features)  # ['coins', 'lives', 'character', 'power_up']
print("Target:", df.tab.schema.target)  # ['is_enemy']

Features: ['coins', 'lives', 'character', 'power_up']
Target: is_enemy


### Dynamic Schema Updates

Modify column roles dynamically for flexible data handling.

In [5]:
df.tab.add_role("lives", ColumnRole.CATEGORICAL)

print("Numeric:", df.tab.schema.numerical)  # ['coins']
print("Categorical:", df.tab.schema.categorical)  # ['character', 'power_up', 'lives']

Numeric: ['coins']
Categorical: ['character', 'power_up', 'lives']


### Role-based Data Views

Access specific data subsets based on column roles while maintaining pandas functionality.

In [6]:
X = df.tab.features
y = df.tab.target

display(X, y)

Unnamed: 0_level_0,coins,character,power_up,lives
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,120,Mario,Mushroom,3
1,95,Luigi,Mushroom,2
2,30,Peach,Star,4
3,300,Bowser,Fire Flower,10
4,60,Yoshi,Egg,5
5,45,Toad,Mushroom,1
6,150,Donkey Kong,Banana,6
7,80,Wario,Garlic,1
8,70,Waluigi,Trickster,2
9,20,Koopa,Shell,1


id
0    0
1    0
2    0
3    1
4    0
5    0
6    0
7    1
8    1
9    1
Name: is_enemy, dtype: int64

In [7]:
df.tab.numerical = df.tab.numerical + 2
df.tab.numerical

Unnamed: 0_level_0,coins
id,Unnamed: 1_level_1
0,122
1,97
2,32
3,302
4,62
5,47
6,152
7,82
8,72
9,22


In [8]:
df.tab.numerical.dtypes

coins    int64
dtype: object

In [9]:
df.tab.numerical = df.tab.numerical.astype("float64")
df.tab.numerical.dtypes

coins    float64
dtype: object

## Data Partitioning

Create train/validation/test splits and access partitions through `TabAccessor`.

In [10]:
split_mapping = random_split(df, train_size=0.6, val_size=0.2, test_size=0.2)
df.tab.set_partitions_from_labels(split_mapping)

print("Train partition:")
display(df.tab.train)

print("Valid partition:")
display(df.tab.val)

print("Test partition:")
display(df.tab.test)

Train partition:


Unnamed: 0_level_0,character,coins,lives,power_up,is_enemy
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,Mario,122.0,3,Mushroom,0
1,Luigi,97.0,2,Mushroom,0
3,Bowser,302.0,10,Fire Flower,1
4,Yoshi,62.0,5,Egg,0
6,Donkey Kong,152.0,6,Banana,0
7,Wario,82.0,1,Garlic,1


Valid partition:


Unnamed: 0_level_0,character,coins,lives,power_up,is_enemy
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5,Toad,47.0,1,Mushroom,0
8,Waluigi,72.0,2,Trickster,1


Test partition:


Unnamed: 0_level_0,character,coins,lives,power_up,is_enemy
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2,Peach,32.0,4,Star,0
9,Koopa,22.0,1,Shell,1


In [11]:
df_train = df.tab.train.copy()
df_train.loc[df_train["is_enemy"] == 0, "coins"] += 50

df.tab.train = df_train
df.tab.train.tab.numerical

Unnamed: 0_level_0,coins
id,Unnamed: 1_level_1
0,172.0
1,147.0
3,302.0
4,112.0
6,202.0
7,82.0
