# Data Management with idspy

This notebook demonstrates the core data management capabilities of idspy through the `TabAccessor` extension and schema system. These tools provide organized and flexible data handling for machine learning workflows.

## What you'll learn

In this tutorial, you'll discover how to:

1. **Define Data Schemas** - Organize your data structure with meaningful column roles
2. **Use TabAccessor** - Access data subsets through an intuitive pandas extension
3. **Handle Dynamic Updates** - Modify data roles and schemas on the fly
4. **Create Data Partitions** - Split datasets into train/validation/test sets
5. **Manage Partitioned Data** - Work with specific data splits while preserving structure

## Key Benefits

- **Structured Data Management**: Clear separation of features, targets, and metadata
- **Flexible Role Assignment**: Dynamic column role updates without data copying
- **Seamless Integration**: Works naturally with pandas DataFrames
- **Partition Awareness**: Automatic handling of data splits with schema preservation

---

Let's start by setting up our environment and creating some sample data.

In [26]:
import sys
import os

# Add the project root to Python path
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.insert(0, project_root)

In [27]:
import pandas as pd

from src.idspy.data.schema import ColumnRole
from src.idspy.data.partition import random_split
from src.idspy.data.tab_accessor import TabAccessor

# Create a sample Super Mario Bros. dataset with various data types
df = pd.DataFrame({
    "character": [
        "Mario", "Luigi", "Peach", "Bowser", "Yoshi",
        "Toad", "Donkey Kong", "Wario", "Waluigi", "Koopa"
    ],
    "coins": [120, 95, 30, 300, 60, 45, 150, 80, 70, 20],
    "lives": [3, 2, 4, 10, 5, 1, 6, 1, 2, 1],
    "power_up": [
        "Mushroom", "Mushroom", "Star", "Fire Flower", "Egg",
        "Mushroom", "Banana", "Garlic", "Trickster", "Shell"
    ],
    "is_enemy": [0, 0, 0, 1, 0, 0, 0, 1, 1, 1]
})
df.index.name = "id"

print("Our sample dataset:")
print(f"Shape: {df.shape}")
print("\nColumn types:")
print(df.dtypes)
print("\nFirst few rows:")
df.head()

Our sample dataset:
Shape: (10, 5)

Column types:
character    object
coins         int64
lives         int64
power_up     object
is_enemy      int64
dtype: object

First few rows:


Unnamed: 0_level_0,character,coins,lives,power_up,is_enemy
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,Mario,120,3,Mushroom,0
1,Luigi,95,2,Mushroom,0
2,Peach,30,4,Star,0
3,Bowser,300,10,Fire Flower,1
4,Yoshi,60,5,Egg,0


## Sample Dataset Creation

Now let's create a sample dataset to demonstrate the data management features. We'll use a Super Mario Bros. themed dataset with different types of columns:

In [28]:
# The TabAccessor is automatically available on any DataFrame
print("Available methods in .tab namespace:")
tab_methods = [method for method in dir(df.tab) if not method.startswith('_')]
for method in tab_methods:
    print(f"  - {method}")

print(f"\nTabAccessor type: {type(df.tab)}")
print(f"Original DataFrame type: {type(df)}")

Available methods in .tab namespace:
  - add_role
  - categorical
  - columns
  - features
  - get_meta
  - get_partition
  - load_meta
  - numerical
  - partitions
  - schema
  - set_partitions_from_labels
  - set_partitions_from_positions
  - set_schema
  - target
  - test
  - train
  - update_role
  - val

TabAccessor type: <class 'src.idspy.data.tab_accessor.TabAccessor'>
Original DataFrame type: <class 'pandas.core.frame.DataFrame'>


## Understanding TabAccessor

Before diving into schemas, let's understand what TabAccessor is and what it provides:

The `TabAccessor` is a pandas extension that adds the `.tab` namespace to DataFrames. It provides structured data management capabilities while keeping your data as a regular pandas DataFrame.

In [29]:
# Define the schema by assigning roles to columns
df = df.tab.set_schema(
    numerical=["coins", "lives"],      # Continuous numeric values
    categorical=["character", "power_up"],  # Discrete categories
    target=["is_enemy"],               # What we want to predict
)

print("Schema successfully defined!")
print(f"✓ Numerical features: {df.tab.schema.numerical}")
print(f"✓ Categorical features: {df.tab.schema.categorical}")
print(f"✓ All features: {df.tab.schema.features}")
print(f"✓ Target variable: {df.tab.schema.target}")

# The schema is now part of the DataFrame
print(f"\nSchema type: {type(df.tab.schema)}")

Schema successfully defined!
✓ Numerical features: ['coins', 'lives']
✓ Categorical features: ['character', 'power_up']
✓ All features: ['coins', 'lives', 'character', 'power_up']
✓ Target variable: is_enemy

Schema type: <class 'src.idspy.data.schema.Schema'>


## Dynamic Schema Updates

Schemas aren't set in stone! You can modify column roles dynamically as your analysis evolves. This is particularly useful when:

- Exploring different feature representations
- Converting continuous variables to categorical
- Adding or removing features from your model

Let's see how to update roles dynamically:

In [30]:
# Let's treat "lives" as categorical instead of numerical
# This makes sense since lives are often discrete counts (1, 2, 3, etc.)
print("Before role change:")
print(f"  Numerical: {df.tab.schema.numerical}")
print(f"  Categorical: {df.tab.schema.categorical}")

# Add categorical role to "lives" column
df.tab.add_role("lives", ColumnRole.CATEGORICAL)

print("\nAfter role change:")
print(f"  Numerical: {df.tab.schema.numerical}")
print(f"  Categorical: {df.tab.schema.categorical}")

print(f"\nNote: Column 'lives' is now categorical, not numerical!")

Before role change:
  Numerical: ['coins', 'lives']
  Categorical: ['character', 'power_up']

After role change:
  Numerical: ['coins']
  Categorical: ['character', 'power_up', 'lives']

Note: Column 'lives' is now categorical, not numerical!


## Role-Based Data Access

Now for the magic! With a schema defined, you can access data subsets based on their roles. This creates clean separation between features and targets, making your ML code more readable and maintainable.

### Accessing Features and Targets

In [31]:
# Access feature matrix (X) and target vector (y) like in scikit-learn
X = df.tab.features  # All feature columns
y = df.tab.target    # Target column(s)

print("Feature matrix (X):")
print(f"Shape: {X.shape}")
display(X.head())

print("\nTarget vector (y):")
print(f"Shape: {y.shape}")
display(y.head())

Feature matrix (X):
Shape: (10, 4)


Unnamed: 0_level_0,coins,character,power_up,lives
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,120,Mario,Mushroom,3
1,95,Luigi,Mushroom,2
2,30,Peach,Star,4
3,300,Bowser,Fire Flower,10
4,60,Yoshi,Egg,5



Target vector (y):
Shape: (10,)


id
0    0
1    0
2    0
3    1
4    0
Name: is_enemy, dtype: int64

In [32]:
# Work with only numerical columns
print("Original numerical data:")
print(df.tab.numerical.head())

# Perform operations on numerical columns
df.tab.numerical = df.tab.numerical + 2
print("\nAfter adding 2 to all numerical columns:")
print(df.tab.numerical.head())

# The original DataFrame is updated too
print("\nVerification - updated DataFrame:")
print(df[["coins"]].head())

Original numerical data:
    coins
id       
0     120
1      95
2      30
3     300
4      60

After adding 2 to all numerical columns:
    coins
id       
0     122
1      97
2      32
3     302
4      62

Verification - updated DataFrame:
    coins
id       
0     122
1      97
2      32
3     302
4      62


In [33]:
# Check data types of numerical columns
print("Data types of numerical columns:")
print(df.tab.numerical.dtypes)

# You can also access categorical columns
print("\nData types of categorical columns:")
print(df.tab.categorical.dtypes)

Data types of numerical columns:
coins    int64
dtype: object

Data types of categorical columns:
character    object
power_up     object
lives         int64
dtype: object


In [34]:
# Type conversion is straightforward
print("Before conversion:")
print(df.tab.numerical.dtypes)

# Convert all numerical columns to float64
df.tab.numerical = df.tab.numerical.astype("float64")

print("\nAfter conversion to float64:")
print(df.tab.numerical.dtypes)

Before conversion:
coins    int64
dtype: object

After conversion to float64:
coins    float64
dtype: object


## Data Partitioning

One of the most powerful features is built-in support for data partitioning. You can split your data into train/validation/test sets while preserving the schema across all partitions.

### Creating Data Splits

In [35]:
# Create stratified splits (60% train, 20% validation, 20% test)
split_mapping = random_split(df, train_size=0.6, val_size=0.2, test_size=0.2)
df.tab.set_partitions_from_labels(split_mapping)

print("Data has been partitioned!")
print(f"Total samples: {len(df)}")
print(f"Train samples: {len(df.tab.train)}")
print(f"Validation samples: {len(df.tab.val)}")
print(f"Test samples: {len(df.tab.test)}")

print("\n" + "="*50)
print("TRAINING SET:")
print("="*50)
display(df.tab.train)

print("\n" + "="*50)
print("VALIDATION SET:")
print("="*50)
display(df.tab.val)

print("\n" + "="*50)
print("TEST SET:")
print("="*50)
display(df.tab.test)

Data has been partitioned!
Total samples: 10
Train samples: 6
Validation samples: 2
Test samples: 2

TRAINING SET:


Unnamed: 0_level_0,character,coins,lives,power_up,is_enemy
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,Mario,122.0,3,Mushroom,0
2,Peach,32.0,4,Star,0
3,Bowser,302.0,10,Fire Flower,1
5,Toad,47.0,1,Mushroom,0
6,Donkey Kong,152.0,6,Banana,0
7,Wario,82.0,1,Garlic,1



VALIDATION SET:


Unnamed: 0_level_0,character,coins,lives,power_up,is_enemy
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Luigi,97.0,2,Mushroom,0
4,Yoshi,62.0,5,Egg,0



TEST SET:


Unnamed: 0_level_0,character,coins,lives,power_up,is_enemy
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
8,Waluigi,72.0,2,Trickster,1
9,Koopa,22.0,1,Shell,1


In [36]:
# Example: Give hero characters bonus coins in training data
df_train = df.tab.train.copy()
print("Original training data coins:")
print(df_train[["character", "coins", "is_enemy"]])

# Add 50 coins to non-enemy characters (heroes)
df_train.loc[df_train["is_enemy"] == 0, "coins"] += 50

print("\nAfter giving heroes bonus coins:")
print(df_train[["character", "coins", "is_enemy"]])

# Update the training partition
df.tab.train = df_train

print("\nUpdated training set numerical features:")
display(df.tab.train.tab.numerical)

Original training data coins:
      character  coins  is_enemy
id                              
0         Mario  122.0         0
2         Peach   32.0         0
3        Bowser  302.0         1
5          Toad   47.0         0
6   Donkey Kong  152.0         0
7         Wario   82.0         1

After giving heroes bonus coins:
      character  coins  is_enemy
id                              
0         Mario  172.0         0
2         Peach   82.0         0
3        Bowser  302.0         1
5          Toad   97.0         0
6   Donkey Kong  202.0         0
7         Wario   82.0         1

Updated training set numerical features:


Unnamed: 0_level_0,coins
id,Unnamed: 1_level_1
0,172.0
2,82.0
3,302.0
5,97.0
6,202.0
7,82.0


## Key Takeaways

1. **TabAccessor Integration**: The `.tab` namespace extends pandas DataFrames with structured data management
2. **Schema-Driven Design**: Define column roles (numerical, categorical, target) for organized ML workflows  
3. **Dynamic Flexibility**: Update column roles on-the-fly without copying data
4. **Clean Data Access**: Use `df.tab.features`, `df.tab.target`, etc. for intuitive data selection
5. **Partition Management**: Split data into train/val/test while preserving schemas across all partitions
6. **Seamless Operations**: All pandas functionality remains available alongside schema features
