<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

## Our plan

We want to try to build up some functionality to create etl pipelines for pandas operations.
We want to define classic extract, transform and load functions which we can string together in a pipeline. 
We would also like to add some functionality to inspect the intermediate dataframes and some key properties.

## Our background

We work as Data Science Consultants mainly with SME's in manufacturing. We often have to build up etl pipelines from various sources to then use for analytics or ML workloads.
The pipelines can get pretty complex with custom business logic, that why we want to have a way to quickly show all the steps we take.

In [None]:
# Sample manufacturing data
data = {
    'order_id': [101, 102, 103, 104, 105],
    'product': ['Widget A', 'Widget B', 'Widget A', 'Widget C', 'Widget B'],
    'quantity': [50, 30, 75, 20, 45],
    'defects': [2, 1, 3, 0, 2],
    'production_time': [120, 95, 150, 80, 110]  # in minutes
}

df = pd.DataFrame(data)
df

Unnamed: 0,order_id,product,quantity,defects,production_time
0,101,Widget A,50,2,120
1,102,Widget B,30,1,95
2,103,Widget A,75,3,150
3,104,Widget C,20,0,80
4,105,Widget B,45,2,110


In [0]:
#| echo: false
#| output: asis
show_doc(get_demo_data)

---

[source](https://github.com/MIS-Analytics/mis_analytics/blob/main/mis_analytics/etl.py#L18){target="_blank" style="float:right; font-size:smaller"}

### get_demo_data

```python

def get_demo_data(
    
):


```

In [None]:
df[df["product"] != "Widget C"].groupby(["product"])[["quantity", "defects"	,"production_time"]].sum()

Unnamed: 0_level_0,quantity,defects,production_time
product,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Widget A,125,5,270
Widget B,75,3,205


# Solution space exploration

# Our preferred approach
For a the exploration and discussion with AI on the approaches see the notebooks in the `nbs` folder. 

So we would write a decorator that adds a logging keyword to each function, then we can determine for each function in the pipeline what to log/how much to log. We want to keep it as simple as possible. The pipeline should just take a df and a list of functions as steps of the pipeline.

In [None]:
def pipeline(df, steps, vrbs_default=True):
    for func, func_kwargs in steps:
        vrbs =  func_kwargs.get("vrbs", vrbs_default)
        func_kwargs.update({"vrbs": vrbs})
        df = func(df, **func_kwargs)
    return df

In [None]:
def track(func):
    def wrapper(in_df, vrbs=False, *args, **kwargs):
        if vrbs:
            print(f"\n{'*'*10} Step: {func.__name__} {'*'*10}")
            print(f"\nInput DataFrame shape: {in_df.shape}")
            print(in_df)
        out_df = func(in_df, *args, **kwargs)
        if vrbs:
            print(f"\nOutput DataFrame shape: {out_df.shape}")
            print(out_df)
        return out_df
    return wrapper

In [None]:
@track
def filter_products(df):
    return df[df["product"] != "Widget C"]

@track
def aggregate_by_product(df):
    return df.groupby(["product"])[["quantity", "defects", "production_time"]].sum()

We include a way to handle pipeline overwrites of the `vrbs` argument for specific transformation steps. In this example, no information for `aggregate_by_product` is printed:

In [None]:
steps = [
    (filter_products, {}),
    (aggregate_by_product, {"vrbs": False}),
]

In [None]:
pipeline(df, steps)


********** Step: filter_products **********

Input DataFrame shape: (5, 5)
   order_id   product  quantity  defects  production_time
0       101  Widget A        50        2              120
1       102  Widget B        30        1               95
2       103  Widget A        75        3              150
3       104  Widget C        20        0               80
4       105  Widget B        45        2              110

Output DataFrame shape: (4, 5)
   order_id   product  quantity  defects  production_time
0       101  Widget A        50        2              120
1       102  Widget B        30        1               95
2       103  Widget A        75        3              150
4       105  Widget B        45        2              110


Unnamed: 0_level_0,quantity,defects,production_time
product,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Widget A,125,5,270
Widget B,75,3,205


We want to add basic profiling of time and memory. 

In [None]:
start_time = datetime.now()
sleep(3.14)
end_time = datetime.now()
total_time = end_time - start_time
total_time

datetime.timedelta(seconds=3, microseconds=140237)

In [None]:
start_time

datetime.datetime(2026, 1, 21, 15, 36, 55, 877745)

In [None]:
def track(func):
    def wrapper(in_df, vrbs=False, *args, **kwargs):
        start_time = datetime.now()
        if vrbs:
            
            print(f"\n{'*'*10} Step: {func.__name__} {'*'*10}")
            print(f"\nInput DataFrame shape: {in_df.shape}")
            print(f"Start time: {start_time}")
            print(in_df)
        out_df = func(in_df, *args, **kwargs)
        

        end_time = datetime.now()
        total_time = end_time - start_time
        if vrbs:
            print(f"\nOutput DataFrame shape: {out_df.shape}")
            print(f"End time: {end_time}")
            print(f"Total time: {total_time}")
            print(out_df)
        return out_df
    return wrapper

In [None]:
@track
def filter_products(df):
    return df[df["product"] != "Widget C"]

@track
def aggregate_by_product(df):
    return df.groupby(["product"])[["quantity", "defects", "production_time"]].sum()

In [None]:
steps = [
    (filter_products, {}),
    (aggregate_by_product, {}),
]

In [None]:
_df = pipeline(df, steps)


********** Step: filter_products **********

Input DataFrame shape: (5, 5)
Start time: 2026-01-21 15:36:59.046934
   order_id   product  quantity  defects  production_time
0       101  Widget A        50        2              120
1       102  Widget B        30        1               95
2       103  Widget A        75        3              150
3       104  Widget C        20        0               80
4       105  Widget B        45        2              110

Output DataFrame shape: (4, 5)
End time: 2026-01-21 15:36:59.048467
Total time: 0:00:00.001533
   order_id   product  quantity  defects  production_time
0       101  Widget A        50        2              120
1       102  Widget B        30        1               95
2       103  Widget A        75        3              150
4       105  Widget B        45        2              110

********** Step: aggregate_by_product **********

Input DataFrame shape: (4, 5)
Start time: 2026-01-21 15:36:59.049173
   order_id   product  quantity

# first reflection and next steps
- right now we are still printing the whole dataframe. We want to change that to just show the first 3, 5 random and the last 3 rows, maybe with some visual clue to show that the rows are truncated.
- we would also like show the dataframes to the right of the aggregate infos (like shape, etc.)
- include information like df.info()
- include information like df.describe()
- include information like diff columns (num cols changed, cols just in before, cols just in after)
- include information like diff rows (maybe just the number)
- source code of functions
- description of step, based on docstring

- nicer representation maybe in html or display or something like that

- new wrapper to include assertions and/or sanity checks (propertiy based, for example same number of mat_ids)

- function generate pipeline report, in excel (or something else)

Okay our first idea after picking the problem back up is to store the information in a dictionary and then create different functions to display the information.

In [None]:
def track(func):
    @wraps(func)
    def wrapper(in_df, vrbs=False, *args, **kwargs):
        meta_dict = {
            'step_name':func.__name__, # name of the pipeline step
            'in_time':datetime.now(), # time when the pipeline step starts
            'in_df_shape':in_df.shape, # shape of the input dataframe
            'in_df_head':in_df.head(3), # head of the input dataframe
            'in_df_sample':in_df.sample(min(in_df.shape[0], 5)), # sample of the input dataframe
            'in_df_tail':in_df.tail(3), # tail of the input dataframe
        }

        out_df = func(in_df, *args, **kwargs)

        out_time = datetime.now()
        total_time = out_time - meta_dict['in_time']
            
        meta_dict.update({
            'out_time':datetime.now(), # time when the pipeline step stops
            'out_df_shape':out_df.shape, # shape of the output dataframe
            'out_df_head':out_df.head(3), # head of the output dataframe
            'out_df_sample':out_df.sample(min(out_df.shape[0], 5)), # sample of the output dataframe
            'out_df_tail':out_df.tail(3), # tail of the output dataframe
            'total_time':total_time, # difference between in_time and out_time
        })

        if vrbs:
            print('Here we use a fuction to display the information')
            print(meta_dict)
        
        return out_df
    return wrapper

In [None]:
@track
def filter_products(df):
    return df[df["product"] != "Widget C"]

@track
def aggregate_by_product(df):
    return df.groupby(["product"])[["quantity", "defects", "production_time"]].sum()

In [None]:
steps = [
    (filter_products, {}),
    (aggregate_by_product, {}),
]

In [None]:
_df = pipeline(df, steps)

Here we use a fuction to display the information
{'step_name': 'filter_products', 'in_time': datetime.datetime(2026, 1, 21, 15, 36, 59, 72286), 'in_df_shape': (5, 5), 'in_df_head':    order_id   product  quantity  defects  production_time
0       101  Widget A        50        2              120
1       102  Widget B        30        1               95
2       103  Widget A        75        3              150, 'in_df_sample':    order_id   product  quantity  defects  production_time
1       102  Widget B        30        1               95
2       103  Widget A        75        3              150
0       101  Widget A        50        2              120
4       105  Widget B        45        2              110
3       104  Widget C        20        0               80, 'in_df_tail':    order_id   product  quantity  defects  production_time
2       103  Widget A        75        3              150
3       104  Widget C        20        0               80
4       105  Widget B        45  

# display functions
Here we created a sample meta_dict to play around and create different functions to display the information

In [None]:
meta_dict = {
    'step_name': 'filter_products',
    'step_description': 'Exclude products which are Widget C',
    'in_time': datetime.now(),
    'in_df_shape': (5, 5),
    'in_df_head': df.head(3),
    'in_df_sample': df.sample(min(df.shape[0], 5)),
    'in_df_tail': df.tail(3),
    'out_time': datetime.now(),
    'out_df_shape': (4, 5),
    'out_df_head': df[df["product"] != "Widget C"].head(3),
    'out_df_sample': df[df["product"] != "Widget C"].sample(min(4, 5)),
    'out_df_tail': df[df["product"] != "Widget C"].tail(3),
    'total_time': timedelta(microseconds=1287)
}

meta_dict

{'step_name': 'filter_products',
 'step_description': 'Exclude products which are Widget C',
 'in_time': datetime.datetime(2026, 1, 21, 15, 36, 59, 86925),
 'in_df_shape': (5, 5),
 'in_df_head':    order_id   product  quantity  defects  production_time
 0       101  Widget A        50        2              120
 1       102  Widget B        30        1               95
 2       103  Widget A        75        3              150,
 'in_df_sample':    order_id   product  quantity  defects  production_time
 2       103  Widget A        75        3              150
 1       102  Widget B        30        1               95
 3       104  Widget C        20        0               80
 4       105  Widget B        45        2              110
 0       101  Widget A        50        2              120,
 'in_df_tail':    order_id   product  quantity  defects  production_time
 2       103  Widget A        75        3              150
 3       104  Widget C        20        0               80
 4     

In [0]:
#| echo: false
#| output: asis
show_doc(StepMeta)

---

[source](https://github.com/MIS-Analytics/mis_analytics/blob/main/mis_analytics/etl.py#L37){target="_blank" style="float:right; font-size:smaller"}

### StepMeta

```python

def StepMeta(
    step_name:str, step_description:str, in_time:datetime, in_df_shape:tuple, in_df_head:DataFrame,
    in_df_sample:DataFrame, in_df_tail:DataFrame, out_time:datetime, out_df_shape:tuple, out_df_head:DataFrame,
    out_df_sample:DataFrame, out_df_tail:DataFrame, total_time:timedelta
)->None:


```

*Metadata collected for a single ETL pipeline step.*

In [None]:
step_meta = StepMeta(
    step_name='filter_products',
    step_description='Exclude products which are Widget C',
    in_time=datetime.now(),
    in_df_shape=df.shape,
    in_df_head=df.head(3),
    in_df_sample=df.sample(min(df.shape[0], 5)),
    in_df_tail=df.tail(3),
    out_time=datetime.now(),
    out_df_shape=(4, 5),
    out_df_head=df[df["product"] != "Widget C"].head(3),
    out_df_sample=df[df["product"] != "Widget C"].sample(4),
    out_df_tail=df[df["product"] != "Widget C"].tail(3),
    total_time=timedelta(microseconds=1287)
)
step_meta

StepMeta(step_name='filter_products', step_description='Exclude products which are Widget C', in_time=datetime.datetime(2026, 1, 21, 15, 36, 59, 108163), in_df_shape=(5, 5), in_df_head=   order_id   product  quantity  defects  production_time
0       101  Widget A        50        2              120
1       102  Widget B        30        1               95
2       103  Widget A        75        3              150, in_df_sample=   order_id   product  quantity  defects  production_time
3       104  Widget C        20        0               80
4       105  Widget B        45        2              110
0       101  Widget A        50        2              120
1       102  Widget B        30        1               95
2       103  Widget A        75        3              150, in_df_tail=   order_id   product  quantity  defects  production_time
2       103  Widget A        75        3              150
3       104  Widget C        20        0               80
4       105  Widget B        45    

## print name and time

In [None]:
print(15*'*' + ' ' + step_meta.step_name + ' ' + 15*'*')

*************** filter_products ***************


In [None]:
print(f'Total Time: {step_meta.total_time}')
print(f'Start: {step_meta.in_time}')
print(f'End: {step_meta.out_time}')

Total Time: 0:00:00.001287
Start: 2026-01-21 15:36:59.108163
End: 2026-01-21 15:36:59.108396


In [0]:
#| echo: false
#| output: asis
show_doc(format_timedelta)

---

[source](https://github.com/MIS-Analytics/mis_analytics/blob/main/mis_analytics/etl.py#L80){target="_blank" style="float:right; font-size:smaller"}

### format_timedelta

```python

def format_timedelta(
    td
):


```

In [None]:
print(f'Total Time: {format_timedelta(step_meta.total_time)}')

Total Time: 1.29 ms


In [0]:
#| echo: false
#| output: asis
show_doc(print_step_name)

---

[source](https://github.com/MIS-Analytics/mis_analytics/blob/main/mis_analytics/etl.py#L89){target="_blank" style="float:right; font-size:smaller"}

### print_step_name

```python

def print_step_name(
    step_meta:StepMeta
):


```

In [None]:
print_step_name(step_meta)

*************** filter_products ***************


In [0]:
#| echo: false
#| output: asis
show_doc(print_time)

---

[source](https://github.com/MIS-Analytics/mis_analytics/blob/main/mis_analytics/etl.py#L92){target="_blank" style="float:right; font-size:smaller"}

### print_time

```python

def print_time(
    step_meta:StepMeta
):


```

In [None]:
print_time(step_meta)

Total Time: 1.29 ms

Start: 2026-01-21 15:36:59.108163
  End: 2026-01-21 15:36:59.108396


## print sample df

In [None]:
pd.concat([step_meta.in_df_head, step_meta.in_df_sample, step_meta.in_df_tail])

Unnamed: 0,order_id,product,quantity,defects,production_time
0,101,Widget A,50,2,120
1,102,Widget B,30,1,95
2,103,Widget A,75,3,150
3,104,Widget C,20,0,80
4,105,Widget B,45,2,110
0,101,Widget A,50,2,120
1,102,Widget B,30,1,95
2,103,Widget A,75,3,150
2,103,Widget A,75,3,150
3,104,Widget C,20,0,80


In [None]:
pd.DataFrame(np.nan, index=range(1), columns=step_meta.in_df_head.columns)

Unnamed: 0,order_id,product,quantity,defects,production_time
0,,,,,


In [None]:
pd.DataFrame(np.nan, index=range(1), columns=step_meta.in_df_head.columns).fillna('...')

Unnamed: 0,order_id,product,quantity,defects,production_time
0,...,...,...,...,...


In [None]:
pd.DataFrame(np.nan, index=range(3), columns=meta_dict['in_df_head'].columns).fillna('.')

Unnamed: 0,order_id,product,quantity,defects,production_time
0,.,.,.,.,.
1,.,.,.,.,.
2,.,.,.,.,.


In [None]:
pd.DataFrame(np.nan, index=range(1), columns=step_meta.in_df_head.columns).fillna(':')

Unnamed: 0,order_id,product,quantity,defects,production_time
0,:,:,:,:,:


In [0]:
#| echo: false
#| output: asis
show_doc(fill_between_df_parts)

---

[source](https://github.com/MIS-Analytics/mis_analytics/blob/main/mis_analytics/etl.py#L99){target="_blank" style="float:right; font-size:smaller"}

### fill_between_df_parts

```python

def fill_between_df_parts(
    df
):


```

In [None]:
fill_between_df_parts(meta_dict['in_df_head'])

Unnamed: 0,order_id,product,quantity,defects,production_time
,:,:,:,:,:


In [None]:
pd.concat(
    [step_meta.in_df_head, 
    fill_between_df_parts(step_meta.in_df_head),
    step_meta.in_df_sample, 
    fill_between_df_parts(step_meta.in_df_head),
    step_meta.in_df_tail]
    )

Unnamed: 0,order_id,product,quantity,defects,production_time
0.0,101,Widget A,50,2,120
1.0,102,Widget B,30,1,95
2.0,103,Widget A,75,3,150
,:,:,:,:,:
3.0,104,Widget C,20,0,80
4.0,105,Widget B,45,2,110
0.0,101,Widget A,50,2,120
1.0,102,Widget B,30,1,95
2.0,103,Widget A,75,3,150
,:,:,:,:,:


We want to build to functions:
1. a display function for within our wrapper which only uses the meta_dict
2. a display function that just takes in one df and gets the head, sample, tail from there

In the future we might want to turn this into one function but for now we don't want to store the whole df in meta_dict

In [0]:
#| echo: false
#| output: asis
show_doc(print_sample_from_meta_dict)

---

[source](https://github.com/MIS-Analytics/mis_analytics/blob/main/mis_analytics/etl.py#L103){target="_blank" style="float:right; font-size:smaller"}

### print_sample_from_meta_dict

```python

def print_sample_from_meta_dict(
    step_meta:StepMeta, mode:str='in'
):


```

In [None]:
print_sample_from_meta_dict(step_meta)

  order_id   product quantity defects production_time
0      101  Widget A       50       2             120
1      102  Widget B       30       1              95
2      103  Widget A       75       3             150
         :         :        :       :               :
3      104  Widget C       20       0              80
4      105  Widget B       45       2             110
0      101  Widget A       50       2             120
1      102  Widget B       30       1              95
2      103  Widget A       75       3             150
         :         :        :       :               :
2      103  Widget A       75       3             150
3      104  Widget C       20       0              80
4      105  Widget B       45       2             110


In [0]:
#| echo: false
#| output: asis
show_doc(display_sample_from_df)

---

[source](https://github.com/MIS-Analytics/mis_analytics/blob/main/mis_analytics/etl.py#L113){target="_blank" style="float:right; font-size:smaller"}

### display_sample_from_df

```python

def display_sample_from_df(
    df
):


```

In [None]:
display_sample_from_df(df)

Unnamed: 0,order_id,product,quantity,defects,production_time
0.0,101,Widget A,50,2,120
1.0,102,Widget B,30,1,95
2.0,103,Widget A,75,3,150
,:,:,:,:,:
2.0,103,Widget A,75,3,150
4.0,105,Widget B,45,2,110
0.0,101,Widget A,50,2,120
1.0,102,Widget B,30,1,95
3.0,104,Widget C,20,0,80
,:,:,:,:,:


We will have to also consider what happen if we have to many columns.

Our idea is to truncate them as well and just add a dummy columns with '...' as sign that its truncated

## print shape and shape change

In [None]:
print('(rows, columns) =', step_meta.in_df_shape)
print('        |          ')
print('        V          ')
print('(rows, columns) =', step_meta.out_df_shape)

(rows, columns) = (5, 5)
        |          
        V          
(rows, columns) = (4, 5)


In [None]:
print('(rows, columns) =', step_meta.in_df_shape)
print(' â†“ '*8)
print('(rows, columns) =', step_meta.out_df_shape)

(rows, columns) = (5, 5)
 â†“  â†“  â†“  â†“  â†“  â†“  â†“  â†“ 
(rows, columns) = (4, 5)


In [None]:
in_str = f"{step_meta.in_df_shape[0]} rows, {step_meta.in_df_shape[1]} columns"
print(in_str)
print('â†“ â†“ â†“'.center(len(in_str), ' '))
print(f"{step_meta.out_df_shape[0]} rows, {step_meta.out_df_shape[1]} columns")

5 rows, 5 columns
      â†“ â†“ â†“      
4 rows, 5 columns


is there a convenietn way to space out a number of strings over a specific length?

##### ðŸ¤–ReplyðŸ¤–<!-- SOLVEIT_SEPARATOR_7f3a9b2c -->

You could use string formatting with padding! Here are a couple of approaches:

**1. Using f-strings with alignment:**
```python
s1, s2, s3 = "5 rows", "4 columns", "time: 1.2ms"
total_width = 50
print(f"{s1:<{total_width//3}}{s2:<{total_width//3}}{s3}")
```

**2. Using `str.ljust()` / `str.center()` / `str.rjust()`:**
```python
width = 20
print(s1.ljust(width) + s2.center(width) + s3.rjust(width))
```

**3. For evenly spacing multiple strings, you could calculate spacing:**
```python
def space_strings(strings, total_width):
    gap = (total_width - sum(len(s) for s in strings)) // (len(strings) - 1)
    return (' ' * gap).join(strings)
```

Which approach fits what you're trying to do with your pipeline display?

In [None]:
s1, s2, s3 = "5 rows", "4 columns", "time: 1.2ms"
total_width = 50
print(f"{s1:<{total_width//3}}{s2:<{total_width//3}}{s3}")

5 rows          4 columns       time: 1.2ms


In [None]:
width = 20
print(s1.ljust(width) + s2.center(width) + s3.rjust(width))

5 rows                   4 columns               time: 1.2ms


In [0]:
#| echo: false
#| output: asis
show_doc(space_strings)

---

[source](https://github.com/MIS-Analytics/mis_analytics/blob/main/mis_analytics/etl.py#L123){target="_blank" style="float:right; font-size:smaller"}

### space_strings

```python

def space_strings(
    strings, total_width
):


```

In [None]:
space_strings(['|', '|', '|'], 20)

'|        |        |'

In [None]:
len(space_strings(['|', '|', '|'], 20))

19

In [None]:
in_str = f"{step_meta.in_df_shape[0]} rows, {step_meta.in_df_shape[1]} columns"
print(in_str)
print(space_strings(['â†“','â†“','â†“'], len(in_str)))
print(f"{step_meta.out_df_shape[0]} rows, {step_meta.out_df_shape[1]} columns")

5 rows, 5 columns
â†“       â†“       â†“
4 rows, 5 columns


In [None]:
in_rows, in_cols = step_meta.in_df_shape
out_rows, out_cols = step_meta.out_df_shape

in_str = f"{in_rows} rows, {in_cols} columns"
print(in_str)
print(space_strings(['â†“','â†“','â†“'], len(in_str)))

print(f'{str(out_rows-in_rows).center(len(in_str)//2)}', f'{str(out_cols-in_cols).center(len(in_str)//2)}')

print(space_strings(['â†“','â†“','â†“'], len(in_str)))
print(f"{out_rows} rows, {out_cols} columns")

5 rows, 5 columns
â†“       â†“       â†“
   -1       0    
â†“       â†“       â†“
4 rows, 5 columns


In [None]:
in_rows, in_cols = step_meta.in_df_shape
out_rows, out_cols = step_meta.out_df_shape
diff_rows, diff_cols = out_rows - in_rows, out_cols - in_cols

diff_rows_str = f"{diff_rows:+d}" if diff_rows != 0 else "0"
diff_cols_str = f"{diff_cols:+d}" if diff_cols != 0 else "0"

# Right-align each column
row_width = max(len(str(in_rows)), len(str(out_rows)), len(diff_rows_str))
col_width = max(len(str(in_cols)), len(str(out_cols)), len(diff_cols_str))

print(f"Input:  {in_rows:>{row_width}} rows, {in_cols:>{col_width}} cols")
print(f"        {' '*row_width}   â†“   {' '*col_width}   â†“")
print(f"Diff:   {diff_rows_str:>{row_width}} rows, {diff_cols_str:>{col_width}} cols")
print(f"        {' '*row_width}   â†“   {' '*col_width}   â†“")
print(f"Output: {out_rows:>{row_width}} rows, {out_cols:>{col_width}} cols")

Input:   5 rows, 5 cols
             â†“       â†“
Diff:   -1 rows, 0 cols
             â†“       â†“
Output:  4 rows, 5 cols


In [None]:
print(f"""
Input:  {in_rows:>{row_width}} rows, {in_cols:>{col_width}} cols
        {' '*row_width}   â†“   {' '*col_width}   â†“
Diff:   {diff_rows_str:>{row_width}} rows, {diff_cols_str:>{col_width}} cols
        {' '*row_width}   â†“   {' '*col_width}   â†“
Output: {out_rows:>{row_width}} rows, {out_cols:>{col_width}} cols
""")


Input:   5 rows, 5 cols
             â†“       â†“
Diff:   -1 rows, 0 cols
             â†“       â†“
Output:  4 rows, 5 cols



In [0]:
#| echo: false
#| output: asis
show_doc(print_shape_change)

---

[source](https://github.com/MIS-Analytics/mis_analytics/blob/main/mis_analytics/etl.py#L128){target="_blank" style="float:right; font-size:smaller"}

### print_shape_change

```python

def print_shape_change(
    step_meta:StepMeta
):


```

In [None]:
print_shape_change(step_meta)


        Input:   5 rows, 5 cols
                     â†“       â†“
        Diff:   -1 rows, 0 cols
                     â†“       â†“
        Output:  4 rows, 5 cols
        


## print docstring

In [None]:
@track
def my_func():
    """This is the docstring."""
    pass

print(my_func.__doc__)

This is the docstring.


In [0]:
#| echo: false
#| output: asis
show_doc(print_step_description)

---

[source](https://github.com/MIS-Analytics/mis_analytics/blob/main/mis_analytics/etl.py#L150){target="_blank" style="float:right; font-size:smaller"}

### print_step_description

```python

def print_step_description(
    step_meta:StepMeta
):


```

In [None]:
print_step_description(step_meta)

'''Exclude products which are Widget C'''


## print all function

In [0]:
#| echo: false
#| output: asis
show_doc(print_step_info)

---

[source](https://github.com/MIS-Analytics/mis_analytics/blob/main/mis_analytics/etl.py#L153){target="_blank" style="float:right; font-size:smaller"}

### print_step_info

```python

def print_step_info(
    step_meta:StepMeta
):


```

In [None]:
print_step_info(step_meta)

*************** filter_products ***************
'''Exclude products which are Widget C'''


Total Time: 1.29 ms

Start: 2026-01-21 15:36:59.108163
  End: 2026-01-21 15:36:59.108396

Input DataFrame:
  order_id   product quantity defects production_time
0      101  Widget A       50       2             120
1      102  Widget B       30       1              95
2      103  Widget A       75       3             150
         :         :        :       :               :
3      104  Widget C       20       0              80
4      105  Widget B       45       2             110
0      101  Widget A       50       2             120
1      102  Widget B       30       1              95
2      103  Widget A       75       3             150
         :         :        :       :               :
2      103  Widget A       75       3             150
3      104  Widget C       20       0              80
4      105  Widget B       45       2             110

        Input:   5 rows, 5 cols
            

# Testing the pipeline

In [0]:
#| echo: false
#| output: asis
show_doc(pipeline)

---

[source](https://github.com/MIS-Analytics/mis_analytics/blob/main/mis_analytics/etl.py#L166){target="_blank" style="float:right; font-size:smaller"}

### pipeline

```python

def pipeline(
    df, steps, vrbs_default:bool=True
):


```

In [0]:
#| echo: false
#| output: asis
show_doc(track)

---

[source](https://github.com/MIS-Analytics/mis_analytics/blob/main/mis_analytics/etl.py#L175){target="_blank" style="float:right; font-size:smaller"}

### track

```python

def track(
    func
):


```

In [None]:
@track
def filter_products(df):
    '''Exclude products which are Widget C.'''
    return df[df["product"] != "Widget C"]

In [None]:
@track
def aggregate_by_product(df):
    '''aggregate by product and sum columns "quantity", "defects", "production_time"'''
    return df.groupby(["product"])[["quantity", "defects", "production_time"]].sum()

In [None]:
steps = [
    (filter_products, {'vrbs':True}),
    (aggregate_by_product, {}),
]

In [None]:
_df = pipeline(df, steps, vrbs_default=False)

*************** filter_products ***************
'''Exclude products which are Widget C.'''


Total Time: 574 Âµs

Start: 2026-01-21 15:36:59.369533
  End: 2026-01-21 15:36:59.370107

Input DataFrame:
  order_id   product quantity defects production_time
0      101  Widget A       50       2             120
1      102  Widget B       30       1              95
2      103  Widget A       75       3             150
         :         :        :       :               :
3      104  Widget C       20       0              80
2      103  Widget A       75       3             150
1      102  Widget B       30       1              95
0      101  Widget A       50       2             120
4      105  Widget B       45       2             110
         :         :        :       :               :
2      103  Widget A       75       3             150
3      104  Widget C       20       0              80
4      105  Widget B       45       2             110

        Input:   5 rows, 5 cols
           

In [None]:
_df = pipeline(df, steps, vrbs_default=True)

*************** filter_products ***************
'''Exclude products which are Widget C.'''


Total Time: 401 Âµs

Start: 2026-01-21 15:36:59.380798
  End: 2026-01-21 15:36:59.381199

Input DataFrame:
  order_id   product quantity defects production_time
0      101  Widget A       50       2             120
1      102  Widget B       30       1              95
2      103  Widget A       75       3             150
         :         :        :       :               :
3      104  Widget C       20       0              80
1      102  Widget B       30       1              95
4      105  Widget B       45       2             110
2      103  Widget A       75       3             150
0      101  Widget A       50       2             120
         :         :        :       :               :
2      103  Widget A       75       3             150
3      104  Widget C       20       0              80
4      105  Widget B       45       2             110

        Input:   5 rows, 5 cols
           