# `mlarena.utils.data_utils` Demo

This notebook serves as a demonstration of the various data cleaning and manipulation utilities available in the `mlarena.utils.data_utils` module. 

In [3]:
import mlarena.utils.data_utils as dut
import pandas as pd

# 1. Transform Data Columns

It is common for a dataframe to have date columns stored as strings. This handy function `transform_date_cols` helps you transform them. 

- Flexible input handling: Works with either a single column or multiple columns
- Format customization: Supports any date format using standard Python strftime directives
    - %d: Day of the month as a zero-padded decimal (e.g., 25)
    - %m: Month as a zero-padded decimal number (e.g., 08)
    - %b: Abbreviated month name (e.g., Aug)
    - %Y: Four-digit year (e.g., 2024)
- Smart case handling: Automatically normalizes month abbreviations (like 'JAN', 'jan', 'Jan') when using %b format
- Type safety: Preserves existing datetime columns without unnecessary conversion


In [12]:
# Sample DataFrame with different date formats
df_test = pd.DataFrame({
    "date1": ["20240101", "20240215", "20240320"],
    "date2": ["25-08-2024", "15-09-2024", "01-10-2024"],
    "date3": ["25Aug2024", "15AUG2024", "01aug2024"],  # different cases
    "date4": ["20240801", "20240915", "20240311"],
    "not_a_date": [123, "abc", None]
})
print(df_test.dtypes)

date1         object
date2         object
date3         object
date4         object
not_a_date    object
dtype: object


In [13]:
# Apply the function 
df_result = dut.transform_date_cols(df_test, ["date1", "date4"], "%Y%m%d") # take a list
df_result = dut.transform_date_cols(df_result, "date2", "%d-%m-%Y") # take one column
df_result = dut.transform_date_cols(df_result, ["date3"], "%d%b%Y") # handle column with different cases automatically

# Display result
print(df_result.dtypes)
print(df_result)

date1         datetime64[ns]
date2         datetime64[ns]
date3         datetime64[ns]
date4         datetime64[ns]
not_a_date            object
dtype: object
       date1      date2      date3      date4 not_a_date
0 2024-01-01 2024-08-25 2024-08-25 2024-08-01        123
1 2024-02-15 2024-09-15 2024-08-15 2024-09-15        abc
2 2024-03-20 2024-10-01 2024-08-01 2024-03-11       None
