# `mlarena.utils.data_utils` Demo

This notebook serves as a demonstration of the various data cleaning and manipulation utilities available in the `mlarena.utils.data_utils` module. 

In [19]:
import mlarena.utils.data_utils as dut
import pandas as pd
import numpy as np

# 1. Transform Data Columns

It is common for a dataframe to have date columns stored as strings. This handy function `transform_date_cols` helps you transform them. 

- Flexible input handling: Works with either a single column or multiple columns
- Format customization: Supports any date format using standard Python strftime directives
    - %d: Day of the month as a zero-padded decimal (e.g., 25)
    - %m: Month as a zero-padded decimal number (e.g., 08)
    - %b: Abbreviated month name (e.g., Aug)
    - %Y: Four-digit year (e.g., 2024)
- Smart case handling: Automatically normalizes month abbreviations (like 'JAN', 'jan', 'Jan') when using %b format
- Type safety: Preserves existing datetime columns without unnecessary conversion


In [20]:
# Sample DataFrame with different date formats
df_test = pd.DataFrame({
    "date1": ["20240101", "20240215", "20240320"],
    "date2": ["25-08-2024", "15-09-2024", "01-10-2024"],
    "date3": ["25Aug2024", "15AUG2024", "01aug2024"],  # different cases
    "date4": ["20240801", "20240915", "20240311"],
    "not_a_date": [123, "abc", None]
})
print(df_test.dtypes)

date1         object
date2         object
date3         object
date4         object
not_a_date    object
dtype: object


In [21]:
# Apply the function 
df_result = dut.transform_date_cols(df_test, ["date1", "date4"], "%Y%m%d") # take a list
df_result = dut.transform_date_cols(df_result, "date2", "%d-%m-%Y") # take one column
df_result = dut.transform_date_cols(df_result, ["date3"], "%d%b%Y") # handle column with different cases automatically

# Display result
print(df_result.dtypes)
print(df_result)

date1         datetime64[ns]
date2         datetime64[ns]
date3         datetime64[ns]
date4         datetime64[ns]
not_a_date            object
dtype: object
       date1      date2      date3      date4 not_a_date
0 2024-01-01 2024-08-25 2024-08-25 2024-08-01        123
1 2024-02-15 2024-09-15 2024-08-15 2024-09-15        abc
2 2024-03-20 2024-10-01 2024-08-01 2024-03-11       None


# 2. Clean Dollar Columns
It is common for a dataframe to have dollar amount columns stored as strings with currency symbols and commas. The `clean_dollar_cols` function helps you transform these into numeric values.

- Flexible input handling: Works with either a single column or multiple columns
- Clean the column(s) off currency symbols and commas
- Type conversion: Converts the cleaned strings to float values for numerical analysis


In [22]:
df_dollars = pd.DataFrame({
    'price': ['$1,234.56', '$2,345.67', '$3,456.78'],
    'revenue': ['12,000', '', '$30,000'],
    'other': ['A', 'B', 'C']
})

print("Original DataFrame:")
print(df_dollars)
print("\nDtypes:")
print(df_dollars.dtypes)

df_cleaned = dut.clean_dollar_cols(df_dollars, ['price', 'revenue'])

print("\nCleaned DataFrame:")
print(df_cleaned)
print("\nDtypes:")
print(df_cleaned.dtypes)

Original DataFrame:
       price  revenue other
0  $1,234.56   12,000     A
1  $2,345.67              B
2  $3,456.78  $30,000     C

Dtypes:
price      object
revenue    object
other      object
dtype: object

Cleaned DataFrame:
     price  revenue other
0  1234.56  12000.0     A
1  2345.67      NaN     B
2  3456.78  30000.0     C

Dtypes:
price      float64
revenue    float64
other       object
dtype: object


# 3. Value Counts with Percent
The `value_counts_with_pct` function enhances pandas' built-in value_counts by adding percentage information alongside counts.

- Comprehensive view: Shows both raw counts and percentages in a single output
- Flexible NA handling: Option to include or exclude NA values from the analysis
- Clear formatting: Percentages are formatted with a specified number of decimal places
- Sorted results: Values are sorted by frequency for easy interpretation
- Useful for: Quick categorical data profiling, understanding class distributions, and reporting

In [23]:
df_categories = pd.DataFrame({
    'color': ['red', 'blue', 'red', 'green', 'blue', 'red', None],
    'size': ['S', 'M', 'L', 'M', 'S', 'L', 'M']
})

print("Value counts for 'color' (including NA):")
print(dut.value_counts_with_pct(df_categories, 'color'))

print("\nValue counts for 'color' (excluding NA):")
print(dut.value_counts_with_pct(df_categories, 'color', dropna=True))

print("\nValue counts for ['color', 'size']:")
print(dut.value_counts_with_pct(df_categories, ['color','size']))


Value counts for 'color' (including NA):
   color  count    pct
0    red      3  42.86
1   blue      2  28.57
2  green      1  14.29
3   None      1  14.29

Value counts for 'color' (excluding NA):
   color  count    pct
0    red      3  50.00
1   blue      2  33.33
2  green      1  16.67

Value counts for ['color', 'size']:
   color size  count    pct
0    red    L      2  28.57
1   blue    M      1  14.29
2   blue    S      1  14.29
3  green    M      1  14.29
4    red    S      1  14.29
5    NaN    M      1  14.29


# 4. Drop Fully Null Columns
The `drop_fully_null_cols` function is specifically designed to prevent issues with Databricks' `display()` function, which can break when encountering columns that are entirely null (as it cannot infer the schema).

- Prevents Databricks display errors: Removes columns that would cause schema inference issues
- Safe operation: Returns a new DataFrame without modifying the original
- Common usage: `drop_fully_null_cols(df).display()` in Databricks notebooks.

In [24]:
df_nulls = pd.DataFrame({
    'col1': [1, 2, 3],
    'col2': [np.nan, np.nan, np.nan],  # Fully null
    'col3': ['A', None, 'C'],
    'col4': [None, None, None]  # Fully null
})

print("Original DataFrame:")
print(df_nulls)

df_cleaned = dut.drop_fully_null_cols(df_nulls, verbose=True)

print("\nCleaned DataFrame:")
print(df_cleaned) 

Original DataFrame:
   col1  col2  col3  col4
0     1   NaN     A  None
1     2   NaN  None  None
2     3   NaN     C  None
🗑️ Dropped fully-null columns: ['col2', 'col4']

Cleaned DataFrame:
   col1  col3
0     1     A
1     2  None
2     3     C


# 5. Print Schema Alphabetically
The `print_schema_alphabetically` function is particularly useful when exploring very wide DataFrames with many columns. By sorting column names alphabetically, it makes it easier to:

- Quickly locate specific columns in large datasets
- Compare schemas between different DataFrames to identify missing or additional columns
- Maintain a consistent view of your data structure regardless of the original column order
- Simplify documentation and reporting of data structures

In [25]:
df = pd.DataFrame({
    'z_price': [100.5, 200.5, 300.5],
    'a_category': ['A', 'B', 'C'],
    'm_date': pd.date_range('2024-01-01', periods=3),
    'b_is_active': [True, False, True],
    'y_quantity': np.array([1, 2, 3], dtype='int32')
})

print("Original DataFrame:")
print(df)
print("\nSchema in alphabetical order:")
dut.print_schema_alphabetically(df)

Original DataFrame:
   z_price a_category     m_date  b_is_active  y_quantity
0    100.5          A 2024-01-01         True           1
1    200.5          B 2024-01-02        False           2
2    300.5          C 2024-01-03         True           3

Schema in alphabetical order:
a_category             object
b_is_active              bool
m_date         datetime64[ns]
y_quantity              int32
z_price               float64
dtype: object


# 6. Check Primary Key
The `is_primary_key` function helps verify if a column or combination of columns could serve as a primary key in a DataFrame.

A traditional primary key must satisfy two key requirements:
1. Uniqueness: Each combination of values must be unique across all rows
2. No null values: Primary key columns cannot contain null/missing values

However, in real-world data analysis, we often encounter datasets where potential key columns contain some missing values. This function takes a practical approach by:
1. Alerting you about any missing values in the potential key columns
2. Checking if the columns would form a unique identifier after removing rows with missing values

This function is useful for:
- Data quality assessment: Understanding the completeness and uniqueness of your key fields
- Database schema design: Identifying potential primary keys even in imperfect data
- ETL validation: Verifying key constraints while being aware of data quality issues
- Data integrity checks: Ensuring uniqueness for joins/merges after handling missing values

The function accepts either a single column name or a list of columns, making it flexible for checking both simple and composite keys.

In [26]:
# Create sample DataFrame with different primary key scenarios
df = pd.DataFrame({
    # Single column primary key
    'id': [1, 2, 3, 4, 5],
    
    # Column with duplicates
    'category': ['A', 'B', 'A', 'B', 'C'],
    
    # Date column with some duplicates
    'date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02', '2024-01-03'],
    
    # Column with null values
    'code': ['X1', None, 'X3', 'X4', 'X5'],
    
    # Values column
    'value': [100, 200, 300, 400, 500]
})

# Test 1: Single column that is a primary key
print("\nTest 1: Single column primary key")
dut.is_primary_key(df, ['id'])  # Should return True

# Test 2: Single column that is not a primary key (has duplicates)
print("\nTest 2: Column with duplicates")
dut.is_primary_key(df, ['category'])  # Should return False

# Test 3: Multiple columns that together form a primary key
print("\nTest 3: Composite primary key")
dut.is_primary_key(df, ['category', 'date'])  # Should return True

# Test 4: Column with null values
print("\nTest 4: Column with null values")
dut.is_primary_key(df, ['code','date'])  # Should return False

# Test 5: Empty DataFrame
print("\nTest 5: Empty DataFrame")
empty_df = pd.DataFrame(columns=['id', 'value'])
dut.is_primary_key(empty_df, ['id'])  # Should return False

# Test 6: Non-existent column
print("\nTest 6: Non-existent column")
dut.is_primary_key(df, ['not_a_column'])  # Should return False


Test 1: Single column primary key
✅ There are no missing values in column 'id'.
ℹ️ Total row count after filtering out missings: 5
ℹ️ Unique row count after filtering out missings: 5
🔑 The column(s) 'id' form a primary key.

Test 2: Column with duplicates
✅ There are no missing values in column 'category'.
ℹ️ Total row count after filtering out missings: 5
ℹ️ Unique row count after filtering out missings: 3
❌ The column(s) 'category' do not form a primary key.

Test 3: Composite primary key
✅ There are no missing values in columns 'category', 'date'.
ℹ️ Total row count after filtering out missings: 5
ℹ️ Unique row count after filtering out missings: 5
🔑 The column(s) 'category', 'date' form a primary key.

Test 4: Column with null values
⚠️ There are 1 row(s) with missing values in column 'code'.
✅ There are no missing values in column 'date'.
ℹ️ Total row count after filtering out missings: 4
ℹ️ Unique row count after filtering out missings: 4
🔑 The column(s) 'code', 'date' form a pr

False

# 7. Select Existing Columns
The `select_existing_cols` function provides a safe way to select columns from a DataFrame, handling cases where some requested columns might not exist.

- Safe column selection: Returns only columns that exist in the DataFrame
- Case sensitivity options: Can match column names exactly or case-insensitively
- Verbose mode: Optional detailed output about which columns were found/missing
- Useful for: Data pipeline robustness, handling dynamic column selections

In [27]:
# Create sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9],
    'Mixed_Case': [10, 11, 12]
})

print("Original DataFrame:")
print(df)

# Example 1: Basic usage
print("\nExample 1: Select existing columns")
result1 = dut.select_existing_cols(df, ['A', 'C', 'D'])
print("Selected 'A', 'C', 'D' (D doesn't exist):")
print(result1)

# Example 2: Case-insensitive matching
print("\nExample 2: Case-insensitive matching")
result2 = dut.select_existing_cols(df, ['a', 'mixed_case'], strict=False, verbose=True)
print(result2)

# Example 3: Verbose output
print("\nExample 3: Verbose output with missing columns")
result3 = dut.select_existing_cols(df, ['A', 'Missing1', 'B', 'Missing2'], verbose=True)
print(result3)

Original DataFrame:
   A  B  C  Mixed_Case
0  1  4  7          10
1  2  5  8          11
2  3  6  9          12

Example 1: Select existing columns
Selected 'A', 'C', 'D' (D doesn't exist):
   A  C
0  1  7
1  2  8
2  3  9

Example 2: Case-insensitive matching
✅ Columns found: ['A', 'Mixed_Case']
   A  Mixed_Case
0  1          10
1  2          11
2  3          12

Example 3: Verbose output with missing columns
✅ Columns found: ['A', 'B']
⚠️ Columns not found: ['Missing1', 'Missing2']
   A  B
0  1  4
1  2  5
2  3  6
