### üéØ Basic Dictionary Grouping Example

**LLM Tabular Preprocessing with Dictionary Groups**

This notebook demonstrates the core functionality of dictionary-based column grouping in Pandas.

---

 **Repository**: 16-DataMining_llm-tabular-preprocessing-dict-groups  
üë©‚ÄçüöÄ **Author**: Fabiana Campanari

---


###  Setup

Install dependencies and clone the repository.


In [None]:
# Install dependencies
!pip install -q pandas numpy

# Clone repository
!git clone https://github.com/Quantum-Software-Development/16-DataMining_llm-tabular-preprocessing-dict-groups.git

# Change to repo directory
import os
os.chdir('16-DataMining_llm-tabular-preprocessing-dict-groups')

# Add src to path
import sys
sys.path.insert(0, 'src')

print('‚úì Setup completed!')


###  Imports


In [None]:
# Diagn√≥stico: onde o Colab est√°, arquivos e tentativa de import (mostra traceback)
import os, sys, traceback

print("PWD:", os.getcwd())
print("\nTop do sys.path:")
for p in sys.path[:10]:
    print("  ", p)

print("\nListando arquivos/dirs neste diret√≥rio:")
for item in sorted(os.listdir(".")):
    print("  ", item)

print("\nTentando importar agrupa_dicionario para ver o erro completo...")
try:
    from agrupa_dicionario import build_grouped_features, row_to_llm_features
    print("Import bem sucedido! M√≥dulo carregado de:", agrupa_dicionario.__file__)
except Exception as e:
    print("IMPORT ERROR:")
    traceback.print_exc()


In [None]:
# Fallback: implementa√ß√µes locais para build_grouped_features e row_to_llm_features
import pandas as pd
import numpy as np

def _agg_axis(df, cols, agg_func):
    if len(cols) == 0:
        # coluna vazia -> zeros
        return pd.Series([0]*len(df), index=df.index)
    if agg_func == 'sum':
        return df[cols].sum(axis=1)
    if agg_func == 'mean':
        return df[cols].mean(axis=1)
    if agg_func == 'max':
        return df[cols].max(axis=1)
    if agg_func == 'count':
        return df[cols].count(axis=1)
    # permitir passar uma fun√ß√£o
    try:
        return df[cols].agg(agg_func, axis=1)
    except Exception:
        return df[cols].sum(axis=1)

def build_grouped_features(df, mapping, agg_func='sum'):
    """
    df: DataFrame com colunas originais
    mapping: dict col -> group_name
    agg_func: 'sum','mean','max','count' ou fun√ß√£o
    Retorna DataFrame com colunas = grupos e index original
    """
    # garantir c√≥pia
    df = df.copy()
    # grupos √∫nicos na ordem de apari√ß√£o
    groups = []
    for col in df.columns:
        if col in mapping:
            g = mapping[col]
            if g not in groups:
                groups.append(g)
    # tamb√©m incluir grupos de mapping que n√£o aparecem nas colunas (opcional)
    for g in dict(mapping).values():
        if g not in groups:
            groups.append(g)

    result = pd.DataFrame(index=df.index)
    for g in groups:
        cols = [col for col, grp in mapping.items() if grp == g and col in df.columns]
        result[g] = _agg_axis(df, cols, agg_func)
    return result

def row_to_llm_features(grouped_df, row_identifier):
    """
    grouped_df: DataFrame retornado por build_grouped_features
    row_identifier: label do √≠ndice (ex: 'Joe') ou integer
    Retorna: dict {grupo: valor}
    """
    if isinstance(row_identifier, int):
        s = grouped_df.iloc[row_identifier]
    else:
        s = grouped_df.loc[row_identifier]
    # converter para tipos nativos (float->python float)
    return {str(k): (None if pd.isna(v) else (v.item() if hasattr(v, "item") else v)) for k, v in s.items()}

# mensagem de confirma√ß√£o
print("Fallback functions carregadas: build_grouped_features, row_to_llm_features")


### 1. Create Sample Data

Let's create a simple DataFrame with 6 people and 6 columns.


In [None]:
# Set random seed for reproducibility
np.random.seed(0)

# Create DataFrame
people = pd.DataFrame(
    np.random.randn(6, 6),
    columns=['a', 'b', 'c', 'd', 'e', 'f'],
    index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis', 'Daniel']
)

print('üìä Original DataFrame:')
people


### 2. Define Mapping Dictionary

We'll group columns into **color categories**:
- `red`: columns a, b, e  
- `blue`: columns c, d  
- `orange`: column f


In [None]:
mapping = {
    'a': 'red',
    'b': 'red',
    'c': 'blue',
    'd': 'blue',
    'e': 'red',
    'f': 'orange',
}

print('üó∫Ô∏è Mapping:')
for col, group in mapping.items():
    print(f'   {col} ‚Üí {group}')


### 3. Group Columns

Apply the mapping and aggregate using **sum**.


In [None]:
# Group and sum
grouped = build_grouped_features(people, mapping, agg_func='sum')

print('‚ú® Grouped DataFrame (sum):')
grouped


### 4. Extract LLM-Ready Features

Convert one row to a dictionary format ready for LLM prompts.


In [None]:
import json

# Extract features for Joe
joe_features = row_to_llm_features(grouped, 'Joe')

print('ü§ñ Features for Joe (LLM-ready):')
print(json.dumps(joe_features, indent=2))


### 5. Different Aggregations

Try other aggregation functions: mean, max, count.


In [None]:
grouped_mean = build_grouped_features(people, mapping, agg_func='mean')
print('üìà Mean:')
display(grouped_mean)


In [None]:
grouped_max = build_grouped_features(people, mapping, agg_func='max')
print('üîº Max:')
display(grouped_max)


In [None]:
grouped_count = build_grouped_features(people, mapping, agg_func='count')
print('üî¢ Count:')
display(grouped_count)


### 6. Compare All Aggregations for Joe


In [None]:
comparison = pd.DataFrame({
    'sum': grouped.loc['Joe'],
    'mean': grouped_mean.loc['Joe'],
    'max': grouped_max.loc['Joe'],
    'count': grouped_count.loc['Joe']
})

print('üìä All aggregations for Joe:')
comparison


### 7. Visualization


In [None]:
import matplotlib.pyplot as plt

# Plot grouped sum for all people
fig, ax = plt.subplots(figsize=(10, 6))
grouped.plot(kind='bar', ax=ax, color=['red', 'blue', 'orange'])

ax.set_title('Grouped Features (Sum) by Person', fontsize=14, fontweight='bold')
ax.set_xlabel('Person')
ax.set_ylabel('Sum of Values')
ax.legend(title='Group')
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()


### Conclusion

You've learned how to:

1. ‚úì Create a mapping dictionary  
2. ‚úì Group DataFrame columns  
3. ‚úì Apply different aggregations  
4. ‚úì Extract LLM-ready features  
5. ‚úì Visualize grouped data  

**Next**: üíö Try `02_llm_preprocessing.ipynb` for real-world LLM integration!
