# Brand × Attribute Pipeline (End-to-End)

This notebook runs the full pipeline:

1. Build Brand × Attribute matrix from JSONL responses  
2. Filter attributes with an LLM (OpenAI or Ollama)  
3. Normalize/group attributes with Ollama  
4. Compute PMI matrix  
5. Run SVD on PMI  
6. Compute brand–attribute importance scores + per-brand rankings

> Assumes you added the notebook-friendly wrapper functions:
- `run_build_matrix(...)` in `build_brand_attribute_matrix.py`
- `run_filter_attributes(...)` in `filter_attributes_with_llm.py`
- `run_normalize_attributes(...)` in the normalize script
- `run_compute_pmi(...)` in `compute_pmi.py`
- `run_svd(...)` in `run_svd.py`
- `run_importance_from_outdir(...)` in `compute_importance_scores.py`


## 0) Setup

Adjust paths here if your repo layout differs.


In [None]:
# --- User-editable paths ---
JSONL_INPUT = "data/processed/responses/Appendix_A_responses.jsonl"
CONFIG_PATH = "data/raw/demo_brand_prompt_config.json"

OUTDIR = "data/processed/brand_attribute_matrix"

# Build
STEM = "raw"
PREFIX_ADJECTIVES = False

# Filter
FILTERED_CSV = f"{OUTDIR}/filtered.csv"
DECISIONS_LOG = f"{OUTDIR}/decisions-log.csv"

# Normalize
NORMALIZED_CSV = f"{OUTDIR}/filtered_normalized.csv"

# PMI
PMI_CSV = f"{OUTDIR}/pmi.csv"

# SVD
K_SVD = 10

# Importance (optionally truncate to top-k dims; None means use all saved dims)
K_IMPORTANCE = 6

## 1) Build Brand × Attribute matrix from JSONL


In [None]:
from src.data_analysis.build_brand_attribute_matrix import run_build_matrix

df_raw, raw_csv, raw_png = run_build_matrix(
    input_jsonl=JSONL_INPUT,
    outdir=OUTDIR,
    config_path=CONFIG_PATH,
    prefix_adjectives=PREFIX_ADJECTIVES,
    stem=STEM,
)

print("Raw matrix:", raw_csv)
print("Raw heatmap:", raw_png)
df_raw.head()


## 2) Filter attributes with an LLM

This step uses whatever backend configuration you set inside `filter_attributes_with_llm.py`
(e.g., `USE_OPENAI=True` / model name, or Ollama).  
Make sure any required API keys are set in your environment if using OpenAI.


In [None]:
from src.data_analysis.filter_attributes_with_llm import run_filter_attributes

filtered_csv = run_filter_attributes(
    input_csv=raw_csv,
    output_csv=FILTERED_CSV,
    decisions_log_csv=DECISIONS_LOG,
)

print("Filtered matrix:", filtered_csv)
pd.read_csv(filtered_csv, index_col=0).head()


## 3) Normalize / group attributes (Ollama)

This step groups synonyms and rewrites the matrix to canonical attribute columns.


In [None]:
from src.data_analysis.normalize_attributes_with_llm import run_normalize_attributes

normalized_csv = run_normalize_attributes(
    input_csv=filtered_csv,
    outdir=OUTDIR,
    chunk_size=60,
)

print("Normalized matrix:", normalized_csv)
pd.read_csv(normalized_csv, index_col=0).head()


## 4) Compute PMI matrix


In [None]:
from src.data_analysis.compute_pmi import run_compute_pmi

PMI_df, pmi_path = run_compute_pmi(
    input_csv=filtered_csv,   # or normalized_csv if you want PMI on normalized attributes
    output_csv=PMI_CSV,
)

print("PMI matrix:", pmi_path)
PMI_df.head()


## 5) Run SVD on PMI


In [None]:
from src.data_analysis.run_svd import run_svd

U_df, S_df, V_df = run_svd(
    input_pmi_csv=pmi_path,
    k=K_SVD,
    outdir=OUTDIR,
)

display(S_df.head())
U_df.head()


## 6) Compute Brand–Attribute importance scores + per-brand rankings


In [None]:
from src.data_analysis.compute_importance_scores import run_importance_from_outdir

importance_df, ranking_dict = run_importance_from_outdir(
    input_outdir=OUTDIR,
    k=K_IMPORTANCE,
)

importance_df.head()


## 7) Quick: show top attributes for a brand


In [None]:
brand = importance_df.index[0]
top10 = ranking_dict[brand]["top_attributes"][:10]
print("Brand:", brand)
print("Top 10 attributes:", top10)
