<a href="https://colab.research.google.com/github/Benfinkels/Cross-Channel-Attribution-Analyzer-EVC-Impact-Model/blob/main/EVC_Session_Attribution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# EVC & "Ghost Traffic" Attribution Analyzer

### **The Problem: The Attribution Gap**
Standard analytics (GA4) often fail to credit video views (EVCs) because users rarely click directly from a video to the site. Instead, they view the ad and visit later via "Direct" or "Organic" search. This leaves video campaigns looking like they have low ROI, while Organic channels appear artificially inflated.

### **The Solution: Competitive Signal Unmixing**
This tool uses **Non-Negative Least Squares (NNLS)** to mathematically "unmix" your traffic spikes and determine which channel is actually echoing your video performance.

* **Competitive Modeling:** Unlike simple correlation, this model forces traffic sources (like Direct and Organic) to "compete" for credit. This prevents double-counting if multiple channels spike at the same time (solving for multicollinearity).
* **Auto-Lag Detection:** The algorithm automatically scans a range of days (0‚Äì5) to find the exact time delay between a video view and the subsequent site visit.
* **Ghost Efficiency:** Calculates a "Ghost Conversion Rate" (Coefficient) to reveal exactly how many sessions typically appear on your site for every 1 reported EVC.

# Step 0: Install Requirements
**Required for Step 4 (Lift Analysis).**

The **CausalImpact** library (used in Step 4 to measure incremental lift) is not pre-installed in Google Colab. Run this cell once to install it.

* **What it does:** Installs `pycausalimpact`, a library developed by Google for causal inference using Bayesian Structural Time-Series.
* **When to run:** Only once per session.

In [None]:
# @title
# ==============================================================================
# CELL 0: Install CausalImpact
# ==============================================================================
pip install pycausalimpact

# Step 0.5: Upload Data
Run the cell below to upload your data files from your local machine to this notebook.

**You need two CSV files:**
1.  **Session Data (The Effect):** From GA4 (e.g., *Date, Session Source, Sessions*).
2.  **Ad Data (The Cause):** From Google Ads (e.g., *Date, Engaged-view conversions*).

> **Note:** The files are deleted when the runtime recycles, so you will need to re-upload them if you restart the notebook.

In [None]:
# @title
# ==============================================================================
# CELL 0.5: UPLOAD BUTTON
# ==============================================================================
from google.colab import files
import os

print(" Click below to upload your GA4 and Google Ads CSV files:")
uploaded = files.upload()

if uploaded:
    print(f"\n {len(uploaded)} file(s) uploaded successfully:")
    for fn in uploaded.keys():
        print(f"   - {fn} ({os.path.getsize(fn)/1024:.1f} KB)")
    print("\n Now run 'Step 1: Interactive Data Loader' below.")
else:
    print("\n No files were uploaded.")

### **Step 1: Prepare Data**
Run the cell below to launch the **Interactive Data Loader**.

**File Requirements:**
The loader includes an **Auto-Parser** that accepts most standard formats (Wide, Long, or Pivot). You need two files:
1.  **Session File (The Effect):** A GA4 export containing `Date`, `Channel` (e.g., Session source/medium), and `Volume` (e.g., Sessions).
2.  **Target File (The Cause):** A Google Ads export containing `Date` and your trigger metric (e.g., `EVC`, `Spend`, or `Impressions`).

> **Note:** The script will automatically detect and clean date headers (pivoted data) or date rows (wide data).

In [None]:
# @title
# =============================================================================
# CELL 1: INTERACTIVE DATA LOADER (OPTIMIZED FOR STEP 2)
# =============================================================================
import pandas as pd
import os
import ipywidgets as widgets
from IPython.display import display, clear_output
import numpy as np

# 1. SCAN FOR CSV FILES
csv_files = [f for f in os.listdir('.') if f.endswith('.csv')]
csv_files.sort(key=lambda x: os.path.getmtime(x), reverse=True) # Newest first

if not csv_files:
    print("‚ùå No CSV files found. Please upload files to the folder icon on the left.")
else:
    # 2. CREATE WIDGETS
    style = {'description_width': 'initial'}
    layout = widgets.Layout(width='600px')

    dd_session_file = widgets.Dropdown(options=csv_files, description='üìÇ Session Data (GA4):', style=style, layout=layout)
    dd_target_file = widgets.Dropdown(options=csv_files, description='üéØ Target Data (EVC):', style=style, layout=layout)

    # Smart defaults
    for f in csv_files:
        if 'analytic' in f.lower() or 'session' in f.lower(): dd_session_file.value = f
    for f in csv_files:
        if 'evc' in f.lower() or 'ads' in f.lower() or 'target' in f.lower(): dd_target_file.value = f

    btn_load = widgets.Button(description="Load Selected Files", button_style='primary', icon='upload')
    out = widgets.Output()

    display(widgets.VBox([
        widgets.HTML("<h3>üìÇ Select Your Files</h3>"),
        dd_session_file, dd_target_file, btn_load, out
    ]))

    # 3. LOADING LOGIC
    def clean_duplicates(df):
        return df.loc[:, ~df.columns.duplicated()]

    def process_session_file(filename):
        """Smart parser that differentiates between types of Wide/Pivot tables."""
        raw_head = pd.read_csv(filename, nrows=5)
        cols = raw_head.columns

        # Detect likely columns
        date_col = 'Date' if 'Date' in cols else next((c for c in cols if 'date' in str(c).lower() or 'day' in str(c).lower()), None)
        sess_col = 'Sessions' if 'Sessions' in cols else next((c for c in cols if 'session' in str(c).lower() or 'user' in str(c).lower()), None)

        # CHECK 1: PIVOT FORMAT (Headers are Dates)
        try:
            sample_headers = cols[1:10]
            valid_dates = pd.to_datetime(sample_headers, errors='coerce').notna().sum()
            headers_are_dates = valid_dates > (len(sample_headers) * 0.5)
        except: headers_are_dates = False

        # --- PARSING ---
        if headers_are_dates:
            print(f"   ‚Ü≥ Format: Pivot Table detected (Un-pivoting...)")
            df = pd.read_csv(filename)
            id_col = df.columns[0]
            # Melt: Turn Date Headers into a 'Date' column
            df_long = df.melt(id_vars=[id_col], var_name='Date', value_name='Sessions')
            df_long = df_long.rename(columns={id_col: 'Session source / medium'})
            return clean_duplicates(df_long) # <--- FIXED BUG HERE (Was returning 'df')

        # CHECK 2: ALREADY LONG FORMAT
        elif date_col and sess_col:
            print(f"   ‚Ü≥ Format: Standard Long (Cleaning...)")
            df = pd.read_csv(filename)
            df = clean_duplicates(df)

            # Normalize names
            if date_col != 'Date': df = df.rename(columns={date_col: 'Date'})
            if sess_col != 'Sessions': df = df.rename(columns={sess_col: 'Sessions'})

            # Identify Channel Column
            reserved = ['Date', 'Sessions']
            chan_cols = [c for c in df.select_dtypes(include=['object']).columns if c not in reserved]
            if chan_cols and 'Session source / medium' not in df.columns:
                df = df.rename(columns={chan_cols[0]: 'Session source / medium'})
            return clean_duplicates(df)

        # CHECK 3: WIDE FORMAT
        elif date_col:
            print(f"   ‚Ü≥ Format: Wide (Un-pivoting...)")
            df = pd.read_csv(filename)
            df = clean_duplicates(df)
            df_long = df.melt(id_vars=[date_col], var_name='Session source / medium', value_name='Sessions')
            df_long = df_long.rename(columns={date_col: 'Date'})
            return clean_duplicates(df_long)

        else:
            print(f"   ‚Ü≥ Format: Unknown. Loading as-is.")
            return clean_duplicates(pd.read_csv(filename))

    def on_load_click(b):
        global df_sessions, df_evc

        with out:
            clear_output()
            s_file = dd_session_file.value
            t_file = dd_target_file.value
            print(f"üîÑ Loading...")

            try:
                # 1. LOAD SESSIONS
                df_sessions = process_session_file(s_file)

                # Critical Data Type Enforcement for Step 2
                if 'Date' in df_sessions.columns:
                    df_sessions['Date'] = pd.to_datetime(df_sessions['Date'], errors='coerce')
                    df_sessions = df_sessions.dropna(subset=['Date'])

                if 'Sessions' in df_sessions.columns:
                     # Remove commas (e.g. "1,000") and force numeric
                     df_sessions['Sessions'] = pd.to_numeric(df_sessions['Sessions'].astype(str).str.replace(',', ''), errors='coerce').fillna(0)

                print(f"   ‚úÖ Session Data: {len(df_sessions):,} rows loaded.")

                # 2. LOAD TARGET (EVC)
                # Scan for offset headers
                raw_evc = pd.read_csv(t_file, header=None, nrows=10)
                header_row_evc = 0
                for i, row in raw_evc.iterrows():
                     if row.astype(str).str.contains('Date|Day|EVC', case=False, regex=True).any():
                        header_row_evc = i; break

                df_evc = pd.read_csv(t_file, header=header_row_evc)
                df_evc = clean_duplicates(df_evc)

                # Normalize EVC Dates
                date_col_evc = next((c for c in df_evc.columns if 'date' in str(c).lower() or 'day' in str(c).lower()), 'Date')
                df_evc = df_evc.rename(columns={date_col_evc: 'Date'})
                df_evc['Date'] = pd.to_datetime(df_evc['Date'], errors='coerce')

                print(f"   ‚úÖ Target Data:  {len(df_evc):,} rows loaded.")
                print("\nüéâ Success! Data is ready for Step 2.")

            except Exception as e:
                print(f"\n‚ùå Error loading files: {e}")
                print("Tip: Ensure your CSVs have standard headers like 'Date' and 'Sessions'.")

    btn_load.on_click(on_load_click)

### **Step 2: Configure & Run Analysis**

Run the cell below to open the **Competitive Attribution Dashboard**.

**Configuration Guide:**
* **Target (EVC):** Select the "Ghost" signal you want to explain (e.g., *Engaged-View Conversions*).
* **Channel Name Col:** The dimension for your traffic sources (e.g., *Session source / medium*).
* **Scan Lag Range:** The maximum delay to test.
    * *Recommendation:* Set to **5 days**. The tool will automatically test every delay (0 to 5) and lock onto the day where the traffic pattern best matches the EVC pattern.

**Interpreting the Results Table:**
* **Attributed EVCs:** The number of conversions the model believes actually came from this channel.
* **Conversion Rate (Ghost):** The efficiency multiplier.
    * *Example:* **0.05** means it takes roughly **20 Sessions** from this channel to generate **1 EVC**.
* **Share of Explained:** The percentage of the total signal "owned" by this channel.
    * *Note:* Because this is a **competitive model**, channels fight for credit. If "Direct" and "Organic" spike at the same time, the model gives credit to the one that fits the curve best, preventing double-counting.

In [None]:
# =============================================================================
# CELL 2: EVC "GHOST TRAFFIC" FINDER (NNLS & AUTO-LAG)
# =============================================================================

import pandas as pd
import numpy as np
import ipywidgets as widgets
from IPython.display import display, clear_output
from scipy.optimize import nnls
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt
import seaborn as sns

# Global storage
current_results_df = None
nnls_results_df = None  # New: Specific storage for NNLS results

def find_date_column(df):
    possible_names = ['Date', 'date', 'Clean Date', 'Day', 'day', 'Time', 'Timestamp', 'Period', 'Week']
    for col in df.columns:
        if col in possible_names: return col
    for col in df.columns:
        if 'date' in col.lower(): return col
    return None

def run_dashboard():
    global current_results_df, nnls_results_df, df_sessions, df_evc

    # --- 1. CONNECT TO LOADED DATA ---
    if 'df_sessions' not in globals() or 'df_evc' not in globals() or df_sessions is None:
        print("‚ùå Error: Data not found. Please run 'Cell 1' first.")
        return

    # De-duplicate
    global_df_sessions = df_sessions.loc[:, ~df_sessions.columns.duplicated()].copy()
    global_df_analysis = df_evc.loc[:, ~df_evc.columns.duplicated()].copy()

    # --- 2. WIDGET SETUP ---
    style = {'description_width': 'initial'}
    layout = widgets.Layout(width='600px')

    all_cols_analysis = sorted(global_df_analysis.columns.tolist())
    all_cols_sessions = sorted(global_df_sessions.columns.tolist())
    num_cols_sessions = sorted(global_df_sessions.select_dtypes(include=[np.number]).columns.tolist())

    dd_trigger = widgets.Dropdown(options=all_cols_analysis, description='Target (EVC):', style=style, layout=layout)
    dd_channel_col = widgets.Dropdown(options=all_cols_sessions, description='Channel Name Col:', style=style, layout=layout)
    dd_value_col = widgets.Dropdown(options=num_cols_sessions, description='Traffic Metric (Sessions):', style=style, layout=layout)

    # Smart Defaults
    if 'EVC' in all_cols_analysis: dd_trigger.value = 'EVC'
    if 'Session source / medium' in all_cols_sessions: dd_channel_col.value = 'Session source / medium'
    if 'Sessions' in num_cols_sessions: dd_value_col.value = 'Sessions'

    slider_max_lag = widgets.IntSlider(value=5, min=1, max=14, step=1, description='Scan Lag Range (Days):', style=style, layout=layout)

    btn_run = widgets.Button(description="Run Competitive Model", button_style='primary', icon='calculator')
    btn_download = widgets.Button(description="Download Results", button_style='success', icon='download', disabled=True)
    out = widgets.Output()

    ui = widgets.VBox([
        widgets.HTML("<h3>üëª Competitive EVC Attribution (NNLS)</h3><p>Solves for multicollinearity and auto-detects time lag.</p>"),
        dd_trigger, dd_channel_col, dd_value_col, slider_max_lag,
        widgets.HBox([btn_run, btn_download]), out
    ])
    display(ui)

    # --- DOWNLOAD LOGIC ---
    def on_download_click(b):
        global current_results_df
        if current_results_df is not None:
            filename = 'EVC_Competitive_Model.csv'
            current_results_df.to_csv(filename, index=False)
            from google.colab import files
            files.download(filename)

    btn_download.on_click(on_download_click)

    # --- ANALYSIS LOGIC ---
    def on_run_click(b):
        global current_results_df, nnls_results_df
        with out:
            clear_output()
            btn_download.disabled = True

            trigger = dd_trigger.value
            channel_col = dd_channel_col.value
            value_col = dd_value_col.value
            max_lag_scan = slider_max_lag.value

            print(f"üîÑ Preparing Data & Scanning Lags (0 to {max_lag_scan} days)...")

            # 1. Date Parsing
            df_s = global_df_sessions.copy()
            df_a = global_df_analysis.copy()
            date_col_s = find_date_column(df_s)
            date_col_a = find_date_column(df_a)

            if date_col_s is None or date_col_a is None:
                print("‚ùå CRITICAL ERROR: Could not find 'Date' column.")
                return

            df_s['Date'] = pd.to_datetime(df_s[date_col_s], errors='coerce')
            df_a['Date'] = pd.to_datetime(df_a[date_col_a], errors='coerce')
            df_a[trigger] = pd.to_numeric(df_a[trigger], errors='coerce').fillna(0)

            # 2. Pivot Sessions
            try:
                df_X = df_s.groupby(['Date', channel_col])[value_col].sum().unstack(fill_value=0)
            except Exception as e:
                print(f"‚ùå Error Pivoting: {e}")
                return

            # Filter out tiny channels (noise reduction)
            total_vol = df_X.sum().sum()
            df_X = df_X.loc[:, df_X.sum() > (total_vol * 0.001)] # 0.1% threshold

            # 3. Align Target
            df_y = df_a[['Date', trigger]].groupby('Date').sum()

            # Align Dates
            common_dates = df_X.index.intersection(df_y.index)
            if len(common_dates) < 10:
                print("‚ùå Not enough overlapping dates between Sessions and EVCs.")
                return

            # 4. Loop Lags to find Best Fit
            best_lag = 0
            best_score = -np.inf
            best_weights = None
            best_X = None
            best_y_aligned = None

            for lag in range(max_lag_scan + 1):
                X_aligned = df_X.shift(lag).dropna()
                y_aligned = df_y.loc[X_aligned.index]

                valid_idx = X_aligned.index.intersection(y_aligned.index)
                X_final = X_aligned.loc[valid_idx]
                y_final = y_aligned.loc[valid_idx][trigger].values

                if len(y_final) < 5: continue

                # Run NNLS
                weights, rss = nnls(X_final.values, y_final)
                y_pred = np.dot(X_final.values, weights)
                score = r2_score(y_final, y_pred)

                if score > best_score:
                    best_score = score
                    best_lag = lag
                    best_weights = weights
                    best_X = X_final
                    best_y_aligned = y_final

            if best_weights is None or best_score < 0:
                print("‚ö†Ô∏è No significant correlation found. Check data volume.")
                return

            # 5. Compile Results
            print(f"‚úÖ Best Fit Found! Lag: {best_lag} days (R¬≤: {best_score:.3f})")

            channel_names = best_X.columns
            results = []
            for i, channel in enumerate(channel_names):
                coeff = best_weights[i]
                if coeff > 0:
                    total_sessions_channel = best_X[channel].sum()
                    attr_evc = coeff * total_sessions_channel
                    results.append({
                        'Channel': channel,
                        'Conversion Rate (Ghost)': coeff,
                        'Attributed EVCs': attr_evc,
                        'Raw Sessions': total_sessions_channel
                    })

            res_df = pd.DataFrame(results).sort_values(by='Attributed EVCs', ascending=False)
            res_df['Share of Explained EVCs'] = (res_df['Attributed EVCs'] / res_df['Attributed EVCs'].sum()) * 100

            # --- SAVE RESULTS ---
            current_results_df = res_df
            nnls_results_df = res_df.copy() # <--- SAFETY COPY SAVED HERE
            btn_download.disabled = False

            # 6. Display
            print("\n" + "="*80)
            print(f"üèÜ ATTRIBUTION RESULT (Based on {len(best_y_aligned)} days of data)")
            print("="*80)

            styled = res_df.style.format({
                'Conversion Rate (Ghost)': '{:.5f}',
                'Attributed EVCs': '{:,.1f}',
                'Raw Sessions': '{:,.0f}',
                'Share of Explained EVCs': '{:.1f}%'
            }).background_gradient(subset=['Attributed EVCs'], cmap='Greens')

            display(styled)

            # 7. Visualization
            plt.figure(figsize=(12, 5))
            y_pred_best = np.dot(best_X.values, best_weights)
            plt.plot(best_X.index, best_y_aligned, label='Actual Google EVCs', color='black', alpha=0.6)
            plt.plot(best_X.index, y_pred_best, label='Model Predicted (From Traffic)', color='green', linestyle='--')
            plt.title(f'Model Fit: Actual EVCs vs Traffic Signals (Lag: {best_lag} days)')
            plt.legend()
            plt.grid(True, alpha=0.3)
            plt.show()

    btn_run.on_click(on_run_click)

run_dashboard()

## Method 2: Ridge Regression (The "Cooperative" Model)

### **Why use this?**
Standard attribution models (like NNLS) are competitive‚Äîthey force channels to fight for credit. If "Direct Traffic" and "Brand Search" spike at the same time, the model picks one winner and gives the other zero credit.

**Ridge Regression** solves this by allowing **shared credit**. It recognizes that user journeys are complex and that multiple channels often work together to drive a single conversion.

**Use this model when:**
* You see "spiky" results in the NNLS model.
* You suspect YouTube is creating a "Halo Effect" that lifts multiple channels simultaneously (e.g., Search AND Direct).

In [None]:
# @title
# =============================================================================
# CELL 2.5: RIDGE REGRESSION (AUTO-LAG DETECT)
# =============================================================================

import pandas as pd
import numpy as np
import ipywidgets as widgets
from IPython.display import display, clear_output
from sklearn.linear_model import Ridge
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt

# Global storage
ridge_results_df = None

def run_ridge_dashboard():
    global df_sessions, df_evc

    # --- 1. SETUP ---
    if 'df_sessions' not in globals() or 'df_evc' not in globals():
        print("‚ùå Error: Data not found. Run Step 1 first.")
        return

    style = {'description_width': 'initial'}
    layout = widgets.Layout(width='600px')

    # Detect Columns
    evc_options = sorted(df_evc.select_dtypes(include=[np.number]).columns)
    sess_options = sorted(df_sessions.select_dtypes(include=[np.number]).columns)
    chan_options = sorted(df_sessions.select_dtypes(exclude=[np.number]).columns)

    evc_guess = next((c for c in evc_options if 'evc' in c.lower()), evc_options[0])
    sess_guess = next((c for c in sess_options if 'session' in c.lower()), sess_options[0])
    chan_guess = next((c for c in chan_options if 'source' in c.lower() or 'channel' in c.lower()), chan_options[0])

    # Widgets
    dd_trigger = widgets.Dropdown(options=evc_options, value=evc_guess, description='Target (EVC):', style=style, layout=layout)
    dd_metric = widgets.Dropdown(options=sess_options, value=sess_guess, description='Input (Sessions):', style=style, layout=layout)
    dd_channel = widgets.Dropdown(options=chan_options, value=chan_guess, description='Channel Name Col:', style=style, layout=layout)

    # Settings (Note: "Scan Range" instead of "Fixed Lag")
    slider_alpha = widgets.FloatLogSlider(value=1.0, base=10, min=-2, max=2, step=0.1, description='Sharing Strength (Alpha):', style=style, layout=layout)
    slider_max_lag = widgets.IntSlider(value=5, min=0, max=14, step=1, description='Scan Lag Range (Days):', style=style, layout=layout)

    btn_run = widgets.Button(description="Run Ridge Auto-Lag", button_style='info', icon='search')
    out = widgets.Output()

    ui = widgets.VBox([
        widgets.HTML("<h3>ü§ù Ridge Regression (Auto-Lag)</h3><p>Finds the optimal time delay while allowing shared credit.</p>"),
        dd_trigger, dd_metric, dd_channel, slider_max_lag, slider_alpha,
        btn_run, out
    ])
    display(ui)

    def on_run_click(b):
        global ridge_results_df
        with out:
            clear_output()
            trigger = dd_trigger.value
            metric = dd_metric.value
            chan_col = dd_channel.value
            max_lag = slider_max_lag.value
            alpha = slider_alpha.value

            print(f"üîÑ Scanning delays (0-{max_lag} days) with Ridge (Alpha={alpha})...")

            # --- A. PREPARE DATA ---
            try:
                # Pivot
                df_work = df_sessions.copy()
                df_work['Clean_Key'] = df_work[chan_col].astype(str).str.lower().str.replace(' ', '')
                df_X_raw = df_work.pivot_table(index='Date', columns='Clean_Key', values=metric, aggfunc='sum').fillna(0)

                # Filter Noise (Top 50 or >0.1% vol)
                df_X_raw = df_X_raw.loc[:, df_X_raw.sum() > df_X_raw.sum().sum() * 0.001]

                # Target
                df_y_raw = df_evc.groupby('Date')[[trigger]].sum()
            except Exception as e:
                print(f"‚ùå Error Preparing Data: {e}")
                return

            # --- B. LOOP LAGS (THE "BAKED IN" PART) ---
            best_score = -np.inf
            best_lag = 0
            best_model = None
            best_X = None
            best_y = None

            for lag in range(max_lag + 1):
                # Shift & Align
                X_shifted = df_X_raw.shift(lag).dropna()
                y_aligned = df_y_raw.loc[X_shifted.index]

                # Intersect Dates
                common_idx = X_shifted.index.intersection(y_aligned.index)
                if len(common_idx) < 10: continue

                X_final = X_shifted.loc[common_idx]
                y_final = y_aligned.loc[common_idx][trigger].values

                # Fit Ridge
                model = Ridge(alpha=alpha, positive=True, fit_intercept=False)
                model.fit(X_final, y_final)

                # Score
                score = model.score(X_final, y_final)

                if score > best_score:
                    best_score = score
                    best_lag = lag
                    best_model = model
                    best_X = X_final
                    best_y = y_final

            if best_model is None:
                print("‚ùå No valid correlation found.")
                return

            print(f"‚úÖ Best Fit Found! Lag: {best_lag} days (R¬≤: {best_score:.3f})")

            # --- C. COMPILE RESULTS ---
            name_map = dict(zip(df_work['Clean_Key'], df_work[chan_col]))

            results = []
            y_pred = best_model.predict(best_X)

            for i, col in enumerate(best_X.columns):
                coef = best_model.coef_[i]
                if coef > 0.000001:
                    total_vol = df_X_raw[col].sum() # Use raw volume, not shifted
                    attr_evc = coef * total_vol
                    display_name = name_map.get(col, col)

                    results.append({
                        'Channel': display_name,
                        'Conversion Rate (Ghost)': coef,
                        'Attributed EVCs': attr_evc,
                        'Raw Sessions': total_vol
                    })

            ridge_results_df = pd.DataFrame(results).sort_values(by='Attributed EVCs', ascending=False)

            # Save for Visualizer
            # Note: We do NOT overwrite 'current_results_df' so NNLS is preserved for comparison
            # But we save to CSV so Step 5/6 can pick it up if desired
            ridge_results_df.to_csv('EVC_Competitive_Model_Ridge.csv', index=False)

            # Display Table
            print("-" * 60)
            styled = ridge_results_df.head(15).style.format({
                'Conversion Rate (Ghost)': '{:.5f}',
                'Attributed EVCs': '{:,.1f}',
                'Raw Sessions': '{:,.0f}'
            }).background_gradient(subset=['Attributed EVCs'], cmap='Blues')
            display(styled)

            # --- D. PLOT ---
            plt.figure(figsize=(12, 5))
            plt.plot(best_X.index, best_y, label='Actual Google EVCs', color='black', alpha=0.6)
            plt.plot(best_X.index, y_pred, label=f'Ridge Prediction (Lag {best_lag})', color='blue', linestyle='--')
            plt.title(f'Ridge Model Fit: Shared Credit (Alpha={alpha}, Lag={best_lag})')
            plt.legend()
            plt.grid(True, alpha=0.3)
            plt.show()

    btn_run.on_click(on_run_click)

run_ridge_dashboard()

---
### **‚ö†Ô∏è Note on Methodology**
* **Correlation vs. Causation:** While a low P-value ($<0.05$) indicates a strong statistical link, it does not prove absolute causality. Seasonality or concurrent media events can influence these numbers.
* **Directional Signal:** Use these results as a **directional signal** to identify which organic channels are absorbing your paid video demand.

### **Step 3: Validate the Model**
Run the cell below to visualize the accuracy of your attribution.

**The Charts Explained:**
* **Reality Check (Top):** Compares the **Actual EVCs** (Black Line) vs. the **Predicted Model** (Green Dashed).
    * *Goal:* You want these lines to move together. If the Green line spikes when the Black line spikes, the model works.
* **Attribution Stack (Middle):** Breaks down the Green line to show you *which channels* are driving that prediction.
    * *Use Case:* Use this to prove to stakeholders: *"See that spike on the 15th? That wasn't random‚Äîthat was Organic Search echoing our video campaign."*

In [None]:
# =============================================================================
# CELL 3: DUAL-MODEL ATTRIBUTION VISUALIZER
# =============================================================================

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import ipywidgets as widgets
from IPython.display import display, clear_output
import seaborn as sns
import os

def run_dual_visualizer():
    global df_sessions, df_evc

    # --- 1. SETUP ---
    if 'df_sessions' not in globals() or 'df_evc' not in globals():
        print("‚ùå Error: Raw data not found. Run Step 1 first.")
        return

    # Check for models
    models_found = {}
    if os.path.exists('EVC_Competitive_Model.csv'):
        models_found['NNLS (Competitive)'] = 'EVC_Competitive_Model.csv'
    if os.path.exists('EVC_Competitive_Model_Ridge.csv'):
        models_found['Ridge (Shared Credit)'] = 'EVC_Competitive_Model_Ridge.csv'

    if not models_found:
        print("‚ùå Error: No model files found. Run Step 2 (NNLS) and Step 2.5 (Ridge).")
        return

    # --- 2. WIDGET SETUP ---
    style = {'description_width': 'initial'}
    layout = widgets.Layout(width='600px')

    # Column Guessing
    evc_options = sorted(df_evc.select_dtypes(include=[np.number]).columns)
    sess_options = sorted(df_sessions.select_dtypes(include=[np.number]).columns)
    chan_options = sorted(df_sessions.select_dtypes(exclude=[np.number]).columns)

    evc_guess = next((c for c in evc_options if 'evc' in c.lower()), evc_options[0])
    sess_guess = next((c for c in sess_options if 'session' in c.lower()), sess_options[0])
    chan_guess = next((c for c in chan_options if 'source' in c.lower() or 'channel' in c.lower()), chan_options[0])

    # Controls
    dd_vis_target = widgets.Dropdown(options=evc_options, value=evc_guess, description='Target (EVC):', style=style, layout=layout)
    dd_vis_metric = widgets.Dropdown(options=sess_options, value=sess_guess, description='Metric (Sessions):', style=style, layout=layout)
    dd_vis_channel = widgets.Dropdown(options=chan_options, value=chan_guess, description='Channel Name Col:', style=style, layout=layout)

    slider_vis_lag = widgets.IntSlider(value=1, min=0, max=14, step=1, description='Shift Stack (Days):', style=style, layout=layout)
    toggle_smooth = widgets.Checkbox(value=False, description='Smooth Data (7-Day Avg)', style=style) # Default off to see spikes

    btn_viz = widgets.Button(description="Visualize Comparison", button_style='primary', icon='columns')
    out_viz = widgets.Output()

    ui = widgets.VBox([
        widgets.HTML("<h3>‚öñÔ∏è Dual-Model Visualizer</h3><p>Compare the 'Spiky' Model (NNLS) vs. the 'Smooth' Model (Ridge).</p>"),
        dd_vis_target, dd_vis_metric, dd_vis_channel, slider_vis_lag, toggle_smooth,
        btn_viz, out_viz
    ])
    display(ui)

    # --- 3. HELPER TO PROCESS DATA ---
    def prepare_model_data(model_file, trigger, metric, chan_col, lag, smooth):
        try:
            df_model = pd.read_csv(model_file)
            df_model['Clean_Key'] = df_model['Channel'].astype(str).str.lower().str.replace(' ', '')

            # Weights & Mapping
            model_weights = dict(zip(df_model['Clean_Key'], df_model['Conversion Rate (Ghost)']))
            name_map = dict(zip(df_model['Clean_Key'], df_model['Channel']))
            active_keys = {k for k, v in model_weights.items() if v > 0}

            # Pivot Session Data
            df_work = df_sessions.copy()
            df_work['Clean_Key'] = df_work[chan_col].astype(str).str.lower().str.replace(' ', '')
            df_pivot = df_work.pivot_table(index='Date', columns='Clean_Key', values=metric, aggfunc='sum').fillna(0)

            # Apply Lag
            df_X = df_pivot.shift(lag).dropna()

            # Build Stack
            common_keys = [k for k in active_keys if k in df_X.columns]
            if not common_keys: return None, None, "No matching channels."

            # Calculate Impact for Sorting (Top 8)
            impact = {k: (df_X[k] * model_weights[k]).sum() for k in common_keys}
            top_keys = sorted(common_keys, key=impact.get, reverse=True)[:8]

            df_plot = pd.DataFrame(index=df_X.index)
            stack_cols = []

            for key in top_keys:
                original_name = name_map.get(key, key)
                df_plot[original_name] = df_X[key] * model_weights[key]
                stack_cols.append(original_name)

            # Merge Actuals
            df_y = df_evc.groupby('Date')[[trigger]].sum().rename(columns={trigger: 'Actual EVCs'})
            df_final = pd.merge(df_y, df_plot, left_index=True, right_index=True, how='inner')

            if smooth:
                df_final = df_final.rolling(window=7, min_periods=1).mean()

            return df_final, stack_cols, None

        except Exception as e:
            return None, None, str(e)

    # --- 4. PLOTTING LOGIC ---
    def on_viz_click(b):
        with out_viz:
            clear_output()

            # Create Subplots (1 or 2 depending on models found)
            n_models = len(models_found)
            fig, axes = plt.subplots(n_models, 1, figsize=(14, 6 * n_models), sharex=True)
            if n_models == 1: axes = [axes] # Ensure iterable

            params = {
                'trigger': dd_vis_target.value,
                'metric': dd_vis_metric.value,
                'chan_col': dd_vis_channel.value,
                'lag': slider_vis_lag.value,
                'smooth': toggle_smooth.value
            }

            for i, (model_name, filename) in enumerate(models_found.items()):
                ax = axes[i]
                print(f"‚öôÔ∏è Processing {model_name}...")

                df_final, stack_cols, error = prepare_model_data(filename, **params)

                if error:
                    ax.text(0.5, 0.5, f"Error: {error}", ha='center', transform=ax.transAxes)
                    continue

                # Plot Stack
                colors = sns.color_palette("tab20", len(stack_cols))
                try:
                    ax.stackplot(df_final.index, df_final[stack_cols].T, labels=stack_cols, colors=colors, alpha=0.85)

                    # Plot Actual Line
                    sns.lineplot(data=df_final, x=df_final.index, y='Actual EVCs', ax=ax,
                                 color='black', linewidth=3, label='Actual Reported EVCs')

                    # Formatting
                    ax.set_title(f"{model_name} Attribution", fontsize=14, fontweight='bold')
                    ax.set_ylabel("Conversions")
                    ax.grid(True, alpha=0.2)

                    # Legend (Outside)
                    ax.legend(loc='upper left', bbox_to_anchor=(1.01, 1), title="Attributed Source")

                    # Stats
                    total_mod = df_final[stack_cols].sum().sum()
                    total_act = df_final['Actual EVCs'].sum()
                    ax.text(0.01, 0.95, f"Accuracy: {total_mod/total_act:.1%}", transform=ax.transAxes,
                            bbox=dict(facecolor='white', alpha=0.8))

                except Exception as e:
                    print(f"Plotting Error on {model_name}: {e}")

            plt.tight_layout()
            plt.show()

    btn_viz.on_click(on_viz_click)

run_dual_visualizer()

## üí∞ Step 5: Financial Impact (The "So What?")

### **Why use this?**
Knowing *where* traffic comes from is useful, but knowing *what it's worth* gets budget approved. This calculator translates the model's attributed "Ghost Traffic" into actual dollars.

**Key Metrics Calculated:**
* **Ghost Revenue:** `Attributed EVCs * Average Order Value`. The total revenue generated by YouTube that was incorrectly credited to other channels.
* **Media Value Created:** `Attributed EVCs * Target CPA`. The efficiency value‚Äîhow much you *would* have paid to acquire these customers if you had to buy them directly on other paid channels.

**Instructions:**
1.  Enter your **Average Order Value (AOV)** (e.g., \$50).
2.  Enter your **Target CPA** (e.g., \$15).
3.  Click **Calculate ROI** to reveal the hidden financial value of your video campaign.

In [None]:
# GHOST ROI CALCULATOR
# =============================================================================
# CELL 5: GHOST ROI CALCULATOR
# =============================================================================

import pandas as pd
import ipywidgets as widgets
from IPython.display import display, clear_output
import os

def run_roi_calculator():
    # --- 1. LOAD MODEL ---
    target_file = 'EVC_Competitive_Model.csv'
    if os.path.exists(target_file):
        df_model = pd.read_csv(target_file)
    else:
        print("‚ùå Error: 'EVC_Competitive_Model.csv' not found. Run Step 2 first.")
        return

    # --- 2. WIDGET SETUP ---
    style = {'description_width': 'initial'}
    layout = widgets.Layout(width='400px')

    # Financial Inputs
    txt_aov = widgets.FloatText(value=50.00, description='Average Order Value ($):', style=style, layout=layout)
    txt_cpa = widgets.FloatText(value=15.00, description='Target CPA ($):', style=style, layout=layout)

    btn_calc = widgets.Button(description="Calculate ROI", button_style='success', icon='dollar-sign')
    out_calc = widgets.Output()

    ui = widgets.VBox([
        widgets.HTML("<h3>üí∞ Ghost ROI Calculator</h3><p>Translate 'Attributed Conversions' into Revenue and Media Value.</p>"),
        txt_aov, txt_cpa, btn_calc, out_calc
    ])
    display(ui)

    def on_calc_click(b):
        with out_calc:
            clear_output()
            aov = txt_aov.value
            cpa = txt_cpa.value

            # --- CALCULATIONS ---
            # Revenue = EVCs * AOV
            # Media Value = EVCs * CPA (How much you would have paid to get these elsewhere)

            df_roi = df_model.copy()
            df_roi['Ghost Revenue'] = df_roi['Attributed EVCs'] * aov
            df_roi['Media Value Created'] = df_roi['Attributed EVCs'] * cpa

            # Totals
            total_evc = df_roi['Attributed EVCs'].sum()
            total_rev = df_roi['Ghost Revenue'].sum()
            total_val = df_roi['Media Value Created'].sum()

            # --- OUTPUT ---
            print("\n" + "="*60)
            print(f"üí∏ FINANCIAL IMPACT ANALYSIS")
            print("="*60)
            print(f"‚Ä¢ Total Attributed EVCs:    {total_evc:,.0f}")
            print(f"‚Ä¢ Total Ghost Revenue:      ${total_rev:,.2f}  (Based on ${aov} AOV)")
            print(f"‚Ä¢ Total Media Value:        ${total_val:,.2f}  (Based on ${cpa} CPA)")
            print("-" * 60)
            print(f"üí° INSIGHT: Your YouTube campaigns generated ${total_rev:,.0f} in revenue")
            print(f"   that was incorrectly attributed to other channels in GA4.")
            print("="*60 + "\n")

            # Pretty Table
            cols_to_show = ['Channel', 'Attributed EVCs', 'Ghost Revenue', 'Media Value Created']

            styled = df_roi[cols_to_show].head(10).style.format({
                'Attributed EVCs': '{:,.1f}',
                'Ghost Revenue': '${:,.2f}',
                'Media Value Created': '${:,.2f}'
            }).background_gradient(subset=['Ghost Revenue'], cmap='Greens')

            display(styled)

    btn_calc.on_click(on_calc_click)

run_roi_calculator()

In [None]:
# @title
# =============================================================================
# CELL 4: CAUSAL IMPACT (WITH TOGGLE)
# =============================================================================
# Note: Ensure '!pip install pycausalimpact' was run in Cell 0.5

import pandas as pd
try:
    from causalimpact import CausalImpact
except ImportError:
    print("‚ùå Error: CausalImpact not found. Please run 'Cell 0.5' to install it.")

import ipywidgets as widgets
from IPython.display import display, clear_output

def run_causal_impact_final():
    # --- 1. SETUP ---
    if 'df_sessions' not in globals():
        print("‚ùå Error: Session Data not found.")
        return

    style = {'description_width': 'initial'}
    layout = widgets.Layout(width='600px')

    # Identify Columns
    date_col = 'Date'
    numeric_cols = sorted(df_sessions.select_dtypes(include=[np.number]).columns)
    cat_cols = df_sessions.select_dtypes(exclude=[np.number]).columns
    chan_col_guess = cat_cols[0] if len(cat_cols) > 0 else None

    # --- 2. WIDGETS ---
    # Mode Toggle
    tgl_mode = widgets.ToggleButtons(
        options=['Simple (Target Only)', 'Advanced (With Controls)'],
        description='Analysis Mode:',
        style={'button_width': '180px'},
        layout=layout
    )

    # Metric & Dimension
    dd_metric = widgets.Dropdown(options=numeric_cols, value='Sessions' if 'Sessions' in numeric_cols else numeric_cols[0], description='Metric:', layout=layout)
    dd_chan_col = widgets.Dropdown(options=cat_cols, value=chan_col_guess, description='Channel Column:', layout=layout)

    # Channel Selectors
    dd_target_chan = widgets.Dropdown(options=[], description='üî¥ Test Channel (Ads):', layout=layout)
    dd_controls = widgets.SelectMultiple(options=[], description='üü¢ Control Channels:', layout=layout, disabled=True)

    # Dates
    date_start = df_sessions[date_col].min()
    picker_launch = widgets.DatePicker(description='Launch Date:', value=date_start + pd.Timedelta(days=14), layout=layout)

    btn_run = widgets.Button(description="Calculate Lift", button_style='danger', icon='rocket')
    out_ci = widgets.Output()

    # --- 3. INTERACTIVITY ---
    def update_channels(*args):
        col = dd_chan_col.value
        if col:
            unique_chans = sorted(df_sessions[col].astype(str).unique())
            dd_target_chan.options = unique_chans
            dd_controls.options = unique_chans

    def on_mode_change(change):
        # Enable/Disable Control Selector based on Toggle
        if change['new'] == 'Simple (Target Only)':
            dd_controls.disabled = True
        else:
            dd_controls.disabled = False

    dd_chan_col.observe(update_channels, 'value')
    tgl_mode.observe(on_mode_change, 'value')

    # Initialize
    update_channels()

    # --- 4. DISPLAY UI ---
    display(widgets.VBox([
        widgets.HTML("<h3>üöÄ Causal Impact Analyzer</h3>"),
        tgl_mode,
        picker_launch,
        dd_metric,
        dd_chan_col,
        dd_target_chan,
        dd_controls,
        btn_run,
        out_ci
    ]))

    # --- 5. EXECUTION LOGIC ---
    def on_run_click(b):
        with out_ci:
            clear_output()
            mode = tgl_mode.value
            target_chan = dd_target_chan.value
            metric = dd_metric.value
            launch_date = pd.to_datetime(picker_launch.value)
            chan_col_name = dd_chan_col.value

            if not target_chan:
                print("‚ùå Error: Please select a Test Channel.")
                return

            print(f"‚öôÔ∏è Preparing Data in '{mode}' mode...")

            # Pivot Data to get Daily columns per Channel
            df_pivot = df_sessions.pivot_table(index=date_col, columns=chan_col_name, values=metric, aggfunc='sum').fillna(0)

            # Select Columns based on Mode
            if mode == 'Simple (Target Only)':
                # Just the target (Univariate)
                try:
                    df_model = df_pivot[[target_chan]]
                except KeyError:
                    print(f"‚ùå Error: Channel '{target_chan}' not found in data.")
                    return
            else:
                # Target + Controls (Multivariate)
                control_chans = list(dd_controls.value)
                if not control_chans:
                    print("‚ö†Ô∏è Warning: Advanced mode selected but no controls picked. Switching to Simple mode.")
                    df_model = df_pivot[[target_chan]]
                else:
                    # Target MUST be first column for CausalImpact
                    selected_cols = [target_chan] + control_chans
                    df_model = df_pivot[selected_cols]
                    print(f"   ‚Ä¢ Using {len(control_chans)} control channels as baseline.")

            # Define Periods
            pre_period = [str(df_model.index.min().date()), str((launch_date - pd.Timedelta(days=1)).date())]
            post_period = [str(launch_date.date()), str(df_model.index.max().date())]

            print(f"üìä Calculating Lift on '{target_chan}'...")

            try:
                # Run Model
                ci = CausalImpact(df_model, pre_period, post_period)

                # Text Report
                print("\n" + "="*60)
                print(ci.summary())
                print("="*60)

                # Visual Report
                ci.plot()

            except Exception as e:
                print(f"\n‚ùå Calculation Error: {e}")
                if mode == 'Advanced (With Controls)':
                    print("   üëâ Tip: Ensure your Control Channels are stable and didn't ALSO receive ad traffic.")

    btn_run.on_click(on_run_click)

run_causal_impact_final()

In [None]:
# =============================================================================
# CELL 6: ADVANCED EXECUTIVE REPORT (DUAL MODEL + CHARTS)
# =============================================================================
import ipywidgets as widgets
from IPython.display import display, HTML
import matplotlib.pyplot as plt
import seaborn as sns
import base64
import pandas as pd
import io
import numpy as np

# --- HELPER: CONVERT PLOT TO BASE64 IMAGE ---
def plot_to_base64(df_model, df_sess, df_target, title_prefix):
    """
    Generates a static Stackplot for the HTML report.
    """
    try:
        # 1. Clean & Prepare Data
        # Re-create the matching logic from Step 3 (simplified for reporting)
        df_model['Clean_Key'] = df_model['Channel'].astype(str).str.lower().str.replace(' ', '')
        model_weights = dict(zip(df_model['Clean_Key'], df_model['Conversion Rate (Ghost)']))
        name_map = dict(zip(df_model['Clean_Key'], df_model['Channel']))
        active_keys = {k for k, v in model_weights.items() if v > 0}

        # Guess columns (Target & Session)
        trigger_col = df_target.select_dtypes(include=[np.number]).columns[0]
        sess_col = df_sess.select_dtypes(include=[np.number]).columns[0]
        chan_col = df_sess.select_dtypes(exclude=[np.number]).columns[0]

        # Pivot Sessions
        df_work = df_sess.copy()
        df_work['Clean_Key'] = df_work[chan_col].astype(str).str.lower().str.replace(' ', '')
        df_pivot = df_work.pivot_table(index='Date', columns='Clean_Key', values=sess_col, aggfunc='sum').fillna(0)

        # Apply Lag (Assume avg 1 day for report if not specified, or use optimized)
        # For simplicity in report, we use a fixed lag of 1 or 0 unless passed
        lag = 1
        df_X = df_pivot.shift(lag).dropna()

        # Build Stack
        common_keys = [k for k in active_keys if k in df_X.columns]
        if not common_keys: return None

        df_plot = pd.DataFrame(index=df_X.index)
        stack_cols = []

        # Sort by impact
        sorted_keys = sorted(common_keys, key=lambda k: model_weights[k] * df_X[k].sum(), reverse=True)
        top_keys = sorted_keys[:8] # Top 8 for clean chart

        for key in top_keys:
            original_name = name_map.get(key, key)
            df_plot[original_name] = df_X[key] * model_weights[key]
            stack_cols.append(original_name)

        # Merge Actuals
        df_y = df_target.groupby('Date')[[trigger_col]].sum().rename(columns={trigger_col: 'Actual'})
        df_final = pd.merge(df_y, df_plot, left_index=True, right_index=True, how='inner')

        # Smooth for prettiness
        df_final = df_final.rolling(window=7, min_periods=1).mean()

        # PLOT
        plt.figure(figsize=(10, 5))
        colors = sns.color_palette("tab20", len(stack_cols))
        plt.stackplot(df_final.index, df_final[stack_cols].T, labels=stack_cols, colors=colors, alpha=0.85)
        plt.plot(df_final.index, df_final['Actual'], color='black', linewidth=2, label='Reported EVCs')

        plt.title(f"{title_prefix}: Traffic Source Breakdown", fontsize=14, fontweight='bold')
        plt.legend(loc='upper left', bbox_to_anchor=(1, 1))
        plt.tight_layout()

        # Save to buffer
        buf = io.BytesIO()
        plt.savefig(buf, format='png', bbox_inches='tight')
        plt.close()
        return base64.b64encode(buf.getvalue()).decode('utf-8')

    except Exception as e:
        print(f"Error plotting {title_prefix}: {e}")
        return None

def generate_full_report():
    # --- 1. GATHER DATA ---
    report_html = ""

    # --- SECTION A: NNLS MODEL ---
    # We look for 'current_results_df' (usually NNLS) or check if you saved it specifically
    # If you only have one variable, we'll use it.
    if 'current_results_df' in globals():
        df_nnls = current_results_df

        # Generate Chart
        img_nnls = plot_to_base64(df_nnls, df_sessions, df_evc, "NNLS Model (The 'Kill Switch')")

        report_html += f"""
        <div class="metric-box">
            <h2>1. NNLS Model (Primary Attribution)</h2>
            <p>This model identifies the <strong>primary drivers</strong> by forcing channels to compete for credit.</p>
            {'<img src="data:image/png;base64,' + img_nnls + '" style="width:100%; max-width:800px;"/>' if img_nnls else '<p><em>Chart unavailable (check data).</em></p>'}
            <br>
            {df_nnls.head(8).to_html(classes='table', index=False)}
        </div>
        <hr>
        """

    # --- SECTION B: RIDGE MODEL ---
    # Check if a Ridge model exists (users often save it to 'ridge_results_df' or overwrite 'current')
    if 'ridge_results_df' in globals() and ridge_results_df is not None:
        df_ridge = ridge_results_df

        # Generate Chart
        img_ridge = plot_to_base64(df_ridge, df_sessions, df_evc, "Ridge Model (The 'Halo Effect')")

        report_html += f"""
        <div class="metric-box">
            <h2>2. Ridge Regression (Multi-Touch View)</h2>
            <p>This model reveals the <strong>shared lift</strong> ("Halo Effect") where video ads influence multiple channels simultaneously.</p>
            {'<img src="data:image/png;base64,' + img_ridge + '" style="width:100%; max-width:800px;"/>' if img_ridge else '<p><em>Chart unavailable.</em></p>'}
            <br>
            {df_ridge.head(8).to_html(classes='table', index=False)}
        </div>
        """
    elif 'current_results_df' in globals():
        # Fallback if user overwrote the variable
        report_html += "<p><em>Note: Run Step 2.5 (Ridge) to see the comparative model here.</em></p>"

    # --- 3. BUILD FINAL HTML ---
    full_html = f"""
    <!DOCTYPE html>
    <html>
    <head>
        <title>Video Attribution Report</title>
        <style>
            body {{ font-family: 'Helvetica', sans-serif; margin: 40px; color: #333; max-width: 1000px; margin: auto; }}
            h1 {{ color: #1a73e8; border-bottom: 3px solid #1a73e8; padding-bottom: 10px; }}
            h2 {{ color: #202124; margin-top: 30px; }}
            .metric-box {{ background: #fff; padding: 20px; border: 1px solid #ddd; border-radius: 8px; box-shadow: 0 2px 5px rgba(0,0,0,0.05); margin-bottom: 30px; }}
            .table {{ width: 100%; border-collapse: collapse; margin-top: 15px; font-size: 0.9em; }}
            .table th {{ background: #f1f3f4; text-align: left; padding: 12px; border-bottom: 2px solid #ddd; }}
            .table td {{ border-bottom: 1px solid #eee; padding: 10px; }}
            .footer {{ margin-top: 50px; font-size: 0.8em; color: #777; text-align: center; }}
        </style>
    </head>
    <body>
        <h1>üé• Attribution & Ghost Traffic Analysis</h1>
        <p><strong>Date:</strong> {pd.Timestamp.now().strftime('%Y-%m-%d')}</p>
        <p>This report quantifies the incremental impact of Video campaigns on other traffic channels (Organic, Direct, Search).</p>

        {report_html}

        <div class="metric-box">
            <h2>3. Strategic Recommendations</h2>
            <ul>
                <li><strong>Validation:</strong> If the <em>Ridge Model</em> shows broader lift than NNLS, assume video is driving a general brand awareness lift rather than just specific channel clicks.</li>
                <li><strong>Budgeting:</strong> Use the "Attributed EVCs" from the NNLS model to calculate your baseline ROAS.</li>
            </ul>
        </div>

        <div class="footer">Generated by the Ghost Traffic Attribution Tool</div>
    </body>
    </html>
    """

    # --- 4. DOWNLOAD BUTTON ---
    b64 = base64.b64encode(full_html.encode()).decode()
    href = f'<a href="data:text/html;base64,{b64}" download="Attribution_Master_Report.html" target="_blank">'
    href += '<button style="background-color: #1a73e8; color: white; padding: 12px 24px; border: none; border-radius: 4px; font-size: 16px; cursor: pointer; font-weight: bold;">üì• Download Full Report</button>'
    href += '</a>'

    return HTML(href)

# Create Widget Wrapper
btn_gen = widgets.Button(description="Generate Master Report", button_style='info', icon='file-text', layout=widgets.Layout(width='250px'))
out_gen = widgets.Output()

display(widgets.VBox([
    widgets.HTML("<h3>üìë Final Step: Download Master Report</h3><p>Generates a complete HTML dashboard with charts for both NNLS and Ridge models.</p>"),
    btn_gen, out_gen
]))

def on_gen_click(b):
    with out_gen:
        clear_output()
        print("‚öôÔ∏è Generating charts and tables... (This may take a moment)")
        display(generate_full_report())

btn_gen.on_click(on_gen_click)