In [None]:
import pandas as pd
import plotly.express as px
from main import get_click_attribution_table, markov_chain

In [None]:
# read the files
train = pd.read_csv("train.csv")
test = pd.read_csv("test_attribution.csv")

## Train data

> Here we will run all the models at once and understand the results. The goal is to understand which marketing channels drive conversions and optimize budget allocation based on attribution insights.

In [None]:
# Step 1: Get attribution scores from your method (e.g., first/last click)
attribution_table = get_click_attribution_table(train, channel_col='channel')  # returns 'channel', 'attribution'

# Step 2: Compute total touchpoints, actual conversions, and users per channel
conversion_summary = (
    train
    .groupby('channel')
    .agg(
        total_touchpoints=('channel', 'count'),
        actual_conversions=('converted', 'sum'),
        user_count=('user_id', 'nunique')
    )
    .reset_index()
)

# Step 3: Add percentage of users per channel
total_users = train['user_id'].nunique()
conversion_summary['user_pct'] = (
    conversion_summary['user_count'] / total_users * 100
).round(2)

# Step 4: Compute conversion rate
conversion_summary['conversion_rate'] = (
    conversion_summary['actual_conversions'] / conversion_summary['total_touchpoints'] * 100
).round(4)

# Step 5: Merge attribution results with summary metrics
attribution_full = pd.merge(attribution_table, conversion_summary, on='channel', how='left')

# Final columns: channel, model, conversions, percentage, total_touchpoints, actual_conversions, conversion_rate, user_pct
attribution_full.sort_values(['model', 'percentage'], ascending=[True, False], inplace=True)


In [None]:
# Lets finally put together and see
fig = px.bar(
    attribution_full,
    x='channel',
    y='percentage',
    color='model',
    barmode='group',
    text='percentage',
    title='Attribution Models (Percentage Share by Channel)',
    labels={'percentage': 'Attribution %', 'channel': 'Channel'}
)
fig.update_traces(texttemplate='%{text}%', textposition='outside')
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide', xaxis_tickangle=-45)
fig.show()

In [None]:

# Round % values to 1 decimal place for display
attribution_full['conversion_rate'] = attribution_full['conversion_rate'].round(1)
attribution_full['user_pct'] = attribution_full['user_pct'].round(1)

# === Plot 1: Conversion Rate per Channel ===
fig = px.bar(
    attribution_full[['channel', 'conversion_rate']].drop_duplicates(),
    x='channel',
    y='conversion_rate',
    barmode='group',
    text='conversion_rate',
    title='Conversion Rate per Channel (%)',
    labels={'conversion_rate': 'Conversion Rate (%)', 'channel': 'Channel'}
)

fig.update_traces(texttemplate='%{text}%', textposition='outside')
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide', xaxis_tickangle=-45)
fig.show()


# === Plot 2: % of Users per Channel ===
fig = px.bar(
    attribution_full[['channel', 'user_pct']].drop_duplicates(),
    x='channel',
    y='user_pct',
    barmode='group',
    text='user_pct',
    title='User Share per Channel (%)',
    labels={'user_pct': 'User %', 'channel': 'Channel'}
)

fig.update_traces(texttemplate='%{text}%', textposition='outside')
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide', xaxis_tickangle=-45)
fig.show()


<div style="border-radius: 6px; background-color: #fff3cd; padding: 12px; border: 1px solid #ffeeba; color: #856404;">
  <strong> Observation:</strong> 
   
   1. First-touch attribution gives more credit to Social and Display, indicating they are effective for awareness. Last-touch favors Email and Direct, showing their role in closing conversions.
   
   2.The Markov model with removal effect spreads credit across the full journey.Channels like Paid Search and Email show high impact — their removal leads to significant drops in conversion probability.
   
   3. LSTM captures temporal and contextual interactions — essential for campaign sequencing.
    
   4. Marketing channels show lower conversion time in general as compared to Direct due to volume of traffic.
    
  <strong> Recommendations:</strong> 
    1. Email & Direct are strong closers (high attribution in Markov & LSTM)

    2. Social & Display are weak converters — best for awareness

    3. Markov model offers balanced credit and should be used for spend reallocation

    4. LSTM can be deployed to predict conversion likelihood in real-time
    
</div>




In [None]:
## Lets look at Sankey paths to understand the journeys
markov_chain(train).plot_transition_sankey(train)

> Transitions with Markov models show following:

| **Insight**                                | **Interpretation**                                                                |
|--------------------------------------------|---------------------------------------------------------------------------------|
| High inflow from Start → Paid Search       | Paid Search acts as a top-of-funnel awareness driver — most users start here.      |
| Repeated loops in Social / Display         | Users engage but don’t progress — these may be weak touchpoints or drop-offs.      |
| Email or Direct before Conversion          | These are strong closers — ideal for retargeting or final-step campaign nudges.    |
| Multi-channel paths are common             | Most journeys are multi-touch — validates advanced attribution models like Markov. |
| No single dominant path                    | Funnel is non-linear — requires omnichannel coordination.                          |
| Direct channel as final step               | High brand recall or intentional navigation — strong indicator of trust/intent  |

### Test data

In [None]:
# Lets finally put together and see
fig = px.bar(
    test,
    x='channel',
    y='percentage',
    color='model',
    barmode='group',
    text='percentage',
    title='Attribution Models (Percentage Share by Channel)',
    labels={'percentage': 'Attribution %', 'channel': 'Channel'}
)
fig.update_traces(texttemplate='%{text}%', textposition='outside')
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide', xaxis_tickangle=-45)
fig.show()

<div style="border-radius: 6px; background-color: #fff3cd; padding: 12px; border: 1px solid #ffeeba; color: #856404;">
  <strong> Observation:</strong> 
   
   1. First-touch attribution on the test set continues to highlight Social and Display as common entry points, reinforcing their role in brand awareness and initial engagement. Last-touch attribution again leans heavily on Email and Direct, validating their effectiveness in conversion closure.
   
   2.The Markov chain with removal effect attributes significant weight to Paid Search, Email, and surprisingly, Social, indicating that even in test journeys, early and middle-funnel interactions contribute to eventual conversion.
   
   3. LSTM captures context-sensitive behaviors — showing that test journeys follow similar strategic patterns to training journeys.
    
</div>


### Attribution results

__Recommendations to Marketing__

- Based on the attribution results the best Channels at  the __End of Funnel__ : `Email`, `Direct`, and `Paid Search`


- Channels that appear just before conversion in last-touch and dominate Markov transitions into conversion.


- Best Channels for __Awareness__ : `Display`, `Social Media`. This is because:

    - Heavily present in LSTM paths but under-credited by last-click

    - Typically appear early in funnel, boost user recall & re-engagement






__Steps to Productionize ML Attribution__


1. Automated Data Pipeline
- Ingest daily logs (user events, sessions, impressions) using workflow orchestrators like **Airflow** or **Prefect**.
- Apply ETL logic to clean and transform raw data into structured **touchpoint sequences** per user.
- Generate time-ordered journeys with fields like `timestamp`, `channel`, `device`, and `converted`.

2. Feature Engineering and Preparation
- Encode categorical features (e.g., channels, devices) using Label Encoding or Embeddings.
- Derive sequence features like:
  - `touchpoint_number`
  - `time_since_first_touch`
  - `time_to_conversion_days`
- Format sequences to fixed-length using **padding** or **sliding windows** (especially for LSTM models).

3. Model Hosting and Serving
- Deploy trained models (e.g., **LSTM**, **Markov Chains**) as APIs using:
  - `Flask` or `FastAPI` (for REST endpoints)
  - `TensorFlow Serving` (for serving Keras models at scale)
- Containerize using **Docker**, and optionally orchestrate with **Kubernetes** for scalability.

4. Batch Scoring
- Score new user journeys daily in **batch mode**.
- For each user sequence, predict:
  - Conversion probability (LSTM)
  - Attribution weights (Markov)
- Save the results per touchpoint or per channel.

5. Attribution Aggregation
- Aggregate predictions or attribution scores:
  - Sum normalized probabilities across touchpoints per channel.
  - Normalize scores to 100\% per user or campaign.
- Join with actual conversions or revenue data for performance analysis.

6. Storage and Visualization
- Store attribution outputs in a **data warehouse**:
  - BigQuery
  - Snowflake
  - PostgreSQL
- Build visual dashboards using:
  - **Looker**, **Tableau**, or **PowerBI**
  - Lightweight apps in **Streamlit** or **Dash** for stakeholder interaction

7. Monitoring and Alerts
- Track model metrics over time:
  - Conversion accuracy
  - Attribution drift
- Set up alerts for:
  - Drops in model performance (e.g., ROC AUC)
  - Sudden shifts in channel attribution share

---

 __Summary__
A production-grade ML attribution pipeline must:
- Automate the daily data flow,
- Serve models for live or batch scoring,
- Aggregate attribution results per dimension (channel, campaign),
- And expose results through stakeholder-friendly dashboards, with built-in monitoring for reliability.
- We can monitor daily attribution shifts and trigger retraining if channel weights deviate by >10% over 7 days.”
