# Geographic Cost & Response Analysis

This notebook joins multiple data sources to create a unified view of marketing spend and user response across different geographic regions.

**Data sources:**
- GA4 Sessions (response metric)
- Meta Geo Spend
- TikTok Geo Spend
- Google Ads Geo Spend

**Target format:**
- Date
- Geo
- Cost (sum of all platform spends)
- Response (GA4 sessions)

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import os
import sys
import warnings

# Add the project root to path to import our modules
sys.path.append('..')

# Import our custom data pipeline modules
from src.data_pipeline.data_standardizer import DateStandardizer, GeoStandardizer, CostStandardizer, DataAggregator
from src.data_pipeline.data_joiner import DataJoiner, DatasetCleaner

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 1000)
warnings.filterwarnings('ignore')

# Set paths
RAW_DATA_PATH = '../raw_data/'
OUTPUT_PATH = '../data/'

## 1. Data Loading

First, let's load all our datasets.

In [2]:
# Load GA4 Sessions data
ga4_sessions = pd.read_csv(os.path.join(RAW_DATA_PATH, 'ga4_sessions.csv'))
print(f"GA4 Sessions shape: {ga4_sessions.shape}")
ga4_sessions.head()

GA4 Sessions shape: (100000, 7)


Unnamed: 0,Date,Region,Region ID,City,City ID,Sessions,Engaged sessions
0,20250319,(not set),(not set),(not set),(not set),19,1
1,20250319,Alabama,US-AL,Headland,1013037,1,0
2,20250319,Alabama,US-AL,Huntsville,1013042,1,0
3,20250319,Alabama,US-AL,Uniontown,9207090,1,1
4,20250319,Alaska,US-AK,Sitka,1012913,1,0


In [3]:
# Load Meta Geo Spend data
meta_geo_spend = pd.read_csv(os.path.join(RAW_DATA_PATH, 'meta_geo_spend.csv'))
print(f"Meta Geo Spend shape: {meta_geo_spend.shape}")
meta_geo_spend.head()

Meta Geo Spend shape: (93304, 6)


Unnamed: 0,DMA region,Day,Impressions,Amount spent (USD),Reporting starts,Reporting ends
0,Cheyenne-Scottsbluff,2025-03-20,9,0.549998,2025-03-20,2025-03-20
1,"Columbus, OH",2025-03-20,1264,44.799799,2025-03-20,2025-03-20
2,Salisbury,2025-03-20,207,8.639961,2025-03-20,2025-03-20
3,Austin,2025-03-20,2283,83.249627,2025-03-20,2025-03-20
4,Wichita-Hutchinson Plus,2025-03-20,358,10.569953,2025-03-20,2025-03-20


In [4]:
# Load TikTok Geo Spend data
tiktok_geo_spend = pd.read_csv(os.path.join(RAW_DATA_PATH, 'tiktok_geo_spend.csv'))
print(f"TikTok Geo Spend shape: {tiktok_geo_spend.shape}")
tiktok_geo_spend.head()

TikTok Geo Spend shape: (10961, 5)


Unnamed: 0,By Day,Subregion,Cost,Impressions,Currency
0,3/20/25,Virginia,53.06,2204,USD
1,3/20/25,Montana,6.75,231,USD
2,3/20/25,Michigan,42.87,1696,USD
3,3/20/25,Illinois,77.01,2976,USD
4,3/20/25,Unknown,386.32,13627,USD


In [5]:
# Load Google Ads Geo Spend data
gads_geo_spend = pd.read_csv(os.path.join(RAW_DATA_PATH, 'gads_geo_spend.csv'))
print(f"Google Ads Geo Spend shape: {gads_geo_spend.shape}")
gads_geo_spend.head()

Google Ads Geo Spend shape: (370818, 8)


Unnamed: 0,Day,Metro area (User location),City (User location),Region (User location),Country/Territory (User location),Currency code,Cost,Impr.
0,2024-01-01,Albany-Schenectady-Troy NY,Albany,New York,United States,USD,2.72,35
1,2024-01-01,Albany-Schenectady-Troy NY,Catskill,New York,United States,USD,3.33,39
2,2024-01-01,Albany-Schenectady-Troy NY,Delmar,New York,United States,USD,32.31,10
3,2024-01-01,Albany-Schenectady-Troy NY,Manchester,Vermont,United States,USD,1.41,13
4,2024-01-01,Amarillo TX,Dumas,Texas,United States,USD,1.73,4


## 2. Initialize Data Pipeline Components

Set up the standardizers and cleaners from our data pipeline module.

In [6]:
# Initialize standardizers
date_standardizer = DateStandardizer(output_format='%Y-%m-%d')
geo_standardizer = GeoStandardizer()
cost_standardizer = CostStandardizer()
data_aggregator = DataAggregator()

# Bundle standardizers for dataset cleaner
standardizers = {
    'date': date_standardizer,
    'geo': geo_standardizer,
    'cost': cost_standardizer,
    'aggregator': data_aggregator
}

# Initialize dataset cleaner and joiner
dataset_cleaner = DatasetCleaner(standardizers)
data_joiner = DataJoiner(date_col='Date', geo_col='geo')

## 3. Process Each Dataset

Clean and standardize each dataset using our data pipeline.

In [7]:
# Process GA4 Sessions data
ga4_clean = dataset_cleaner.clean_ga4_sessions(ga4_sessions)
ga4_clean.head()

AttributeError: 'numpy.int64' object has no attribute 'isdigit'

In [9]:
# Process Meta Geo Spend data
meta_clean = dataset_cleaner.clean_meta_spend(meta_geo_spend)
meta_clean.head()

Unnamed: 0,Date,geo,meta_cost
0,2024-01-01,ABILENE-SWEETWATER,0.939899
1,2024-01-01,"ALBANY, GA",0.999893
2,2024-01-01,ALBANY-SCHENECTADY-TROY,16.838196
3,2024-01-01,ALBUQUERQUE-SANTA FE,17.368139
4,2024-01-01,"ALEXANDRIA, LA",0.809913


In [10]:
# Process TikTok Geo Spend data
tiktok_clean = dataset_cleaner.clean_tiktok_spend(tiktok_geo_spend)
tiktok_clean.head()

Unnamed: 0,Date,geo,tiktok_cost
0,2024-03-25,ALABAMA,1.48
1,2024-03-25,ALASKA,0.33
2,2024-03-25,ARIZONA,4.06
3,2024-03-25,ARKANSAS,1.17
4,2024-03-25,CALIFORNIA,38.12


In [11]:
# Process Google Ads Geo Spend data
gads_clean = dataset_cleaner.clean_gads_spend(gads_geo_spend)
gads_clean.head()

Unnamed: 0,Date,geo,gads_cost
0,2024-01-01,ALABAMA,40.11
1,2024-01-01,ARIZONA,67.81
2,2024-01-01,ARKANSAS,6.4
3,2024-01-01,CALIFORNIA,582.55
4,2024-01-01,COLORADO,149.6


## 4. Join All Datasets

Join the cleaned datasets using our DataJoiner.

In [None]:
# Prepare datasets for joining
datasets_to_join = [
    (meta_clean, 'meta'),
    (tiktok_clean, 'tiktok'),
    (gads_clean, 'gads')
]

# Join all datasets with GA4 as the base
combined_df = data_joiner.join_datasets(
    base_df=ga4_clean,
    datasets=datasets_to_join,
    join_type='outer'
)

# Calculate total cost
cost_cols = ['meta_cost', 'tiktok_cost', 'gads_cost']
combined_df = data_joiner.calculate_total_cost(combined_df, cost_cols)

# Fill missing values for response
combined_df['response'] = combined_df['response'].fillna(0)

# Sort by date and geo
combined_df = combined_df.sort_values(['Date', 'geo']).reset_index(drop=True)

# Preview the final dataset
combined_df.head()

In [None]:
# Check for missing values
combined_df.isnull().sum()

In [None]:
# Basic summary statistics
combined_df.describe()

## 5. Data Quality Checks

Let's perform some basic data quality checks on our joined dataset.

In [None]:
# Check for records with no response but have costs
no_response_with_cost = combined_df[(combined_df['response'] == 0) & (combined_df['cost'] > 0)]
print(f"Records with cost but no response: {len(no_response_with_cost)}")
no_response_with_cost.head()

In [None]:
# Check for records with response but no costs
response_no_cost = combined_df[(combined_df['response'] > 0) & (combined_df['cost'] == 0)]
print(f"Records with response but no cost: {len(response_no_cost)}")
response_no_cost.head()

In [None]:
# Check distribution of costs by platform
platform_costs = combined_df[cost_cols].sum()
platform_costs

In [None]:
# Visualize platform cost distribution
plt.figure(figsize=(10, 6))
platform_costs.plot(kind='bar', color='skyblue')
plt.title('Total Cost by Platform')
plt.ylabel('Cost (USD)')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

## 6. Cost-Response Analysis

Let's explore the relationship between cost and response.

In [None]:
# Scatter plot of cost vs response
plt.figure(figsize=(12, 8))
sns.scatterplot(data=combined_df, x='cost', y='response', alpha=0.6)
plt.title('Cost vs Response')
plt.xlabel('Cost (USD)')
plt.ylabel('Response (Sessions)')
plt.grid(True, linestyle='--', alpha=0.7)

# Add trend line
sns.regplot(data=combined_df, x='cost', y='response', 
            scatter=False, line_kws={'color': 'red'})

plt.show()

In [None]:
# Calculate cost per response by geo
# Avoid division by zero
combined_df['cost_per_response'] = np.where(
    combined_df['response'] > 0,
    combined_df['cost'] / combined_df['response'],
    np.nan
)

# Aggregate by geo
geo_performance = combined_df.groupby('geo').agg({
    'cost': 'sum',
    'response': 'sum',
    'cost_per_response': 'mean'
}).reset_index()

# Calculate overall CPR
geo_performance['overall_cpr'] = geo_performance['cost'] / geo_performance['response']

# Sort by cost (descending)
geo_performance = geo_performance.sort_values('cost', ascending=False)

# Show top 20 geos by spend
geo_performance.head(20)

In [None]:
# Visualize top 10 geos by spend
top10_geos = geo_performance.head(10)

plt.figure(figsize=(14, 8))

# Create bar plot
ax = sns.barplot(x='geo', y='cost', data=top10_geos, color='skyblue')

# Create a twin axis for response
ax2 = ax.twinx()
sns.scatterplot(x=np.arange(len(top10_geos)), y='response', data=top10_geos, 
                color='darkred', s=100, ax=ax2)

# Add labels and title
ax.set_title('Top 10 Geos by Spend')
ax.set_xlabel('Geographic Region')
ax.set_ylabel('Total Cost (USD)', color='blue')
ax2.set_ylabel('Total Response (Sessions)', color='darkred')

# Rotate x-axis labels for better readability
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

## 7. Save the Final Dataset

Save our cleaned and joined dataset for future use.

In [None]:
# Create output directory if it doesn't exist
os.makedirs(OUTPUT_PATH, exist_ok=True)

# Save the final dataset
combined_df.to_csv(os.path.join(OUTPUT_PATH, 'geo_cost_response_combined.csv'), index=False)
print(f"Final dataset saved to {os.path.join(OUTPUT_PATH, 'geo_cost_response_combined.csv')}")

## 8. Next Steps and Recommendations

Based on this analysis, here are some potential next steps:

1. **Geo Mapping Refinement**: Create a more comprehensive geo mapping table to improve the matching between different data sources
2. **Time Lag Analysis**: Investigate the time lag between spend and response to determine the optimal attribution window
3. **Response Curve Modeling**: Model the relationship between spend and response to identify diminishing returns and optimal spend levels
4. **Platform Efficiency**: Compare the cost-effectiveness of different platforms across geos
5. **Pipeline Automation**: Set up automated ETL processes to regularly update this dataset
6. **Causal Inference**: Use this cleaned dataset for geographic causal inference analysis

This modular approach can be easily extended to accommodate new data sources or adapted for other clients with similar data structures.