# Geographic Cost & Response Analysis

This notebook joins multiple data sources to create a unified view of marketing spend and user response across different geographic regions.

**Data sources:**
- GA4 Sessions (response metric)
- Meta Geo Spend
- TikTok Geo Spend
- Google Ads Geo Spend

**Target format:**
- Date
- Geo
- Cost (sum of all platform spends)
- Response (GA4 sessions)

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import os
import re
import warnings

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 1000)
warnings.filterwarnings('ignore')

# Set paths
RAW_DATA_PATH = '../raw_data/'
OUTPUT_PATH = '../data/'

## 1. Data Loading and Initial Inspection

First, let's load all our datasets and examine their structure.

In [None]:
# Load GA4 Sessions data
ga4_sessions = pd.read_csv(os.path.join(RAW_DATA_PATH, 'ga4_sessions.csv'))
print(f"GA4 Sessions shape: {ga4_sessions.shape}")
ga4_sessions.head()

In [None]:
# Load Meta Geo Spend data
meta_geo_spend = pd.read_csv(os.path.join(RAW_DATA_PATH, 'meta_geo_spend.csv'))
print(f"Meta Geo Spend shape: {meta_geo_spend.shape}")
meta_geo_spend.head()

In [None]:
# Load TikTok Geo Spend data
tiktok_geo_spend = pd.read_csv(os.path.join(RAW_DATA_PATH, 'tiktok_geo_spend.csv'))
print(f"TikTok Geo Spend shape: {tiktok_geo_spend.shape}")
tiktok_geo_spend.head()

In [None]:
# Load Google Ads Geo Spend data
gads_geo_spend = pd.read_csv(os.path.join(RAW_DATA_PATH, 'gads_geo_spend.csv'))
print(f"Google Ads Geo Spend shape: {gads_geo_spend.shape}")
gads_geo_spend.head()

## 2. Data Cleaning and Preprocessing Functions

Let's define reusable functions for the key cleaning steps we'll need to perform.

In [None]:
def standardize_date(df, date_col, input_format=None, output_format='%Y-%m-%d'):
    """
    Standardize date format across datasets.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        The dataframe containing the date column
    date_col : str
        Name of the date column to standardize
    input_format : str, optional
        Format of the input date. If None, tries to infer format.
    output_format : str, default '%Y-%m-%d'
        Desired output date format
        
    Returns:
    --------
    pandas.DataFrame
        Dataframe with standardized date column
    """
    df = df.copy()
    
    # If input is already datetime, just format it
    if pd.api.types.is_datetime64_any_dtype(df[date_col]):
        df[date_col] = df[date_col].dt.strftime(output_format)
        return df
    
    # Handle YYYYMMDD format (like in GA4)
    if input_format is None and df[date_col].iloc[0].isdigit() and len(str(df[date_col].iloc[0])) == 8:
        df[date_col] = pd.to_datetime(df[date_col], format='%Y%m%d')
    # Handle M/D/YY format (like in TikTok)
    elif input_format is None and '/' in str(df[date_col].iloc[0]):
        df[date_col] = pd.to_datetime(df[date_col], format='%m/%d/%y')
    # Use specified format or try to infer
    else:
        if input_format:
            df[date_col] = pd.to_datetime(df[date_col], format=input_format)
        else:
            df[date_col] = pd.to_datetime(df[date_col])
    
    # Convert to output format
    df[date_col] = df[date_col].dt.strftime(output_format)
    return df

def standardize_geo(df, geo_cols, output_col='geo', geo_level='region'):
    """
    Standardize geographic data across datasets.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        The dataframe containing geo columns
    geo_cols : list of str or str
        Column name(s) to use for geo standardization
    output_col : str, default 'geo'
        Name of the standardized output column
    geo_level : str, default 'region'
        Level of geographic granularity to standardize to ('region', 'city', etc.)
        
    Returns:
    --------
    pandas.DataFrame
        Dataframe with standardized geo column
    """
    df = df.copy()
    
    if isinstance(geo_cols, str):
        geo_cols = [geo_cols]
    
    # Case 1: Use first valid geo column
    for col in geo_cols:
        if col in df.columns:
            df[output_col] = df[col].str.strip() if isinstance(df[col].iloc[0], str) else df[col]
            break
    
    # Standardize names - convert to uppercase for consistent matching
    if output_col in df.columns and isinstance(df[output_col].iloc[0], str):
        df[output_col] = df[output_col].str.strip().str.upper()
        
        # Handle special cases
        # Replace "(NOT SET)" with "UNKNOWN"
        df[output_col] = df[output_col].replace(r'\(NOT SET\)', 'UNKNOWN', regex=True)
        
        # Remove state/region codes in parentheses if present
        df[output_col] = df[output_col].str.replace(r'\s*\([A-Z]{2}\)$', '', regex=True)
    
    return df

def standardize_cost(df, cost_col):
    """
    Standardize cost/spend data across datasets.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        The dataframe containing the cost column
    cost_col : str
        Column name containing cost data
        
    Returns:
    --------
    pandas.DataFrame
        Dataframe with standardized cost column
    """
    df = df.copy()
    
    # Ensure cost column exists
    if cost_col not in df.columns:
        raise ValueError(f"Cost column '{cost_col}' not found in dataframe")
    
    # Handle string values with currency symbols
    if df[cost_col].dtype == 'object':
        df[cost_col] = df[cost_col].replace('[\$,]', '', regex=True)
    
    # Convert to float
    df[cost_col] = pd.to_numeric(df[cost_col], errors='coerce')
    
    # Fill NaN with 0
    df[cost_col] = df[cost_col].fillna(0)
    
    return df

def aggregate_data(df, date_col, geo_col, value_col, agg_func='sum'):
    """
    Aggregate data by date and geo.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        The dataframe to aggregate
    date_col : str
        Column name containing date information
    geo_col : str
        Column name containing geographic information
    value_col : str
        Column name containing the value to aggregate
    agg_func : str or dict, default 'sum'
        Aggregation function to apply
        
    Returns:
    --------
    pandas.DataFrame
        Aggregated dataframe
    """
    return df.groupby([date_col, geo_col])[value_col].agg(agg_func).reset_index()

## 3. Process Each Dataset Individually

Now let's clean and standardize each dataset separately before joining them.

In [None]:
# Process GA4 Sessions data
ga4_clean = standardize_date(ga4_sessions, 'Date')
ga4_clean = standardize_geo(ga4_clean, ['Region'], 'geo')

# Aggregate sessions by date and geo
ga4_agg = aggregate_data(ga4_clean, 'Date', 'geo', 'Sessions')
ga4_agg = ga4_agg.rename(columns={'Sessions': 'response'})
ga4_agg.head()

In [None]:
# Process Meta Geo Spend data
meta_clean = standardize_date(meta_geo_spend, 'Day')
meta_clean = standardize_geo(meta_clean, ['DMA region'], 'geo')
meta_clean = standardize_cost(meta_clean, 'Amount spent (USD)')

# Aggregate spend by date and geo
meta_agg = aggregate_data(meta_clean, 'Day', 'geo', 'Amount spent (USD)')
meta_agg = meta_agg.rename(columns={'Day': 'Date', 'Amount spent (USD)': 'meta_cost'})
meta_agg.head()

In [None]:
# Process TikTok Geo Spend data
tiktok_clean = standardize_date(tiktok_geo_spend, 'By Day')
tiktok_clean = standardize_geo(tiktok_clean, ['Subregion'], 'geo')
tiktok_clean = standardize_cost(tiktok_clean, 'Cost')

# Aggregate spend by date and geo
tiktok_agg = aggregate_data(tiktok_clean, 'By Day', 'geo', 'Cost')
tiktok_agg = tiktok_agg.rename(columns={'By Day': 'Date', 'Cost': 'tiktok_cost'})
tiktok_agg.head()

In [None]:
# Process Google Ads Geo Spend data
gads_clean = standardize_date(gads_geo_spend, 'Day')
gads_clean = standardize_geo(gads_clean, ['Region (User location)'], 'geo')
gads_clean = standardize_cost(gads_clean, 'Cost')

# Aggregate spend by date and geo
gads_agg = aggregate_data(gads_clean, 'Day', 'geo', 'Cost')
gads_agg = gads_agg.rename(columns={'Day': 'Date', 'Cost': 'gads_cost'})
gads_agg.head()

## 4. Join All Datasets

Now we can join all of our cleaned and standardized datasets.

In [None]:
# Start with GA4 as the base (response data)
combined_df = ga4_agg.copy()

# Join with Meta data
combined_df = pd.merge(
    combined_df, 
    meta_agg, 
    on=['Date', 'geo'], 
    how='outer'
)

# Join with TikTok data
combined_df = pd.merge(
    combined_df, 
    tiktok_agg, 
    on=['Date', 'geo'], 
    how='outer'
)

# Join with Google Ads data
combined_df = pd.merge(
    combined_df, 
    gads_agg, 
    on=['Date', 'geo'], 
    how='outer'
)

# Fill NaN values with 0 for cost and response columns
cost_cols = ['meta_cost', 'tiktok_cost', 'gads_cost']
combined_df[cost_cols] = combined_df[cost_cols].fillna(0)
combined_df['response'] = combined_df['response'].fillna(0)

# Calculate total cost
combined_df['cost'] = combined_df[cost_cols].sum(axis=1)

# Sort by date and geo
combined_df = combined_df.sort_values(['Date', 'geo'])

# Reset index
combined_df = combined_df.reset_index(drop=True)

# Preview the final dataset
combined_df.head()

In [None]:
# Check for missing values
combined_df.isnull().sum()

In [None]:
# Basic summary statistics
combined_df.describe()

## 5. Data Quality Checks

Let's perform some basic data quality checks on our joined dataset.

In [None]:
# Check for records with no response but have costs
no_response_with_cost = combined_df[(combined_df['response'] == 0) & (combined_df['cost'] > 0)]
print(f"Records with cost but no response: {len(no_response_with_cost)}")
no_response_with_cost.head()

In [None]:
# Check for records with response but no costs
response_no_cost = combined_df[(combined_df['response'] > 0) & (combined_df['cost'] == 0)]
print(f"Records with response but no cost: {len(response_no_cost)}")
response_no_cost.head()

In [None]:
# Check distribution of costs by platform
platform_costs = combined_df[['meta_cost', 'tiktok_cost', 'gads_cost']].sum()
platform_costs

In [None]:
# Visualize platform cost distribution
plt.figure(figsize=(10, 6))
platform_costs.plot(kind='bar', color='skyblue')
plt.title('Total Cost by Platform')
plt.ylabel('Cost (USD)')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

## 6. Additional Analyses

Let's explore the relationship between cost and response.

In [None]:
# Scatter plot of cost vs response
plt.figure(figsize=(12, 8))
sns.scatterplot(data=combined_df, x='cost', y='response', alpha=0.6)
plt.title('Cost vs Response')
plt.xlabel('Cost (USD)')
plt.ylabel('Response (Sessions)')
plt.grid(True, linestyle='--', alpha=0.7)

# Add trend line
sns.regplot(data=combined_df, x='cost', y='response', 
            scatter=False, line_kws={'color': 'red'})

plt.show()

In [None]:
# Calculate cost per response by geo
# Avoid division by zero
combined_df['cost_per_response'] = np.where(
    combined_df['response'] > 0,
    combined_df['cost'] / combined_df['response'],
    np.nan
)

# Aggregate by geo
geo_performance = combined_df.groupby('geo').agg({
    'cost': 'sum',
    'response': 'sum',
    'cost_per_response': 'mean'
}).reset_index()

# Calculate overall CPR
geo_performance['overall_cpr'] = geo_performance['cost'] / geo_performance['response']

# Sort by cost (descending)
geo_performance = geo_performance.sort_values('cost', ascending=False)

# Show top 20 geos by spend
geo_performance.head(20)

In [None]:
# Visualize top 10 geos by spend
top10_geos = geo_performance.head(10)

plt.figure(figsize=(14, 8))

# Create bar plot
ax = sns.barplot(x='geo', y='cost', data=top10_geos, color='skyblue')

# Create a twin axis for response
ax2 = ax.twinx()
sns.scatterplot(x=np.arange(len(top10_geos)), y='response', data=top10_geos, 
                color='darkred', s=100, ax=ax2)

# Add labels and title
ax.set_title('Top 10 Geos by Spend')
ax.set_xlabel('Geographic Region')
ax.set_ylabel('Total Cost (USD)', color='blue')
ax2.set_ylabel('Total Response (Sessions)', color='darkred')

# Rotate x-axis labels for better readability
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

## 7. Save the Final Dataset

Let's save our cleaned and joined dataset for future use.

In [None]:
# Create output directory if it doesn't exist
os.makedirs(OUTPUT_PATH, exist_ok=True)

# Save the final dataset
combined_df.to_csv(os.path.join(OUTPUT_PATH, 'geo_cost_response_combined.csv'), index=False)
print(f"Final dataset saved to {os.path.join(OUTPUT_PATH, 'geo_cost_response_combined.csv')}")

## 8. Next Steps and Recommendations

Based on this analysis, here are some potential next steps:

1. **Geo Mapping**: Create a more comprehensive geo mapping table to improve the matching between different data sources.
2. **Time Lag Analysis**: Investigate the time lag between spend and response to determine the optimal attribution window.
3. **Response Curve Modeling**: Model the relationship between spend and response to identify diminishing returns and optimal spend levels.
4. **Platform Efficiency**: Compare the cost-effectiveness of different platforms across geos.
5. **Anomaly Detection**: Set up automated checks for anomalous data points (e.g., unusually high costs with no response).
6. **Data Pipeline**: Implement an automated ETL pipeline for regular updates to this dataset.

This analysis provides a foundation for more sophisticated geo-based causal inference analyses.