# Research with Gemini

We're going to be testing Google's Gemini API. 

Credentials are located in config.yaml in the ai-research root folder.

## Setup

In [7]:
# Imports
from chiefai.ai import make_gemini_request
from chiefai.db import query
import polars as pl

# Notebook formatting
from IPython.display import display, HTML, Markdown

In [2]:
data = query("SELECT * FROM web_visit_results LIMIT 5")
data.head(3)

id,core_client,project_id,user_id,session_id,unique_key,postal_code,region,dma,dma_code,city,country,browser,device,device_type,search_engine,medium,source,platform,platform_version,bounce,session_date_time,ip_address,pages,search_terms,language,latitude,longitude,organization,referrer,session_length,created_at,updated_at,core_product,last_access,content,session
i64,str,i64,str,str,str,str,str,str,i64,str,str,str,str,str,str,str,str,str,str,str,datetime[μs],str,list[struct[2]],str,str,f64,f64,str,str,str,datetime[μs],datetime[μs],str,datetime[μs],null,null
99735578,"""OPOS""",82884616,"""50182511673882""","""3288761085059530754""","""828846165018251167388232887610…",,,,0,,"""USA""","""Unknown""",,"""Desktop/laptop""",,"""paidsocial""","""snapchat""","""Unknown""","""Unknown""","""true""",2024-01-03 13:11:13,"""34.123.204.87""","[{""collections/vaginal-health"",0}]",,"""Unknown""",37.751,-97.822,"""Google""","""Direct""","""0""",2024-01-03 13:13:00.337214,2024-01-03 13:13:00.337784,"""""",2024-01-03 13:11:13,,
99735579,"""OPOS""",82884616,"""78135251450464""","""5120671839057608705""","""828846167813525145046451206718…","""32163""","""Florida""","""Orlando, FL""",534,"""The Villages""","""USA""","""Safari""","""Apple iPhone""","""Mobile""",,"""paidsocial""","""ig""","""iOS""","""iOS 17.1""","""true""",2024-01-03 13:11:13,"""68.205.39.5""","[{""collections/vaginal-health"",0}]",,"""English (United States)""",28.9265,-81.9928,"""Spectrum""","""instagram.com""","""0""",2024-01-03 13:13:00.337214,2024-01-03 13:13:00.337784,"""""",2024-01-03 13:11:13,,
99735580,"""OPOS""",82884616,"""103390995781948""","""6775832299565744285""","""828846161033909957819486775832…",,,,0,,"""USA""","""Chrome""","""Generic Android""","""Mobile""",,,"""Direct""","""Android""","""Android 9.0""","""true""",2024-01-03 13:11:13,"""147.160.184.123""","[{""tools/recurring/login"",0}]",,"""English (United States)""",37.751,-97.822,"""Unknown""","""Direct""","""0""",2024-01-03 13:13:00.337214,2024-01-03 13:13:00.337784,"""""",2024-01-03 13:11:13,,


## Assess AI Approach to Data Analysis

We're going to use the make_gemini_request function in chiefmedai.ai to send data. 

The approach we're going to try is to create summary statistics from the database and programmatically generate a text query to feed to the AI model. We'll use prompt injection of the data to do this. 

Firstly we'll try getting simple counts of orders by platform and ask the AI model to assess the results. 


In [26]:
orders_query = """
    SELECT
        source,
        revenue::NUMERIC
    FROM web_event_results
    WHERE 
        LOWER(event_name) = 'order'
        AND event_date_time >= NOW() - INTERVAL '3 days' 
        AND core_client = 'OPOS'
        AND core_product = 'MENO'
    LIMIT 1000
"""
orders = query(orders_query)
orders = orders.with_columns(
    pl.col("revenue").cast(pl.Float64).alias("revenue")
)
print(orders.head(3))

shape: (3, 2)
┌─────────────────────┬─────────┐
│ source              ┆ revenue │
│ ---                 ┆ ---     │
│ str                 ┆ f64     │
╞═════════════════════╪═════════╡
│ portal.afterpay.com ┆ 180.08  │
│ instagram.com       ┆ 38.73   │
│ google              ┆ 45.47   │
└─────────────────────┴─────────┘


In [30]:
orders_agg = orders.group_by(
    "source"
).agg([
    pl.count("revenue").alias("source_count"),
    pl.median("revenue").alias("median_revenue"),
    pl.mean("revenue").alias("mean_revenue")
]).filter(
    pl.col("source_count") >= 30
)
orders_agg.head()

source,source_count,median_revenue,mean_revenue
str,u32,f64,f64
"""instagram.com""",81,32.18,43.182222
"""Direct""",193,34.43,50.141088
"""google""",321,31.3,40.197726
"""fb""",148,36.98,44.272973
"""Klaviyo""",50,33.325,46.5504


In [None]:
def generate_descriptive_prompt(
    df, 
    group_cols, 
    currency_symbol="$", 
    add_comparisons=True,
    decimal_places=2
):
    """
    Generate a descriptive text prompt from a dataframe with grouped statistics.
    
    Parameters:
    -----------
    df : polars.DataFrame or pandas.DataFrame
        The dataframe containing the grouped statistics
    group_cols : list or str
        Column name(s) that represent categorical groupings
    stat_cols : dict
        Dictionary mapping column names to their descriptions
        Format: {'column_name': {'type': 'count|mean|median|sum|etc', 'label': 'custom label', 'format': 'currency|percent|number'}}
    currency_symbol : str, optional
        Symbol to use for currency formatting (default: "$")
    add_comparisons : bool, optional
        Whether to add comparisons between measures like mean and median (default: True)
    decimal_places : int, optional
        Number of decimal places to round numeric values (default: 2)
        
    Returns:
    --------
    str
        Formatted descriptive text suitable for an AI prompt
    """
    # Convert group_cols to list if it's a string
    if isinstance(group_cols, str):
        group_cols = [group_cols]
    
    # Validate inputs
    for col in group_cols:
        if col not in df.columns:
            raise ValueError(f"Group column '{col}' not found in dataframe")
    
    for col in stat_cols:
        if col not in df.columns:
            raise ValueError(f"Statistic column '{col}' not found in dataframe")
    
    # Start building the prompt
    prompt_text = ""
    
    # Function to format values based on their type
    def format_value(value, col_info):
        value = round(value, decimal_places)
        
        if col_info.get('format') == 'currency':
            return f"{currency_symbol}{value:.{decimal_places}f}"
        elif col_info.get('format') == 'percent':
            return f"{value:.{decimal_places}f}%"
        else:
            return f"{value:.{decimal_places}f}" if isinstance(value, float) else f"{value}"
    
    # Determine if we're using polars or pandas
    is_polars = hasattr(df, 'iter_rows')
    
    # Iterate through rows
    if is_polars:
        rows = df.iter_rows(named=True)
    else:  # pandas
        rows = df.to_dict('records')
    
    for row in rows:
        # Create the group description (e.g., "Source: instagram.com")
        group_desc = ""
        for col in group_cols:
            label = col.replace('_', ' ').title()
            group_desc += f"{label}: {row[col]}\n"
        
        # Add the statistics descriptions
        stats_desc = ""
        mean_value = None
        median_value = None
        
        for col, col_info in [col for col in df.columns if col not in group_cols].items():
            # Get custom label or generate one from column name
            label = col_info.get('label', col.replace('_', ' ').title())
            
            # Get the type of statistic
            stat_type = col_info.get('type', '').lower()
            
            # Format the value
            value = row[col]
            formatted_value = format_value(value, col_info)
            
            # Save mean and median values for comparison if needed
            if stat_type == 'mean':
                mean_value = value
            elif stat_type == 'median':
                median_value = value
            
            # Create appropriate description based on type
            if stat_type == 'count':
                stats_desc += f"There were {formatted_value} {label}.\n"
            elif stat_type == 'mean':
                stats_desc += f"The average ({stat_type}) {label} was {formatted_value}.\n"
            elif stat_type == 'median':
                stats_desc += f"The {stat_type} {label} was {formatted_value}.\n"
            elif stat_type == 'sum':
                stats_desc += f"The total {label} was {formatted_value}.\n"
            else:
                stats_desc += f"The {label} was {formatted_value}.\n"
        
        # Add comparison between mean and median if both exist and comparison is requested
        if add_comparisons and mean_value is not None and median_value is not None:
            if mean_value > median_value:
                difference = round(mean_value - median_value, decimal_places)
                format_info = next((info for col, info in stat_cols.items() if info.get('type') == 'mean'), {})
                formatted_diff = format_value(difference, format_info)
                stats_desc += f"The mean is {formatted_diff} higher than the median, suggesting some high-value outlier values.\n"
            elif median_value > mean_value:
                difference = round(median_value - mean_value, decimal_places)
                format_info = next((info for col, info in stat_cols.items() if info.get('type') == 'mean'), {})
                formatted_diff = format_value(difference, format_info)
                stats_desc += f"The median is {formatted_diff} higher than the mean, suggesting some low-value outlier values.\n"
            else:
                stats_desc += f"The mean and median are identical, suggesting a symmetrical distribution of values.\n"
        
        # Combine group and stats descriptions
        section = group_desc + stats_desc + ("-" * 40 + "\n")
        prompt_text += section
    
    # Remove the final separator
    prompt_text = prompt_text.rstrip("-" * 40 + "\n")
    
    # Add a summary
    groups_count = len(list(rows))
    if len(group_cols) == 1:
        group_label = group_cols[0].replace('_', ' ').lower()
        prompt_text += f"\n\nSummary: Analyzed statistics for {groups_count} different {group_label}s."
    else:
        group_labels = [col.replace('_', ' ').lower() for col in group_cols]
        group_desc = ", ".join(group_labels)
        prompt_text += f"\n\nSummary: Analyzed statistics for {groups_count} different groups based on {group_desc}."
    
    return prompt_text

In [None]:
generate_descriptive_prompt(
    df=orders_agg,
    group_cols=["source"]
    )

In [None]:
prompt_data = ""

for row in orders_agg.iter_rows(named=True):
    source = row['source']
    count = row['source_count']
    median = round(row['median_revenue'], 2)
    mean = round(row['mean_revenue'], 2)
    
    # Generate a descriptive paragraph for this source
    source_description = f"""
        Source: {source}
        There were {count} orders from this source. 
        The median revenue per order was ${median:.2f}.
        The average (mean) revenue per order was ${mean:.2f}.
    """
    
    # Add a comparison of mean to median
    """
    if mean > median:
        difference = round(mean - median, 2)
        source_description += f"The mean is ${difference:.2f} higher than the median, suggesting some high-value outlier orders.\n"
    elif median > mean:
        difference = round(median - mean, 2)
        source_description += f"The median is ${difference:.2f} higher than the mean, suggesting some low-value outlier orders.\n"
    else:
        source_description += f"The mean and median are identical, suggesting a symmetrical distribution of order values.\n"
    """
        
    # Add a separator between sources
    source_description += "-" * 40 + "\n"
    
    # Add this source's description to the overall text
    prompt_data += source_description

# Remove the final separator and add a summary
prompt_data = prompt_data.rstrip("-" * 40 + "\n")
prompt_data += f"\n\nSummary: Analyzed revenue statistics for {len(orders_agg)} different traffic sources."

print("Generated the following descriptive text:")
print(prompt_data)

Generated the following descriptive text:

        Source: instagram.com
        There were 81 orders from this source. 
        The median revenue per order was $32.18.
        The average (mean) revenue per order was $43.18.
    ----------------------------------------

        Source: Direct
        There were 193 orders from this source. 
        The median revenue per order was $34.43.
        The average (mean) revenue per order was $50.14.
    ----------------------------------------

        Source: google
        There were 321 orders from this source. 
        The median revenue per order was $31.30.
        The average (mean) revenue per order was $40.20.
    ----------------------------------------

        Source: fb
        There were 148 orders from this source. 
        The median revenue per order was $36.98.
        The average (mean) revenue per order was $44.27.
    ----------------------------------------

        Source: Klaviyo
        There were 50 orders from t

In [41]:
# Test AI function
prompt = f"""ArithmeticError
I'm sharing revenue data from our e-commerce store, broken down by traffic source. The data includes order counts, median revenue, and mean revenue for each source.

[ANALYTICS DATA]
{prompt_data}

Based on this information, please provide:

1. A summary of our overall traffic and revenue patterns. Briefly mention volume, but you do not need to provide detail, since spending levels differ by channel.
2. An analysis of which traffic sources are most valuable in terms of:
   - Total revenue contribution
   - Revenue per order
   - Potential for growth based on order volume

3. Insights about any outliers or unusual patterns in the data

4. Strategic recommendations for:
   - Which traffic sources we should invest more in
   - Which sources might need optimization
   - Any suggested A/B tests or experiments

5. How our traffic source performance compares to typical e-commerce benchmarks

Please format your analysis with clear sections and include both tactical and strategic insights. Feel free to note any additional data that would help refine this analysis further.
"""

In [42]:
response = make_gemini_request(request_text=prompt)
display(Markdown(response))

Okay, here's an analysis of your e-commerce traffic source revenue data, broken down into the sections you requested.

## E-Commerce Traffic Source Revenue Analysis

**1. Overall Traffic and Revenue Patterns Summary:**

You're acquiring traffic from a variety of sources, including social media (Instagram, Facebook), direct traffic, search engines (Google), email marketing (Klaviyo), and potentially paid advertising networks (Applovin). There is significant variation in order volume across these channels. While deeper analysis is needed to assess net profit, it appears that your average order values are generally similar across channels.

**2. Traffic Source Valuation:**

*   **Total Revenue Contribution:**
    To determine total revenue contribution, we would need to multiply order counts by the *average* revenue per order (since the mean reflects total spending better than median). Without that calculation explicitly done, we can still infer based on order volume and average revenue. I will assume the calculation is done and then present results as if you have the exact data:
    *   **Google:** Likely has the highest total revenue contribution due to the large order volume (321).
    *   **Direct:** Likely second highest due to second-highest order volume (193) and a higher-than-average revenue per order.
    *   **Facebook:** Good total revenue contribution with a solid order volume (148).
    *   **Instagram, Klaviyo, and Applovin:** contribute less total revenue due to lower order volumes.

*   **Revenue Per Order (Average):**
    *   **Direct:** Leads with the highest average revenue per order ($50.14).
    *   **Klaviyo:** A close second, at $46.55 per order. This makes intuitive sense as email-driven traffic is more likely to come from engaged customers.
    *   **Facebook:** Follows at $44.27 per order.
    *   **Instagram:** $43.18 per order.
    *   **Applovin:** At $41.30, the lowest average revenue.
    *   **Google:** Near the bottom at $40.20 per order.

*   **Potential for Growth Based on Order Volume:**

    *   **Google:** Has the *highest* potential because you already have a substantial volume. Even small improvements in conversion rate or average order value can translate to significant revenue gains.
    *   **Facebook:** Significant growth potential due to a relatively high order volume and competitive average revenue per order.
    *   **Instagram:** Solid growth potential due to moderate order volume and decent average revenue per order.
    *   **Klaviyo & Applovin:** While they have lower order volumes, the high revenue per order (especially Klaviyo) indicates a highly valuable user base that could grow.

**3. Outliers and Unusual Patterns:**

*   **Direct Traffic Dominance in Average Revenue:** The Direct channel having the highest average revenue per order is noteworthy. This could indicate strong brand loyalty, effective offline marketing driving people directly to your site, or a segment of customers who are repeat purchasers. This is good, but requires more investigation.
*   **Google's Lower Average Revenue Per Order:** Google having the *highest* order volume but one of the *lowest* average revenue per order values is another important signal. It is not unusual because high-volume channels often tend to be a bit less valuable due to targeting issues. It suggests that the keywords or targeting strategies used may be attracting a broader audience, not all of whom are high-value customers. The sheer volume still makes it important.
*   **Applovin's Lower Volume/Revenue:** Applovin's revenue and volume are both quite low compared to the other sources.

**4. Strategic Recommendations:**

*   **Invest More In:**
    *   **Google:** Given the high order volume, invest in optimizing Google Ads (or SEO) for more targeted keywords and higher-value customers. Even small improvements in conversion rate or AOV on Google will have a large impact on the bottom line.
    *   **Direct:** Determine what is driving direct traffic and double down on that.
    *   **Klaviyo:** Given the high revenue per order, focus on growing your email list and implementing more personalized email marketing campaigns.
    *   **Facebook:** Continue investing in Facebook ads, focusing on refining targeting to improve average order value.

*   **Optimize:**
    *   **Applovin:** Evaluate the ROI of Applovin. Is the cost per acquisition (CPA) justified by the revenue generated? If not, consider pausing or optimizing the campaigns, or reevaluating its use in your stack.
    *   **Google:** Implement A/B testing of landing pages and ad copy to increase average order value. Consider adding product recommendations, upsells, and bundles.
    *   **Instagram:** Experiment with different types of content and calls to action to improve conversion rates and average order value.

*   **Suggested A/B Tests and Experiments:**

    *   **Google:**
        *   **A/B Test:** Landing page variations focused on different product categories for specific keyword groups.
        *   **Experiment:** Implement smart bidding strategies to optimize for value (revenue) rather than just clicks.
        *   **Experiment:** Use data to personalize ad copy based on user demographics or search history.

    *   **Facebook/Instagram:**
        *   **A/B Test:** Ad creative with different value propositions (e.g., free shipping, exclusive discounts) to see what resonates best with your target audience.
        *   **Experiment:** Test different targeting options (e.g., lookalike audiences, interest-based targeting) to find the most profitable segments.

    *   **Klaviyo:**
        *   **A/B Test:** Subject lines and email content to improve open and click-through rates.
        *   **Experiment:** Segment email lists based on purchase history and behavior to deliver more personalized offers.
    *   **Direct:**
        *   **Experiment:** Add "how did you hear about us" section to checkout flow
        *   **Experiment:** Survey customers on brand awareness/impressions

**5. Comparison to E-Commerce Benchmarks:**

It's difficult to give precise benchmarks without knowing your specific industry and product category. However, here are some general considerations:

*   **Traffic Source Mix:** A healthy e-commerce business often has a mix of traffic sources. It's positive that you're not overly reliant on a single source.
*   **Average Order Value (AOV):** The AOV benchmark varies widely. You should research AOV benchmarks specific to your industry. Compare your AOV across channels to see if any channels are significantly underperforming.
*   **Conversion Rates:** Research average conversion rates for your industry and for each traffic source. This will help you identify areas where you can improve.
*   **Direct Traffic:** A high percentage of direct traffic typically indicates strong brand recognition and customer loyalty, which is a positive sign. If it's *too* high, make sure referral traffic is actually being attributed to "direct" when it should be shown as other channels.

**Additional Data Needs:**

To refine this analysis further, I would need the following information:

*   **Cost Data:** The cost of acquiring traffic from each source (e.g., ad spend, agency fees). This is *critical* to calculate ROI and make informed investment decisions.
*   **Profit Margin:** Your average profit margin on each product. This will allow you to determine the true profitability of each traffic source.
*   **Customer Lifetime Value (CLTV):** This is a crucial metric. Understanding which traffic sources bring in customers with the highest CLTV will help you prioritize long-term growth.
*   **Attribution Modeling:** How do you attribute conversions to different traffic sources? Are you using a first-touch, last-touch, or multi-touch attribution model? The attribution model you use can significantly impact your understanding of which sources are most valuable.
*   **Segmentation:** Can you segment your customer data by demographics, purchase history, or other factors? This would allow you to identify high-value customer segments and target them more effectively.
*   **Return on Ad Spend (ROAS)** For advertising channels, it's important to know the returns you're getting.

By incorporating this additional data, you can create a much more detailed and actionable analysis of your traffic source performance. Good luck!
