<a href="https://colab.research.google.com/github/AleemRahil/Robust-End-to-End-E-Commerce-Analytics-Automation-with-LLMs/blob/main/Talk_to_Private_Data_with_Meta_llama3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Smart & Private Data Analysis with Llama 3

In [None]:
!pip install -qU \
  python-dotenv \
  langchain \
  groq \
  langchain-groq \
  google-cloud-bigquery

In this project, I will explore how to generate insights from BigQuery using Llama 3 and Langchain. The focus will be on handling errors gracefully and feeding them back into the chain for iterative improvement.

A significant advantage of using LLMs for insight extraction is the substantial efficiency gains. Traditional methods often involve manual exploration of the data, trial and error with SQL queries, and a lot of back-and-forth between data analysts and stakeholders. With LLMs and Langchain, we can automate much of this process, enabling us to retrieve relevant information quickly based on natural language queries.

One key challenge in working with large datasets is dealing with sensitive or private information. Many companies have strict security requirements that prevent them from using external services like OpenAI or Anthropic to process their data. This is where Llama 3 comes in - it allows us to run LLM-based SQL pipelines locally, ensuring that sensitive data remains secure within the organization's infrastructure.

Generating SQL queries from natural language is not always a straightforward process, and there may be cases where the generated query is invalid or doesn't quite match the user's intent. To address this, I'll implement a retry mechanism that captures error messages and feeds them back into the chain. This allows the system to learn from its mistakes and generate more accurate queries in subsequent attempts.

I'll use GroqCloud to test the model's data analysis capabilities to keep it simple

#Setting Up the Environment


In [None]:
import json
import os
from datetime import datetime

from dotenv import load_dotenv

load_dotenv()

True

#Authenticating with BigQuery


In [None]:
from google.cloud import bigquery
from google.oauth2 import service_account

service_account_path = './gbqkey.json'
dataset="ecomdata_bi"
full_dataset_id = f"rabbitpromotion.{dataset}"

credentials = service_account.Credentials.from_service_account_file(service_account_path)
gbq_client = bigquery.Client(credentials=credentials, project=credentials.project_id)

In [None]:
import schema

In [None]:
 print(schema.fetch_schemas(full_dataset_id,gbq_client))

- rabbitpromotion.ecomdata_bi.customer_tags
- rabbitpromotion.ecomdata_bi.customers
- rabbitpromotion.ecomdata_bi.orders
- rabbitpromotion.ecomdata_bi.products

Schema for customer_tags:
- Name: id, Type: STRING, Mode: REQUIRED
- Name: acquisitionSource, Type: STRING, Mode: NULLABLE
- Name: campaignInfo, Type: STRING, Mode: NULLABLE
- Name: discountFirstPurchase, Type: INTEGER, Mode: NULLABLE
- Name: daysToConversion, Type: INTEGER, Mode: NULLABLE
- Name: predictedGender, Type: STRING, Mode: NULLABLE
- Name: predictedGeneration, Type: STRING, Mode: NULLABLE

Schema for customers:
- Name: id, Type: STRING, Mode: REQUIRED
- Name: emailMarketingConsent, Type: RECORD, Mode: NULLABLE
    - Name: consentUpdatedAt, Type: TIMESTAMP, Mode: NULLABLE
    - Name: marketingOptInLevel, Type: STRING, Mode: NULLABLE
    - Name: marketingState, Type: STRING, Mode: NULLABLE
- Name: createdAt, Type: TIMESTAMP, Mode: NULLABLE
- Name: updatedAt, Type: TIMESTAMP, Mode: NULLABLE
- Name: firstName, Type: STRI

#Configuring the Language Model

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_groq import ChatGroq

llm_groq = ChatGroq(temperature=0.2, model_name="llama3-70b-8192")



In [None]:
template = """Based on the BigQuery schema below, and the message history, write a
SQL query that answers the question/request.

Remember to UNNEST repeated records and make sure only to use exisiting fields in the schema:

schema:{schema}

Question: {question}

Message history: {messages}

SQL Query:"""
prompt = ChatPromptTemplate.from_template(template)

In [None]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

from langchain.memory import ChatMessageHistory

#Setting Up the Pipeline


In [None]:
hist = ChatMessageHistory()

def get_schema(_):
    return schema.fetch_schemas(full_dataset_id,gbq_client)

def get_messages(_):
    #print(hist.messages)
    return hist.messages


sql_response = (
    RunnablePassthrough.assign(schema=get_schema, messages=get_messages)
    | prompt
    | llm_groq
    | StrOutputParser()
)

#Extracting SQL from Generated Text

In [None]:
import re

def extract_sql(input_text):
    # Check if the input contains triple backticks
    if '```' in input_text:
        # Regex to extract content within triple backticks
        pattern = re.compile(r'```(.*?)```', re.DOTALL)
        match = pattern.search(input_text)
        if match:
            return match.group(1).strip()  # Return the cleaned, extracted SQL
    # If no triple backticks are found, return the input as is
    return input_text.strip()

#Executing Queries with Retry Mechanism

In [None]:
from google.cloud.exceptions import BadRequest

def execute_query_with_retries(my_query, max_attempts=5):
    attempts = 0
    while attempts < max_attempts:
        attempts += 1
        print(f"Attempt {attempts} of {max_attempts}")
        # Invoke the external SQL service
        print("Generating the SQL")
        res = sql_response.invoke({"question": my_query})
        clean_sql = extract_sql(res)
        try:
            print("Attempting to run the query and convert it to a DataFrame")
            dataframe = gbq_client.query(clean_sql).to_dataframe()
            print("Query executed successfully.")
            return dataframe
        except BadRequest as e:
            # Print or store the error message
            error_message = str(e)
            print("Query failed with the following error:")
            print(error_message)
            hist.add_user_message(clean_sql + ': ' + error_message)
            if attempts == max_attempts:
                print("Reached maximum attempt limit. Stopping retries.")
                return None  # Return None if all retries fail


#Generating Insights


In [None]:
hist = ChatMessageHistory()

my_query = """
Give me the name of the top 100 customers with highest purchase frequency but with under average AOV.
Include frequency and aov"""

df = execute_query_with_retries(my_query)

Attempt 1 of 5
Generating the SQL
Attempting to run the query and convert it to a DataFrame
Query failed with the following error:
Attempt 2 of 5
Generating the SQL
Attempting to run the query and convert it to a DataFrame
Query failed with the following error:
Attempt 3 of 5
Generating the SQL
Attempting to run the query and convert it to a DataFrame
Query executed successfully.


In [None]:
df

Unnamed: 0,firstName,lastName,purchase_frequency,avg_order_value
0,Michael,Harris,6,383.998333
1,Brandon,Bates,6,424.055000
2,Robert,Garcia,6,306.633333
3,Jonathan,Holder,6,442.136667
4,Edward,Payne,6,461.783333
...,...,...,...,...
95,Maria,Wilson,4,312.665000
96,Jessica,Williams,4,384.367500
97,Susan,Gardner,4,322.785000
98,Charles,Thomas,4,459.787500
