# Virtual Focus Groups Using LLMs

This notebook demonstrates how to analyze user-generated content (UGC) using LLMs.

#### Use Case
We have large amounts of user-generate content such as product reviews or social media posts. We want to understand what users think about our brand, products, marketing communications, corporate social media posts, etc. Performing this analysis manually is a daunting labor-consuming task, but we can leverage language models to perform content summarization. 

#### Prototype: Approach and Data
One particular way of implementing content analysis using language models is to cluster the available content in the embedding space and create a *virtual persona* (or *virtual focus group*) for each cluster that impersonates the real users and answers questions based on the psychological characteristics and traits such as values, desires, goals, interests, and lifestyle choices expressed in the content. Marketing users can perform the analysis by asking questions to such virtual personas. This simulates the analysis using real focus groups, and can also be viewed as a cost- and time-efficient alternative to using real focus groups.

We use the Amazon Product Review 2018 dataset (see `datasets.md` for details) for the prototyping purposes. The prototype uses a small subset of reviews that fit the LLM context.  

#### Usage and Productization
The input dataset can be easily replaced with actual reviews or social media posts; small data schema adjustments will be required. However, the basic prompt-based summarizations will need to be replaced with a more scalable approach such as retrieval-augmented generation (RAG). In practice, the personas would be typically specified by marketing users rather than extracted from content on an ad-hoc basis.

In [2]:
#
# Imports and helper functions
#
import json
import pandas as pd
import gzip
from pprint import pprint

from langchain.embeddings import VertexAIEmbeddings
from langchain.llms import VertexAI
from langchain_core.prompts import PromptTemplate
from sklearn.cluster import KMeans
import inspect

def trim_multiline(x):
    return inspect.cleandoc(x)

## Load the Review Data

In this section, we load the product and review data.

In [3]:
def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield json.loads(l)

def getDF(path):
  df = {}
  for i, d in enumerate(parse(path)):
    df[i] = d
  return pd.DataFrame.from_dict(df, orient='index')

base_path = '<<path to data>>' # Replace with the path to the Amazon Review Data folder
review_df = getDF(base_path + 'Luxury_Beauty.json.gz')
print(f"Loaded {len(review_df)} reviews")

meta_df = getDF(base_path + 'meta_Luxury_Beauty.json.gz')
print(f"Loaded {len(meta_df)} products")

Loaded 574628 reviews
Loaded 12299 products


## Create Personas By Clustering Review Embeddings

In this section, we compute embeddings for individual reviews, cluster the reviews in the embedding space to create the data foundations for virtual personas, and then identify the personas by analyzing the content in each cluster. 

In [31]:
#
# Initialize LLM provider
# (google-cloud-aiplatform must be installed)
#
from google.cloud import aiplatform
aiplatform.init(
    project='gd-gcp-rnd-genai-convai',
    location='us-central1'
)
embedding_llm = VertexAIEmbeddings()
print(f'Using the following embedding model: {embedding_llm}')

llm = VertexAI(temperature=0.7)
print(f'Using the following generative model: {llm}')

Using the following embedding model: project=None location='us-central1' request_parallelism=5 max_retries=6 stop=None model_name='textembedding-gecko' client=<vertexai.language_models.TextEmbeddingModel object at 0x7f92ede23880> client_preview=None temperature=0.0 max_output_tokens=128 top_p=0.95 top_k=40 credentials=None n=1 streaming=False
Using the following generative model: [1mVertexAI[0m
Params: {'model_name': 'text-bison', 'temperature': 0.7, 'max_output_tokens': 128, 'candidate_count': 1, 'top_k': 40, 'top_p': 0.95}


In [6]:
#
# Sample and filter the reviews 
#
n_products = 200
meta_sample_df = meta_df.sample(n_products)
df = pd.merge(meta_sample_df[['asin', 'title']], 
              review_df[['asin', 'reviewerName', 'summary', 'reviewText']], 
              how='inner', on='asin')

df[['summary', 'reviewText']] = df[['summary', 'reviewText']].astype(str)
df = df[(df['summary'].map(len) > 10) & (df['reviewText'].map(len) > 100)]   # Filter out too short reviews
df.reset_index(drop=True, inplace=True)
print(f"Sampled {df.shape[0]} reviews for {df['asin'].nunique()} products")

Sampled 4733 reviews for 173 products.


In [14]:
#
# Compute embeddings for the reviews
#
def row_to_document(row):
    return trim_multiline(f"""
    Product title: {row["title"]}
    Review summary: {row["summary"]}
    Review text: {row["reviewText"]}""")

docs = df.apply(lambda review: row_to_document(review), axis=1).to_list()

embeddings = embedding_llm.embed_documents(docs)
print(f'Computed {len(embeddings)} embedding vectors')

Computed 4733 embedding vectors


In [34]:
#
# Perform clustering and attribute each review with the cluster ID
#
n_clusters = 4
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init="auto").fit(embeddings)
clustered_df = pd.concat([df, pd.DataFrame(kmeans.labels_, columns=['cluster_id'])], axis=1)

#
# Analyze each cluster and create a persona title
#
topic_prompt = """You're an expert marketing analyst. Your goal is to read of bunch of product reviews and create a title for an imaginary persona who wrote these reviews. 
The title should clearly describe the psychological characteristics and traits such as values, desires, goals, interests, and lifestyle choices. 
The title MUST NOT contain any formatting or personal names. 

---------------------------
PRODUCT REVIEWS:
{reviews}
---------------------------



PERSONA TITLE:
"""
topic_prompt_template = PromptTemplate(template=topic_prompt, input_variables=["reviews"])

n_samples_per_cluster = 50
sampled_reviews_df = clustered_df.groupby('cluster_id').sample(n=n_samples_per_cluster)
personas = {}
for cluster_id in range(n_clusters):
    reviews_in_cluster = sampled_reviews_df[sampled_reviews_df['cluster_id']==cluster_id].apply(
        lambda review: row_to_document(review), axis=1).to_list()

    response = llm(topic_prompt_template.format(reviews="\n\n\n".join(reviews_in_cluster)))
    personas[cluster_id] = response

print(f'The following {len(personas)} personas have been identified:')
pprint(personas)

The following 4 personas have been identified:
{0: ' The Sophisticated Beauty Connoisseur',
 1: ' The Versatile Hair Enthusiast',
 2: ' The Refined Grooming Enthusiast',
 3: ' The Skincare Enthusiast'}


## Ask Questions to Virtual Personas

In this section, we demonstrate how business users (e.g. marketing analysts) can interact with the virtual personas and get insights into customers' perception of the brand, product, and messaging.

In [36]:
persona_prompt = """You're a virtual persona defined as {persona_title}. Please answer the question consistently with your previous product reviews. 
Your answer must be consistent with your reviews in terms of values, desires, goals, interests, lifestyle choices, and product strengths and weaknesses which you have noticed.

YOUR PREVIOUS REVIEWS:
{reviews}

---------------------------

QUESTION:
{question}

ANSWER:
"""
persona_prompt_template = PromptTemplate(template=persona_prompt, input_variables=["persona_title", "reviews", "question"])

cluster_id = 2
persona_reviews = sampled_reviews_df[sampled_reviews_df['cluster_id']==cluster_id].apply(
        lambda review: row_to_document(review), axis=1)
person_review_as_text = "\n\n".join(persona_reviews)

question = "How you would recommend to improve the products?"

response = llm(persona_prompt_template.format(persona_title=personas[cluster_id], reviews=person_review_as_text, question=question))

print(response)

 To improve the products, I would recommend the following:

**For Baxter of California Comb:**
- Improve packaging to ensure combs arrive in good condition.
- Consider offering a case or pouch for the comb to protect it when not in use.

**For Billy Jealousy Beard Envy Kit:**
- Improve the quality control process to ensure products are not damaged or defective before shipping.
- Soften the bristles of the brush to make it more gentle on the beard.
- Consider adding a leave-in conditioner to the kit to help soften and condition the beard.

**For Clarisonic Sensitive Facial Cleansing Brush Head Replacement
