### Summary Generation for Mobile App Reviews

- first, load the sample reviews (N~350)
- merge (unique) reviews into a single text
- prepare three prompts: chain of density (CoD), chain of density for reviews (CoD_R), and vanilla
- use each prompt and generate summaries for each app

In [None]:
import os, re, pandas as pd, time
from functools import reduce
from openai import OpenAI
import json, re

## Load dataset

In [12]:
domains_apps = {
  "ridehailing": ["uber", "lyft" ],
  "mentalhealth": ["calm", "headspace"],
   "dating": ["tinder", "bumble"],
   "investing": [ "robinhood", "acorn"]
}

source_reviews = {}

In [None]:
for domain, apps in domains_apps.items():
    for app in apps:
        df = pd.read_csv(f"./data/reviews/sampled/{domain}_{app}.csv")
        source_reviews[f"{domain}_{app}"] = df

## Summary Generation using CoD, CoD_r, and Vanilla prompting on GPT-4 preview model

In [4]:
import tiktoken

def num_tokens(text, model):
    encoding = tiktoken.get_encoding("cl100k_base")
    assert encoding.decode(encoding.encode(text)) == text
    encoder = tiktoken.encoding_for_model(model)
    tokens = encoder.encode(text)
    return len(tokens)

In [5]:
task_prompt_cod = """Instructions:
You will generate increasingly concise, entity-dense summaries of reviews (delimited with XML tags).{{lemma_info}}

Repeat the following 2 steps {{iterations}} times.
Step 1: Identify {{num_missing_entities}} informative entities ("";"" delimited) from the reviews which are missing from the previously generated summaries.
Step 2: Write a new, denser summary of identical length ({{summary_length}} words) which covers entities from previous summaries plus the Missing entities.

A missing entity is:
- relevant to the app’s operation,
- specific yet concise (5 words or fewer),
- novel (not in the previous summary),
- faithful (present in the reviews),
- anywhere (can be located anywhere in the reviews),
- Non-identifier: avoids specific personal, geographical information like name, location, URLs, emails

Guidelines:
- The first summary should be long (4-5 sentences, {{summary_length}} words) yet highly non-specific, containing little information beyond the entities marked as missing.
- Use overly verbose language and fillers (e.g., "this app discusses") to reach {{summary_length}} words.
- Make every word count: rewrite the previous summary to improve flow and make space for additional entities.
- Make space with fusion, compression, and removal of uninformative phrases like "the users discuss".
- The summaries should become highly dense and concise yet self-contained, i.e., easily understood without the reviews.
- Missing entities can appear anywhere in the reviews. NEVER drop entities from the previous summary.
- If space cannot be made, add fewer new entities. Remember, use the exact same number of words for each summary.
- Avoid apps' names other than {{app}} in summaries.
- Avoid personal and location specific information like name, place, URLs, and emails

Remember, use the exact number ({{summary_length}}) of words for each summary.
ONLY Answer in JSON. The JSON should be a list (length {{iterations}}) of dictionaries whose keys are ""Iteration turn"", ""Missing_Entities"", ""Denser_Summary"".
"""

task_prompt_cod_r = """Instructions:
You will generate increasingly concise, entity-dense summaries of the reviews of the {{app}} app.
Repeat the following 2 steps {{iterations}} times.
Step 1:	Identify {{num_missing_entities}} informative entities from the reviews which are missing from the previously generated summary.
Step 2:	Write a new, denser summary of identical length which covers every entity and detail from the previous summary plus the missing entities.
Note: An entity is a user-defined issue in the reviews that is perceived to harm or support their goals. This includes the functional or non-functional features of the app.

A missing entity is:
- relevant to the app’s operation,
- specific yet concise (5 words or fewer),
- novel (not in the previous summary),
- faithful (present in the reviews),
- anywhere (can be located anywhere in the reviews).

Guidelines:
- The first summary should be long (4-5 sentences, {{summary_length}} words) yet highly non-specific, containing little information beyond the entities marked as missing.
- Use overly verbose language and fillers (e.g., "this app discusses") to reach {{summary_length}} words.
- Make every word count: rewrite the previous summary to improve flow and make space for additional entities.
- Make space with fusion, compression, and removal of uninformative phrases like "the users discuss".
- The summaries should become highly dense and concise yet self-contained, i.e., easily understood without the reviews.
- Missing entities can appear anywhere in the reviews. NEVER drop entities from the previous summary.
- If space cannot be made, add fewer new entities. Remember, use the exact same number of words for each summary.
- Avoid apps' names other than {{app}} in summaries.
- Avoid personal and location specific information like name, place, URLs, and emails

Remember, use the exact number ({{summary_length}}) of words for each summary.
ONLY Answer in JSON. The JSON should be a list (length {{iterations}}) of dictionaries whose keys are ""Iteration turn"", ""Missing_Entities"", ""Denser_Summary"".
"""

task_prompt_vanilla = """Instructions:
You will summarize the app store reviews (delimited with XML tags) for {{app}} app in {{summary_length}} words.{{lemma_info}}
"""

In [None]:
OPEN_AI_API_KEY = "your-api-key-goes-here"
client = OpenAI(api_key=OPEN_AI_API_KEY)

In [None]:
def get_reviews(text):
    return """<REVIEWS>
{{reviews}}
</REVIEWS>
"""

def generate_summary(app, reviews, params, prompt_format, output_file, simply_return=False):
    
    task_prompt = ""
    
    if prompt_format == "CoD"
        task_prompt = task_prompt_cod
    elif prompt_format == "CoD_R":
        task_prompt = task_prompt_cod_r
    elif prompt_format == "vanilla":
        task_prompt = task_prompt_vanilla
    else:
        print("Invalid prompt format:", prompt_format)
        return

    # get reviews text delimited within <REVIEWS></REVIEWS> tags
    reviews_prompt = get_reviews(reviews)
    reviews_prompt = reviews_prompt.replace("{{reviews}}", reviews)
    reviews_prompt = re.sub('\n+', '\n', reviews_prompt)
    reviews_prompt = re.sub('\t+', '\t', reviews_prompt)

    # add params in the task prompt
    task_prompt = re.sub("{{iterations}}", params["iterations"], task_prompt)
    task_prompt = re.sub("{{summary_length}}", params["summary_length"], task_prompt)
    task_prompt = re.sub("{{max_summary_length}}", params["max_summary_length"], task_prompt)
    task_prompt = re.sub("{{num_missing_entities}}", params["num_missing_entities"], task_prompt)
    task_prompt = re.sub("{{app}}", app, task_prompt)
    task_prompt = re.sub("\"\"", "\"", task_prompt)

    system_prompt = "You are an experienced summarizer. You will generate summaries that people can understand easily. Reviews will be provided within XML tags (<REVIEWS> and </REVIEWS>)."

    print(system_prompt)
    print(reviews_prompt)
    print(task_prompt)

    # compute tokens for the prompt and reviews texts
    reviews_token_len = num_tokens(reviews_prompt, params["model"])
    task_token_len = num_tokens(task_prompt, params["model"])
    reviews_prompt_len = reviews_token_len + task_token_len
    print("reviews_token_len: ", reviews_token_len, 
          " task_token_len: ", task_token_len, 
          " reviews_prompt_len: ", reviews_prompt_len)

    messages = [{"role": "system", "content": system_prompt},
    {"role": "user", "content": reviews_prompt},
    {"role": "assistant", "content": "Received reviews delimited with XML tags."},
    {"role": "user", "content": task_prompt}]
    print("combined messages:\n", messages)

    if simply_return:
        return

    response =  client.chat.completions.create(
                model=params["model"],
                messages = messages,
                n=params["num_output"],
                stop= None,
                top_p=params["top_p"],
                frequency_penalty = params["frequency_penalty"],
                presence_penalty = params["presence_penalty"],
                temperature=params["temperature"]
                )

    results = []
    for item in response.choices:
        content = item.message.content.strip().replace("\n", "")
        print(item.index, content)
        results.append((item.index, content, task_prompt, reviews_prompt, messages,
                        response.model, response.created, response.system_fingerprint,
                        response.usage.completion_tokens, response.usage.prompt_tokens, response.usage.total_tokens))


    df = pd.DataFrame(results, columns=["choice index", "content", "task prompt", "reviews prompt",
                                        "messages", "model", "created", "system_fingerprint",
                                        "completion_tokens", "prompt_tokens", "total_tokens"])
    df.to_csv(output_file, index=False, header=True)
    print("\n-------------- saved result to ", output_file, "------------\n")
    return response

In [None]:
params = {"iterations": "5", "summary_length": "120",
          "max_summary_length": "120", "num_missing_entities": "2-4",
          "prompt_format": ["CoD", "CoD_R", "vanilla"],
    "top_p": 0.5, "temperature": 0.5, "max_tokens": 128000, "model": "gpt-4-1106-preview",
    "num_output": 1, "presence_penalty": 0.1, "frequency_penalty": 0.1}

In [None]:
# generate summaries using each prompt type

for domain, apps in domains_apps.items():
    for app in apps:
        for prompt_format in params["prompt_format"]:
            source_df = source_reviews[f"{domain}_{app}"]
            sample_size = len(source_df.groupby("uuid"))
            total_size = len(rate_df)

            model = params["model"]
            OUTPUT_DIR = f"./data/summaries/{prompt_format}"
            summ_output_dir = f"{OUTPUT_DIR}/{model}/{domain}"
            if not os.path.exists(summ_output_dir):
                os.makedirs(summ_output_dir)
            output_file = f"{summ_output_dir}/{app}.csv"
            
            source_df = source_df.sort_values("tfidf score", ascending=False)
            col_reviews = rate_df["review"].tolist()
            col_reviews_merged = ".".join(col_reviews)
            
            print("\n---------------PARAMS:--------------\n", app, params)
            print(domain, app, input_format, prompt_format,
                    "\n#output file: ", output_file, # output file to save summaries
                    "\n#col reviews len: ",len(col_reviews), # number of "review" rows
                    "\n#reviews: ", len(rate_df.groupby("uuid"))) # number of unique ratings

            # call summary generation function
            abs_summary(app, col_reviews_merged, params, prompt_format, output_file, True)
            time.sleep(1)

## Baseline summary: Hybrid TF.IDF 

In [None]:
domains_tfidf_summary = {}
sents_df = {}

for domain, apps in domains_apps.items():
    for app in apps:
        df = pd.read_csv(f"./data/raw reviews/{domain}/{app}.csv")
        df = break_into_sentences(df) # defined in data preprocessing file
        df = df.apply(lambda row: preprocess(row, "sent"), axis=1) # defined in data preprocessing file
        sents_df[f"{domain}_{app}"] = df

In [None]:
from numpy.linalg import norm
from numpy import dot

def cosine_sim(vec1, vec2):
    return dot(vec1, vec2) / (norm(vec1) * norm(vec2))

def select_salient_documents(post_vectors, post_weights, k=10, similarity_threshold=0.4):
    """
        Selects the top k most salient posts in a collection of posts.
        To avoid redundancy, any post too similar to other-posts are disregarded. Each selected post will
        therefore be both highly salient and representative of unique semantics.

        Note:
            post_vectors and post_weights must be in the same order. The ith element of post_weights must reflect
             the ith element of post_vectors

        Args:
            post_vectors (list of (list of float)): Hybrid tfidf representation of the documents
             as a document-term matrix

            post_weights (list of float): Hybrid Tfidf weight for each document

            k (int): The number of posts to select as output

            similarity_threshold (float): The maximum cosine similiarity for a post to be selected

    """

    sorted_keyed_vectors = [z for _, z in sorted(zip(post_weights, enumerate(post_vectors)), key=lambda i: i[0],
                                                 reverse=True)]  # z is (i,vi) sorted by weight

    i = 1

    veclength = len(post_vectors)
    loop_condition = True

    significant_indices = [0]
    sorted_indices = [sorted_keyed_vectors[0][0]]

    while loop_condition:
        is_similar = False

        for j in significant_indices:
            sim = cosine_sim(sorted_keyed_vectors[j][1], sorted_keyed_vectors[i][1])
            if sim >= similarity_threshold:
                is_similar = True

        if not is_similar:
            significant_indices.append(i)
            sorted_indices.append(sorted_keyed_vectors[i][0])

        if (len(significant_indices) >= k) or (i >= veclength - 1):
            loop_condition = False
        i += 1

    return sorted_indices

In [None]:
#https://pypi.org/project/hybridtfidf/#description

from hybridtfidf import HybridTfidf
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
        
def create_glove_embeddings(reviews, review_sent = False):
    reviews_embeddings = []
    
    for review_str in reviews:
        review_words = review_str
        if review_sent:
            review_words = nltk.word_tokenize(review_str)
            
        # get embeddings for each word
        review_words_embedding = []
        for word in review_words:
            if word in glove_model.key_to_index: # glove_model is initialized in lib file
                review_words_embedding.append(glove_model[word])
            else:
                review_words_embedding.append(np.zeros(300))
        review_embedding = np.mean(review_words_embedding, axis=0)
        reviews_embeddings.append(review_embedding)

    # Normalize the review embeddings
    reviews_embeddings = np.array(reviews_embeddings)
    reviews_embeddings /= np.linalg.norm(reviews_embeddings, axis=1).reshape(-1, 1)
    return reviews_embeddings
    

def library_hybrid_tfidf_summary(uuids, raw_reviews, raw_sent, lemma_reviews, output_file):
    hybridtfidf = HybridTfidf(threshold=7) 
    hybridtfidf.fit(lemma_reviews)

    # The thresold value affects how strongly the algorithm biases towards longer documents
    # A higher threshold will make longer documents have a higher post weight

    # list of list (inner list is the hybrid tfidf vector for the document)
    # use glove vectors at document level
    document_vectors = hybridtfidf.transform(lemma_reviews) 
    document_weights = hybridtfidf.transform_to_weights(lemma_reviews) # returns a list of hybrid tfidf scores for each document

    review_embeddings = create_glove_embeddings(lemma_reviews, True) # each document is a string, so better tokenize
    sentence_embeddings = np.array(review_embeddings)
    
    most_significant = select_salient_documents(sentence_embeddings, document_weights, k = 50, similarity_threshold = 0.5)

    dissimilar_reviews = []
    for i in most_significant:
        dissimilar_reviews.append([i, uuids[i], raw_reviews[i], raw_sent[i], lemma_reviews[i]])       

    
    df = pd.DataFrame(dissimilar_reviews, columns = ["index", "uuid", "review", "sent",  "lemmatized"])
    df.to_csv(output_file, header=True, index=False)
    return df

In [None]:
for domain, apps in domains_apps.items():
    for app in apps:
        df = sents_df[f"{domain}_{app}"]
        uuids = df["uuid"].tolist()
        raw_reviews = df["review"].tolist()
        sent_reviews = df["sent"].tolist()
        lemma_sent_reviews = df["sent_lemma"].tolist()
        lemmatized_reviews = [re.sub(",", " ", item) for item in lemma_sent_reviews] # array of array
        
        output_file = f"./data/summaries/{domain}_{app}_tfidf.csv"
        summary_df = library_hybrid_tfidf_summary(uuids, raw_reviews, sent_reviews, lemmatized_reviews, output_file)
        domains_tfidf_summary[domain] = summary_df

## Generating Summaries using CoD_r prompting on Gemini 1.5 Flash model

### Load libs and env variables

In [None]:
from dotenv import load_dotenv
import os, re, pandas as pd, time, json
from functools import reduce
import google.generativeai as genai

In [None]:
load_dotenv(".env")

GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
GEMINI_API_KEY

genai.configure(api_key=GEMINI_API_KEY)

In [None]:
codr_prompt = ""

with open("./data/prompts/codr.txt", "r") as infile:
    codr_prompt = infile.read()

def get_reviews():
    return """<REVIEWS>
{{reviews}}
</REVIEWS>
"""

### Load reviews for summarization into a single DF 
- use this DF to generate summaries using llama and gemini models

In [None]:
apps_reviews_dir = "./data/reviews/sampled for summarization"
apps_reviews_df = {}

for app_file in [item for item in os.listdir(apps_reviews_dir) if ".csv" in item]:
    app_file_path = f"{apps_reviews_dir}/{app_file}"
    df = pd.read_csv(app_file_path)
    appname = app_file.replace(".csv", "")
    apps_reviews_df[appname] = df

### Create API function for summary generation

In [None]:
def generate_gemini_summary(app, reviews, params, prompt_template, output_file, simply_return=False):
    system_prompt = "You are an experienced summarizer. You will generate summaries that people can understand easily. Reviews will be provided within XML tags (<REVIEWS> and </REVIEWS>)."

    task_prompt = prompt_template

    # add params in the task prompt
    task_prompt = re.sub("{{iterations}}", params["iterations"], task_prompt)
    task_prompt = re.sub("{{summary_length}}", params["summary_length"], task_prompt)
    task_prompt = re.sub("{{num_missing_entities}}", params["num_missing_entities"], task_prompt)
    task_prompt = re.sub("{{app}}", app, task_prompt)
    task_prompt = re.sub("\"\"", "\"", task_prompt)
    
    # init model instance
    model_config = genai.GenerationConfig(temperature=params["temperature"],
                                      top_p=params["top_p"],
                                      max_output_tokens = params["max_tokens"],
                                      candidate_count=params["num_output"],
                                      presence_penalty=params["presence_penalty"],
                                      frequency_penalty=params["frequency_penalty"],
                                      response_mime_type = "application/json")
    print("model_config: ", model_config)

    model = genai.GenerativeModel(params["model"], 
                                  system_instruction=system_prompt,
                                  generation_config=model_config)
    
    # count tokens of prompts
    system_prompt_tokens = model.count_tokens(system_prompt)
    task_prompt_tokens = model.count_tokens(task_prompt)

    # WARNING: append "reviews" template to the end of the task_prompt
    reviews_template  = get_reviews()
    task_prompt = f"{task_prompt}\n\n{reviews_template}" 
    task_prompt = task_prompt.replace("{{reviews}}", reviews)
    task_prompt = re.sub('\n+', '\n', task_prompt)
    task_prompt = re.sub('\t+', '\t', task_prompt)
    
    print("[TASK PROMPT]:\n", task_prompt)

    # count tokens of app reviews only
    reviews_tokens = model.count_tokens(reviews)
    
    print("[TOKENS]:\t task prompt: ", task_prompt_tokens , 
          "\n\t reviews: ", reviews_tokens,
          "\n\t system prompt: ", system_prompt_tokens)
    

    if simply_return:
        return task_prompt

    # call model API
    response = model.generate_content(task_prompt)
    print("[RESPONSE]:\n", response)

    with open(output_file, "w") as outfile:
        outfile.write(response.text)
        print("....saved to file: ", output_file, "....")

    return task_prompt

### Call Gemini API function

In [None]:
params = {"iterations": "5", "summary_length": "120", 
           "top_p": 0.5, "temperature": 0.5, 
           "num_output": 1, "max_tokens": 128000,
           "presence_penalty": 0.1, "frequency_penalty": 0.1,
          "num_missing_entities": "2-4", "model": "gemini-1.5-flash"}

In [None]:
# All text summaries are checked for JSON format and once verified, text files are converted to .json files

_outdir = "./data/summaries/gemini-summaries"
prompt_text = vanilla_prompt

for appname, df in apps_reviews_df.items():
    reviews_only = " ".join(df["review"].tolist())
    output_file_prefix = f"{_outdir}/{appname}"

    output_file = f"{output_file_prefix}.txt"
    generate_gemini_summary(appname, reviews_only, params, prompt_text, output_file, False)
    

## Generate Summaries using CoD_r prompting on Llama 3.1 70B Instruct Model

### Import Libs and load environment variables

In [None]:
from botocore.config import Config
import json
import boto3
from botocore.exceptions import ClientError

In [None]:
aws_access_key = os.getenv("AWS_ACCESS_KEY")
aws_secret_key = os.getenv("AWS_SECRET_KEY")
aws_access_key, aws_secret_key

model_id = "us.meta.llama3-1-70b-instruct-v1:0"
config = Config(read_timeout=1000)
region_name="us-east-1"

aws_client = boto3.client("bedrock-runtime", 
                        aws_access_key_id=aws_access_key,
                        aws_secret_access_key=aws_secret_key,
                        region_name=region_name,
                        config=config)

### Create Llama API function for summary generation

In [None]:

def generate_aws_llama_summary(app, reviews, params, prompt_template, output_file, simply_return=False):
    system_prompt = "You are an experienced summarizer. You will generate summaries that people can understand easily. Reviews will be provided within XML tags (<REVIEWS> and </REVIEWS>)."

    task_prompt = prompt_template

    # add params in the task prompt
    task_prompt = re.sub("{{iterations}}", params["iterations"], task_prompt)
    task_prompt = re.sub("{{summary_length}}", params["summary_length"], task_prompt)
    task_prompt = re.sub("{{num_missing_entities}}", params["num_missing_entities"], task_prompt)
    task_prompt = re.sub("{{app}}", app, task_prompt)
    task_prompt = re.sub("\"\"", "\"", task_prompt)
    
    # append reviews in the task prompt
    # WARNING: append "reviews" template to the end of the task_prompt
    reviews_template = get_reviews()
    task_prompt = f"{system_prompt}\n{task_prompt}\n{reviews_template}" 
    task_prompt = task_prompt.replace("{{reviews}}", reviews)
    task_prompt = re.sub('\n+', '\n', task_prompt)
    task_prompt = re.sub('\t+', '\t', task_prompt)

    messages = [
        {
            "role": "user",
            "content": [{"text": task_prompt}],
        }
    ]

    print("[TASK PROMPT]:\n", task_prompt)
    print("[MESSAGES]:\n", messages)

    if simply_return:
        return task_prompt


    try:
        # https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_InferenceConfiguration.html
        # call model API

        model_config = {"maxTokens": params["max_tokens"], 
                             "temperature": params["temperature"], 
                             "topP": params["top_p"]}
        response = aws_client.converse(
            modelId=model_id,
            messages=messages,
            inferenceConfig=model_config)

        # Extract and print the response text.
        response_text = response["output"]["message"]["content"][0]["text"]
        print("[RESPONSE]:\n", response)

        with open(output_file, "w") as outfile:
            outfile.write(response_text)
            print("....saved to file: ", output_file, "....")

        output_file_2 = output_file.replace(".txt", ".json")
        with open(output_file_2, "w") as outfile_2:
            # outfile_2.write(response)
            json.dump(response, outfile_2, indent=4) 
            print("....saved to json file: ", output_file_2, "....")

    except (ClientError, Exception) as e:
        print(f"ERROR: Can't invoke '{model_id}'. Reason: {e}")
        

    return task_prompt

### Call the Llama API function

In [None]:
params = {"iterations": "5", "summary_length": "120", 
           "top_p": 0.5, "temperature": 0.5, 
           "num_output": 1, "max_tokens": 8192, #128K context length including only 8K output tokens
           "presence_penalty": 0.1, "frequency_penalty": 0.1,
          "num_missing_entities": "2-4"}

In [None]:
# This API offers on-demand service and might take 3-5 minutes to get response from the model.

prompt_text = codr_prompt
prompt = "codr"
output_summ_dir = "./data/summaries/llama-summaries"

for appname, df in apps_reviews_df.items():
    reviews_only = " ".join(df["review"].tolist())
    output_file_prefix = f"{output_summ_dir}/{prompt}/{appname}"
    output_file = f"{output_file_prefix}.txt"
    generate_aws_llama_summary(appname, reviews_only, params, prompt_text, output_file, False)