### Generate Rate Feature from extractive summaries (obtained using Hybrid TF-IDF sentence scoring)

- Select the **top 10 summary sentences** in each domain
- Prepare prompt texts: **two to three words labels, adjective labels and Non Functional Requirements**
- Set parameters: temperature, top_probability, max_tokens, number of output, prompt, frequency and presence penalty
- Feed the prompt and the paramters to the **GPT-3.5 Turbo** model using OpenAI API
- Parse the output generated from GPT model to extract labels and their descriptions
- Compute frequency of each unique labels and select the top five labels as the rate features in each domain

**NOTE: To use OpenAI GPT models, you need an authorization token. You can find details in the following links.**
- [API reference](https://platform.openai.com/docs/api-reference)
- [Pricing](https://openai.com/pricing)
- [Models](https://platform.openai.com/docs/models/overview)
- Guides for Prompt Design
    - [OpenAI Guide](https://platform.openai.com/docs/guides/completion/prompt-design)
    - [OpenAI Articles](https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api)

In [2]:
import os, openai, re, pandas as pd
from functools import reduce

**Load review data**

In [8]:
# load summary revuews - sentence level
summ_dir = ["hybrid-tfidf-summary"]
domains = ["ride", "investing", "health"]

tfidf_summ_reviews = {}
for domain in domains:
    reviews = pd.read_csv(summ_dir[0] +  "/" + domain + "_summary.csv")
    tfidf_summ_reviews[domain] = reviews
tfidf_summ_reviews["ride"].head()

current dir:  /content/drive/MyDrive/Colab Notebooks/rate-features-generation ['hybrid-tfidf-summary']


Unnamed: 0,doc_id,score,raw,sent,lemmatized
0,27783,1.205823,i am afairlynew lyftriderand iheardheard that ...,i am afairlynew lyftriderand iheardheard that ...,"competition,lyft,find,true,ber,seem,much,well,..."
1,26214,1.180414,usually cheaper than uber not cheaper than a c...,usually cheaper than uber not cheaper than a c...,"usually,cheaper,uber,cheaper,cab,reliable,usua..."
2,28885,1.076704,well it's been a roller coaster riding with ly...,well it's been a roller coaster riding with ly...,"well,roller,coaster,rid,lyft,start,use,lyft,we..."
3,292,0.874847,better & cheaper than train/bus/taxi. some dri...,also will it would be nice if there was an opt...,"also,would,nice,option,change,time,offer,money..."
4,62785,0.802241,i hate via they take way to long they take you...,i hate via they take way to long they take you...,"hate,via,take,way,long,take,money,paid,via,pas..."


In [9]:
tfidf_summ_reviews["health"].head()

Unnamed: 0,doc_id,score,raw,sent,lemmatized
0,25004,0.782249,had to go back & forth with customer service t...,had to go back & forth with customer service t...,"back,forth,customer,service,get,price,two,frie..."
1,63148,0.705859,just downloaded rootd today & i think itz such...,just downloaded rootd today & i think itz such...,"download,rootd,today,think,cute,lil,app,help,m..."
2,40152,0.580661,this app is gawd awful!! i accidentally downlo...,i accidentally downloaded it(via my niece) and...,"accidentally,download,via,niece,get,update,say..."
3,26196,0.440951,this app is for anxiety and to help people who...,this app is for anxiety and to help people who...,"app,anxiety,help,people,trouble,fall,asleep,be..."
4,4375,0.400531,i've been using betterhelp for maybe six month...,i've been using betterhelp for maybe six month...,"use,betterhelp,maybe,six,month,student,discoun..."


In [10]:
tfidf_summ_reviews["investing"].head()

Unnamed: 0,doc_id,score,raw,sent,lemmatized
0,37683,0.806133,m1 vs webull hands down m1 is better then webu...,webull has a stock lending program allows you ...,"webull,stock,lending,program,allows,loan,share..."
1,33941,0.743765,i am a disabled american man that was excited ...,i am a disabled american man that was excited ...,"disabled,american,man,excite,opening,account,p..."
2,26125,0.624518,i gave one star because had to do something to...,i gave one star because had to do something to...,"give,one,star,something,proceed,download,app,p..."
3,26484,0.513329,nothing works on apple since march 2020 - not ...,an issue with broken watch lists on apple ipad...,"issue,broken,watch,list,apple,ipad,fidelity,ap..."
4,213744,0.456688,if i can give this no stars i wouldthis is the...,if i can give this no stars i wouldthis is the...,"give,star,bad,place,crypto,first,take,money,sa..."


In [None]:
import itertools

openai.api_key = "" # paste your OpenAI Auth key here
num_reviews = 10
gpt_label_dir = "gpt_labels/"
summ_dir = ["hybrid-tfidf-summary"]
domains = ["ride", "investing", "health"]
gpt_embed_raw = {"ride": {}, "health": {}, "investing": {}}
gpt_embed_vector = {"ride": {}, "health": {}, "investing": {}}
domain_keys = {"ride": "ride-hailing", "health": "mental health", "investing": "investing"}

def get_labels_from_summ_reviews(top_summ, params, output_file):
    result = []
    print("params: ", params)
    for summ in top_summ:
        prompt = params["prefix"] + "Review: \"\"\"\n"+ summ + "\n\"\"\"\""
        print("\nprompt:\n", prompt)
        response = openai.ChatCompletion.create(
                        model="gpt-3.5-turbo",
                        messages = [{"role": "user", "content": prompt}] ,
                        max_tokens=params["max_tokens"], 
                        n=1, # number of output from the prompt
                        stop= None,
                        top_p=params["top_p"],
                        frequency_penalty = params["frequency_penalty"],
                        presence_penalty = params["presence_penalty"],
                        temperature=params["temperature"])
        labels = response.choices[0].message.content.strip().split("\n")
        print("labels: ", labels)
        result.append((params, summ, prompt, response.choices, labels))
    
    df = pd.DataFrame(result, columns=["params", "input",  "prompt", "generator choices", "labels"])
    df.to_csv(output_file, index=False, header=True)
    print("\n-------------- saved result to ", output_file, "------------\n")

    
def get_params(domain):

    grid_params = {"temperature": [0.1],
            "top_p": [0.1],
            "max_tokens": [150], 
            "presence_penalty": [0.1], 
            "frequency_penalty": [0.1],
            "prefix": [
               "\nGenerate minimum of five labels (two to three words) with very concise descriptions that best summarize the following "  + domain + " app review.\nDesired format: {label}: {description}\n",
                "\nGenerate minimum of five adjective labels with very concise descriptions that best summarize the following "  + domain + " app review.\nDesired format: {label}: {description}\n",
                "\nGenerate minimum of five labels for Non Functional Requirements with very concise descriptions that summarize the following " + domain + " app review.\nDesired format: {label}: {description}\n",
                ]
            }

    # Generate all possible combinations of parameters
    param_combinations = list(itertools.product(
        grid_params["temperature"], 
        grid_params["top_p"], 
        grid_params["max_tokens"], 
        grid_params["presence_penalty"], 
        grid_params["frequency_penalty"], 
        grid_params["prefix"]
    ))

    # Loop through parameter combinations and perform training and evaluation
    combinations = []
    for params in param_combinations:
        temperature, top_p, max_tokens, presence_penalty, frequency_penalty, prefix = params
        item = {"prefix": prefix, "top_p": top_p, 
            "temperature": temperature, "max_tokens": max_tokens,
            "presence_penalty": presence_penalty, "frequency_penalty": frequency_penalty}
        combinations.append(item)
    print("combinations: ", len(combinations))
    return combinations


print("---------------- raw sentences --------------------")


for domain in domains:
    print("\n\n------------domain: ", domain, "------------------")
    params = get_params(domain_keys[domain])
    tmp = tfidf_summ_reviews[domain]
    reviews = tmp["sent"].tolist()[:num_reviews]
    for index, param in enumerate(params):
        output_file = gpt_label_dir + domain + "_" + str(index + 6) + "_gpt_labels.csv"
        print("output_file:" , output_file)
        get_labels_from_summ_reviews(reviews, param, output_file)