## Prompting GPT to choose the best label from a industry standard topic list
This notebook reads from a standardized content taxonomy list used by the News industry and prompts GPT to pick the best fit given the BERTopic groups.
<br>Author: Dingyuan Xu dyxu@bu.edu

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
from bertopic import BERTopic
import json
import openai
from transformers import pipeline

  @numba.jit()
  @numba.jit()
  @numba.jit()
  from .autonotebook import tqdm as notebook_tqdm
  @numba.jit()


In [5]:
df = pd.read_csv('../datasets/Content_Taxonomy.csv', skiprows=5, usecols=range(8))
df.columns = df.iloc[0]
df = df.tail(-1)
df.head(10)

Unnamed: 0,Unique ID,Parent,Name,Tier 1,Tier 2,Tier 3,Tier 4,NaN
1,150,150,Attractions,Attractions,,,,
2,151,150,Amusement and Theme Parks,Attractions,Amusement and Theme Parks,,,
3,179,150,Bars & Restaurants,Attractions,Bars & Restaurants,,,
4,181,150,Casinos & Gambling,Attractions,Casinos & Gambling,,,
5,153,150,Historic Site and Landmark Tours,Attractions,Historic Site and Landmark Tours,,,
6,154,150,Malls & Shopping Centers,Attractions,Malls & Shopping Centers,,,
7,155,150,Museums & Galleries,Attractions,Museums & Galleries,,,
8,158,150,Nightclubs,Attractions,Nightclubs,,,
9,159,150,Outdoor Attractions,Attractions,Outdoor Activities,,,
10,160,150,Parks,Attractions,Parks & Nature,,,


In [6]:
print(f'Topics df has {len(df)} rows.')
print(df['Tier 1'].value_counts())

Topics df has 943 rows.
Tier 1
Business and Finance                   71
Sports                                 69
Style & Fashion                        44
Technology & Computing                 44
Automotive                             41
Medical Health                         38
Hobbies & Interests                    35
Music                                  33
Personal Finance                       31
Genres                                 30
Travel                                 27
Video Gaming                           24
Education                              18
Healthy Living                         16
Family and Relationships               15
Food & Drink                           13
Home & Garden                          12
Real Estate                            12
Attractions                            12
Sensitive Topics                       12
Shopping                               11
Personal Celebrations & Life Events    11
Religion & Spirituality                11
Pet

In [4]:
df['Tier 2'].value_counts()

Tier 2
                               40
Computing                      33
Industries                     33
Diseases and Conditions        32
Business                       27
                               ..
Cigars                          1
Birdwatching                    1
Beekeeping                      1
Antiquing and Antiques          1
Religious (Music and Audio)     1
Name: count, Length: 348, dtype: int64

In [5]:
df['Tier 3'].value_counts()

Tier 3
                                      387
Computer Software and Applications     14
Internet                               11
Business Banking & Finance              9
Women's Clothing                        7
                                     ... 
Internet Safety                         1
Parenting Babies and Toddlers           1
Parenting Children Aged 4-11            1
Parenting Teens                         1
Strategy Video Games                    1
Name: count, Length: 257, dtype: int64

In [7]:
taxonomy_list = []
tier_3_list = []
for index, row in df.iterrows():
    if not pd.isnull(row['Tier 3']) and row['Tier 3'] != ' ':
        tier_2_label = row['Tier 2']
        tier_3_label = row['Tier 3']
        tier_3_list.append(f'{tier_2_label} - {tier_3_label}')
    elif not pd.isnull(row['Tier 2']) and row['Tier 2'] != ' ':
        taxonomy_list.append(row['Tier 2'])

print('Tier 3 list has ' + str(len(list(set(tier_3_list)))) + ' topics.')
print('Tier 2 list has ' + str(len(list(set(taxonomy_list)))) + ' topics.')

Tier 3 list has 256 topics.
Tier 2 list has 347 topics.


In [93]:
print(taxonomy_list)

['Amusement and Theme Parks', 'Bars & Restaurants', 'Casinos & Gambling', 'Historic Site and Landmark Tours', 'Malls & Shopping Centers', 'Museums & Galleries', 'Nightclubs', 'Outdoor Activities', 'Parks & Nature', 'Theater Venues', 'Zoos & Aquariums', 'Auto Body Styles', 'Auto Buying and Selling', 'Auto Insurance', 'Auto Parts', 'Auto Recalls', 'Auto Rentals', 'Auto Repair', 'Auto Safety', 'Auto Shows', 'Auto Technology', 'Auto Type', 'Car Culture', 'Dash Cam Videos', 'Motorcycles', 'Road-Side Assistance', 'Scooters', 'Art and Photography', 'Comics and Graphic Novels', 'Fiction', 'Poetry', 'Business', 'Economy', 'Industries', 'Apprenticeships', 'Career Advice', 'Career Planning', 'Job Search', 'Remote Working', 'Vocational Training', 'Adult Education', 'College Education', 'Early Childhood Education', 'Educational Assessment', 'Homeschooling', 'Homework and Study', 'Language Learning', 'Online Education', 'Primary Education', 'Private School', 'Secondary Education', 'Special Education

In [6]:
# load saved BERTopic Model, generate topics based on random 100k sample of the BostonGlobe (lexisnexis) data
topic_model = BERTopic.load("../models/bglobe_519_body_350") # for the scc: /projectnb/sparkgrp/dyxu/models/bglobe_519_body_non_stochastic

topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,58266,-1_schools_case_information_women,"[schools, case, information, women, public, ch...","[$400,000, but the natural gas savings alone w..."
1,0,14312,0_downtime_blah_salute_ok,"[downtime, blah, salute, ok, boy, nice, pretty...","[, , ]"
2,1,6280,1_song_symphony_opera_blues,"[song, symphony, opera, blues, musicians, ball...","[Haynes is humble about his role. ""Music is ve..."
3,2,5669,2_inning_yankees_red_francona,"[inning, yankees, red, francona, runs, homer, ...",[BODY Latos (0-1) was ineffective in his third...
4,3,4659,3_candidates_primary_romney_gop,"[candidates, primary, romney, gop, weld, dole,...","[Here is that list, sorted by party and alphab..."
...,...,...,...,...,...
226,225,53,225_farms_csa_burpee_hirshberg,"[farms, csa, burpee, hirshberg, heronswood, sp...","[And in winter ""most people won't drive that e..."
227,226,53,226_crone_olympic_peterson_mantha,"[crone, olympic, peterson, mantha, lithuanians...",[In an array of expressions meant to deke any ...
228,227,52,227_saleh_yemeni_rabbo_taiz,"[saleh, yemeni, rabbo, taiz, marib, militants,...",[The comments by Arab leaders including Yemeni...
229,228,51,228_detainees_waterboarding_conventions_geneva,"[detainees, waterboarding, conventions, geneva...",[is being subjected to these kinds of techniqu...


In [7]:
list(topic_model.get_topic_info(1)['Representation'])

[['song',
  'symphony',
  'opera',
  'blues',
  'musicians',
  'ballet',
  'pop',
  'www',
  'bso',
  'festival']]

In [8]:
# get the bag of words (keywords) for each topic from the bertopic model
topics_df = topic_model.get_topic_info()
word_bags = [""]
for i in range(len(topics_df)):
    temp = topic_model.get_topic(i)
    try:
        words = [x for x,_ in temp]
        word_bags.append(words)
    except TypeError:
        pass

topics_df["Keywords"] = word_bags

In [9]:
# use openai api to create a label for the bag of words from bertopic
openai.api_key = # GitHub version does not include the OpenAI API key, please replace with your own key

# exponential back off - because i kept getting ratelimiterror 
from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
)  # for exponential backoff

@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def completion_with_backoff(**kwargs):
    return openai.ChatCompletion.create(**kwargs)

In [10]:
# get labels for all bag of words from bertopic
# range function modified to suit weird outliers of keyBERT-inspired representation model
openai_label = []
for i in [x for x in range(1, len(topics_df), 5)]:
    formatted_keywords = ""
    labels_per_response = []
    for j in range(5):
        labels = topics_df['Keywords'][min(i+j, len(topics_df)-1)]
        labels_per_response.append(labels)
        formatted_keywords += str(j+1) + ". " + ", ".join(labels[:-1]) + ", and " + labels[-1] + ".\n"
    
    # print(formatted_keywords)
    text_input = "Please pick the best topics for these five groups of keywords: " + formatted_keywords + " You must choose a topic for each from the provided list."
    text_input += "Provide your response in the format of [topic1, topic2, topic3, topic4, topic5] please. If you were not able to find an appropriate topic, leave a blank as the corresponding list element."
    text_input += "\n" 

    # using gpt-3.5-turbo: $0.002/1,000 tokens
    # each request is ~150 tokens (including the response from openai)
    background_prompt = [{"role": "system", "content": "You are given a group of news articles represented by a bag of words. " + 
                          "Your task is to find the most suitable topic to describe this group of articles, where the list of topics are provided separately. " +
                          "Remember, you must choose the most suitable topic from the provided list to summarize the bag of words. Here's the list of topics: " + str(taxonomy_list)},
                         {"role": "user", "content": "Please pick a topic for these words: celtics,points,inning,season,red,win,rebounds,coach,play,second."},
                         {"role": "assistant", "content": "[basketball]"}]


    prompt = background_prompt + [{"role": "user", "content": text_input}]

    response = completion_with_backoff(model="gpt-3.5-turbo",
                                       temperature=0.7, 
                                       max_tokens=35,
                                       messages=prompt)
    
    openai_label.append({"bow": labels_per_response, 
                         "openai": str(response['choices'][0]['message']['content'])})
    if i in [50, 100, 150, 200]:
        print(f'{i} topics have been prompted to GPT.')
#     with open('../openai_label_file/openai_label_from_taxonomy.json', 'r+') as f:
#         # load existing data 
#         file_data = json.load(f)
#         file_data[int(topics_df['Topic'][i])] = {"Name": topics_df['Name'][i],
#                                               "OpenAI_label": response['choices'][0]['message']['content'],
#                                               "OpenAI_metadata": response}
    
#         f.seek(0)
#         # convert back to json
#         json.dump(file_data, f, indent=4)

1. downtime, blah, salute, ok, boy, nice, pretty, saw, , , , , , , and .
2. song, symphony, opera, blues, musicians, ballet, pop, www, bso, and festival.
3. inning, yankees, red, francona, runs, homer, hit, ortiz, fenway, and farrell.
4. candidates, primary, romney, gop, weld, dole, polls, senate, hampshire, and dukakis.
5. burr, min, pg, ty, actor, story, gilbert, drama, extras, and novels.

1. leaves, cemetery, daughter, served, mass, brother, bogl, retired, degree, and dr.
2. celtics, rivers, points, ainge, rebounds, rondo, quarter, mchale, guard, and pitino.
3. puck, defenseman, bergeron, chara, goals, krejci, bourque, period, sinden, and milbury.
4. brady, receiver, yard, parcells, dolphins, offensive, super, bowl, gronkowski, and pass.
5. mbta, rail, commuter, tolls, buses, dig, artery, bike, authority, and pike.

1. affordable, developers, apartments, 40b, properties, land, site, bra, planning, and tower.
2. cyrisse, jaffee, stimulation, thighs, strings, inner, amateur, falls, p

In [2]:
# Restore a list of dictionaries where each dict contains openai:label from GPT and bow:bag of words representation
import ast

restored_dict = []
for response in openai_label:
    for i in range(5):
        openai_list_str = response['openai']
        openai_list_str = openai_list_str.replace("'","")[1:-1]
        openai_list = openai_list_str.split(', ')
        restored_dict.append({"openai":openai_list[i], "bow":response['bow'][i]})

NameError: name 'openai_label' is not defined

In [12]:
# Specify the output file path
output_file_path = "../openai_label_file/openai_label_from_taxonomy_structured_230.json"

# Write the list of dictionaries to a JSON file
with open(output_file_path, 'w') as json_file:
    json.dump(restored_dict, json_file, indent=4)

### After initial prompting, scan through the label list and identify all labels not present in the given taxonomy list, AKA those that were made up by GPT

In [10]:
import csv
# Check how many labels are made up


output_file_path = "../openai_label_file/openai_label_from_taxonomy_structured_230.json"
with open(output_file_path, 'r') as json_file:
    restored_dict = json.load(json_file)
    
madeup_label_list = []
for response in restored_dict:
    if response['openai'] not in taxonomy_list:
        madeup_label_list.append(response)
  
print(f"{len(madeup_label_list)} out of {len(restored_dict)} topics are made up by GPT.")

field_names = ['openai', 'bow']
with open('../output/madeup_labels.csv', 'w') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=field_names)
    writer.writeheader()
    writer.writerows(madeup_label_list)


156 out of 230 topics are made up by GPT.


In [104]:
# get labels for all bag of words from bertopic
# range function modified to suit weird outliers of keyBERT-inspired representation model
openai_label_t3 = []
for i in range(0,len(madeup_label_list),5):
    formatted_keywords = ""
    labels_per_response = []
    for j in range(5):
        labels = madeup_label_list[min(i+j, len(madeup_label_list)-1)]['bow']
        formatted_keywords += str(j+1) + ". " + ", ".join(labels[:-1]) + ", and " + labels[-1] + ".\n"
    
    print(formatted_keywords)
    text_input = "Please pick the best topics for these five groups of keywords: " + formatted_keywords + " You must choose a topic for each from the provided list."
    text_input += "Provide your response in the format of [topic1, topic2, topic3, topic4, topic5] please. If you were not able to find an appropriate topic, leave a blank as the corresponding list element."
    text_input += "\n" 

    # using gpt-3.5-turbo: $0.002/1,000 tokens
    # each request is ~150 tokens (including the response from openai)
    background_prompt = [{"role": "system", "content": "You are given a group of news articles represented by a bag of words. " + 
                          "Your task is to find the most suitable topic to describe this group of articles, where the list of topics are provided separately. " +
                          "Remember, you must choose the most suitable topic from the provided list to summarize the bag of words. Here's the list of topics: " + str(tier_3_list)},
                         {"role": "user", "content": "Please pick a topic for these words: celtics,points,inning,season,red,win,rebounds,coach,play,second."},
                         {"role": "assistant", "content": "[basketball]"}]


    prompt = background_prompt + [{"role": "user", "content": text_input}]

    response = completion_with_backoff(model="gpt-3.5-turbo",
                                       temperature=0.7, 
                                       max_tokens=40,
                                       messages=prompt)
    
    openai_label_t3.append({"bow": labels_per_response, 
                         "openai": str(response['choices'][0]['message']['content'])})
    
    # save result in json file
    with open('../openai_label_file/openai_label_from_taxonomy_t3.json', 'r+') as f:
        # load existing data 
        file_data = json.load(f)
        file_data[int(topics_df['Topic'][i])] = {"Name": topics_df['Name'][i],
                                              "OpenAI_label": response['choices'][0]['message']['content'],
                                              "OpenAI_metadata": response}
    
        f.seek(0)
        # convert back to json
        json.dump(file_data, f, indent=4)

1. democratic, candidates, primary, gop, kerry, weld, senate, dole, hampshire, and vote.
2. grandchildren, leaves, mrs, cemetery, son, sisters, mass, degree, worked, and retired.
3. netanyahu, forces, afghanistan, militants, minister, peace, israelis, security, al, and killed.
4. bulger, tsarnaev, hernandez, victim, tamerlan, driver, charges, authorities, man, and hospital.
5. menu, fried, cream, shrimp, 95, beef, sweet, restaurants, vegetables, and fresh.

1. patients, cancer, fda, cells, genzyme, biotech, company, study, breast, and blood.
2. necee, liveright, regis, eryn, motions, detect, amateur, legs, 95, and press.
3. care, premiums, nurses, partners, uninsured, connector, hmos, plans, massachusetts, and cost.
4. mfa, works, exhibit, contemporary, galleries, collection, gardner, ica, installation, and abstract.
5. colleges, charter, umass, mcas, test, superintendent, loans, aid, budget, and scores.

1. cuts, welfare, fiscal, republicans, billion, house, weld, governor, plan, and 

In [None]:
# Prompt GPT again with tier2-tier3 labels
for response in made_up_list:
    labels = response["bow"]
    
    formatted_keywords = ", ".join(labels[:-1]) + ", and " + labels[-1] + ".\n"
    text_input = "Please pick the best topic for these keywords: " + formatted_keywords + ". You must choose a topic from the provided list. Include only the topic in your response."
    text_input += "\n" 
    
    background_prompt = [{"role": "system", "content": "You are given a group of news articles represented by a bag of words. " + 
                          "Your task is to find the most suitable topic to describe this group of articles, where the list of topics are provided separately. " +
                          "Remember, you must choose the most suitable topic from the provided list to summarize the bag of words. Here's the list of topics: " + str(tier_3_list)},
                         {"role": "user", "content": "Please pick a topic for these words: celtics,points,inning,season,red,win,rebounds,coach,play,second."},
                         {"role": "assistant", "content": "basketball"}]
