CLUSTER NAMES BY LLM

LLM is given N randomal tweets from cluster.
It is asked to: 1. suggest a name for the cluster  2. Give a reason 3. Decide if any tweets are far from the cluster .
Caution- with no. 3 - sometimes the LLM is "too sensitive" - needs future prompt engineering.

In [None]:
df_clusters_path = '/content/drive/MyDrive/VieRally/test-set/clusters-captions.csv' # set path to df containing clustered tweets
df_tf_idf_path = '/content/drive/MyDrive/VieRally/test-set/clusters-names.csv'      # set path to df containing the tf-idf of each cluster
n_tweets_sample = 50                                                                # num of tweets randomly sampled from each tweet and sent to LLM to get a cluster name
engine = 'mistralai/Mistral-7B-Instruct-v0.3'                        # see https://huggingface.co/mistralai/Mistral-7B-v0.3, if LLM changed - query format might be changed
#optional
#hugging_face_key = 'XXXXXXXXX'


In [None]:
import pandas as pd
import numpy as np
import os
from tqdm import tqdm
import os
from random import shuffle
import requests
import time


In [None]:
df_clusters = pd.read_csv(df_clusters_path)
if 'Unnamed: 0' in df_clusters.columns:
    df_clusters = df_clusters.drop('Unnamed: 0', axis=1)
print(f'df_clusters contains {df_clusters.shape[1]} clusters, overall {df_clusters.size - df_clusters.isna().sum().sum()} tweets, and {df_clusters.isna().sum().sum()} NaNs')

df_clusters contains 52 clusters, overall 12499 tweets, and 16933 NaNs


In [None]:
# if tf-idf was done- it can be used for comparison
df_tf_idf  = pd.read_csv(df_tf_idf_path).reset_index(drop=True)

In [None]:
df_tf_idf.head()

Unnamed: 0,cluster number,cluster name
0,0,"('life', 'fan', 'link')"
1,1,"('ers', 'ninergang', 'sanfrancisco')"
2,2,"('nfl', 'atlantafalcons', 'falcons')"
3,3,"('ers', 'niners', 'sanfrancisco')"
4,4,"('year', 'signing', 'deal')"


In [None]:
# getting HuggingFace key, if not in colab, it is created in above first cell
from google.colab import userdata
hugging_face_key = userdata.get('H_FACE')

In [None]:
def query(prompt, engine=engine, temperature=0.01,max_tokens=400):

    """ sending a query to the LLM HuggingFace API
        returns the LLM's response
    """

    inputs = f'<s>[INST] {prompt} [/INST]'
    query ={"inputs":inputs  ,
            "parameters": {"temperature": temperature,"do_sample":False, "max_new_tokens": max_tokens, "max_time": 120},
            "options":{"wait_for_model":True}
            }
    API_URL = "https://api-inference.huggingface.co/models/"+engine
    headers = {"Authorization": "Bearer "+hugging_face_key}
    response = requests.post(API_URL, headers=headers, json=query)
    #print(response)
    out = response.json()
    #print(out)
    out = out[0]["generated_text"][len(inputs):]
    return(out)



In [None]:
#sanity check
query('how are you?')

"I'm just a computer program, so I don't have feelings or emotions. I'm here to help you with any questions you have to the best of my ability! How can I assist you today?"

In [None]:
import random
cluster_num_list, tf_idf_list, out_list, suggested_name_list, reason_list, examples_list = [] , [], [], [], [],[]
for col in df_clusters.columns:
    cluster_num = col.split('_')[1]
    cluster_num_list.append(cluster_num)
    print(f'****cluster # {cluster_num} ****')
    try:
        tf_idf = tuple(df_tf_idf[df_tf_idf['cluster number'] == int(cluster_num)]['cluster name'])[0]
        print('tf_idf: ', tf_idf)
        tf_idf_list.append(tf_idf)
    except:
        tf_idf_list.append(None)
    tweets_list = df_clusters[col].dropna().tolist()
    examples =  random.sample(tweets_list, 3) # 3 random samples from cluster
    examples_list.append(examples)
    print('cluster examples: ', examples)
    texts = random.sample(tweets_list, n_tweets_sample)
    prompt = f"""Here is a cluster of tweets. Suggest a name for the cluster, based on the majority of tweets.  Give answer in format:
            "Cluster suggested mame: xxx \n Short explanation (max 15 words): yyy \n  Tweets very far from the cluster (if any): aaa , a short reason (max 10 words). "
             Tweets cluster:{texts}"""
    out = query(prompt)
    out_list.append(out)
    print(out)
    suggested_name = out.split('\n')[0].split(':')[1]
    suggested_name_list.append(suggested_name)
    reason = out.split('\n')[1].split(':')[1]
    reason_list.append(reason)
    ##### add random sample of tweets from texts, explain why cluster name fits#####
    print('_______________________________________________________\n')
suggested_names_df = pd.DataFrame({'cluster_num': cluster_num_list, 'LLM_suggested_cluster_name': suggested_name_list , 'LLM_reasoning': reason_list,
                                   'tf_idf': tf_idf_list , 'examples': examples_list})
display(suggested_names_df)

****cluster # 0 ****
tf_idf:  ('life', 'fan', 'link')
cluster examples:  ['Explore our selection fan gear designed elevate your Team support Click the link our bio for exciting shopping experience that awaits you TampaBayBuccaneers GoBucs BuccaneersNation BuccaneersFootball BucsPride BucsCountry BucsFans Bucs Life BucsGameDay BucsFamily BucsWin BucsLove BucsNationUnite BucsUp BucsSupport BucsStrong BucsHuddle BucsCheer BucsCrazies BucsForTheWin BucsUnited BucsCommunity BucsForever BucsLegacy BucsGameday BucsFootballClub BucsFansUnite BucsTeam BucsRoar BucsNationRise', 'Explore our selection fan gear designed elevate your Team support Click the link our bio for exciting shopping experience that awaits you IndianapolisColts ColtsNation ForTheShoe ColtsFootball ColtsPride ColtsCountry ColtsFans Colts Life ColtsGameDay ColtsFamily ColtsWin ColtsLove ColtsNationUnite ColtsUp ColtsSupport ColtsStrong ColtsHuddle ColtsCheer ColtsCrazies ColtsForTheWin ColtsUnited ColtsCommunity ColtsForever C

Unnamed: 0,cluster_num,LLM_suggested_cluster_name,LLM_reasoning,tf_idf,examples
0,0,NFL Fan Gear,Tweets about NFL team fan gear.,"('life', 'fan', 'link')",[Explore our selection fan gear designed eleva...
1,1,NFL Free Agency Signings,Tweets discussing various NFL teams signing f...,"('ers', 'ninergang', 'sanfrancisco')",[The Falcons have agreed terms with former ers...
2,2,Falcons' Offensive Overhaul,Major signings and trades for the Atlanta Fal...,"('nfl', 'atlantafalcons', 'falcons')",[BREAKING The Atlanta Falcons are signing Darn...
3,3,Sports and Family,"Tweets focus on sports, teams, and family-rel...","('ers', 'niners', 'sanfrancisco')",[tiig pink phoenix rookie ers sanfrancisco ers...
4,4,NFL Free Agency Discussions,Discussions about NFL free agency signings an...,"('year', 'signing', 'deal')",[Free agent tight end Irv Smith reached agreem...
5,5,Beauty & Barbershop,"Tweets about various beauty services, haircut...","('newyorkartist', 'newyorkgiants', 'newyork')",[have the PRETTIEST clients Loving the placeme...
6,6,NFL Free Agency Discussions,"Tweets discussing NFL free agency moves, trad...","('nfl', 'chargers', 'losangeleschargers')",[The Chargers have agreed terms bring back Eas...
7,7,Arizona Cardinals Free Agency Signings,Tweets about the Arizona Cardinals signing va...,"('nfl', 'arizonacardinals', 'cardinals')",[The Cardinals signed former Seahawks running ...
8,8,NFL Teams Pride,Tweets promoting fan apparel and accessories ...,"('click', 'range', 'fantastic')",[Wear your Team pride with our exclusive range...
9,9,Steelers Signings,Discusses the Pittsburgh Steelers signing mul...,"('pittsburghsteelers', 'steelers', 'pittsburgh')",[honor Patrick Queen and Deshone Elliot coming...
