# topic modelling on NFBCS and program-l emails

This notebook contains a set of experiments that extracts topics from emails to program-l and nfbcs, that talk about AI. The notebook also contains similar experiments on the entire email corpus. Please run scraper.ipynb before running this notebook
## setup
Please ensure that you have a .env file that has the necessary OpenAI variables as per the Azure OpenAI documentation.

## Setup

In [66]:
import os
from openai import AzureOpenAI
import pandas as pd

from bertopic.representation import KeyBERTInspired
from sklearn.feature_extraction.text import CountVectorizer
import tiktoken
import json

from dotenv import load_dotenv
load_dotenv()
deployment_name='GPT3516'
client = AzureOpenAI()
keywords = ['ChatGPT','chatgpt', 'GitHub Copilot','GitHub copylot','github copilot','copilot','Copylot','CoPylot','Bard','bard','Gemini','gemini', 'Generative AI','midjourney', 'BeMyEyes','bemyeyes','BeMyAI','BeMyAI','openai','transformer','huggingface','HuggingFace']
# 'AI', 'generative ai', 'bing',\
delimiter = '---end of email---'
system_prompt = """
You are an expert in topic modeling.
you are an assistant helping a user with extracting important topics from a large set of emails.
you will be given emails delimited with {delimiter}.
repetitive emails can contain:
--- original message ---
You will only operate on the data provided to you to extract topics.
You will not use any external data sources.
Please, identify the main topics mentioned in these emails.
    focus on topics that talk about problems that people experienced and solutions. 
    Return a list of 10-20 topics.
    Output returned should always be a JSON list with the following format
    [
        {{"topic_name": "<topic1>", "topic_description": "<topic_description1>","email_snippet": "<email_snippet1>"}}, 
        {{"topic_name": "<topic2>", "topic_description": "<topic_description2>", "email_snippet": "<email_snippet>"}},
        ...
    ]
    {delimiter}
    """

# load the data from the CSV into df
df = pd.read_csv("data/emails.csv")
df = df.dropna()
print(f"we have a total of {df.shape[0]} emails. the data frame has {df.shape[1]} columns.")
# filter for messages that contain keywords in subject or body
def checkForKeywords(txt):
    for keyword in keywords:
        if keyword in txt:
            # print matched snippet from the text
            # print("matched keyword: " + keyword + " in text: " + txt[:100] + "...")
            return True
    return False
ai_df = df # not applying filter
# ai_df = df[df["subject"].str.contains("|".join(keywords)) | df["body"].str.contains("|".join(keywords))]
# ai_df = df[df["subject"].apply(checkForKeywords) | df["body"].apply(checkForKeywords)]
ai_df = ai_df.dropna()

# print(f"we have a total of {ai_df.shape[0]} emails that contain the keywords. the data frame has {ai_df.shape[1]} columns.")
# print the number of threads, and number of messages in each threads, and the subjects
# print("number of messages per thread: ", ai_df["thread_id"].value_counts())
# group by thread id and print subjects
# print("subjects of threads: "+ ai_df.groupby("thread_id")["subject"].first())
    # identify how many emails have been sent to nfb lists vs the program-l list Vs. cross posted
df_nfb = ai_df[ai_df["to"].str.contains("nfb")]
df_prog = ai_df[ai_df["to"].str.contains("program-l")]
df_cross = ai_df[ai_df["to"].str.contains("nfb") & ai_df["to"].str.contains("program-l")]
print(f"number of emails sent to nfb: {df_nfb.shape[0]}")
print(f"number of emails sent to program-l: {df_prog.shape[0]}")
print(f"number of emails cross posted: {df_cross.shape[0]}")


messages = [
    {"role": "system", "content": system_prompt},
]
def getResponseFromOpenAI(prompt, history=True, model=deployment_name, system_prompt=system_prompt):
    prompt_tokens = 0
    response_tokens = 0
    gpt35_enc = tiktoken.encoding_for_model("gpt-3.5-turbo")
    # access the global message variable
    global messages
    if not history:
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ]
    else:
        messages.append({"role": "user", "content": prompt})
     #check if context length exceeds and remove older history
    encoded_messages = [gpt35_enc.encode(message['content']) for message in messages]
    print("encoded messages.")
    total_tokens = sum([len(encoded_message) for encoded_message in encoded_messages])
    system_prompt_tokens = len(gpt35_enc.encode(system_prompt))
    while total_tokens >= 16000-system_prompt_tokens:
        if len(messages) >1:
            encoded_messages.pop(1)
            messages.pop(1)
        total_tokens = sum([len(encoded_message) for encoded_message in encoded_messages])
    try:
        # print the number of tokens being sent
        print(f"total tokens being sent to OpenAI: {total_tokens}")
        response = client.chat.completions.create(model=model,    messages=messages)
        messages.append({"role":response.choices[0].message.role, "content":response.choices[0].message.content})
        content = response.choices[0].message.content
        tokens_count ={
       
        'prompt_tokens':response.usage.prompt_tokens,
        'completion_tokens':response.usage.completion_tokens,
        'total_tokens':response.usage.total_tokens}
        return content, tokens_count 
    
    except Exception as e:
        print(f"error: {e}")
        return "error"
    
    # test the method
print(getResponseFromOpenAI("what is your primary purpose?"))

we have a total of 25857 emails. the data frame has 9 columns.
number of emails sent to nfb: 7630
number of emails sent to program-l: 18051
number of emails cross posted: 133
encoded messages.
total tokens being sent to OpenAI: 202
('My primary purpose is to assist you in extracting important topics from a large set of emails.', {'prompt_tokens': 213, 'completion_tokens': 18, 'total_tokens': 231})


In [58]:
import re

def cleanupText(text):
    text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', "",text,flags=re.I)
# replace any url with "". use regular expression
    text = re.sub(r'http\S+', "",text,flags=re.I)
    text = re.sub(r'www\S+', "",text,flags=re.I)
    text = re.sub(r'[\w\.-]+@[\w\.-]+', "",text,flags=re.I)
    return text

# iterate through subject and body and cleanup text
def cleanupDF (df):
    email_text = []
    for index, row in df.iterrows():
        email_text_string = ""
        email_text_string =  row["subject"] + " " + row["body"]
        email_text_string = cleanupText(email_text_string)
        email_text.append(email_text_string)
    return email_text

email_text = cleanupDF(ai_df)
# print (f"cleaned up email text: {email_text[:5]}")

## BirTopic and ChatGPT on all the emails


In [59]:
import re
# find email addresses in the list of words
def findEmails(email_text):
    email_addresses = []
    for text in email_text:
        email_addresses.append(re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text))
    return email_addresses

print("number of email addresses found = ", len(findEmails(email_text)))
email_addresses = findEmails(email_text)
# print(email_addresses[:10])

number of email addresses found =  25857


In [40]:
def get_topic_stats(topic_model, extra_cols = []):
    topics_info_df = topic_model.get_topic_info().sort_values('Count', ascending = False)
    # print the columns
    print(f"columns in the topics_info_df: {topics_info_df.columns}")
    topics_info_df['Share'] = 100.*topics_info_df['Count']/topics_info_df['Count'].sum()
    topics_info_df['CumulativeShare'] = 100.*topics_info_df['Count'].cumsum()/topics_info_df['Count'].sum()
    return topics_info_df[['Topic', 'Count', 'Share', 'CumulativeShare', 
                           'Name', 'Representation'] + extra_cols]

def getBerTopics(email_text):
    representation_model = KeyBERTInspired()
    vectorizer_model = CountVectorizer(min_df=5, stop_words = 'english')
    topic_model = BERTopic(nr_topics = 'auto', vectorizer_model = vectorizer_model, representation_model = representation_model)
    print("fitting model")
    topics, ini_probs = topic_model.fit_transform(email_text)
    print("model fit step executed")
    topic_list = topic_model.generate_topic_labels(nr_words=3,separator=", ")
    print(f"topics: {topic_list}")
    return topic_model, topics, ini_probs

topic_model, topics, ini_probs = getBerTopics(email_text)
topic_model.visualize_topics()

fitting model
model fit step executed
topics: ['-1, accessible, accessibility, reader', '0, accessibility, screen, text', '1, font, html, text', '2, api, apis, curl', '3, cloud, protocols, communications', '4, emails, email, sender', '5, server, bgt, client', '6, json, python, parse', '7, vpn, accessible, access', '8, text, paragraph, writing', '9, keyboards, keyboard, switches', '10, fetch, nodejs, htmljs', '11, news, george, british', '12, textbox, selected, winforms', '13, saving, pause, save', '14, wifi, mobile, iphone', '15, trading, learning, crypto', '16, passwords, password, sync', '17, ftp, files, backup', '18, dropbox, cloud, storage', '19, drawing, code, draw', '20, technologies, communications, health', '21, php, session, nodejs', '22, numpy, python, matrix', '23, penny, cash, code', '24, datetime, python, hour', '25, tests, typescript, types', '26, javascript, dates, date', '27, iphone, tags, mobile', '28, cards, deck, programming', '29, calendar, offset, assignment', '30,

In [41]:
# save the model
topic_model.save("email_topic_model")


Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.



In [45]:
response_list = []
gpt35_enc = tiktoken.encoding_for_model("gpt-3.5-turbo")
topic_stats_df = get_topic_stats(topic_model,['Representative_Docs'])
# print 3 rows of representation column
# print(topic_stats_df.head(3))
repr_docs = topic_stats_df.Representative_Docs.sum()
print(f"length of repr_docs is {len(repr_docs)}")
for repr_doc in repr_docs[:20]:
    user_message = f"""
        {repr_doc}
    {delimiter}
    """
    if len(gpt35_enc.encode(user_message)) > 16000:
        print(f"skipping topic. length of user_message is {len(gpt35_enc.encode(user_message))}")
        continue
    print (f" sending {len(gpt35_enc.encode(user_message))} tokens")
    # get response from OpenAI
    response, tokens_count = getResponseFromOpenAI(user_message, history = True, model = deployment_name)
    # print(response)
    try:
        response_items = json.loads(response)
        response_list.append(response_items)
    except Exception as e:
        print(f"error: {e}")
        print(f"could not parse response: {response}")
        continue

print(f"have a total of {len(response_list)} topics. printing a few samples")
print(response_list[:3])

columns in the topics_info_df: Index(['Topic', 'Count', 'Name', 'Representation', 'Representative_Docs'], dtype='object')
length of repr_docs is 246
 sending 3282 tokens
total tokens being sent to OpenAI: 3502
 sending 4658 tokens
total tokens being sent to OpenAI: 8583
 sending 5868 tokens
total tokens being sent to OpenAI: 14835
 sending 3592 tokens
total tokens being sent to OpenAI: 15578
 sending 2196 tokens
total tokens being sent to OpenAI: 13105
 sending 8215 tokens
total tokens being sent to OpenAI: 15467
 sending 667 tokens
total tokens being sent to OpenAI: 12516
 sending 589 tokens
total tokens being sent to OpenAI: 13246
 sending 440 tokens
total tokens being sent to OpenAI: 13841
 sending 704 tokens
total tokens being sent to OpenAI: 14611
 sending 1739 tokens
total tokens being sent to OpenAI: 13945
 sending 531 tokens
total tokens being sent to OpenAI: 14672
 sending 1125 tokens
total tokens being sent to OpenAI: 15669
 sending 1109 tokens
total tokens being sent to Open

In [46]:
print(f"sample response: {response_list[:2]}")
response_df = pd.DataFrame(columns = ["topic_name", "topic_description", "email_snippet"])
for response_listItem in response_list:
    for response in response_listItem:
        temp_df = pd.DataFrame([response])
        # print(f"temp df: {temp_df}")
        pd.concat([response_df,temp_df])
    
print(response_df.shape)
print(len(response_list))
# write raw responses to a file
with open("data/raw_responses.json", "w") as f:
    json.dump(response_list, f)

sample response: [[{'topic_name': 'Discoverability in Visual Studio', 'topic_description': 'Discussion about the discoverability problem in Visual Studio and the need for better ways to surface commands and shortcuts.', 'email_snippet': 'Agreed. I will say that one of the \'original\' ideas with the Ctrl+Q search was something like this. You could if you typed "rename a method", it would somehow know that you\'re probably looking for the \'Refactor\' command, even if you didn\'t know that magical name. And when Refactor is returned, it would include the keyboard shortcut.'}, {'topic_name': 'Intelligent Interface', 'topic_description': "Discussion about the suggestion of having a semi-intelligent interface in Visual Studio that provides suggestions based on the user's input.", 'email_snippet': "That is not the same as having a semi-intelligent interface where lets say I did control+AI  I know I can't use those keys but it sounded good.  Anyway it pops open a dialog and you can just writ

In [53]:
# load each response into a df with proper column names
response_df = pd.DataFrame(columns = ["topic_name", "topic_description", "email_snippet"])
for response_listItem in response_list:
    for response in response_listItem:
        temp_df = pd.DataFrame([response])
        response_df = pd.concat([response_df,temp_df])
print(f"response_df.shape: {response_df.shape}")
print(response_df.head())
response_df.to_csv("data/response_df.csv", index = False)

response_df.shape: (42, 3)
                              topic_name  \
0       Discoverability in Visual Studio   
0                  Intelligent Interface   
0  Keyboard Shortcuts and Customizations   
0  AI Agent for Productivity Suggestions   
0                    Developing iOS Apps   

                                   topic_description  \
0  Discussion about the discoverability problem i...   
0  Discussion about the suggestion of having a se...   
0  Discussion about the availability of keyboard ...   
0  Discussion about the suggestion of using an AI...   
0  Discussion about the process of developing iOS...   

                                       email_snippet  
0  Agreed. I will say that one of the 'original' ...  
0  That is not the same as having a semi-intellig...  
0  In Visual Studio, you can go into Tools\Option...  
0  If you want to get really cute get an AI agent...  
0  Hi, I am looking at starting to develop iOS ap...  


In [74]:

# system prompt to summarize the topic list
summary_system_prompt = """
You are an expert in topic modeling.
you are an assistant helping a user with extracting important topics from a large set of keywords and topics.
You will only operate on the data provided to you to extract topics.
You will not use any external data sources.
Please, identify the main topics mentioned in the list given to you, and summarize them.
    focus on topics that talk about problems that people experienced and solutions. 
    Return a list of 10 topics. For each topic, include the following:
    -   an explanation of the topic constructed from the information in the list given to you,
     -  classification into "problem", "solution", or "feature request".
     - an explanation grounded in data
     - a direct quote from the data for your classification.
    Input is a JSON list of the following format
    [
        {{"topic_name": "<topic1>", "topic_description": "<topic_description1>","email_snippet": "<email_snippet1>"}}, 
        {{"topic_name": "<topic2>", "topic_description": "<topic_description2>", "email_snippet": "<email_snippet>"}},
        ...
    ]
    {delimiter}
    """
# get response
# convert response_list into string
response_list_str = json.dumps(response_list)
response, tokens_count = getResponseFromOpenAI(response_list_str, history = False, model = deployment_name, system_prompt = summary_system_prompt)
print(response)

encoded messages.
total tokens being sent to OpenAI: 4594
1. Topic: Discoverability in Visual Studio
   - Problem: The discoverability problem in Visual Studio and the need for better ways to surface commands and shortcuts.
   - Quote: "Agreed. I will say that one of the 'original' ideas with the Ctrl+Q search was something like this. You could if you typed \"rename a method\", it would somehow know that you're probably looking for the 'Refactor' command, even if you didn't know that magical name. And when Refactor is returned, it would include the keyboard shortcut."

2. Topic: Intelligent Interface
   - Problem: The suggestion of having a semi-intelligent interface in Visual Studio that provides suggestions based on the user's input.
   - Quote: "That is not the same as having a semi-intelligent interface where lets say I did control+AI  I know I can't use those keys but it sounded good. Anyway it pops open a dialog and you can just write\n\nI want to copy noncontiguous blocks of tex