# Cleaning the repeated and repetitive questions from the dataset

In [1]:
import pandas as pd
import numpy as np
import faiss, random

from langchain_openai import OpenAIEmbeddings

test_brut = pd.read_csv("first_test_dataset_with_fragments.csv")

# openai api key
openai_api_key = input("Enter the OpenAI API key: ")
model_embeddings = "text-embedding-3-small"

In [2]:
# for the lines that have repeated questions, we will let only the first one
test_clean_dupl = test_brut.drop_duplicates(subset=["question"])
test_clean_dupl

Unnamed: 0,fragment,question,answer,fragment_text
0,1,What kind of information can be found in the '...,"Information about community calls, weekly gove...",Board: Get Started 🌱\nThread: How to Navigate...
1,1,What is the purpose of the 'Accountability' ca...,To increase transparency and accountability ar...,Board: Get Started 🌱\nThread: How to Navigate...
2,1,What type of discussions take place in the 'Re...,Discussions relating to the allocation of Retr...,Board: Get Started 🌱\nThread: How to Navigate...
3,1,What is the focus of the 'Governance Design' c...,"The Collective’s metagovernance strategy, whic...",Board: Get Started 🌱\nThread: How to Navigate...
4,1,What can be found in the 'General Discussions'...,General and public discussions that don’t fit ...,Board: Get Started 🌱\nThread: How to Navigate...
...,...,...,...,...
1544,462,Why did Matters Town migrate to Optimism from ...,Matters Town migrated to Optimism to leverage ...,Board: General Discussions ✨ \nThread: Bringi...
1545,462,What unique challenges does Matters Town face ...,Matters Town faces challenges such as finding ...,Board: General Discussions ✨ \nThread: Bringi...
1546,462,How does Matters Town aim to sustain valuable ...,Matters Town uses a combination of ActivityPub...,Board: General Discussions ✨ \nThread: Bringi...
1547,462,What is the broader vision for Optimism accord...,Guo envisions Optimism evolving beyond a finan...,Board: General Discussions ✨ \nThread: Bringi...


In [3]:
# questions df
questions = test_clean_dupl["question"].tolist()

In [4]:
# project into the embedding space
embeddings = OpenAIEmbeddings(model=model_embeddings, openai_api_key=openai_api_key)
questions_emb = embeddings.embed_documents(questions)
questions_emb = np.array(questions_emb)

In [5]:
# given a treeshold (in terms of faiss distance in the emb space), we will remove the questions that are too similar
def rm_too_similar_questions(questions, questions_emb, tresh):
    # faiss index
    index = faiss.IndexFlatL2(questions_emb.shape[1])
    index.add(questions_emb)

    # get the 100 nearest neighbors for each question
    dist, ind = index.search(questions_emb, 100)
    dist, ind

    indexes_to_remove = []
    # for each question
    for n in range(len(ind)):
        # closest neighbors indexes
        i = ind[n]
        # closest neighbors distances
        d = dist[n]
        # if the question is not removed yet
        if not i[0] in indexes_to_remove:
            # the too close questions
            small = i[d < tresh]

            if len(small) > 1:
                for s in small:
                    print(questions[s])
                print("----")
                indexes_to_remove.extend(small[1:])
        
    # questions without indexes_to_remove
    questions_clean = [questions[i] for i in range(len(questions)) if i not in indexes_to_remove]
    # questions embeddings without indexes_to_remove
    questions_emb_clean = np.array([questions_emb[i] for i in range(len(questions_emb)) if i not in indexes_to_remove])

    return questions_clean, questions_emb_clean

# remove the questions that are too similar
cleaned_questions, cleaned_questions_emb = rm_too_similar_questions(questions, questions_emb, tresh = 0.4)

# select questions in the cleaned_questions list
test_clean = test_clean_dupl[test_clean_dupl["question"].isin(cleaned_questions)]
test_clean

What kind of information can be found in the 'Updates and Announcements' section?
What should topics in the Updates and Announcements category generally contain?
----
What is the Optimism Collective?
What are some of the ongoing goals of the Optimism Collective?
How is the Optimism Collective governed?
----
But what do the Token House and Citizens' House really do?
What are the responsibilities of the Token House and the Citizens’ House?
----
How did the Grants Council handle the increase in applications during Season 5?
What challenges did the Grants Council face in Season 5?
----
What was the overall outcome of the Grants Council's work in Season 5?
What were some key achievements of the Grants Council in Season 5?
What challenges did the Grants Council face in Season 5?
What are some of the key changes in the scope of the Grants Council for Season 5?
How will the Grants Council be renewed for Season 5?
How did the Grants Council perform in terms of reviewing mission requests and mai

Unnamed: 0,fragment,question,answer,fragment_text
0,1,What kind of information can be found in the '...,"Information about community calls, weekly gove...",Board: Get Started 🌱\nThread: How to Navigate...
1,1,What is the purpose of the 'Accountability' ca...,To increase transparency and accountability ar...,Board: Get Started 🌱\nThread: How to Navigate...
2,1,What type of discussions take place in the 'Re...,Discussions relating to the allocation of Retr...,Board: Get Started 🌱\nThread: How to Navigate...
3,1,What is the focus of the 'Governance Design' c...,"The Collective’s metagovernance strategy, whic...",Board: Get Started 🌱\nThread: How to Navigate...
4,1,What can be found in the 'General Discussions'...,General and public discussions that don’t fit ...,Board: Get Started 🌱\nThread: How to Navigate...
...,...,...,...,...
1543,461,What was the resolution for the issue of not s...,The issue was resolved by ensuring that applic...,Board: General Discussions ✨ \nThread: Missin...
1544,462,Why did Matters Town migrate to Optimism from ...,Matters Town migrated to Optimism to leverage ...,Board: General Discussions ✨ \nThread: Bringi...
1545,462,What unique challenges does Matters Town face ...,Matters Town faces challenges such as finding ...,Board: General Discussions ✨ \nThread: Bringi...
1546,462,How does Matters Town aim to sustain valuable ...,Matters Town uses a combination of ActivityPub...,Board: General Discussions ✨ \nThread: Bringi...


# Clustering questions using OpenAI Chat

In [6]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
model_chat = "gpt-3.5-turbo-0125"

# select the model
llm = ChatOpenAI(
    model = model_chat,
    temperature = 0,
    max_tokens = None,
    timeout = None,
    max_retries = 2,
    api_key = openai_api_key
)

# create the template for interactions
def answer_template():
    return f""" I'll give you a question about Optimism Documentation. You don't need to answer it, just categorize it (just return the number, one character) into one of the following categories:

1. project support;
2. governance;
3. dev;
4. tech documentation;
5. general documentation;
6. marketing / promotion / ambassadors / events / PR;
7. other.

Question: {{question}}
"""

prompt = ChatPromptTemplate.from_template(answer_template())

chain = prompt | llm

In [7]:
clusters = {}
for q in cleaned_questions:
    cat = chain.invoke({"question": q})
    clusters[q] = cat.content
    print(f"Question: {q}")
    print(f"Category: {cat.content}")

# add a column with the cluster to df
test_clean['cluster'] = [clusters[q] for q in test_clean['question']]
test_clean

Question: What kind of information can be found in the 'Updates and Announcements' section?
Category: 5
Question: What is the purpose of the 'Accountability' category?
Category: 5
Question: What type of discussions take place in the 'Retro Funding' category?
Category: 2
Question: What is the focus of the 'Governance Design' category?
Category: 2
Question: What can be found in the 'General Discussions' category?
Category: 5
Question: What is the Optimism Collective?
Category: 6
Question: What is the Token House?
Category: 4
Question: What is the Citizens' House?
Category: 5
Question: But what do the Token House and Citizens' House really do?
Category: 5
Question: Are Optimists expected to avoid conflicts of interest and disclose any potential conflicts?
Category: 2
Question: What is the expectation regarding the communication style of Optimists within the community?
Category: 5
Question: How should Optimists handle personal attacks and unsubstantiated claims in governance activities?
Ca

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_clean['cluster'] = [clusters[q] for q in test_clean['question']]


Unnamed: 0,fragment,question,answer,fragment_text,cluster
0,1,What kind of information can be found in the '...,"Information about community calls, weekly gove...",Board: Get Started 🌱\nThread: How to Navigate...,5
1,1,What is the purpose of the 'Accountability' ca...,To increase transparency and accountability ar...,Board: Get Started 🌱\nThread: How to Navigate...,5
2,1,What type of discussions take place in the 'Re...,Discussions relating to the allocation of Retr...,Board: Get Started 🌱\nThread: How to Navigate...,2
3,1,What is the focus of the 'Governance Design' c...,"The Collective’s metagovernance strategy, whic...",Board: Get Started 🌱\nThread: How to Navigate...,2
4,1,What can be found in the 'General Discussions'...,General and public discussions that don’t fit ...,Board: Get Started 🌱\nThread: How to Navigate...,5
...,...,...,...,...,...
1543,461,What was the resolution for the issue of not s...,The issue was resolved by ensuring that applic...,Board: General Discussions ✨ \nThread: Missin...,4
1544,462,Why did Matters Town migrate to Optimism from ...,Matters Town migrated to Optimism to leverage ...,Board: General Discussions ✨ \nThread: Bringi...,3
1545,462,What unique challenges does Matters Town face ...,Matters Town faces challenges such as finding ...,Board: General Discussions ✨ \nThread: Bringi...,5
1546,462,How does Matters Town aim to sustain valuable ...,Matters Town uses a combination of ActivityPub...,Board: General Discussions ✨ \nThread: Bringi...,5


In [8]:
test_clean['cluster'].value_counts()

cluster
2    719
4    225
6     96
3     88
5     71
1     38
7     17
Name: count, dtype: int64

In [9]:
"""
1. project support;
2. governance;
3. dev;
4. tech documentation;
5. general documentation;
6. marketing / promotion / ambassadors / events / PR;
7. other.
"""
labels_dict = {
    "1": "project support",
    "2": "governance",
    "3": "dev",
    "4": "tech documentation",
    "5": "general documentation",
    "6": "marketing / promotion / ambassadors / events / PR",
    "7": "other"
}

test_clean['cluster'] = test_clean['cluster'].map(labels_dict)
test_clean

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_clean['cluster'] = test_clean['cluster'].map(labels_dict)


Unnamed: 0,fragment,question,answer,fragment_text,cluster
0,1,What kind of information can be found in the '...,"Information about community calls, weekly gove...",Board: Get Started 🌱\nThread: How to Navigate...,general documentation
1,1,What is the purpose of the 'Accountability' ca...,To increase transparency and accountability ar...,Board: Get Started 🌱\nThread: How to Navigate...,general documentation
2,1,What type of discussions take place in the 'Re...,Discussions relating to the allocation of Retr...,Board: Get Started 🌱\nThread: How to Navigate...,governance
3,1,What is the focus of the 'Governance Design' c...,"The Collective’s metagovernance strategy, whic...",Board: Get Started 🌱\nThread: How to Navigate...,governance
4,1,What can be found in the 'General Discussions'...,General and public discussions that don’t fit ...,Board: Get Started 🌱\nThread: How to Navigate...,general documentation
...,...,...,...,...,...
1543,461,What was the resolution for the issue of not s...,The issue was resolved by ensuring that applic...,Board: General Discussions ✨ \nThread: Missin...,tech documentation
1544,462,Why did Matters Town migrate to Optimism from ...,Matters Town migrated to Optimism to leverage ...,Board: General Discussions ✨ \nThread: Bringi...,dev
1545,462,What unique challenges does Matters Town face ...,Matters Town faces challenges such as finding ...,Board: General Discussions ✨ \nThread: Bringi...,general documentation
1546,462,How does Matters Town aim to sustain valuable ...,Matters Town uses a combination of ActivityPub...,Board: General Discussions ✨ \nThread: Bringi...,general documentation


In [10]:
# save csv
test_clean.to_csv("forum_test_dataset_with_clusters.csv", index=False)