# Testing how to clean the dataset with LLMs

Here we are testing how to clean the dataset with the use of LLMs in order to keep the relevant comments to the OP question. Please find the finalized code and model in the `./llama3-labeling directory`

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

Load the data

In [2]:
comments_array = np.load('comments_with_certainty.npy', allow_pickle=True)

columns = ['post_id', 'delta', 'body', 'assertivity']
comments_df = pd.DataFrame(comments_array, columns=columns)
comments_df

Unnamed: 0,post_id,delta,body,assertivity
0,pz0w6x,False,The biggest problem with two spaces is that it...,4.303459
1,pz0w6x,False,Do you think your post is difficult to disging...,4.838442
2,pz0w6x,False,"That is a very good point, I have had troubles...",4.822953
3,pz0w6x,False,/u/pepe_extendus (OP) has awarded 3 delta(s) i...,5.036173
4,pz0w6x,False,Confirmed: 1 delta awarded to /u/Salanmander (...,5.059403
...,...,...,...,...
850130,1005v09,False,"No, public universities are part of the state....",5.173915
850131,1005v09,False,"&gt;And again, Hooters waitresses wear tank to...",5.074471
850132,1005v09,False,[Here's the full court decision explaining why...,5.04283
850133,1005v09,False,&gt; If Republicans want to restrict what prof...,4.463691


In [3]:

# remove posts with body '(OP) has awarded' in them 
comments_df = comments_df[~comments_df['body'].str.contains(r'\(OP\) has awarded', na=False)]
# remove posts with body 'delta awarded' in them
comments_df = comments_df[~comments_df['body'].str.contains(r'delta awarded', na=False)]
# Ensure the 'delta' column is boolean
comments_df['delta'] = comments_df['delta'].astype(bool)

# Filter groups where at least one 'delta' is True
filtered_df = comments_df.groupby('post_id').filter(lambda x: x['delta'].any())

filtered_df

Unnamed: 0,post_id,delta,body,assertivity
0,pz0w6x,False,The biggest problem with two spaces is that it...,4.303459
1,pz0w6x,False,Do you think your post is difficult to disging...,4.838442
2,pz0w6x,True,"That is a very good point, I have had troubles...",4.822953
5,pz0w6x,False,I just realised that Reddit automatically made...,4.462715
6,pz0w6x,False,What kind of argument could change what you pe...,4.165082
...,...,...,...,...
850121,1003si2,False,&gt; It also seems to me he doesn't want to do...,4.543573
850122,1003si2,False,Isn't Leftist is usually synonymous with socia...,4.63549
850123,1003si2,False,"Other than DSA, Communists, what other left or...",4.942189
850124,1003si2,False,it would be nice if there was a left-leaning m...,4.601137


In [12]:
# reset the index
filtered_df = filtered_df.reset_index(drop=True)

In [13]:
filtered_df

Unnamed: 0,post_id,delta,body,assertivity
0,pz0w6x,False,The biggest problem with two spaces is that it...,4.303459
1,pz0w6x,False,Do you think your post is difficult to disging...,4.838442
2,pz0w6x,True,"That is a very good point, I have had troubles...",4.822953
3,pz0w6x,False,I just realised that Reddit automatically made...,4.462715
4,pz0w6x,False,What kind of argument could change what you pe...,4.165082
...,...,...,...,...
589914,1003si2,False,&gt; It also seems to me he doesn't want to do...,4.543573
589915,1003si2,False,Isn't Leftist is usually synonymous with socia...,4.63549
589916,1003si2,False,"Other than DSA, Communists, what other left or...",4.942189
589917,1003si2,False,it would be nice if there was a left-leaning m...,4.601137


Get the total number of tokens

In [31]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-70B")
text = "hi"
tokens = tokenizer.tokenize(text)
num_tokens = len(tokens)
print(num_tokens)

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


1


In [32]:
# find the 
filtered_df['num_tokens'] = filtered_df['body'].apply(lambda x: len(tokenizer.tokenize(x)))

In [33]:
filtered_df.num_tokens.describe()

count    589919.000000
mean        104.688879
std         136.528012
min           1.000000
25%          28.000000
50%          59.000000
75%         128.000000
max        3271.000000
Name: num_tokens, dtype: float64

price calculations for bedrock API calls

In [146]:
(((sum(filtered_df.num_tokens) + (589919 * 300)) / 1000) * 0.00265) + (((589919 * 30) / 1000) * 0.0006)

479.89394305

In [42]:
# take the first 1000 data points
filtered_df = filtered_df.head(1000)

In [43]:
filtered_df

Unnamed: 0,post_id,delta,body,assertivity,num_tokens
0,pz0w6x,False,The biggest problem with two spaces is that it...,4.303459,240
1,pz0w6x,False,Do you think your post is difficult to disging...,4.838442,17
2,pz0w6x,True,"That is a very good point, I have had troubles...",4.822953,50
3,pz0w6x,False,I just realised that Reddit automatically made...,4.462715,47
4,pz0w6x,False,What kind of argument could change what you pe...,4.165082,13
...,...,...,...,...,...
995,pzri5s,False,We could start by actually targeting the real ...,4.470484,124
996,pzri5s,False,"As a man, have you been consistently advised t...",5.03484,235
997,pzri5s,False,whoosh. your misogyny and racism have rendered...,4.990645,21
998,pzri5s,False,"It's a ""don't do what she did, or you'll end u...",4.924634,42


In [75]:
import json
# read /Users/shahrad/projs/assertivity/changemyview_pei/filtered_posts_with_deltas_october_2021_to_may_2024.json into a list of dictionaries
with open('/Users/shahrad/projs/assertivity/changemyview_pei/filtered_posts_with_deltas_october_2021_to_may_2024.json') as f:
    posts = json.load(f)

# get posts that are in the posts list with ids in unique(filtered_df['post_id']) and save to dataframe with id and selftext
# iterate over the posts list and get the post with the id in the unique(filtered_df['post_id']) list
# save the post id and selftext to a dictionary
# convert the dictionary to a dataframe

posts_df = pd.DataFrame([{'post_id': post['post']['id'], 'selftext': post['post']['selftext']} for post in posts if post['post']['id'] in filtered_df['post_id'].unique()])

In [76]:

posts_df

Unnamed: 0,post_id,selftext
0,pz0w6x,This is something I thought was standard all w...
1,pz2hp9,I'll also address this one bit first: If you d...
2,pzri5s,More and more I see people being attacked for ...


In [77]:
filtered_df.post_id.unique()

array(['pz0w6x', 'pz2hp9', 'pzri5s'], dtype=object)

try LLM classification

In [94]:
import litellm
from litellm import batch_completion
import os

os.environ["AWS_ACCESS_KEY_ID"] = "AKIA5JBHRTOLAHR6TLM2"
os.environ["AWS_SECRET_ACCESS_KEY"] = "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"
os.environ["AWS_REGION_NAME"] = "us-east-1"

litellm.set_verbose = False


def classify_comments(df):
    relevancy_scores = []

    # Loop over the DataFrame in chunks of 100 rows
    for i in range(0, len(df), 5):
        # Construct the messages for each comment in this chunk
        messages = []
        for _, row in df.iloc[i : i + 5].iterrows():
            # Get the OP text for this post_id
            op_text = posts_df[posts_df["post_id"] == row["post_id"]][
                "selftext"
            ].values[0]

            # Construct the system message
            system_message = """
            We are working with data from the subreddit r/changemyview, where each entry consists of an original post (OP) followed by comments \
            in a list format. Your task is to classify each comment based on whether it tries to persuade the OP or another commenter to change \
            their view. Classify as Relevant if the comment attempts to persuade, Neutral if it asks a non-rhetorical question or states unrelated \
            information, and Irrelevant if it contains information unrelated to the topic. For each comment, read the OP text and the comment, \
            then provide a single word classification: "relevant", "neutral", or "irrelevant". For example, if the OP is "I believe electric \
            cars are not a viable solution to climate change" and the comment is "Have you considered the advancements in battery technology?", \
            classify it as "relevant". If the comment is "How many electric cars are currently on the road?", classify it as "neutral". \
            If the comment is "I like electric cars, they look cool!", classify it as "irrelevant".
            """

            # Construct the user message
            user_message = f"OP text: {op_text}\nComment: {row['body']}"

            # Add the messages for this comment to the list
            messages.append(
                [
                    {"role": "system", "content": system_message},
                    {"role": "user", "content": user_message},
                ]
            )

        # Call the LLM with batch_completion
        responses = batch_completion(
            model="bedrock/meta.llama3-8b-instruct-v1:0",
            messages=messages,
            max_tokens=20,
            temperature=0.1,
        )

        # Add the relevancy scores for this chunk to the list
        relevancy_scores.extend(
            [
                (
                    1
                    if "relevant"
                    in response["choices"][0]["message"]["content"].lower()
                    else 0
                )
                for response in responses
            ]
        )

    return relevancy_scores


# Then you can call this function for the entire filtered_df and add the responses as a new column:
filtered_df["relevancy"] = classify_comments(filtered_df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df["relevancy"] = classify_comments(filtered_df)


In [95]:
filtered_df

Unnamed: 0,post_id,delta,body,assertivity,num_tokens,relevancy
0,pz0w6x,False,The biggest problem with two spaces is that it...,4.303459,240,1
1,pz0w6x,False,Do you think your post is difficult to disging...,4.838442,17,0
2,pz0w6x,True,"That is a very good point, I have had troubles...",4.822953,50,1
3,pz0w6x,False,I just realised that Reddit automatically made...,4.462715,47,1
4,pz0w6x,False,What kind of argument could change what you pe...,4.165082,13,0
...,...,...,...,...,...,...
995,pzri5s,False,We could start by actually targeting the real ...,4.470484,124,1
996,pzri5s,False,"As a man, have you been consistently advised t...",5.03484,235,1
997,pzri5s,False,whoosh. your misogyny and racism have rendered...,4.990645,21,1
998,pzri5s,False,"It's a ""don't do what she did, or you'll end u...",4.924634,42,1


In [96]:
filtered_df.to_csv('filtered_df_with_relevancy.csv', index=False)

Try the same thing with the larger llama and compare

In [144]:
import litellm
from litellm import batch_completion
import os
import openai
import asyncio
import aiohttp

os.environ["AWS_ACCESS_KEY_ID"] = "AKIA5JBHRTOLAHR6TLM2"
os.environ["AWS_SECRET_ACCESS_KEY"] = "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"
os.environ["AWS_REGION_NAME"] = "us-east-1"
client = openai.Client(api_key="anything",
                       base_url="http://0.0.0.0:4000")

litellm.set_verbose = True

async def classify_comment(session, model, message, max_tokens=20, temperature=0.1):
    response = await session.post(
        "http://0.0.0.0:4000/chat/completions",
        json={
            "model": model,
            "messages": message,
            "max_tokens": max_tokens,
            "temperature": temperature,
        },
    )
    print(response)
    response_json = await response.json()
    print(response_json)
    return response_json

async def classify_comments_70B(df, chunk_size=10, delay=0.1):
    relevancy_scores = []

    async with aiohttp.ClientSession() as session:
        for i in range(0, len(df), chunk_size):

            tasks = []
            for _, row in df[i:i+chunk_size].iterrows():
                op_text = posts_df[posts_df["post_id"] == row["post_id"]]["selftext"].values[0]

                system_message = """
                We are working with data from the subreddit r/changemyview, where each entry consists of an original post (OP) followed by comments \
                in a list format. Your task is to classify each comment based on whether it tries to persuade the OP or another commenter to change \
                their view. Classify as Relevant if the comment attempts to persuade, Neutral if it asks a non-rhetorical question or states unrelated \
                information, and Irrelevant if it contains information unrelated to the topic. For each comment, read the OP text and the comment, \
                then provide a single word classification: "relevant", "neutral", or "irrelevant". For example, if the OP is "I believe electric \
                cars are not a viable solution to climate change" and the comment is "Have you considered the advancements in battery technology?", \
                classify it as "relevant". If the comment is "How many electric cars are currently on the road?", classify it as "neutral". \
                If the comment is "I like electric cars, they look cool!", classify it as "irrelevant".
                """

                user_message = f"OP text: {op_text}\nComment: {row['body']}"

                message = [
                    {"role": "system", "content": system_message},
                    {"role": "user", "content": user_message},
                ]

                task = classify_comment(
                    session=session,
                    model="bedrock/meta.llama3-70b-instruct-v1:0",
                    message=message,
                )
                tasks.append(task)

            responses = await asyncio.gather(*tasks)

            for response in responses:
                print(response)
                relevancy_scores.append(
                    1 if "relevant" in response["choices"][0]["message"]["content"].lower() else 0
                )

            await asyncio.sleep(delay)  # Add a delay between each chunk

    return relevancy_scores



In [145]:
# Enable autoawait
%autoawait on

# Process the dataset in chunks
chunk_size = 100  # Adjust this value based on your memory constraints
for i in range(0, len(filtered_df), chunk_size):
    chunk = filtered_df[i:i+chunk_size]
    relevancy_scores = await classify_comments_70B(chunk, 10, 0.1)
    chunk["relevancy_70B"] = relevancy_scores
    chunk.to_csv(f'filtered_df_with_relevancy_{i//chunk_size}.csv', index=False)

<ClientResponse(http://0.0.0.0:4000/chat/completions) [200 OK]>
<CIMultiDictProxy('Date': 'Tue, 04 Jun 2024 17:27:52 GMT', 'Server': 'uvicorn', 'Content-Length': '338', 'Content-Type': 'application/json', 'x-litellm-version': '1.40.0', 'x-litellm-key-tpm-limit': 'None', 'x-litellm-key-rpm-limit': 'None')>

{'id': 'chatcmpl-d7da608a-4b14-446b-aa38-c269d8ae12db', 'choices': [{'finish_reason': 'stop', 'index': 0, 'message': {'content': 'relevant', 'role': 'assistant'}}], 'created': 1717522074, 'model': 'meta.llama3-70b-instruct-v1:0', 'object': 'chat.completion', 'system_fingerprint': None, 'usage': {'prompt_tokens': 664, 'completion_tokens': 2, 'total_tokens': 666}}
<ClientResponse(http://0.0.0.0:4000/chat/completions) [200 OK]>
<CIMultiDictProxy('Date': 'Tue, 04 Jun 2024 17:27:52 GMT', 'Server': 'uvicorn', 'Content-Length': '338', 'Content-Type': 'application/json', 'x-litellm-version': '1.40.0', 'x-litellm-key-tpm-limit': 'None', 'x-litellm-key-rpm-limit': 'None')>

{'id': 'chatcmpl-2d

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chunk["relevancy_70B"] = relevancy_scores


<ClientResponse(http://0.0.0.0:4000/chat/completions) [200 OK]>
<CIMultiDictProxy('Date': 'Tue, 04 Jun 2024 17:28:13 GMT', 'Server': 'uvicorn', 'Content-Length': '338', 'Content-Type': 'application/json', 'x-litellm-version': '1.40.0', 'x-litellm-key-tpm-limit': 'None', 'x-litellm-key-rpm-limit': 'None')>

{'id': 'chatcmpl-1340ddfe-a463-454f-8e24-faff1058440a', 'choices': [{'finish_reason': 'stop', 'index': 0, 'message': {'content': 'relevant', 'role': 'assistant'}}], 'created': 1717522095, 'model': 'meta.llama3-70b-instruct-v1:0', 'object': 'chat.completion', 'system_fingerprint': None, 'usage': {'prompt_tokens': 720, 'completion_tokens': 2, 'total_tokens': 722}}
<ClientResponse(http://0.0.0.0:4000/chat/completions) [200 OK]>
<CIMultiDictProxy('Date': 'Tue, 04 Jun 2024 17:28:13 GMT', 'Server': 'uvicorn', 'Content-Length': '338', 'Content-Type': 'application/json', 'x-litellm-version': '1.40.0', 'x-litellm-key-tpm-limit': 'None', 'x-litellm-key-rpm-limit': 'None')>

{'id': 'chatcmpl-37

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chunk["relevancy_70B"] = relevancy_scores


<ClientResponse(http://0.0.0.0:4000/chat/completions) [200 OK]>
<CIMultiDictProxy('Date': 'Tue, 04 Jun 2024 17:28:28 GMT', 'Server': 'uvicorn', 'Content-Length': '340', 'Content-Type': 'application/json', 'x-litellm-version': '1.40.0', 'x-litellm-key-tpm-limit': 'None', 'x-litellm-key-rpm-limit': 'None')>

{'id': 'chatcmpl-f98a8415-a778-4c32-921c-bdc726b81797', 'choices': [{'finish_reason': 'stop', 'index': 0, 'message': {'content': 'relevant', 'role': 'assistant'}}], 'created': 1717522110, 'model': 'meta.llama3-70b-instruct-v1:0', 'object': 'chat.completion', 'system_fingerprint': None, 'usage': {'prompt_tokens': 1066, 'completion_tokens': 2, 'total_tokens': 1068}}
<ClientResponse(http://0.0.0.0:4000/chat/completions) [200 OK]>
<CIMultiDictProxy('Date': 'Tue, 04 Jun 2024 17:28:28 GMT', 'Server': 'uvicorn', 'Content-Length': '338', 'Content-Type': 'application/json', 'x-litellm-version': '1.40.0', 'x-litellm-key-tpm-limit': 'None', 'x-litellm-key-rpm-limit': 'None')>

{'id': 'chatcmpl-

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chunk["relevancy_70B"] = relevancy_scores


<ClientResponse(http://0.0.0.0:4000/chat/completions) [200 OK]>
<CIMultiDictProxy('Date': 'Tue, 04 Jun 2024 17:28:47 GMT', 'Server': 'uvicorn', 'Content-Length': '338', 'Content-Type': 'application/json', 'x-litellm-version': '1.40.0', 'x-litellm-key-tpm-limit': 'None', 'x-litellm-key-rpm-limit': 'None')>

{'id': 'chatcmpl-a3b9f1ed-7edf-43d8-8457-181de95b74d3', 'choices': [{'finish_reason': 'stop', 'index': 0, 'message': {'content': 'relevant', 'role': 'assistant'}}], 'created': 1717522129, 'model': 'meta.llama3-70b-instruct-v1:0', 'object': 'chat.completion', 'system_fingerprint': None, 'usage': {'prompt_tokens': 758, 'completion_tokens': 2, 'total_tokens': 760}}
<ClientResponse(http://0.0.0.0:4000/chat/completions) [200 OK]>
<CIMultiDictProxy('Date': 'Tue, 04 Jun 2024 17:28:47 GMT', 'Server': 'uvicorn', 'Content-Length': '342', 'Content-Type': 'application/json', 'x-litellm-version': '1.40.0', 'x-litellm-key-tpm-limit': 'None', 'x-litellm-key-rpm-limit': 'None')>

{'id': 'chatcmpl-ea

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chunk["relevancy_70B"] = relevancy_scores


<ClientResponse(http://0.0.0.0:4000/chat/completions) [200 OK]>
<CIMultiDictProxy('Date': 'Tue, 04 Jun 2024 17:29:03 GMT', 'Server': 'uvicorn', 'Content-Length': '340', 'Content-Type': 'application/json', 'x-litellm-version': '1.40.0', 'x-litellm-key-tpm-limit': 'None', 'x-litellm-key-rpm-limit': 'None')>

{'id': 'chatcmpl-8b35c8ef-504b-4838-b9d3-0162368b20db', 'choices': [{'finish_reason': 'stop', 'index': 0, 'message': {'content': 'irrelevant', 'role': 'assistant'}}], 'created': 1717522145, 'model': 'meta.llama3-70b-instruct-v1:0', 'object': 'chat.completion', 'system_fingerprint': None, 'usage': {'prompt_tokens': 664, 'completion_tokens': 3, 'total_tokens': 667}}
<ClientResponse(http://0.0.0.0:4000/chat/completions) [200 OK]>
<CIMultiDictProxy('Date': 'Tue, 04 Jun 2024 17:29:03 GMT', 'Server': 'uvicorn', 'Content-Length': '340', 'Content-Type': 'application/json', 'x-litellm-version': '1.40.0', 'x-litellm-key-tpm-limit': 'None', 'x-litellm-key-rpm-limit': 'None')>

{'id': 'chatcmpl-

KeyError: 'choices'

In [121]:
# call the models endpoint 
import requests

response = requests.post(
    "http://0.0.0.0:4000/chat/completions",
    json={
        "model": "bedrock/meta.llama3-8b-instruct-v1:0",
        "messages": [
            {"role": "system", "content": "System message"},
            {"role": "user", "content": "User message"},
        ],
        "max_tokens": 20,
        "temperature": 0.1,
    },)

response.json()


{'error': {'message': "cannot access local variable 'response' where it is not associated with a value",
  'type': None,
  'param': None,
  'code': 500}}

In [137]:
posts_df.iloc[1].selftext

'I\'ll also address this one bit first: If you didn\'t graduate a specialized higher education, and still say that there is value to those classes, be ready for me fiercely defending my point. I won\'t attack you personally, but as someone who\'s been made completely uninterested in higher education by General Education classes (henceforth: GE classes), despite having a passion for what I went there to study, I have a lot of animosity towards that.\n\n\\----------------------------------------\n\nGeneral Education classes, or at least stuff that I\'ve seen stapled on most specialized education in college, were "Literature", "Philosophy", "Advanced Mathematics" (usually statistics), "History", and "Economics". When I mean GE classes, for the purposes of this post, I\'m mainly referring to these classes, all of which I was forced into in college.\n\n\\-------------------------------------------\n\nLiterature, History, and Philosophy were, put simply, completely un-interesting. We were le