### Preprocess Discord JSON Data

This notebook is for preprocessing the extracted messages from the Autogen Discord. The purpose is to format and filter the data before putting it into a format that can be stored within a vector store for RAG operations.

In [None]:
import os
import sys
import json
import glob
import pandas as pd

sys.path.append("../")

from utils import api_utils

OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")

# Discord message JSON files
path = '../data/chat_logs/*.json'

In [None]:
# Function for 
def process_file(file_path):
    file_name = os.path.basename(file_path).split('.')[0]
    with open(file_path, 'r') as file:
        try:
            messages = json.load(file)
            return pd.DataFrame([{
                'channel': file_name,
                'author_username': item['author']['username'],
                'timestamp': item['timestamp'],
                'content': item['content'],
                'embeds': item['embeds']
            } for item in messages])
        except json.JSONDecodeError:
            print(f"Error decoding JSON from {file_path}")
        except Exception as e:
            print(f"Error processing {file_path}: {e}")

# Get number of tokens per string
def get_token_len(text):
    return len([word for word in text.split(' ')])

try:
    df = pd.concat((process_file(fp) for fp in glob.glob(path) if process_file(fp) is not None), ignore_index=True)
    df['timestamp'] =  pd.to_datetime(df['timestamp'],  errors='coerce').dt.strftime('%Y-%m-%d %H:%M:%S')
    df['num_tokens'] = df.content.apply(get_token_len)
    df['has_embedding'] = df.embeds.apply(lambda x: False if x == [] else True)
    # Remove redundant short messages by token lengths 
    df = df[(df['has_embedding']) | (~df['has_embedding'] & (df['num_tokens'] >= 5))]

except Exception as e:
    print(f"Error: {e}")

In [None]:
df.channel.value_counts()

In [None]:
df.sort_values(['channel', 'timestamp'])

In [None]:
# write_cols = ['channel', 'author_username', 'timestamp', 'content', 'embeds', has_embedding]
output_file = "../data/docs/22112023_chat_history.txt"

with open(output_file, 'w') as file:
    for index, row in df.iterrows():

        additional_context = ""
        try:
            if row.has_embedding:
                additional_context (f"""Additional information about the content linked by this user: 
                - Link title: {row.embeds[0].title}
                - Link description: {row.embeds[0].description}
                """)
        except:
            pass

        formatted_text = (f"""In {row.channel}, at {row.timestamp} a user named {row.author_username} said ```{row.content}```.\n {additional_context}""").strip()
        file.write(formatted_text.strip().rstrip() + '\n')
        


---

In [None]:
context = '''
In issues-and-help, at 2023-11-15 10:41:54 a user named Lega said ```im novice w programming so i may not have explained it in the best way. But I basically am trying to figure out the same thing as this member in the server https://discord.com/channels/1153072414184452236/1153072414184452241/1174253851285667911```.
In issues-and-help, at 2023-11-14 19:49:12 a user named razahin said ```Hi @.beibinli, thank you very much for your offer of assistance. I have attached the two files here. I've also included an example of the output.

The design.jpg files is located under a folder called `coding` which is located at the same level as main.py. 

I've also tried `user_proxy.initiate_chat(designAnalyzer, message="""
Load the image from the <img ./coding/design.jpg> file location for processing by an analyzer""")` in case of there being a location issue but recieved the same results.```.
In issues-and-help, at 2023-11-14 19:01:34 a user named sonichi said ```https://microsoft.github.io/autogen/docs/Installation#python```.
In issues-and-help, at 2023-11-14 18:11:25 a user named ariel.andres said ```Hi, I am trying to run the following example code on my computer:

https://github.com/microsoft/autogen/blob/main/notebook/agentchat_auto_feedback_from_code_execution.ipynb

If I try to use Autogen 0.1.14 with Openai 0.28.1 (by just using a pip install pyautogen), I get the following error:

raise self.handle_error_response(
openai.error.InvalidRequestError: Invalid URL (POST /v1/openai/deployments/InnovationGPT4-32/chat/completions)

If I instead do a pip install autogen==0.2.0b5 (which also installs Openai 1.2.4), it runs perfectly fine.

This is the structure for my OAI_CONFIG_LIST.json:

[
    {
        "model": "InnovationGPT4-32",
        "api_key": "xxxxx",
        "base_url": "https://innovationopenaiservice.openai.azure.com/",
        "api_type": "azure",
        "api_version": "2023-06-01-preview"
    }
]

The problem is, while it runs fine with Autogen 0.2+ and Openai 1.0+, we are trying to integrate Autogen in a project that already uses Openai 0.27.4, so we would like to use Autogen 0.1.14.```.
In issues-and-help, at 2023-11-14 15:50:34 a user named aaronward_ said ```It was my fault, i didn't format the tool config correctly. Have shared an example notebook here: https://discord.com/channels/1153072414184452236/1173957465285611551```.
In issues-and-help, at 2023-11-14 12:39:00 a user named razahin said ```I've been taking inspiration from the https://github.com/microsoft/autogen/blob/v0.2.0b4/notebook/agentchat_lmm_gpt-4v.ipynb notebook for using GPT4 with vision. Running the examples as written in the notebook works. Passing a public image file, such as one hosted on openAI, also works.

When I attempt to modify the commander, coder, and critic prompts to read a local file I am consistently met with messages like 

`I'm sorry for any confusion, but it appears there may have been a misunderstanding. As an AI text-based interface, I don't have the capability to interpret images or run code.` 

Otherwise the commander coder workflow will often go completely off the rails. 

I'm looking for suggestions on how I should approach feeding an image which is located in the local working directory into the MultimodalConversableAgent. The user prompt I am trying is

```
user_proxy.initiate_chat(designAnalyzer, message="""
Load the image from the design.jpg file location for processing by an analyzer""")
``````.
In issues-and-help, at 2023-11-14 11:56:28 a user named aaronward_ said ```Nevermind, got it working - it was the assistand_id that was messing it up. I passed None and it worked. 

One thing i noticed so far: its fast! it doesn't waste time talking back and foward with the UserProxy before writing the query - potentially saving money on token costs, i'll need to look into this further. It also seems counter intuitive because usually the UserProxy is the one who is executing the code. it's a great addition to autogen from what i can tell so far.```.
In issues-and-help, at 2023-11-14 09:28:20 a user named aaronward_ said ```i'm having issues with my agents not using the provided tool, wondering if someone could give me help - the user proxy keeps trying to run the sql as a file rather than pass it as a string to a python function which is registered.

I'm testing out the new **GPTAssistantAgent** with a postgres sql operation. 

https://github.com/AaronWard/generative-ai-workbook/blob/main/personal_projects/14.openai-assistant-api/OpenAi-assistant-with-autogen.ipynb```.
In issues-and-help, at 2023-11-14 04:12:31 a user named levre said ```My group chat eventually turns into user_proxy repeatedly calling GPT4 and (I guess?) getting no response so calling it again.  Is there some way to add logic to terminate when this happens?  It appears to be sending the full context every time, so I'm getting charged for input tokens.```.
In issues-and-help, at 2023-11-11 22:40:45 a user named pika.c said ```For now to deal with this I'm creating a new GroupChat object with messages from the previous one and a new GroupChatManager. The problem is that the new agents in the group chat do not have context of what is going on. 
I'm pretty sure there's a better way to go about this. I'll keep experimenting and update this thread if I make any progress.
https://github.com/toshNaik/TaleCraft/tree/main```.
In issues-and-help, at 2023-11-11 19:44:40 a user named ab.z said ```add --pre flag to see pre-releases```.
In issues-and-help, at 2023-11-11 19:42:12 a user named ab.z said ```Because it is not released yet, it is a pre-release```.
In issues-and-help, at 2023-11-11 13:10:24 a user named malicor said ```i'm having this code:

https://pastebin.com/faWpJi6T

but it's still throwing the same error```.
In issues-and-help, at 2023-11-10 04:06:40 a user named reporter said ```https://discord.com/channels/1153072414184452236/1162811675762753589/1171829828232695870

Might be helpful but let me know if it isn't.```.
In issues-and-help, at 2023-11-09 19:20:12 a user named sonichi said ```Check this: https://microsoft.github.io/autogen/docs/Installation#python```.
In issues-and-help, at 2023-11-09 18:46:49 a user named yigitkonur said ```for those who are facing with this issue, here is how I fixed this:

you should install autogen by following command:

```
pip install pyautogen==0.2.0b2
```

here is how to load config to fix this problem:

```
import autogen

config_list = [
    {
        "model": "YOUR_DEPLOYMENT_NAME",  
        "base_url": "https://xxx.openai.azure.com", 
        "api_type": "azure", 
        "api_version": "2023-07-01-preview", 
        "api_key": "xxx"
 }
]
``````.
In issues-and-help, at 2023-11-09 16:06:01 a user named c_bonadio said ```Hi @wadymc I spent some time with function_call and I think I got some reasonable understanding.

I even created an agent that can self execute its own function call
https://gist.github.com/bonadio/96435a1b6ccc32297aa8cc1db7cfc381
'''

In [None]:
prompt = f'''// Goal: Extract useful Q&A pairs from a given text.
Here is the text snippet of interest:
{context}


// Requirements:
// 1. Identify and compile 10 pairs of questions and answers.
// 2. Exclude irrelevant conversational speech.
// 3. Include complete URLs where relevant.
// 4. Omit specific usernames, channels, or timestamps.
// 5. Focus on general-use content, avoiding overly specific user conversations.
// 6. When citing error messages, include the complete error while omitting personal identifiers like IDs or usernames.
// 7. Where relevant, supplement answers with complete code snippets in a code block format.
// Restrictions:
// - Do not invent or create answers; rely solely on the provided text.

// Instructions for AI:
// Analyze the provided text, adhering to the above guidelines, to extract relevant and general Q&A pairs that would be beneficial for a broader audience. Ensure that each answer is clearly connected to its question, maintaining the integrity and context of the original discussion.

// Examples:
```
Question: How do I load an image for processing by an analyzer in Python?
Answer: Use the following code to load an image from a specified file location for processing:
```
user_proxy.initiate_chat(designAnalyzer, message="""Load the image from the <img ./coding/design.jpg> file location for processing by an analyzer""")
```

Question: How do I install a specific version of Autogen with OpenAI?
Answer: To install Autogen 0.2.0b5 with OpenAI 1.2.4, use the command:
```
pip install autogen==0.2.0b5
```
This resolves the issue with the error: InvalidRequestError: Invalid URL (POST /v1/openai/deployments/InnovationGPT4-32/chat/completions) encountered with earlier versions.

Question: How do I integrate Autogen with a project using an older version of OpenAI?
Answer: To integrate Autogen with a project using OpenAI 0.27.4, you might face compatibility issues. Autogen 0.1.14 raises an InvalidRequestError with OpenAI 0.28.1, while Autogen 0.2+ runs fine with OpenAI 1.0+. The compatibility must be checked between specific versions.
```
'''

In [None]:
get_token_len(context)

In [None]:
output_file = "../data/docs/22112023_qa.txt"
with open(output_file, 'w') as file:
    # iterate over chunks of the text file
    # with a size of 3000 tokens
    # chunk = ...

    prompt_response = api_utils.prompt(chunk)
    print(prompt_response)

    file.write(formatted_text.strip().rstrip())