Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.py files are not being created. I just get all_output.txt that I manually have to create from. #35

Closed
mindwellsolutions opened this issue Jun 14, 2023 · 34 comments

Comments

@mindwellsolutions
Copy link

Hi, I absolutely love this script. This is the most accurate auto-GPT development script I have tried yet, it's so powerful!

In the demo video it shows the script creating each of the development files, in my case .py files within the workspace folder automatically. My build isn't doing this I just get an all_output.txt file with all .py files codes in one place and a single python file.

How do I ensure that GPT-Engineer automatically creates the .py files for me. Thanks

@kawpls
Copy link

kawpls commented Jun 14, 2023

i have same issue

@Dawkinspv
Copy link

Dawkinspv commented Jun 14, 2023

a hack for python

def parse_chat(chat):# -> List[Tuple[str, str]]:
# Get all ``` blocks

regex = r"`(.*?)`.*?```(python)?(.*?)```"

matches = re.finditer(regex, chat, re.DOTALL)



files = []
for match in matches:
    path = match.group(1)
    # Get the code

    filename = match.group(1)
    code = match.group(3).strip()


    # Add the file to the list
    files.append((path, code))

return files

@jebarpg
Copy link
Contributor

jebarpg commented Jun 15, 2023

Heres my solution:

import re
import os

save_dir = "results/"
f = open("example/workspace/all_output.txt", "r")
s = f.read()
#pattern = re.compile(r'^\*\*(.*?\.py)\*\*\s+```python\s+.*?(^(?:.*\n)*?)^```\s*', re.MULTILINE) #Example **game.py**
#pattern = re.compile(r'^(.*?\.py):\s+```python\s+.*?(^(?:.*\n)*?)^```\s*', re.MULTILINE) #Example game.py:
pattern = re.compile(r'^.*?\((.*?\.py)\):\s+```python\s+.*?(^(?:.*\n)*?)^```\s*', re.MULTILINE) #Example Game File (game.py):
os.makedirs(save_dir, exist_ok=True)
for (file_name, file_text) in re.findall(pattern, s):
    write_file = open(save_dir + file_name, "w")
    write_file.write(file_text)
    write_file.close()
    print(file_name, "\n")

but sometimes you have to change the regex because the output isn't always the same.
so if you see **game.py** use:
#pattern = re.compile(r'^\*\*(.*?\.py)\*\*\s+```python\s+.*?(^(?:.*\n)*?)^```\s*', re.MULTILINE) #Example **game.py**
if you see game.py: then use this:
pattern = re.compile(r'^(.*?\.py):\s+```python\s+.*?(^(?:.*\n)*?)^```\s', re.MULTILINE)
if you see 'Game file (game.py)' use:
pattern = re.compile(r'^.*?\((.*?\.py)\):\s+```python\s+.*?(^(?:.*\n)*?)^```\s*', re.MULTILINE) #Example Game File (game.py):

@DrMemoryFish
Copy link

Heres my solution:

import re
import os

save_dir = "results/"
f = open("example/workspace/all_output.txt", "r")
s = f.read()
#pattern = re.compile(r'^\*\*(.*?\.py)\*\*\s+```python\s+.*?(^(?:.*\n)*?)^```\s*', re.MULTILINE) #Example **game.py**
#pattern = re.compile(r'^(.*?\.py):\s+```python\s+.*?(^(?:.*\n)*?)^```\s*', re.MULTILINE) #Example game.py:
pattern = re.compile(r'^.*?\((.*?\.py)\):\s+```python\s+.*?(^(?:.*\n)*?)^```\s*', re.MULTILINE) #Example Game File (game.py):
os.makedirs(save_dir, exist_ok=True)
for (file_name, file_text) in re.findall(pattern, s):
    write_file = open(save_dir + file_name, "w")
    write_file.write(file_text)
    write_file.close()
    print(file_name, "\n")

but sometimes you have to change the regex because the output isn't always the same. so if you see game.py use: #pattern = re.compile(r'^\*\*(.*?\.py)\*\*\s+```python\s+.*?(^(?:.*\n)*?)^```\s*', re.MULTILINE) #Example **game.py** if you see game.py: then use this: pattern = re.compile(r'^(.*?\.py):\s+```python\s+.*?(^(?:.*\n)*?)^```\s', re.MULTILINE) if you see 'Game file (game.py)' use: pattern = re.compile(r'^.*?\((.*?\.py)\):\s+```python\s+.*?(^(?:.*\n)*?)^```\s*', re.MULTILINE) #Example Game File (game.py):

where do I put this?

@gartlans
Copy link

+1

@jebarpg
Copy link
Contributor

jebarpg commented Jun 15, 2023

Heres my solution:

import re
import os

save_dir = "results/"
f = open("example/workspace/all_output.txt", "r")
s = f.read()
#pattern = re.compile(r'^\*\*(.*?\.py)\*\*\s+```python\s+.*?(^(?:.*\n)*?)^```\s*', re.MULTILINE) #Example **game.py**
#pattern = re.compile(r'^(.*?\.py):\s+```python\s+.*?(^(?:.*\n)*?)^```\s*', re.MULTILINE) #Example game.py:
pattern = re.compile(r'^.*?\((.*?\.py)\):\s+```python\s+.*?(^(?:.*\n)*?)^```\s*', re.MULTILINE) #Example Game File (game.py):
os.makedirs(save_dir, exist_ok=True)
for (file_name, file_text) in re.findall(pattern, s):
    write_file = open(save_dir + file_name, "w")
    write_file.write(file_text)
    write_file.close()
    print(file_name, "\n")

but sometimes you have to change the regex because the output isn't always the same. so if you see game.py use: #pattern = re.compile(r'^\*\*(.*?\.py)\*\*\s+```python\s+.*?(^(?:.*\n)*?)^```\s*', re.MULTILINE) #Example **game.py** if you see game.py: then use this: pattern = re.compile(r'^(.*?\.py):\s+```python\s+.*?(^(?:.*\n)*?)^```\s', re.MULTILINE) if you see 'Game file (game.py)' use: pattern = re.compile(r'^.*?\((.*?\.py)\):\s+```python\s+.*?(^(?:.*\n)*?)^```\s*', re.MULTILINE) #Example Game File (game.py):

where do I put this?

create a file call it 'create_files.py' inside the main repo directory. then run python create_files.py
it will create the files inside a directory called results. you can then see all the files it created in side that directory. Then run python results/{whatever the entry point python file name that was created by GPT}.py

@jebarpg
Copy link
Contributor

jebarpg commented Jun 15, 2023

is @offiub a bot account

@jebarpg
Copy link
Contributor

jebarpg commented Jun 15, 2023

it's kind of annoying and asking for private data. Please discontinue the requests of email and numbers or anything private.

@mindwellsolutions
Copy link
Author

mindwellsolutions commented Jun 15, 2023

create a file call it 'create_files.py' inside the main repo directory. then run python create_files.py
it will create the files inside a directory called results. you can then see all the files it created in side that directory. Then run python results/{whatever the entry point python file name that was created by GPT}.py

Hi, when I run your script in the main GPT-Engineer folder and point the save_dir and open paths to the correct all_output.txt file. For me the script runs, creates the results folder - but leaves it empty. The all_outputs.txt file has all of the code inside of it properly, it seems the script isn't building the .py files in /results for me.

This is your code I'm using with my custom paths: (In all_output.txt is has the py title format: entrypoint.py, file_converter.py, gui.py

import re
import os

save_dir = "/home/ailocal/apps/gpt-engineer/results/"
f = open("/home/ailocal/apps/gpt-engineer/example/workspace/all_output.txt", "r")
s = f.read()
#pattern = re.compile(r'^**(.?.py)**\s+python\s+.*?(^(?:.*\n)*?)^\s', re.MULTILINE) #Example game.py
pattern = re.compile(r'^(.?.py):\s+python\s+.*?(^(?:.*\n)*?)^\s', re.MULTILINE) #Example game.py:
#pattern = re.compile(r'^.
?((.?.py)):\s+python\s+.*?(^(?:.*\n)*?)^\s', re.MULTILINE) #Example Game File (game.py):
os.makedirs(save_dir, exist_ok=True)
for (file_name, file_text) in re.findall(pattern, s):
write_file = open(save_dir + file_name, "w")
write_file.write(file_text)
write_file.close()
print(file_name, "\n")

@jebarpg
Copy link
Contributor

jebarpg commented Jun 15, 2023

create a file call it 'create_files.py' inside the main repo directory. then run python create_files.py
it will create the files inside a directory called results. you can then see all the files it created in side that directory. Then run python results/{whatever the entry point python file name that was created by GPT}.py

Hi, when I run your script in the main GPT-Engineer folder and point the save_dir and open paths to the correct all_output.txt file. For me the script runs, creates the results folder - but leaves it empty. The all_outputs.txt file has all of the code inside of it properly, it seems the script isn't building the .py files in /results for me.

This is your code I'm using with my custom paths: (In all_output.txt is has the py title format: entrypoint.py, file_converter.py, gui.py

import re
import os

save_dir = "/home/ailocal/apps/gpt-engineer/results/"
f = open("/home/ailocal/apps/gpt-engineer/example/workspace/all_output.txt", "r")
s = f.read()
#pattern = re.compile(r'^**(.?.py)**\s+python\s+.*?(^(?:.*\n)*?)^\s', re.MULTILINE) #Example game.py
pattern = re.compile(r'^(.?.py):\s+python\s+.*?(^(?:.*\n)*?)^\s', re.MULTILINE) #Example game.py:
#pattern = re.compile(r'^.
?((.?.py)):\s+python\s+.*?(^(?:.*\n)*?)^\s', re.MULTILINE) #Example Game File (game.py):
os.makedirs(save_dir, exist_ok=True)
for (file_name, file_text) in re.findall(pattern, s):
write_file = open(save_dir + file_name, "w")
write_file.write(file_text)
write_file.close()
print(file_name, "\n")

Hopefully the repo will figure out a way to keep the formatting of the all_output.txt consistent soon. I've had to make about a dozen different patterns to capture different formating scenarios.

Can you give me a copy of a chunk of your all_output.txt file where the first code file starts? I can make you a correct regex when I see it

@jebarpg
Copy link
Contributor

jebarpg commented Jun 15, 2023

It's probably going to be something like this

pattern = re.compile(r'^(.?.py)\s+python\s+.?(^(?:.\n)*?)^\s', re.MULTILINE) #Example game.py

But can't be sure until I see a chunk of your output

@zajcomm
Copy link

zajcomm commented Jun 15, 2023

Heres my solution:

import re
import os

save_dir = "results/"
f = open("example/workspace/all_output.txt", "r")
s = f.read()
#pattern = re.compile(r'^\*\*(.*?\.py)\*\*\s+```python\s+.*?(^(?:.*\n)*?)^```\s*', re.MULTILINE) #Example **game.py**
#pattern = re.compile(r'^(.*?\.py):\s+```python\s+.*?(^(?:.*\n)*?)^```\s*', re.MULTILINE) #Example game.py:
pattern = re.compile(r'^.*?\((.*?\.py)\):\s+```python\s+.*?(^(?:.*\n)*?)^```\s*', re.MULTILINE) #Example Game File (game.py):
os.makedirs(save_dir, exist_ok=True)
for (file_name, file_text) in re.findall(pattern, s):
    write_file = open(save_dir + file_name, "w")
    write_file.write(file_text)
    write_file.close()
    print(file_name, "\n")

but sometimes you have to change the regex because the output isn't always the same. so if you see game.py use: #pattern = re.compile(r'^\*\*(.*?\.py)\*\*\s+```python\s+.*?(^(?:.*\n)*?)^```\s*', re.MULTILINE) #Example **game.py** if you see game.py: then use this: pattern = re.compile(r'^(.*?\.py):\s+```python\s+.*?(^(?:.*\n)*?)^```\s', re.MULTILINE) if you see 'Game file (game.py)' use: pattern = re.compile(r'^.*?\((.*?\.py)\):\s+```python\s+.*?(^(?:.*\n)*?)^```\s*', re.MULTILINE) #Example Game File (game.py):

where do I put this?

create a file call it 'create_files.py' inside the main repo directory. then run python create_files.py it will create the files inside a directory called results. you can then see all the files it created in side that directory. Then run python results/{whatever the entry point python file name that was created by GPT}.py

I am trying to follow these steps but I dont see any game.py. where this file should be?

@jebarpg
Copy link
Contributor

jebarpg commented Jun 15, 2023

Heres my solution:

import re
import os

save_dir = "results/"
f = open("example/workspace/all_output.txt", "r")
s = f.read()
#pattern = re.compile(r'^\*\*(.*?\.py)\*\*\s+```python\s+.*?(^(?:.*\n)*?)^```\s*', re.MULTILINE) #Example **game.py**
#pattern = re.compile(r'^(.*?\.py):\s+```python\s+.*?(^(?:.*\n)*?)^```\s*', re.MULTILINE) #Example game.py:
pattern = re.compile(r'^.*?\((.*?\.py)\):\s+```python\s+.*?(^(?:.*\n)*?)^```\s*', re.MULTILINE) #Example Game File (game.py):
os.makedirs(save_dir, exist_ok=True)
for (file_name, file_text) in re.findall(pattern, s):
    write_file = open(save_dir + file_name, "w")
    write_file.write(file_text)
    write_file.close()
    print(file_name, "\n")

but sometimes you have to change the regex because the output isn't always the same. so if you see game.py use: #pattern = re.compile(r'^\*\*(.*?\.py)\*\*\s+```python\s+.*?(^(?:.*\n)*?)^```\s*', re.MULTILINE) #Example **game.py** if you see game.py: then use this: pattern = re.compile(r'^(.*?\.py):\s+```python\s+.*?(^(?:.*\n)*?)^```\s', re.MULTILINE) if you see 'Game file (game.py)' use: pattern = re.compile(r'^.*?\((.*?\.py)\):\s+```python\s+.*?(^(?:.*\n)*?)^```\s*', re.MULTILINE) #Example Game File (game.py):

where do I put this?

create a file call it 'create_files.py' inside the main repo directory. then run python create_files.py it will create the files inside a directory called results. you can then see all the files it created in side that directory. Then run python results/{whatever the entry point python file name that was created by GPT}.py

I am trying to follow these steps but I dont see any game.py. where this file should be?

That's just an example, if you're generated project doesn't have a game.py that's fine the regex captures {some_file_name}.py and then creates a new file and prints the text into it. The issue you will probably run into is with the format of all_output.txt, which will determine which pattern you need to use. Since the repo main branch hasn't been able to formalize a format for the output consistently from gpt-# you have to figure out what to anchor on for each run of main.py

If you paste a chunk of your all_output.txt I can tell you which pattern to use, or create a new pattern appropriate for how your output looks if it's not one of the ones I have above.

@mindwellsolutions
Copy link
Author

mindwellsolutions commented Jun 16, 2023

create a file call it 'create_files.py' inside the main repo directory. then run python create_files.py
it will create the files inside a directory called results. you can then see all the files it created in side that directory. Then run python results/{whatever the entry point python file name that was created by GPT}.py

Hi, when I run your script in the main GPT-Engineer folder and point the save_dir and open paths to the correct all_output.txt file. For me the script runs, creates the results folder - but leaves it empty. The all_outputs.txt file has all of the code inside of it properly, it seems the script isn't building the .py files in /results for me.
This is your code I'm using with my custom paths: (In all_output.txt is has the py title format: entrypoint.py, file_converter.py, gui.py
import re
import os
save_dir = "/home/ailocal/apps/gpt-engineer/results/"
f = open("/home/ailocal/apps/gpt-engineer/example/workspace/all_output.txt", "r")
s = f.read()
#pattern = re.compile(r'^(._?.py)\s+python\s+.*?(^(?:.*\n)*?)^\s_', re.MULTILINE) #Example game.py
pattern = re.compile(r'^(.?.py):\s+python\s+.*?(^(?:.*\n)*?)^\s', re.MULTILINE) #Example game.py:
#pattern = re.compile(r'^.
?((.?.py)):\s+python\s+.*?(^(?:.*\n)*?)^\s', re.MULTILINE) #Example Game File (game.py):
os.makedirs(save_dir, exist_ok=True)
for (file_name, file_text) in re.findall(pattern, s):
write_file = open(save_dir + file_name, "w")
write_file.write(file_text)
write_file.close()
print(file_name, "\n")

Hopefully the repo will figure out a way to keep the formatting of the all_output.txt consistent soon. I've had to make about a dozen different patterns to capture different formating scenarios.

Can you give me a copy of a chunk of your all_output.txt file where the first code file starts? I can make you a correct regex when I see it

Hi thank you so much. That would be amazing if I could get that script working. Here is a chunk of my all_output.txt

entrypoint.py

from typing import List
from tkinter import Tk, filedialog, messagebox
from file_converter import FileConverter
from gui import GUI

def main():
    root = Tk()
    root.withdraw()
    gui = GUI(root)
    root.mainloop()

if __name__ == "__main__":
    main()

file_converter.py

from typing import List
import openpyxl
import csv
import os

class FileConverter:
    def __init__(self, file_path: str):
        self.file_path = file_path
        self.workbook = openpyxl.load_workbook(filename=self.file_path)

gui.py
```python
from tkinter import Tk, Label, Button, filedialog, messagebox
from file_converter import FileConverter
class GUI:
    def __init__(self, master: Tk):
        self.master = master
        self.master.title("CSV/XLSX Converter")
        self.file_path = ""
        self.output_path = ""
        self.sheet_names = []
        self.active_sheet_name = ""
        self.create_widgets()

@goncalomoita
Copy link
Contributor

I have improved the "parse_chat" function by treating "code discovery" and "filename discovery" as different tasks.
The regex to search for code blocks (``` blocks) works great. The code is almost always within those blocks so code discovery rarely fails and it supports all kinds of scripts as it is only matching ```.
However, filename placement is much more random, so it requires a greater deal of attention.
I have created a new regex expression to match "<filename>.<ext>".
Extensions (exts) are imported from a file and put in a regex expression via the "build_regex_from_file" function.

I have created a new "extensions.txt" file in the root directory and replaced the entire chat_to_files.py file with:

chat_to_files.py

import re

def build_regex_from_file(filename: str = 'extensions.txt') -> str:
    '''
    Builds a regex from a file containing a list of extensions.

    File should be formatted as follows:
    ```extensions.txt
    py
    ts
    js
    html
    css
    ```
    
    '''
    with open(filename, 'r') as file:
        exts = file.read().splitlines()
    
    # Pipe acts as an OR operator in regex
    extension_str = '|'.join(exts)
    return r"\b[\w\-.]+?\.(?:" + extension_str + r")\b"

def parse_chat(chat):  # -> List[Tuple[str, str]]:
    # Get all unique filenames
    filenames = re.findall(build_regex_from_file(), chat)
    
    # Drop duplicates in case they are mentioned multiple times
    filenames = list(dict.fromkeys(filenames))

    # Get all ``` (code) blocks
    code_matches = re.finditer(r"```(.*?)```", chat, re.DOTALL)

    files = []
    for i, match in enumerate(code_matches):
        # path = match.group(1).split("\n")[0]
        path = filenames[i]
        # Get the code
        code = match.group(1).split("\n")[1:]
        code = "\n".join(code)
        # Add the file to the list
        files.append((path, code))

    return files


def to_files(chat, workspace):
    workspace['all_output.txt'] = chat

    files = parse_chat(chat)
    for file_name, file_content in files:
        workspace[file_name] = file_content

extensions.txt

py
tsx
tsx
js
jsx
html
css

Hope this helps!

@patillacode
Copy link
Collaborator

Hi @goncalomoita

that is a good idea.

Would you feel comfortable creating a PR with your changes?

If so I would suggest using a constant for the file extensions, in a file constants.py you could declare FILE_EXTENSIONS like:

FILE_EXTENSIONS = ['py', ..., 'css']

and the import it in chat_to_files.py and use that instead of reading from a file.

Or maybe something even better, isn't there a way to avoid the extensions all together?

I get the regex works well with this approach, but this solution scales badly, we would want to support all extensions, so maybe modifying the regex to just look for <filename>.*? ? (thinking out loud here)

@goncalomoita
Copy link
Contributor

Absolutely! Working on it. I'll drop the current extensions logic as my own extensions.txt has over 100 lines...

@mindwellsolutions
Copy link
Author

mindwellsolutions commented Jun 16, 2023

@goncalomoita . You are amazing, that worked right away. Thank you so much, this really streamlines GPT-Engineer.

Can the developer add this code into future builds?

@jebarpg
Copy link
Contributor

jebarpg commented Jun 16, 2023

I have improved the "parse_chat" function by treating "code discovery" and "filename discovery" as different tasks. The regex to search for code blocks (blocks) works great. The code is almost always within those blocks so code discovery rarely fails and it supports all kinds of scripts as it is only matching. However, filename placement is much more random, so it requires a greater deal of attention. I have created a new regex expression to match ".". Extensions (exts) are imported from a file and put in a regex expression via the "build_regex_from_file" function.

I have created a new "extensions.txt" file in the root directory and replaced the entire chat_to_files.py file with:

chat_to_files.py

import re

def build_regex_from_file(filename: str = 'extensions.txt') -> str:
    '''
    Builds a regex from a file containing a list of extensions.

    File should be formatted as follows:
    ```extensions.txt
    py
    ts
    js
    html
    css
    ```
    
    '''
    with open(filename, 'r') as file:
        exts = file.read().splitlines()
    
    # Pipe acts as an OR operator in regex
    extension_str = '|'.join(exts)
    return r"\b[\w\-.]+?\.(?:" + extension_str + r")\b"

def parse_chat(chat):  # -> List[Tuple[str, str]]:
    # Get all unique filenames
    filenames = re.findall(build_regex_from_file(), chat)
    
    # Drop duplicates in case they are mentioned multiple times
    filenames = list(dict.fromkeys(filenames))

    # Get all ``` (code) blocks
    code_matches = re.finditer(r"```(.*?)```", chat, re.DOTALL)

    files = []
    for i, match in enumerate(code_matches):
        # path = match.group(1).split("\n")[0]
        path = filenames[i]
        # Get the code
        code = match.group(1).split("\n")[1:]
        code = "\n".join(code)
        # Add the file to the list
        files.append((path, code))

    return files


def to_files(chat, workspace):
    workspace['all_output.txt'] = chat

    files = parse_chat(chat)
    for file_name, file_content in files:
        workspace[file_name] = file_content

extensions.txt

py
tsx
tsx
js
jsx
html
css

Hope this helps!

This a huge improvement. I think we need to figure out a way to keep the formatting consistent on the GPT side of things if possible. We might be able to request to revaluate the output and have it check against a standard format to validate that it's format is correct and if not then correct the output. I think this is probably the best side to resolve the issues. The other option of course is to constantly monitoring for quarks and variations as they come and try to compensate for them with additional complexity in the regex, which I am not a fan of, but, if it need be that way then let it be. Perhaps we discover there is a finite of variance in formats written out.

@jebarpg
Copy link
Contributor

jebarpg commented Jun 16, 2023

I have improved the "parse_chat" function by treating "code discovery" and "filename discovery" as different tasks. The regex to search for code blocks (blocks) works great. The code is almost always within those blocks so code discovery rarely fails and it supports all kinds of scripts as it is only matching. However, filename placement is much more random, so it requires a greater deal of attention. I have created a new regex expression to match ".". Extensions (exts) are imported from a file and put in a regex expression via the "build_regex_from_file" function.

I have created a new "extensions.txt" file in the root directory and replaced the entire chat_to_files.py file with:

chat_to_files.py

import re

def build_regex_from_file(filename: str = 'extensions.txt') -> str:
    '''
    Builds a regex from a file containing a list of extensions.

    File should be formatted as follows:
    ```extensions.txt
    py
    ts
    js
    html
    css
    ```
    
    '''
    with open(filename, 'r') as file:
        exts = file.read().splitlines()
    
    # Pipe acts as an OR operator in regex
    extension_str = '|'.join(exts)
    return r"\b[\w\-.]+?\.(?:" + extension_str + r")\b"

def parse_chat(chat):  # -> List[Tuple[str, str]]:
    # Get all unique filenames
    filenames = re.findall(build_regex_from_file(), chat)
    
    # Drop duplicates in case they are mentioned multiple times
    filenames = list(dict.fromkeys(filenames))

    # Get all ``` (code) blocks
    code_matches = re.finditer(r"```(.*?)```", chat, re.DOTALL)

    files = []
    for i, match in enumerate(code_matches):
        # path = match.group(1).split("\n")[0]
        path = filenames[i]
        # Get the code
        code = match.group(1).split("\n")[1:]
        code = "\n".join(code)
        # Add the file to the list
        files.append((path, code))

    return files


def to_files(chat, workspace):
    workspace['all_output.txt'] = chat

    files = parse_chat(chat)
    for file_name, file_content in files:
        workspace[file_name] = file_content

extensions.txt

py
tsx
tsx
js
jsx
html
css

Hope this helps!

Also it might be advantages to use this repo to pull the extensions from:

https://gist.github.com/ppisarczyk/43962d06686722d26d176fad46879d41

instead of manually having to add them. It could just download then json parse it and have an up to day list of programming languages extensions, less maintenance for the future and makes it easier on any one who is use gpt-engineer.

@jebarpg
Copy link
Contributor

jebarpg commented Jun 16, 2023

import re

def build_regex_from_file(filename: str = 'extensions.txt') -> str:
    '''
    Builds a regex from a file containing a list of extensions.

    File should be formatted as follows:
    ```extensions.txt
    py
    ts
    js
    html
    css
    ```
    
    '''
    with open(filename, 'r') as file:
        exts = file.read().splitlines()
    
    # Pipe acts as an OR operator in regex
    extension_str = '|'.join(exts)
    return r"\b[\w\-.]+?\.(?:" + extension_str + r")\b"

def parse_chat(chat):  # -> List[Tuple[str, str]]:
    # Get all unique filenames
    filenames = re.findall(build_regex_from_file(), chat)
    
    # Drop duplicates in case they are mentioned multiple times
    filenames = list(dict.fromkeys(filenames))

    # Get all ``` (code) blocks
    code_matches = re.finditer(r"```(.*?)```", chat, re.DOTALL)

    files = []
    for i, match in enumerate(code_matches):
        # path = match.group(1).split("\n")[0]
        path = filenames[i]
        # Get the code
        code = match.group(1).split("\n")[1:]
        code = "\n".join(code)
        # Add the file to the list
        files.append((path, code))

    return files


def to_files(chat, workspace):
    workspace['all_output.txt'] = chat

    files = parse_chat(chat)
    for file_name, file_content in files:
        workspace[file_name] = file_content

Here is the change to use the link I mentioned above:

import re
from urllib.request import urlopen
import json

def build_regex_from_file(filename: str = 'extensions.txt') -> str:

    #Builds a regex from https://gist.githubusercontent.com/ppisarczyk/43962d06686722d26d176fad46879d41/raw/211547723b4621a622fc56978d74aa416cbd1729/Programming_Languages_Extensions.json) containing a list of programming languages file extensions.
    # store the URL in url as 
    # parameter for urlopen
    url = "https://gist.githubusercontent.com/ppisarczyk/43962d06686722d26d176fad46879d41/raw/211547723b4621a622fc56978d74aa416cbd1729/Programming_Languages_Extensions.json"

    # store the response of URL
    response = urlopen(url)

    # storing the JSON response 
    # from url in data
    data_json = json.loads(response.read())
    extension_str = ""
    for item in data_json:
        if "extensions" in item:
            print(item["extensions"])
            for exts in item["extensions"]:
                extension_str += exts + '|'
    extension_str = extension_str[:-1]

    return r"\b[\w\-.]+?\.(?:" + extension_str + r")\b"

def parse_chat(chat):  # -> List[Tuple[str, str]]:
    # Get all unique filenames
    filenames = re.findall(build_regex_from_file(), chat)
    
    # Drop duplicates in case they are mentioned multiple times
    filenames = list(dict.fromkeys(filenames))

    # Get all ``` (code) blocks
    code_matches = re.finditer(r"```(.*?)```", chat, re.DOTALL)

    files = []
    for i, match in enumerate(code_matches):
        # path = match.group(1).split("\n")[0]
        path = filenames[i]
        # Get the code
        code = match.group(1).split("\n")[1:]
        code = "\n".join(code)
        # Add the file to the list
        files.append((path, code))

    return files


def to_files(chat, workspace):
    workspace['all_output.txt'] = chat

    files = parse_chat(chat)
    for file_name, file_content in files:
        workspace[file_name] = file_content

@jebarpg
Copy link
Contributor

jebarpg commented Jun 16, 2023

Hi @goncalomoita

that is a good idea.

Would you feel comfortable creating a PR with your changes?

If so I would suggest using a constant for the file extensions, in a file constants.py you could declare FILE_EXTENSIONS like:

FILE_EXTENSIONS = ['py', ..., 'css']

and the import it in chat_to_files.py and use that instead of reading from a file.

Or maybe something even better, isn't there a way to avoid the extensions all together?

I get the regex works well with this approach, but this solution scales badly, we would want to support all extensions, so maybe modifying the regex to just look for <filename>.*? ? (thinking out loud here)

@patillacode read my above post with the solution you are looking for.

@goncalomoita
Copy link
Contributor

@patillacode I think I've cracked this thing.
No more verbose extensions. The "parse_chat" is now able to discern filenames by searching in specific areas near the code block.
Rather than performing filename pattern matching on the entire prompt completion output, the new approach looks for the filename in logical spots (just like a human would).
I actually think this is a better way because now there's "intention" when parsing.

@jebarpg I thought about the problem of ensuring a consistent format when deciding whether or not to improve this parser. My initial thought was "Does better parsing make sense? Can I fix this with a "self-review" call to a LLM? Should I call the LLM just to find what the filenames are?".
Those are possible solutions but they add performance overhead, since you have to wait for another inference, and ultimately cost.
And do we really need that overhead?
When reasoning from first principles, I landed on two questions (and their conjunction):
What is a filename? + Where would I expect it?

It is working great and it is supporting every format I've seen, including:

---------------------
File: entrypoint.py

```python
import pygame

---------------------
entrypoint.py
```python
import pygame

---------------------
```python
# File: entrypoint.py
import pygame

---------------------
```entrypoint.py
import pygame

Note: I will use this throughout the day to find possible issues. I'll do a PR later.
For now, here's the new code:

chat_to_files.py

import re
from typing import List, Tuple

# Amount of lines within the code block to consider for filename discovery
N_CODELINES_FOR_FILENAME_TA = 5

# Default path to use if no filename is found
DEFAULT_PATH = 'unknown.txt'


def parse_chat(chat, verbose = False) -> List[Tuple[str, str]]:
    '''
    Parses a chat message and returns a list of tuples containing
    the file path and the code content for each file.
    '''
    code_regex = r"```(.*?)```"
    filename_regex = r'\b[\w-]+\.[\w]{1,6}\b'

    # Get all ``` (code) blocks
    code_matches = re.finditer(code_regex, chat, re.DOTALL)
    
    prev_code_y_end = 0
    files = []
    for match in code_matches:
        lines = match.group(1).split('\n')
        code_y_start = match.start()
        code_y_end = match.end()

        # Now, we need to get the filename associated with this code block.
        # We will look for the filename somewhere near the code block start.
        #
        # This "somewhere near" is referred to as the "filename_ta", to
        # resemble a sort-of target area (ta).
        #
        # The target area includes the text preceding the code block that
        # does not belong to previous code blocks ("no_code").
        # Additionally, as sometimes the filename is defined within
        # the code block itself, we will also include the first few lines
        # of the code block in the filename_ta.
        #
        # Example:
        # ```python
        # # File: entrypoint.py
        # import pygame
        # ```
        #
        # The amount of lines to consider within the code block is set by
        # the constant 'N_CODELINES_FOR_FILENAME_TA'.
        #
        # Get the "preceding" text, which is located between codeblocks
        no_code = chat[prev_code_y_end:code_y_start].strip()
        within_code = '\n'.join(lines[:N_CODELINES_FOR_FILENAME_TA])
        filename_ta = no_code + '\n' + within_code

        # Visualize the filename_ta if verbose
        if verbose:
            print('-' * 20)
            print(filename_ta)
            print('-' * 20)
        
        # The path is the filename itself which we greedily match
        filename = re.search(filename_regex, filename_ta)
        path = filename.group(0) if filename else DEFAULT_PATH

        prev_code_y_end = code_y_end
        
        # Parse the entire code block
        code = lines[1:]
        code = "\n".join(code)

        # Add the file to the list
        files.append((path, code))

    return files


def to_files(chat, workspace):
    workspace['all_output.txt'] = chat

    files = parse_chat(chat)
    for file_name, file_content in files:
        workspace[file_name] = file_content

@patillacode
Copy link
Collaborator

Ty @jebarpg for the alternatives, just to mention it, the way we want to add code would be through a proper PR.

Let's see what @goncalomoita comes up with, if anything.

@AntonOsika thoughts on this?

@patillacode
Copy link
Collaborator

Speak of the devil... 😂

OK @goncalomoita,this looks promising, we can do a proper review when the PR is up.

@jebarpg
Copy link
Contributor

jebarpg commented Jun 16, 2023

Another format I have seen is **game.py** which this looks like it will work just fine for.

@jebarpg
Copy link
Contributor

jebarpg commented Jun 16, 2023

@goncalomoita definitely agree if we don't have to run another inference then all the better. And your solution captures all file type cases generically so no need to pull from a list of extensions which is one less thing to have to maintain. This is a great solution so far. Looking forward to see how your testings go later today.

@RemkoDelsman
Copy link

RemkoDelsman commented Jun 16, 2023

Hi, I'm new here and I perhaps a small thingy; Setting gpt-3.5-turbo-16k in main.py did not work for me either and also just created the all_output.txt file. I was able to fix it though by changing row 16 in Scripts/rerun_edited_message_logs.py file. It also had a fixed reference to GPT-4 causing the filecreations to stop. Is the above code still required then?
I also tinckered with the 'identy' files to ensure GPT returns the file name structure consistent. ie.:
Use_qa is for me:

Please now remember the steps:

First lay out the names of the core classes, functions, methods that will be necessary, As well as a quick comment on their purpose.
Then output the code content for each file using the structure as in the example below.
(You will start with the "entrypoint" file, then go to the ones that are imported by that file, and so on.)
Make sure that files contain all imports, types etc. The code should be fully functional. Make sure that code in different files are compatible with each other.
Before you finish, double check that all parts of the architecture is present in the files.

Example syntax:

```main.py
# this is importcode
Import abc 
\`\`\` <github comment does not escape??>

and Setup is as:

You will get instructions for code to write.
You will write a very long answer. Make sure that every detail of the architecture is, in the end, implemented as code.

You will first lay out the names of the core classes, functions, methods that will be necessary, As well as a quick comment on their purpose.
Then you will output the content of each file, with syntax below.
(You will start with the "entrypoint" file, then go to the ones that are imported by that file, and so on.)
Make sure that files contain all imports, types etc. Make sure that code in different files are compatible with each other.
Ensure to implement all code, if you are unsure, write a plausible implementation.
Before you finish, double check that all parts of the architecture is present in the files.

Example syntax:

```main.py
# this is importcode
Import abc 
\`\`\` <github comment does not escape??>

@mindwellsolutions
Copy link
Author

mindwellsolutions commented Jun 16, 2023

Thanks everyone, these solutions went above and beyond. I have goncalomoita's first solution with the extensions.txt file working on the previous version of GPT-Engineer. The last script they posted didn't work on the same version of GPT-Engineer.

I also tried downloading the newest GPT-Engineer build from 3 hours ago, where the .py files are in a subfolder /gpt-engineer/gpt-engineer/ now and neither of goncalomoita's scripts work. If possible could we get this working with the new folder structure update.

This is amazing though. I'm still running the older version from yesterday with the extensions.txt script and its working perfectly.

@goncalomoita
Copy link
Contributor

goncalomoita commented Jun 16, 2023

Updates: I found a code format that broke the function. It was:

```main.py```
```python
...
\```

For some reason "gpt-3.5-turbo-16k" really likes that one. It has since been solved.

For the impatient people out there who can't wait for the PR, here's the latest:

def parse_chat(chat: str, verbose: bool = False) -> List[Tuple[str, str]]:
    '''
    Parses a chat message and returns a list of tuples containing
    the file path and the code content for each file.
    '''
    code_regex = r'```(.*?)```'
    filename_regex = r'\b[\w-]+\.[\w]{1,6}\b'

    # Get all ``` (code) blocks
    code_matches = re.finditer(code_regex, chat, re.DOTALL)
    
    prev_code_y_end = 0
    files = []
    for match in code_matches:
        lines = match.group(1).split('\n')
        code_y_start = match.start()
        code_y_end = match.end()

        # Now, we need to get the filename associated with this code block.
        # We will look for the filename somewhere near the code block start.
        #
        # This "somewhere near" is referred to as the "filename_ta", to
        # resemble a sort-of target area (ta).
        #
        # The target area includes the text preceding the code block that
        # does not belong to previous code blocks ("no_code").
        # Additionally, as sometimes the filename is defined within
        # the code block itself, we will also include the first few lines
        # of the code block in the filename_ta.
        #
        # Example:
        # ```python
        # # File: entrypoint.py
        # import pygame
        # ```
        #
        # The amount of lines to consider within the code block is set by
        # the constant 'N_CODELINES_FOR_FILENAME_TA'.
        #
        # Get the "preceding" text, which is located between codeblocks
        no_code = chat[prev_code_y_end:code_y_start].strip()
        within_code = '\n'.join(lines[:N_CODELINES_FOR_FILENAME_TA])
        filename_ta = no_code + '\n' + within_code
        
        # The path is the filename itself which we greedily match
        filename = re.search(filename_regex, filename_ta)
        path = filename.group(0) if filename else DEFAULT_PATH

        # Visualize the filename_ta if verbose
        if verbose:
            print('-' * 20)
            print(f'Path: {path}')
            print('-' * 20)
            print(filename_ta)
            print('-' * 20)
        
        # Check if its not a false positive
        #
        # For instance, the match with ```main.py``` should not be considered.
        # ```main.py```
        # ```python
        # ...
        # ```
        if not re.fullmatch(filename_regex, '\n'.join(lines)):
            # Update the previous code block end
            prev_code_y_end = code_y_end

            # File and code have been matched, add them to the list
            files.append((path, '\n'.join(lines[1:])))

    return files

@mindwellsolutions I'll check the new build now and start working on the PR. I also make a few customizations... May I suggest the creation of a global script (bash or something) to invoke gpt-engineer from any location in your OS. It's insane lmao!

goncalomoita added a commit to goncalomoita/gpt-engineer that referenced this issue Jun 16, 2023
@jebarpg
Copy link
Contributor

jebarpg commented Jun 17, 2023

@goncalomoita Also I just found out that we might not even need this if we just change the model inside gpt-engineer/scripts/rerun_edited_message_logs.py
If both the model in main.py and rerun_edited_message_logs.py are the same then files are created in the example directory... or what ever the user specifies as the project directory.

@qrekkwijjtx54
Copy link

更新:我发现了一种破坏功能的代码格式。它是:

```main.py```
```python
...
\```

出于某种原因,“gpt-3.5-turbo-16k”真的很喜欢那个。它已经被解决了。
 对于那些等不及 PR 的不耐烦的人,这是最新消息:

def parse_chat(chat: str, verbose: bool = False) -> List[Tuple[str, str]]:
    '''
    Parses a chat message and returns a list of tuples containing
    the file path and the code content for each file.
    '''
    code_regex = r'```(.*?)```'
    filename_regex = r'\b[\w-]+\.[\w]{1,6}\b'

    # Get all ``` (code) blocks
    code_matches = re.finditer(code_regex, chat, re.DOTALL)
    
    prev_code_y_end = 0
    files = []
    for match in code_matches:
        lines = match.group(1).split('\n')
        code_y_start = match.start()
        code_y_end = match.end()

        # Now, we need to get the filename associated with this code block.
        # We will look for the filename somewhere near the code block start.
        #
        # This "somewhere near" is referred to as the "filename_ta", to
        # resemble a sort-of target area (ta).
        #
        # The target area includes the text preceding the code block that
        # does not belong to previous code blocks ("no_code").
        # Additionally, as sometimes the filename is defined within
        # the code block itself, we will also include the first few lines
        # of the code block in the filename_ta.
        #
        # Example:
        # ```python
        # # File: entrypoint.py
        # import pygame
        # ```
        #
        # The amount of lines to consider within the code block is set by
        # the constant 'N_CODELINES_FOR_FILENAME_TA'.
        #
        # Get the "preceding" text, which is located between codeblocks
        no_code = chat[prev_code_y_end:code_y_start].strip()
        within_code = '\n'.join(lines[:N_CODELINES_FOR_FILENAME_TA])
        filename_ta = no_code + '\n' + within_code
        
        # The path is the filename itself which we greedily match
        filename = re.search(filename_regex, filename_ta)
        path = filename.group(0) if filename else DEFAULT_PATH

        # Visualize the filename_ta if verbose
        if verbose:
            print('-' * 20)
            print(f'Path: {path}')
            print('-' * 20)
            print(filename_ta)
            print('-' * 20)
        
        # Check if its not a false positive
        #
        # For instance, the match with ```main.py``` should not be considered.
        # ```main.py```
        # ```python
        # ...
        # ```
        if not re.fullmatch(filename_regex, '\n'.join(lines)):
            # Update the previous code block end
            prev_code_y_end = code_y_end

            # File and code have been matched, add them to the list
            files.append((path, '\n'.join(lines[1:])))

    return files

@mindwellsolutions我现在将检查新版本并开始处理 PR。我还进行了一些自定义...我可以建议创建一个全局脚本(bash 或其他东西)以从操作系统中的任何位置调用 gpt-engineer。真是疯了!

new code not work......

@goncalomoita
Copy link
Contributor

更新:我发现了一种破坏功能的代码格式。它是:

```main.py```
```python
...
\```

出于某种原因,“gpt-3.5-turbo-16k”真的很喜欢那个。它已经被解决了。
> 对于那些等不及 PR 的不耐烦的人,这是最新消息:

def parse_chat(chat: str, verbose: bool = False) -> List[Tuple[str, str]]:
    '''
    Parses a chat message and returns a list of tuples containing
    the file path and the code content for each file.
    '''
    code_regex = r'```(.*?)```'
    filename_regex = r'\b[\w-]+\.[\w]{1,6}\b'

    # Get all ``` (code) blocks
    code_matches = re.finditer(code_regex, chat, re.DOTALL)
    
    prev_code_y_end = 0
    files = []
    for match in code_matches:
        lines = match.group(1).split('\n')
        code_y_start = match.start()
        code_y_end = match.end()

        # Now, we need to get the filename associated with this code block.
        # We will look for the filename somewhere near the code block start.
        #
        # This "somewhere near" is referred to as the "filename_ta", to
        # resemble a sort-of target area (ta).
        #
        # The target area includes the text preceding the code block that
        # does not belong to previous code blocks ("no_code").
        # Additionally, as sometimes the filename is defined within
        # the code block itself, we will also include the first few lines
        # of the code block in the filename_ta.
        #
        # Example:
        # ```python
        # # File: entrypoint.py
        # import pygame
        # ```
        #
        # The amount of lines to consider within the code block is set by
        # the constant 'N_CODELINES_FOR_FILENAME_TA'.
        #
        # Get the "preceding" text, which is located between codeblocks
        no_code = chat[prev_code_y_end:code_y_start].strip()
        within_code = '\n'.join(lines[:N_CODELINES_FOR_FILENAME_TA])
        filename_ta = no_code + '\n' + within_code
        
        # The path is the filename itself which we greedily match
        filename = re.search(filename_regex, filename_ta)
        path = filename.group(0) if filename else DEFAULT_PATH

        # Visualize the filename_ta if verbose
        if verbose:
            print('-' * 20)
            print(f'Path: {path}')
            print('-' * 20)
            print(filename_ta)
            print('-' * 20)
        
        # Check if its not a false positive
        #
        # For instance, the match with ```main.py``` should not be considered.
        # ```main.py```
        # ```python
        # ...
        # ```
        if not re.fullmatch(filename_regex, '\n'.join(lines)):
            # Update the previous code block end
            prev_code_y_end = code_y_end

            # File and code have been matched, add them to the list
            files.append((path, '\n'.join(lines[1:])))

    return files

@mindwellsolutions我现在将检查新版本并开始处理 PR。我还进行了一些自定义...我可以建议创建一个全局脚本(bash 或其他东西)以从操作系统中的任何位置调用 gpt-engineer。真是疯了!

new code not work......

Can you elaborate? What issues are you encountering? In my fork I have this working for the pre-package build (branch:initial) and the post-package (branch:main).

@patillacode
Copy link
Collaborator

Addressed in #120

goncalomoita added a commit to goncalomoita/gpt-engineer that referenced this issue Jul 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

12 participants