In this file we go from the zip file of the combined projects to a list of documents to be processed by RAG. The main things I wanted to get were the class definitions and the function docstring + code. The plan is for the function and class files to be retrievable by RAG, and if it retrieves a method, to get the associated class file as well. As the project takes the form of a Python desktop application, a Flutter mobile app, and a C++ desktop driver, we need to create unique extraction methods for each. Initially I wanted to make a VA that could handle any arbitrary project, but due to differences in programming languages I quickly narrowed my focus to a VA specifically for FreeMoveVR.

In [12]:
import zipfile
import os
import ast
import re
import shutil
from typing import List, Tuple
from pathlib import Path

# Extract FreeMoveVR (FMVR) zip

I plan to submit FreeMoveVR as a zip file containing the three projects, so to follow along we need to unzip them.

In [2]:
def unzip() -> None:
    zip_file_path = 'FreeMoveVR.zip'
    extract_to = os.getcwd()
    os.makedirs(extract_to, exist_ok=True)

    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        zip_ref.extractall(extract_to)

    print(f"Extracted files to: {extract_to}")

# unzip() # TODO: UNCOMMENT TO UNZIP

# FreeMoveVR Desktop (Python)

The python project is called `FreeMoveVR-Desktop` in the zip file and it's code is contained inside the `freemovevr_desktop` folder. The code has sub folders, and almost all of the functions have doc strings. I want to be explicit in which files are defined, so I make an extractor run at the folder level and not process sub-folders. This was the easiest of the repositories to extract, due to the flexibility of Python and the `ast` library, which allows Python to read Python easily. The format in this extractor was the one I wanted for the other two languages as well, but as the other two became much more intricate processes I did not achieve my vision for those extractors.

In [17]:
def extract_comments_and_code(source_code: str, output_dir: str, file_name: str) -> None:
    """
    Extract functions, methods, and class docstring + code from the python source and save them as separate documents.
    """
    tree = ast.parse(source_code)

    for node in ast.walk(tree):
        # Function Extractor
        if isinstance(node, ast.FunctionDef) or isinstance(node, ast.AsyncFunctionDef):
            # Extract the function name and associated docstring
            function_name = node.name
            docstring = ast.get_docstring(node) or ""
        
            node.body = [stmt for stmt in node.body if not isinstance(stmt, ast.Expr) or not isinstance(stmt.value, ast.Constant)]
            function_code = ast.unparse(node)

            if docstring and function_name and function_code:
            
                document = f"## Function {function_name} ##\n\n"
                document += f"Docstring:\n{docstring}\n\n"
                document += f"Code:\n{function_code}\n"

                name_of_file = os.path.splitext(file_name)[0]

                file_path = os.path.join(os.path.join(output_dir, "code"), f"{name_of_file}::{function_name}.txt")
                with open(file_path, 'w') as f:
                    f.write(document)

        # Class Extractor
        elif isinstance(node, ast.ClassDef):
            class_name = node.name
            docstring = ast.get_docstring(node) or ""
            
            node.body = [stmt for stmt in node.body if not isinstance(stmt, ast.Expr) or not isinstance(stmt.value, ast.Constant)]
            class_code = ast.unparse(node)

            document = f"## Class {class_name} ##\n\n"
            if docstring:
                document += f"Docstring:\n{docstring}\n\n"
            document += f"Code:\n{class_code}\n"

            file_path = os.path.join(os.path.join(output_dir, "classes"), f"{class_name}.txt")
            with open(file_path, 'w') as f:
                f.write(document)

In [18]:
# Set the folder to store individual text files
def desktop_generate_docs(path: str) -> None:
    base = "desktop_documents"

    os.makedirs(os.path.join(base, "code"), exist_ok=True)
    os.makedirs(os.path.join(base, "classes"), exist_ok=True)

    input_dir = f"FreeMoveVR/FreeMoveVR-desktop/{path}"

    for file_name in os.listdir(input_dir):
        if file_name.endswith('.py'):
            file_path = os.path.join(input_dir, file_name)
            
            with open(file_path, 'r') as f:
                source_code = f.read()

            extract_comments_and_code(source_code, base, file_name)

In [19]:
desktop_generate_docs('freemovevr_desktop')
desktop_generate_docs('freemovevr_desktop/gui')
desktop_generate_docs('freemovevr_desktop/gui/popup_window_components')
desktop_generate_docs('freemovevr_desktop/gui/popups')
desktop_generate_docs('freemovevr_desktop/gui/settings_bar_components')
desktop_generate_docs('freemovevr_desktop/messages')
desktop_generate_docs('freemovevr_desktop/vr_interfaces')


# FreeMoveVR Mobile (Dart / Flutter)

The mobile application is written in Dart, which uses the Flutter framework. Dart is much harder to extract then Python as we can't use `ast` again, and is less well known then C++ so it is harder to find resources for how to do it online. I ended up needing to use extremely complex RegEx, which was created using Anthropic's Claude 3.5 Sonnet. I used an LLM here as it would not have been possible for me to create this RegEx otherwise, and is a necessary, but not the focus of this class. This extractor's output is a bit different then the Python repository, as I could not get the comments from most of the functions. I am using the output to embed the entire source file in addition to the methods, and if the embedding returns a function we will pass the file to the LLM with it.

In [6]:
class DartMethodExtractor:
    def __init__(self):
        # Updated regex pattern for matching documentation comments
        # Used Anthropic's Claude 3.5 Sonnet to create RegEx
        self.doc_comment_pattern = r'(?:\/\/\/[^\n]*\n)*(?:\/\*\*(?:[^*]|\*(?!\/))*\*\/\s*)?'
        
        # Updated method pattern to better match Dart methods
        # used Anthropic's Claude 3.5 Sonnet to create RegEx
        self.method_pattern = (
            # Doc comments (both /// and /** */ style)
            r'(?:\/\/\/[^\n]*\n)*(?:\/\*\*(?:[^*]|\*(?!\/))*\*\/\s*)?'
            # Annotations
            r'(?:@\w+(?:\([^)]*\))?\s*)*'
            # Method modifiers and return type
            r'(?:static\s+)?(?:async\s+)?(?:Future<[^>]+>\s+)?(?:\w+(?:<[^>]+>)?)\s+'
            # Method name and parameters
            r'(\w+)\s*\([^)]*\)'
            # Async modifier (if after parameters)
            r'\s*(?:async\s*)?'
            # Method body
            r'\s*{[^{}]*(?:{[^{}]*}[^{}]*)*}'
        )

    def extract_methods(self, dart_code: str) -> List[Tuple[str, str]]:
        """
        Extracts methods from Dart code.
        Returns a list of tuples containing (method_name, full_method_text)
        """
        methods = []
        
        # Find all method matches in the code
        matches = re.finditer(self.method_pattern, dart_code, re.MULTILINE | re.DOTALL)
        
        for match in matches:
            method_text = match.group(0).strip()
            
            # Extract method name
            method_name_match = re.search(r'\s(\w+)\s*\(', method_text)
            if method_name_match:
                method_name = method_name_match.group(1)
                
                # Skip if this looks like an if statement or other control structure
                if method_name not in ['if', 'for', 'while', 'switch']:
                    methods.append((method_name, method_text))
        
        return methods

    def clean_method_text(self, method_text: str) -> str:
        """
        Cleans up the extracted method text by fixing indentation and removing extra whitespace.
        """
        lines = method_text.splitlines()
        
        # Remove empty lines from start and end
        while lines and not lines[0].strip():
            lines.pop(0)
        while lines and not lines[-1].strip():
            lines.pop()
            
        # Find minimum indentation (excluding empty lines)
        min_indent = float('inf')
        for line in lines:
            if line.strip():
                indent = len(line) - len(line.lstrip())
                min_indent = min(min_indent, indent)
        min_indent = 0 if min_indent == float('inf') else min_indent
        
        cleaned_lines = [line[min_indent:] if line.strip() else '' for line in lines]
        
        return '\n'.join(cleaned_lines)

    def save_methods_to_files(self, methods: List[Tuple[str, str]], output_dir: str, input_file: str) -> None:
        """
        Saves extracted methods to individual text files.
        """
        os.makedirs(output_dir, exist_ok=True)
        
        for method_name, method_text in methods:
            cleaned_method = self.clean_method_text(method_text)
            
            filename = f"{os.path.splitext(input_file)[0]}::{method_name}.txt"
            filepath = os.path.join(output_dir, 'code', filename)
            
            with open(filepath, 'w', encoding='utf-8') as f:
                f.write(cleaned_method)

def process_dart_file(input_file: str, output_dir: str) -> None:
    """
    Process a single Dart file, extract its methods, and copy the file to a folder
    with its extension changed to .txt.
    """
    extractor = DartMethodExtractor()
    
    with open(input_file, 'r', encoding='utf-8') as f:
        dart_code = f.read()

    methods = extractor.extract_methods(dart_code)
    
    extractor.save_methods_to_files(methods, output_dir, os.path.basename(input_file))
    
    base_name = os.path.basename(input_file)
    new_file_name = os.path.splitext(base_name)[0] + '.txt'
    new_file_path = os.path.join(output_dir, 'classes', new_file_name)

    shutil.copy(input_file, new_file_path)

def process_directory(input_dir: str, output_dir: str):
    """
    Process all Dart files in a directory.
    """
    
    for file in os.listdir(input_dir):
        input_file = os.path.join(input_dir, file)
        
        if os.path.isfile(input_file) and file.endswith('.dart'):
            current_output_dir = output_dir
            os.makedirs(os.path.join(current_output_dir, 'code'), exist_ok=True)
            os.makedirs(os.path.join(current_output_dir, 'classes'), exist_ok=True)
            
            process_dart_file(input_file, current_output_dir)

In [7]:
base_source_dir = "FreeMoveVR/FreeMoveVR/lib"
base_output_dir = "mobile_documents"

process_directory(base_source_dir, base_output_dir)

source = "bluetooth"
process_directory(os.path.join(base_source_dir, source), base_output_dir)
source = "calibration"
process_directory(os.path.join(base_source_dir, source), base_output_dir)
source = "home_menu"
process_directory(os.path.join(base_source_dir, source), base_output_dir)
source = "messages"
process_directory(os.path.join(base_source_dir, source), base_output_dir)
source = "painter"
process_directory(os.path.join(base_source_dir, source), base_output_dir)



# FreeMoveVR Driver (C++)

Extracting C++ methods felt similar to Dart, although it was easier to do, due to online resources with C++ being a more popular language. I once again used Anthropic's Claude 3.5 Sonnet for the RegEx for the same reason as the Dart code. For this language, I have a lot of getter and setter like functions, so I added an arbitrary rule that a function needed to be 10 or more lines to be saved so we do not include useless files. This extractor also gave me the idea to use `::` to show what file a function is in in the filename of the text file. I retroactively applied this format to the other two extractors, as well as the `code` and `classes` folder format. 

In [8]:
MIN_METHOD_LINES = 10

def is_method_signature(line: str) -> bool:
    """Check if a line contains a valid method signature."""
    # Used Anthropic's Claude 3.5 Sonnet to create RegEx
    method_pattern = r'^\s*(?:virtual\s+)?(?:[\w:]+\s+)?[\w:]+\s+[\w:]+\s*\([^)]*\)\s*(?:const)?\s*\{?'
    
    # The extractor keeps thinking these are the starts of functions, 
    # so we make a blacklist to explicitly remove them when it finds them.
    invalid_starts = ['if', 'while', 'for', 'switch', 'return', 'else', 'throw']
    
    if any(line.strip().startswith(keyword) for keyword in invalid_starts):
        return False
    
    return bool(re.search(method_pattern, line))

def extract_methods(cpp_content: str) -> List[str]:
    """Extract methods from C++ content."""
    lines = cpp_content.splitlines()
    
    methods = []
    current_method = []
    brace_count = 0
    in_method = False
    
    for line in lines:
        # If we're not in a method, check if this line starts one
        if not in_method:
            if is_method_signature(line):
                in_method = True
                brace_count = line.count('{')
                current_method.append(line)
        else:
            current_method.append(line)
            brace_count += line.count('{')
            brace_count -= line.count('}')
            
            # If braces are balanced, we've found the end of the method
            if brace_count == 0:
                method_text = '\n'.join(current_method)
                # Check if there's at least one pair of curly brackets
                if '{' in method_text and '}' in method_text:
                    # Double check that the number of opening and closing braces match
                    total_open = method_text.count('{')
                    total_close = method_text.count('}')
                    # Check if the method has enough lines
                    if total_open == total_close and len(current_method) >= MIN_METHOD_LINES:
                        methods.append(method_text)
                current_method = []
                in_method = False
                
    return methods

def clean_method_name(method_content) -> str:
    """Extract a clean method name for the filename."""
    # Get the first line of the method
    first_line = method_content.split('\n')[0].strip()
    # Extract the method name
    match = re.search(r'[\w:]+\s+([\w:]+)\s*\(', first_line)
    if match:
        return match.group(1)
    raise ValueError(f"Method name could not be cleaned: {first_line}")

def process_file(input_file, output_dir) -> None:
    """Process a single C++ file and extract its methods."""
    try:
        with open(input_file, 'r', encoding='utf-8') as f:
            content = f.read()
        
        # Extract methods
        methods = extract_methods(content)

        os.makedirs(output_dir, exist_ok=True)
        base_filename = os.path.basename(input_file).replace('.cpp', '')
        
        # Save each method to a separate file
        for i, method in enumerate(methods, 1):
            method_name = clean_method_name(method)
            
            output_filename = f"{method_name}.txt"
            output_path = os.path.join(output_dir, output_filename)
            with open(output_path, 'w', encoding='utf-8') as f:
                f.write(method)
            
            
    except Exception as e:
        print(f"Error processing {input_file}: {str(e)}")

In [9]:
input_dir = Path("FreeMoveVR/FreeMoveVR-Driver/src")
output_dir = Path('driver_documents/code')

for cpp_file in input_dir.glob('**/*.cpp'):
    process_file(cpp_file, output_dir)

In [10]:
def copy_hpp_files(src_folder, dest_folder):
    os.makedirs(dest_folder, exist_ok=True)

    for filename in os.listdir(src_folder):
        if filename.endswith('.hpp'):
            full_file_name = os.path.join(src_folder, filename)
            if os.path.isfile(full_file_name):
                new_filename = filename[:-4] + '.txt'
                new_full_file_name = os.path.join(dest_folder, new_filename)

                shutil.copy(full_file_name, new_full_file_name)

In [11]:
dest_directory = 'driver_documents/classes'

src_directory = 'FreeMoveVR/FreeMoveVR-Driver/src'
copy_hpp_files(src_directory, dest_directory)

src_directory = 'FreeMoveVR/FreeMoveVR-Driver/src/Connections'
copy_hpp_files(src_directory, dest_directory)

src_directory = 'FreeMoveVR/FreeMoveVR-Driver/src/Connections/LandmarkData'
copy_hpp_files(src_directory, dest_directory)

src_directory = 'FreeMoveVR/FreeMoveVR-Driver/src/Connections/Messages'
copy_hpp_files(src_directory, dest_directory)

src_directory = 'FreeMoveVR/FreeMoveVR-Driver/src/Connections/Messages/Fragments'
copy_hpp_files(src_directory, dest_directory)

src_directory = 'FreeMoveVR/FreeMoveVR-Driver/src/Trackers'
copy_hpp_files(src_directory, dest_directory)