This project develops a conversational agent capable of answering questions about the Microsoft AutoGen framework. The agent ingests documentation from the official GitHub repository, enabling it to provide detailed, context-aware responses on topics ranging from basic agent creation to complex multi-agent workflows. This serves as a practical demonstration of building an intelligent system for code documentation.

### complete function

In [1]:
import io
import zipfile
import requests
import frontmatter

def read_repo_data(repo_owner, repo_name):
    """
    Download and parse all markdown files from a GitHub repository.
    
    Args:
        repo_owner: GitHub username or organization
        repo_name: Repository name
    
    Returns:
        List of dictionaries containing file content and metadata
    """
    # The URL has been updated to use the direct archive link
    url = f'https://github.com/{repo_owner}/{repo_name}/archive/refs/heads/main.zip'
    resp = requests.get(url)
    
    if resp.status_code != 200:
        raise Exception(f"Failed to download repository: {resp.status_code}")

    repository_data = []
    try:
        zf = zipfile.ZipFile(io.BytesIO(resp.content))
        
        for file_info in zf.infolist():
            filename = file_info.filename
            filename_lower = filename.lower()
            
            if not (filename_lower.endswith('.md') or filename_lower.endswith('.mdx')):
                continue

            try:
                with zf.open(file_info) as f_in:
                    content = f_in.read().decode('utf-8', errors='ignore')
                    post = frontmatter.loads(content)
                    data = post.to_dict()
                    data['filename'] = filename
                    repository_data.append(data)
            except Exception as e:
                print(f"Error processing {filename}: {e}")
                continue
    finally:
        # Ensure the zip file is closed, even if an error occurs
        if 'zf' in locals():
            zf.close()
            
    return repository_data


In [3]:
# Example of how to use it with the project repository
data = read_repo_data('microsoft', 'autogen')
print(f"Successfully processed {len(data)} markdown files.")

# To see the structure of the data you've ingested
if data:
    print("\nExample of one file's data:")
    print(data[0])

Successfully processed 161 markdown files.

Example of one file's data:
{'content': '<!-- Thank you for your contribution! Please review https://microsoft.github.io/autogen/docs/Contribute before opening a pull request. -->\n\n<!-- Please add a reviewer to the assignee section when you create a PR. If you don\'t have the access to it, we will shortly find a reviewer and assign them to your PR. -->\n\n## Why are these changes needed?\n\n<!-- Please give a short summary of the change and the problem this solves. -->\n\n## Related issue number\n\n<!-- For example: "Closes #1234" -->\n\n## Checks\n\n- [ ] I\'ve included any doc changes needed for <https://microsoft.github.io/autogen/>. See <https://github.com/microsoft/autogen/blob/main/CONTRIBUTING.md> to build and test documentation locally.\n- [ ] I\'ve added tests (if relevant) corresponding to the changes introduced in this PR.\n- [ ] I\'ve made sure all auto checks have passed.', 'filename': 'autogen-main/.github/PULL_REQUEST_TEMPLAT