# Import Required Libraries

Import the necessary libraries: io, zipfile, requests, and frontmatter.

In [18]:
import io
import zipfile
import requests
import frontmatter

# Define the read_repo_data Function

Define a function to download a GitHub repository as a zip file, extract markdown files, parse them with frontmatter, and return a list of dictionaries with content and metadata.

In [19]:
def read_repo_data(repo_owner, repo_name):
    """
    Download and parse all markdown files from a GitHub repository.
    
    Args:
        repo_owner: GitHub username or organization
        repo_name: Repository name
    
    Returns:
        List of dictionaries containing file content and metadata
    """
    prefix = 'https://codeload.github.com' 
    url = f'{prefix}/{repo_owner}/{repo_name}/zip/refs/heads/main'
    resp = requests.get(url)
    
    if resp.status_code != 200:
        raise Exception(f"Failed to download repository: {resp.status_code}")

    repository_data = []
    zf = zipfile.ZipFile(io.BytesIO(resp.content))
    
    for file_info in zf.infolist():
        filename = file_info.filename
        filename_lower = filename.lower()

        if not (filename_lower.endswith('.md') 
            or filename_lower.endswith('.mdx')):
            continue
    
        try:
            with zf.open(file_info) as f_in:
                content = f_in.read().decode('utf-8', errors='ignore')
                post = frontmatter.loads(content)
                data = post.to_dict()
                data['filename'] = filename
                repository_data.append(data)
        except Exception as e:
            print(f"Error processing {filename}: {e}")
            continue
    
    zf.close()
    return repository_data

# Download and Process Repository Data

Use the read_repo_data function to download and process data from specified GitHub repositories, such as 'DataTalksClub/faq' and 'evidentlyai/docs'.

In [20]:
# For homework, select a GitHub repo with documentation: evidentlyai/docs
evidently_docs = read_repo_data('evidentlyai', 'docs')

# Optionally, you can try other repos
# dtc_faq = read_repo_data('DataTalksClub', 'faq')
# fastai_docs = read_repo_data('fastai', 'fastbook')

# Print Document Counts

Print the number of documents retrieved from each repository.

In [21]:
print(f"Evidently documents: {len(evidently_docs)}")

# Uncomment to print others
# print(f"FAQ documents: {len(dtc_faq)}")
# print(f"FastAI documents: {len(fastai_docs)}")

Evidently documents: 95


# Day 2: Chunking and Intelligent Processing for Data

Welcome to Day 2 of our 7-Day AI Agents Email Crash-Course.

In the first part of the course, we focus on data preparation – the process of properly preparing data before it can be used for AI agents.

## Small and Large Documents

Yesterday (Day 1), we downloaded the data from a GitHub repository and processed it. For some use cases, like the FAQ database, this is sufficient. The questions and answers are small enough. We can put them directly into the search engine.

But it's different for the Evidently documentation. These documents are quite large. Let's take a look at this one: https://github.com/evidentlyai/docs/blob/main/docs/library/descriptors.mdx.

We could use it as is, but we risk overwhelming our LLMs.

## Why We Need to Prepare Large Documents Before Using Them

Large documents create several problems:

- Token limits: Most LLMs have maximum input token limits
- Cost: Longer prompts cost more money
- Performance: LLMs perform worse with very long contexts
- Relevance: Not all parts of a long document are relevant to a specific question

So we need to split documents into smaller subdocuments. For AI applications like RAG (which we will discuss tomorrow), this process is referred to as "chunking."

Today, we will cover multiple ways of chunking data:

1. Simple character-based chunking
2. Paragraph and section-based chunking
3. Intelligent chunking with LLM

Just so you know, for the last section, you will need a Gemini API key.

## 1. Simple Chunking

Let's start with simple chunking. This will be sufficient for most cases.

We can continue with the notebook from Day 1. We already downloaded the data from Evidently docs. We put them into the evidently_docs list.

This is how the document at index 45 looks like:

{'title': 'LLM regression testing',
 'description': 'How to run regression testing for LLM outputs.',
 'content': 'In this tutorial, you will learn...'
}

The content field is 21,712 characters long. The simplest thing we can do is cut it into pieces of equal length. For example, for size of 2000 characters, we will have:

Chunk 1: 0..2000
Chunk 2: 2000..4000
Chunk 3: 4000..6000

And so on.

However, this approach has disadvantages:

- Context loss: Important information might be split in the middle
- Incomplete sentences: Chunks might end mid-sentence
- Missing connections: Related information might end up in different chunks

That's why, in practice, we usually make sure there's overlap between chunks. For size 2000 and overlap 1000, we will have:

Chunk 1: 0..2000
Chunk 2: 1000..3000
Chunk 3: 2000..4000
...

This is better for AI because:

- Continuity: Important information isn't lost at chunk boundaries
- Context preservation: Related sentences stay together in at least one chunk
- Better search: Queries can match information even if it spans chunk boundaries

This approach is known as the "sliding window" method.

In [22]:
def sliding_window(seq, size, step):
    if size <= 0 or step <= 0:
        raise ValueError("size and step must be positive")

    n = len(seq)
    result = []
    for i in range(0, n, step):
        chunk = seq[i:i+size]
        result.append({'start': i, 'chunk': chunk})
        if i + size >= n:
            break

    return result

In [23]:
# Let's apply it for document 45. This gives us 21 chunks:
# 0..2000, 1000..3000, ..., 19000..21000, 20000..21712

if len(evidently_docs) > 45:
    doc_45_content = evidently_docs[45]['content']
    chunks_45 = sliding_window(doc_45_content, 2000, 1000)
    print(f"Document 45 has {len(chunks_45)} chunks")
else:
    print("Document 45 not available")

# Let's process all the documents:

evidently_chunks = []

for doc in evidently_docs:
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')
    chunks = sliding_window(doc_content, 2000, 1000)
    for chunk in chunks:
        chunk.update(doc_copy)
    evidently_chunks.extend(chunks)

print(f"Total chunks created: {len(evidently_chunks)} from {len(evidently_docs)} documents")

Document 45 has 21 chunks
Total chunks created: 575 from 95 documents


## 2. Splitting by Paragraphs and Sections

Splitting by paragraphs is relatively easy:

In [24]:
import re

if len(evidently_docs) > 45:
    text = evidently_docs[45]['content']
    paragraphs = re.split(r"\n\s*\n", text.strip())
    print(f"Document 45 has {len(paragraphs)} paragraphs")
    print(f"First paragraph: {paragraphs[0][:200]}...")
else:
    print("Document 45 not available")

Document 45 has 153 paragraphs
First paragraph: In this tutorial, you will learn how to perform regression testing for LLM outputs....


Let's now look at section splitting. Here, we take advantage of the documents' structure. Markdown documents have this structure:

# Heading 1
## Heading 2  
### Heading 3

What we can do is split by headers.

In [25]:
def split_markdown_by_level(text, level=2):
    """
    Split markdown text by a specific header level.
    
    :param text: Markdown text as a string
    :param level: Header level to split on
    :return: List of sections as strings
    """
    # This regex matches markdown headers
    # For level 2, it matches lines starting with "## "
    header_pattern = r'^(#{' + str(level) + r'} )(.+)$'
    pattern = re.compile(header_pattern, re.MULTILINE)

    # Split and keep the headers
    parts = pattern.split(text)
    
    sections = []
    for i in range(1, len(parts), 3):
        # We step by 3 because regex.split() with
        # capturing groups returns:
        # [before_match, group1, group2, after_match, ...]
        # here group1 is "## ", group2 is the header text
        header = parts[i] + parts[i+1]  # "## " + "Title"
        header = header.strip()

        # Get the content after this header
        content = ""
        if i+2 < len(parts):
            content = parts[i+2].strip()

        if content:
            section = f'{header}\n\n{content}'
        else:
            section = header
        sections.append(section)
    
    return sections

In [26]:
# If we want to split by second-level headers, that's what we do:

if len(evidently_docs) > 45:
    text = evidently_docs[45]['content']
    sections = split_markdown_by_level(text, level=2)
    print(f"Document 45 has {len(sections)} sections")
    if sections:
        print(f"First section: {sections[0][:200]}...")
else:
    print("Document 45 not available")

# Now we iterate over all the docs to create the final result:

evidently_chunks_sections = []

for doc in evidently_docs:
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')
    sections = split_markdown_by_level(doc_content, level=2)
    for section in sections:
        section_doc = doc_copy.copy()
        section_doc['section'] = section
        evidently_chunks_sections.append(section_doc)

print(f"Total sections created: {len(evidently_chunks_sections)} from {len(evidently_docs)} documents")

Document 45 has 8 sections
First section: ## 1. Installation and Imports

Install Evidently:

```python
pip install evidently[llm] 
```

Import the required modules:

```python
import pandas as pd
from evidently.future.datasets import Dataset...
Total sections created: 262 from 95 documents


## 3. Intelligent Chunking with LLM

In some cases, we want to be more intelligent with chunking. Instead of doing simple splits, we delegate this work to AI.

This makes sense when:

- Complex structure: Documents have complex, non-standard structure
- Semantic coherence: You want chunks that are semantically meaningful
- Custom logic: You need domain-specific splitting rules
- Quality over cost: You prioritize quality over processing cost

This costs money. In most cases, we don't need intelligent chunking.

Simple approaches are sufficient. Use intelligent chunking only when

- You already evaluated simpler methods and you can confirm that they produce poor results
- You have complex, unstructured documents
- Quality is more important than cost
- You have the budget for LLM processing

Let's create a prompt:

In [27]:
import google.generativeai as genai
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Set up Gemini API
api_key = os.getenv("GEMINI_API_KEY")
if not api_key:
    raise ValueError("GEMINI_API_KEY not found in environment variables. Please set it in a .env file.")
genai.configure(api_key=api_key)

prompt_template = """
Split the provided document into logical sections
that make sense for a Q&A system.

Each section should be self-contained and cover
a specific topic or concept.

<DOCUMENT>
{document}
</DOCUMENT>

Use this format:

## Section Name

Section content with all relevant details

---

## Another Section Name

Another section content

---
""".strip()

def intelligent_chunking(text):
    prompt = prompt_template.format(document=text)
    
    model = genai.GenerativeModel('gemini-1.5-flash')
    response = model.generate_content(prompt)
    
    sections = response.text.split('---')
    sections = [s.strip() for s in sections if s.strip()]
    return sections

In [28]:
%pip install google-generativeai

Note: you may need to restart the kernel to use updated packages.


D:\aihero\project\.venv\Scripts\python.exe: No module named pip


In [29]:
%pip install python-dotenv

Note: you may need to restart the kernel to use updated packages.


D:\aihero\project\.venv\Scripts\python.exe: No module named pip


In [None]:
# Now we apply this to every document:

from tqdm.auto import tqdm

evidently_chunks_intelligent = []

# Uncomment the next line and set your API key
# genai.configure(api_key="YOUR_GEMINI_API_KEY")

for doc in tqdm(evidently_docs[:5]):  # Process only first 5 docs for demo (costs money)
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')

    sections = intelligent_chunking(doc_content)
    for section in sections:
        section_doc = doc_copy.copy()
        section_doc['section'] = section
        evidently_chunks_intelligent.append(section_doc)

print(f"Total intelligent sections created: {len(evidently_chunks_intelligent)} from 5 documents")

# Note: This process requires time and incurs costs. As mentioned before, use this only when really necessary.
# For most applications, you don't need intelligent chunking.

  0%|          | 0/5 [00:00<?, ?it/s]

## Bonus: Processing Code in Your GitHub Repository

You can use this approach for processing the code in your GitHub repository. You can use a variation of the following prompt:

"Summarize the code in plain English. Briefly describe each class and function/method (their purpose and role), then give a short overall summary of how they work together. Avoid low-level details."

Then add both the source code and the summary to your documents.

# Implementing Code Processing for GitHub Repositories

Now let's implement the bonus feature to process code files from GitHub repositories. We'll download code files, use the LLM to generate summaries, and add both the source code and summaries to our documents.

In [None]:
def read_repo_code(repo_owner, repo_name, file_extensions=None):
    """
    Download and extract code files from a GitHub repository.

    Args:
        repo_owner: GitHub username or organization
        repo_name: Repository name
        file_extensions: List of file extensions to include (e.g., ['.py', '.js', '.ts'])

    Returns:
        List of dictionaries containing file content and metadata
    """
    if file_extensions is None:
        file_extensions = ['.py', '.js', '.ts', '.java', '.cpp', '.c', '.go', '.rs']

    prefix = 'https://codeload.github.com'
    url = f'{prefix}/{repo_owner}/{repo_name}/zip/refs/heads/main'
    resp = requests.get(url)

    if resp.status_code != 200:
        raise Exception(f"Failed to download repository: {resp.status_code}")

    repository_code = []
    zf = zipfile.ZipFile(io.BytesIO(resp.content))

    for file_info in zf.infolist():
        filename = file_info.filename
        filename_lower = filename.lower()

        # Skip if not a code file
        if not any(filename_lower.endswith(ext) for ext in file_extensions):
            continue

        # Skip files in common non-code directories
        if any(skip_dir in filename_lower for skip_dir in ['node_modules/', '__pycache__/', '.git/', 'dist/', 'build/']):
            continue

        try:
            with zf.open(file_info) as f_in:
                content = f_in.read().decode('utf-8', errors='ignore')
                # Skip empty files or very small files
                if len(content.strip()) < 50:
                    continue

                data = {
                    'filename': filename,
                    'content': content,
                    'language': filename.split('.')[-1] if '.' in filename else 'unknown'
                }
                repository_code.append(data)
        except Exception as e:
            print(f"Error processing {filename}: {e}")
            continue

    zf.close()
    return repository_code

In [None]:
code_summary_prompt = """
Summarize the code in plain English. Briefly describe each class and function/method (their purpose and role), then give a short overall summary of how they work together. Avoid low-level details.

<CODE>
{code}
</CODE>

Provide the summary in this format:

## Summary

[Your summary here]
""".strip()

def summarize_code_with_llm(code_content, filename):
    """
    Use LLM to generate a summary of the provided code.

    Args:
        code_content: The source code as a string
        filename: The filename for context

    Returns:
        Summary string generated by the LLM
    """
    prompt = code_summary_prompt.format(code=code_content)

    try:
        model = genai.GenerativeModel('gemini-1.5-flash')
        response = model.generate_content(prompt)
        return response.text.strip()
    except Exception as e:
        print(f"Error summarizing {filename}: {e}")
        return f"Error generating summary for {filename}"

In [None]:
# Download code from a GitHub repository
# Let's use a small, well-known Python project for demonstration
# You can replace this with any repository you want to process

code_repo_owner = 'psf'
code_repo_name = 'requests'  # A popular Python HTTP library

print(f"Downloading code from {code_repo_owner}/{code_repo_name}...")
code_files = read_repo_code(code_repo_owner, code_repo_name, file_extensions=['.py'])
print(f"Found {len(code_files)} Python files")

# Process a few files for demonstration (to avoid high API costs)
code_documents = []

for i, code_file in enumerate(code_files[:3]):  # Process only first 3 files
    print(f"Processing file {i+1}/{min(3, len(code_files))}: {code_file['filename']}")

    # Generate summary using LLM
    summary = summarize_code_with_llm(code_file['content'], code_file['filename'])

    # Create document with both source code and summary
    doc = {
        'filename': code_file['filename'],
        'language': code_file['language'],
        'source_code': code_file['content'],
        'summary': summary,
        'content': f"## Source Code\n\n```{code_file['language']}\n{code_file['content']}\n```\n\n## Summary\n\n{summary}"
    }

    code_documents.append(doc)

print(f"Created {len(code_documents)} code documents with summaries")

# You can now use these code_documents in your RAG system or search engine
# Each document contains both the original source code and its LLM-generated summary

In [None]:
# Display the results
print("Code Processing Results:")
print("=" * 50)

for i, doc in enumerate(code_documents, 1):
    print(f"\nDocument {i}:")
    print(f"Filename: {doc['filename']}")
    print(f"Language: {doc['language']}")
    print(f"Summary preview: {doc['summary'][:200]}...")
    print(f"Full content length: {len(doc['content'])} characters")
    print("-" * 30)

# You can now combine these code documents with your markdown documents
# For example:
# all_documents = evidently_docs + code_documents

print(f"\nTotal code documents created: {len(code_documents)}")
print("Each document contains both the source code and its AI-generated summary!")

# Usage Notes for Code Processing

## What was implemented:

1. **`read_repo_code()`**: Downloads and extracts code files from GitHub repositories
   - Filters by file extensions (default: .py, .js, .ts, .java, .cpp, .c, .go, .rs)
   - Skips common non-code directories and empty files

2. **`summarize_code_with_llm()`**: Uses Gemini AI to generate plain English summaries
   - Uses the exact prompt you specified
   - Describes classes, functions/methods and their purposes
   - Provides overall summary of how components work together

3. **Document Creation**: Combines source code and summaries into searchable documents
   - Each document contains: filename, language, source_code, summary, and combined content

## How to use this for your own repositories:

```python
# Replace with your repository details
your_code_files = read_repo_code('your-username', 'your-repo-name')

# Process all files (be mindful of API costs)
your_code_documents = []
for code_file in your_code_files:
    summary = summarize_code_with_llm(code_file['content'], code_file['filename'])
    doc = {
        'filename': code_file['filename'],
        'language': code_file['language'],
        'source_code': code_file['content'],
        'summary': summary,
        'content': f"## Source Code\n\n```{code_file['language']}\n{code_file['content']}\n```\n\n## Summary\n\n{summary}"
    }
    your_code_documents.append(doc)
```

## Cost Considerations:
- Each file summarization costs API credits
- Process files selectively or in batches
- Consider file size limits for very large code files

## Integration with RAG:
You can now combine code documents with your markdown documents:
```python
all_documents = evidently_docs + code_documents + your_code_documents
```