# Data Analyst Agent Notebook Documentation

## Overview

This Jupyter notebook implements a multi-agent system for data analysis using LangChain and E2B Code Interpreter. The system is designed to help users explore and analyze datasets, specifically focused on a data science student marks dataset.

## System Architecture

The notebook implements a hierarchical agent system with three main components:

1. **Research Agent** - Handles file exploration and basic data overview
2. **Analysis Agent** - Performs data analysis and visualization using code execution
3. **Supervisor Agent** - Orchestrates the workflow between research and analysis agents


In [55]:
from langchain.agents import create_agent
from langchain_core.tools import tool
from typing import Annotated
import os 
import pandas as pd
from pathlib import Path
from dotenv import load_dotenv
from langchain_core.prompts import ChatPromptTemplate
from e2b_code_interpreter import Sandbox


load_dotenv()



True

### 1. Tool Definitions

#### `list_files_in_directory(directory="data")`
- **Purpose**: Lists all files in a specified directory
- **Returns**: String listing all files found in the directory
- **Usage**: Used to discover available datasets

#### `read_csv_file(file_path: str)`
- **Purpose**: Reads and provides basic statistics about CSV files
- **Returns**: First 5 rows, statistical description, and data info
- **Usage**: Provides quick overview of dataset structure and content

#### `code_tool(code: str, upload_files: list = None)`
- **Purpose**: Executes Python code in a secure sandbox environment
- **Features**: 
  - Uses E2B Code Interpreter for safe code execution
  - Supports file uploads to the sandbox
  - Returns execution results and logs
- **Usage**: For data analysis, visualization, and statistical computations

In [56]:
@tool
def list_files_in_directory(directory="data"):
    """List all files in a directory
    Args:
        directory: The directory to list files from
    Returns:
        A string listing all files in the directory
    """

    data_folder = Path(directory)
    if not data_folder.exists():
        return []
    return f"found {[f.name for f in data_folder.iterdir() if f.is_file()]} files in {directory}"


@tool 
def read_csv_file(file_path:str):
    """Read a CSV file
    Args:
        file_path: The path to the CSV file
    Returns:
        A string containing information from the CSV file
    """
    if os.path.exists(file_path) is False:
        return f"File {file_path} does not exist"
    
    df = pd.read_csv(file_path)

    first_5_rows = df.head(5)

    describe = df.describe()

    info = df.info()

    return f"First 5 rows: {first_5_rows}\nDescription: {describe}\nInfo: {info}"
    

@tool
def code_tool(code: str, upload_files: list = None) -> str:
    """Execute python code and return the result. Use this to visualize or generate any code needed
    
    Args:
        code: Python code to execute
        upload_files: List of local file paths to upload to sandbox 
    """
    with Sandbox.create() as sandbox:
        # Upload local files if provided
        if upload_files:
            for file_path in upload_files:
                # Convert to Path object for better path handling
                path_obj = Path(file_path)
                if path_obj.exists() and path_obj.is_file():
                    with open(path_obj, 'rb') as f:
                        content = f.read()
                    # Use the filename from the Path object
                    filename = path_obj.name
                    sandbox.files.write(f'/home/user/{filename}', content)
        
        execution = sandbox.run_code(code)
        if execution.error:
            return f"Error: {execution.error}"
        return f"Results: {execution.results}, Logs: {execution.logs}"

### 2. Agent Definitions

#### Research Agent
- **Model**: GPT-4o-mini
- **Tools**: `list_files_in_directory`, `read_csv_file`
- **Purpose**: Provides dataset overviews and basic information
- **Use Cases**: 
  - Listing available files
  - Understanding dataset structure
  - Getting basic statistics

#### Analysis Agent
- **Model**: GPT-4o-mini
- **Tools**: `list_files_in_directory`, `code_tool`
- **Purpose**: Performs detailed data analysis and visualization
- **Features**:
  - Automatically uploads CSV files to sandbox
  - Executes Python code for analysis
  - Creates visualizations and statistical models
- **Use Cases**:
  - Creating charts and graphs
  - Statistical analysis
  - Data modeling and regression

In [57]:



research_agent = create_agent(
    "openai:gpt-4o-mini",
    tools=[list_files_in_directory, read_csv_file],
    system_prompt=(
        """
        You are a helpful assistant that can list files in a directory, and read CSV files.
        you are responsible for givine the user an overview of the dataset 
        """
    )
)


analysis_agent = create_agent(
    "openai:gpt-4o-mini",
    tools=[list_files_in_directory, code_tool],
    system_prompt=(
        """
        You are a Data Analyst.
        You are responsible for analyzing the dataset and providing the user with insights.
        You have access to list_files_in_directory and code_tool. If the code fails to execute, respond with the code for the user to paste
        
        IMPORTANT: When analyzing data, you MUST:
        1. First use list_files_in_directory to see what files are available
        2. When using code_tool, ALWAYS pass the upload_files parameter with the CSV file path: ["data/data_science_student_marks.csv"]
        3. In your code, reference the uploaded file as '/home/user/data_science_student_marks.csv'
        
        Example usage:
        - Use list_files_in_directory to check available files
        - Use code_tool with upload_files=["data/data_science_student_marks.csv"] to upload and analyze the data
        
        Always upload the CSV file before trying to analyze it in the sandbox.
        """
    )
)







### 3. Tool Wrappers

#### `research(request: str) -> str`
- **Purpose**: Wrapper for the research agent
- **Input**: Natural language requests about datasets
- **Examples**:
  - "What files are present in the directory?"
  - "What data is present in the data science student dataset?"
  - "What is the average age of students?"

#### `analyze_data(request: str) -> str`
- **Purpose**: Wrapper for the analysis agent
- **Input**: Natural language requests for data analysis
- **Examples**:
  - "Plot a bar chart of average age by location"
  - "Create a histogram of student ages"
  - "Run regression analysis on the dataset"

In [58]:
@tool
def research(request:str) -> str:
    """
    Research Information about a file

    Use this when the user needs information about the avaliable files in the directory or 
    about a specific csv file present in the directory. Respond with the information in a concise manner.

    Input: Natural Language Request 
    Examples: 
    - What files are present in the directory?
    - What data is present in the data science student dataset?
    - What is the average age of the students in the data science student dataset?
    - What is the average salary of the students in the data science student dataset?
    - What is the average age of the students in the data science student dataset?
    """
    result = research_agent.invoke({
        "messages": [{"role": "user", "content": request}]
    })
    return result["messages"][-1].text

@tool
def analyze_data(request:str) -> str:
    """
    Analyze the data present in the dataset

    Use this when the user needs to analyze the data present in the dataset or to generate a chart that is not answered by the research tool.
    You have access to a sandbox to execute python code. 

    Input: Natural Language Request 
    Examples: 
       - Plot a bar chart of the average age of the students in the data science student dataset.
       - Create a histogram of the age of the students in the data science student dataset.
       - Run a regression analysis on the data science student dataset.
       
    """
    result = analysis_agent.invoke({
        "messages": [{"role": "user", "content": request}]
    })
    return result["messages"][-1].text







### Supervisor Agent
- **Model**: GPT-4o-mini
- **Tools**: `research`, `analyze_data`
- **Purpose**: Orchestrates the workflow between research and analysis
- **Features**:
  - Coordinates multiple tools in sequence
  - Provides comprehensive responses
  - Summarizes findings concisely

In [59]:
supervisor_agent = create_agent(
    "openai:gpt-4o-mini",
    tools=[research, analyze_data],
    system_prompt=(
        """
        You are a super assistant that can research and analyze the data present. 
        You have access to a research tool and an analyze tool.
        You are responsible for giving the user the best possible answer to their question.
        When a request involves multiple actions, use multiple tools in sequence.
        Do not make assumptions, only use the tools provided to you.
        Summarize your response in a concise manner.

        """)
    )

## Usage Example

The notebook includes a demonstration query:

```python
query = (
    "Give me an overview of the datasets I have "
    "Visualize any interesting insights from the first dataset"
)
```

This query triggers:
1. Research agent to list available files
2. Research agent to provide dataset overview
3. Analysis agent to create visualizations
4. Supervisor agent to coordinate and summarize results

In [60]:
query = (
    "Give me an overview of the datasets I have "
    "Visualize any interesting insights from the first dataset"
)

for step in supervisor_agent.stream(
    {"messages": [{"role": "user", "content": query}]}
):
    for update in step.values():
        for message in update.get("messages", []):
            message.pretty_print()

Tool Calls:
  research (call_5diOC0yZVzkP48zWBT4NDeiC)
 Call ID: call_5diOC0yZVzkP48zWBT4NDeiC
  Args:
    request: What files are present in the directory?
Name: research

The directory contains the following file: **data_science_student_marks.csv**.
Tool Calls:
  research (call_jbxm4YLflispiaeYIsDw7bfs)
 Call ID: call_jbxm4YLflispiaeYIsDw7bfs
  Args:
    request: What data is present in the data science student marks dataset?
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 497 entries, 0 to 496
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   student_id      497 non-null    int64 
 1   location        497 non-null    object
 2   age             497 non-null    int64 
 3   sql_marks       497 non-null    int64 
 4   excel_marks     497 non-null    int64 
 5   python_marks    497 non-null    int64 
 6   power_bi_marks  497 non-null    int64 
 7   english_marks   497 non-null    int64 
dtypes: int64(7), object