# Working with Data: Files and Tabular Data

This comprehensive notebook covers working with files and CSV data in Python, including reading/writing files, data manipulation with pandas, and leveraging AI tools for data analysis.

## Part 1: Working with Files

In Python, it's extremely simple to work with files like .txt or .md files. Let's explore file operations and how to leverage them for powerful workflows.

In [1]:
import sys
sys.path.append('../../scripts')
from ai_tools import ask_ai
from IPython.display import Markdown

### Reading Files

In [2]:
# To open a file
with open("./file.txt", "r") as f:
    data = f.read()

print(data)

this is a sample filethis is an append to my filernew datanew data


### Creating and Writing Files

We can also create files easily in Python:

In [3]:
content = "This is a file"
with open("summary-notes.txt", "w") as f:
    f.write(content)

# Verify the file was created
!cat ./summary-notes.txt

This is a file

### File Modes

Below are more examples for the different modes of reading and writing files available via the built-in `open()` method:

In [4]:
# Common file modes in Python's open() function:

# "a" - Append - Opens file for appending, creates new file if not exists
with open("summary-notes.txt", "a") as f:
    f.write("append this")

!cat summary-notes.txt

This is a fileappend this

In [5]:
# "x" - Exclusive creation - Opens for writing, fails if file exists
try:
    with open("file.txt", "x") as f:
        f.write("new file content")
except FileExistsError:
    print("File already exists!")

File already exists!


In [6]:
# "+" - Read and write mode
with open("file.txt", "r+") as f:  # Open for both reading and writing
    data = f.read()
    f.write("new data")

### Combining File Operations with AI

The cool stuff about being able to do this is that we can connect our ability of generating summaries of information with AI, along with our ability to read and write files in Python to create super powerful workflows.

For example, below we will write single sentence summaries for multiple files containing information about different papers.

In [7]:
folder_with_papers = "../assets/papers/"
file_names = ["paper1.txt", "paper2.txt", "paper3.txt"]

def summarize_this_paper(paper_contents):
    summary_prompt = f"Summarize this paper\n\n: {paper_contents} in a couple of sentences."
    output_summary = ask_ai(summary_prompt)
    
    return output_summary

paper_summaries_list = []
for file_name in file_names:
    file_path = folder_with_papers + file_name
    with open(file_path, "r") as f:
        contents_of_the_paper = f.read()
    
    paper_summary = summarize_this_paper(contents_of_the_paper)
    paper_summaries_list.append(paper_summary)

# Display the markdown content in the notebook
markdown_content = "# Paper Summaries\n\n"
for i, summary in enumerate(paper_summaries_list, 1):
    markdown_content += f"## Paper {i}\n\n"
    markdown_content += f"{summary}\n\n"
        
Markdown(markdown_content)

# Paper Summaries

## Paper 1

The paper titled "TheAgent Company: Benchmarking LLM Agents on Consequential Real World Tasks" introduces TheAgentCompany benchmark, designed to evaluate the effectiveness of large language model (LLM)-powered AI agents in performing various professional tasks within a simulated software company environment. The benchmark encompasses 175 diverse tasks in areas such as software engineering and project management, employing web browsing and communication with simulated colleagues for realism. Performance evaluations revealed that top models like Claude-3.5-Sonnet completed only 24% of tasks autonomously, highlighting ongoing challenges, particularly in complex interactions and lengthy tasks. The research emphasizes the need for further advancements in LLM capabilities, particularly in social interactions and decision-making, signaling a critical step towards understanding and enhancing AI integration into professional settings.

## Paper 2

The paper introduces a new research area called Automated Design of Agentic Systems (ADAS), which aims to automate the creation of powerful agentic systems by using a meta-agent to define agents in code. The method, termed Meta Agent Search, involves iteratively generating and evaluating new agents based on previously discovered ones, leading to the invention of novel designs that often outperform traditional hand-designed agents across various tasks such as reasoning, math, and science. The experiments demonstrate the robustness and transferability of the generated agents across different domains, indicating the potential for significant advancements in AI capabilities through automation in agent design.

## Paper 3

The paper "Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences" by Shreya Shankar et al. explores the challenges of effectively validating the evaluations produced by Large Language Models (LLMs). The authors emphasize that while LLMs provide a means to assess outputs efficiently, they can inherit the shortcomings of the models they evaluate. As a solution, the paper introduces a mixed-initiative interface called EvalGen, which helps users generate evaluation criteria and implement assertions while gathering iterative human feedback to align the generated evaluations with user preferences.

Key contributions of the paper include:

1. **Mixed-Initiative Approach**: EvalGen seeks to validate LLM-generated evaluation functions by allowing users to iteratively refine criteria based on their evaluations of LLM outputs. This is achieved through a voting system where users can give quick feedback (good/bad) on LLM outputs.

2. **Criteria Drift Phenomenon**: The authors identify "criteria drift," where user-defined evaluation criteria evolve as they interact with outputs. Users often need to re-evaluate their criteria based on observed outputs, highlighting a fundamental challenge when trying to establish fixed evaluation standards.

3. **Empirical Evaluation**: The paper includes an empirical study involving nine industry practitioners who utilized EvalGen in a real-world context. Findings indicated that while participants mostly appreciated EvalGen's ability to assist with assertion creation, they encountered challenges in aligning the generated assertions with their standards—particularly due to criteria drift and the subjectivity of evaluative metrics.

4. **Algorithmic Framework**: EvalGen employs an algorithm that suggests evaluation criteria and synthesizes potential assertions based on user input, leading to better alignment with human preferences compared to fully automated systems like SPADE, particularly in terms of coverage and false failure rates.

Overall, the paper emphasizes the need for interactive systems that account for the inherent complexity and subjectivity in evaluating LLM outputs, arguing for continuous alignment through user feedback and iterative refinement in the design of LLM evaluation assistants. The findings have implications for the future design of LLMOps and evaluation frameworks, raising questions about the stability of evaluation criteria and the dynamics of human-machine collaboration.



We can do similar things to extract specific information from documents, imagine you have a bunch of differently formatted invoices from which you would like to organize the information extracting things like the amounts and dates.

## Part 2: Working with Tabular Data (CSV Files)

Let's learn about CSV files that structure data into rows and columns (tabular data!). Text files are great but sometimes you need a bit more organization and structure, that's where CSV files come into play.

In [8]:
# Super popular library for working with tabular data
import pandas as pd
import sys
sys.path.append('../../scripts')
from ai_tools import ask_ai

### Reading CSV Files

Imagine you have a bunch of information about customer tickets organized in a .csv file that you would like to understand a bit more about.

In [9]:
data_customer_tickets = pd.read_csv("../assets/extracted_ticket_issues.csv")

data_customer_tickets

Unnamed: 0,customer_name,issue_description,priority
0,Jane Doe,Customer was charged twice for the same transa...,High
1,John Smith,"Customer unable to log into their account, fac...",Medium
2,Alice Johnson,Customer wants more information about product ...,Low
3,Bob Brown,"Customer has not received the order yet, track...",High
4,Michael Lee,Customer wants to return a product and needs a...,Medium


The data contains 3 columns:
1. `customer_name` - names of the customers
2. `issue_description` - description of the issue they had
3. `priority` - reference to the level of priority of that task

### Filtering Data

We could use Python to get for example only the high priority issues:

In [10]:
# == indicates equivalence!
data_customer_tickets["priority"]=="High"

0     True
1    False
2    False
3     True
4    False
Name: priority, dtype: bool

In [11]:
high_priority_issues = data_customer_tickets[data_customer_tickets["priority"]=="High"]

# Now we can take a look at the issues themselves:
high_priority_issues

Unnamed: 0,customer_name,issue_description,priority
0,Jane Doe,Customer was charged twice for the same transa...,High
3,Bob Brown,"Customer has not received the order yet, track...",High


### Using AI for Data Categorization

Awesome! What we could do now is for example use our `ask_ai` tool to categorize the issues for us to help organizing the information, and then feed that back into the table:

In [12]:
categories_list = []
for issue in high_priority_issues["issue_description"]:
    print(f"Categorizing issue: {issue}")
    category = ask_ai(f"Categorize this issue in just one single word and OUTPUT ONLY THAT WORD:\n\n issue: {issue}\n category: \n")
    print(f"Category: {category}")
    categories_list.append(category)

Categorizing issue: Customer was charged twice for the same transaction.


Category: Billing
Categorizing issue: Customer has not received the order yet, tracking information shows a delay.


Category: Delay


Notice we use concepts we've learned before by looping over the issues, saving them to a list.

Now with that information in hand we can actually update the dataframe accordingly, first we create a new column in the dataframe:

In [13]:
data_customer_tickets["issue_category"] = None

# Update categories for high priority issues using the index from high_priority_issues
for idx, category in zip(high_priority_issues.index, categories_list):
    data_customer_tickets.loc[idx, "issue_category"] = category

data_customer_tickets

Unnamed: 0,customer_name,issue_description,priority,issue_category
0,Jane Doe,Customer was charged twice for the same transa...,High,Billing
1,John Smith,"Customer unable to log into their account, fac...",Medium,
2,Alice Johnson,Customer wants more information about product ...,Low,
3,Bob Brown,"Customer has not received the order yet, track...",High,Delay
4,Michael Lee,Customer wants to return a product and needs a...,Medium,


Notice that the issues for which we did not analyse still contain a `None` indicating they haven't been categorized yet!

## Creating and Managing Structured Data

Besides analysing data, we can also create our own tables with information we care about.

Let's start with a practical example - creating a camping trip gear checklist:

In [14]:
# Create a camping gear checklist
camping_gear = {
    "item": [
        "Tent", "Sleeping Bag", "Backpack", "Hiking Boots",
        "Water Filter", "First Aid Kit", "Headlamp", "Camp Stove"
    ],
    "priority": [
        "Essential", "Essential", "Essential", "Essential",
        "High", "Essential", "High", "Medium"
    ],
    "estimated_cost": [
        299.99, 149.99, 199.99, 159.99,
        89.99, 49.99, 39.99, 79.99
    ],
    "packed": [
        False, False, False, False,
        False, False, False, False
    ]
}

# Convert to DataFrame
gear_df = pd.DataFrame(camping_gear)
print("Camping Gear Checklist:")
display(gear_df)

Camping Gear Checklist:


Unnamed: 0,item,priority,estimated_cost,packed
0,Tent,Essential,299.99,False
1,Sleeping Bag,Essential,149.99,False
2,Backpack,Essential,199.99,False
3,Hiking Boots,Essential,159.99,False
4,Water Filter,High,89.99,False
5,First Aid Kit,Essential,49.99,False
6,Headlamp,High,39.99,False
7,Camp Stove,Medium,79.99,False


### Working with Data Filters

Let's demonstrate how to filter and analyze our data:

In [15]:
def analyze_gear_requirements():
    # Filter essential items
    essential_gear = gear_df[gear_df['priority'] == 'Essential']
    
    # Calculate total cost of essential items
    essential_cost = essential_gear['estimated_cost'].sum()
    
    # Get unpacked essential items
    unpacked_essential = essential_gear[~essential_gear['packed']]
    
    print(f"Total cost of essential gear: ${essential_cost:.2f}")
    print("\nUnpacked essential items:")
    display(unpacked_essential[['item', 'estimated_cost']])

analyze_gear_requirements()

Total cost of essential gear: $859.95

Unpacked essential items:


Unnamed: 0,item,estimated_cost
0,Tent,299.99
1,Sleeping Bag,149.99
2,Backpack,199.99
3,Hiking Boots,159.99
5,First Aid Kit,49.99


### Creating a Trip Itinerary

Let's create a more complex example with a detailed trip itinerary:

In [16]:
def create_trip_itinerary():
    itinerary_data = {
        'day': range(1, 6),
        'date': pd.date_range('2024-06-01', periods=5),
        'activity': [
            'Arrival and Camp Setup',
            'Mountain Trail Hike',
            'Lake Exploration',
            'Forest Adventure',
            'Pack and Departure'
        ],
        'location': [
            'Basecamp Area',
            'Mountain Ridge Trail',
            'Crystal Lake',
            'Ancient Forest',
            'Basecamp Area'
        ],
        'distance_km': [2, 8, 5, 6, 2],
        'difficulty': [
            'Easy',
            'Hard',
            'Moderate',
            'Moderate',
            'Easy'
        ]
    }
    
    itinerary_df = pd.DataFrame(itinerary_data)
    return itinerary_df

# Create and display the itinerary
trip_itinerary = create_trip_itinerary()
print("Trip Itinerary:")
display(trip_itinerary)

Trip Itinerary:


Unnamed: 0,day,date,activity,location,distance_km,difficulty
0,1,2024-06-01,Arrival and Camp Setup,Basecamp Area,2,Easy
1,2,2024-06-02,Mountain Trail Hike,Mountain Ridge Trail,8,Hard
2,3,2024-06-03,Lake Exploration,Crystal Lake,5,Moderate
3,4,2024-06-04,Forest Adventure,Ancient Forest,6,Moderate
4,5,2024-06-05,Pack and Departure,Basecamp Area,2,Easy


### Analyzing Trip Statistics

Let's add some analysis to our trip planning:

In [17]:
def analyze_trip_metrics(itinerary_df):
    # Calculate total distance
    total_distance = itinerary_df['distance_km'].sum()
    
    # Get difficulty breakdown
    difficulty_counts = itinerary_df['difficulty'].value_counts()
    
    # Find longest day
    longest_day = itinerary_df.loc[itinerary_df['distance_km'].idxmax()]
    
    print(f"Trip Analysis:")
    print(f"Total distance: {total_distance} km")
    print("\nDifficulty breakdown:")
    display(difficulty_counts)
    print(f"\nLongest day: Day {longest_day['day']} - {longest_day['activity']}")
    print(f"Distance: {longest_day['distance_km']} km")

analyze_trip_metrics(trip_itinerary)

Trip Analysis:
Total distance: 23 km

Difficulty breakdown:


difficulty
Easy        2
Moderate    2
Hard        1
Name: count, dtype: int64


Longest day: Day 2 - Mountain Trail Hike
Distance: 8 km


### Exporting and Saving Data

Let's see how to save our data for later use:

In [18]:
def export_trip_data(gear_df, itinerary_df, filename_prefix):
    # Export to CSV
    gear_df.to_csv(f"{filename_prefix}_gear.csv", index=False)
    itinerary_df.to_csv(f"{filename_prefix}_itinerary.csv", index=False)
    print(f"Data exported to {filename_prefix}_gear.csv and {filename_prefix}_itinerary.csv")

# Export our data
export_trip_data(gear_df, trip_itinerary, "camping_trip")

Data exported to camping_trip_gear.csv and camping_trip_itinerary.csv


### Practical Exercise: Trip Budget Calculator

Let's create a budget calculator for our trip:

In [19]:
def calculate_trip_budget(gear_df, itinerary_df):
    # Equipment costs
    total_gear_cost = gear_df['estimated_cost'].sum()
    
    # Daily expenses (example values)
    daily_expenses = {
        'food': 30,
        'fuel': 10,
        'miscellaneous': 15
    }
    
    num_days = len(itinerary_df)
    daily_total = sum(daily_expenses.values())
    total_daily_costs = daily_total * num_days
    
    # Create budget summary
    budget_summary = pd.DataFrame({
        'Category': ['Gear', 'Food', 'Fuel', 'Miscellaneous'],
        'Cost': [
            total_gear_cost,
            daily_expenses['food'] * num_days,
            daily_expenses['fuel'] * num_days,
            daily_expenses['miscellaneous'] * num_days
        ]
    })
    
    budget_summary['Percentage'] = (
        budget_summary['Cost'] / budget_summary['Cost'].sum() * 100
    ).round(1)
    
    return budget_summary

# Calculate and display budget
budget = calculate_trip_budget(gear_df, trip_itinerary)
print("Trip Budget Summary:")
display(budget)

Trip Budget Summary:


Unnamed: 0,Category,Cost,Percentage
0,Gear,1069.92,79.6
1,Food,150.0,11.2
2,Fuel,50.0,3.7
3,Miscellaneous,75.0,5.6


## Key Takeaways

### Working with Files
- Always use `with` statements when working with files to ensure proper closure
- Different file modes serve different purposes:
  - `"r"` for reading
  - `"w"` for writing (creates new/overwrites)
  - `"a"` for appending
  - `"r+"` for reading and writing
- Always handle potential file-related exceptions
- File operations can be combined with data processing for powerful automation
- Consider creating helper functions for common file operations

### Working with CSV and Tabular Data
- Pandas provides powerful tools for working with tabular data
- DataFrames can be filtered and analyzed in various ways
- Data can be exported to different formats (CSV, Excel)
- Structured data makes analysis and planning easier
- Always consider data types when creating DataFrames
- Use appropriate column names and data organization
- Remember to handle missing data appropriately

### Combining AI with Data Processing
- AI tools can be integrated with file and data operations for enhanced automation
- Use AI for tasks like summarization, categorization, and information extraction
- Combine traditional data processing with AI capabilities for powerful workflows

In the next lesson, we'll explore Python packages, APIs, and more advanced data processing techniques!