# Working with Data: Files and Tabular Data

This comprehensive notebook covers working with files and CSV data in Python, including reading/writing files, data manipulation with pandas, and leveraging AI tools for data analysis.

## Part 1: Working with Files

In Python, it's extremely simple to work with files like .txt or .md files. Let's explore file operations and how to leverage them for powerful workflows.

In [None]:
from ai_tools import ask_ai
from IPython.display import Markdown

### Reading Files

In [None]:
# To open a file
with open("./file.txt", "r") as f:
    data = f.read()

print(data)

### Creating and Writing Files

We can also create files easily in Python:

In [None]:
content = "This is a file"
with open("summary-notes.txt", "w") as f:
    f.write(content)

# Verify the file was created
!cat ./summary-notes.txt

### File Modes

Below are more examples for the different modes of reading and writing files available via the built-in `open()` method:

In [None]:
# Common file modes in Python's open() function:

# "a" - Append - Opens file for appending, creates new file if not exists
with open("summary-notes.txt", "a") as f:
    f.write("append this")

!cat summary-notes.txt

In [None]:
# "x" - Exclusive creation - Opens for writing, fails if file exists
try:
    with open("file.txt", "x") as f:
        f.write("new file content")
except FileExistsError:
    print("File already exists!")

In [None]:
# "+" - Read and write mode
with open("file.txt", "r+") as f:  # Open for both reading and writing
    data = f.read()
    f.write("new data")

### Combining File Operations with AI

The cool stuff about being able to do this is that we can connect our ability of generating summaries of information with AI, along with our ability to read and write files in Python to create super powerful workflows.

For example, below we will write single sentence summaries for multiple files containing information about different papers.

In [None]:
folder_with_papers = "./assets-resources/papers/"
file_names = ["paper1.txt", "paper2.txt", "paper3.txt"]

def summarize_this_paper(paper_contents):
    summary_prompt = f"Summarize this paper\n\n: {paper_contents} in a couple of sentences."
    output_summary = ask_ai(summary_prompt)
    
    return output_summary

paper_summaries_list = []
for file_name in file_names:
    file_path = folder_with_papers + file_name
    with open(file_path, "r") as f:
        contents_of_the_paper = f.read()
    
    paper_summary = summarize_this_paper(contents_of_the_paper)
    paper_summaries_list.append(paper_summary)

# Display the markdown content in the notebook
markdown_content = "# Paper Summaries\n\n"
for i, summary in enumerate(paper_summaries_list, 1):
    markdown_content += f"## Paper {i}\n\n"
    markdown_content += f"{summary}\n\n"
        
Markdown(markdown_content)

We can do similar things to extract specific information from documents, imagine you have a bunch of differently formatted invoices from which you would like to organize the information extracting things like the amounts and dates.

## Part 2: Working with Tabular Data (CSV Files)

Let's learn about CSV files that structure data into rows and columns (tabular data!). Text files are great but sometimes you need a bit more organization and structure, that's where CSV files come into play.

In [None]:
# Super popular library for working with tabular data
import pandas as pd
from ai_tools import ask_ai

### Reading CSV Files

Imagine you have a bunch of information about customer tickets organized in a .csv file that you would like to understand a bit more about.

In [None]:
data_customer_tickets = pd.read_csv("./extracted_ticket_issues.csv")

data_customer_tickets

The data contains 3 columns:
1. `customer_name` - names of the customers
2. `issue_description` - description of the issue they had
3. `priority` - reference to the level of priority of that task

### Filtering Data

We could use Python to get for example only the high priority issues:

In [None]:
# == indicates equivalence!
data_customer_tickets["priority"]=="High"

In [None]:
high_priority_issues = data_customer_tickets[data_customer_tickets["priority"]=="High"]

# Now we can take a look at the issues themselves:
high_priority_issues

### Using AI for Data Categorization

Awesome! What we could do now is for example use our `ask_ai` tool to categorize the issues for us to help organizing the information, and then feed that back into the table:

In [None]:
categories_list = []
for issue in high_priority_issues["issue_description"]:
    print(f"Categorizing issue: {issue}")
    category = ask_ai(f"Categorize this issue in just one single word and OUTPUT ONLY THAT WORD:\n\n issue: {issue}\n category: \n")
    print(f"Category: {category}")
    categories_list.append(category)

Notice we use concepts we've learned before by looping over the issues, saving them to a list.

Now with that information in hand we can actually update the dataframe accordingly, first we create a new column in the dataframe:

In [None]:
data_customer_tickets["issue_category"] = None

# Update categories for high priority issues using the index from high_priority_issues
for idx, category in zip(high_priority_issues.index, categories_list):
    data_customer_tickets.loc[idx, "issue_category"] = category

data_customer_tickets

Notice that the issues for which we did not analyse still contain a `None` indicating they haven't been categorized yet!

## Creating and Managing Structured Data

Besides analysing data, we can also create our own tables with information we care about.

Let's start with a practical example - creating a camping trip gear checklist:

In [None]:
# Create a camping gear checklist
camping_gear = {
    "item": [
        "Tent", "Sleeping Bag", "Backpack", "Hiking Boots",
        "Water Filter", "First Aid Kit", "Headlamp", "Camp Stove"
    ],
    "priority": [
        "Essential", "Essential", "Essential", "Essential",
        "High", "Essential", "High", "Medium"
    ],
    "estimated_cost": [
        299.99, 149.99, 199.99, 159.99,
        89.99, 49.99, 39.99, 79.99
    ],
    "packed": [
        False, False, False, False,
        False, False, False, False
    ]
}

# Convert to DataFrame
gear_df = pd.DataFrame(camping_gear)
print("Camping Gear Checklist:")
display(gear_df)

### Working with Data Filters

Let's demonstrate how to filter and analyze our data:

In [None]:
def analyze_gear_requirements():
    # Filter essential items
    essential_gear = gear_df[gear_df['priority'] == 'Essential']
    
    # Calculate total cost of essential items
    essential_cost = essential_gear['estimated_cost'].sum()
    
    # Get unpacked essential items
    unpacked_essential = essential_gear[~essential_gear['packed']]
    
    print(f"Total cost of essential gear: ${essential_cost:.2f}")
    print("\nUnpacked essential items:")
    display(unpacked_essential[['item', 'estimated_cost']])

analyze_gear_requirements()

### Creating a Trip Itinerary

Let's create a more complex example with a detailed trip itinerary:

In [None]:
def create_trip_itinerary():
    itinerary_data = {
        'day': range(1, 6),
        'date': pd.date_range('2024-06-01', periods=5),
        'activity': [
            'Arrival and Camp Setup',
            'Mountain Trail Hike',
            'Lake Exploration',
            'Forest Adventure',
            'Pack and Departure'
        ],
        'location': [
            'Basecamp Area',
            'Mountain Ridge Trail',
            'Crystal Lake',
            'Ancient Forest',
            'Basecamp Area'
        ],
        'distance_km': [2, 8, 5, 6, 2],
        'difficulty': [
            'Easy',
            'Hard',
            'Moderate',
            'Moderate',
            'Easy'
        ]
    }
    
    itinerary_df = pd.DataFrame(itinerary_data)
    return itinerary_df

# Create and display the itinerary
trip_itinerary = create_trip_itinerary()
print("Trip Itinerary:")
display(trip_itinerary)

### Analyzing Trip Statistics

Let's add some analysis to our trip planning:

In [None]:
def analyze_trip_metrics(itinerary_df):
    # Calculate total distance
    total_distance = itinerary_df['distance_km'].sum()
    
    # Get difficulty breakdown
    difficulty_counts = itinerary_df['difficulty'].value_counts()
    
    # Find longest day
    longest_day = itinerary_df.loc[itinerary_df['distance_km'].idxmax()]
    
    print(f"Trip Analysis:")
    print(f"Total distance: {total_distance} km")
    print("\nDifficulty breakdown:")
    display(difficulty_counts)
    print(f"\nLongest day: Day {longest_day['day']} - {longest_day['activity']}")
    print(f"Distance: {longest_day['distance_km']} km")

analyze_trip_metrics(trip_itinerary)

### Exporting and Saving Data

Let's see how to save our data for later use:

In [None]:
def export_trip_data(gear_df, itinerary_df, filename_prefix):
    # Export to CSV
    gear_df.to_csv(f"{filename_prefix}_gear.csv", index=False)
    itinerary_df.to_csv(f"{filename_prefix}_itinerary.csv", index=False)
    print(f"Data exported to {filename_prefix}_gear.csv and {filename_prefix}_itinerary.csv")

# Export our data
export_trip_data(gear_df, trip_itinerary, "camping_trip")

### Practical Exercise: Trip Budget Calculator

Let's create a budget calculator for our trip:

In [None]:
def calculate_trip_budget(gear_df, itinerary_df):
    # Equipment costs
    total_gear_cost = gear_df['estimated_cost'].sum()
    
    # Daily expenses (example values)
    daily_expenses = {
        'food': 30,
        'fuel': 10,
        'miscellaneous': 15
    }
    
    num_days = len(itinerary_df)
    daily_total = sum(daily_expenses.values())
    total_daily_costs = daily_total * num_days
    
    # Create budget summary
    budget_summary = pd.DataFrame({
        'Category': ['Gear', 'Food', 'Fuel', 'Miscellaneous'],
        'Cost': [
            total_gear_cost,
            daily_expenses['food'] * num_days,
            daily_expenses['fuel'] * num_days,
            daily_expenses['miscellaneous'] * num_days
        ]
    })
    
    budget_summary['Percentage'] = (
        budget_summary['Cost'] / budget_summary['Cost'].sum() * 100
    ).round(1)
    
    return budget_summary

# Calculate and display budget
budget = calculate_trip_budget(gear_df, trip_itinerary)
print("Trip Budget Summary:")
display(budget)

## Key Takeaways

### Working with Files
- Always use `with` statements when working with files to ensure proper closure
- Different file modes serve different purposes:
  - `"r"` for reading
  - `"w"` for writing (creates new/overwrites)
  - `"a"` for appending
  - `"r+"` for reading and writing
- Always handle potential file-related exceptions
- File operations can be combined with data processing for powerful automation
- Consider creating helper functions for common file operations

### Working with CSV and Tabular Data
- Pandas provides powerful tools for working with tabular data
- DataFrames can be filtered and analyzed in various ways
- Data can be exported to different formats (CSV, Excel)
- Structured data makes analysis and planning easier
- Always consider data types when creating DataFrames
- Use appropriate column names and data organization
- Remember to handle missing data appropriately

### Combining AI with Data Processing
- AI tools can be integrated with file and data operations for enhanced automation
- Use AI for tasks like summarization, categorization, and information extraction
- Combine traditional data processing with AI capabilities for powerful workflows

In the next lesson, we'll explore Python packages, APIs, and more advanced data processing techniques!