# Task 1: Ticket Processing Pipeline

## Scenario
You have a CSV with support ticket data. Each ticket has:
- `id`: ticket ID
- `category`: ticket category
- `description`: ticket description (may contain PII)
- `resolution_note`: how the ticket was resolved
- `metadata`: JSON string with additional info including status

## Your Tasks:
1. **Filter**: Keep only tickets with status `resolved` (extract from metadata)
2. **Clean**: Remove phone numbers and emails from descriptions
3. **Clean**: Remove category prefix from descriptions
4. **Aggregate**: Count tickets by category

## Rules:
- Phone formats: `+370 XXXXXXX` or `8XXXXXXX` (7 digits after prefix)
- Email format: `name.surname@company.com`
- Category prefix format: `Category:\n    {category}\n    `

## Setup (provided)

In [None]:
import pandas as pd
import json
import re

# Load data
df = pd.read_csv("../fixtures/input/tickets.csv")
print(f"Loaded {len(df)} tickets")
df.head()

In [None]:
# Explore the data
print("Sample metadata:")
print(df["metadata"].iloc[0])
print("\nSample description with PII:")
print(df["description"].iloc[1])

---
## Task 1: Extract Status and Filter

Extract the `status` field from the `metadata` JSON and keep only `resolved` tickets.

**Hint**: metadata is a JSON string containing a list of dicts. One of the dicts has a `status` key.

In [None]:
# YOUR CODE HERE
# 1. Create a function to extract status from metadata
# 2. Apply it to create a new 'status' column
# 3. Filter to keep only resolved tickets



In [None]:
# TEST - Do not modify
# Check for phone patterns (using word boundaries to avoid false positives like error codes)
has_phones = df["description"].str.contains(r'\+370|\b8\d{7}\b', regex=True).any()
assert not has_phones, "Phone numbers still present in descriptions"

# Check no emails remain
has_emails = df["description"].str.contains(r'@company\.com', regex=True).any()
assert not has_emails, "Emails still present in descriptions"

print("Task 2 PASSED!")

---
## Task 2: Remove PII from Descriptions

Remove phone numbers and email addresses from the `description` column.

**Phone formats:**
- `+370 1234567` or `+3701234567`
- `81234567`

**Email format:**
- `name.surname@company.com` or similar patterns

In [None]:
# YOUR CODE HERE
# Create regex patterns and clean the description column



In [None]:
# TEST - Do not modify
# Check no phone numbers remain
has_phones = df["description"].str.contains(r'\+370|8\d{7}', regex=True).any()
assert not has_phones, "Phone numbers still present in descriptions"

# Check no emails remain
has_emails = df["description"].str.contains(r'@company\.com', regex=True).any()
assert not has_emails, "Emails still present in descriptions"

print("Task 2 PASSED!")

---
## Task 3: Remove Category Prefix

Some descriptions start with:
```
Category:
    Software Installation
    Actual description here...
```

Remove this prefix using the `category` column value.

In [None]:
# YOUR CODE HERE
# Remove the category prefix from descriptions



In [None]:
# TEST - Do not modify
# Check specific ticket descriptions after cleaning
ticket_45ea = df[df["id"] == "45eacbab-ef56-493a-9feb-00acb471fada"]["description"].iloc[0]
expected_45ea = "I am facing issues while trying to install Adobe Photoshop on my Windows 10 PC."
assert ticket_45ea.strip() == expected_45ea.strip(), f"Ticket 45ea not cleaned correctly. Got: '{ticket_45ea}'"

# No descriptions should start with "Category:"
has_prefix = df["description"].str.startswith("Category:").any()
assert not has_prefix, "Some descriptions still have Category prefix"

print("Task 3 PASSED!")

---
## Task 4: Aggregate by Category

Create a summary DataFrame with:
- `category`: ticket category
- `ticket_count`: number of tickets
- `avg_description_length`: average description length

Store in variable `category_stats`.

In [None]:
# YOUR CODE HERE
# Create category_stats DataFrame with aggregations



In [None]:
# TEST - Do not modify
assert "category_stats" in dir(), "Variable 'category_stats' not found"
assert set(category_stats.columns) >= {"category", "ticket_count", "avg_description_length"}, \
    f"Missing columns. Got: {category_stats.columns.tolist()}"

software_count = category_stats[category_stats["category"] == "Software Installation"]["ticket_count"].iloc[0]
assert software_count == 7, f"Expected 7 Software Installation tickets, got {software_count}"

print("Task 4 PASSED!")
print("\nCategory Stats:")
print(category_stats)

---
## Bonus Task: Full Pipeline Function

Combine all steps into a single function that takes raw DataFrame and returns cleaned, filtered data.

In [None]:
# YOUR CODE HERE (optional)
def process_tickets(df: pd.DataFrame) -> pd.DataFrame:
    """
    Process ticket DataFrame:
    1. Extract status from metadata
    2. Filter to resolved only
    3. Remove PII from descriptions
    4. Remove category prefix
    
    Returns: cleaned DataFrame
    """
    pass  # Your implementation


---
## Expected Results

After completing all tasks:
- 10 resolved tickets (from 12 total)
- No phone numbers or emails in descriptions
- No "Category:" prefixes
- Category stats showing 7 Software Installation, 2 Network Issues, 1 Hardware