# Task 1: Ticket Processing Pipeline - SOLUTION

## Scenario
You have a CSV with support ticket data. Each ticket has:
- `id`: ticket ID
- `category`: ticket category
- `description`: ticket description (may contain PII)
- `resolution_note`: how the ticket was resolved
- `metadata`: JSON string with additional info including status

## Your Tasks:
1. **Filter**: Keep only tickets with status `resolved` (extract from metadata)
2. **Clean**: Remove phone numbers and emails from descriptions
3. **Clean**: Remove category prefix from descriptions
4. **Aggregate**: Count tickets by category

## Setup

In [None]:
import pandas as pd
import json
import re

# Load data
df = pd.read_csv("../fixtures/input/tickets.csv")
print(f"Loaded {len(df)} tickets")
df.head()

In [None]:
# Explore the data
print("Sample metadata:")
print(df["metadata"].iloc[0])
print("\nParsed metadata:")
print(json.loads(df["metadata"].iloc[0]))
print("\nSample description with PII:")
print(df["description"].iloc[1])

---
## Task 1: Extract Status and Filter - SOLUTION

**Approach:**
1. Parse JSON string to list of dicts
2. Find dict that contains 'status' key
3. Extract the value
4. Filter DataFrame

In [None]:
# SOLUTION

def extract_status(metadata_str: str) -> str:
    """
    Extract status from metadata JSON string.
    
    Metadata format: [{"source": "email"}, {"language": "EN"}, {"status": "resolved"}]
    """
    try:
        metadata_list = json.loads(metadata_str)
        for item in metadata_list:
            if "status" in item:
                return item["status"]
        return None
    except (json.JSONDecodeError, TypeError):
        return None

# Apply to create status column
df["status"] = df["metadata"].apply(extract_status)

# Check status distribution before filtering
print("Status distribution:")
print(df["status"].value_counts())

# Filter to keep only resolved
df = df[df["status"] == "resolved"].copy()
print(f"\nFiltered to {len(df)} resolved tickets")

In [None]:
# TEST
assert "status" in df.columns, "Column 'status' not found"
assert len(df) == 10, f"Expected 10 resolved tickets, got {len(df)}"
assert df["status"].unique().tolist() == ["resolved"], "Should only contain resolved tickets"
print("Task 1 PASSED!")

---
## Task 2: Remove PII from Descriptions - SOLUTION

**Approach:**
1. Create regex pattern for Lithuanian phone numbers
2. Create regex pattern for emails
3. Apply substitution to remove matches

In [None]:
# SOLUTION

def remove_pii(text: str) -> str:
    """
    Remove PII (phone numbers and emails) from text.
    
    Phone formats:
    - +370 1234567 or +3701234567
    - 81234567 (8 followed by 7 digits)
    
    Email format:
    - name.surname@company.com
    """
    # Remove phones with +370 prefix (with optional space, then 7 digits)
    text = re.sub(r'\+370\s*\d{7}', '', text)
    
    # Remove phones starting with 8 (8 followed by 7 digits)
    text = re.sub(r'\b8\d{7}\b', '', text)
    
    # Remove email addresses
    text = re.sub(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', '', text)
    
    return text

# Apply to description column
df["description"] = df["description"].apply(remove_pii)

# Debug: check for remaining phones
print("Checking for remaining phones...")
for idx, row in df.iterrows():
    if '+370' in row['description'] or re.search(r'8\d{7}', row['description']):
        print(f"  Found in: {row['description'][:80]}...")

print("\nSample cleaned description:")
print(df["description"].iloc[0])

In [None]:
# TEST
# Check for phone patterns (using word boundaries to avoid false positives like error codes)
has_phones = df["description"].str.contains(r'\+370|\b8\d{7}\b', regex=True).any()
assert not has_phones, "Phone numbers still present in descriptions"

has_emails = df["description"].str.contains(r'@company\.com', regex=True).any()
assert not has_emails, "Emails still present in descriptions"

print("Task 2 PASSED!")

---
## Task 3: Remove Category Prefix - SOLUTION

**Approach:**
1. Build pattern dynamically using category value
2. Use re.escape() for safe pattern building
3. Handle newlines and whitespace

In [None]:
# SOLUTION

def remove_category_prefix(text: str, category: str) -> str:
    """
    Remove category prefix from description.
    
    Prefix format:
    Category:
        {category}
        Actual text...
    """
    # Build pattern: "Category:\n    {category}\n    "
    # Use re.escape to handle special characters in category name
    pattern = rf'Category:\s*{re.escape(category)}\s*'
    
    cleaned = re.sub(pattern, '', text)
    return cleaned

# Apply using row values (need both description and category)
df["description"] = df.apply(
    lambda row: remove_category_prefix(row["description"], row["category"]),
    axis=1
)

# Show sample
print("Sample after prefix removal:")
for i, row in df.head(3).iterrows():
    print(f"\n[{row['id'][:8]}...] {row['description'][:60]}")

In [None]:
# TEST
ticket_45ea = df[df["id"] == "45eacbab-ef56-493a-9feb-00acb471fada"]["description"].iloc[0]
expected_45ea = "I am facing issues while trying to install Adobe Photoshop on my Windows 10 PC."
assert ticket_45ea.strip() == expected_45ea.strip(), f"Ticket 45ea not cleaned correctly. Got: '{ticket_45ea}'"

has_prefix = df["description"].str.startswith("Category:").any()
assert not has_prefix, "Some descriptions still have Category prefix"

print("Task 3 PASSED!")

---
## Task 4: Aggregate by Category - SOLUTION

**Approach:**
1. Calculate description length
2. Use groupby with named aggregations
3. Reset index for clean output

In [None]:
# SOLUTION

# First, add description length column
df["desc_length"] = df["description"].str.len()

# Aggregate by category with named aggregations
category_stats = df.groupby("category").agg(
    ticket_count=("id", "count"),
    avg_description_length=("desc_length", "mean")
).reset_index()

# Sort by ticket count descending
category_stats = category_stats.sort_values("ticket_count", ascending=False)

print("Category Statistics:")
print(category_stats)

In [None]:
# TEST
assert "category_stats" in dir(), "Variable 'category_stats' not found"
assert set(category_stats.columns) >= {"category", "ticket_count", "avg_description_length"}, \
    f"Missing columns. Got: {category_stats.columns.tolist()}"

software_count = category_stats[category_stats["category"] == "Software Installation"]["ticket_count"].iloc[0]
assert software_count == 7, f"Expected 7 Software Installation tickets, got {software_count}"

print("Task 4 PASSED!")
print("\nFinal Category Stats:")
print(category_stats)

---
## Bonus: Full Pipeline Function - SOLUTION

In [None]:
# BONUS SOLUTION

def process_tickets(df: pd.DataFrame) -> pd.DataFrame:
    """
    Complete ticket processing pipeline.
    
    Steps:
    1. Extract status from metadata JSON
    2. Filter to resolved tickets only
    3. Remove PII (phones, emails) from descriptions
    4. Remove category prefix from descriptions
    
    Args:
        df: Raw ticket DataFrame
        
    Returns:
        Cleaned and filtered DataFrame
    """
    # Make a copy to avoid modifying original
    df = df.copy()
    
    # Step 1: Extract status
    def extract_status(metadata_str):
        try:
            for item in json.loads(metadata_str):
                if "status" in item:
                    return item["status"]
        except:
            pass
        return None
    
    df["status"] = df["metadata"].apply(extract_status)
    
    # Step 2: Filter resolved
    df = df[df["status"] == "resolved"].copy()
    
    # Step 3: Remove PII (apply patterns separately for reliability)
    def remove_pii(text):
        text = re.sub(r'\+370\s*\d{7}', '', text)  # +370 phones
        text = re.sub(r'\b8\d{7}\b', '', text)      # 8XXXXXXX phones
        text = re.sub(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', '', text)  # emails
        return text
    
    df["description"] = df["description"].apply(remove_pii)
    
    # Step 4: Remove category prefix
    def remove_prefix(row):
        pattern = rf'Category:\s*{re.escape(row["category"])}\s*'
        return re.sub(pattern, '', row["description"])
    
    df["description"] = df.apply(remove_prefix, axis=1)
    
    return df


# Test the pipeline
df_raw = pd.read_csv("../fixtures/input/tickets.csv")
df_processed = process_tickets(df_raw)

print(f"Input: {len(df_raw)} tickets")
print(f"Output: {len(df_processed)} resolved tickets")
print(f"\nSample processed:")
print(df_processed[["id", "category", "description"]].head(3).to_string())

---
## Summary

**Key techniques used:**

1. **JSON Parsing**: `json.loads()` to parse string, iterate over list of dicts
2. **Regex Patterns** (applied separately for reliability):
   - Phone +370: `r'\+370\s*\d{7}'`
   - Phone 8X: `r'\b8\d{7}\b'`
   - Email: `r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'`
3. **Dynamic Patterns**: `re.escape()` for safe pattern building
4. **Aggregation**: `groupby().agg()` with named aggregations
5. **Pipeline Pattern**: Combine all steps into reusable function

**Common pitfalls:**
- Forgetting `.copy()` when filtering (SettingWithCopyWarning)
- Not escaping special characters in regex patterns
- Not handling JSON parse errors
- Using `axis=0` instead of `axis=1` in apply with row access
- Combining regex patterns with `|` can cause issues - apply separately for reliability