# Pandas Essentials

This notebook covers core Pandas operations needed for data processing in ML/NLP pipelines.

## Topics:
1. Loading data (CSV, JSON)
2. Filtering and querying
3. Transformations
4. Grouping and aggregation
5. Merging DataFrames

In [None]:
import pandas as pd
import numpy as np
import json

# Load sample data
df = pd.read_csv("../fixtures/input/tickets.csv")
print(f"Loaded {len(df)} rows")
df.head()

## 1. Data Exploration

In [None]:
# Basic info
print("Shape:", df.shape)
print("\nColumn types:")
print(df.dtypes)
print("\nMissing values:")
print(df.isnull().sum())

In [None]:
# Value counts for categorical columns
print("Categories:")
print(df["category"].value_counts())

## 2. Filtering Data

Multiple ways to filter DataFrames:

In [None]:
# Method 1: Boolean indexing
software_tickets = df[df["category"] == "Software Installation"]
print(f"Software tickets: {len(software_tickets)}")

In [None]:
# Method 2: Multiple conditions
filtered = df[
    (df["category"] == "Software Installation") |
    (df["category"] == "Network Issues")
]
print(f"Software + Network: {len(filtered)}")

In [None]:
# Method 3: Using query() - cleaner for complex conditions
categories = ["Software Installation", "Network Issues"]
filtered = df.query("category in @categories")
print(f"Using query: {len(filtered)}")

In [None]:
# Method 4: Using isin()
filtered = df[df["category"].isin(categories)]
print(f"Using isin: {len(filtered)}")

## 3. Working with JSON in Columns

Often data contains JSON strings that need parsing.

In [None]:
# Look at metadata column
print("Raw metadata:")
print(df["metadata"].iloc[0])
print("\nParsed:")
print(json.loads(df["metadata"].iloc[0]))

In [None]:
# Extract status from metadata
def extract_status(metadata_str):
    """Extract status from metadata JSON list."""
    try:
        metadata = json.loads(metadata_str)
        for item in metadata:
            if "status" in item:
                return item["status"]
        return None
    except:
        return None

df["status"] = df["metadata"].apply(extract_status)
print(df[["id", "status"]].head())

In [None]:
# Filter by extracted status
resolved = df[df["status"] == "resolved"]
print(f"Resolved tickets: {len(resolved)}")

## 4. Transformations with apply()

In [None]:
# Add description length
df["desc_length"] = df["description"].apply(len)
print(df[["id", "desc_length"]].head())

In [None]:
# Vectorized alternative (faster)
df["desc_length"] = df["description"].str.len()
print(df["desc_length"].describe())

In [None]:
# Apply with multiple columns
def create_summary(row):
    return f"{row['category']}: {row['description'][:50]}..."

df["summary"] = df.apply(create_summary, axis=1)
print(df["summary"].iloc[0])

## 5. Grouping and Aggregation

In [None]:
# Basic groupby
category_stats = df.groupby("category").agg(
    ticket_count=("id", "count"),
    avg_desc_length=("desc_length", "mean")
).reset_index()

print(category_stats)

In [None]:
# Group by multiple columns
grouped = df.groupby(["category", "status"]).size().reset_index(name="count")
print(grouped)

In [None]:
# Transform - add group statistics to each row
df["category_avg_length"] = df.groupby("category")["desc_length"].transform("mean")
df["vs_category_avg"] = df["desc_length"] - df["category_avg_length"]
print(df[["category", "desc_length", "category_avg_length", "vs_category_avg"]].head())

## 6. Merging DataFrames

In [None]:
# Create a lookup table
priority_lookup = pd.DataFrame({
    "category": ["Software Installation", "Network Issues", "Hardware", "Email"],
    "priority": [2, 3, 1, 2],
    "team": ["IT Support", "Network Team", "Hardware Team", "IT Support"]
})

print("Lookup table:")
print(priority_lookup)

In [None]:
# Merge - LEFT JOIN
df_enriched = pd.merge(
    df,
    priority_lookup,
    on="category",
    how="left"
)

print(df_enriched[["id", "category", "priority", "team"]].head())

In [None]:
# Alternative: map() for single column lookup (faster)
priority_dict = priority_lookup.set_index("category")["priority"].to_dict()
df["priority"] = df["category"].map(priority_dict)
print(df[["category", "priority"]].head())

## Summary

Key operations covered:
- `pd.read_csv()` / `pd.read_json()` - loading data
- Boolean indexing, `query()`, `isin()` - filtering
- `apply()`, `.str` accessor - transformations
- `groupby()`, `agg()`, `transform()` - aggregation
- `merge()`, `map()` - joining data

### Practice:
Now try the tasks in `../tasks/` folder!