# Data Wrangling Learning Guide

## What is Data Wrangling?

**Data Wrangling** (also called Data Munging) is the process of transforming and mapping raw data from one format into another to make it more appropriate and valuable for analysis and modeling. It's the essential bridge between collecting data and analyzing it.

## Why is Data Wrangling Important?

1. **Clean Messy Data** - Real-world data is rarely clean and ready to use
2. **Standardize Formats** - Convert data into consistent, usable formats
3. **Combine Multiple Sources** - Integrate data from various sources
4. **Prepare for Analysis** - Transform data into the right shape for modeling
5. **Handle Missing Values** - Deal with incomplete or corrupted data
6. **Improve Data Quality** - Detect and fix errors, duplicates, and inconsistencies
7. **Save Time** - Automate repetitive data transformation tasks

## What You'll Learn in This Notebook

This comprehensive guide covers **15 essential data wrangling topics**:

1. [Data Loading and Inspection](#1-data-loading-and-inspection) - Loading and initial exploration
2. [Data Cleaning](#2-data-cleaning-handling-missing-duplicates) - Handling missing values and duplicates
3. [Data Type Conversions](#3-data-type-conversions) - Converting between data types
4. [String Operations](#4-string-operations) - Text cleaning and manipulation
5. [Data Filtering and Subsetting](#5-data-filtering-and-subsetting) - Selecting specific data
6. [Data Transformation](#6-data-transformation) - Reshaping and restructuring
7. [Sorting and Ranking](#7-sorting-and-ranking) - Ordering and ranking data
8. [Grouping and Aggregation](#8-grouping-and-aggregation) - Summarizing data by groups
9. [Merging and Joining](#9-merging-and-joining) - Combining multiple datasets
10. [Pivot Tables and Cross-Tabulation](#10-pivot-tables-and-cross-tabulation) - Creating summary tables
11. [Handling Categorical Data](#11-handling-categorical-data) - Encoding categories
12. [Feature Engineering](#12-feature-engineering) - Creating new features
13. [Handling Outliers](#13-handling-outliers) - Detecting and managing outliers
14. [Time Series Wrangling](#14-time-series-wrangling) - Working with dates and times
15. [Advanced Wrangling Techniques](#15-advanced-wrangling-techniques) - Optimization and best practices

## Tools & Libraries Used

- **pandas** - Primary data manipulation library
- **numpy** - Numerical operations and array handling
- **matplotlib & seaborn** - Data visualization
- **datetime** - Date and time manipulation

## How to Use This Notebook

1. **Run cells in order** - Each section builds on previous concepts
2. **Practice with examples** - Try the code with sample data
3. **Apply to your data** - Replace sample data with your own datasets
4. **Experiment** - Modify parameters to understand behavior
5. **Take notes** - Document patterns and techniques you find useful

## Dataset Overview

This notebook uses multiple **synthetic datasets** to demonstrate various wrangling techniques:

- **Customer Data**: Demographics and transaction information
- **Sales Data**: Product sales with dates and amounts
- **Time Series Data**: Temporal data for resampling and aggregation
- **Multi-source Data**: Separate tables for merging and joining

Each dataset is designed to showcase specific wrangling challenges like missing values, duplicates, inconsistent formats, and complex transformations.

## The Data Wrangling Workflow

```
Raw Data â†’ Load â†’ Clean â†’ Transform â†’ Integrate â†’ Validate â†’ Ready for Analysis
```

1. **Load**: Import data from various sources
2. **Clean**: Handle missing values, duplicates, errors
3. **Transform**: Convert types, reshape, engineer features
4. **Integrate**: Merge and combine datasets
5. **Validate**: Check data quality and consistency
6. **Export**: Save processed data for analysis

---

Let's master data wrangling! ðŸ”§ðŸ“Š

In [1]:
# Setup: Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings

warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', None)
pd.set_option('display.float_format', '{:.2f}'.format)

# Set plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

Libraries imported successfully!
Pandas version: 2.3.3
NumPy version: 2.3.5


## 1. Data Loading and Inspection

**What is it?**
Loading data from various sources (CSV, Excel, databases, APIs) and performing initial exploration to understand structure, types, and quality.

**Why use it?**
- Understand data structure before processing
- Identify data quality issues early
- Plan appropriate wrangling strategies
- Verify data loaded correctly

**When to use it?**
- Start of every data project
- After receiving new data
- Before any transformations
- When debugging data issues

**Common loading methods:**
- CSV: `pd.read_csv()`
- Excel: `pd.read_excel()`
- JSON: `pd.read_json()`
- SQL: `pd.read_sql()`
- HTML: `pd.read_html()`
- Parquet: `pd.read_parquet()`

In [2]:
# ============================================
# 1. DATA LOADING AND INSPECTION
# ============================================

# Create sample datasets for demonstration
np.random.seed(42)

# Sales dataset
dates = pd.date_range('2024-01-01', periods=100, freq='D')
sales_data = pd.DataFrame({
    'date': dates,
    'product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Monitor'], 100),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 100),
    'sales_amount': np.random.uniform(100, 2000, 100),
    'quantity': np.random.randint(1, 20, 100),
    'customer_id': np.random.randint(1000, 1050, 100)
})

# Customer dataset
customer_data = pd.DataFrame({
    'customer_id': range(1000, 1050),
    'customer_name': [f'Customer_{i}' for i in range(1000, 1050)],
    'age': np.random.randint(20, 70, 50),
    'city': np.random.choice(['New York', 'Los Angeles', 'Chicago', 'Houston'], 50),
    'customer_type': np.random.choice(['Regular', 'Premium', 'VIP'], 50)
})

print("=" * 80)
print("DATA LOADING AND INSPECTION")
print("=" * 80)

# 1. Basic inspection methods
print("\n1. Basic Dataset Information:")
print("\n   Sales Data:")
print(f"   Shape: {sales_data.shape} (rows, columns)")
print(f"   Size: {sales_data.size} total elements")
print(f"   Memory usage: {sales_data.memory_usage(deep=True).sum() / 1024:.2f} KB")

# 2. Display first and last rows
print("\n2. First 5 rows (head):")
print(sales_data.head())

print("\n   Last 5 rows (tail):")
print(sales_data.tail())

# 3. Data types
print("\n3. Data Types:")
print(sales_data.dtypes)

# 4. Column names and index
print("\n4. Column Names:")
print(f"   {sales_data.columns.tolist()}")

print("\n   Index:")
print(f"   Type: {type(sales_data.index)}")
print(f"   Range: {sales_data.index[0]} to {sales_data.index[-1]}")

# 5. Summary statistics
print("\n5. Summary Statistics:")
print(sales_data.describe())

# 6. Info method (comprehensive overview)
print("\n6. Comprehensive Info:")
sales_data.info()

# 7. Unique values
print("\n7. Unique Values Count:")
for col in ['product', 'region']:
    print(f"   {col}: {sales_data[col].nunique()} unique values")
    print(f"   Values: {sales_data[col].unique()}")

# 8. Value counts
print("\n8. Value Counts (Product):")
print(sales_data['product'].value_counts())

# 9. Missing values check
print("\n9. Missing Values:")
print(sales_data.isnull().sum())

# 10. Sample random rows
print("\n10. Random Sample (3 rows):")
print(sales_data.sample(n=3, random_state=42))

print("\n   âœ“ Data loading and inspection complete!")

DATA LOADING AND INSPECTION

1. Basic Dataset Information:

   Sales Data:
   Shape: (100, 6) (rows, columns)
   Size: 600 total elements
   Memory usage: 13.07 KB

2. First 5 rows (head):
        date  product region  sales_amount  quantity  customer_id
0 2024-01-01   Tablet   East        159.72        11         1032
1 2024-01-02  Monitor  South       1309.18        17         1007
2 2024-01-03   Laptop  South        697.28         8         1043
3 2024-01-04   Tablet   West       1066.28         4         1043
4 2024-01-05   Tablet  South       1824.38         6         1004

   Last 5 rows (tail):
         date  product region  sales_amount  quantity  customer_id
95 2024-04-05    Phone  South        763.50        10         1034
96 2024-04-06    Phone  South       1479.32         6         1032
97 2024-04-07  Monitor   West       1804.51        15         1032
98 2024-04-08    Phone  North       1785.46        11         1042
99 2024-04-09   Laptop   East       1581.76         5   

## 2. Filtering and Subsetting Data

**What is it?**
Selecting specific rows and columns from a dataset based on conditions or criteria.

**Why use it?**
- Focus on relevant data
- Reduce memory usage
- Improve processing speed
- Extract specific subsets for analysis
- Remove unwanted records

**When to use it?**
- Need specific time periods
- Focus on certain categories
- Extract records meeting conditions
- Remove outliers or anomalies
- Create training/test splits

**Filtering methods:**
1. **Boolean indexing**: `df[df['column'] > value]`
2. **loc**: Label-based selection
3. **iloc**: Position-based selection
4. **query()**: SQL-like filtering
5. **isin()**: Match against list of values
6. **between()**: Range filtering

In [None]:
# ============================================
# 2. FILTERING AND SUBSETTING DATA
# ============================================

print("=" * 80)
print("FILTERING AND SUBSETTING DATA")
print("=" * 80)

# 1. Boolean indexing
print("\n1. Boolean Indexing:")

# Single condition
high_sales = sales_data[sales_data['sales_amount'] > 1000]
print(f"   Records with sales > $1000: {len(high_sales)}")
print(high_sales.head())

# Multiple conditions (AND)
north_high_sales = sales_data[(sales_data['region'] == 'North') & (sales_data['sales_amount'] > 1000)]
print(f"\n   North region with sales > $1000: {len(north_high_sales)}")

# Multiple conditions (OR)
laptop_or_phone = sales_data[(sales_data['product'] == 'Laptop') | (sales_data['product'] == 'Phone')]
print(f"\n   Laptop or Phone sales: {len(laptop_or_phone)}")

# 2. loc - Label-based selection
print("\n2. loc (Label-based selection):")

# Select specific columns
selected_cols = sales_data.loc[:, ['date', 'product', 'sales_amount']]
print("   Selected columns:")
print(selected_cols.head())

# Select rows and columns
subset = sales_data.loc[0:4, ['product', 'region', 'sales_amount']]
print("\n   Rows 0-4, specific columns:")
print(subset)

# Select with condition
laptops = sales_data.loc[sales_data['product'] == 'Laptop', ['date', 'sales_amount', 'quantity']]
print(f"\n   Laptop sales only: {len(laptops)} records")

# 3. iloc - Position-based selection
print("\n3. iloc (Position-based selection):")

# First 10 rows, first 3 columns
first_subset = sales_data.iloc[:10, :3]
print("   First 10 rows, first 3 columns:")
print(first_subset)

# Specific rows and columns
specific = sales_data.iloc[[0, 5, 10], [1, 3, 4]]
print("\n   Specific rows [0,5,10] and columns [1,3,4]:")
print(specific)

# Every 10th row
every_10th = sales_data.iloc[::10, :]
print(f"\n   Every 10th row: {len(every_10th)} records")

# 4. query() method
print("\n4. query() Method (SQL-like):")

# Simple query
query_result = sales_data.query('sales_amount > 1500 and region == "North"')
print(f"   Sales > 1500 in North: {len(query_result)} records")
print(query_result.head())

# Using variables in query
threshold = 1200
query_result2 = sales_data.query('sales_amount > @threshold')
print(f"\n   Sales > {threshold}: {len(query_result2)} records")

# Complex query
complex_query = sales_data.query('product in ["Laptop", "Tablet"] and quantity >= 10')
print(f"\n   Laptop/Tablet with quantity >= 10: {len(complex_query)} records")

# 5. isin() method
print("\n5. isin() Method:")

# Filter by list of values
products_of_interest = ['Laptop', 'Monitor']
filtered = sales_data[sales_data['product'].isin(products_of_interest)]
print(f"   Products in {products_of_interest}: {len(filtered)} records")

# Inverse (NOT in list)
not_in_list = sales_data[~sales_data['product'].isin(['Tablet'])]
print(f"   Products NOT Tablet: {len(not_in_list)} records")

# 6. between() method
print("\n6. between() Method:")

# Numeric range
mid_range = sales_data[sales_data['sales_amount'].between(500, 1000)]
print(f"   Sales between $500-$1000: {len(mid_range)} records")
print(f"   Range: ${mid_range['sales_amount'].min():.2f} - ${mid_range['sales_amount'].max():.2f}")

# Date range
date_range = sales_data[sales_data['date'].between('2024-02-01', '2024-02-28')]
print(f"\n   February 2024 sales: {len(date_range)} records")

# 7. Column selection shortcuts
print("\n7. Column Selection Shortcuts:")

# Single column (Series)
product_series = sales_data['product']
print(f"   Single column type: {type(product_series)}")

# Multiple columns (DataFrame)
multi_cols = sales_data[['product', 'sales_amount']]
print(f"   Multiple columns type: {type(multi_cols)}")
print(multi_cols.head(3))

# Drop columns
dropped = sales_data.drop(columns=['customer_id'])
print(f"\n   After dropping 'customer_id': {dropped.columns.tolist()}")

# Select by data type
numeric_only = sales_data.select_dtypes(include=[np.number])
print(f"\n   Numeric columns only: {numeric_only.columns.tolist()}")

# 8. Filter with string methods
print("\n8. String Filtering:")

# Products starting with 'L'
starts_with_l = sales_data[sales_data['product'].str.startswith('L')]
print(f"   Products starting with 'L': {starts_with_l['product'].unique()}")

# Case-insensitive contains
contains_tab = sales_data[sales_data['product'].str.contains('tab', case=False)]
print(f"   Products containing 'tab': {contains_tab['product'].unique()}")

# 9. Top N and Bottom N
print("\n9. Top N and Bottom N Records:")

# Top 5 by sales amount
top_5 = sales_data.nlargest(5, 'sales_amount')
print("   Top 5 sales:")
print(top_5[['date', 'product', 'sales_amount']])

# Bottom 5 by sales amount
bottom_5 = sales_data.nsmallest(5, 'sales_amount')
print("\n   Bottom 5 sales:")
print(bottom_5[['date', 'product', 'sales_amount']])

# 10. Combining filters
print("\n10. Complex Combined Filters:")

# Multiple conditions with helper variables
min_amount = 800
max_amount = 1500
regions_of_interest = ['North', 'South']
product_list = ['Laptop', 'Phone']

complex_filter = sales_data[
    (sales_data['sales_amount'].between(min_amount, max_amount)) &
    (sales_data['region'].isin(regions_of_interest)) &
    (sales_data['product'].isin(product_list)) &
    (sales_data['quantity'] > 5)
]

print(f"   Complex filter results: {len(complex_filter)} records")
print(f"   Conditions: Sales ${min_amount}-${max_amount}, ")
print(f"   Regions: {regions_of_interest}, Products: {product_list}, Quantity > 5")
print(complex_filter.head())

# Summary
print("\n11. Filtering Summary:")
summary = pd.DataFrame({
    'Method': ['Boolean Indexing', 'loc', 'iloc', 'query()', 'isin()', 'between()', 'nlargest/nsmallest'],
    'Use_Case': [
        'Simple conditions',
        'Label-based selection',
        'Position-based selection',
        'SQL-like filtering',
        'Match against list',
        'Range filtering',
        'Top/Bottom N records'
    ],
    'Example': [
        'df[df["col"] > 100]',
        'df.loc[df["col"] > 100, ["a", "b"]]',
        'df.iloc[:10, :3]',
        'df.query("col > 100")',
        'df[df["col"].isin([1,2,3])]',
        'df[df["col"].between(1, 10)]',
        'df.nlargest(5, "col")'
    ]
})
print(summary.to_string(index=False))

print("\n   âœ“ Filtering and subsetting complete!")"
   ],
   "outputs": []
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 4
}


SyntaxError: unexpected character after line continuation character (148314264.py, line 30)

## 3. Sorting and Ranking

**What is it?**
Arranging data in a specific order based on one or more columns, and assigning ranks to values.

**Why use it?**
- Identify top/bottom performers
- Order chronological data
- Prepare data for certain algorithms
- Find patterns in ordered data
- Create rankings and leaderboards

**When to use it?**
- Need to find extremes (min/max)
- Time series analysis
- Creating rankings
- Before groupby operations
- Presenting sorted reports

**Key methods:**
- `sort_values()`: Sort by column values
- `sort_index()`: Sort by index
- `rank()`: Assign ranks to data
- `nlargest()` / `nsmallest()`: Get top/bottom N

In [None]:
# ============================================
# 3. SORTING AND RANKING
# ============================================

print("=" * 80)
print("SORTING AND RANKING")
print("=" * 80)

# 1. Sort by single column
print("\n1. Sort by Single Column:")

# Ascending order
sorted_asc = sales_data.sort_values('sales_amount')
print("   Top 5 (lowest sales):")
print(sorted_asc[['date', 'product', 'sales_amount']].head())

# Descending order
sorted_desc = sales_data.sort_values('sales_amount', ascending=False)
print("\n   Top 5 (highest sales):")
print(sorted_desc[['date', 'product', 'sales_amount']].head())

# 2. Sort by multiple columns
print("\n2. Sort by Multiple Columns:")

# Sort by region, then by sales_amount
multi_sort = sales_data.sort_values(['region', 'sales_amount'], ascending=[True, False])
print("   Sorted by region (asc), then sales_amount (desc):")
print(multi_sort[['region', 'product', 'sales_amount']].head(10))

# 3. Sort by index
print("\n3. Sort by Index:")

# First shuffle the data
shuffled = sales_data.sample(frac=1, random_state=42)
print(f"   Index before sort: {shuffled.index[:5].tolist()}")

# Sort by index
index_sorted = shuffled.sort_index()
print(f"   Index after sort: {index_sorted.index[:5].tolist()}")

# 4. Sort with missing values
print("\n4. Handling Missing Values in Sorting:")

# Create data with NaN
df_with_nan = sales_data.copy()
df_with_nan.loc[0:5, 'sales_amount'] = np.nan

# NaN first
nan_first = df_with_nan.sort_values('sales_amount', na_position='first')
print(f"   NaN first - First 3 sales_amounts: {nan_first['sales_amount'].head(3).tolist()}")

# NaN last (default)
nan_last = df_with_nan.sort_values('sales_amount', na_position='last')
print(f"   NaN last - Last 3 sales_amounts: {nan_last['sales_amount'].tail(3).tolist()}")

# 5. Inplace sorting
print("\n5. Inplace Sorting:")

temp_df = sales_data.copy()
print(f"   Before: First sales_amount = ${temp_df.iloc[0]['sales_amount']:.2f}")

temp_df.sort_values('sales_amount', ascending=False, inplace=True)
print(f"   After inplace sort: First sales_amount = ${temp_df.iloc[0]['sales_amount']:.2f}")

# 6. Ranking methods
print("\n6. Ranking Methods:")

# Create sample for ranking
rank_sample = sales_data[['product', 'sales_amount']].head(10).copy()

# Average rank (default)
rank_sample['rank_average'] = rank_sample['sales_amount'].rank(method='average')

# Min rank (ties get minimum rank)
rank_sample['rank_min'] = rank_sample['sales_amount'].rank(method='min')

# Dense rank (no gaps in ranking)
rank_sample['rank_dense'] = rank_sample['sales_amount'].rank(method='dense')

# First rank (order in data)
rank_sample['rank_first'] = rank_sample['sales_amount'].rank(method='first')

print("   Different ranking methods:")
print(rank_sample.sort_values('sales_amount'))

# 7. Ranking with ascending parameter
print("\n7. Ranking Order (Ascending vs Descending):")

rank_example = sales_data[['product', 'sales_amount']].head(8).copy()

# Ascending ranks (lowest = rank 1)
rank_example['rank_asc'] = rank_example['sales_amount'].rank(ascending=True)

# Descending ranks (highest = rank 1)
rank_example['rank_desc'] = rank_example['sales_amount'].rank(ascending=False)

print("   Ascending vs Descending ranks:")
print(rank_example.sort_values('sales_amount'))

# 8. Percentage ranking
print("\n8. Percentage Ranking (Percentile):\")\n\nrank_pct = sales_data[['product', 'sales_amount']].copy()\nrank_pct['percentile'] = rank_pct['sales_amount'].rank(pct=True) * 100\n\nprint(\"   Sales percentiles:\")\nprint(rank_pct.sort_values('percentile', ascending=False).head())\nprint(f\"\\n   Top sale is in {rank_pct['percentile'].max():.1f}th percentile\")\n\n# 9. Ranking within groups\nprint(\"\\n9. Ranking Within Groups:\")\n\n# Rank sales within each region\ngrouped_rank = sales_data.copy()\ngrouped_rank['rank_in_region'] = grouped_rank.groupby('region')['sales_amount'].rank(ascending=False)\n\nprint(\"   Top sales in each region:\")\ntop_per_region = grouped_rank[grouped_rank['rank_in_region'] <= 3].sort_values(['region', 'rank_in_region'])\nprint(top_per_region[['region', 'product', 'sales_amount', 'rank_in_region']].head(12))\n\n# 10. Custom sorting with key parameter
print("\n10. Custom Sorting (with key function):\")\n\n# Sort products by length of name\ncustom_sort = sales_data.copy()\ncustom_sort_result = custom_sort.sort_values('product', key=lambda x: x.str.len())\n\nprint(\"   Products sorted by name length:\")\nprint(custom_sort_result[['product']].drop_duplicates())\n\n# 11. Stable sorting\nprint(\"\n\n11. Stable Sorting (preserves order for equal values):\")\n\nstable_df = sales_data.copy()\nstable_df['original_order'] = range(len(stable_df))\n\n# Stable sort keeps original order for ties\nstable_sorted = stable_df.sort_values('region', kind='stable')\n\nprint(\"   Stable sort maintains order within same region:\")\nprint(stable_sorted[['original_order', 'region']].head(10))\n\n# 12. argsort for getting sorted indices\nprint(\"\n\n12. Getting Sorted Indices (argsort):\")\n\narr = sales_data['sales_amount'].values[:10]\nsorted_indices = np.argsort(arr)\n\nprint(f\"   Original values: {arr[:5]}\")\nprint(f\"   Sorted indices: {sorted_indices[:5]}\")\nprint(f\"   Sorted values: {arr[sorted_indices][:5]}\")\n\n# 13. Summary comparison\nprint(\"\n\n13. Sorting Methods Comparison:\")\n\nsort_summary = pd.DataFrame({\n    'Method': [\n        'sort_values(col)',\n        'sort_values([col1, col2])',\n        'sort_index()',\n        'rank()',\n        'rank(pct=True)',\n        'nlargest(n, col)',\n        'nsmallest(n, col)'\n    ],\n    'Purpose': [\n        'Sort by column values',\n        'Sort by multiple columns',\n        'Sort by index',\n        'Assign ranks',\n        'Percentile ranks',\n        'Get top N',\n        'Get bottom N'\n    ],\n    'Returns': [\n        'Sorted DataFrame',\n        'Sorted DataFrame',\n        'Sorted DataFrame',\n        'Series with ranks',\n        'Series with percentiles',\n        'Top N rows',\n        'Bottom N rows'\n    ]\n})\n\nprint(sort_summary.to_string(index=False))\n\nprint(\"\n\n   âœ“ Sorting and ranking complete!\")"
   ],
   "outputs": []
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 4
}


## 4. Reshaping Data (Pivot, Melt, Stack, Unstack)

**What is it?**
Transforming data structure between wide and long formats to facilitate different types of analysis.

**Why use it?**
- Convert between analysis-friendly formats
- Prepare data for visualization
- Match requirements of different tools
- Create pivot tables for summarization
- Normalize/denormalize data structures

**When to use it?**
- Need pivot tables for reporting
- Converting wide to long format (or vice versa)
- Preparing data for plotting
- Creating cross-tabulations
- Restructuring hierarchical data

**Key operations:**
- **pivot()**: Reshape based on column values
- **pivot_table()**: Pivot with aggregation
- **melt()**: Wide to long format
- **stack()**: Pivot columns to rows (MultiIndex)
- **unstack()**: Pivot rows to columns
- **crosstab()**: Cross-tabulation

In [None]:
# ============================================
# 4. RESHAPING DATA (PIVOT, MELT, STACK, UNSTACK)
# ============================================

print("=" * 80)
print("RESHAPING DATA")
print("=" * 80)

# Create sample data for reshaping
reshape_data = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=12, freq='M'),
    'region': ['North', 'South'] * 6,
    'product': ['Laptop', 'Laptop', 'Phone', 'Phone'] * 3,
    'sales': np.random.randint(1000, 5000, 12),
    'quantity': np.random.randint(10, 50, 12)
})

print("\n0. Original Data (Long Format):")
print(reshape_data.head(8))

# 1. pivot() - Basic reshape
print("\n1. pivot() - Reshape without aggregation:")

# Create simple data for pivot (needs unique index-column combinations)
simple_pivot_data = pd.DataFrame({
    'date': ['2024-01', '2024-01', '2024-02', '2024-02'],
    'region': ['North', 'South', 'North', 'South'],
    'sales': [1000, 1500, 1200, 1600]
})

pivoted = simple_pivot_data.pivot(index='date', columns='region', values='sales')
print("\n   Pivoted (regions as columns):")
print(pivoted)

# 2. pivot_table() - Pivot with aggregation
print("\n2. pivot_table() - Pivot with aggregation:")

# Pivot with sum aggregation
pivot_sum = reshape_data.pivot_table(
    index='region',
    columns='product',
    values='sales',
    aggfunc='sum'
)

print("\n   Total sales by region and product:")
print(pivot_sum)

# Multiple aggregations
pivot_multi_agg = reshape_data.pivot_table(
    index='region',
    columns='product',
    values='sales',
    aggfunc=['sum', 'mean', 'count']
)

print("\n   Multiple aggregations:")
print(pivot_multi_agg)

# 3. pivot_table with margins (totals)
print("\n3. Pivot Table with Margins (Totals):")

pivot_margins = reshape_data.pivot_table(
    index='region',
    columns='product',
    values='sales',
    aggfunc='sum',
    margins=True,
    margins_name='Total'
)

print(pivot_margins)

# 4. melt() - Wide to long format
print("\n4. melt() - Wide to Long Format:")

# Create wide format data
wide_data = pd.DataFrame({
    'product': ['Laptop', 'Phone', 'Tablet'],
    'Jan': [1000, 1500, 800],
    'Feb': [1200, 1600, 900],
    'Mar': [1100, 1550, 850]
})

print("\n   Wide format:")
print(wide_data)

# Melt to long format
melted = wide_data.melt(
    id_vars=['product'],
    value_vars=['Jan', 'Feb', 'Mar'],
    var_name='month',
    value_name='sales'
)

print("\n   Melted to long format:")
print(melted)

# 5. melt with multiple id variables
print("\n5. melt() with Multiple ID Variables:")

multi_id_data = pd.DataFrame({
    'region': ['North', 'South'],
    'product': ['Laptop', 'Phone'],
    'Q1': [5000, 6000],
    'Q2': [5500, 6200],
    'Q3': [5200, 6100],
    'Q4': [5800, 6500]
})

multi_melted = multi_id_data.melt(
    id_vars=['region', 'product'],
    var_name='quarter',
    value_name='sales'
)

print(multi_melted)

# 6. stack() - Pivot columns to index
print("\n6. stack() - Pivot Columns to Index:")

# Create DataFrame with column hierarchy
stacked_data = pd.DataFrame({
    ('Sales', 'North'): [1000, 1100],
    ('Sales', 'South'): [1500, 1600],
    ('Quantity', 'North'): [10, 12],
    ('Quantity', 'South'): [15, 18]
}, index=['Jan', 'Feb'])

print("\n   Before stack:")
print(stacked_data)

# Stack innermost level
stacked = stacked_data.stack()
print("\n   After stack:")
print(stacked)

# 7. unstack() - Pivot index to columns
print("\n7. unstack() - Pivot Index to Columns:")

# Create multi-index data
multi_index_data = pd.DataFrame({
    'sales': [1000, 1500, 1200, 1600],
    'quantity': [10, 15, 12, 18]
}, index=pd.MultiIndex.from_product([['North', 'South'], ['Laptop', 'Phone']], 
                                     names=['region', 'product']))

print("\n   Before unstack:")
print(multi_index_data)

# Unstack product level
unstacked = multi_index_data.unstack(level='product')
print("\n   After unstack (product):  ")
print(unstacked)

# 8. crosstab() - Cross-tabulation
print("\n8. crosstab() - Cross-tabulation:")

# Simple crosstab
ct = pd.crosstab(
    reshape_data['region'],
    reshape_data['product'],
    values=reshape_data['sales'],
    aggfunc='sum'
)

print("\n   Crosstab (region vs product):")
print(ct)

# Crosstab with margins and normalization
ct_normalized = pd.crosstab(
    reshape_data['region'],
    reshape_data['product'],
    values=reshape_data['sales'],
    aggfunc='sum',
    normalize='all',  # Normalize to percentages
    margins=True
)

print("\n   Normalized crosstab (percentages):")
print(ct_normalized)

# 9. reset_index and set_index for reshaping
print("\n9. reset_index() and set_index():")

# Pivot creates index - can reset to columns
pivot_with_index = reshape_data.pivot_table(
    index='region',
    columns='product',
    values='sales',
    aggfunc='mean'
)

print("\n   With index:")
print(pivot_with_index)

# Reset index to make it a column
reset = pivot_with_index.reset_index()
print("\n   After reset_index():")
print(reset)

# Set index back
set_back = reset.set_index('region')
print("\n   After set_index('region'):")
print(set_back)

# 10. Wide to long example (practical)
print("\n10. Practical Example - Survey Data Reshaping:")

# Survey data in wide format
survey_wide = pd.DataFrame({
    'respondent_id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie'],
    'Q1_rating': [5, 4, 3],
    'Q2_rating': [4, 5, 4],
    'Q3_rating': [5, 3, 5]
})

print("\n   Wide format (one row per respondent):")
print(survey_wide)

# Convert to long format
survey_long = survey_wide.melt(
    id_vars=['respondent_id', 'name'],
    value_vars=['Q1_rating', 'Q2_rating', 'Q3_rating'],
    var_name='question',
    value_name='rating'
)

# Clean question names
survey_long['question'] = survey_long['question'].str.replace('_rating', '')

print("\n   Long format (one row per response):")
print(survey_long)

# 11. Reshaping time series data
print("\n11. Time Series Reshaping:")

# Create time series data
ts_data = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=6, freq='M'),
    'metric': ['sales', 'costs'] * 3,
    'value': [10000, 7000, 11000, 7500, 10500, 7200]
})

print("\n   Long format time series:")
print(ts_data)

# Pivot to wide format
ts_wide = ts_data.pivot(index='date', columns='metric', values='value')

print("\n   Wide format (metrics as columns):")
print(ts_wide)

# 12. Summary of reshaping operations
print("\n12. Reshaping Operations Summary:")

reshape_summary = pd.DataFrame({
    'Operation': [
        'pivot()',
        'pivot_table()',
        'melt()',
        'stack()',
        'unstack()',
        'crosstab()'
    ],
    'Direction': [
        'Long â†’ Wide',
        'Long â†’ Wide (with agg)',
        'Wide â†’ Long',
        'Columns â†’ Index',
        'Index â†’ Columns',
        'Cross-tabulation'
    ],
    'Use_Case': [
        'Unique combinations',
        'With aggregation',
        'Normalize data',
        'Create MultiIndex',
        'Flatten MultiIndex',
        'Frequency tables'
    ],
    'Example': [
        'df.pivot(index=, columns=, values=)',
        'df.pivot_table(index=, columns=, aggfunc=)',
        'df.melt(id_vars=, value_vars=)',
        'df.stack()',
        'df.unstack()',
        'pd.crosstab(rows, cols)'
    ]
})

print(reshape_summary.to_string(index=False))

print("\n   âœ“ Reshaping complete!")"
   ],
   "outputs": []
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 4
}


## 5. Merging and Joining DataFrames

**What is it?**
Combining multiple DataFrames based on common columns or indices, similar to SQL joins.

**Why use it?**
- Integrate data from multiple sources
- Enrich datasets with additional information
- Combine related tables
- Create comprehensive datasets for analysis
- Maintain relational data integrity

**When to use it?**
- Data split across multiple files/tables
- Need to add customer/product details to transactions
- Combining datasets from different systems
- Creating master datasets from normalized data
- Matching records across sources

**Join types:**
- **inner**: Only matching records (intersection)
- **left**: All from left + matching from right
- **right**: All from right + matching from left
- **outer**: All records from both (union)
- **cross**: Cartesian product (all combinations)

In [None]:
# ============================================
# 5. MERGING AND JOINING DATAFRAMES
# ============================================

print("=" * 80)
print("MERGING AND JOINING DATAFRAMES")
print("=" * 80)

# Create sample datasets for merging
orders = pd.DataFrame({
    'order_id': [1, 2, 3, 4, 5],
    'customer_id': [101, 102, 103, 101, 104],
    'product': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Laptop'],
    'amount': [1200, 800, 500, 300, 1100]
})

customers = pd.DataFrame({
    'customer_id': [101, 102, 103, 105],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston']
})

products = pd.DataFrame({
    'product': ['Laptop', 'Phone', 'Tablet', 'Monitor'],
    'category': ['Computer', 'Mobile', 'Mobile', 'Computer'],
    'price': [1200, 800, 500, 300]
})

print("\n0. Sample DataFrames:")
print("\n   Orders:")
print(orders)
print("\n   Customers:")
print(customers)
print("\n   Products:")
print(products)

# 1. Inner Join (intersection)
print("\n1. Inner Join (default):")

inner_join = pd.merge(orders, customers, on='customer_id', how='inner')
print("\n   Orders + Customers (inner join):")
print(inner_join)
print(f"\n   Result: {len(inner_join)} rows (only matching customer_ids)")

# 2. Left Join
print("\n2. Left Join:")

left_join = pd.merge(orders, customers, on='customer_id', how='left')
print("\n   Orders + Customers (left join):")
print(left_join)
print(f"\n   Result: {len(left_join)} rows (all orders, customer_id 104 has NaN)")

# 3. Right Join
print("\n3. Right Join:")

right_join = pd.merge(orders, customers, on='customer_id', how='right')
print("\n   Orders + Customers (right join):")
print(right_join)
print(f"\n   Result: {len(right_join)} rows (all customers, Diana has no orders)")

# 4. Outer Join (union)
print("\n4. Outer Join:")

outer_join = pd.merge(orders, customers, on='customer_id', how='outer')
print("\n   Orders + Customers (outer join):")
print(outer_join)
print(f"\n   Result: {len(outer_join)} rows (all records from both)")

# 5. Merge on multiple columns
print("\n5. Merge on Multiple Columns:")

sales1 = pd.DataFrame({
    'region': ['North', 'South', 'East'],
    'product': ['Laptop', 'Phone', 'Tablet'],
    'Q1_sales': [1000, 1500, 800]
})

sales2 = pd.DataFrame({
    'region': ['North', 'South', 'West'],
    'product': ['Laptop', 'Phone', 'Monitor'],
    'Q2_sales': [1100, 1600, 900]
})

multi_col_merge = pd.merge(sales1, sales2, on=['region', 'product'], how='outer')
print(multi_col_merge)

# 6. Merge with different column names
print("\n6. Merge with Different Column Names:")

df1 = pd.DataFrame({
    'employee_id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie']
})

df2 = pd.DataFrame({
    'emp_id': [1, 2, 4],
    'department': ['Sales', 'Engineering', 'HR']
})

diff_names_merge = pd.merge(df1, df2, left_on='employee_id', right_on='emp_id', how='left')
print(diff_names_merge)

# 7. Merge on index
print("\n7. Merge on Index:")

df_a = pd.DataFrame({
    'value_a': [10, 20, 30]
}, index=['A', 'B', 'C'])

df_b = pd.DataFrame({
    'value_b': [100, 200, 400]
}, index=['A', 'B', 'D'])

index_merge = pd.merge(df_a, df_b, left_index=True, right_index=True, how='outer')
print(index_merge)

# 8. Join method (alternative to merge)
print("\n8. Join Method (Index-based):")

joined = df_a.join(df_b, how='outer')
print("\n   Using .join() method:")
print(joined)

# Join with suffix for overlapping columns
df_c = pd.DataFrame({
    'value': [10, 20, 30]
}, index=['A', 'B', 'C'])

df_d = pd.DataFrame({
    'value': [100, 200, 300]
}, index=['A', 'B', 'C'])

joined_suffix = df_c.join(df_d, lsuffix='_left', rsuffix='_right')
print("\n   Join with suffixes:")
print(joined_suffix)

# 9. Indicator to show merge source
print("\n9. Merge with Indicator:")

merge_indicator = pd.merge(orders, customers, on='customer_id', how='outer', indicator=True)
print(merge_indicator)

print("\n   Merge source counts:")
print(merge_indicator['_merge'].value_counts())

# 10. Validate merge (check for duplicates)
print("\n10. Validate Merge:")

# This will raise error if there are duplicate keys
try:
    validated_merge = pd.merge(orders, customers, on='customer_id', validate='one_to_one')
except Exception as e:
    print(f"   Validation error (expected): {type(e).__name__}")

# Correct validation
validated_merge = pd.merge(orders, customers, on='customer_id', validate='many_to_one')
print("   âœ“ Many-to-one validation passed")

# 11. Cross join (Cartesian product)
print("\n11. Cross Join:")

colors = pd.DataFrame({'color': ['Red', 'Blue']})
sizes = pd.DataFrame({'size': ['S', 'M', 'L']})

cross_join = pd.merge(colors, sizes, how='cross')
print("\n   All combinations (cross join):")
print(cross_join)
print(f"   Result: {len(cross_join)} rows (2 Ã— 3)")

# 12. Merge with suffixes
print("\n12. Merge with Overlapping Columns:")

sales_2023 = pd.DataFrame({
    'product': ['Laptop', 'Phone'],
    'sales': [1000, 800],
    'quantity': [10, 15]
})

sales_2024 = pd.DataFrame({
    'product': ['Laptop', 'Phone'],
    'sales': [1200, 900],
    'quantity': [12, 18]
})

suffixed_merge = pd.merge(sales_2023, sales_2024, on='product', suffixes=('_2023', '_2024'))
print(suffixed_merge)

# 13. Multiple merges (chaining)
print("\n13. Chaining Multiple Merges:")

result = (orders
          .merge(customers, on='customer_id', how='left')
          .merge(products, on='product', how='left'))

print("\n   Orders + Customers + Products:")
print(result[['order_id', 'name', 'product', 'category', 'amount', 'city']])

# 14. Merge performance comparison
print("\n14. Merge Types Comparison:")

merge_comparison = pd.DataFrame({
    'Join_Type': ['inner', 'left', 'right', 'outer', 'cross'],
    'Returns': [
        'Only matching records',
        'All left + matching right',
        'All right + matching left',
        'All records from both',
        'All combinations (Cartesian)'
    ],
    'Use_Case': [
        'Need complete data only',
        'Keep all main records',
        'Keep all lookup records',
        'Keep everything',
        'Generate combinations'
    ],
    'Example': [
        'Transaction + Valid customers',
        'Customers + Optional orders',
        'Products + Optional sales',
        'Consolidate all data',
        'Size Ã— Color combinations'
    ]
})

print(merge_comparison.to_string(index=False))

print("\n   âœ“ Merging and joining complete!")

## 6. Concatenating Data

**What**: Stacking or combining multiple DataFrames along rows or columns.

**Why**: 
- Combine data from multiple sources or time periods
- Stack datasets with the same structure
- Add new rows or columns to existing data

**When to Use**:
- Combining monthly/yearly reports
- Merging data splits (train/test)
- Appending new records
- Adding features from different sources

**Key Operations**:
- **pd.concat()**: Stack along axis (rows/columns)
- **append()**: Add rows (deprecated, use concat)
- **ignore_index**: Reset index after concatenation
- **keys**: Add hierarchical index to identify source

In [None]:
# ============================================
# 6. CONCATENATING DATA
# ============================================

print("=" * 80)
print("CONCATENATING DATA")
print("=" * 80)

# Create sample datasets
q1_sales = pd.DataFrame({
    'product': ['Laptop', 'Phone', 'Tablet'],
    'sales': [1000, 800, 500],
    'quarter': ['Q1', 'Q1', 'Q1']
})

q2_sales = pd.DataFrame({
    'product': ['Laptop', 'Phone', 'Tablet'],
    'sales': [1200, 900, 550],
    'quarter': ['Q2', 'Q2', 'Q2']
})

q3_sales = pd.DataFrame({
    'product': ['Laptop', 'Phone', 'Tablet'],
    'sales': [1100, 850, 520],
    'quarter': ['Q3', 'Q3', 'Q3']
})

print("\n0. Sample DataFrames:")
print("\n   Q1 Sales:")
print(q1_sales)
print("\n   Q2 Sales:")
print(q2_sales)
print("\n   Q3 Sales:")
print(q3_sales)

# 1. Concatenate rows (vertically)
print("\n1. Concatenate Rows (axis=0, default):")

yearly_sales = pd.concat([q1_sales, q2_sales, q3_sales])
print(yearly_sales)
print(f"\n   Result: {len(yearly_sales)} rows (3 + 3 + 3)")

# 2. Concatenate with ignore_index
print("\n2. Concatenate with Reset Index:")

yearly_sales_reset = pd.concat([q1_sales, q2_sales, q3_sales], ignore_index=True)
print(yearly_sales_reset)
print("\n   Index reset: 0 to 8 instead of 0,1,2,0,1,2,0,1,2")

# 3. Concatenate columns (horizontally)
print("\n3. Concatenate Columns (axis=1):")

df_a = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
})

df_b = pd.DataFrame({
    'C': [7, 8, 9],
    'D': [10, 11, 12]
})

horizontal_concat = pd.concat([df_a, df_b], axis=1)
print(horizontal_concat)

# 4. Concatenate with keys (hierarchical index)
print("\n4. Concatenate with Keys (Hierarchical Index):")

keyed_concat = pd.concat([q1_sales, q2_sales, q3_sales], keys=['Q1', 'Q2', 'Q3'])
print(keyed_concat)

print("\n   Access Q2 data:")
print(keyed_concat.loc['Q2'])

# 5. Concatenate with mismatched columns
print("\n5. Concatenate with Mismatched Columns:")

df1 = pd.DataFrame({
    'A': [1, 2],
    'B': [3, 4]
})

df2 = pd.DataFrame({
    'B': [5, 6],
    'C': [7, 8]
})

mismatched = pd.concat([df1, df2], ignore_index=True)
print("\n   Missing columns filled with NaN:")
print(mismatched)

# 6. Concatenate with inner join (intersection)
print("\n6. Concatenate with Inner Join:")

inner_concat = pd.concat([df1, df2], ignore_index=True, join='inner')
print("\n   Only common columns (B):")
print(inner_concat)

# 7. Concatenate series
print("\n7. Concatenate Series:")

s1 = pd.Series([1, 2, 3], name='A')
s2 = pd.Series([4, 5, 6], name='B')
s3 = pd.Series([7, 8, 9], name='C')

series_concat = pd.concat([s1, s2, s3], axis=1)
print("\n   Series to DataFrame:")
print(series_concat)

# 8. Concatenate with verify_integrity
print("\n8. Concatenate with Duplicate Check:")

df_dup1 = pd.DataFrame({'A': [1, 2]}, index=[0, 1])
df_dup2 = pd.DataFrame({'A': [3, 4]}, index=[1, 2])

try:
    # This will raise error due to duplicate index
    pd.concat([df_dup1, df_dup2], verify_integrity=True)
except ValueError as e:
    print(f"   Error (expected): {e}")

print("\n   âœ“ Use ignore_index=True or verify_integrity=False")

# 9. Concatenate multiple DataFrames at once
print("\n9. Concatenate Multiple DataFrames:")

jan = pd.DataFrame({'sales': [100, 200]}, index=['A', 'B'])
feb = pd.DataFrame({'sales': [150, 250]}, index=['A', 'B'])
mar = pd.DataFrame({'sales': [120, 220]}, index=['A', 'B'])

# List of DataFrames
monthly_data = [jan, feb, mar]
all_months = pd.concat(monthly_data, keys=['Jan', 'Feb', 'Mar'])
print(all_months)

# 10. Concatenate with sort
print("\n10. Concatenate with Column Sorting:")

df_x = pd.DataFrame({
    'Z': [1, 2],
    'A': [3, 4]
})

df_y = pd.DataFrame({
    'B': [5, 6],
    'A': [7, 8]
})

sorted_concat = pd.concat([df_x, df_y], ignore_index=True, sort=True)
print("\n   Columns sorted alphabetically:")
print(sorted_concat)

# 11. Practical example: Combining train/test/validation sets
print("\n11. Practical: Combine Data Splits:")

train = pd.DataFrame({
    'feature1': [1, 2, 3],
    'feature2': [4, 5, 6],
    'label': [0, 1, 0]
})

test = pd.DataFrame({
    'feature1': [7, 8],
    'feature2': [9, 10],
    'label': [1, 0]
})

validation = pd.DataFrame({
    'feature1': [11, 12],
    'feature2': [13, 14],
    'label': [0, 1]
})

# Combine all with dataset identifier
all_data = pd.concat(
    [train, test, validation],
    keys=['train', 'test', 'validation'],
    names=['dataset', 'row']
)
print(all_data)

# 12. Concatenate with names
print("\n12. Concatenate with Index Names:")

df_2023 = pd.DataFrame({'value': [100, 200]})
df_2024 = pd.DataFrame({'value': [150, 250]})

named_concat = pd.concat(
    [df_2023, df_2024],
    keys=['2023', '2024'],
    names=['year', 'id']
)
print(named_concat)

# 13. Comparison: concat vs merge
print("\n13. Concat vs Merge Comparison:")

comparison = pd.DataFrame({
    'Operation': ['concat', 'merge'],
    'Purpose': ['Stack/combine', 'Join on keys'],
    'Alignment': ['By index/position', 'By column values'],
    'Use_Case': ['Same structure data', 'Related data'],
    'Example': ['Monthly reports', 'Orders + Customers']
})

print(comparison.to_string(index=False))

print("\n   âœ“ Concatenation complete!")

## 7. Grouping and Aggregation

**What**: Splitting data into groups and applying aggregate functions to each group.

**Why**: 
- Summarize data by categories
- Calculate statistics per group
- Identify patterns within segments
- Generate reports and insights

**When to Use**:
- Sales by region/product/time period
- Average metrics per category
- Count records by group
- Compare performance across segments

**Key Operations**:
- **groupby()**: Split data into groups
- **agg()**: Apply multiple aggregations
- **transform()**: Return same-shaped result
- **filter()**: Filter groups based on condition
- **apply()**: Custom group operations

In [None]:
# ============================================
# 7. GROUPING AND AGGREGATION
# ============================================

print("=" * 80)
print("GROUPING AND AGGREGATION")
print("=" * 80)

# Create sample dataset
sales_data = pd.DataFrame({
    'region': ['North', 'South', 'East', 'North', 'South', 'East', 'North', 'South'],
    'product': ['Laptop', 'Laptop', 'Phone', 'Phone', 'Tablet', 'Tablet', 'Laptop', 'Phone'],
    'sales': [1000, 1200, 800, 900, 500, 550, 1100, 850],
    'quantity': [10, 12, 15, 18, 8, 9, 11, 16],
    'month': ['Jan', 'Jan', 'Jan', 'Feb', 'Feb', 'Feb', 'Mar', 'Mar']
})

print("\n0. Sample Sales Data:")
print(sales_data)

# 1. Simple groupby with single aggregation
print("\n1. Simple GroupBy with Single Aggregation:")

sales_by_region = sales_data.groupby('region')['sales'].sum()
print("\n   Total sales by region:")
print(sales_by_region)

# 2. Groupby with multiple aggregations
print("\n2. Multiple Aggregations:")

region_stats = sales_data.groupby('region')['sales'].agg(['sum', 'mean', 'count', 'min', 'max'])
print("\n   Sales statistics by region:")
print(region_stats)

# 3. Groupby multiple columns
print("\n3. GroupBy Multiple Columns:")

region_product = sales_data.groupby(['region', 'product'])['sales'].sum()
print("\n   Sales by region and product:")
print(region_product)

# 4. Aggregation with different functions for different columns
print("\n4. Different Aggregations per Column:")

multi_agg = sales_data.groupby('region').agg({
    'sales': ['sum', 'mean'],
    'quantity': ['sum', 'max']
})
print(multi_agg)

# 5. Named aggregations (cleaner output)
print("\n5. Named Aggregations:")

named_agg = sales_data.groupby('region').agg(
    total_sales=('sales', 'sum'),
    avg_sales=('sales', 'mean'),
    total_quantity=('quantity', 'sum'),
    max_quantity=('quantity', 'max')
)
print(named_agg)

# 6. GroupBy with transform
print("\n6. Transform (Keep Original Shape):")

# Add group mean as new column
sales_data['region_avg_sales'] = sales_data.groupby('region')['sales'].transform('mean')
print("\n   Original data with region average:")
print(sales_data[['region', 'sales', 'region_avg_sales']])

# Calculate percentage of region total
sales_data['pct_of_region'] = (
    sales_data['sales'] / sales_data.groupby('region')['sales'].transform('sum') * 100
)
print("\n   Sales as percentage of region total:")
print(sales_data[['region', 'product', 'sales', 'pct_of_region']].round(2))

# 7. GroupBy with filter
print("\n7. Filter Groups:")

# Keep only regions with total sales > 2000
high_sales_regions = sales_data.groupby('region').filter(lambda x: x['sales'].sum() > 2000)
print("\n   Regions with total sales > 2000:")
print(high_sales_regions)

# 8. GroupBy with custom aggregation function
print("\n8. Custom Aggregation Function:")

def sales_range(x):
    return x.max() - x.min()

custom_agg = sales_data.groupby('region')['sales'].agg([
    'mean',
    ('range', sales_range),
    ('cv', lambda x: x.std() / x.mean())  # Coefficient of variation
])
print("\n   Custom metrics by region:")
print(custom_agg)

# 9. GroupBy with apply
print("\n9. Apply Custom Function to Groups:")

def top_product(group):
    return group.nlargest(1, 'sales')[['product', 'sales']]

top_per_region = sales_data.groupby('region').apply(top_product)
print("\n   Top product per region:")
print(top_per_region)

# 10. GroupBy iteration
print("\n10. Iterate Over Groups:")

for region, group_data in sales_data.groupby('region'):
    print(f"\n   {region}:")
    print(f"   Total sales: ${group_data['sales'].sum():,}")
    print(f"   Products: {group_data['product'].nunique()}")

# 11. Multiple levels of grouping
print("\n11. Multiple Grouping Levels:")

monthly_summary = sales_data.groupby(['month', 'region']).agg({
    'sales': 'sum',
    'quantity': 'sum'
}).round(2)
print(monthly_summary)

# 12. Unstack for better visualization
print("\n12. Unstack Grouped Data:")

pivot_summary = sales_data.groupby(['region', 'product'])['sales'].sum().unstack(fill_value=0)
print("\n   Sales matrix (Region Ã— Product):")
print(pivot_summary)

# 13. GroupBy with size and count
print("\n13. Size vs Count:")

print("\n   Size (includes NaN):")
print(sales_data.groupby('region').size())

print("\n   Count per column (excludes NaN):")
print(sales_data.groupby('region').count())

# 14. Groupby with percentiles
print("\n14. Percentile Aggregations:")

percentile_agg = sales_data.groupby('region')['sales'].quantile([0.25, 0.5, 0.75])
print(percentile_agg)

# 15. Practical example: Sales performance report
print("\n15. Practical: Comprehensive Sales Report:")

report = sales_data.groupby('region').agg(
    total_sales=('sales', 'sum'),
    avg_sale=('sales', 'mean'),
    min_sale=('sales', 'min'),
    max_sale=('sales', 'max'),
    total_quantity=('quantity', 'sum'),
    num_transactions=('sales', 'count'),
    unique_products=('product', 'nunique')
).round(2)

print(report)

# Add rankings
report['sales_rank'] = report['total_sales'].rank(ascending=False)
print("\n   With rankings:")
print(report.sort_values('total_sales', ascending=False))

print("\n   âœ“ Grouping and aggregation complete!")

## 8. Data Transformation (Apply, Map, Replace)

**What**: Applying functions to modify data values in DataFrames and Series.

**Why**: 
- Transform data according to business logic
- Create new features from existing columns
- Clean and standardize values
- Perform element-wise operations

**When to Use**:
- Creating derived features
- Data cleaning and normalization
- Category mapping and encoding
- Custom calculations

**Key Operations**:
- **apply()**: Apply function along axis (row/column)
- **map()**: Map values element-wise (Series only)
- **applymap()**: Apply function to every element (deprecated, use map)
- **replace()**: Replace specific values
- **Lambda functions**: Quick inline transformations

In [None]:
# ============================================
# 8. DATA TRANSFORMATION (APPLY, MAP, REPLACE)
# ============================================

print("=" * 80)
print("DATA TRANSFORMATION")
print("=" * 80)

# Create sample dataset
transform_data = pd.DataFrame({
    'name': ['alice', 'bob', 'charlie', 'david'],
    'age': [25, 30, 35, 28],
    'salary': [50000, 60000, 75000, 55000],
    'department': ['sales', 'IT', 'sales', 'IT'],
    'performance': ['good', 'excellent', 'average', 'good']
})

print("\n0. Sample Data:")
print(transform_data)

# 1. Apply with lambda (column-wise)
print("\n1. Apply Lambda to Column:")

transform_data['salary_k'] = transform_data['salary'].apply(lambda x: x / 1000)
print("\n   Salary in thousands:")
print(transform_data[['name', 'salary', 'salary_k']])

# 2. Apply custom function
print("\n2. Apply Custom Function:")

def categorize_age(age):
    if age < 25:
        return 'Junior'
    elif age < 35:
        return 'Mid-level'
    else:
        return 'Senior'

transform_data['age_category'] = transform_data['age'].apply(categorize_age)
print(transform_data[['name', 'age', 'age_category']])

# 3. Apply to multiple columns (row-wise)
print("\n3. Apply to Rows (axis=1):")

def total_compensation(row):
    bonus = {'good': 5000, 'excellent': 10000, 'average': 2000}
    return row['salary'] + bonus.get(row['performance'], 0)

transform_data['total_comp'] = transform_data.apply(total_compensation, axis=1)
print(transform_data[['name', 'salary', 'performance', 'total_comp']])

# 4. Map with dictionary
print("\n4. Map with Dictionary:")

dept_mapping = {
    'sales': 'Sales Department',
    'IT': 'Information Technology',
    'HR': 'Human Resources'
}

transform_data['dept_full'] = transform_data['department'].map(dept_mapping)
print(transform_data[['name', 'department', 'dept_full']])

# 5. Map with Series
print("\n5. Map with Series:")

dept_budget = pd.Series({
    'sales': 100000,
    'IT': 150000,
    'HR': 80000
})

transform_data['dept_budget'] = transform_data['department'].map(dept_budget)
print(transform_data[['name', 'department', 'dept_budget']])

# 6. Map with function
print("\n6. Map with Function:")

transform_data['name_upper'] = transform_data['name'].map(str.upper)
print(transform_data[['name', 'name_upper']])

# 7. DataFrame map (element-wise, new in pandas 2.1)
print("\n7. DataFrame Map (Element-wise):")

numeric_data = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
})

# Square all values
squared = numeric_data.map(lambda x: x ** 2)
print("\n   Original:")
print(numeric_data)
print("\n   Squared:")
print(squared)

# 8. Replace single value
print("\n8. Replace Single Value:")

replaced = transform_data['performance'].replace('average', 'satisfactory')
print("\n   Performance ratings:")
print(replaced)

# 9. Replace multiple values
print("\n9. Replace Multiple Values:")

perf_replace = transform_data['performance'].replace({
    'good': 'Good',
    'excellent': 'Excellent',
    'average': 'Average'
})
print(perf_replace)

# 10. Replace with regex
print("\n10. Replace with Regex:")

text_data = pd.Series(['test-123', 'data-456', 'info-789'])
cleaned = text_data.replace(r'-\d+', '', regex=True)
print("\n   Original:", text_data.tolist())
print("   Cleaned:", cleaned.tolist())

# 11. Conditional replacement (where)
print("\n11. Conditional Replacement (where/mask):")

# Replace salary > 60000 with 60000 (cap)
capped_salary = transform_data['salary'].where(transform_data['salary'] <= 60000, 60000)
print("\n   Capped salary:")
print(pd.DataFrame({'original': transform_data['salary'], 'capped': capped_salary}))

# 12. Apply with multiple return values
print("\n12. Apply Returning Multiple Values:")

def salary_stats(salary):
    return pd.Series({
        'tax': salary * 0.25,
        'net': salary * 0.75
    })

tax_net = transform_data['salary'].apply(salary_stats)
print(tax_net)

# 13. Vectorized operations vs apply
print("\n13. Vectorized Operations (Preferred):")

# Using apply (slower)
result_apply = transform_data['salary'].apply(lambda x: x * 1.1)

# Using vectorized operation (faster)
result_vectorized = transform_data['salary'] * 1.1

print("\n   Both give same result:")
print("   Apply:", result_apply.head(2).tolist())
print("   Vectorized:", result_vectorized.head(2).tolist())
print("   âœ“ Use vectorized when possible (much faster for large data)")

# 14. Practical: Feature engineering
print("\n14. Practical: Feature Engineering:")

# Create multiple derived features
transform_data['salary_per_year_of_age'] = transform_data['salary'] / transform_data['age']
transform_data['is_high_earner'] = transform_data['salary'] > transform_data['salary'].median()
transform_data['dept_code'] = transform_data['department'].astype('category').cat.codes

print(transform_data[['name', 'salary_per_year_of_age', 'is_high_earner', 'dept_code']])

# 15. Chaining transformations
print("\n15. Chaining Transformations:")

result = (transform_data['name']
          .str.upper()
          .str.replace('A', '@')
          .str[:5])

print("\n   Chained string transformations:")
print(result)

print("\n   âœ“ Data transformation complete!")

## 9. Handling Time Series Data

**What**: Working with date and time data, including parsing, indexing, and time-based operations.

**Why**: 
- Analyze temporal patterns and trends
- Perform time-based calculations
- Resample and aggregate time series data
- Handle date ranges and periods

**When to Use**:
- Stock price analysis
- Sales forecasting
- Event tracking and scheduling
- IoT sensor data analysis

**Key Operations**:
- **pd.to_datetime()**: Convert to datetime
- **dt accessor**: Access datetime properties
- **date_range()**: Generate date sequences
- **resample()**: Change time frequency
- **shift()**: Lag/lead time series

In [None]:
# ============================================
# 9. HANDLING TIME SERIES DATA
# ============================================

print("=" * 80)
print("HANDLING TIME SERIES DATA")
print("=" * 80)

# 1. Create datetime from strings
print("\n1. Convert Strings to Datetime:")

date_strings = ['2024-01-01', '2024-02-15', '2024-03-30']
dates = pd.to_datetime(date_strings)
print(dates)

# Different formats
mixed_formats = ['01/15/2024', '2024-02-20', 'March 10, 2024']
parsed_dates = pd.to_datetime(mixed_formats, format='mixed')
print("\n   Mixed formats:")
print(parsed_dates)

# 2. DateTime properties
print("\n2. Extract DateTime Components:")

sample_dates = pd.date_range('2024-01-01', periods=5, freq='D')
df_dates = pd.DataFrame({'date': sample_dates})

df_dates['year'] = df_dates['date'].dt.year
df_dates['month'] = df_dates['date'].dt.month
df_dates['day'] = df_dates['date'].dt.day
df_dates['day_name'] = df_dates['date'].dt.day_name()
df_dates['quarter'] = df_dates['date'].dt.quarter

print(df_dates)

# 3. Generate date ranges
print("\n3. Generate Date Ranges:")

# Daily
daily = pd.date_range('2024-01-01', periods=7, freq='D')
print(f"\n   Daily: {daily[0]} to {daily[-1]}")

# Business days
business_days = pd.date_range('2024-01-01', periods=5, freq='B')
print(f"   Business days: {len(business_days)} days")

# Monthly
monthly = pd.date_range('2024-01-01', periods=6, freq='MS')  # Month start
print(f"   Monthly: {monthly[0]} to {monthly[-1]}")

# Hourly
hourly = pd.date_range('2024-01-01', periods=24, freq='h')
print(f"   Hourly: {len(hourly)} hours")

# 4. Time series with datetime index
print("\n4. Time Series with DatetimeIndex:")

ts_data = pd.DataFrame({
    'sales': np.random.randint(100, 200, 10)
}, index=pd.date_range('2024-01-01', periods=10, freq='D'))

print(ts_data.head())
print(f"\n   Index type: {type(ts_data.index)}")

# 5. Selecting by date
print("\n5. Select by Date:")

print("\n   All January data:")
print(ts_data['2024-01'])

print("\n   Specific date:")
print(ts_data.loc['2024-01-05'])

# 6. Resampling (change frequency)
print("\n6. Resample Time Series:")

# Create sample time series
ts = pd.DataFrame({
    'value': np.random.randn(30)
}, index=pd.date_range('2024-01-01', periods=30, freq='D'))

# Resample to weekly (sum)
weekly = ts.resample('W').sum()
print("\n   Daily to Weekly (sum):")
print(weekly.head())

# Resample to weekly (mean)
weekly_mean = ts.resample('W').mean()
print("\n   Daily to Weekly (mean):")
print(weekly_mean.head())

# 7. Shifting (lag/lead)
print("\n7. Shift Time Series:")

shift_data = pd.DataFrame({
    'value': [10, 20, 30, 40, 50]
}, index=pd.date_range('2024-01-01', periods=5, freq='D'))

shift_data['prev_day'] = shift_data['value'].shift(1)  # Lag
shift_data['next_day'] = shift_data['value'].shift(-1)  # Lead

print(shift_data)

# 8. Calculate changes
print("\n8. Calculate Changes:")

shift_data['daily_change'] = shift_data['value'].diff()
shift_data['pct_change'] = shift_data['value'].pct_change() * 100

print(shift_data)

# 9. Rolling window calculations
print("\n9. Rolling Window (Moving Average):")

rolling_data = pd.DataFrame({
    'value': [10, 15, 12, 18, 20, 17, 22, 25]
}, index=pd.date_range('2024-01-01', periods=8, freq='D'))

rolling_data['ma_3'] = rolling_data['value'].rolling(window=3).mean()
rolling_data['ma_5'] = rolling_data['value'].rolling(window=5).mean()

print(rolling_data)

# 10. Time zones
print("\n10. Time Zones:")

# Create timezone-aware datetime
utc_time = pd.date_range('2024-01-01', periods=3, freq='h', tz='UTC')
print("\n   UTC times:")
print(utc_time)

# Convert to different timezone
ny_time = utc_time.tz_convert('America/New_York')
print("\n   New York times:")
print(ny_time)

# 11. Date arithmetic
print("\n11. Date Arithmetic:")

start_date = pd.Timestamp('2024-01-01')
print(f"\n   Start: {start_date}")
print(f"   + 10 days: {start_date + pd.Timedelta(days=10)}")
print(f"   + 2 weeks: {start_date + pd.Timedelta(weeks=2)}")
print(f"   + 3 months: {start_date + pd.DateOffset(months=3)}")

# 12. Period objects
print("\n12. Period Objects:")

periods = pd.period_range('2024-01', periods=12, freq='M')
print("\n   Monthly periods:")
print(periods[:6])

# 13. Business day operations
print("\n13. Business Day Calculations:")

# Add 10 business days
from pandas.tseries.offsets import BDay
biz_date = pd.Timestamp('2024-01-01') + 10 * BDay()
print(f"\n   10 business days from 2024-01-01: {biz_date}")

# 14. Practical: Time series analysis
print("\n14. Practical: Sales Analysis:")

# Generate sample sales data
sales_ts = pd.DataFrame({
    'sales': np.random.randint(1000, 2000, 90)
}, index=pd.date_range('2024-01-01', periods=90, freq='D'))

# Add features
sales_ts['day_of_week'] = sales_ts.index.day_name()
sales_ts['is_weekend'] = sales_ts.index.dayofweek >= 5
sales_ts['week'] = sales_ts.index.isocalendar().week

# Weekly summary
weekly_summary = sales_ts.resample('W').agg({
    'sales': ['sum', 'mean', 'max']
})

print("\n   Weekly sales summary:")
print(weekly_summary.head())

print("\n   âœ“ Time series handling complete!")

## 10. Working with MultiIndex

**What**: DataFrames with hierarchical (multi-level) row or column indices.

**Why**: 
- Represent higher-dimensional data
- Organize complex datasets efficiently
- Enable advanced grouping and aggregation
- Simplify pivot and cross-tabulation

**When to Use**:
- Grouped time series data
- Multi-dimensional analysis
- Hierarchical data structures
- Panel data (cross-section + time)

**Key Operations**:
- **set_index()**: Create MultiIndex
- **xs()**: Cross-section selection
- **swaplevel()**: Swap index levels
- **unstack()**: Pivot level to columns
- **reset_index()**: Flatten MultiIndex

In [None]:
# ============================================
# 10. WORKING WITH MULTIINDEX
# ============================================

print("=" * 80)
print("WORKING WITH MULTIINDEX")
print("=" * 80)

# 1. Create MultiIndex from lists
print("\n1. Create MultiIndex from Lists:")

arrays = [
    ['A', 'A', 'B', 'B'],
    ['one', 'two', 'one', 'two']
]
index = pd.MultiIndex.from_arrays(arrays, names=['letter', 'number'])
df_multi = pd.DataFrame({'value': [10, 20, 30, 40]}, index=index)

print(df_multi)

# 2. Create MultiIndex from tuples
print("\n2. Create MultiIndex from Tuples:")

tuples = [('A', 'one'), ('A', 'two'), ('B', 'one'), ('B', 'two')]
index = pd.MultiIndex.from_tuples(tuples, names=['letter', 'number'])
df_tuples = pd.DataFrame({'value': [10, 20, 30, 40]}, index=index)

print(df_tuples)

# 3. Create MultiIndex from product
print("\n3. Create MultiIndex from Product:")

# Cartesian product
index = pd.MultiIndex.from_product(
    [['A', 'B'], ['one', 'two', 'three']],
    names=['letter', 'number']
)
df_product = pd.DataFrame({'value': range(6)}, index=index)

print(df_product)

# 4. Set MultiIndex from columns
print("\n4. Set MultiIndex from Columns:")

df_simple = pd.DataFrame({
    'region': ['North', 'North', 'South', 'South'],
    'product': ['Laptop', 'Phone', 'Laptop', 'Phone'],
    'sales': [1000, 800, 1200, 900]
})

df_indexed = df_simple.set_index(['region', 'product'])
print(df_indexed)

# 5. Selecting with MultiIndex
print("\n5. Select with MultiIndex:")

# Select outer level
print("\n   All North region:")
print(df_indexed.loc['North'])

# Select specific tuple
print("\n   North + Laptop:")
print(df_indexed.loc[('North', 'Laptop')])

# Select with slice
print("\n   North, all products:")
print(df_indexed.loc[('North', slice(None)), :])

# 6. Cross-section (xs)
print("\n6. Cross-Section Selection:")

# Get all 'Laptop' across regions
laptops = df_indexed.xs('Laptop', level='product')
print("\n   All Laptop sales:")
print(laptops)

# 7. Swap levels
print("\n7. Swap Index Levels:")

swapped = df_indexed.swaplevel('region', 'product')
print(swapped)

# Sort by new index
print("\n   Sorted by product then region:")
print(swapped.sort_index())

# 8. Unstack (pivot level to columns)
print("\n8. Unstack MultiIndex:")

unstacked = df_indexed.unstack(level='product')
print("\n   Products as columns:")
print(unstacked)

# Unstack different level
unstacked_region = df_indexed.unstack(level='region')
print("\n   Regions as columns:")
print(unstacked_region)

# 9. Stack (columns to index)
print("\n9. Stack Back:")

stacked = unstacked.stack()
print(stacked)

# 10. Reset index (flatten)
print("\n10. Reset Index to Columns:")

flattened = df_indexed.reset_index()
print(flattened)

# 11. Aggregation with MultiIndex
print("\n11. Aggregation with MultiIndex:")

# Create sample hierarchical data
multi_data = pd.DataFrame({
    'region': ['North', 'North', 'North', 'South', 'South', 'South'],
    'product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone', 'Tablet'],
    'month': ['Jan', 'Jan', 'Jan', 'Jan', 'Jan', 'Jan'],
    'sales': [1000, 800, 500, 1200, 900, 550]
}).set_index(['region', 'product', 'month'])

print(multi_data)

# Aggregate at different levels
print("\n   Sum by region:")
print(multi_data.sum(level='region'))

print("\n   Mean by product:")
print(multi_data.mean(level='product'))

# 12. Sorting MultiIndex
print("\n12. Sort MultiIndex:")

# Sort by all levels
sorted_multi = multi_data.sort_index()
print("\n   Sorted by all levels:")
print(sorted_multi)

# Sort by specific level
sorted_level = multi_data.sort_index(level='product')
print("\n   Sorted by product level:")
print(sorted_level)

# 13. MultiIndex columns
print("\n13. MultiIndex Columns:")

# Create DataFrame with MultiIndex columns
col_index = pd.MultiIndex.from_product(
    [['Sales', 'Quantity'], ['Q1', 'Q2']],
    names=['Metric', 'Quarter']
)

df_multi_col = pd.DataFrame(
    np.random.randint(100, 200, (3, 4)),
    index=['Product A', 'Product B', 'Product C'],
    columns=col_index
)

print(df_multi_col)

# Select specific column level
print("\n   All Q1 data:")
print(df_multi_col.xs('Q1', level='Quarter', axis=1))

# 14. Practical: Hierarchical time series
print("\n14. Practical: Regional Sales Time Series:")

# Create hierarchical time series
dates = pd.date_range('2024-01-01', periods=6, freq='MS')
regions = ['North', 'South']
products = ['Laptop', 'Phone']

index = pd.MultiIndex.from_product(
    [dates, regions, products],
    names=['date', 'region', 'product']
)

hierarchical_ts = pd.DataFrame({
    'sales': np.random.randint(500, 1500, len(index))
}, index=index)

print(hierarchical_ts.head(10))

# Aggregate by month and region
monthly_regional = hierarchical_ts.sum(level=['date', 'region'])
print("\n   Monthly sales by region:")
print(monthly_regional.head())

print("\n   âœ“ MultiIndex operations complete!")

## 11. String Operations

**What**: Text manipulation and pattern matching using pandas string accessor.

**Why**: 
- Clean and standardize text data
- Extract information from strings
- Parse unstructured text
- Prepare text for analysis

**When to Use**:
- Name/address cleaning
- Extracting codes or IDs
- Text categorization
- URL/email parsing

**Key Operations**:
- **str.lower/upper()**: Case conversion
- **str.contains()**: Pattern matching
- **str.extract()**: Regex extraction
- **str.split()**: Split strings
- **str.replace()**: Replace patterns

In [None]:
# ============================================
# 11. STRING OPERATIONS
# ============================================

print("=" * 80)
print("STRING OPERATIONS")
print("=" * 80)

# Create sample text data
string_data = pd.DataFrame({
    'name': ['Alice Smith', 'Bob JONES', '  charlie brown', 'Diana Prince  '],
    'email': ['alice@example.com', 'BOB@TEST.COM', 'charlie@demo.org', 'diana@company.net'],
    'phone': ['(123) 456-7890', '987-654-3210', '555.123.4567', '800-555-1234'],
    'code': ['ABC-123', 'XYZ-456', 'DEF-789', 'GHI-012']
})

print("\n0. Sample Text Data:")
print(string_data)

# 1. Case conversion
print("\n1. Case Conversion:")

string_data['name_lower'] = string_data['name'].str.lower()
string_data['name_upper'] = string_data['name'].str.upper()
string_data['name_title'] = string_data['name'].str.title()

print(string_data[['name', 'name_lower', 'name_upper', 'name_title']].head(2))

# 2. Strip whitespace
print("\n2. Strip Whitespace:")

cleaned_names = string_data['name'].str.strip()
print("\n   Before strip:")
print(string_data['name'].tolist())
print("\n   After strip:")
print(cleaned_names.tolist())

# 3. Contains (pattern matching)
print("\n3. String Contains:")

has_smith = string_data['name'].str.contains('Smith', case=False)
print("\n   Names containing 'Smith':")
print(string_data[has_smith]['name'])

# Regex pattern
has_numbers = string_data['phone'].str.contains(r'\d{3}', regex=True)
print("\n   Phones with 3+ digits:", has_numbers.all())

# 4. Starts with / Ends with
print("\n4. Starts With / Ends With:")

starts_with = string_data['code'].str.startswith('ABC')
print("\n   Codes starting with ABC:")
print(string_data[starts_with]['code'])

ends_with = string_data['email'].str.endswith('.com')
print("\n   .com emails:")
print(string_data[ends_with]['email'])

# 5. Split strings
print("\n5. Split Strings:")

# Split by space
name_parts = string_data['name'].str.strip().str.split(' ', expand=True)
name_parts.columns = ['first_name', 'last_name']
print(name_parts)

# Split email
email_parts = string_data['email'].str.split('@', expand=True)
email_parts.columns = ['username', 'domain']
print("\n   Email parts:")
print(email_parts)

# 6. Extract with regex
print("\n6. Extract with Regex:")

# Extract area code from phone
area_codes = string_data['phone'].str.extract(r'(\d{3})')
print("\n   Area codes:")
print(area_codes)

# Extract code parts
code_parts = string_data['code'].str.extract(r'([A-Z]+)-(\d+)')
code_parts.columns = ['letters', 'numbers']
print("\n   Code components:")
print(code_parts)

# 7. Replace strings
print("\n7. Replace Strings:")

# Simple replace
standardized_phone = string_data['phone'].str.replace(r'[().\s-]', '', regex=True)
print("\n   Standardized phones:")
print(standardized_phone)

# Replace with pattern
masked_email = string_data['email'].str.replace(r'(.{2})[^@]+', r'\1***', regex=True)
print("\n   Masked emails:")
print(masked_email)

# 8. Length
print("\n8. String Length:")

string_data['name_length'] = string_data['name'].str.len()
print(string_data[['name', 'name_length']])

# 9. Slice strings
print("\n9. Slice Strings:")

# First 3 characters
first_three = string_data['name'].str[:3]
print("\n   First 3 chars:", first_three.tolist())

# Last 3 characters
last_three = string_data['code'].str[-3:]
print("   Last 3 chars:", last_three.tolist())

# 10. Concatenate strings
print("\n10. Concatenate Strings:")

string_data['full_contact'] = (
    string_data['name'].str.strip() + 
    ' <' + string_data['email'] + '>'
)
print(string_data['full_contact'])

# 11. Find and index
print("\n11. Find Position:")

at_position = string_data['email'].str.find('@')
print("\n   Position of @ in email:")
print(at_position)

# 12. Count occurrences
print("\n12. Count Pattern Occurrences:")

dash_count = string_data['phone'].str.count('-')
print("\n   Number of dashes in phone:")
print(dash_count)

# 13. Get specific items after split
print("\n13. Get Specific Split Items:")

# Get first name only
first_names = string_data['name'].str.strip().str.split().str[0]
print("\n   First names:")
print(first_names)

# Get domain
domains = string_data['email'].str.split('@').str[1]
print("\n   Email domains:")
print(domains)

# 14. Padding
print("\n14. Pad Strings:")

# Pad with zeros
padded_codes = code_parts['numbers'].str.pad(width=5, fillchar='0')
print("\n   Padded codes:")
print(padded_codes)

# 15. Practical: Clean and standardize data
print("\n15. Practical: Data Cleaning Pipeline:")

# Comprehensive cleaning
cleaned_data = string_data.copy()

# Clean names
cleaned_data['name_clean'] = (
    cleaned_data['name']
    .str.strip()
    .str.title()
)

# Standardize emails
cleaned_data['email_clean'] = (
    cleaned_data['email']
    .str.lower()
    .str.strip()
)

# Standardize phones
cleaned_data['phone_clean'] = (
    cleaned_data['phone']
    .str.replace(r'[^0-9]', '', regex=True)
)

# Extract components
cleaned_data['area_code'] = cleaned_data['phone_clean'].str[:3]
cleaned_data['code_prefix'] = cleaned_data['code'].str.split('-').str[0]

print(cleaned_data[['name_clean', 'email_clean', 'phone_clean', 'area_code', 'code_prefix']])

print("\n   âœ“ String operations complete!")

## 12. Binning and Discretization

**What**: Converting continuous numerical data into categorical bins or discrete intervals.

**Why**: 
- Simplify continuous variables
- Create categorical features
- Handle outliers
- Prepare data for analysis

**When to Use**:
- Age groups from ages
- Price ranges from prices
- Risk categories from scores
- Quantile-based grouping

**Key Operations**:
- **pd.cut()**: Bin into fixed intervals
- **pd.qcut()**: Bin into quantiles
- **Custom bins**: Specify bin edges
- **Labels**: Assign category names

In [None]:
# ============================================
# 12. BINNING AND DISCRETIZATION
# ============================================

print("=" * 80)
print("BINNING AND DISCRETIZATION")
print("=" * 80)

# Create sample data
binning_data = pd.DataFrame({
    'age': [22, 25, 30, 35, 42, 48, 55, 62, 68, 75],
    'income': [35000, 45000, 52000, 68000, 75000, 82000, 95000, 105000, 88000, 92000],
    'score': [45, 67, 72, 89, 56, 78, 91, 63, 82, 94]
})

print("\n0. Sample Data:")
print(binning_data)

# 1. Simple binning with cut()
print("\n1. Simple Binning (cut):")

# Create 3 equal-width bins
age_bins = pd.cut(binning_data['age'], bins=3)
print("\n   Age bins:")
print(age_bins)

# Count in each bin
print("\n   Counts per bin:")
print(age_bins.value_counts().sort_index())

# 2. Custom bin edges
print("\n2. Custom Bin Edges:")

# Define specific age groups
age_groups = pd.cut(
    binning_data['age'],
    bins=[0, 30, 50, 70, 100],
    labels=['Young', 'Adult', 'Middle-aged', 'Senior']
)

binning_data['age_group'] = age_groups
print(binning_data[['age', 'age_group']])

# 3. Right vs Left inclusive
print("\n3. Right vs Left Inclusive:")

# Default: right=True (intervals like (20, 30])
right_inclusive = pd.cut(binning_data['age'], bins=3, right=True)
print("\n   Right inclusive (20, 30]:")
print(right_inclusive[:3])

# Left inclusive: [20, 30)
left_inclusive = pd.cut(binning_data['age'], bins=3, right=False)
print("\n   Left inclusive [20, 30):")
print(left_inclusive[:3])

# 4. Quantile-based binning (qcut)
print("\n4. Quantile-based Binning (qcut):")

# Divide into quartiles
income_quartiles = pd.qcut(binning_data['income'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
binning_data['income_quartile'] = income_quartiles

print(binning_data[['income', 'income_quartile']])

print("\n   Counts per quartile:")
print(income_quartiles.value_counts().sort_index())

# 5. Quantiles with duplicates
print("\n5. Quantile Binning with Duplicates:")

# Handle duplicate edges
try:
    score_deciles = pd.qcut(binning_data['score'], q=10, duplicates='drop')
    print("   Score deciles created successfully")
except Exception as e:
    print(f"   Error: {e}")

# 6. Get bin information
print("\n6. Bin Information:")

# Include metadata
age_cut = pd.cut(binning_data['age'], bins=4, retbins=True)
print("\n   Bin edges:")
print(age_cut[1])

# 7. Precision control
print("\n7. Control Bin Precision:")

# Limit decimal places in bin labels
precise_bins = pd.cut(binning_data['income'], bins=3, precision=0)
print("\n   Income bins (no decimals):")
print(precise_bins)

# 8. Include lowest
print("\n8. Include Lowest Value:")

# Include minimum value
inclusive_bins = pd.cut(
    binning_data['score'],
    bins=[40, 60, 80, 100],
    include_lowest=True,
    labels=['Low', 'Medium', 'High']
)
print("\n   Score categories:")
print(inclusive_bins)

# 9. Binning with inf bounds
print("\n9. Binning with Infinity:")

# Open-ended bins
income_categories = pd.cut(
    binning_data['income'],
    bins=[0, 50000, 75000, float('inf')],
    labels=['Low', 'Medium', 'High']
)
binning_data['income_category'] = income_categories

print(binning_data[['income', 'income_category']])

# 10. Ordered categories
print("\n10. Ordered Categories:")

# Make categories ordered
score_levels = pd.cut(
    binning_data['score'],
    bins=[0, 60, 75, 90, 100],
    labels=['Fail', 'Pass', 'Good', 'Excellent']
)

score_cat = pd.Categorical(score_levels, ordered=True)
print("\n   Ordered score levels:")
print(score_cat)
print(f"   Is ordered: {score_cat.ordered}")

# 11. Practical: Risk scoring
print("\n11. Practical: Risk Scoring:")

# Create risk categories
binning_data['risk_score'] = binning_data['score']

binning_data['risk_level'] = pd.cut(
    binning_data['risk_score'],
    bins=[0, 50, 70, 85, 100],
    labels=['High Risk', 'Medium Risk', 'Low Risk', 'Very Low Risk'],
    include_lowest=True
)

print(binning_data[['score', 'risk_level']])

# 12. Multiple binning strategies
print("\n12. Compare Binning Methods:")

comparison = pd.DataFrame({
    'value': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
})

# Equal width bins
comparison['equal_width'] = pd.cut(comparison['value'], bins=3, labels=['Low', 'Mid', 'High'])

# Equal frequency bins (quantiles)
comparison['equal_freq'] = pd.qcut(comparison['value'], q=3, labels=['Low', 'Mid', 'High'])

print(comparison)

print("\n   Equal width distribution:")
print(comparison['equal_width'].value_counts().sort_index())

print("\n   Equal frequency distribution:")
print(comparison['equal_freq'].value_counts().sort_index())

# 13. Binning summary statistics
print("\n13. Statistics by Bin:")

# Group by bins and calculate statistics
bin_stats = binning_data.groupby('age_group').agg({
    'age': ['min', 'max', 'count'],
    'income': 'mean',
    'score': 'mean'
}).round(2)

print(bin_stats)

print("\n   âœ“ Binning and discretization complete!")

## 13. Window Functions (Rolling, Expanding, EWM)

**What**: Calculations over moving or expanding windows of data.

**Why**: 
- Smooth noisy time series data
- Calculate moving averages and trends
- Identify patterns over time
- Remove short-term fluctuations

**When to Use**:
- Stock price analysis (moving averages)
- Weather data smoothing
- Sales trend analysis
- Anomaly detection

**Key Operations**:
- **rolling()**: Fixed-size moving window
- **expanding()**: Cumulative window
- **ewm()**: Exponentially weighted moving
- **Window functions**: sum, mean, std, min, max

In [None]:
# ============================================
# 13. WINDOW FUNCTIONS (ROLLING, EXPANDING, EWM)
# ============================================

print("=" * 80)
print("WINDOW FUNCTIONS")
print("=" * 80)

# Create sample time series data
np.random.seed(42)
dates = pd.date_range('2024-01-01', periods=20, freq='D')
window_data = pd.DataFrame({
    'value': np.random.randint(80, 120, 20) + np.random.randn(20) * 5
}, index=dates)

print("\n0. Sample Time Series Data:")
print(window_data.head(10))

# 1. Rolling window - Mean
print("\n1. Rolling Mean (Moving Average):")

window_data['rolling_mean_3'] = window_data['value'].rolling(window=3).mean()
window_data['rolling_mean_7'] = window_data['value'].rolling(window=7).mean()

print(window_data[['value', 'rolling_mean_3', 'rolling_mean_7']].head(10))

# 2. Rolling window - Other aggregations
print("\n2. Rolling Aggregations:")

window_data['rolling_std'] = window_data['value'].rolling(window=5).std()
window_data['rolling_min'] = window_data['value'].rolling(window=5).min()
window_data['rolling_max'] = window_data['value'].rolling(window=5).max()

print(window_data[['value', 'rolling_std', 'rolling_min', 'rolling_max']].head(10))

# 3. Rolling with min_periods
print("\n3. Rolling with Minimum Periods:")

# Calculate even when window not full
early_rolling = window_data['value'].rolling(window=5, min_periods=1).mean()
print("\n   First 7 values (min_periods=1):")
print(early_rolling.head(7))

# 4. Centered rolling window
print("\n4. Centered Rolling Window:")

# Center the window (instead of trailing)
centered = window_data['value'].rolling(window=5, center=True).mean()
print("\n   Centered vs trailing:")
comparison = pd.DataFrame({
    'value': window_data['value'],
    'trailing': window_data['value'].rolling(5).mean(),
    'centered': centered
})
print(comparison.head(10))

# 5. Rolling sum
print("\n5. Rolling Sum:")

window_data['rolling_sum_3'] = window_data['value'].rolling(window=3).sum()
print(window_data[['value', 'rolling_sum_3']].head(7))

# 6. Expanding window (cumulative)
print("\n6. Expanding Window:")

window_data['expanding_mean'] = window_data['value'].expanding().mean()
window_data['expanding_sum'] = window_data['value'].expanding().sum()

print(window_data[['value', 'expanding_mean', 'expanding_sum']].head(10))

# 7. Expanding with min_periods
print("\n7. Expanding with Min Periods:")

expanding_min = window_data['value'].expanding(min_periods=3).mean()
print("\n   First 5 values (min_periods=3):")
print(expanding_min.head(5))

# 8. Exponentially Weighted Moving Average
print("\n8. Exponentially Weighted Moving (EWM):")

window_data['ewm_mean'] = window_data['value'].ewm(span=5).mean()
print(window_data[['value', 'ewm_mean']].head(10))

# 9. EWM with different parameters
print("\n9. EWM Parameters:")

# Different spans
window_data['ewm_short'] = window_data['value'].ewm(span=3).mean()
window_data['ewm_long'] = window_data['value'].ewm(span=10).mean()

print(window_data[['value', 'ewm_short', 'ewm_long']].head(10))

# 10. Rolling correlation
print("\n10. Rolling Correlation:")

# Create second series
window_data['value2'] = window_data['value'] + np.random.randn(20) * 10

# Calculate rolling correlation
rolling_corr = window_data['value'].rolling(window=7).corr(window_data['value2'])
print("\n   Rolling 7-day correlation:")
print(rolling_corr.head(10))

# 11. Rolling apply custom function
print("\n11. Rolling Apply Custom Function:")

# Custom function: range (max - min)
def window_range(x):
    return x.max() - x.min()

window_data['rolling_range'] = window_data['value'].rolling(window=5).apply(window_range)
print(window_data[['value', 'rolling_range']].head(10))

# 12. Multiple rolling aggregations
print("\n12. Multiple Rolling Aggregations:")

rolling_stats = window_data['value'].rolling(window=7).agg(['mean', 'std', 'min', 'max'])
print(rolling_stats.head(10))

# 13. Window with datetime offset
print("\n13. Rolling Window with Time Offset:")

# 7-day rolling window
time_window = window_data['value'].rolling('7D').mean()
print("\n   7-day time-based window:")
print(time_window.head(10))

# 14. Practical: Moving average crossover
print("\n14. Practical: Moving Average Crossover Strategy:")

# Common in trading: short-term vs long-term MA
window_data['ma_short'] = window_data['value'].rolling(window=3).mean()
window_data['ma_long'] = window_data['value'].rolling(window=7).mean()

# Signal: short MA crosses above long MA
window_data['signal'] = (window_data['ma_short'] > window_data['ma_long']).astype(int)

print(window_data[['value', 'ma_short', 'ma_long', 'signal']].tail(10))

# 15. Practical: Smoothing noisy data
print("\n15. Practical: Data Smoothing Comparison:")

# Create noisy data
noisy_signal = pd.Series(
    [100 + i*2 + np.random.randn()*10 for i in range(20)],
    index=dates
)

smoothing = pd.DataFrame({
    'original': noisy_signal,
    'rolling_3': noisy_signal.rolling(3).mean(),
    'rolling_7': noisy_signal.rolling(7).mean(),
    'ewm_5': noisy_signal.ewm(span=5).mean()
})

print(smoothing.head(10))

print("\n   Different smoothing techniques:")
print(f"   Original std: {smoothing['original'].std():.2f}")
print(f"   Rolling-3 std: {smoothing['rolling_3'].std():.2f}")
print(f"   Rolling-7 std: {smoothing['rolling_7'].std():.2f}")
print(f"   EWM-5 std: {smoothing['ewm_5'].std():.2f}")

print("\n   âœ“ Window functions complete!")

## 14. Data Sampling and Resampling

**What**: Selecting subsets of data or changing time series frequency.

**Why**: 
- Create train/test splits
- Balance datasets
- Change time granularity
- Reduce data size for testing

**When to Use**:
- Machine learning data preparation
- Time series aggregation/disaggregation
- Statistical sampling
- Performance testing with smaller datasets

**Key Operations**:
- **sample()**: Random sampling
- **resample()**: Time series frequency conversion
- **Stratified sampling**: Maintain class proportions
- **Bootstrapping**: Sample with replacement

In [None]:
# ============================================
# 14. DATA SAMPLING AND RESAMPLING
# ============================================

print("=" * 80)
print("DATA SAMPLING AND RESAMPLING")
print("=" * 80)

# Create sample dataset
sampling_data = pd.DataFrame({
    'id': range(1, 101),
    'category': np.random.choice(['A', 'B', 'C'], 100),
    'value': np.random.randint(1, 100, 100),
    'score': np.random.randn(100)
})

print("\n0. Sample Dataset:")
print(sampling_data.head())
print(f"   Shape: {sampling_data.shape}")

# 1. Random sampling
print("\n1. Random Sampling:")

# Sample 10 random rows
random_sample = sampling_data.sample(n=10)
print(random_sample)

# 2. Fraction-based sampling
print("\n2. Fraction-based Sampling:")

# Sample 20% of data
frac_sample = sampling_data.sample(frac=0.2)
print(f"\n   20% sample size: {len(frac_sample)} rows")
print(frac_sample.head())

# 3. Sampling with replacement
print("\n3. Sampling with Replacement:")

# Can get duplicate rows
bootstrap_sample = sampling_data.sample(n=10, replace=True)
print(bootstrap_sample)

# 4. Stratified sampling
print("\n4. Stratified Sampling (Maintain Proportions):")

# Check original distribution
print("\n   Original category distribution:")
print(sampling_data['category'].value_counts(normalize=True))

# Stratified sample
stratified = sampling_data.groupby('category', group_keys=False).apply(
    lambda x: x.sample(frac=0.2)
)

print(f"\n   Stratified sample size: {len(stratified)}")
print("\n   Stratified sample distribution:")
print(stratified['category'].value_counts(normalize=True))

# 5. Random state for reproducibility
print("\n5. Reproducible Sampling:")

sample1 = sampling_data.sample(n=5, random_state=42)
sample2 = sampling_data.sample(n=5, random_state=42)

print("\n   First sample:")
print(sample1['id'].tolist())
print("\n   Second sample (same random_state):")
print(sample2['id'].tolist())
print(f"\n   Identical: {sample1['id'].tolist() == sample2['id'].tolist()}")

# 6. Sampling by weights
print("\n6. Weighted Sampling:")

# Higher probability for higher values
weights = sampling_data['value'] / sampling_data['value'].sum()
weighted_sample = sampling_data.sample(n=10, weights=weights)

print("\n   Weighted sample (biased toward higher values):")
print(weighted_sample[['id', 'value']].sort_values('value', ascending=False))

# 7. Time series resampling - Downsampling
print("\n7. Time Series Resampling (Downsampling):")

# Create daily time series
ts_data = pd.DataFrame({
    'value': np.random.randint(50, 150, 30)
}, index=pd.date_range('2024-01-01', periods=30, freq='D'))

# Resample to weekly (mean)
weekly = ts_data.resample('W').mean()
print("\n   Daily to Weekly (mean):")
print(weekly.head())

# Resample to weekly (sum)
weekly_sum = ts_data.resample('W').sum()
print("\n   Daily to Weekly (sum):")
print(weekly_sum.head())

# 8. Time series resampling - Upsampling
print("\n8. Time Series Resampling (Upsampling):")

# Create monthly data
monthly_data = pd.DataFrame({
    'value': [100, 150, 200]
}, index=pd.date_range('2024-01-01', periods=3, freq='MS'))

# Upsample to daily (forward fill)
daily = monthly_data.resample('D').ffill()
print("\n   Monthly to Daily (forward fill):")
print(daily.head(10))

# Upsample with interpolation
daily_interp = monthly_data.resample('D').interpolate()
print("\n   Monthly to Daily (interpolation):")
print(daily_interp.head(10))

# 9. Resample with different aggregations
print("\n9. Multiple Aggregations in Resample:")

# Create detailed time series
detailed_ts = pd.DataFrame({
    'sales': np.random.randint(100, 200, 30),
    'quantity': np.random.randint(10, 50, 30)
}, index=pd.date_range('2024-01-01', periods=30, freq='D'))

# Resample with different aggregations per column
weekly_agg = detailed_ts.resample('W').agg({
    'sales': 'sum',
    'quantity': 'mean'
})

print(weekly_agg.head())

# 10. Resample with custom function
print("\n10. Resample with Custom Function:")

def price_range(x):
    return x.max() - x.min()

weekly_range = ts_data.resample('W').apply(price_range)
print("\n   Weekly value range:")
print(weekly_range.head())

# 11. Sampling for train/test split
print("\n11. Practical: Train/Test Split:")

# 80/20 split
train = sampling_data.sample(frac=0.8, random_state=42)
test = sampling_data.drop(train.index)

print(f"\n   Train size: {len(train)} ({len(train)/len(sampling_data)*100:.0f}%)")
print(f"   Test size: {len(test)} ({len(test)/len(sampling_data)*100:.0f}%)")

# Verify no overlap
print(f"   No overlap: {len(set(train.index) & set(test.index)) == 0}")

# 12. Bootstrap sampling
print("\n12. Bootstrap Sampling:")

# Multiple bootstrap samples
bootstrap_means = []
for i in range(1000):
    boot_sample = sampling_data['value'].sample(n=len(sampling_data), replace=True)
    bootstrap_means.append(boot_sample.mean())

bootstrap_means = pd.Series(bootstrap_means)
print(f"\n   Bootstrap mean estimate: {bootstrap_means.mean():.2f}")
print(f"   95% CI: [{bootstrap_means.quantile(0.025):.2f}, {bootstrap_means.quantile(0.975):.2f}]")
print(f"   Actual mean: {sampling_data['value'].mean():.2f}")

# 13. Cross-validation folds
print("\n13. Create Cross-Validation Folds:")

# Simple K-fold creation
n_folds = 5
fold_size = len(sampling_data) // n_folds

for fold in range(n_folds):
    indices = sampling_data.index.tolist()
    test_idx = indices[fold * fold_size : (fold + 1) * fold_size]
    train_idx = [i for i in indices if i not in test_idx]
    
    print(f"   Fold {fold + 1}: Train={len(train_idx)}, Test={len(test_idx)}")

# 14. Time-based resampling periods
print("\n14. Different Resampling Periods:")

# Create hourly data
hourly_data = pd.DataFrame({
    'value': np.random.randint(50, 150, 24 * 7)  # 1 week of hourly data
}, index=pd.date_range('2024-01-01', periods=24*7, freq='h'))

print("\n   Original: Hourly data")
print(f"   Period: {hourly_data.index[0]} to {hourly_data.index[-1]}")

# Different frequencies
print("\n   Resample to 6-hour:")
print(hourly_data.resample('6h').mean().head())

print("\n   Resample to daily:")
print(hourly_data.resample('D').mean().head())

print("\n   Resample to business days:")
print(hourly_data.resample('B').mean().head())

print("\n   âœ“ Sampling and resampling complete!")

## 15. Advanced Wrangling Techniques

**What**: Complex data manipulation techniques and best practices for efficient workflows.

**Why**: 
- Optimize performance
- Write cleaner, more maintainable code
- Handle complex data scenarios
- Build efficient data pipelines

**When to Use**:
- Large-scale data processing
- Production data pipelines
- Complex transformations
- Performance-critical applications

**Key Techniques**:
- **Method chaining**: Chain operations fluently
- **pipe()**: Custom transformation functions
- **assign()**: Add columns in chains
- **query()**: SQL-like filtering
- **eval()**: Fast expression evaluation

In [None]:
# ============================================
# 15. ADVANCED WRANGLING TECHNIQUES
# ============================================

print("=" * 80)
print("ADVANCED WRANGLING TECHNIQUES")
print("=" * 80)

# Create sample data
advanced_data = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'age': [25, 30, 35, 28, 32],
    'salary': [50000, 60000, 75000, 55000, 65000],
    'department': ['Sales', 'IT', 'Sales', 'IT', 'HR'],
    'performance': [8.5, 9.2, 7.8, 8.9, 9.0]
})

print("\n0. Sample Data:")
print(advanced_data)

# 1. Method chaining
print("\n1. Method Chaining:")

# Chain multiple operations
result = (advanced_data
          .query('age > 25')
          .assign(bonus=lambda x: x['salary'] * 0.1)
          .sort_values('bonus', ascending=False)
          .head(3))

print(result)

# 2. Using pipe() for custom transformations
print("\n2. Custom Transformations with pipe():")

def add_salary_category(df):
    df['salary_category'] = pd.cut(
        df['salary'],
        bins=[0, 55000, 70000, 100000],
        labels=['Low', 'Medium', 'High']
    )
    return df

def add_age_group(df):
    df['age_group'] = pd.cut(
        df['age'],
        bins=[0, 30, 40, 100],
        labels=['Young', 'Middle', 'Senior']
    )
    return df

# Chain custom functions
piped_result = (advanced_data
                .pipe(add_salary_category)
                .pipe(add_age_group))

print(piped_result[['name', 'salary_category', 'age_group']])

# 3. assign() for adding columns
print("\n3. Adding Columns with assign():")

# Add multiple columns at once
assigned = advanced_data.assign(
    salary_k=lambda x: x['salary'] / 1000,
    bonus=lambda x: x['salary'] * 0.1,
    total_comp=lambda x: x['salary'] + x['bonus']  # Can reference newly created columns
)

print(assigned[['name', 'salary', 'salary_k', 'bonus', 'total_comp']].head())

# 4. query() for filtering
print("\n4. Query Method (SQL-like):")

# Simple query
high_performers = advanced_data.query('performance > 8.5')
print("\n   High performers:")
print(high_performers)

# Complex query
filtered = advanced_data.query('age > 27 and salary >= 60000')
print("\n   Age > 27 AND salary >= 60000:")
print(filtered)

# Query with variables
min_age = 30
dept = 'IT'
var_query = advanced_data.query('age >= @min_age and department == @dept')
print(f"\n   Age >= {min_age} in {dept}:")
print(var_query)

# 5. eval() for fast calculations
print("\n5. eval() for Fast Expression Evaluation:")

# Instead of: df['total'] = df['a'] + df['b'] - df['c']
calc_data = pd.DataFrame({
    'a': [1, 2, 3, 4, 5],
    'b': [10, 20, 30, 40, 50],
    'c': [5, 10, 15, 20, 25]
})

calc_data['total'] = calc_data.eval('a + b - c')
print(calc_data)

# 6. explode() for list-like columns
print("\n6. Explode List-like Columns:")

list_data = pd.DataFrame({
    'customer': ['A', 'B', 'C'],
    'products': [['Laptop', 'Phone'], ['Tablet'], ['Laptop', 'Monitor', 'Keyboard']]
})

print("\n   Original:")
print(list_data)

exploded = list_data.explode('products')
print("\n   Exploded:")
print(exploded)

# 7. melt with multiple value columns
print("\n7. Advanced Melt:")

wide_data = pd.DataFrame({
    'product': ['A', 'B'],
    'Q1_sales': [100, 150],
    'Q1_units': [10, 15],
    'Q2_sales': [120, 160],
    'Q2_units': [12, 16]
})

# Melt with value_vars patterns
melted = wide_data.melt(
    id_vars='product',
    value_vars=['Q1_sales', 'Q2_sales'],
    var_name='quarter',
    value_name='sales'
)

print(melted)

# 8. Combining merge + groupby
print("\n8. Merge + GroupBy Pipeline:")

orders = pd.DataFrame({
    'order_id': [1, 2, 3, 4],
    'customer_id': [101, 102, 101, 103],
    'amount': [100, 150, 200, 120]
})

customers = pd.DataFrame({
    'customer_id': [101, 102, 103],
    'name': ['Alice', 'Bob', 'Charlie'],
    'tier': ['Gold', 'Silver', 'Gold']
})

# Complex pipeline
result = (orders
          .merge(customers, on='customer_id')
          .groupby('tier')
          .agg({
              'amount': ['sum', 'mean', 'count'],
              'customer_id': 'nunique'
          })
          .round(2))

print(result)

# 9. Using where/mask for conditional assignment
print("\n9. Conditional Assignment with where/mask:")

data = pd.DataFrame({
    'value': [10, 20, 30, 40, 50]
})

# where: keep values where condition is True
data['capped_at_30'] = data['value'].where(data['value'] <= 30, 30)

# mask: replace values where condition is True
data['floored_at_20'] = data['value'].mask(data['value'] < 20, 20)

print(data)

# 10. Complex transformations with groupby.transform
print("\n10. GroupBy Transform for Complex Features:")

sales_data = pd.DataFrame({
    'product': ['A', 'A', 'B', 'B', 'A'],
    'sales': [100, 150, 200, 180, 120]
})

# Add group statistics
sales_data['product_mean'] = sales_data.groupby('product')['sales'].transform('mean')
sales_data['diff_from_mean'] = sales_data['sales'] - sales_data['product_mean']
sales_data['pct_of_total'] = sales_data.groupby('product')['sales'].transform(
    lambda x: x / x.sum() * 100
)

print(sales_data)

# 11. pd.crosstab with normalization
print("\n11. Advanced CrossTab:")

ct_data = pd.DataFrame({
    'region': ['North', 'South', 'North', 'South', 'North'],
    'product': ['A', 'A', 'B', 'B', 'A'],
    'sales': [100, 150, 200, 180, 120]
})

# Crosstab with values and normalization
crosstab = pd.crosstab(
    ct_data['region'],
    ct_data['product'],
    values=ct_data['sales'],
    aggfunc='sum',
    normalize='index',  # Row percentages
    margins=True
) * 100

print("\n   Sales percentage by region:")
print(crosstab.round(2))

# 12. Efficient string operations
print("\n12. Efficient String Vectorization:")

text_data = pd.DataFrame({
    'text': ['  Hello World  ', 'PYTHON pandas', 'Data Wrangling']
})

# Chain string operations
text_data['processed'] = (text_data['text']
                          .str.strip()
                          .str.lower()
                          .str.replace(' ', '_'))

print(text_data)

# 13. Memory optimization
print("\n13. Memory Optimization:")

mem_data = pd.DataFrame({
    'category': ['A', 'B', 'A', 'C', 'B'] * 20,
    'value': range(100)
})

print(f"\n   Original memory: {mem_data.memory_usage(deep=True).sum() / 1024:.2f} KB")

# Convert to category
mem_data['category'] = mem_data['category'].astype('category')

print(f"   After category: {mem_data.memory_usage(deep=True).sum() / 1024:.2f} KB")

# 14. Practical: Complete data pipeline
print("\n14. Practical: Complete Data Pipeline:")

# Simulate raw data
raw_data = pd.DataFrame({
    'customer_name': ['  Alice Smith', 'BOB JONES  ', 'charlie brown'],
    'purchase_date': ['2024-01-15', '2024-02-20', '2024-01-30'],
    'amount': [100, 200, 150],
    'category': ['electronics', 'electronics', 'books']
})

# Complete pipeline
cleaned_data = (raw_data
                # Clean names
                .assign(customer_name=lambda x: x['customer_name'].str.strip().str.title())
                # Parse dates
                .assign(purchase_date=lambda x: pd.to_datetime(x['purchase_date']))
                # Add derived features
                .assign(
                    month=lambda x: x['purchase_date'].dt.month,
                    year=lambda x: x['purchase_date'].dt.year,
                    amount_category=lambda x: pd.cut(x['amount'], bins=[0, 100, 200, 1000], labels=['Low', 'Medium', 'High'])
                )
                # Optimize memory
                .assign(category=lambda x: x['category'].astype('category'))
                # Sort
                .sort_values('purchase_date')
                # Reset index
                .reset_index(drop=True))

print(cleaned_data)
print(f"\n   Data types:\n{cleaned_data.dtypes}")

# 15. Best practices summary
print("\n15. Best Practices Summary:")

best_practices = pd.DataFrame({
    'Practice': [
        'Use method chaining',
        'Vectorize operations',
        'Use query() for filtering',
        'Optimize dtypes (category)',
        'Use assign() over direct assignment',
        'Prefer apply() to loops',
        'Use pipe() for reusable transforms',
        'Profile memory usage'
    ],
    'Benefit': [
        'Readable, maintainable code',
        '10-100x faster than loops',
        'SQL-like readability',
        '50-90% memory reduction',
        'Works in method chains',
        'Faster than Python loops',
        'Modular, testable code',
        'Identify bottlenecks'
    ]
})

print(best_practices.to_string(index=False))

print("\n" + "=" * 80)
print("âœ“ DATA WRANGLING LEARNING COMPLETE!")
print("=" * 80)
print("\nYou've learned 15 comprehensive data wrangling techniques:")
print("1. Data Loading and Inspection")
print("2. Filtering and Subsetting")
print("3. Sorting and Ranking")
print("4. Reshaping Data")
print("5. Merging and Joining DataFrames")
print("6. Concatenating Data")
print("7. Grouping and Aggregation")
print("8. Data Transformation (Apply, Map, Replace)")
print("9. Handling Time Series Data")
print("10. Working with MultiIndex")
print("11. String Operations")
print("12. Binning and Discretization")
print("13. Window Functions (Rolling, Expanding, EWM)")
print("14. Data Sampling and Resampling")
print("15. Advanced Wrangling Techniques")
print("\nNext steps: Practice these techniques on real datasets!")