# Pivoting Data

Reshaping data between wide and long formats is essential for different types of analyses and visualizations. While R's `tidyr::pivot_wider()` and `pivot_longer()` provide intuitive interfaces, pandas offers several methods for pivoting: `pivot()`, `pivot_table()`, `melt()`, and `stack()/unstack()`. This chapter will show you how to achieve tidyverse-style data reshaping in pandas.

## Best Practices Summary

Quick reference for pivoting patterns:

| Task | R (tidyr) | Pandas |
|------|-----------|--------|
| Wide to long | `pivot_longer(df, cols, names_to, values_to)` | `df.melt(id_vars, value_vars, var_name, value_name)` |
| Long to wide | `pivot_wider(df, names_from, values_from)` | `df.pivot(index, columns, values)` |
| With aggregation | `pivot_wider(..., values_fn = sum)` | `df.pivot_table(..., aggfunc='sum')` |
| Multiple values | `pivot_wider(..., values_from = c(x, y))` | `df.pivot(..., values=['x', 'y'])` |
| Fill missing | `pivot_wider(..., values_fill = 0)` | `df.pivot_table(..., fill_value=0)` |
| Complex names | `pivot_longer(..., names_pattern = "(.*)_(.*)")` | Use `melt()` then `str.extract()` |

## Tips for Tidyverse Users

1. **Use `melt()` for pivot_longer**: It's the most direct equivalent and very flexible.

2. **Choose `pivot()` vs `pivot_table()`**: Use `pivot()` for simple reshaping, `pivot_table()` when you need aggregation.

3. **Remember `reset_index()`**: After pivoting, often need to reset index to get regular columns.

4. **Handle multi-level columns**: After pivoting multiple values, flatten column names for easier access.

5. **Chain operations**: Pivoting works well in method chains:
   ```python
   (df
    .melt(id_vars=['id'], var_name='metric', value_name='value')
    .query('value > 0')
    .pivot(index='id', columns='metric', values='value'))
   ```

Pivoting data in pandas offers multiple approaches for different scenarios. While the syntax differs from tidyr, the concepts are similar, and pandas often provides more control over the reshaping process, especially when dealing with complex data structures or when aggregation is needed.

## Wide to Long Format (Melt)

Converting wide data to long format, similar to `pivot_longer()`:

In [1]:
import pandas as pd
import numpy as np

# Create wide format data
df_wide = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'jan_sales': [100, 150, 120],
    'feb_sales': [110, 140, 130],
    'mar_sales': [120, 160, 125]
})
df_wide

Unnamed: 0,name,age,jan_sales,feb_sales,mar_sales
0,Alice,25,100,110,120
1,Bob,30,150,140,160
2,Charlie,35,120,130,125


In [2]:
# Basic melt - pivot_longer equivalent
# R: pivot_longer(df, cols = jan_sales:mar_sales, names_to = "month", values_to = "sales")
df_long = df_wide.melt(
    id_vars=['name', 'age'],
    value_vars=['jan_sales', 'feb_sales', 'mar_sales'],
    var_name='month',
    value_name='sales'
)
df_long

Unnamed: 0,name,age,month,sales
0,Alice,25,jan_sales,100
1,Bob,30,jan_sales,150
2,Charlie,35,jan_sales,120
3,Alice,25,feb_sales,110
4,Bob,30,feb_sales,140
5,Charlie,35,feb_sales,130
6,Alice,25,mar_sales,120
7,Bob,30,mar_sales,160
8,Charlie,35,mar_sales,125


In [3]:
# Melt with pattern matching
# R: pivot_longer(df, cols = ends_with("_sales"), names_to = "month", values_to = "sales")
sales_cols = df_wide.filter(like='_sales').columns
df_long2 = df_wide.melt(
    id_vars=['name', 'age'],
    value_vars=sales_cols,
    var_name='month',
    value_name='sales'
)

# Clean up month names
df_long2['month'] = df_long2['month'].str.replace('_sales', '')
df_long2

Unnamed: 0,name,age,month,sales
0,Alice,25,jan,100
1,Bob,30,jan,150
2,Charlie,35,jan,120
3,Alice,25,feb,110
4,Bob,30,feb,140
5,Charlie,35,feb,130
6,Alice,25,mar,120
7,Bob,30,mar,160
8,Charlie,35,mar,125


## Long to Wide Format (Pivot)

Converting long data to wide format, similar to `pivot_wider()`:

In [4]:
# Create long format data
df_long = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=9),
    'store': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'metric': ['sales', 'customers', 'returns'] * 3,
    'value': [1000, 50, 5, 1200, 60, 8, 900, 45, 3]
})
df_long

Unnamed: 0,date,store,metric,value
0,2024-01-01,A,sales,1000
1,2024-01-02,A,customers,50
2,2024-01-03,A,returns,5
3,2024-01-04,B,sales,1200
4,2024-01-05,B,customers,60
5,2024-01-06,B,returns,8
6,2024-01-07,C,sales,900
7,2024-01-08,C,customers,45
8,2024-01-09,C,returns,3


In [5]:
# Basic pivot - pivot_wider equivalent
# R: pivot_wider(df, names_from = metric, values_from = value)
df_wide = df_long.pivot(
    index=['date', 'store'],
    columns='metric',
    values='value'
).reset_index()
df_wide

metric,date,store,customers,returns,sales
0,2024-01-01,A,,,1000.0
1,2024-01-02,A,50.0,,
2,2024-01-03,A,,5.0,
3,2024-01-04,B,,,1200.0
4,2024-01-05,B,60.0,,
5,2024-01-06,B,,8.0,
6,2024-01-07,C,,,900.0
7,2024-01-08,C,45.0,,
8,2024-01-09,C,,3.0,


In [6]:
# Handle multiple value columns
df_multi = pd.DataFrame({
    'product': ['A', 'A', 'B', 'B'] * 2,
    'quarter': ['Q1', 'Q2', 'Q1', 'Q2'] * 2,
    'region': ['East', 'East', 'East', 'East', 'West', 'West', 'West', 'West'],
    'revenue': [100, 120, 150, 160, 110, 130, 140, 170],
    'units': [10, 12, 15, 16, 11, 13, 14, 17]
})

# Pivot with multiple values
# R: pivot_wider(df, names_from = quarter, values_from = c(revenue, units))
df_wide_multi = df_multi.pivot(
    index=['product', 'region'],
    columns='quarter',
    values=['revenue', 'units']
)
df_wide_multi

Unnamed: 0_level_0,Unnamed: 1_level_0,revenue,revenue,units,units
Unnamed: 0_level_1,quarter,Q1,Q2,Q1,Q2
product,region,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
A,East,100,120,10,12
A,West,110,130,11,13
B,East,150,160,15,16
B,West,140,170,14,17


## Pivot Table for Aggregation

When you need to aggregate during pivoting:

In [7]:
# Create data with duplicates
df_sales = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=20),
    'product': np.random.choice(['A', 'B', 'C'], 20),
    'region': np.random.choice(['North', 'South'], 20),
    'sales': np.random.randint(50, 200, 20),
    'quantity': np.random.randint(1, 10, 20)
})

# Pivot table with aggregation
# R: df %>% 
#     pivot_wider(names_from = product, values_from = sales, values_fn = sum)
pivot_result = df_sales.pivot_table(
    index='region',
    columns='product',
    values='sales',
    aggfunc='sum',
    fill_value=0
)
pivot_result

product,A,B,C
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
North,307,324,484
South,1049,169,148


In [8]:
# Multiple aggregations
# R: df %>% 
#     pivot_wider(names_from = product, 
#                 values_from = c(sales, quantity),
#                 values_fn = list(sales = sum, quantity = mean))
pivot_multi_agg = df_sales.pivot_table(
    index='region',
    columns='product',
    values=['sales', 'quantity'],
    aggfunc={'sales': 'sum', 'quantity': 'mean'},
    fill_value=0
)
pivot_multi_agg.round(1)

Unnamed: 0_level_0,quantity,quantity,quantity,sales,sales,sales
product,A,B,C,A,B,C
region,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
North,6.0,4.0,6.6,307,324,484
South,3.3,5.0,3.0,1049,169,148


## Complex Melting Patterns

Advanced patterns for reshaping wide data:

In [9]:
# Create complex wide data
df_complex_wide = pd.DataFrame({
    'id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie'],
    'test1_math': [85, 92, 78],
    'test1_english': [88, 85, 92],
    'test2_math': [87, 94, 80],
    'test2_english': [90, 87, 91],
    'final_math': [88, 93, 82],
    'final_english': [89, 86, 93]
})
df_complex_wide

Unnamed: 0,id,name,test1_math,test1_english,test2_math,test2_english,final_math,final_english
0,1,Alice,85,88,87,90,88,89
1,2,Bob,92,85,94,87,93,86
2,3,Charlie,78,92,80,91,82,93


In [10]:
# Melt with pattern extraction
# R: pivot_longer(df, 
#                 cols = -c(id, name),
#                 names_to = c("test_type", "subject"),
#                 names_pattern = "(.*)_(.*)",
#                 values_to = "score")

# First melt to long format
melted = df_complex_wide.melt(
    id_vars=['id', 'name'],
    var_name='test_subject',
    value_name='score'
)

# Extract test type and subject from column names
melted[['test_type', 'subject']] = melted['test_subject'].str.extract(r'(.+)_(.+)')
melted = melted.drop(columns='test_subject')
melted

Unnamed: 0,id,name,score,test_type,subject
0,1,Alice,85,test1,math
1,2,Bob,92,test1,math
2,3,Charlie,78,test1,math
3,1,Alice,88,test1,english
4,2,Bob,85,test1,english
5,3,Charlie,92,test1,english
6,1,Alice,87,test2,math
7,2,Bob,94,test2,math
8,3,Charlie,80,test2,math
9,1,Alice,90,test2,english


## Stack and Unstack Methods

Alternative approaches for reshaping:

In [11]:
# Create multi-index DataFrame
df_multi_index = pd.DataFrame({
    'A': ['foo', 'foo', 'bar', 'bar'],
    'B': ['one', 'two', 'one', 'two'],
    'C': [1, 2, 3, 4],
    'D': [10, 20, 30, 40]
})
df_indexed = df_multi_index.set_index(['A', 'B'])
df_indexed

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
foo,one,1,10
foo,two,2,20
bar,one,3,30
bar,two,4,40


In [12]:
# Unstack - move index level to columns
# Similar to pivot_wider
df_unstacked = df_indexed.unstack()
df_unstacked

Unnamed: 0_level_0,C,C,D,D
B,one,two,one,two
A,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
bar,3,4,30,40
foo,1,2,10,20


In [13]:
# Stack - move columns to index
# Similar to pivot_longer
df_stacked = df_unstacked.stack()
df_stacked

  df_stacked = df_unstacked.stack()


Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,3,30
bar,two,4,40
foo,one,1,10
foo,two,2,20


## Real-World Reshaping Examples

Practical examples of data reshaping:

In [14]:
# Example 1: Survey data from wide to long
survey_wide = pd.DataFrame({
    'respondent_id': [1, 2, 3, 4],
    'age': [25, 35, 45, 30],
    'q1_satisfaction': [4, 5, 3, 4],
    'q2_satisfaction': [3, 5, 4, 4],
    'q3_satisfaction': [5, 4, 3, 5],
    'q1_importance': [5, 4, 5, 3],
    'q2_importance': [4, 5, 5, 4],
    'q3_importance': [3, 3, 4, 5]
})

# Reshape to have one row per question per respondent
# First, separate satisfaction and importance
satisfaction = survey_wide.melt(
    id_vars=['respondent_id', 'age'],
    value_vars=['q1_satisfaction', 'q2_satisfaction', 'q3_satisfaction'],
    var_name='question',
    value_name='satisfaction'
)
satisfaction['question'] = satisfaction['question'].str.extract(r'(q\d+)')

importance = survey_wide.melt(
    id_vars=['respondent_id', 'age'],
    value_vars=['q1_importance', 'q2_importance', 'q3_importance'],
    var_name='question',
    value_name='importance'
)
importance['question'] = importance['question'].str.extract(r'(q\d+)')

# Merge back together
survey_long = satisfaction.merge(
    importance[['respondent_id', 'question', 'importance']],
    on=['respondent_id', 'question']
)
survey_long

Unnamed: 0,respondent_id,age,question,satisfaction,importance
0,1,25,q1,4,5
1,2,35,q1,5,4
2,3,45,q1,3,5
3,4,30,q1,4,3
4,1,25,q2,3,4
5,2,35,q2,5,5
6,3,45,q2,4,5
7,4,30,q2,4,4
8,1,25,q3,5,3
9,2,35,q3,4,3


In [15]:
# Example 2: Time series data reshaping
# Create monthly data in wide format
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
products = ['Product_A', 'Product_B', 'Product_C']

data = {}
data['Store'] = ['Store_1', 'Store_2', 'Store_3']
for product in products:
    for month in months:
        col_name = f'{product}_{month}'
        data[col_name] = np.random.randint(50, 150, 3)

df_monthly_wide = pd.DataFrame(data)
df_monthly_wide.head()

Unnamed: 0,Store,Product_A_Jan,Product_A_Feb,Product_A_Mar,Product_A_Apr,Product_A_May,Product_A_Jun,Product_B_Jan,Product_B_Feb,Product_B_Mar,Product_B_Apr,Product_B_May,Product_B_Jun,Product_C_Jan,Product_C_Feb,Product_C_Mar,Product_C_Apr,Product_C_May,Product_C_Jun
0,Store_1,94,55,147,95,89,51,126,75,110,97,116,120,56,62,92,147,92,81
1,Store_2,88,65,94,143,115,94,94,119,140,68,134,74,72,121,60,99,121,115
2,Store_3,97,63,62,109,128,144,75,96,103,128,59,111,59,136,97,55,120,140


In [16]:
# Reshape to long format with product and month as separate columns
df_monthly_long = df_monthly_wide.melt(
    id_vars=['Store'],
    var_name='Product_Month',
    value_name='Sales'
)

# Split Product_Month into separate columns
df_monthly_long[['Product', 'Month']] = df_monthly_long['Product_Month'].str.split('_', n=1, expand=True)
df_monthly_long = df_monthly_long.drop(columns='Product_Month')

# Create a proper date column
month_map = {'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6}
df_monthly_long['Month_Num'] = df_monthly_long['Month'].map(month_map)
df_monthly_long['Date'] = pd.to_datetime('2024-' + df_monthly_long['Month_Num'].astype(str) + '-01')

df_monthly_long = df_monthly_long[['Store', 'Product', 'Date', 'Sales']].sort_values(['Store', 'Product', 'Date'])
df_monthly_long.head(10)

DateParseError: Unknown datetime string format, unable to parse: 2024-nan-01, at position 0

## Pivot with Multiple Index/Columns

Handling complex pivoting scenarios:

In [17]:
# Create hierarchical data
df_hierarchical = pd.DataFrame({
    'year': [2023, 2023, 2023, 2023, 2024, 2024, 2024, 2024],
    'quarter': ['Q1', 'Q1', 'Q2', 'Q2', 'Q1', 'Q1', 'Q2', 'Q2'],
    'region': ['East', 'West', 'East', 'West', 'East', 'West', 'East', 'West'],
    'product': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A'],
    'revenue': [100, 110, 120, 130, 140, 150, 160, 170],
    'cost': [60, 65, 70, 75, 80, 85, 90, 95]
})

# Pivot with multiple index and columns
pivot_hierarchical = df_hierarchical.pivot_table(
    index=['region', 'product'],
    columns=['year', 'quarter'],
    values=['revenue', 'cost'],
    aggfunc='sum'
)
pivot_hierarchical

Unnamed: 0_level_0,Unnamed: 1_level_0,cost,cost,cost,cost,revenue,revenue,revenue,revenue
Unnamed: 0_level_1,year,2023,2023,2024,2024,2023,2023,2024,2024
Unnamed: 0_level_2,quarter,Q1,Q2,Q1,Q2,Q1,Q2,Q1,Q2
region,product,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3
East,A,60,70,80,90,100,120,140,160
West,A,65,75,85,95,110,130,150,170


In [18]:
# Flatten multi-level columns
pivot_hierarchical.columns = ['_'.join(map(str, col)).strip() for col in pivot_hierarchical.columns.values]
pivot_hierarchical.reset_index()

Unnamed: 0,region,product,cost_2023_Q1,cost_2023_Q2,cost_2024_Q1,cost_2024_Q2,revenue_2023_Q1,revenue_2023_Q2,revenue_2024_Q1,revenue_2024_Q2
0,East,A,60,70,80,90,100,120,140,160
1,West,A,65,75,85,95,110,130,150,170


## Handling Edge Cases

Dealing with common pivoting challenges:

In [19]:
# Duplicate entries
df_duplicates = pd.DataFrame({
    'date': ['2024-01-01', '2024-01-01', '2024-01-02'],
    'product': ['A', 'A', 'B'],
    'sales': [100, 50, 200]  # Two entries for product A on same date
})

# pivot() will fail with duplicates
try:
    df_duplicates.pivot(index='date', columns='product', values='sales')
except ValueError as e:
    print(f"Error: {e}")
    
# Use pivot_table() instead
df_duplicates.pivot_table(
    index='date', 
    columns='product', 
    values='sales', 
    aggfunc='sum'  # Aggregate duplicates
)

Error: Index contains duplicate entries, cannot reshape


product,A,B
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2024-01-01,150.0,
2024-01-02,,200.0


In [20]:
# Missing combinations
df_sparse = pd.DataFrame({
    'store': ['A', 'A', 'B'],  # Store B missing product Y
    'product': ['X', 'Y', 'X'],
    'sales': [100, 150, 120]
})

# Pivot with fill_value for missing combinations
df_sparse.pivot_table(
    index='store',
    columns='product',
    values='sales',
    fill_value=0  # Fill missing with 0
)

product,X,Y
store,Unnamed: 1_level_1,Unnamed: 2_level_1
A,100.0,150.0
B,120.0,0.0


## Creating Tidyverse-Style Helper Functions

Make pivoting more tidyr-like:

In [21]:
def pivot_longer(df, cols, names_to='name', values_to='value', id_vars=None):
    """Mimics tidyr's pivot_longer"""
    if id_vars is None:
        id_vars = [col for col in df.columns if col not in cols]
    
    return df.melt(
        id_vars=id_vars,
        value_vars=cols,
        var_name=names_to,
        value_name=values_to
    )

def pivot_wider(df, names_from, values_from, id_cols=None):
    """Mimics tidyr's pivot_wider"""
    if id_cols is None:
        id_cols = [col for col in df.columns if col not in [names_from, values_from]]
    
    return df.pivot(
        index=id_cols,
        columns=names_from,
        values=values_from
    ).reset_index()

# Usage examples
df_test = pd.DataFrame({
    'id': [1, 1, 2, 2],
    'measurement': ['height', 'weight', 'height', 'weight'],
    'value': [180, 75, 165, 60]
})

# R: pivot_wider(df, names_from = measurement, values_from = value)
pivot_wider(df_test, names_from='measurement', values_from='value')

measurement,id,height,weight
0,1,180,75
1,2,165,60


## Performance Considerations

Efficient pivoting strategies:

In [22]:
# Create large dataset
np.random.seed(42)
large_df = pd.DataFrame({
    'id': np.repeat(range(1000), 12),
    'month': np.tile(range(1, 13), 1000),
    'value': np.random.randn(12000)
})

import time

# Method 1: pivot()
start = time.time()
pivoted1 = large_df.pivot(index='id', columns='month', values='value')
print(f"pivot(): {time.time() - start:.4f} seconds")

# Method 2: pivot_table()
start = time.time()
pivoted2 = large_df.pivot_table(index='id', columns='month', values='value', aggfunc='mean')
print(f"pivot_table(): {time.time() - start:.4f} seconds")

# Method 3: unstack()
start = time.time()
pivoted3 = large_df.set_index(['id', 'month'])['value'].unstack()
print(f"unstack(): {time.time() - start:.4f} seconds")

pivot(): 0.0028 seconds
pivot_table(): 0.0035 seconds
unstack(): 0.0013 seconds
