# Arranging & Sorting

Sorting data is a fundamental operation that helps reveal patterns and prepare data for analysis. While R's `dplyr::arrange()` provides a clean interface for sorting, pandas offers flexible sorting capabilities through `sort_values()` and `sort_index()`. This chapter will show you how to achieve tidyverse-style sorting in pandas.

## Best Practices Summary

Quick reference for sorting patterns:

| Task | R (dplyr) | Pandas |
|------|-----------|--------|
| Simple sort | `arrange(df, col)` | `df.sort_values('col')` |
| Descending | `arrange(df, desc(col))` | `df.sort_values('col', ascending=False)` |
| Multiple columns | `arrange(df, col1, col2)` | `df.sort_values(['col1', 'col2'])` |
| Mixed order | `arrange(df, col1, desc(col2))` | `df.sort_values(['col1', 'col2'], ascending=[True, False])` |
| NA position | `arrange(df, col)` | `df.sort_values('col', na_position='last')` |
| Top N | `slice_max(df, col, n=5)` | `df.nlargest(5, 'col')` |
| Bottom N | `slice_min(df, col, n=5)` | `df.nsmallest(5, 'col')` |
| By string length | `arrange(df, nchar(col))` | `df.sort_values('col', key=lambda x: x.str.len())` |
| Within groups | `group_by(df, g) %>% arrange(col)` | `df.sort_values(['g', 'col'])` |

## Tips for Tidyverse Users

1. **Use `sort_values()` not `sort()`**: The `sort()` method is deprecated; always use `sort_values()`.

2. **Remember ascending parameter**: It accepts both single boolean or list of booleans for multiple columns.

3. **Consider `nlargest/nsmallest`**: Often faster than sorting everything when you only need top/bottom rows.

4. **Chain sorting operations**: Sorting works well in method chains:
   ```python
   (df
    .query('salary > 70000')
    .sort_values('performance', ascending=False)
    .head(10))
   ```

5. **Use `key` parameter**: For custom sorting logic, the `key` parameter (pandas 1.1.0+) is very powerful.

Sorting in pandas is straightforward and flexible. While the syntax differs from dplyr's `arrange()`, pandas offers additional capabilities like index sorting and key functions that can handle complex sorting requirements efficiently.

## Basic Sorting

The fundamental ways to sort data in pandas:

In [1]:
import pandas as pd
import numpy as np

# Create sample DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'department': ['Sales', 'IT', 'HR', 'Sales', 'IT', 'HR'],
    'salary': [70000, 85000, 65000, 72000, 90000, 68000],
    'years_exp': [5, 8, 3, 6, 10, 4],
    'performance': [4.2, 4.5, 3.8, 4.0, 4.7, 3.9],
    'hire_date': pd.to_datetime(['2019-03-15', '2016-06-01', '2021-01-10', 
                                  '2018-09-20', '2014-11-30', '2020-04-05'])
})

# Simple sorting by one column
# R: arrange(df, salary)
df.sort_values('salary')

Unnamed: 0,name,department,salary,years_exp,performance,hire_date
2,Charlie,HR,65000,3,3.8,2021-01-10
5,Frank,HR,68000,4,3.9,2020-04-05
0,Alice,Sales,70000,5,4.2,2019-03-15
3,David,Sales,72000,6,4.0,2018-09-20
1,Bob,IT,85000,8,4.5,2016-06-01
4,Eve,IT,90000,10,4.7,2014-11-30


In [2]:
# Descending order
# R: arrange(df, desc(salary))
df.sort_values('salary', ascending=False)

Unnamed: 0,name,department,salary,years_exp,performance,hire_date
4,Eve,IT,90000,10,4.7,2014-11-30
1,Bob,IT,85000,8,4.5,2016-06-01
3,David,Sales,72000,6,4.0,2018-09-20
0,Alice,Sales,70000,5,4.2,2019-03-15
5,Frank,HR,68000,4,3.9,2020-04-05
2,Charlie,HR,65000,3,3.8,2021-01-10


In [3]:
# In-place sorting (modifies original DataFrame)
# Note: This is different from R which always returns a new data frame
df_copy = df.copy()
df_copy.sort_values('salary', inplace=True)
df_copy

Unnamed: 0,name,department,salary,years_exp,performance,hire_date
2,Charlie,HR,65000,3,3.8,2021-01-10
5,Frank,HR,68000,4,3.9,2020-04-05
0,Alice,Sales,70000,5,4.2,2019-03-15
3,David,Sales,72000,6,4.0,2018-09-20
1,Bob,IT,85000,8,4.5,2016-06-01
4,Eve,IT,90000,10,4.7,2014-11-30


## Multiple Column Sorting

Sorting by multiple columns with different orders:

In [4]:
# Sort by multiple columns
# R: arrange(df, department, salary)
df.sort_values(['department', 'salary'])

Unnamed: 0,name,department,salary,years_exp,performance,hire_date
2,Charlie,HR,65000,3,3.8,2021-01-10
5,Frank,HR,68000,4,3.9,2020-04-05
1,Bob,IT,85000,8,4.5,2016-06-01
4,Eve,IT,90000,10,4.7,2014-11-30
0,Alice,Sales,70000,5,4.2,2019-03-15
3,David,Sales,72000,6,4.0,2018-09-20


In [5]:
# Mixed ascending/descending order
# R: arrange(df, department, desc(salary))
df.sort_values(['department', 'salary'], ascending=[True, False])

Unnamed: 0,name,department,salary,years_exp,performance,hire_date
5,Frank,HR,68000,4,3.9,2020-04-05
2,Charlie,HR,65000,3,3.8,2021-01-10
4,Eve,IT,90000,10,4.7,2014-11-30
1,Bob,IT,85000,8,4.5,2016-06-01
3,David,Sales,72000,6,4.0,2018-09-20
0,Alice,Sales,70000,5,4.2,2019-03-15


In [6]:
# Complex multi-column sorting
# R: arrange(df, desc(performance), years_exp, salary)
df.sort_values(['performance', 'years_exp', 'salary'], 
               ascending=[False, True, True])

Unnamed: 0,name,department,salary,years_exp,performance,hire_date
4,Eve,IT,90000,10,4.7,2014-11-30
1,Bob,IT,85000,8,4.5,2016-06-01
0,Alice,Sales,70000,5,4.2,2019-03-15
3,David,Sales,72000,6,4.0,2018-09-20
5,Frank,HR,68000,4,3.9,2020-04-05
2,Charlie,HR,65000,3,3.8,2021-01-10


## Sorting with Missing Values

Handling NaN values during sorting:

In [7]:
# Create DataFrame with missing values
df_missing = pd.DataFrame({
    'product': ['A', 'B', 'C', 'D', 'E'],
    'price': [10.5, np.nan, 8.0, 12.0, np.nan],
    'rating': [4.5, 3.8, np.nan, 4.2, 4.0],
    'stock': [100, 50, 200, np.nan, 150]
})

# Default: NaN values go to the end
# R: arrange(df, price) - NA values also go to end by default
df_missing.sort_values('price')

Unnamed: 0,product,price,rating,stock
2,C,8.0,,200.0
0,A,10.5,4.5,100.0
3,D,12.0,4.2,
1,B,,3.8,50.0
4,E,,4.0,150.0


In [8]:
# Put NaN values first
# R: arrange(df, desc(is.na(price)), price)
df_missing.sort_values('price', na_position='first')

Unnamed: 0,product,price,rating,stock
1,B,,3.8,50.0
4,E,,4.0,150.0
2,C,8.0,,200.0
0,A,10.5,4.5,100.0
3,D,12.0,4.2,


In [9]:
# Multiple columns with different NaN handling
df_missing.sort_values(['price', 'rating'], 
                      na_position='first',
                      ascending=[True, False])

Unnamed: 0,product,price,rating,stock
4,E,,4.0,150.0
1,B,,3.8,50.0
2,C,8.0,,200.0
0,A,10.5,4.5,100.0
3,D,12.0,4.2,


## Sorting by Index

Pandas allows sorting by index, which has no direct equivalent in dplyr:

In [10]:
# Set a meaningful index
df_indexed = df.set_index('name')
df_indexed

Unnamed: 0_level_0,department,salary,years_exp,performance,hire_date
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alice,Sales,70000,5,4.2,2019-03-15
Bob,IT,85000,8,4.5,2016-06-01
Charlie,HR,65000,3,3.8,2021-01-10
David,Sales,72000,6,4.0,2018-09-20
Eve,IT,90000,10,4.7,2014-11-30
Frank,HR,68000,4,3.9,2020-04-05


In [11]:
# Sort by index
# No direct R equivalent - would need to convert rownames to column first
df_indexed.sort_index()

Unnamed: 0_level_0,department,salary,years_exp,performance,hire_date
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alice,Sales,70000,5,4.2,2019-03-15
Bob,IT,85000,8,4.5,2016-06-01
Charlie,HR,65000,3,3.8,2021-01-10
David,Sales,72000,6,4.0,2018-09-20
Eve,IT,90000,10,4.7,2014-11-30
Frank,HR,68000,4,3.9,2020-04-05


In [12]:
# Sort by index descending
df_indexed.sort_index(ascending=False)

Unnamed: 0_level_0,department,salary,years_exp,performance,hire_date
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Frank,HR,68000,4,3.9,2020-04-05
Eve,IT,90000,10,4.7,2014-11-30
David,Sales,72000,6,4.0,2018-09-20
Charlie,HR,65000,3,3.8,2021-01-10
Bob,IT,85000,8,4.5,2016-06-01
Alice,Sales,70000,5,4.2,2019-03-15


In [13]:
# Multi-level index sorting
df_multi = df.set_index(['department', 'name']).sort_index()
df_multi

Unnamed: 0_level_0,Unnamed: 1_level_0,salary,years_exp,performance,hire_date
department,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
HR,Charlie,65000,3,3.8,2021-01-10
HR,Frank,68000,4,3.9,2020-04-05
IT,Bob,85000,8,4.5,2016-06-01
IT,Eve,90000,10,4.7,2014-11-30
Sales,Alice,70000,5,4.2,2019-03-15
Sales,David,72000,6,4.0,2018-09-20


## Sorting with Custom Functions

Using key functions for custom sorting logic:

In [14]:
# Sort by string length
# R: arrange(df, nchar(name))
df.sort_values('name', key=lambda x: x.str.len())

Unnamed: 0,name,department,salary,years_exp,performance,hire_date
1,Bob,IT,85000,8,4.5,2016-06-01
4,Eve,IT,90000,10,4.7,2014-11-30
0,Alice,Sales,70000,5,4.2,2019-03-15
3,David,Sales,72000,6,4.0,2018-09-20
5,Frank,HR,68000,4,3.9,2020-04-05
2,Charlie,HR,65000,3,3.8,2021-01-10


In [15]:
# Sort by last character of name
# R: arrange(df, substr(name, nchar(name), nchar(name)))
df.sort_values('name', key=lambda x: x.str[-1])

Unnamed: 0,name,department,salary,years_exp,performance,hire_date
1,Bob,IT,85000,8,4.5,2016-06-01
3,David,Sales,72000,6,4.0,2018-09-20
0,Alice,Sales,70000,5,4.2,2019-03-15
2,Charlie,HR,65000,3,3.8,2021-01-10
4,Eve,IT,90000,10,4.7,2014-11-30
5,Frank,HR,68000,4,3.9,2020-04-05


In [16]:
# Case-insensitive sorting
# R: arrange(df, tolower(department))
df_mixed_case = df.copy()
df_mixed_case.loc[1, 'department'] = 'it'  # lowercase
df_mixed_case.loc[3, 'department'] = 'SALES'  # uppercase

df_mixed_case.sort_values('department', key=lambda x: x.str.lower())

Unnamed: 0,name,department,salary,years_exp,performance,hire_date
2,Charlie,HR,65000,3,3.8,2021-01-10
5,Frank,HR,68000,4,3.9,2020-04-05
1,Bob,it,85000,8,4.5,2016-06-01
4,Eve,IT,90000,10,4.7,2014-11-30
0,Alice,Sales,70000,5,4.2,2019-03-15
3,David,SALES,72000,6,4.0,2018-09-20


## Sorting in Method Chains

Integrating sorting into data pipelines:

In [17]:
# Complex chain with sorting
# R: df %>%
#     filter(salary > 68000) %>%
#     arrange(desc(performance), salary) %>%
#     select(name, department, performance, salary)
result = (df
    .query('salary > 68000')
    .sort_values(['performance', 'salary'], ascending=[False, True])
    [['name', 'department', 'performance', 'salary']]
)
result

Unnamed: 0,name,department,performance,salary
4,Eve,IT,4.7,90000
1,Bob,IT,4.5,85000
0,Alice,Sales,4.2,70000
3,David,Sales,4.0,72000


In [18]:
# Sorting after groupby operations
# R: df %>%
#     group_by(department) %>%
#     summarize(avg_salary = mean(salary)) %>%
#     arrange(desc(avg_salary))
(df
    .groupby('department')
    .agg(avg_salary=('salary', 'mean'))
    .sort_values('avg_salary', ascending=False)
)

Unnamed: 0_level_0,avg_salary
department,Unnamed: 1_level_1
IT,87500.0
Sales,71000.0
HR,66500.0


## Rank and Order Operations

Getting ranks and ordering positions:

In [19]:
# Add rank column
# R: mutate(df, salary_rank = rank(salary))
df['salary_rank'] = df['salary'].rank()
df[['name', 'salary', 'salary_rank']]

Unnamed: 0,name,salary,salary_rank
0,Alice,70000,3.0
1,Bob,85000,5.0
2,Charlie,65000,1.0
3,David,72000,4.0
4,Eve,90000,6.0
5,Frank,68000,2.0


In [20]:
# Different ranking methods
# R: mutate(df, 
#          rank_min = rank(salary, ties.method = "min"),
#          rank_avg = rank(salary, ties.method = "average"),
#          rank_dense = dense_rank(salary))
df_ranks = df.assign(
    rank_min = df['salary'].rank(method='min'),
    rank_avg = df['salary'].rank(method='average'),
    rank_max = df['salary'].rank(method='max'),
    rank_dense = df['salary'].rank(method='dense')
)
df_ranks[['name', 'salary', 'rank_min', 'rank_avg', 'rank_max', 'rank_dense']]

Unnamed: 0,name,salary,rank_min,rank_avg,rank_max,rank_dense
0,Alice,70000,3.0,3.0,3.0,3.0
1,Bob,85000,5.0,5.0,5.0,5.0
2,Charlie,65000,1.0,1.0,1.0,1.0
3,David,72000,4.0,4.0,4.0,4.0
4,Eve,90000,6.0,6.0,6.0,6.0
5,Frank,68000,2.0,2.0,2.0,2.0


In [21]:
# Ranking within groups
# R: df %>% 
#     group_by(department) %>% 
#     mutate(dept_salary_rank = rank(desc(salary)))
df['dept_salary_rank'] = df.groupby('department')['salary'].rank(ascending=False)
df.sort_values(['department', 'dept_salary_rank'])

Unnamed: 0,name,department,salary,years_exp,performance,hire_date,salary_rank,dept_salary_rank
5,Frank,HR,68000,4,3.9,2020-04-05,2.0,1.0
2,Charlie,HR,65000,3,3.8,2021-01-10,1.0,2.0
4,Eve,IT,90000,10,4.7,2014-11-30,6.0,1.0
1,Bob,IT,85000,8,4.5,2016-06-01,5.0,2.0
3,David,Sales,72000,6,4.0,2018-09-20,4.0,1.0
0,Alice,Sales,70000,5,4.2,2019-03-15,3.0,2.0


## Top N Operations

Getting top/bottom N rows:

In [22]:
# Top N by value
# R: slice_max(df, salary, n = 3)
df.nlargest(3, 'salary')

Unnamed: 0,name,department,salary,years_exp,performance,hire_date,salary_rank,dept_salary_rank
4,Eve,IT,90000,10,4.7,2014-11-30,6.0,1.0
1,Bob,IT,85000,8,4.5,2016-06-01,5.0,2.0
3,David,Sales,72000,6,4.0,2018-09-20,4.0,1.0


In [23]:
# Bottom N by value
# R: slice_min(df, performance, n = 2)
df.nsmallest(2, 'performance')

Unnamed: 0,name,department,salary,years_exp,performance,hire_date,salary_rank,dept_salary_rank
2,Charlie,HR,65000,3,3.8,2021-01-10,1.0,2.0
5,Frank,HR,68000,4,3.9,2020-04-05,2.0,1.0


In [24]:
# Top N by multiple columns
# First by performance, then by salary
# R: arrange(df, desc(performance), desc(salary)) %>% slice_head(n = 3)
df.nlargest(3, ['performance', 'salary'])

Unnamed: 0,name,department,salary,years_exp,performance,hire_date,salary_rank,dept_salary_rank
4,Eve,IT,90000,10,4.7,2014-11-30,6.0,1.0
1,Bob,IT,85000,8,4.5,2016-06-01,5.0,2.0
0,Alice,Sales,70000,5,4.2,2019-03-15,3.0,2.0


In [25]:
# Top N per group
# R: df %>% 
#     group_by(department) %>% 
#     slice_max(salary, n = 1)
df.sort_values('salary', ascending=False).groupby('department').head(1)

Unnamed: 0,name,department,salary,years_exp,performance,hire_date,salary_rank,dept_salary_rank
4,Eve,IT,90000,10,4.7,2014-11-30,6.0,1.0
3,David,Sales,72000,6,4.0,2018-09-20,4.0,1.0
5,Frank,HR,68000,4,3.9,2020-04-05,2.0,1.0


## Sorting Different Data Types

Handling various data types in sorting:

In [26]:
# Create DataFrame with various types
df_types = pd.DataFrame({
    'text': ['apple', 'Banana', '10apples', 'cherry', '2bananas'],
    'numbers': [3.14, 2.71, 1.41, 2.24, 1.73],
    'dates': pd.to_datetime(['2024-03-15', '2024-01-10', '2024-12-01', 
                            '2024-06-20', '2024-02-28']),
    'categories': pd.Categorical(['Low', 'High', 'Medium', 'Low', 'High'],
                                categories=['Low', 'Medium', 'High'],
                                ordered=True),
    'booleans': [True, False, True, False, True]
})

# Sort by date
df_types.sort_values('dates')

Unnamed: 0,text,numbers,dates,categories,booleans
1,Banana,2.71,2024-01-10,High,False
4,2bananas,1.73,2024-02-28,High,True
0,apple,3.14,2024-03-15,Low,True
3,cherry,2.24,2024-06-20,Low,False
2,10apples,1.41,2024-12-01,Medium,True


In [27]:
# Sort by categorical (respects order)
df_types.sort_values('categories')

Unnamed: 0,text,numbers,dates,categories,booleans
0,apple,3.14,2024-03-15,Low,True
3,cherry,2.24,2024-06-20,Low,False
2,10apples,1.41,2024-12-01,Medium,True
1,Banana,2.71,2024-01-10,High,False
4,2bananas,1.73,2024-02-28,High,True


In [28]:
# Natural sorting for strings with numbers
# R: arrange(df, str_sort(text, numeric = TRUE))
# Pandas doesn't have built-in natural sort, but we can implement it
def natural_sort_key(x):
    import re
    return [int(text) if text.isdigit() else text.lower() 
            for text in re.split('([0-9]+)', x)]

df_types.iloc[df_types['text'].apply(natural_sort_key).argsort()]

Unnamed: 0,text,numbers,dates,categories,booleans
4,2bananas,1.73,2024-02-28,High,True
2,10apples,1.41,2024-12-01,Medium,True
0,apple,3.14,2024-03-15,Low,True
1,Banana,2.71,2024-01-10,High,False
3,cherry,2.24,2024-06-20,Low,False


## Stable Sorting

Understanding stable vs unstable sorting:

In [29]:
# Create DataFrame with ties
df_ties = pd.DataFrame({
    'group': ['A', 'B', 'A', 'B', 'A', 'B'],
    'value': [10, 10, 20, 20, 10, 20],
    'order': [1, 2, 3, 4, 5, 6]  # Original order
})

# Stable sort preserves original order for ties
# pandas sort is stable by default (like R)
df_ties.sort_values('value')

Unnamed: 0,group,value,order
0,A,10,1
1,B,10,2
4,A,10,5
2,A,20,3
3,B,20,4
5,B,20,6


In [30]:
# Multiple sorts to demonstrate stability
(df_ties
    .sort_values('order')  # First sort
    .sort_values('value')  # Second sort preserves order within ties
)

Unnamed: 0,group,value,order
0,A,10,1
1,B,10,2
4,A,10,5
2,A,20,3
3,B,20,4
5,B,20,6


## Performance Considerations

Efficient sorting strategies:

In [31]:
# Create large DataFrame
np.random.seed(42)
large_df = pd.DataFrame({
    'A': np.random.randn(100000),
    'B': np.random.choice(['X', 'Y', 'Z'], 100000),
    'C': np.random.randint(0, 1000, 100000)
})

import time

# Method 1: Single column sort
start = time.time()
sorted1 = large_df.sort_values('A')
print(f"Single column sort: {time.time() - start:.4f} seconds")

# Method 2: Multiple column sort
start = time.time()
sorted2 = large_df.sort_values(['B', 'A'])
print(f"Multiple column sort: {time.time() - start:.4f} seconds")

# Method 3: In-place sort (saves memory)
start = time.time()
large_df_copy = large_df.copy()
large_df_copy.sort_values('A', inplace=True)
print(f"In-place sort: {time.time() - start:.4f} seconds")

Single column sort: 0.0135 seconds
Multiple column sort: 0.0373 seconds
In-place sort: 0.0111 seconds


## Advanced Sorting Patterns

Complex sorting scenarios:

In [32]:
# Sort by computed column without creating it
# R: arrange(df, salary / years_exp)
df.iloc[df.eval('salary / years_exp').argsort()[::-1]]

Unnamed: 0,name,department,salary,years_exp,performance,hire_date,salary_rank,dept_salary_rank
2,Charlie,HR,65000,3,3.8,2021-01-10,1.0,2.0
5,Frank,HR,68000,4,3.9,2020-04-05,2.0,1.0
0,Alice,Sales,70000,5,4.2,2019-03-15,3.0,2.0
3,David,Sales,72000,6,4.0,2018-09-20,4.0,1.0
1,Bob,IT,85000,8,4.5,2016-06-01,5.0,2.0
4,Eve,IT,90000,10,4.7,2014-11-30,6.0,1.0


In [33]:
# Sort by aggregated values
# Sort employees by their department's average salary
# R: df %>% 
#     group_by(department) %>% 
#     mutate(dept_avg = mean(salary)) %>% 
#     arrange(desc(dept_avg), desc(salary))
dept_avg = df.groupby('department')['salary'].transform('mean')
df.assign(dept_avg=dept_avg).sort_values(['dept_avg', 'salary'], 
                                         ascending=[False, False])

Unnamed: 0,name,department,salary,years_exp,performance,hire_date,salary_rank,dept_salary_rank,dept_avg
4,Eve,IT,90000,10,4.7,2014-11-30,6.0,1.0,87500.0
1,Bob,IT,85000,8,4.5,2016-06-01,5.0,2.0,87500.0
3,David,Sales,72000,6,4.0,2018-09-20,4.0,1.0,71000.0
0,Alice,Sales,70000,5,4.2,2019-03-15,3.0,2.0,71000.0
5,Frank,HR,68000,4,3.9,2020-04-05,2.0,1.0,66500.0
2,Charlie,HR,65000,3,3.8,2021-01-10,1.0,2.0,66500.0


In [34]:
# Custom sort order for categorical-like data
# Define custom order for departments
dept_order = {'IT': 0, 'Sales': 1, 'HR': 2}
df.assign(dept_order=df['department'].map(dept_order)).sort_values('dept_order').drop(columns='dept_order')

Unnamed: 0,name,department,salary,years_exp,performance,hire_date,salary_rank,dept_salary_rank
1,Bob,IT,85000,8,4.5,2016-06-01,5.0,2.0
4,Eve,IT,90000,10,4.7,2014-11-30,6.0,1.0
0,Alice,Sales,70000,5,4.2,2019-03-15,3.0,2.0
3,David,Sales,72000,6,4.0,2018-09-20,4.0,1.0
2,Charlie,HR,65000,3,3.8,2021-01-10,1.0,2.0
5,Frank,HR,68000,4,3.9,2020-04-05,2.0,1.0


## Creating Tidyverse-Style Helper Functions

Make sorting more dplyr-like:

In [35]:
def arrange(df, *args, ascending=True):
    """Mimics dplyr's arrange function"""
    columns = []
    ascending_list = []
    
    for arg in args:
        if arg.startswith('-'):
            columns.append(arg[1:])
            ascending_list.append(False)
        else:
            columns.append(arg)
            ascending_list.append(True)
    
    return df.sort_values(columns, ascending=ascending_list)

# Usage examples
# R: arrange(df, department, desc(salary))
arrange(df, 'department', '-salary')

Unnamed: 0,name,department,salary,years_exp,performance,hire_date,salary_rank,dept_salary_rank
5,Frank,HR,68000,4,3.9,2020-04-05,2.0,1.0
2,Charlie,HR,65000,3,3.8,2021-01-10,1.0,2.0
4,Eve,IT,90000,10,4.7,2014-11-30,6.0,1.0
1,Bob,IT,85000,8,4.5,2016-06-01,5.0,2.0
3,David,Sales,72000,6,4.0,2018-09-20,4.0,1.0
0,Alice,Sales,70000,5,4.2,2019-03-15,3.0,2.0


In [36]:
def slice_max(df, column, n=5):
    """Mimics dplyr's slice_max"""
    return df.nlargest(n, column)

def slice_min(df, column, n=5):
    """Mimics dplyr's slice_min"""
    return df.nsmallest(n, column)

# Usage
slice_max(df, 'salary', n=3)

Unnamed: 0,name,department,salary,years_exp,performance,hire_date,salary_rank,dept_salary_rank
4,Eve,IT,90000,10,4.7,2014-11-30,6.0,1.0
1,Bob,IT,85000,8,4.5,2016-06-01,5.0,2.0
3,David,Sales,72000,6,4.0,2018-09-20,4.0,1.0
