# Selecting Columns

Column selection is one of the most fundamental operations in data analysis. While R's `dplyr::select()` provides a clean, consistent interface for column selection, pandas offers multiple approaches that can be more flexible but also more confusing initially. This chapter will show you how to achieve tidyverse-style column selection in pandas.

In [1]:
import pandas as pd
import numpy as np

## Best Practices Summary

Here's a quick reference for column selection patterns:

| Task | R (dplyr) | Pandas |
|------|-----------|--------|
| Select columns | `select(df, col1, col2)` | `df[['col1', 'col2']]` |
| Select range | `select(df, col1:col3)` | `df.loc[:, 'col1':'col3']` |
| Contains pattern | `select(df, contains("test"))` | `df.filter(like='test')` |
| Starts with | `select(df, starts_with("test"))` | `df.filter(regex='^test')` |
| Ends with | `select(df, ends_with("test"))` | `df.filter(regex='test$')` |
| By type | `select(df, where(is.numeric))` | `df.select_dtypes(include='number')` |
| Exclude columns | `select(df, -c(col1, col2))` | `df.drop(columns=['col1', 'col2'])` |
| Reorder | `select(df, col2, col1, everything())` | `df[['col2', 'col1'] + other_cols]` |

## Basic Column Selection

The simplest forms of column selection in pandas:

In [2]:
# Create sample DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'age': [25, 30, 35, 28, 33],
    'salary': [70000, 80000, 75000, 72000, 85000],
    'department': ['Sales', 'IT', 'HR', 'Sales', 'IT'],
    'years_experience': [2, 5, 8, 3, 7],
    'performance_rating': [4.2, 4.5, 3.8, 4.0, 4.6]
})

df

Unnamed: 0,name,age,salary,department,years_experience,performance_rating
0,Alice,25,70000,Sales,2,4.2
1,Bob,30,80000,IT,5,4.5
2,Charlie,35,75000,HR,8,3.8
3,David,28,72000,Sales,3,4.0
4,Eve,33,85000,IT,7,4.6


### Single column selection (returns Series)

In [4]:
# R: select(df, name) or df$name
df['name']

0      Alice
1        Bob
2    Charlie
3      David
4        Eve
Name: name, dtype: object

### Multiple columns (returns DataFrame)

In [5]:
# R: select(df, name, age, salary)
df[['name', 'age', 'salary']]

Unnamed: 0,name,age,salary
0,Alice,25,70000
1,Bob,30,80000
2,Charlie,35,75000
3,David,28,72000
4,Eve,33,85000


In [6]:
# Using variables for column names
cols_to_select = ['name', 'department']
df[cols_to_select]

Unnamed: 0,name,department
0,Alice,Sales
1,Bob,IT
2,Charlie,HR
3,David,Sales
4,Eve,IT


## Column Selection Methods Comparison

Pandas offers several ways to select columns, each with pros and cons:

In [7]:
# Method 1: Bracket notation (most common)
df[['name', 'age']]

Unnamed: 0,name,age
0,Alice,25
1,Bob,30
2,Charlie,35
3,David,28
4,Eve,33


In [8]:
# Method 2: .loc accessor (more explicit)
df.loc[:, ['name', 'age']]

Unnamed: 0,name,age
0,Alice,25
1,Bob,30
2,Charlie,35
3,David,28
4,Eve,33


In [10]:
# Method 3: .filter() method (regex capable)
df.filter(items=['name', 'age'])

Unnamed: 0,name,age
0,Alice,25
1,Bob,30
2,Charlie,35
3,David,28
4,Eve,33


In [11]:
# Method 4: Using column positions with iloc
df.iloc[:, [0, 1]]  # First two columns

Unnamed: 0,name,age
0,Alice,25
1,Bob,30
2,Charlie,35
3,David,28
4,Eve,33


## Selecting Column Ranges

Unlike R's `select()`, pandas can slice columns if they're in order:

In [13]:
# Column slicing with loc
# R: select(df, name:department)  # if using tidyselect
df.loc[:, 'name':'department']

Unnamed: 0,name,age,salary,department
0,Alice,25,70000,Sales
1,Bob,30,80000,IT
2,Charlie,35,75000,HR
3,David,28,72000,Sales
4,Eve,33,85000,IT


In [14]:
# Using integer positions
# R: select(df, 1:3)
df.iloc[:, 0:3] # Note: 0:3 means columns 0, 1, 2

Unnamed: 0,name,age,salary
0,Alice,25,70000
1,Bob,30,80000
2,Charlie,35,75000
3,David,28,72000
4,Eve,33,85000


In [15]:
# From column to end
df.loc[:, 'salary':]

Unnamed: 0,salary,department,years_experience,performance_rating
0,70000,Sales,2,4.2
1,80000,IT,5,4.5
2,75000,HR,8,3.8
3,72000,Sales,3,4.0
4,85000,IT,7,4.6


## Pattern-Based Selection

One area where pandas shines is pattern-based column selection:

In [16]:
# Create DataFrame with patterned column names
df_wide = pd.DataFrame({
    'id': [1, 2, 3],
    'test_score_math': [85, 92, 78],
    'test_score_english': [88, 85, 92],
    'test_score_science': [90, 88, 85],
    'final_grade_math': [87, 93, 80],
    'final_grade_english': [89, 86, 93],
    'student_name': ['Alice', 'Bob', 'Charlie']
})

df_wide

Unnamed: 0,id,test_score_math,test_score_english,test_score_science,final_grade_math,final_grade_english,student_name
0,1,85,88,90,87,89,Alice
1,2,92,85,88,93,86,Bob
2,3,78,92,85,80,93,Charlie


In [17]:
# Select columns containing 'test'
df_wide.filter(like='test')

Unnamed: 0,test_score_math,test_score_english,test_score_science
0,85,88,90
1,92,85,88
2,78,92,85


In [18]:
# Select columns starting with 'test'
df_wide.filter(regex='^test')

Unnamed: 0,test_score_math,test_score_english,test_score_science
0,85,88,90
1,92,85,88
2,78,92,85


In [19]:
# Select columns ending with 'math'
df_wide.filter(regex='math$')

Unnamed: 0,test_score_math,final_grade_math
0,85,87
1,92,93
2,78,80


In [20]:
# Complex regex pattern
df_wide.filter(regex='score|grade')

Unnamed: 0,test_score_math,test_score_english,test_score_science,final_grade_math,final_grade_english
0,85,88,90,87,89
1,92,85,88,93,86
2,78,92,85,80,93


## Selecting by Data Type

Selecting columns based on their data type is very useful:

In [21]:
# DataFrame with mixed types
df_mixed = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'salary': [70000.5, 80000.0, 75000.25],
    'is_manager': [False, True, False],
    'department': ['Sales', 'IT', 'HR'],
    'hire_date': pd.to_datetime(['2020-01-15', '2019-03-22', '2018-07-01']),
    'employee_id': [1001, 1002, 1003]
})

df_mixed

Unnamed: 0,name,age,salary,is_manager,department,hire_date,employee_id
0,Alice,25,70000.5,False,Sales,2020-01-15,1001
1,Bob,30,80000.0,True,IT,2019-03-22,1002
2,Charlie,35,75000.25,False,HR,2018-07-01,1003


In [22]:
# Select numeric columns
df_mixed.select_dtypes(include='number')

Unnamed: 0,age,salary,employee_id
0,25,70000.5,1001
1,30,80000.0,1002
2,35,75000.25,1003


In [23]:
# Select string columns
df_mixed.select_dtypes(include='object')

Unnamed: 0,name,department
0,Alice,Sales
1,Bob,IT
2,Charlie,HR


In [24]:
# Select multiple types
df_mixed.select_dtypes(include=['number', 'bool'])

Unnamed: 0,age,salary,is_manager,employee_id
0,25,70000.5,False,1001
1,30,80000.0,True,1002
2,35,75000.25,False,1003


In [25]:
# Exclude certain types
df_mixed.select_dtypes(exclude='datetime')

Unnamed: 0,name,age,salary,is_manager,department,employee_id
0,Alice,25,70000.5,False,Sales,1001
1,Bob,30,80000.0,True,IT,1002
2,Charlie,35,75000.25,False,HR,1003


## Creating Tidyverse-Style Helper Functions

To make column selection more tidyverse-like, we can create helper functions:

In [None]:
def select(df, *args):
    """Select columns tidyverse-style."""
    if len(args) == 1 and isinstance(args[0], list):
        return df[args[0]]
    return df[list(args)]

def select_contains(df, pattern):
    """Select columns containing pattern."""
    return df.filter(like=pattern)

def select_starts_with(df, prefix):
    """Select columns starting with prefix."""
    return df.filter(regex=f'^{prefix}')

def select_ends_with(df, suffix):
    """Select columns ending with suffix."""
    return df.filter(regex=f'{suffix}$')

def select_matches(df, pattern):
    """Select columns matching regex pattern."""
    return df.filter(regex=pattern)

In [None]:
# Usage examples
print("Using select() helper:")
print(select(df, 'name', 'age', 'salary'))

In [None]:
print("Using select_contains():")
print(select_contains(df_wide, 'score'))

## Excluding Columns

Sometimes it's easier to specify what you don't want:

In [26]:
# Exclude specific columns
df.drop(columns=['age', 'salary'])

Unnamed: 0,name,department,years_experience,performance_rating
0,Alice,Sales,2,4.2
1,Bob,IT,5,4.5
2,Charlie,HR,8,3.8
3,David,Sales,3,4.0
4,Eve,IT,7,4.6


In [27]:
# Alternative approach
all_cols = df.columns.tolist()
exclude = ['age', 'salary']
keep_cols = [col for col in all_cols if col not in exclude]

df[keep_cols]

Unnamed: 0,name,department,years_experience,performance_rating
0,Alice,Sales,2,4.2
1,Bob,IT,5,4.5
2,Charlie,HR,8,3.8
3,David,Sales,3,4.0
4,Eve,IT,7,4.6


In [28]:
# Exclude by pattern
cols_without_rating = df.columns[~df.columns.str.contains('rating')]
df[cols_without_rating]

Unnamed: 0,name,age,salary,department,years_experience
0,Alice,25,70000,Sales,2
1,Bob,30,80000,IT,5
2,Charlie,35,75000,HR,8
3,David,28,72000,Sales,3
4,Eve,33,85000,IT,7


## Reordering Columns

Selecting columns in a specific order:

In [29]:
# Reorder columns
df[['department', 'name', 'age', 'salary', 'years_experience', 'performance_rating']]

Unnamed: 0,department,name,age,salary,years_experience,performance_rating
0,Sales,Alice,25,70000,2,4.2
1,IT,Bob,30,80000,5,4.5
2,HR,Charlie,35,75000,8,3.8
3,Sales,David,28,72000,3,4.0
4,IT,Eve,33,85000,7,4.6


In [30]:
# Move specific columns to front
def move_to_front(df, cols):
    """Move specified columns to front."""
    cols = [cols] if isinstance(cols, str) else cols
    other_cols = [c for c in df.columns if c not in cols]
    return df[cols + other_cols]

move_to_front(df, ['department', 'name'])

Unnamed: 0,department,name,age,salary,years_experience,performance_rating
0,Sales,Alice,25,70000,2,4.2
1,IT,Bob,30,80000,5,4.5
2,HR,Charlie,35,75000,8,3.8
3,Sales,David,28,72000,3,4.0
4,IT,Eve,33,85000,7,4.6


In [31]:
# Sort columns alphabetically
df[sorted(df.columns)]

Unnamed: 0,age,department,name,performance_rating,salary,years_experience
0,25,Sales,Alice,4.2,70000,2
1,30,IT,Bob,4.5,80000,5
2,35,HR,Charlie,3.8,75000,8
3,28,Sales,David,4.0,72000,3
4,33,IT,Eve,4.6,85000,7


## Renaming While Selecting

Combining selection with renaming:

In [32]:
# Select and rename in one step
df[['name', 'department']].rename(columns={
    'name': 'employee',
    'department': 'dept'
})

Unnamed: 0,employee,dept
0,Alice,Sales
1,Bob,IT
2,Charlie,HR
3,David,Sales
4,Eve,IT


In [33]:
# More complex renaming with selection
numeric_df = df.select_dtypes(include='number')
numeric_df.columns = 'num_' + numeric_df.columns
numeric_df

Unnamed: 0,num_age,num_salary,num_years_experience,num_performance_rating
0,25,70000,2,4.2
1,30,80000,5,4.5
2,35,75000,8,3.8
3,28,72000,3,4.0
4,33,85000,7,4.6


## Column Selection in Method Chains

Integrating column selection into pandas method chains:

In [34]:
# Create a larger dataset
np.random.seed(42)
df_large = pd.DataFrame({
    'employee_id': range(1000, 1020),
    'first_name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'] * 4,
    'last_name': ['Smith', 'Jones', 'Wilson', 'Brown', 'Davis'] * 4,
    'age': np.random.randint(25, 55, 20),
    'salary_base': np.random.randint(50000, 90000, 20),
    'salary_bonus': np.random.randint(5000, 15000, 20),
    'dept_code': np.random.choice(['S', 'I', 'H', 'F'], 20),
    'dept_name': np.random.choice(['Sales', 'IT', 'HR', 'Finance'], 20),
    'years_exp': np.random.randint(1, 15, 20),
    'perf_q1': np.random.uniform(3, 5, 20),
    'perf_q2': np.random.uniform(3, 5, 20),
    'perf_q3': np.random.uniform(3, 5, 20),
    'perf_q4': np.random.uniform(3, 5, 20)
})

df_large.head()

Unnamed: 0,employee_id,first_name,last_name,age,salary_base,salary_bonus,dept_code,dept_name,years_exp,perf_q1,perf_q2,perf_q3,perf_q4
0,1000,Alice,Smith,31,51685,6528,F,HR,13,4.21192,4.421326,3.436881,4.116204
1,1001,Bob,Jones,44,50769,8556,I,Finance,9,4.852602,3.221782,3.83302,3.807672
2,1002,Charlie,Wilson,53,52433,8890,I,IT,13,4.302154,3.878673,4.766561,3.129784
3,1003,David,Brown,39,55311,13838,I,HR,13,4.829919,3.403438,3.64869,3.507831
4,1004,Eve,Davis,35,87819,10393,I,Finance,1,4.700077,4.791527,3.244176,3.493752


In [35]:
# Chain with column selection
# R: df %>% select(contains("name"), starts_with("salary")) %>% head()
result = (df_large
          .filter(regex='name|salary')
          .head())

result

Unnamed: 0,first_name,last_name,salary_base,salary_bonus,dept_name
0,Alice,Smith,51685,6528,HR
1,Bob,Jones,50769,8556,Finance
2,Charlie,Wilson,52433,8890,IT
3,David,Brown,55311,13838,HR
4,Eve,Davis,87819,10393,Finance


In [None]:
# More complex chain
# R: df %>% 
#     select(employee_id, first_name, last_name, starts_with("perf")) %>%
#     mutate(perf_avg = rowMeans(select(., starts_with("perf"))))
result = (df_large
          .filter(regex='^(employee_id|first_name|last_name|perf)')
          # Row means in `perf` cols
          .assign(perf_avg=lambda x: x.filter(like='perf').mean(axis=1))
          .round(2)
          .head())

result

Unnamed: 0,employee_id,first_name,last_name,perf_q1,perf_q2,perf_q3,perf_q4,perf_avg
0,1000,Alice,Smith,4.21,4.42,3.44,4.12,4.05
1,1001,Bob,Jones,4.85,3.22,3.83,3.81,3.93
2,1002,Charlie,Wilson,4.3,3.88,4.77,3.13,4.02
3,1003,David,Brown,4.83,3.4,3.65,3.51,3.85
4,1004,Eve,Davis,4.7,4.79,3.24,3.49,4.06


## Advanced Column Selection Patterns

Some advanced patterns for complex selections:

In [39]:
# Select columns based on a condition
# Get columns where any value is greater than 100
high_value_cols = [col for col in df.select_dtypes(include='number').columns 
                   if (df[col] > 100).any()]
df[high_value_cols]

Unnamed: 0,salary
0,70000
1,80000
2,75000
3,72000
4,85000


In [40]:
# Select columns with low missing data
# Create sample with missing data
df_missing = df_large.copy()
df_missing.loc[0:5, 'age'] = np.nan
df_missing.loc[3:8, 'salary_base'] = np.nan

# Select columns with less than 10% missing
missing_pct = df_missing.isna().sum() / len(df_missing)
low_missing_cols = missing_pct[missing_pct < 0.1].index.tolist()
print("Columns with <10% missing:")
print(df_missing[low_missing_cols].head())

Columns with <10% missing:
   employee_id first_name last_name  salary_bonus dept_code dept_name  \
0         1000      Alice     Smith          6528         F        HR   
1         1001        Bob     Jones          8556         I   Finance   
2         1002    Charlie    Wilson          8890         I        IT   
3         1003      David     Brown         13838         I        HR   
4         1004        Eve     Davis         10393         I   Finance   

   years_exp   perf_q1   perf_q2   perf_q3   perf_q4  
0         13  4.211920  4.421326  3.436881  4.116204  
1          9  4.852602  3.221782  3.833020  3.807672  
2         13  4.302154  3.878673  4.766561  3.129784  
3         13  4.829919  3.403438  3.648690  3.507831  
4          1  4.700077  4.791527  3.244176  3.493752  


## Performance Considerations

Different selection methods have different performance implications:

In [41]:
# Create large DataFrame for timing
large_df = pd.DataFrame(
    np.random.randn(10000, 100), 
    columns=[f'col_{i}' for i in range(100)]
)

# Compare selection methods
import time

# Method 1: Bracket notation
start = time.time()
for _ in range(1000):
    _ = large_df[['col_0', 'col_10', 'col_20']]
bracket_time = time.time() - start

# Method 2: .loc
start = time.time()
for _ in range(1000):
    _ = large_df.loc[:, ['col_0', 'col_10', 'col_20']]
loc_time = time.time() - start

# Method 3: .filter()
start = time.time()
for _ in range(1000):
    _ = large_df.filter(items=['col_0', 'col_10', 'col_20'])
filter_time = time.time() - start

print(f"Bracket notation: {bracket_time:.4f} seconds")
print(f".loc accessor: {loc_time:.4f} seconds")
print(f".filter() method: {filter_time:.4f} seconds")

Bracket notation: 0.0903 seconds
.loc accessor: 0.0834 seconds
.filter() method: 0.0813 seconds


## Tips for Tidyverse Users

1. **Think in lists**: Most pandas column selection operations expect lists of column names.

2. **Use `.filter()` for patterns**: It's the closest equivalent to tidyselect helpers.

3. **`.loc` for explicit selection**: When you want to be clear about selecting columns (not rows), use `.loc[:, columns]`.

4. **Create helper functions**: Build your own tidyverse-style functions for common patterns.

5. **Chain operations**: Use parentheses for multi-line chains:

In [42]:
# Example of chaining with column selection
result = (df
         .filter(like='score')
         .assign(total=lambda x: x.sum(axis=1))
         .head())

print("Chained operations example:")
print(result)

Chained operations example:
   total
0    0.0
1    0.0
2    0.0
3    0.0
4    0.0


Column selection in pandas is more verbose than dplyr's `select()`, but it's also more flexible. Master these patterns, and you'll find yourself writing cleaner, more maintainable code.