# Advanced Pandas Operations

In this lecture, we'll dive deep into advanced Pandas operations. We'll cover complex data manipulation techniques, advanced indexing, grouping, merging, and performance optimization. By the end of this lecture, you'll have a strong grasp of these powerful Pandas features.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', None)

## 1. Advanced Data Selection and Filtering

Pandas provides powerful tools for selecting and filtering data. We'll explore boolean indexing, .loc, .iloc, and complex conditional selection.

In [2]:
# Create a sample DataFrame
df = pd.DataFrame({
    'A': range(1, 6),
    'B': range(10, 60, 10),
    'C': ['foo', 'bar', 'baz', 'qux', 'quux']
})
print(df)

   A   B     C
0  1  10   foo
1  2  20   bar
2  3  30   baz
3  4  40   qux
4  5  50  quux


### Boolean Indexing

In [3]:
# Select rows where A is greater than 2
print(df[df['A'] > 2])

# Select rows where B is less than 40 and C is not 'baz'
print(df[(df['B'] < 40) & (df['C'] != 'baz')])

   A   B     C
2  3  30   baz
3  4  40   qux
4  5  50  quux
   A   B    C
0  1  10  foo
1  2  20  bar


### .loc and .iloc

In [4]:
# .loc for label-based indexing
print(df.loc[1:3, ['A', 'C']])

# .iloc for integer-based indexing
print(df.iloc[1:3, [0, 2]])

   A    C
1  2  bar
2  3  baz
3  4  qux
   A    C
1  2  bar
2  3  baz


### Conditional Selection with Multiple Criteria

In [5]:
# Complex condition
condition = (df['A'] > 2) & (df['B'] < 50) | (df['C'].isin(['foo', 'quux']))
print(df[condition])

   A   B     C
0  1  10   foo
2  3  30   baz
3  4  40   qux
4  5  50  quux


## 2. Data Transformation

Pandas offers various methods for transforming data, including apply(), applymap(), and lambda functions.

### apply() Function

In [34]:
# Apply a function to each column
def column_operation(col):
    if col.dtype == 'object':  # For non-numeric columns
        return f"Max length: {col.str.len().max()}, Min length: {col.str.len().min()}"
    else:  # For numeric columns
        return col.max() - col.min()

print(df.apply(column_operation))

# Apply a function to each row
print(df.apply(lambda row: row['A'] * row['B'] if pd.api.types.is_numeric_dtype(row['A']) and pd.api.types.is_numeric_dtype(row['B']) else "N/A", axis=1))


A                               4
B                              40
C    Max length: 4, Min length: 3
D    Max length: 4, Min length: 3
dtype: object
0    N/A
1    N/A
2    N/A
3    N/A
4    N/A
dtype: object


### applymap() Function

In [7]:
# Apply a function to every element
print(df.applymap(lambda x: str(x).upper()))

   A   B     C
0  1  10   FOO
1  2  20   BAR
2  3  30   BAZ
3  4  40   QUX
4  5  50  QUUX


  print(df.applymap(lambda x: str(x).upper()))


### Creating New Columns Based on Conditions

In [8]:
# Create a new column based on a condition
df['D'] = np.where(df['A'] > 3, 'High', 'Low')
print(df)

   A   B     C     D
0  1  10   foo   Low
1  2  20   bar   Low
2  3  30   baz   Low
3  4  40   qux  High
4  5  50  quux  High


## 3. Grouping and Aggregation

GroupBy operations allow you to split your data into groups, apply functions, and combine the results.

In [9]:
# Create a new DataFrame for demonstration
df2 = pd.DataFrame({
    'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
    'Value1': [10, 20, 30, 40, 50, 60],
    'Value2': [100, 200, 300, 400, 500, 600]
})
print(df2)

  Category  Value1  Value2
0        A      10     100
1        B      20     200
2        A      30     300
3        B      40     400
4        A      50     500
5        B      60     600


### Basic GroupBy Operations

In [10]:
# Group by Category and calculate mean
print(df2.groupby('Category').mean())

# Group by Category and apply multiple aggregations
print(df2.groupby('Category').agg(['mean', 'sum', 'count']))

          Value1  Value2
Category                
A           30.0   300.0
B           40.0   400.0
         Value1            Value2            
           mean  sum count   mean   sum count
Category                                     
A          30.0   90     3  300.0   900     3
B          40.0  120     3  400.0  1200     3


### Advanced Aggregation Techniques

In [11]:
# Custom aggregation function
def custom_agg(x):
    return pd.Series({
        'mean': x.mean(),
        'median': x.median(),
        'std': x.std()
    })

print(df2.groupby('Category').apply(custom_agg))

                                                    mean                                          median                                             std
Category                                                                                                                                                
A         Value1     30.0
Value2    300.0
dtype: float64  Value1     30.0
Value2    300.0
dtype: float64  Value1     20.0
Value2    200.0
dtype: float64
B         Value1     40.0
Value2    400.0
dtype: float64  Value1     40.0
Value2    400.0
dtype: float64  Value1     20.0
Value2    200.0
dtype: float64


### Pivot Tables and Cross-tabulations

In [12]:
# Create a pivot table
df3 = pd.DataFrame({
    'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
    'Product': ['A', 'B', 'A', 'B'],
    'Sales': [100, 200, 150, 250]
})

pivot = pd.pivot_table(df3, values='Sales', index='Date', columns='Product', aggfunc='sum')
print(pivot)

# Cross-tabulation
print(pd.crosstab(df3['Date'], df3['Product'], values=df3['Sales'], aggfunc='sum'))

Product       A    B
Date                
2023-01-01  100  200
2023-01-02  150  250
Product       A    B
Date                
2023-01-01  100  200
2023-01-02  150  250


## 4. Handling Missing Data

Dealing with missing data is a common task in data analysis. Pandas provides various methods to handle missing values effectively.

In [13]:
# Create a DataFrame with missing values
df4 = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [10, np.nan, 30, 40, 50],
    'C': ['a', 'b', 'c', np.nan, 'e']
})
print(df4)

     A     B    C
0  1.0  10.0    a
1  2.0   NaN    b
2  NaN  30.0    c
3  4.0  40.0  NaN
4  5.0  50.0    e


### Filling Missing Values

In [14]:
# Fill with a specific value
print(df4.fillna(0))

# Fill with different values for each column
print(df4.fillna({'A': 0, 'B': 100, 'C': 'Unknown'}))

# Fill with forward fill method
print(df4.fillna(method='ffill'))

# Fill with backward fill method
print(df4.fillna(method='bfill'))

     A     B  C
0  1.0  10.0  a
1  2.0   0.0  b
2  0.0  30.0  c
3  4.0  40.0  0
4  5.0  50.0  e
     A      B        C
0  1.0   10.0        a
1  2.0  100.0        b
2  0.0   30.0        c
3  4.0   40.0  Unknown
4  5.0   50.0        e
     A     B  C
0  1.0  10.0  a
1  2.0  10.0  b
2  2.0  30.0  c
3  4.0  40.0  c
4  5.0  50.0  e
     A     B  C
0  1.0  10.0  a
1  2.0  30.0  b
2  4.0  30.0  c
3  4.0  40.0  e
4  5.0  50.0  e


  print(df4.fillna(method='ffill'))
  print(df4.fillna(method='bfill'))


### Interpolation Methods

In [15]:
# Linear interpolation
print(df4.interpolate())

# Polynomial interpolation
print(df4.interpolate(method='polynomial', order=2))

  print(df4.interpolate())
  print(df4.interpolate(method='polynomial', order=2))


     A     B    C
0  1.0  10.0    a
1  2.0  20.0    b
2  3.0  30.0    c
3  4.0  40.0  NaN
4  5.0  50.0    e
     A     B    C
0  1.0  10.0    a
1  2.0  20.0    b
2  3.0  30.0    c
3  4.0  40.0  NaN
4  5.0  50.0    e


### Handling Missing Data in Time Series

In [16]:
# Create a time series with missing data
dates = pd.date_range('20230101', periods=6)
ts = pd.Series([1, np.nan, 3, np.nan, 5, 6], index=dates)
print(ts)

# Time-aware interpolation
print(ts.interpolate(method='time'))

2023-01-01    1.0
2023-01-02    NaN
2023-01-03    3.0
2023-01-04    NaN
2023-01-05    5.0
2023-01-06    6.0
Freq: D, dtype: float64
2023-01-01    1.0
2023-01-02    2.0
2023-01-03    3.0
2023-01-04    4.0
2023-01-05    5.0
2023-01-06    6.0
Freq: D, dtype: float64


## 5. Merging and Joining DataFrames

Pandas provides powerful tools for combining different DataFrames. We'll explore various types of joins and concatenation.

In [17]:
# Create sample DataFrames
df_left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': [1, 2, 3, 4]})
df_right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': [20, 40, 50, 60]})
print("Left DataFrame:")
print(df_left)
print("\nRight DataFrame:")
print(df_right)

Left DataFrame:
  key  value
0   A      1
1   B      2
2   C      3
3   D      4

Right DataFrame:
  key  value
0   B     20
1   D     40
2   E     50
3   F     60


### Different Types of Joins

In [18]:
# Inner join
print("Inner Join:")
print(pd.merge(df_left, df_right, on='key', how='inner'))

# Outer join
print("\nOuter Join:")
print(pd.merge(df_left, df_right, on='key', how='outer'))

# Left join
print("\nLeft Join:")
print(pd.merge(df_left, df_right, on='key', how='left'))

# Right join
print("\nRight Join:")
print(pd.merge(df_left, df_right, on='key', how='right'))

Inner Join:
  key  value_x  value_y
0   B        2       20
1   D        4       40

Outer Join:
  key  value_x  value_y
0   A      1.0      NaN
1   B      2.0     20.0
2   C      3.0      NaN
3   D      4.0     40.0
4   E      NaN     50.0
5   F      NaN     60.0

Left Join:
  key  value_x  value_y
0   A        1      NaN
1   B        2     20.0
2   C        3      NaN
3   D        4     40.0

Right Join:
  key  value_x  value_y
0   B      2.0       20
1   D      4.0       40
2   E      NaN       50
3   F      NaN       60


### Merging on Index

In [19]:
# Set 'key' as index
df_left.set_index('key', inplace=True)
df_right.set_index('key', inplace=True)

# Merge on index
print(df_left.join(df_right, how='outer', lsuffix='_left', rsuffix='_right'))

     value_left  value_right
key                         
A           1.0          NaN
B           2.0         20.0
C           3.0          NaN
D           4.0         40.0
E           NaN         50.0
F           NaN         60.0


### Concatenating DataFrames

In [20]:
# Vertical concatenation
print("Vertical Concatenation:")
print(pd.concat([df_left, df_right]))

# Horizontal concatenation
print("\nHorizontal Concatenation:")
print(pd.concat([df_left, df_right], axis=1))

Vertical Concatenation:
     value
key       
A        1
B        2
C        3
D        4
B       20
D       40
E       50
F       60

Horizontal Concatenation:
     value  value
key              
A      1.0    NaN
B      2.0   20.0
C      3.0    NaN
D      4.0   40.0
E      NaN   50.0
F      NaN   60.0


## 6. Time Series Operations

Pandas excels at handling time series data. We'll explore resampling, rolling windows, and other time-based operations.

In [21]:
# Create a time series DataFrame
dates = pd.date_range('20230101', periods=100, freq='D')
ts = pd.Series(np.random.randn(len(dates)), index=dates)
df_ts = pd.DataFrame({'value': ts})
print(df_ts.head())

               value
2023-01-01 -1.173940
2023-01-02 -0.154302
2023-01-03 -0.059296
2023-01-04  1.212269
2023-01-05 -0.249615


### Resampling

In [22]:
# Resample to monthly frequency
print(df_ts.resample('M').mean())

# Resample to weekly frequency with sum
print(df_ts.resample('W').sum())

               value
2023-01-31 -0.033523
2023-02-28 -0.164839
2023-03-31 -0.090123
2023-04-30  0.156954
               value
2023-01-01 -1.173940
2023-01-08 -0.675408
2023-01-15  0.357052
2023-01-22  0.228703
2023-01-29 -0.106586
2023-02-05 -0.668872
2023-02-12 -1.702220
2023-02-19 -0.823237
2023-02-26 -0.963081
2023-03-05  0.153435
2023-03-12  1.946141
2023-03-19 -3.736399
2023-03-26 -0.989691
2023-04-02 -2.221668
2023-04-09  3.127429
2023-04-16  0.369376


### Rolling Windows

In [23]:
# 7-day rolling mean
print(df_ts.rolling(window=7).mean().head(10))

# 30-day rolling standard deviation
print(df_ts.rolling(window=30).std().head(10))

               value
2023-01-01       NaN
2023-01-02       NaN
2023-01-03       NaN
2023-01-04       NaN
2023-01-05       NaN
2023-01-06       NaN
2023-01-07 -0.243904
2023-01-08 -0.096487
2023-01-09  0.036076
2023-01-10  0.046044
            value
2023-01-01    NaN
2023-01-02    NaN
2023-01-03    NaN
2023-01-04    NaN
2023-01-05    NaN
2023-01-06    NaN
2023-01-07    NaN
2023-01-08    NaN
2023-01-09    NaN
2023-01-10    NaN


### Shifting and Lagging Data

In [25]:
# Shift data forward by 1 day
print(df_ts.shift(1).head())

# Shift data backward by 1 day
print(df_ts.shift(-1).head())

               value
2023-01-01       NaN
2023-01-02 -1.173940
2023-01-03 -0.154302
2023-01-04 -0.059296
2023-01-05  1.212269
               value
2023-01-01 -0.154302
2023-01-02 -0.059296
2023-01-03  1.212269
2023-01-04 -0.249615
2023-01-05  0.019245


### Date and Time Functionality

In [26]:
# Extract various date components
df_ts['year'] = df_ts.index.year
df_ts['month'] = df_ts.index.month
df_ts['day'] = df_ts.index.day
df_ts['dayofweek'] = df_ts.index.dayofweek
print(df_ts.head())

               value  year  month  day  dayofweek
2023-01-01 -1.173940  2023      1    1          6
2023-01-02 -0.154302  2023      1    2          0
2023-01-03 -0.059296  2023      1    3          1
2023-01-04  1.212269  2023      1    4          2
2023-01-05 -0.249615  2023      1    5          3


## 7. Performance Optimization

When working with large datasets, optimizing performance becomes crucial. We'll explore some techniques to improve the efficiency of your Pandas operations.

### Using Categorical Data Types

In [27]:
# Create a large DataFrame with repeated values
df_large = pd.DataFrame({
    'id': np.arange(1000000),
    'category': np.random.choice(['A', 'B', 'C', 'D'], 1000000)
})

# Check memory usage
print("Memory usage before optimization:")
print(df_large.memory_usage(deep=True))

# Convert to categorical
df_large['category'] = df_large['category'].astype('category')

# Check memory usage after optimization
print("\nMemory usage after optimization:")
print(df_large.memory_usage(deep=True))

Memory usage before optimization:
Index             72
id           4000000
category    30000000
dtype: int64

Memory usage after optimization:
Index            72
id          4000000
category    1000216
dtype: int64


### Efficient Iteration

In [None]:
import pandas as pd
import numpy as np
import time

# Create a sample DataFrame
df_sample = pd.DataFrame({
    'A': np.random.randint(1, 100, 1000000),
    'B': np.random.randint(1, 100, 1000000)
})

# Using itertuples() for efficient iteration
start_time = time.time()
for row in df_sample.itertuples():
    _ = row.A + row.B
end_time = time.time()
print(f"Time taken with itertuples(): {end_time - start_time:.4f} seconds")

# Using iterrows() (slower)
start_time = time.time()
for index, row in df_sample.iterrows():
    _ = row['A'] + row['B']
end_time = time.time()
print(f"Time taken with iterrows(): {end_time - start_time:.4f} seconds")


Time taken with itertuples(): 6.1898 seconds


### Vectorization Techniques

In [None]:
# Non-vectorized operation
%time result = [x + y for x, y in zip(df_sample['A'], df_sample['B'])]

# Vectorized operation
%time result = df_sample['A'] + df_sample['B']

## Conclusion

In this comprehensive lecture on Advanced Pandas Operations, we've covered a wide range of topics including:

1. Advanced Data Selection and Filtering
2. Data Transformation
3. Grouping and Aggregation
4. Handling Missing Data
5. Merging and Joining DataFrames
6. Time Series Operations
7. Performance Optimization

These advanced techniques will allow you to manipulate and analyze complex datasets efficiently. Remember that practice is key to mastering these concepts. Try applying these techniques to your own datasets and experiment with different combinations of operations to solve real-world data problems.

As you continue your journey with Pandas, keep exploring the official documentation and stay updated with new features and best practices in the data science community.

## Exercises

To reinforce your understanding of these advanced Pandas operations, try the following exercises:

1. Create a DataFrame with at least 1000 rows and 5 columns of various data types. Perform complex filtering operations using boolean indexing and .loc/.iloc.

2. Using the same DataFrame, create new columns based on complex conditions and apply custom functions using apply() and applymap().

3. Perform advanced groupby operations with multiple columns and custom aggregation functions.

4. Create a time series DataFrame and perform resampling, rolling window calculations, and time-based indexing operations.

5. Merge multiple DataFrames using different join types and handle cases with missing data.

6. Optimize a large DataFrame (>1 million rows) using categorical data types and vectorization techniques. Compare the performance before and after optimization.

Good luck, and happy coding!