# Advanced Pandas Operations

In this lecture, we'll dive deep into advanced Pandas operations. We'll cover complex data manipulation techniques, advanced indexing, grouping, merging, and performance optimization. By the end of this lecture, you'll have a strong grasp of these powerful Pandas features.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', None)

## 1. Advanced Data Selection and Filtering

Pandas provides powerful tools for selecting and filtering data. We'll explore boolean indexing, .loc, .iloc, and complex conditional selection.

In [None]:
# Create a sample DataFrame
df = pd.DataFrame({
    'A': range(1, 6),
    'B': range(10, 60, 10),
    'C': ['foo', 'bar', 'baz', 'qux', 'quux']
})
print(df)

### Boolean Indexing

In [None]:
# Select rows where A is greater than 2
print(df[df['A'] > 2])

# Select rows where B is less than 40 and C is not 'baz'
print(df[(df['B'] < 40) & (df['C'] != 'baz')])

### .loc and .iloc

In [None]:
# .loc for label-based indexing
print(df.loc[1:3, ['A', 'C']])

# .iloc for integer-based indexing
print(df.iloc[1:3, [0, 2]])

### Conditional Selection with Multiple Criteria

In [None]:
# Complex condition
condition = (df['A'] > 2) & (df['B'] < 50) | (df['C'].isin(['foo', 'quux']))
print(df[condition])

## 2. Data Transformation

Pandas offers various methods for transforming data, including apply(), applymap(), and lambda functions.

### apply() Function

In [None]:
# Apply a function to each column
def column_operation(col):
    return col.max() - col.min()

print(df.apply(column_operation))

# Apply a function to each row
print(df.apply(lambda row: row['A'] * row['B'], axis=1))

### applymap() Function

In [None]:
# Apply a function to every element
print(df.applymap(lambda x: str(x).upper()))

### Creating New Columns Based on Conditions

In [None]:
# Create a new column based on a condition
df['D'] = np.where(df['A'] > 3, 'High', 'Low')
print(df)

## 3. Grouping and Aggregation

GroupBy operations allow you to split your data into groups, apply functions, and combine the results.

In [None]:
# Create a new DataFrame for demonstration
df2 = pd.DataFrame({
    'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
    'Value1': [10, 20, 30, 40, 50, 60],
    'Value2': [100, 200, 300, 400, 500, 600]
})
print(df2)

### Basic GroupBy Operations

In [None]:
# Group by Category and calculate mean
print(df2.groupby('Category').mean())

# Group by Category and apply multiple aggregations
print(df2.groupby('Category').agg(['mean', 'sum', 'count']))

### Advanced Aggregation Techniques

In [None]:
# Custom aggregation function
def custom_agg(x):
    return pd.Series({
        'mean': x.mean(),
        'median': x.median(),
        'std': x.std()
    })

print(df2.groupby('Category').apply(custom_agg))

### Pivot Tables and Cross-tabulations

In [None]:
# Create a pivot table
df3 = pd.DataFrame({
    'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
    'Product': ['A', 'B', 'A', 'B'],
    'Sales': [100, 200, 150, 250]
})

pivot = pd.pivot_table(df3, values='Sales', index='Date', columns='Product', aggfunc='sum')
print(pivot)

# Cross-tabulation
print(pd.crosstab(df3['Date'], df3['Product'], values=df3['Sales'], aggfunc='sum'))

## 4. Handling Missing Data

Dealing with missing data is a common task in data analysis. Pandas provides various methods to handle missing values effectively.

In [None]:
# Create a DataFrame with missing values
df4 = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [10, np.nan, 30, 40, 50],
    'C': ['a', 'b', 'c', np.nan, 'e']
})
print(df4)

### Filling Missing Values

In [None]:
# Fill with a specific value
print(df4.fillna(0))

# Fill with different values for each column
print(df4.fillna({'A': 0, 'B': 100, 'C': 'Unknown'}))

# Fill with forward fill method
print(df4.fillna(method='ffill'))

# Fill with backward fill method
print(df4.fillna(method='bfill'))

### Interpolation Methods

In [None]:
# Linear interpolation
print(df4.interpolate())

# Polynomial interpolation
print(df4.interpolate(method='polynomial', order=2))

### Handling Missing Data in Time Series

In [None]:
# Create a time series with missing data
dates = pd.date_range('20230101', periods=6)
ts = pd.Series([1, np.nan, 3, np.nan, 5, 6], index=dates)
print(ts)

# Time-aware interpolation
print(ts.interpolate(method='time'))

## 5. Merging and Joining DataFrames

Pandas provides powerful tools for combining different DataFrames. We'll explore various types of joins and concatenation.

In [None]:
# Create sample DataFrames
df_left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': [1, 2, 3, 4]})
df_right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': [20, 40, 50, 60]})
print("Left DataFrame:")
print(df_left)
print("\nRight DataFrame:")
print(df_right)

### Different Types of Joins

In [None]:
# Inner join
print("Inner Join:")
print(pd.merge(df_left, df_right, on='key', how='inner'))

# Outer join
print("\nOuter Join:")
print(pd.merge(df_left, df_right, on='key', how='outer'))

# Left join
print("\nLeft Join:")
print(pd.merge(df_left, df_right, on='key', how='left'))

# Right join
print("\nRight Join:")
print(pd.merge(df_left, df_right, on='key', how='right'))

### Merging on Index

In [None]:
# Set 'key' as index
df_left.set_index('key', inplace=True)
df_right.set_index('key', inplace=True)

# Merge on index
print(df_left.join(df_right, how='outer', lsuffix='_left', rsuffix='_right'))

### Concatenating DataFrames

In [None]:
# Vertical concatenation
print("Vertical Concatenation:")
print(pd.concat([df_left, df_right]))

# Horizontal concatenation
print("\nHorizontal Concatenation:")
print(pd.concat([df_left, df_right], axis=1))

## 6. Time Series Operations

Pandas excels at handling time series data. We'll explore resampling, rolling windows, and other time-based operations.

In [None]:
# Create a time series DataFrame
dates = pd.date_range('20230101', periods=100, freq='D')
ts = pd.Series(np.random.randn(len(dates)), index=dates)
df_ts = pd.DataFrame({'value': ts})
print(df_ts.head())

### Resampling

In [None]:
# Resample to monthly frequency
print(df_ts.resample('M').mean())

# Resample to weekly frequency with sum
print(df_ts.resample('W').sum())

### Rolling Windows

In [None]:
# 7-day rolling mean
print(df_ts.rolling(window=7).mean().head(10))

# 30-day rolling standard deviation
print(df_ts.rolling(window=30).std().head(10))

### Shifting and Lagging Data

In [None]:
# Shift data forward by 1 day
print(df_ts.shift(1).head())

# Shift data backward by 1 day
print(df_ts.shift(-1).head())

### Date and Time Functionality

In [None]:
# Extract various date components
df_ts['year'] = df_ts.index.year
df_ts['month'] = df_ts.index.month
df_ts['day'] = df_ts.index.day
df_ts['dayofweek'] = df_ts.index.dayofweek
print(df_ts.head())

## 7. Performance Optimization

When working with large datasets, optimizing performance becomes crucial. We'll explore some techniques to improve the efficiency of your Pandas operations.

### Using Categorical Data Types

In [None]:
# Create a large DataFrame with repeated values
df_large = pd.DataFrame({
    'id': np.arange(1000000),
    'category': np.random.choice(['A', 'B', 'C', 'D'], 1000000)
})

# Check memory usage
print("Memory usage before optimization:")
print(df_large.memory_usage(deep=True))

# Convert to categorical
df_large['category'] = df_large['category'].astype('category')

# Check memory usage after optimization
print("\nMemory usage after optimization:")
print(df_large.memory_usage(deep=True))

### Efficient Iteration

In [None]:
# Create a sample DataFrame
df_sample = pd.DataFrame({
    'A': range(1000000),
    'B': range(1000000, 2000000)
})

# Using itertuples() for efficient iteration
%time for row in df_sample.itertuples():
    _ = row.A + row.B

# Using iterrows() (slower)
%time for index, row in df_sample.iterrows():
    _ = row['A'] + row['B']

### Vectorization Techniques

In [None]:
# Non-vectorized operation
%time result = [x + y for x, y in zip(df_sample['A'], df_sample['B'])]

# Vectorized operation
%time result = df_sample['A'] + df_sample['B']

## Conclusion

In this comprehensive lecture on Advanced Pandas Operations, we've covered a wide range of topics including:

1. Advanced Data Selection and Filtering
2. Data Transformation
3. Grouping and Aggregation
4. Handling Missing Data
5. Merging and Joining DataFrames
6. Time Series Operations
7. Performance Optimization

These advanced techniques will allow you to manipulate and analyze complex datasets efficiently. Remember that practice is key to mastering these concepts. Try applying these techniques to your own datasets and experiment with different combinations of operations to solve real-world data problems.

As you continue your journey with Pandas, keep exploring the official documentation and stay updated with new features and best practices in the data science community.

## Exercises

To reinforce your understanding of these advanced Pandas operations, try the following exercises:

1. Create a DataFrame with at least 1000 rows and 5 columns of various data types. Perform complex filtering operations using boolean indexing and .loc/.iloc.

2. Using the same DataFrame, create new columns based on complex conditions and apply custom functions using apply() and applymap().

3. Perform advanced groupby operations with multiple columns and custom aggregation functions.

4. Create a time series DataFrame and perform resampling, rolling window calculations, and time-based indexing operations.

5. Merge multiple DataFrames using different join types and handle cases with missing data.

6. Optimize a large DataFrame (>1 million rows) using categorical data types and vectorization techniques. Compare the performance before and after optimization.

Good luck, and happy coding!