# Pandas & SQL Reference Guide

## Data Selection Patterns

| SQL | Pandas |
|-----|--------|
| `SELECT col1, col2 FROM table` | `df[['col1', 'col2']]` |
| `SELECT * FROM table WHERE col > 100` | `df[df['col'] > 100]` |
| `SELECT * FROM table LIMIT 10` | `df.head(10)` |
| `SELECT DISTINCT col FROM table` | `df['col'].unique()` |
| `SELECT COUNT(*) FROM table` | `len(df)` or `df.shape[0]` |

## Aggregation Patterns

| SQL | Pandas |
|-----|--------|
| `SELECT AVG(col) FROM table` | `df['col'].mean()` |
| `SELECT SUM(col) FROM table` | `df['col'].sum()` |
| `SELECT MAX(col) FROM table` | `df['col'].max()` |
| `SELECT MIN(col) FROM table` | `df['col'].min()` |
| `SELECT COUNT(DISTINCT col) FROM table` | `df['col'].nunique()` |

## Grouping Patterns

| SQL | Pandas |
|-----|--------|
| `GROUP BY col` | `df.groupby('col')` |
| `GROUP BY col1, col2` | `df.groupby(['col1', 'col2'])` |
| `HAVING SUM(col) > 100` | `df.groupby('cat')['col'].sum() > 100` |

---

## Advanced Pandas Techniques

### Method Chaining (Pipeline Approach)

```python
# Instead of multiple separate operations:
df_filtered = df[df['amount'] > 1000]
df_grouped = df_filtered.groupby('region')['amount'].sum()
df_result = df_grouped.reset_index()

# Use method chaining:
result = (df
    .query('amount > 1000')
    .groupby('region')['amount']
    .sum()
    .reset_index()
    .sort_values('amount', ascending=False)
)
```

### Query Method (SQL-like syntax)

```python
# Instead of: df[(df['age'] > 25) & (df['salary'] > 50000)]
df.query('age > 25 and salary > 50000')

# With variables:
min_age = 25
df.query('age > @min_age')
```

### Apply Functions for Complex Operations

```python
# Apply function to each row
def calculate_bonus(row):
    if row['performance'] == 'Excellent':
        return row['salary'] * 0.15
    elif row['performance'] == 'Good':
        return row['salary'] * 0.10
    else:
        return row['salary'] * 0.05

df['bonus'] = df.apply(calculate_bonus, axis=1)

# Apply to specific columns
df['region'] = df['region'].apply(lambda x: x.upper())
```

---

## Common Pitfalls and Solutions

### 1. SettingWithCopyWarning

```python
# ❌ Wrong way (causes warning):
df[df['amount'] > 1000]['category'] = 'High'

# ✅ Right way:
df.loc[df['amount'] > 1000, 'category'] = 'High'
```

### 2. Index Management

```python
# Reset index after groupby operations
df.groupby('region')['amount'].sum().reset_index()

# Set meaningful index
df.set_index('customer_id', inplace=True)
```

### 3. Memory Efficiency

```python
# Use appropriate data types
df['category'] = df['category'].astype('category')  # For repeated strings
df['customer_id'] = df['customer_id'].astype('int32')  # If values fit

# Check memory usage
df.info(memory_usage='deep')
```

---

## Quick Tips

- Use `.loc[]` for label-based indexing
- Use `.iloc[]` for position-based indexing
- Use `.query()` for complex filtering conditions
- Chain methods for cleaner, more readable code
- Always use `.copy()` when creating subsets to avoid warnings
- Use `pd.options.display.max_columns = None` to see all columns in output