## Data Cleaning & Missing Values

### Identifying Missing Data

```python
# Count nulls per column
df.isnull().sum()

# Percentage of nulls
df.isnull().sum() / len(df) * 100
```

### Handling Missing Data

| SQL | Pandas |
|-----|--------|
| `SELECT * FROM table WHERE column IS NOT NULL` | `df.dropna(subset=['important_column'])` |
| `UPDATE table SET column = 'Default' WHERE column IS NULL` | `df['column'].fillna('Default Value', inplace=True)` |

```python
# Fill with statistical measures
df['numeric_column'].fillna(df['numeric_column'].mean(), inplace=True)
df['numeric_column'].fillna(df['numeric_column'].median(), inplace=True)

# Forward fill and backward fill
df['column'].fillna(method='ffill')  # Use previous valid value
df['column'].fillna(method='bfill')  # Use next valid value

# Interpolation for numeric data
df['numeric_column'].interpolate()
```

---

## Data Transformation

### Conditional Logic

```python
# SQL: UPDATE table SET new_column = CASE WHEN condition THEN value1 ELSE value2 END
df['category'] = np.where(df['total_amount'] > 1000, 'High Value', 'Regular')

# Multiple conditions (SQL: CASE WHEN ... WHEN ... ELSE)
conditions = [
    df['total_amount'] > 2000,
    df['total_amount'] > 1000,
    df['total_amount'] > 500
]
choices = ['Premium', 'High Value', 'Medium Value']
df['customer_tier'] = np.select(conditions, choices, default='Regular')
```

### Creating New Columns

```python
# Creating new columns from existing ones
df['profit_margin'] = (df['revenue'] - df['cost']) / df['revenue'] * 100
df['full_name'] = df['first_name'] + ' ' + df['last_name']
```

### Working with Dates

```python
# Converting to datetime
df['order_date'] = pd.to_datetime(df['order_date'])

# Extracting date components
df['year'] = df['order_date'].dt.year
df['month'] = df['order_date'].dt.month
df['day_of_week'] = df['order_date'].dt.day_name()

# Calculating time differences
df['days_since_order'] = (pd.Timestamp.now() - df['order_date']).dt.days
```

### Binning Continuous Data

```python
# SQL: CASE WHEN with ranges
df['age_group'] = pd.cut(df['age'], 
                        bins=[0, 25, 40, 60, 100], 
                        labels=['Young', 'Adult', 'Middle Age', 'Senior'])
```

### Removing Duplicates

| SQL | Pandas |
|-----|--------|
| `SELECT DISTINCT * FROM table` | `df.drop_duplicates()` |
| `SELECT DISTINCT column1, column2 FROM table` | `df.drop_duplicates(subset=['column1', 'column2'])` |

---

## Quick Tips

- Use `.loc[]` for label-based indexing
- Use `.iloc[]` for position-based indexing
- Use `.query()` for complex filtering conditions
- Chain methods for cleaner, more readable code
- Always use `.copy()` when creating subsets to avoid warnings
- Use `pd.options.display.max_columns = None` to see all columns in output
