## Data Cleaning & Missing Values

### Identifying Missing Data

```python
# Count nulls per column
df.isnull().sum()

# Percentage of nulls
df.isnull().sum() / len(df) * 100
```
### Looking for nulls that may be an issue later

```python
import pandas as pd
import seaborn as sns

# Load and inspect a dataset
dataset = 'titanic'  # You can change this to any dataset available in seaborn
df = sns.load_dataset(dataset)
print(f"\nLoaded '{dataset}' dataset:")

# Identifying missing data
null_counts = df.isnull().sum()
null_percent = (null_counts / len(df)) * 100

# Combine into a single DataFrame for better visibility
null_summary = pd.DataFrame({
    'Null Count': null_counts,
    'Null Percentage': null_percent
})

print("\nMissing Data Summary:")
print(null_summary)

print("Shape:", df.shape)
print("\nColumns:", df.columns.tolist())
print("\nFirst few rows:")
print(df.head())
```

### Then, dropping nulls, if acceptable to the outcome requirement.

```python

import pandas as pd
import seaborn as sns

# Load dataset
df = sns.load_dataset('taxis')

# SQL: SELECT * FROM table WHERE column IS NOT NULL
print("\nDropping rows with nulls in 'payment':")
df.dropna(subset=['payment'])
```
**See Day7 practice if chosing to fill nulls**

---

### Removing Duplicates

| SQL | Pandas |
|-----|--------|
| `SELECT DISTINCT * FROM table` | `df.drop_duplicates()` |
| `SELECT DISTINCT column1, column2 FROM table` | `df.drop_duplicates(subset=['column1', 'column2'])` |

---

## Quick Tips

- Use `.loc[]` for label-based indexing
- Use `.iloc[]` for position-based indexing
- Use `.query()` for complex filtering conditions
- Chain methods for cleaner, more readable code
- Always use `.copy()` when creating subsets to avoid warnings
- Use `pd.options.display.max_columns = None` to see all columns in output
