# Data Cleaning and Handling Missing Data with Pandas

Real-world datasets often contain **missing or messy data**.  
Pandas provides easy tools for cleaning and handling missing values so that your analysis remains accurate.

Common functions for handling missing data:

- **`isnull()` / `notnull()`**: Identify missing values.
- **`dropna()`**: Remove rows or columns with missing data.
- **`fillna()`**: Fill missing values with a specified value or a calculated one (like mean or median).
- **Replacing values**: Use `replace()` to substitute unwanted or incorrect values.

The goal is to ensure the dataset is consistent, accurate, and ready for analysis.


In [24]:
import pandas as pd
import numpy as np

In [25]:
# Create a DataFrame with missing values
data = {
    "Name": ["Hayley", "Taylor", "Claire", "Aurora", "Evangeline"],
    "Age": [25, np.nan, 35, np.nan, 45],
    "City": ["New York", "Los Angeles", np.nan, "Houston", "Phoenix"],
    "Salary": [50000, 60000, np.nan, 80000, 90000]
}
df = pd.DataFrame(data)
print("Original DataFrame with missing values:\n", df)

Original DataFrame with missing values:
          Name   Age         City   Salary
0      Hayley  25.0     New York  50000.0
1      Taylor   NaN  Los Angeles  60000.0
2      Claire  35.0          NaN      NaN
3      Aurora   NaN      Houston  80000.0
4  Evangeline  45.0      Phoenix  90000.0


In [26]:
# Detect missing values
print("\nCheck for missing values:\n", df.isnull())


Check for missing values:
     Name    Age   City  Salary
0  False  False  False   False
1  False   True  False   False
2  False  False   True    True
3  False   True  False   False
4  False  False  False   False


In [27]:
# Drop rows with any missing values
dropped_rows = df.dropna()
print("\nDataFrame after dropping rows with missing values:\n", dropped_rows)


DataFrame after dropping rows with missing values:
          Name   Age      City   Salary
0      Hayley  25.0  New York  50000.0
4  Evangeline  45.0   Phoenix  90000.0


In [28]:
# Fill missing values with a fixed value
filled_fixed = df.fillna(0)
print("\nFill missing values with 0:\n", filled_fixed)


Fill missing values with 0:
          Name   Age         City   Salary
0      Hayley  25.0     New York  50000.0
1      Taylor   0.0  Los Angeles  60000.0
2      Claire  35.0            0      0.0
3      Aurora   0.0      Houston  80000.0
4  Evangeline  45.0      Phoenix  90000.0


In [29]:
# Fill missing values with column mean (for numeric columns)
df_mean_filled = df.copy()
df_mean_filled["Age"]=df_mean_filled["Age"].fillna(df["Age"].mean())
df_mean_filled["Salary"]=df_mean_filled["Salary"].fillna(df["Salary"].mean())
print("\nFill numeric missing values with column mean:\n", df_mean_filled)


Fill numeric missing values with column mean:
          Name   Age         City   Salary
0      Hayley  25.0     New York  50000.0
1      Taylor  35.0  Los Angeles  60000.0
2      Claire  35.0          NaN  70000.0
3      Aurora  35.0      Houston  80000.0
4  Evangeline  45.0      Phoenix  90000.0


In [30]:
# Replace specific values
replaced_df = df.replace({"New York": "NYC", "Los Angeles": "LA"})
print("\nReplace city names:\n", replaced_df)


Replace city names:
          Name   Age     City   Salary
0      Hayley  25.0      NYC  50000.0
1      Taylor   NaN       LA  60000.0
2      Claire  35.0      NaN      NaN
3      Aurora   NaN  Houston  80000.0
4  Evangeline  45.0  Phoenix  90000.0


# Real-World Analogy: Fixing an Incomplete Guest List

Imagine you are organizing a party, but the guest list has:
- Some missing names
- Missing contact information
- Old or inconsistent city names

You might:
- Remove incomplete entries (`dropna`)
- Fill missing info with placeholders (`fillna`)
- Update city names to be consistent (`replace`)

Cleaning data is like making sure your guest list is complete and accurate so the event runs smoothly.
