# 📓 Lesson 5: Cleaning Missing and Duplicate Data
📘 What you will learn:
1. How to identify missing (NaN) values
2. How to handle missing data using dropna() and fillna()
3. How to detect and remove duplicate rows using drop_duplicates()

## 📁 Step 1: Load the Dataset

We use the same dataset as before: Sales_January_2019.csv

In [None]:
import pandas as pd

# Load data
df = pd.read_csv('../data/Sales_January_2019.csv')

# Check basic info
print(df.info())


## 🔍 Step 2: Detect Missing Values
To check for missing values in each column:

In [None]:
# Count missing values in each column
print(df.isnull().sum())

This will show how many null (NaN) values are in each column.

## 🧹 Step 3: Remove Missing Values
You can remove rows with missing values using dropna():

In [None]:
# Drop all rows that have any missing value
df_cleaned = df.dropna()

# Confirm no more missing values
print(df_cleaned.isnull().sum())

# Compare shape before and after
print("Original rows:", len(df))
print("Rows after dropna:", len(df_cleaned))

📌 Note: This may remove valuable rows. Use carefully!

In [None]:
# Some columns are read as string; convert to numeric
df['Quantity Ordered'] = pd.to_numeric(df['Quantity Ordered'], errors='coerce')
df['Price Each'] = pd.to_numeric(df['Price Each'], errors='coerce')

# Drop rows where conversion failed and became NaN
df = df.dropna(subset=['Quantity Ordered', 'Price Each'])

# Check types
print(df.dtypes)

## 🛠 Step 4: Fill Missing Values
Instead of dropping rows, you can fill missing values with a default value or strategy:

In [None]:

# Option 1: Fill all NaNs with zero (not always recommended)

# Load data
df = pd.read_csv('../data/Sales_January_2019.csv')

# Fill all NaNs with zero
df_filled = df.fillna(0)

# Example: Find rows where 'Price Each' was filled with zero
result = df_filled.query("`Price Each` == 0")
print(result.head())


In [None]:
# Option 2: Fill a specific column with 'Unknown'

# Load data
df = pd.read_csv('../data/Sales_January_2019.csv')

#Fill a specific column with 'Unknown'
df['Purchase Address'] = df['Purchase Address'].fillna('Unknown')

# Check rows where value was unknown
result = df.query("`Purchase Address` == 'Unknown'")
print(result.head())

💡 You can use .loc for assignments

This makes the assignment explicitly on the DataFrame and avoids ambiguity:


In [None]:
# Load data 
df = pd.read_csv('../data/Sales_January_2019.csv')

# Replace NaN in 'Purchase Address' with 'Unknown'
df.loc['Purchase Address'] = df['Purchase Address'].fillna('Unknown')

# Find those rows
result = df.query("`Purchase Address` == 'Unknown'")
print(result.head())


## ❌ Step 5: Remove Duplicate Rows
Check and remove duplicate rows:

In [None]:
# Check for duplicates
print("Duplicate rows:", df.duplicated().sum())

# Remove duplicates
df = df.drop_duplicates()

# Confirm
print("Remaining duplicates:", df.duplicated().sum())


## 🧠 Practice Exercises
1. Load the file Sales_January_2019.csv
2. Count missing values per column
3. Drop all rows with missing data
4. Convert Quantity Ordered and Price Each to numeric
5. Drop rows where those conversions failed
6. Count and remove duplicates

In [None]:
df = pd.read_csv('../data/Sales_January_2019.csv')

# 1. Count missing values
print(df.isnull().sum())

# 2. Drop missing values
df = df.dropna()

# 3. Convert to numeric
df['Quantity Ordered'] = pd.to_numeric(df['Quantity Ordered'], errors='coerce')
df['Price Each'] = pd.to_numeric(df['Price Each'], errors='coerce')

# 4. Drop invalid rows
df = df.dropna(subset=['Quantity Ordered', 'Price Each'])

# 5. Remove duplicates
df = df.drop_duplicates()


## 📌 Summary
In this lesson, you learned:
- How to detect and handle missing values
- How to use dropna() and fillna() effectively
- How to convert strings to numbers with to_numeric()
- How to clean up duplicate rows

👉 In the next lesson, you’ll learn how to change data types and use categorical data for optimization.