# 🧼 Cleaning Dirty Data (Missing Values & Type Fixes)

## 🔹 LEARNING GOALS:
- Detect and count missing values (`NaN`)
- Fill or drop missing data
- Convert column data types safely
- Understand the difference between `NaN`, `None`, `""`, and type mismatches


### 🧪 1. Load a Messy Dataset

In [1]:
import pandas as pd
import numpy as np

data = {
    "Name": ["Alice", "Bob", "Charlie", "David", None],
    "Age": ["25", "thirty", 35, np.nan, "40"],
    "Signup Date": ["2022-01-01", "not a date", "2022/03/01", None, "April 5, 2022"],
    "Score": [95.5, None, 88.0, 92.5, ""]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Signup Date,Score
0,Alice,25,2022-01-01,95.5
1,Bob,thirty,not a date,
2,Charlie,35,2022/03/01,88.0
3,David,,,92.5
4,,40,"April 5, 2022",


### 🧯 2. Detecting Missing or Broken Values

In [2]:
df.isnull()

Unnamed: 0,Name,Age,Signup Date,Score
0,False,False,False,False
1,False,False,False,True
2,False,False,False,False
3,False,True,True,False
4,True,False,False,False


In [3]:
df.isnull().sum()

Unnamed: 0,0
Name,1
Age,1
Signup Date,1
Score,1


In [7]:
df[df.isnull().any(axis=1)]

Unnamed: 0,Name,Age,Signup Date,Score
1,Bob,thirty,not a date,
3,David,,,92.5
4,,40,"April 5, 2022",


### 🧹 3. Cleaning Strategy Options

In [8]:
df.fillna({
    "Name": "Unknown",
    "Age": -1,
    "Signup Date": "1970-01-01",
    "Score": 0.0
})

Unnamed: 0,Name,Age,Signup Date,Score
0,Alice,25,2022-01-01,95.5
1,Bob,thirty,not a date,0.0
2,Charlie,35,2022/03/01,88.0
3,David,-1,1970-01-01,92.5
4,Unknown,40,"April 5, 2022",


In [9]:
df.dropna()

Unnamed: 0,Name,Age,Signup Date,Score
0,Alice,25,2022-01-01,95.5
2,Charlie,35,2022/03/01,88.0


### 🧬 4. Data Type Fixes

In [12]:
df.dtypes

Unnamed: 0,0
Name,object
Age,float64
Signup Date,datetime64[ns]
Score,float64


In [13]:
df["Age"].dtype

dtype('float64')

In [11]:
df["Age"] = pd.to_numeric(df["Age"], errors="coerce")
df["Score"] = pd.to_numeric(df["Score"], errors="coerce")
df["Signup Date"] = pd.to_datetime(df["Signup Date"], errors="coerce")
df.dtypes

Unnamed: 0,0
Name,object
Age,float64
Signup Date,datetime64[ns]
Score,float64


### 🩹 5. Impute (Fill In) Fixed Missing Values

In [20]:
df["Age"].fillna(df["Age"].median())

Unnamed: 0,Age
0,25.0
1,35.0
2,35.0
3,35.0
4,40.0


In [21]:
df["Score"].fillna(df["Score"].mean())

Unnamed: 0,Score
0,95.5
1,92.0
2,88.0
3,92.5
4,92.0


In [17]:
df["Signup Date"].fillna(df["Signup Date"].min())

Unnamed: 0,Signup Date
0,2022-01-01
1,2022-01-01
2,2022-01-01
3,2022-01-01
4,2022-01-01


In [23]:
df["Name"].fillna("Unknown")

Unnamed: 0,Name
0,Alice
1,Bob
2,Charlie
3,David
4,Unknown


### 🤓 6. Cleaned Data Review

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Name         5 non-null      object        
 1   Age          5 non-null      float64       
 2   Signup Date  5 non-null      datetime64[ns]
 3   Score        5 non-null      float64       
dtypes: datetime64[ns](1), float64(2), object(1)
memory usage: 292.0+ bytes


In [25]:
df.describe(include="all")

Unnamed: 0,Name,Age,Signup Date,Score
count,5,5.0,5,5.0
unique,5,,,
top,Alice,,,
freq,1,,,
mean,,34.0,2022-01-01 00:00:00,92.0
min,,25.0,2022-01-01 00:00:00,88.0
25%,,35.0,2022-01-01 00:00:00,92.0
50%,,35.0,2022-01-01 00:00:00,92.0
75%,,35.0,2022-01-01 00:00:00,92.5
max,,40.0,2022-01-01 00:00:00,95.5


### 🧪 Try It Yourself

Modify the `data` dictionary at the top of this notebook. Add:
- A new column with some `None` and `""` values
- At least one row with all columns filled incorrectly
Then re-run the notebook and fix it step-by-step.

In [28]:
data = {
    "Name": ["Alice", "Bob", "Charlie", "David", None],
    "Age": ["25", "thirty", 35, np.nan, "40"],
    "Signup Date": ["2022-01-01", "not a date", "2022/03/01", None, "April 5, 2022"],
    "Score": [95.5, None, 88.0, 92.5, ""],
    "New": [None, "", None, "", np.nan]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Signup Date,Score,New
0,Alice,25,2022-01-01,95.5,
1,Bob,thirty,not a date,,
2,Charlie,35,2022/03/01,88.0,
3,David,,,92.5,
4,,40,"April 5, 2022",,


In [31]:
df.isnull()

Unnamed: 0,Name,Age,Signup Date,Score,New
0,False,False,False,False,True
1,False,False,False,True,False
2,False,False,False,False,True
3,False,True,True,False,False
4,True,False,False,False,True


In [34]:
df.isnull().sum()

Unnamed: 0,0
Name,1
Age,1
Signup Date,1
Score,1
New,3


In [35]:
df.fillna({
    "Name": "Ethan",
    "Score": 100,
    "New": "New Value",
    "Age": 22,
    "Signup Date": "2023-01-01"
})

Unnamed: 0,Name,Age,Signup Date,Score,New
0,Alice,25,2022-01-01,95.5,New Value
1,Bob,thirty,not a date,100.0,
2,Charlie,35,2022/03/01,88.0,New Value
3,David,22,2023-01-01,92.5,
4,Ethan,40,"April 5, 2022",,New Value


In [36]:
df.dropna()

Unnamed: 0,Name,Age,Signup Date,Score,New


### 🧠 Mini-Challenge

> 🗂 Load `"data/survey.csv"` and:
- Identify which columns have missing values
- Use `.isnull().sum()` to get a null report
- Use a mix of `.fillna()`, `.dropna()`, and `pd.to_numeric()` or `pd.to_datetime()` to clean it
- Print a summary with `.info()` and `.describe()`

In [41]:
#with open("/content/sample_data/california_housing_test.csv") as f:
    #print(f.read())
df = pd.read_csv("/content/sample_data/california_housing_test.csv")
df


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-122.05,37.37,27.0,3885.0,661.0,1537.0,606.0,6.6085,344700.0
1,-118.30,34.26,43.0,1510.0,310.0,809.0,277.0,3.5990,176500.0
2,-117.81,33.78,27.0,3589.0,507.0,1484.0,495.0,5.7934,270500.0
3,-118.36,33.82,28.0,67.0,15.0,49.0,11.0,6.1359,330000.0
4,-119.67,36.33,19.0,1241.0,244.0,850.0,237.0,2.9375,81700.0
...,...,...,...,...,...,...,...,...,...
2995,-119.86,34.42,23.0,1450.0,642.0,1258.0,607.0,1.1790,225000.0
2996,-118.14,34.06,27.0,5257.0,1082.0,3496.0,1036.0,3.3906,237200.0
2997,-119.70,36.30,10.0,956.0,201.0,693.0,220.0,2.2895,62000.0
2998,-117.12,34.10,40.0,96.0,14.0,46.0,14.0,3.2708,162500.0


In [42]:
df.isnull().sum()

Unnamed: 0,0
longitude,0
latitude,0
housing_median_age,0
total_rooms,0
total_bedrooms,0
population,0
households,0
median_income,0
median_house_value,0


In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           3000 non-null   float64
 1   latitude            3000 non-null   float64
 2   housing_median_age  3000 non-null   float64
 3   total_rooms         3000 non-null   float64
 4   total_bedrooms      3000 non-null   float64
 5   population          3000 non-null   float64
 6   households          3000 non-null   float64
 7   median_income       3000 non-null   float64
 8   median_house_value  3000 non-null   float64
dtypes: float64(9)
memory usage: 211.1 KB


### 📝 Summary

| Concept        | Tool/Function                      |
|----------------|------------------------------------|
| Detect nulls   | `df.isnull()`, `df.isnull().sum()` |
| Drop rows      | `df.dropna()`                      |
| Fill values    | `df.fillna()`                      |
| Convert types  | `pd.to_numeric()`, `pd.to_datetime()` |
| Replace values | `df.replace()`                     |
