
# 🧪 Beginner Lab: Handling Missing Data with pandas and scikit-learn (Ames Housing)

**Goal:** Learn simple, practical ways to find and fix missing values using **pandas** and **scikit-learn**.
We will use the dataset `Ames_outliers_removed.csv`. All code is beginner-friendly. No fancy tricks.

### What you'll do
1. Load the dataset and take a quick look.
2. Measure missing values by column.
3. Visualize which columns have the most missing data.
4. Decide on basic strategies:
   - Drop columns with too many missing values (optional).
   - Drop rows only when truly necessary (optional).
   - Fill numeric columns with **median**.
   - Fill categorical columns with **most frequent** value.
5. Save a clean, imputed CSV you can use later.



## 1) Setup

Run the cell below to import libraries. If anything errors, install the missing package first.


In [None]:

# Imports
import os
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt  # We'll keep plots simple

from sklearn.impute import SimpleImputer

# Make pandas show more columns so we can read tables easily
pd.set_option("display.max_columns", 100)



## 2) Load the dataset

This notebook tries a few common locations. If it can't find the file, place `Ames_outliers_removed.csv`
in the **same folder** as this notebook and run again.


In [None]:

# Try to find the dataset in a few common paths
possible_paths = [
    "./Ames_outliers_removed.csv",
    "Ames_outliers_removed.csv",
    "/mnt/data/Ames_outliers_removed.csv"  # Used in hosted environments
]

data_path = None
for p in possible_paths:
    if os.path.exists(p):
        data_path = p
        break

if data_path is None:
    raise FileNotFoundError(
        "Could not find 'Ames_outliers_removed.csv'. "
        "Place it in the same folder as this notebook and try again."
    )

print("Using file:", data_path)
df = pd.read_csv(data_path)
print("Shape:", df.shape)
df.head()



## 3) Quick information about the data

Use `info()` to see column types and a first sense of missingness (look at "non-null" counts).


In [None]:

df.info()



## 4) Missing values overview by column

We will count how many missing values each column has and also the percentage.
Then we sort so the worst offenders are on top.


In [None]:

# Count missing values per column
missing_count = df.isna().sum()

# Percentage of missing values per column
missing_percent = (missing_count / len(df)) * 100

# Put into one easy-to-read table
missing_table = (
    pd.DataFrame({"missing_count": missing_count, "missing_percent": missing_percent})
    .sort_values(by="missing_percent", ascending=False)
)

# Show the top 20 columns with missing values
missing_table.head(20)



## 5) Visualize missingness (Top 20)

A simple bar chart helps you see which columns need attention first.


In [None]:

top_n = 20
top_missing = missing_table.head(top_n)

plt.figure(figsize=(10, 5))
plt.bar(top_missing.index.astype(str), top_missing["missing_percent"])
plt.xticks(rotation=75, ha="right")
plt.ylabel("Percent missing")
plt.title(f"Top {top_n} columns by missing percentage")
plt.tight_layout()
plt.show()



## 6) Separate numeric and categorical columns

We will impute them differently:
- **Numeric:** use **median** (robust to outliers).
- **Categorical:** use **most frequent** value.


In [None]:

# Work on a copy so we always keep the original df untouched
df_work = df.copy()

# Identify numeric and categorical columns in a simple way
numeric_cols = df_work.select_dtypes(include=["number"]).columns.tolist()
categorical_cols = df_work.select_dtypes(include=["object", "category"]).columns.tolist()

print("Numeric columns:", len(numeric_cols))
print("Categorical columns:", len(categorical_cols))



## 7) Optional: Drop columns with too many missing values

A simple rule: if more than **40%** of a column is missing, it might be safer to drop it.
This is not a law. It depends on your project. We just show how to do it.

If you don't want to drop anything, set `threshold = 1.1` so nothing is dropped.


In [None]:

# Choose a simple threshold
threshold = 0.40  # 40%
cols_to_drop = missing_table[missing_table["missing_percent"] > (threshold * 100)].index.tolist()

print("Columns suggested to drop (you can change the threshold):")
print(cols_to_drop)

# Make a version where those are dropped (optional)
df_drop_cols = df_work.drop(columns=cols_to_drop)
print("Shape before drop:", df_work.shape, " -> after drop:", df_drop_cols.shape)



## 8) Optional: Drop rows in special cases

Dropping rows is usually a last resort. Two common simple cases:
1. Drop rows that are completely empty (rare).
2. Drop rows missing a **critical** target column (for example, `SalePrice`).

We will **not** drop rows by default here. The code below shows how you would do it.


In [None]:

# Example: drop rows that are entirely missing (almost never needed)
df_no_all_na_rows = df_drop_cols.dropna(how="all")
print("Rows before:", len(df_drop_cols), " -> after removing all-NA rows:", len(df_no_all_na_rows))

# Example: if you must have SalePrice for your task
if "SalePrice" in df_no_all_na_rows.columns:
    df_no_all_na_rows = df_no_all_na_rows.dropna(subset=["SalePrice"])
    print("After dropping rows missing SalePrice:", len(df_no_all_na_rows))
else:
    print("Column 'SalePrice' not found. Skipping this specific drop example.")



## 9) Impute numeric columns (median)

Why median? It is simple and resistant to outliers. This is a good default for beginners.


In [None]:

# Take the current working DataFrame from the previous step
df_impute_base = df_no_all_na_rows.copy()

# Simple median imputer for numeric columns
num_imputer = SimpleImputer(strategy="median")

# Fit on the numeric part and transform it
numeric_imputed = num_imputer.fit_transform(df_impute_base[numeric_cols])

# Put back into a DataFrame with the same column names and index
numeric_imputed_df = pd.DataFrame(numeric_imputed, columns=numeric_cols, index=df_impute_base.index)

# Peek at how many missing remain in numeric columns (should be zero)
numeric_imputed_df.isna().sum().sum()



## 10) Impute categorical columns (most frequent)

For text-like columns, the **most frequent** value is a very simple baseline that works okay to start.


In [None]:

# If there are categorical columns, impute them
if len(categorical_cols) > 0:
    cat_imputer = SimpleImputer(strategy="most_frequent")
    categorical_imputed = cat_imputer.fit_transform(df_impute_base[categorical_cols])
    categorical_imputed_df = pd.DataFrame(categorical_imputed, columns=categorical_cols, index=df_impute_base.index)
else:
    # No categorical columns found
    categorical_imputed_df = pd.DataFrame(index=df_impute_base.index)

# Check missing in categorical part
categorical_imputed_df.isna().sum().sum()



## 11) Combine numeric and categorical back together

Now we join the imputed numeric and categorical tables into one clean DataFrame.


In [None]:

# Combine along columns
df_imputed = pd.concat([numeric_imputed_df, categorical_imputed_df], axis=1)

# Keep original column order where possible
# Any dropped columns will naturally be missing from this new DataFrame
ordered_cols = [c for c in df_impute_base.columns if c in df_imputed.columns]
df_imputed = df_imputed[ordered_cols]

print("Shape after imputation:", df_imputed.shape)
df_imputed.head()



## 12) Check missing values before vs after

We make a quick comparison to confirm our imputation worked.


In [None]:

# Total missing values in the original working frame (after any drops)
before_total_na = df_impute_base.isna().sum().sum()

# Total missing values after imputation
after_total_na = df_imputed.isna().sum().sum()

print("Total missing values before:", before_total_na)
print("Total missing values after :", after_total_na)



## 13) Save the cleaned data

Save your imputed data to a new CSV. You can use it in future notebooks or models.


In [None]:

output_path = "Ames_clean_imputed.csv"
df_imputed.to_csv(output_path, index=False)
print("Saved:", output_path)



## 14) Next steps (optional)

- Try a different numeric strategy: `"mean"` instead of `"median"`.
- For categorical, try filling with a placeholder like `"None"` if it makes sense for the feature.
- Create **missingness indicator** columns to remember which values were missing. Example:

```python
df_with_flags = df_impute_base.copy()
for col in df_with_flags.columns:
    df_with_flags[col + "_was_missing"] = df_with_flags[col].isna().astype(int)
```

- Explore more advanced imputers later, like `KNNImputer` or model-based methods.
