# **Data Cleaning Notebook**

## Objectives

* Clean the Ames, Iowa housing dataset for modeling.

## Inputs

* outputs/datasets/collection/house_prices_records.csv

## Outputs

* Cleaned data set: outputs/datasets/cleaned/house_prices_records.csv
                    outputs/datasets/cleaned/inherited_houses.csv


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load Data Set

In [None]:
import pandas as pd
df = pd.read_csv("outputs/datasets/collection/house_prices_records.csv")
df_inherited = pd.read_csv("outputs/datasets/collection/inherited_houses.csv")

df.head(3)
df_inherited.head(3)

---

# Cleaning of house_prices_records.csv

Initial inspection

In [None]:
print("Initial shape:", df.shape)
print(df.info())

Check missing values

In [None]:
missing_values = df.isnull().sum()
missing_values = missing_values[missing_values > 0].sort_values(ascending=False)
print("Missing values:\n", missing_values)

Visualize missing values

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12,6))
sns.heatmap(df.isnull(), cbar=False, cmap="viridis")
plt.title("Missing Values Heatmap")
plt.show()

Impute or drop columns with high missingness (threshold = 30%)

In [None]:
thresh = len(df) * 0.3
cols_to_drop = [col for col in df.columns if df[col].isnull().sum() > thresh]
df.drop(columns=cols_to_drop, inplace=True)

df

Fill categorical missing values with "Not Available"

In [None]:
categorical_cols = df.select_dtypes(include='object').columns
for col in categorical_cols:
    df[col] = df[col].fillna("Not Available")

df

Fill numerical missing values with 0 (assumed no feature present)

In [None]:
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns
for col in numerical_cols:
    df[col] = df[col].fillna(0)

df

Recheck for missing values

In [None]:
print("Remaining missing values:", df.isnull().sum().sum())

---

# Cleaning of inherited_houses.csv

Initial inspection

In [None]:
print("Initial shape:", df_inherited.shape)
print(df_inherited.info())

Check missing values

In [None]:
missing_values = df_inherited.isnull().sum()
missing_values = missing_values[missing_values > 0].sort_values(ascending=False)
print("Missing values:\n", missing_values)

Visualize missing values

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12,6))
sns.heatmap(df_inherited.isnull(), cbar=False, cmap="viridis")
plt.title("Missing Values Heatmap")
plt.show()

No missing values in inherited_houses.csv, so we can push to repo.

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/cleaned')
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/cleaned/house_prices_records.csv", index=False)
df_inherited.to_csv(f"outputs/datasets/cleaned/inherited_houses.csv", index=False)
print("Cleaned datasets saved successfully.")