## This notebook is about Exploratory Data Analysis of the 'ames housing' dataset

### Import required libs

In [5]:
import pandas as pd

from viz import bar, box, corrTable, heatMap, hist, scatter

### Load & Analyze the dataset

In [6]:
# before executing intake.py (raw-data)
ames_raw = pd.read_csv(r"D:\Languages\Projects\ml-portfolio\regression\house_price_ames\data\raw\AmesHousing.csv")
print(f'ames_raw -> {ames_raw.shape[0]} rows × {ames_raw.shape[1]} columns')

# after executing intake.py (cleaned-data)
ames_after = pd.read_csv(r"D:\Languages\Projects\ml-portfolio\regression\house_price_ames\data\processed\ames_cleaned.csv")
print(f'ames_after -> {ames_after.shape[0]} rows × {ames_after.shape[1]} columns')

ames_raw -> 2930 rows × 82 columns
ames_after -> 2930 rows × 121 columns


##### Processed data shows 39 additional columns.  
##### Let’s verify how many columns were truly added/removed by comparing the sets of column names.

In [7]:
raw_cols = set(ames_raw.columns)
after_cols = set(ames_after.columns)

added = after_cols - raw_cols
removed = raw_cols - after_cols
common = raw_cols & after_cols

print("Added:", len(added))
print("Removed:", len(removed))
print("Common:", len(common))

Added: 108
Removed: 69
Common: 13


##### That’s quite a large number of removed columns. This is likely due to name cleaning in intake.py. Let’s normalize the column names to see the actual differences.

In [8]:
def normalize_cols(cols):
    return [c.strip().replace(" ", "_").replace("-", "_") for c in cols]

raw_cols_norm = normalize_cols(ames_raw.columns)
after_cols_norm = normalize_cols(ames_after.columns)

added = [c for c in after_cols_norm if c not in raw_cols_norm]
removed = [c for c in raw_cols_norm if c not in after_cols_norm]
common = [c for c in raw_cols_norm if c in after_cols_norm]

print("Added:", len(added))
print("Removed:", len(removed))
print("Common:", len(common))

Added: 39
Removed: 0
Common: 82


##### After aligning the naming conventions, it’s clear that all 82 original features are still present. In addition, 39 new engineered features were added, most of them is the "<<col>col>_is_outlier" flags.