# Notebook 01: Data Exploration

This notebook conducts exploratory data analysis of the Ames Housing dataset to understand its structure, identify patterns, and discover data quality issues. We use a discovery-driven approach without predetermined assumptions, letting the data guide our analysis through systematic investigation.


## 1. Dataset Structure Discovery

Examine dataset dimensions, feature types, and basic structural properties using parser-guided domain knowledge to establish analytical foundation.

### 1.1 Data Loading and Basic Structure

Load the dataset and inspect fundamental characteristics including dimensions, feature names, and data types.

In [23]:
# Load required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the datasets
df_train = pd.read_csv('../data/raw/train.csv')
df_test = pd.read_csv('../data/raw/test.csv')

print("Dataset Dimensions:")
print(f"Training data: {df_train.shape}")
print(f"Test data: {df_test.shape}")
print(f"Features in training: {df_train.shape[1]}")
print(f"Features in test: {df_test.shape[1]}")

# Basic info about the dataset
print(f"\nDataset Overview:")
print(f"Total samples for training: {len(df_train)}")
print(f"Total samples for testing: {len(df_test)}")
print(f"Feature names (first 10): {list(df_train.columns[:10])}")

Dataset Dimensions:
Training data: (1460, 81)
Test data: (1459, 80)
Features in training: 81
Features in test: 80

Dataset Overview:
Total samples for training: 1460
Total samples for testing: 1459
Feature names (first 10): ['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities']


The dataset contains 1,460 training samples and 1,459 test samples with 81 and 80 features respectively. The difference indicates the test set excludes the target variable (SalePrice). Feature names suggest comprehensive property characteristics including identification, zoning, lot details, and utilities.

In [25]:
# Examine data types and get basic statistics
print("Data Types Distribution:")
print(df_train.dtypes.value_counts())

print(f"\nBasic Data Quality Check:")
print(f"Duplicate rows in training: {df_train.duplicated().sum()}")
print(f"Features with missing data: {df_train.isnull().any().sum()}")
print(f"Total missing values: {df_train.isnull().sum().sum()}")

# Check target variable presence
print(f"\nTarget Variable:")
if 'SalePrice' in df_train.columns:
    print(f"SalePrice range: ${df_train['SalePrice'].min():,.0f} - ${df_train['SalePrice'].max():,.0f}")
    print(f"SalePrice mean: ${df_train['SalePrice'].mean():,.0f}")
else:
    print("SalePrice not found in training data")

Data Types Distribution:
object     43
int64      35
float64     3
Name: count, dtype: int64

Basic Data Quality Check:
Duplicate rows in training: 0
Features with missing data: 19
Total missing values: 7829

Target Variable:
SalePrice range: $34,900 - $755,000
SalePrice mean: $180,921


The dataset shows balanced data types with 43 categorical (object) and 38 numerical features (35 int64, 3 float64). Data quality is excellent with zero duplicates, though 19 features contain missing values totaling 7,829 missing entries. SalePrice ranges from $34,900 to $755,000 with mean $180,921, indicating substantial price variation requiring careful analysis.

### 1.2 Feature Classification with Parser Integration  

Use the data description parser to categorize features according to official documentation and understand their real estate domain context.

In [26]:
# Setup data description parser for domain knowledge
from data_description_parser import (
    load_feature_descriptions,
    quick_feature_lookup, 
    display_summary_table,
    get_categorical_features,
    get_numerical_features
)

# Load official documentation
feature_descriptions = load_feature_descriptions()
print("Official real estate documentation loaded successfully.")

# Display comprehensive feature overview
print("\nOfficial Feature Classification Summary:")
display_summary_table(feature_descriptions, max_rows=15)

Official real estate documentation loaded successfully.

Official Feature Classification Summary:
Feature Summary Table:
     Feature        Type                                                     Description Categories_Summary
  MSSubClass Categorical           Identifies the type of dwelling involved in the sale.      16 categories
    MSZoning Categorical       Identifies the general zoning classification of the sale.       8 categories
 LotFrontage   Numerical                     Linear feet of street connected to property          Numerical
     LotArea   Numerical                                         Lot size in square feet          Numerical
      Street Categorical                                 Type of road access to property       2 categories
       Alley Categorical                                Type of alley access to property       3 categories
    LotShape Categorical                                       General shape of property       4 categories
 LandContour Ca

Parser successfully loaded official documentation revealing 79 total features with 46 categorical and 33 numerical features. Notable categorical features include MSSubClass (16 dwelling types), Neighborhood (25 locations), and various property characteristics. The parser classification differs from pandas detection (46 vs 43 categorical), indicating some numerical features may be stored as integers but represent categories.

In [27]:
# Get official feature classifications and compare with pandas
categorical_features = get_categorical_features(feature_descriptions)
numerical_features = get_numerical_features(feature_descriptions)

print("Parser vs Pandas Classification Comparison:")
print(f"Parser - Categorical: {len(categorical_features)}, Numerical: {len(numerical_features)}")
print(f"Pandas - Object: {len(df_train.select_dtypes(include=['object']).columns)}, Numerical: {len(df_train.select_dtypes(include=['int64', 'float64']).columns)}")

# Identify discrepancies between parser and pandas classifications
pandas_objects = set(df_train.select_dtypes(include=['object']).columns)
parser_categorical = set(categorical_features)

print(f"\nClassification Analysis:")
print(f"Features classified as categorical by parser but numerical by pandas:")
categorical_as_numeric = parser_categorical - pandas_objects
if categorical_as_numeric:
    for feature in list(categorical_as_numeric)[:5]:
        print(f"  {feature}: {df_train[feature].dtype}")

Parser vs Pandas Classification Comparison:
Parser - Categorical: 46, Numerical: 33
Pandas - Object: 43, Numerical: 38

Classification Analysis:
Features classified as categorical by parser but numerical by pandas:
  OverallQual: int64
  OverallCond: int64
  MSSubClass: int64


Critical discovery: Three key features (OverallQual, OverallCond, MSSubClass) are stored as integers but represent categorical ordinal ratings according to official documentation. This misclassification could impact preprocessing and modeling strategies, as these should be treated as ordered categories rather than continuous numerical variables.