# House Prices Dataset Exploration

## Dataset Overview

We are working with the **House Prices** dataset from OpenML, which contains information about residential properties and their sale prices. This dataset is commonly used for regression tasks and feature engineering practice.

### Dataset Characteristics
- **Total Records**: 1,460 houses
- **Total Features**: 81 columns (including the target variable `SalePrice`)
- **Target Variable**: `SalePrice` - the sale price of each house
- **Feature Types**: Mix of numeric and categorical variables including property dimensions, condition ratings, amenities, and sale details

### Key Features Include
- Property basics: `LotArea`, `YearBuilt`, `BldgType`, `HouseStyle`
- Quality indicators: `OverallQual`, `OverallCond`, `ExterQual`, `KitchenQual`
- Structural components: basement, garage, porch measurements
- Categorical attributes: zoning, neighborhood, sale type, and condition

## Exploration Methods

Our analysis will follow this structured approach:

1. **Data Structure Inspection** - Examine column names, data types, and dimensions
2. **Missing Value Analysis** - Identify and assess the extent of missing data
3. **Descriptive Statistics** - Calculate summary statistics for numeric features
4. **Distribution Analysis** - Visualize feature distributions and identify outliers
5. **Correlation Analysis** - Explore relationships between features and the target variable
6. **Categorical Feature Analysis** - Examine unique values and distributions of categorical variables
7. **Data Quality Assessment** - Check for inconsistencies and data anomalies

This systematic approach will help us understand the data before building predictive models.

In [21]:
import pandas as pd
from sklearn.datasets import fetch_openml

In [22]:
# Fetch data and convert to DataFrame
data = fetch_openml(name="house_prices", as_frame=True)
df = data.frame

In [26]:
# Show the first few rows of the DataFrame
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [43]:

# Show column/features of the DataFrame
for f, i in zip(df.columns, range(1, len(df.columns))):
    print("Feature {i}: {f}".format(i=i, f=f))

Feature 1: Id
Feature 2: MSSubClass
Feature 3: MSZoning
Feature 4: LotFrontage
Feature 5: LotArea
Feature 6: Street
Feature 7: Alley
Feature 8: LotShape
Feature 9: LandContour
Feature 10: Utilities
Feature 11: LotConfig
Feature 12: LandSlope
Feature 13: Neighborhood
Feature 14: Condition1
Feature 15: Condition2
Feature 16: BldgType
Feature 17: HouseStyle
Feature 18: OverallQual
Feature 19: OverallCond
Feature 20: YearBuilt
Feature 21: YearRemodAdd
Feature 22: RoofStyle
Feature 23: RoofMatl
Feature 24: Exterior1st
Feature 25: Exterior2nd
Feature 26: MasVnrType
Feature 27: MasVnrArea
Feature 28: ExterQual
Feature 29: ExterCond
Feature 30: Foundation
Feature 31: BsmtQual
Feature 32: BsmtCond
Feature 33: BsmtExposure
Feature 34: BsmtFinType1
Feature 35: BsmtFinSF1
Feature 36: BsmtFinType2
Feature 37: BsmtFinSF2
Feature 38: BsmtUnfSF
Feature 39: TotalBsmtSF
Feature 40: Heating
Feature 41: HeatingQC
Feature 42: CentralAir
Feature 43: Electrical
Feature 44: 1stFlrSF
Feature 45: 2ndFlrSF
Featu

In [28]:
# Show no. of rows
print("Number of rows:", len(df))

Number of rows: 1460


In [None]:
unique_dt = set(df.dtypes)
print("Unique data types in the dataset:", unique_dt)

In [None]:
# Basic structure and types
df.shape


In [None]:
# Column data types and non-null counts
df.info()


In [None]:
# Missing value analysis
missing_counts = df.isna().sum().sort_values(ascending=False)
missing_pct = (missing_counts / len(df) * 100).round(2)
missing_summary = pd.DataFrame({"missing_count": missing_counts, "missing_pct": missing_pct})
missing_summary = missing_summary.query('missing_count > 0')
missing_summary.head(20)


In [None]:
# Descriptive statistics for numeric features
df.describe().T


In [None]:
# Separate numeric and categorical columns
numeric_cols = df.select_dtypes(include=['number']).columns
categorical_cols = df.select_dtypes(exclude=['number']).columns
print('Numeric columns:', len(numeric_cols))
print('Categorical columns:', len(categorical_cols))


In [None]:
# Distribution analysis for target and key numeric features
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style='whitegrid')

plt.figure(figsize=(8, 4))
sns.histplot(df['SalePrice'], kde=True)
plt.title('SalePrice Distribution')
plt.show()

key_numeric = ['OverallQual', 'GrLivArea', 'TotalBsmtSF', 'GarageArea', 'YearBuilt']
for col in key_numeric:
    if col in df.columns:
        plt.figure(figsize=(8, 4))
        sns.histplot(df[col], kde=True)
        plt.title(f'{col} Distribution')
        plt.show()


In [None]:
# Boxplots to identify outliers for key numeric features
for col in ['SalePrice', 'GrLivArea', 'TotalBsmtSF', 'LotArea']:
    if col in df.columns:
        plt.figure(figsize=(8, 3))
        sns.boxplot(x=df[col])
        plt.title(f'{col} Outliers')
        plt.show()


In [None]:
# Correlation analysis with target
numeric_df = df.select_dtypes(include=['number'])
corr_with_target = numeric_df.corr(numeric_only=True)['SalePrice'].sort_values(ascending=False)
corr_with_target.head(15)


In [None]:
# Heatmap of top correlated features
top_corr_features = corr_with_target.head(10).index
plt.figure(figsize=(8, 6))
sns.heatmap(numeric_df[top_corr_features].corr(numeric_only=True), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Top Correlated Features with SalePrice')
plt.show()


In [None]:
# Categorical feature analysis: top categories and counts
for col in categorical_cols[:10]:
    print(f'\n{col} value counts:')
    display(df[col].value_counts(dropna=False).head(10))


In [None]:
# Relationship between key categorical features and SalePrice
for col in ['Neighborhood', 'BldgType', 'HouseStyle', 'ExterQual', 'KitchenQual']:
    if col in df.columns:
        plt.figure(figsize=(10, 4))
        sns.boxplot(x=col, y='SalePrice', data=df)
        plt.xticks(rotation=45, ha='right')
        plt.title(f'SalePrice by {col}')
        plt.tight_layout()
        plt.show()


In [None]:
# Data quality checks
duplicate_count = df.duplicated().sum()
print('Duplicate rows:', duplicate_count)

# Basic sanity checks for numerical ranges
print('Min/Max YearBuilt:', df['YearBuilt'].min(), df['YearBuilt'].max())
print('Min/Max SalePrice:', df['SalePrice'].min(), df['SalePrice'].max())
