# Exploratory Data Analysis (EDA)

## Step 1 â€” Target Variable Analysis

**Objective:**
Understand the statistical behavior of the target variable (`SalePrice`) in order to support modeling decisions later in the pipeline.

At this stage:
- No models are trained
- No transformations are applied permanently
- The test dataset is not used for statistical inference


In [2]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats

pd.set_option("display.max_columns", None)
sns.set_style("whitegrid")


In [3]:
train_path = "../data/raw/train.csv"
test_path = "../data/raw/test.csv"

train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)

print("Train shape:", train_df.shape)
print("Test shape:", test_df.shape)


Train shape: (1460, 81)
Test shape: (1459, 80)


### Train vs Test Data Usage

The training dataset is used for:
- Statistical analysis
- Distribution inspection
- Decision-making regarding transformations

The test dataset is used **only** for:
- Structural validation (columns, data types, missing values)

This separation is critical to prevent **data leakage**.


In [4]:
target = "SalePrice"

train_df[target].describe()


count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64