# Exploratory Data Analysis on House Prices

In this notebook, we will perform exploratory data analysis (EDA) on the house prices dataset. We will visualize the data and derive insights that can help in understanding the factors affecting house prices.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set the style of seaborn
sns.set(style='whitegrid')

In [2]:
# Load the dataset
data_path = '../data/raw/house_prices.csv'
df = pd.read_csv(data_path)

# Display the first few rows of the dataset
df.head()

In [3]:
# Summary statistics of the dataset
df.describe()

In [4]:
# Check for missing values
missing_values = df.isnull().sum()
missing_values[missing_values > 0]

In [5]:
# Visualize the distribution of house prices
plt.figure(figsize=(10, 6))
sns.histplot(df['SalePrice'], bins=30, kde=True)
plt.title('Distribution of House Prices')
plt.xlabel('Sale Price')
plt.ylabel('Frequency')
plt.show()

In [6]:
# Correlation heatmap
plt.figure(figsize=(12, 8))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

## Insights
1. The distribution of house prices is right-skewed, indicating that most houses are sold at lower prices.
2. The correlation heatmap shows which features are most strongly correlated with house prices, helping to identify important predictors.