# Enhanced Exploratory Data Analysis (EDA) for Housing Data

This notebook performs an in-depth exploratory data analysis on the Housing dataset. The goal is to understand the relationships between different features and the house price, and to prepare the data for potential machine learning modeling.

## 1. Load and Inspect Data

The first step is to load the necessary libraries and the dataset. We will then perform an initial inspection to understand its structure, data types, and basic statistics.

### 1.1 Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px
import plotly.io as pio
from sklearn.preprocessing import PowerTransformer, LabelEncoder

# Set styles and defaults
plt.style.use('seaborn-v0_8-whitegrid') 
pio.renderers.default = 'iframe' 
pio.templates.default = 'plotly_white' 
pd.set_option('display.float_format', "{:.3f}".format) 
sns.set_palette('viridis') 

### 1.2 Load Data

In [None]:
df = pd.read_csv('Housing.csv')

### 1.3 Initial Data Inspection

In [None]:
print("First 5 rows of the dataframe:")
df.head()

In [None]:
print("\nInformation about the dataframe:")
df.info()

In [None]:
print("\nDescriptive statistics of the dataframe:")
df.describe().T

In [None]:
print(f"\nShape of the dataframe (rows, columns): {df.shape}")

In [None]:
print("\nMissing values per column:")
df.isna().sum()

In [None]:
print(f"\nNumber of duplicated rows: {df.duplicated().sum()}")

**Initial Findings:**
- The dataset contains 545 rows and 13 columns.
- There are no missing values.
- There are no duplicated rows.
- Several features are categorical (object type) and will need encoding for machine learning models.

### 1.4 Identify Categorical and Numerical Features

In [None]:
categorical_features = df.select_dtypes(include='object').columns.tolist()
numerical_features = df.select_dtypes(include=np.number).columns.tolist()
binary_categorical_features = ['mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'prefarea']
multi_categorical_features = ['furnishingstatus']

print(f"Categorical Features: {categorical_features}")
print(f"Numerical Features: {numerical_features}")

## 2. Exploratory Data Analysis (EDA) and Visualization

Now, we'll dive deeper into the data using various visualization techniques to understand distributions, relationships, and potential outliers.

### 2.1 Univariate Analysis

Analyzing individual features.

#### 2.1.1 Distribution of Target Variable (Price)

In [None]:
plt.figure(figsize=(12, 6))
sns.histplot(df['price'], kde=True, bins=30)
plt.title('Distribution of Price', fontsize=16)
plt.xlabel('Price', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.show()

plt.figure(figsize=(8, 4))
sns.boxplot(x=df['price'])
plt.title('Boxplot of Price', fontsize=16)
plt.xlabel('Price', fontsize=14)
plt.show()

print(f"Skewness of Price: {df['price'].skew():.2f}")

**Observation:** The `price` is right-skewed, which is common for price data. We might consider a transformation later.

#### 2.1.2 Distribution of Numerical Features

In [None]:
num_cols_for_dist = [col for col in numerical_features if col != 'price'] 
fig, axes = plt.subplots(nrows=len(num_cols_for_dist), ncols=2, figsize=(15, len(num_cols_for_dist) * 4))
fig.suptitle('Distribution and Boxplot for Numerical Features', fontsize=18, y=1.02)
fig.tight_layout(pad=5.0)

for i, col in enumerate(num_cols_for_dist):
    sns.histplot(df[col], kde=True, ax=axes[i, 0], bins=20)
    axes[i, 0].set_title(f'Distribution of {col} (Skew: {df[col].skew():.2f})', fontsize=14)
    axes[i, 0].set_xlabel(col, fontsize=12)
    axes[i, 0].set_ylabel('Frequency', fontsize=12)
    
    sns.boxplot(x=df[col], ax=axes[i, 1])
    axes[i, 1].set_title(f'Boxplot of {col}', fontsize=14)
    axes[i, 1].set_xlabel(col, fontsize=12)

plt.show()

**Observations:**
- `area` is also right-skewed.
- `bedrooms`, `bathrooms`, `stories`, and `parking` are discrete numerical features. Their distributions show the common counts for each category.

#### 2.1.3 Frequency of Categorical Features

In [None]:
fig, axes = plt.subplots(nrows=(len(categorical_features) + 1) // 2, ncols=2, figsize=(15, len(categorical_features) * 2.5))
axes = axes.flatten() 
fig.suptitle('Frequency of Categorical Features', fontsize=18, y=1.03)
fig.tight_layout(pad=5.0)

for i, col in enumerate(categorical_features):
    sns.countplot(x=col, data=df, ax=axes[i], palette='viridis_r')
    axes[i].set_title(f'Frequency of {col}', fontsize=14)
    axes[i].set_xlabel(col, fontsize=12)
    axes[i].set_ylabel('Count', fontsize=12)
    axes[i].tick_params(axis='x', rotation=45)

if len(categorical_features) % 2 != 0:
    axes[-1].set_visible(False)

plt.show()

**Observations:**
- Most houses are on the `mainroad`.
- `guestroom`, `basement`, `hotwaterheating`, and `prefarea` have a majority of 'no' responses.
- `airconditioning` is more balanced but still more 'no' than 'yes'.
- `furnishingstatus` has 'semi-furnished' as the most common category.

### 2.2 Bivariate Analysis

Exploring relationships between pairs of features, especially with the target variable `price`.

#### 2.2.1 Price vs. Area

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x='area', y='price', data=df, hue='furnishingstatus', size='stories', palette='plasma', alpha=0.7)
plt.title('Price vs. Area', fontsize=16)
plt.xlabel('Area (sq. ft)', fontsize=14)
plt.ylabel('Price', fontsize=14)
plt.legend(title='Furnishing Status & Stories')
plt.show()

**Observation:** There's a positive correlation between `area` and `price`. Houses with more stories tend to be larger and more expensive. Furnishing status also seems to play a role.

#### 2.2.2 Price vs. Categorical Features

In [None]:
fig, axes = plt.subplots(nrows=(len(categorical_features) + 1) // 2, ncols=2, figsize=(18, len(categorical_features) * 3))
axes = axes.flatten()
fig.suptitle('Price vs. Categorical Features', fontsize=18, y=1.03)
fig.tight_layout(pad=5.0)

for i, col in enumerate(categorical_features):
    sns.boxplot(x=col, y='price', data=df, ax=axes[i], palette='magma_r')
    axes[i].set_title(f'Price vs. {col}', fontsize=14)
    axes[i].set_xlabel(col, fontsize=12)
    axes[i].set_ylabel('Price', fontsize=12)
    axes[i].tick_params(axis='x', rotation=45)

if len(categorical_features) % 2 != 0:
    axes[-1].set_visible(False)

plt.show()

**Observations:**
- Houses on the `mainroad`, with a `guestroom`, `basement`, `hotwaterheating`, `airconditioning`, or in a `prefarea` tend to have higher median prices.
- `Furnished` houses generally command higher prices than `unfurnished`, with `semi-furnished` in between.

#### 2.2.3 Price vs. Other Numerical Features

In [None]:
num_cols_for_biv = [col for col in numerical_features if col not in ['price', 'area']] 
fig, axes = plt.subplots(nrows=(len(num_cols_for_biv) + 1) // 2, ncols=2, figsize=(15, len(num_cols_for_biv) * 2.5))
axes = axes.flatten()
fig.suptitle('Price vs. Other Numerical Features', fontsize=18, y=1.03)
fig.tight_layout(pad=5.0)

for i, col in enumerate(num_cols_for_biv):
    if df[col].nunique() < 10: 
        sns.boxplot(x=col, y='price', data=df, ax=axes[i], palette='coolwarm_r')
    else: 
        sns.scatterplot(x=col, y='price', data=df, ax=axes[i], palette='coolwarm_r')
    axes[i].set_title(f'Price vs. {col}', fontsize=14)
    axes[i].set_xlabel(col, fontsize=12)
    axes[i].set_ylabel('Price', fontsize=12)

if len(num_cols_for_biv) % 2 != 0:
    axes[-1].set_visible(False)
    
plt.show()

**Observations:**
- More `bedrooms`, `bathrooms`, `stories`, and `parking` spots generally correlate with higher prices, as seen from the boxplots.

### 2.3 Multivariate Analysis

Looking at relationships between multiple features simultaneously.

#### 2.3.1 Pairplot of Numerical Features

In [None]:
sns.pairplot(df[numerical_features], diag_kind='kde', corner=True, plot_kws={'alpha':0.6, 's':80, 'edgecolor':'k'})
plt.suptitle('Pairplot of Numerical Features', y=1.02, fontsize=16)
plt.show()

**Observation:** The pairplot confirms the positive correlation between `price` and `area`. It also shows some correlation between `bedrooms` and `stories`, and `bathrooms` and `stories`.

#### 2.3.2 Correlation Heatmap

In [None]:
# For the heatmap, we'll use the df_processed which has binary features encoded numerically
df_temp_for_corr = df.copy()
for col in binary_categorical_features:
    df_temp_for_corr[col] = label_encoder.fit_transform(df_temp_for_corr[col])
furnishing_dummies_corr = pd.get_dummies(df_temp_for_corr['furnishingstatus'], prefix='furnishing', dtype=int)
df_temp_for_corr = pd.concat([df_temp_for_corr.drop('furnishingstatus', axis=1), furnishing_dummies_corr], axis=1)

plt.figure(figsize=(12, 10))
correlation_matrix = df_temp_for_corr.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Heatmap of All Features (Processed)', fontsize=16)
plt.show()

**Observations:**
- `price` has the strongest positive correlation with `area` (0.54).
- `bathrooms`, `stories`, and `airconditioning` also show notable positive correlations with `price`.
- `furnishing_unfurnished` has a negative correlation with price, as expected.
- There is some multicollinearity, for example, between `bathrooms` and `stories`.

## 3. Feature Engineering (Basic)

Converting categorical features into a numerical format suitable for modeling.

### 3.1 Convert Binary Categorical Features to Numerical

In [None]:
df_processed = df.copy() # Create a copy for processing

label_encoder = LabelEncoder()
for col in binary_categorical_features:
    df_processed[col] = label_encoder.fit_transform(df_processed[col])

print("Binary features after Label Encoding:")
df_processed[binary_categorical_features].head()

### 3.2 One-Hot Encode 'furnishingstatus'

In [None]:
furnishing_status_encoded = pd.get_dummies(df_processed['furnishingstatus'], prefix='furnishing', drop_first=True, dtype=int)
df_processed = pd.concat([df_processed, furnishing_status_encoded], axis=1)
df_processed.drop('furnishingstatus', axis=1, inplace=True)

print("DataFrame after One-Hot Encoding 'furnishingstatus':")
df_processed.head()

In [None]:
print("\nInformation about the processed dataframe:")
df_processed.info()

**Note:** `df_processed` now contains the numerically encoded categorical features. `df_transformed` will be used for transformations on numerical features.

## 4. Data Transformation

Addressing skewness in numerical features to potentially improve model performance.

### 4.1 Skewness and Transformation of Numerical Features

In [None]:
df_transformed = df_processed.copy()

skewed_features = ['price', 'area'] 
power_transformer = PowerTransformer(method='yeo-johnson') 

fig, axes = plt.subplots(nrows=len(skewed_features), ncols=2, figsize=(15, len(skewed_features) * 4))
fig.suptitle('Original vs. Transformed Distributions for Skewed Features', fontsize=18, y=1.03)
fig.tight_layout(pad=5.0)

for i, col in enumerate(skewed_features):
    sns.histplot(df_transformed[col], kde=True, ax=axes[i, 0], bins=30, color='skyblue')
    axes[i, 0].set_title(f'Original Distribution of {col} (Skewness: {df_transformed[col].skew():.2f})', fontsize=14)
    axes[i, 0].set_xlabel(f'Original {col}', fontsize=12)
    axes[i, 0].set_ylabel('Frequency', fontsize=12)
    
    df_transformed[col] = power_transformer.fit_transform(df_transformed[[col]])
    
    sns.histplot(df_transformed[col], kde=True, ax=axes[i, 1], bins=30, color='lightcoral')
    axes[i, 1].set_title(f'Transformed Distribution of {col} (Skewness: {df_transformed[col].skew():.2f})', fontsize=14)
    axes[i, 1].set_xlabel(f'Transformed {col}', fontsize=12)
    axes[i, 1].set_ylabel('Frequency', fontsize=12)

plt.show()

print("\nSkewness of all numerical features (original df_processed for comparison):")
print(df_processed[numerical_features].skew())
print("\nSkewness of 'price' and 'area' in df_transformed:")
print(df_transformed[skewed_features].skew())

**Observation:** The PowerTransformer significantly reduced the skewness of `price` and `area`, making their distributions more symmetric.

## 5. Advanced Visualizations & Insights

Using interactive plots and grouped analyses to uncover more complex patterns.

### 5.1 Interactive Scatter Plot: Price vs. Area with Hover Data

Using the original `df` for easier interpretation of categorical values in hover data.

In [None]:
fig = px.scatter(df, x='area', y='price', 
                 color='bedrooms', 
                 size='bathrooms', 
                 hover_data=['stories', 'airconditioning', 'furnishingstatus'],
                 title='Interactive Scatter Plot: Price vs. Area',
                 labels={'area': 'Area (sq. ft)', 'price': 'Price', 'bedrooms': 'Bedrooms'},
                 color_continuous_scale=px.colors.sequential.Viridis)
fig.show()

**Insight:** Hovering over points can reveal specific combinations of features for high or low priced houses.

### 5.2 Sunburst Chart: Furnishing Status within Preference Area

In [None]:
sunburst_data = df.groupby(['prefarea', 'furnishingstatus']).size().reset_index(name='count')
fig = px.sunburst(sunburst_data, path=['prefarea', 'furnishingstatus'], values='count',
                  title='Housing Distribution: Furnishing Status within Preference Areas',
                  color_discrete_sequence=px.colors.qualitative.Pastel)
fig.show()

**Insight:** This chart shows the proportion of furnishing types within preferred and non-preferred areas. For instance, we can see if semi-furnished is more common in preferred areas.

### 5.3 Mean Price by Number of Bedrooms and Bathrooms

In [None]:
price_by_bed_bath = df.groupby(['bedrooms', 'bathrooms'])['price'].mean().unstack()

plt.figure(figsize=(10, 7))
sns.heatmap(price_by_bed_bath, annot=True, fmt=".0f", cmap="YlGnBu", linewidths=.5)
plt.title('Mean Price by Number of Bedrooms and Bathrooms', fontsize=16)
plt.xlabel('Number of Bathrooms', fontsize=14)
plt.ylabel('Number of Bedrooms', fontsize=14)
plt.show()

**Insight:** The heatmap clearly shows that increasing the number of bedrooms and bathrooms generally leads to a higher mean price. Houses with 4 bedrooms and 4 bathrooms have the highest mean price.

### 5.4 Impact of Air Conditioning on Price by Furnishing Status

In [None]:
plt.figure(figsize=(12, 7))
sns.boxplot(x='furnishingstatus', y='price', hue='airconditioning', data=df, palette='Set2_r')
plt.title('Impact of Air Conditioning on Price, Grouped by Furnishing Status', fontsize=16)
plt.xlabel('Furnishing Status', fontsize=14)
plt.ylabel('Price', fontsize=14)
plt.xticks(rotation=45)
plt.legend(title='Air Conditioning')
plt.show()

**Insight:** Across all furnishing statuses, houses with air conditioning have a noticeably higher median price and a wider price range, indicating its significant impact on value.

### 5.5 Price Distribution by Number of Stories and Basement

In [None]:
fig = px.violin(df, y="price", x="stories", color="basement", box=True, points="all",
                hover_data=df.columns,
                title="Price Distribution by Stories and Basement Presence",
                labels={'stories':'Number of Stories', 'price':'Price', 'basement':'Basement (Yes/No)'})
fig.show()

**Insight:** Houses with more stories tend to be more expensive. The presence of a basement also generally increases the price, especially for houses with fewer stories.

## 6. Summary of EDA Findings

1.  **Data Quality:** The dataset is clean with no missing values or duplicates.
2.  **Target Variable (`price`):** Right-skewed, indicating that most houses are in the lower to mid-price range, with fewer expensive houses. Power transformation helps normalize its distribution.
3.  **Key Numerical Predictors:** `area` is strongly positively correlated with `price` and is also right-skewed. `bathrooms` and `stories` also show a good positive correlation with price.
4.  **Key Categorical Predictors:** 
    *   `airconditioning` and `prefarea` have a significant positive impact on price.
    *   `furnishingstatus` shows that furnished and semi-furnished houses tend to be more expensive than unfurnished ones.
    *   `mainroad` access, `guestroom`, and `basement` presence also positively influence the price.
    *   `hotwaterheating` is a rare feature but houses with it tend to be more expensive.
5.  **Multicollinearity:** Some multicollinearity exists (e.g., `stories` and `bathrooms`), which might need consideration in some modeling techniques.
6.  **Transformations:** `price` and `area` benefit from Yeo-Johnson power transformation to handle skewness.
7.  **Interactions:** Advanced visualizations revealed interesting interactions, such as air conditioning's impact varying slightly across furnishing statuses, and basements having a more pronounced effect on price for houses with fewer stories.

This EDA provides a solid foundation for feature selection, preprocessing, and model building for predicting house prices.