# **Comprehensive Data Visualization of the Paris Housing Dataset**

## **This script performs a full exploratory data analysis on the 'ParisHousingClass.csv' dataset.**  

### It uses `pandas` for data handling and `matplotlib` and `seaborn` for generating a variety of visualizations to uncover insights and relationships within the data.

**The analysis is broken down into the following sections**
- **Univariate Analysis**  
  Visualizes the distribution of single variables.
- **Bivariate Analysis**  
  Explores relationships between two variables.
- **Correlation Analysis:**  
  Provides a high-level overview of relationships across all numerical features.
- **Relational Plots:**  
  A powerful tool for examining all key relationships at once.

In [None]:
#
# Import Libraries
#
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import files

In [None]:
#
# Suppress scientific notation for clarity in output
#
# pd.set_option('display.float_format', '{:.2f}'.format)
pd.set_option('display.float_format', lambda x: '%.2f' % x)
np.set_printoptions(suppress=True, precision=2)

In [None]:
#
# Upload Data File into Google Colab Worksheet
# /content/ParisHousingClass.csv
#
uploadOutput = files.upload()

# **Load Data**

### **Loads the dataset from a CSV file and prepares it for analysis**

In [None]:
#
# Read the File into a Data Frame and do initial checks
#
fileName = 'ParisHousingClass.csv'

print("Loading dataset from '{}'...".format(fileName))
try:
  df = pd.read_csv(fileName)
  print("Dataset loaded successfully.")

  #
  # Print DataFrame Information
  #
  print("")
  print("DataFrame Information")
  print("=====================")
  print(df.info())

  #
  # Print DataFrame Statistical Details
  #
  print("")
  print("DataFrame Statistical Details")
  print("=============================")
  print(df.describe())

  #
  # Print First 5 Rows of the DataFrame
  #
  print("")
  print("First 5 rows of the DataFrame")
  print("=============================")
  print(df.head())
except FileNotFoundError:
  print(f"Error: The file '{fileName}' was not found.")

# **Univariate Analysis**  

### **Creates univariate plots to show the distribution of single variables**

## **Histogram of Price**

In [None]:
#
# Create Histogram of Price
#
binSize = 50

plt.figure(figsize = (10, 6))
sns.histplot(df['price'], bins = binSize, color = 'skyblue') # Change the bin size to see the effect
# sns.histplot(data=df, x='price', hue='category', bins = binSize) # Change the bin size to see the effect
plt.title('Distribution of House Prices', fontsize = 16)
plt.xlabel('Price', fontsize = 12)
plt.ylabel('Frequency', fontsize = 12)
plt.show()

# **Analysis of House Price Distribution Histogram**

## Dataset Context
* **Total Records**: 10,000 houses
* **Number of Bins**: 50 bins
* **Price Range**: 0 to ~10 million
* **Bin Width**: Approximately $200,000 per bin

---

## Key Observations

### 1. **Extremely Uniform Distribution**
The histogram shows a remarkably uniform distribution across all price ranges:
- **Frequency Range**: 480-540 houses per bin
- **Expected Average**: 200 houses per bin (10,000 ÷ 50)

### 2. **Notable Peaks**
Two prominent peaks are visible:
- **Peak 1**: Around $0-$200K range (~530 frequency)
- **Peak 2**: Around $800K-$1M range (~540 frequency)
- **Consistent Middle**: $400K-$1.6M shows stable 480-510 frequencies

### 3. **Minimal Variation**
- **Standard deviation** appears very low across bins
- **Range**: Only ~60 frequency difference between highest and lowest bars
- **Pattern**: No significant drops or spikes except at the identified peaks

---

## Statistical Implications

### Expected vs Actual Distribution

| Aspect | Expected (Real Market) | Observed (This Dataset) |
|--------|------------------------|-------------------------|
| **Shape** | Right-skewed | Nearly uniform |
| **Low-price homes** | High concentration | Moderate concentration |
| **High-price homes** | Sharp decline | Consistent presence |
| **Clustering** | Clear price clusters | Minimal clustering |

---

## Real-World Housing Market Comparison

### Typical Housing Market Characteristics:
1. **Right-skewed distribution** - Most homes in lower price ranges
2. **Exponential decay** - Fewer homes at higher prices
3. **Market clustering** - Concentration around local median prices
4. **Price gaps** - Empty or low-frequency bins at extreme high prices

### This Dataset's Unusual Features:
- **Equal representation** across all price tiers
- **No price deserts** - Even 8M-10M range has 460+ homes
- **Flat distribution** - Contradicts economic housing principles
- **Consistent supply** - Uniform availability across price spectrum

### Likely Conclusion:
This appears to be a **Pre-processed dataset** designed for:
- **Educational purposes** in data science courses
- **Machine learning training** with balanced classes
- **Statistical modeling** without real-world market bias

---

## Summary

This histogram represents an **artificially uniform distribution** of house prices that is **highly atypical** of real-world housing markets. While useful for educational purposes and algorithm testing, it should not be interpreted as representative of actual market conditions or used for real estate market analysis without significant caveats.

## **Histogram of Square Meters**

In [None]:
#
# Create Histogram of Square Meters
#
binSize = 50

plt.figure(figsize = (10, 6))
sns.histplot(df['price'], bins = binSize, color = 'coral') # Change the bin size to see the effect
# sns.histplot(data=df, x='price', hue='category', bins = binSize) # Change the bin size to see the effect
plt.title('Distribution of House Square Meters', fontsize = 16)
plt.xlabel('Square Meters', fontsize = 12)
plt.ylabel('Frequency', fontsize = 12)
plt.show()

# Analysis of House Square Meters Distribution

## What I See in This Chart

Looking at this histogram, I can see:
- All the bars are about the same height (around 200)
- The bars go from 0 to 10,000,000 square meters
- It looks very similar to the price chart we saw earlier

## Basic Observations

### The Distribution Looks Very Flat
- Most bars are between 180-230 houses
- There's no big peaks or valleys
- This is pretty unusual for house data

### The Numbers Seem Really Big
The x-axis goes up to **1e7**, which means **10,000,000 square meters**. That's huge!

To put this in perspective:
- A normal house is about 150-200 square meters
- A big house might be 400-500 square meters  
- 10,000,000 square meters = 10 square kilometers!

That's like saying some houses are the size of a small town, which doesn't make sense.

## Comparing to the Price Chart

Both charts look almost identical:
- Same flat, uniform pattern
- Same lack of clustering
- Both have unrealistic ranges

## What This Probably Means

This data is likely:
- **Made up for learning purposes** (not real house data)
- **Has some errors** in the measurements
- **Created to be balanced** so every range has similar numbers

## Simple Conclusion

This doesn't look like real house data because:
1. Real houses don't come in sizes up to 10 km²
2. Most houses cluster around normal sizes (like 100-300 m²)
3. The pattern is too perfect and uniform

This is probably a practice dataset for learning data analysis, not actual housing market information.

## **Box plot of Price by Category**

In [None]:
#
# Create Box Plot of Price by Category
#
plt.figure(figsize = (12, 7))
sns.boxplot(x = 'category', y = 'price', data = df)
plt.title('Price Distribution by Category', fontsize = 16)
plt.xlabel('Price Category', fontsize = 12)
plt.ylabel('Price', fontsize = 12)
plt.show()

# Box Plot Analysis: Price Distribution by Category

## What I See in This Chart

Looking at these two box plots, I can see that **both "Basic" and "Luxury" houses have almost identical price ranges**.

## Simple Observations

### The Boxes Look Almost Identical
- Both boxes are about the same size
- Both start around the same price (about 2.5 million)
- Both end around the same price (about 7.5 million)
- The middle line (median) is in almost the same spot for both

### What This Means
- **Basic houses**: Range from ~2.5M to 7.5M
- **Luxury houses**: Range from ~2.5M to 7.5M
- **No real difference** between the categories!

## This is Strange Because...

In the real world, we'd expect:
- **Basic houses**: Lower prices (like 200K - 500K)
- **Luxury houses**: Higher prices (like 800K - 2M+)
- **Clear separation** between the two categories

## What's Wrong Here?

This confirms our earlier findings about the synthetic data:

1. **Categories don't make sense** - Both Basic and Luxury cost the same
2. **No price separation** - Categories are meaningless
3. **More evidence** this is fake/practice data

## Simple Conclusion

The box plot shows that **"Basic" and "Luxury" categories are just random labels** in this dataset. There's no actual difference in pricing between them.

This is another sign that this dataset was created for learning purposes and doesn't represent real housing market data where luxury homes would definitely cost more than basic ones!

## **Count Plot of City Part Range**

In [None]:
#
# Create Count Plot of City Part Range
#
plt.figure(figsize = (10, 6))
sns.countplot(x = 'cityPartRange', data = df)
plt.title('Number of Houses per City Part', fontsize = 16)
plt.xlabel('City Part Range', fontsize = 12)
plt.ylabel('Count', fontsize = 12)
plt.show()

# Count Plot Analysis: Number of Houses per City Part

## What I See in This Chart

Looking at this bar chart, I can see that **all city parts have almost the same number of houses**.

## Simple Observations

### All Bars Look Nearly Identical
- All bars are around **1,000 houses** each
- City Part 5 has slightly more (around 1,050)
- City Part 6 has slightly less (around 950)
- But overall, they're all very similar

### What This Means
- Each city part has roughly **1,000 houses** out of 10,000 total
- **Perfect distribution** across all 10 city parts
- **No clustering** in popular areas

## This is Unrealistic Because...

In real cities, we'd expect:
- **Popular areas**: More houses (downtown, good schools)
- **Less popular areas**: Fewer houses (industrial, remote)
- **Big differences** between city parts
- **Some variation** based on geography and development

## What We'd See in Real Data
- Some city parts might have **2,000+ houses**
- Others might have only **200-300 houses**
- **Uneven distribution** based on:
  - Population density
  - Available land
  - City planning
  - Economic factors

## Another Sign of Fake Data

This adds to our growing evidence:

1. **Uniform price distribution** ✓
2. **Uniform square meters** ✓  
3. **No difference between Basic/Luxury** ✓
4. **Equal houses in all city parts** ✓

## Simple Conclusion

This chart shows **perfect artificial balance** across all city parts, which never happens in real cities.

Real cities have **natural clustering** - some areas are more developed, some are more popular, and some have geographical constraints.

This is more proof that our dataset is **created for learning purposes** rather than representing actual housing market data!

## **Count Plot of Has Pool**

In [None]:
#
# Create Count Plot of Has Pool
#
plt.figure(figsize = (8,6))
sns.countplot(x = 'hasPool', data = df)
plt.title('Number of Houses that has Swimming Pool', fontsize = 16)
plt.xlabel('Has Swimming Pool', fontsize = 12)
plt.ylabel('Count', fontsize = 12)
plt.show()

# Count Plot Analysis: Number of Houses that have Swimming Pool

## What I See in This Chart

Looking at this count plot, I can see that **exactly half the houses have pools and half don't**. This continues the pattern we've been seeing!

## Simple Observations

### Perfect 50/50 Split
- **Houses without pools (0)**: About 5,000 houses
- **Houses with pools (1)**: About 5,000 houses
- **Almost identical bar heights**
- **Perfect balance** between the two categories

### What This Means
- Exactly **50% of houses** have swimming pools
- Exactly **50% of houses** don't have swimming pools
- **No natural variation** in pool ownership

## This is Unrealistic Because...

In real housing markets, we'd expect:
- **Most houses DON'T have pools** (maybe 15-25% have pools)
- **Pools are expensive** to install and maintain
- **Climate matters** - more pools in hot areas, fewer in cold areas
- **Income levels** - luxury areas have more pools
- **Uneven distribution** based on geography and economics

## What We'd See in Real Data
- Maybe **2,000 houses with pools** (20%)
- Maybe **8,000 houses without pools** (80%)
- **Big difference** between the bar heights
- **Variation by city part** and price range

## Another Perfect Balance

This adds to our growing list of artificial patterns:

1. **Uniform price distribution** ✓
2. **Uniform square meters** ✓  
3. **No difference between Basic/Luxury** ✓
4. **Equal houses in all city parts** ✓
5. **Exactly 50% have pools** ✓

## Simple Conclusion

This chart shows **another perfect 50/50 split** that doesn't happen in real life.

In reality, **swimming pools are relatively rare** because they're expensive, require maintenance, and depend on climate and income levels.

This is more evidence that our dataset was **artificially created** with balanced categories rather than reflecting actual housing market patterns!

# **Bivariate Analysis**  

### **Creates bivariate plots to show the relationships between two variables**

## **Scatter plot of Square Meters vs. Price, colored by isNewBuilt**

In [None]:
#
# Create Scatter Plot of Square Meters vs. Price, colored by isNewBuilt
#
plt.figure(figsize = (12, 8 ))
sns.scatterplot(x = 'squareMeters', y = 'price', hue = 'isNewBuilt', data = df)
plt.title('House Price vs. Square Meters (Colored by New Built Status)', fontsize = 16)
plt.xlabel('Square Meters', fontsize = 12)
plt.ylabel('Price', fontsize = 12)
plt.legend(title = 'New Built')
# Both 'Existing' and 'New' shows same colour, when passing 'labels' Parameter
# plt.legend(title = 'New Built', labels = ['Existing', 'New'])
plt.show()

In [None]:
#
# Create Scatter Plot of Square Meters vs. Price, colored by isNewBuilt
# Another Way to Work-Around LEGEND Issue
#
df['descNewBuilt'] = df['isNewBuilt'].apply(lambda x: 'Existing' if x == 0 else 'New')
plt.figure(figsize = (12, 8 ))
sns.scatterplot(x = 'squareMeters', y = 'price', hue = 'descNewBuilt', data = df)
plt.title('House Price vs. Square Meters (Colored by New Built Status)', fontsize = 16)
plt.xlabel('Square Meters', fontsize = 12)
plt.ylabel('Price', fontsize = 12)
plt.show()

# Scatter Plot Analysis: House Price vs. Square Meters (Colored by New Built Status)

## What I See in This Chart

Looking at this scatter plot, I can see a **perfect straight line** going from bottom-left to top-right. This is the most artificial-looking pattern yet!

## Simple Observations

### Perfect Linear Relationship
- **Perfect straight line** - no scatter at all
- **Bigger house = Higher price** in exact proportion
- **No variation** around the line
- **Blue and orange dots** are perfectly mixed along the same line

### Color Pattern (New Built Status)
- **Blue dots (Existing houses)** and **Orange dots (New houses)** are scattered randomly
- **No difference** in pricing between new and existing houses
- **50/50 split** of colors along the entire line

## This is EXTREMELY Unrealistic Because...

In real housing data, we'd expect:
- **Scattered points** around a general trend line
- **Some variation** - houses of same size with different prices
- **Different factors** affecting price beyond just size
- **New houses** typically cost MORE than existing ones
- **Market fluctuations** creating natural scatter

## What Real Data Would Look Like
- **Cloud of points** with a general upward trend
- **New houses** (orange) mostly above existing houses (blue)
- **Some houses** priced higher/lower due to location, condition, features
- **Natural variation** in the relationship

## This is Clearly Artificial Data

This scatter plot shows:

1. **Mathematical formula** - Price = Size × Some_Constant
2. **No real-world factors** affecting price
3. **Perfect correlation** (probably R² = 1.0)
4. **Generated data** following a simple equation

## Previous Evidence + This Chart

Our growing list of artificial patterns:

1. **Uniform price distribution** ✓
2. **Uniform square meters** ✓  
3. **No difference between Basic/Luxury** ✓
4. **Equal houses in all city parts** ✓
5. **Exactly 50% have pools** ✓
6. **Perfect linear price-size relationship** ✓

## Simple Conclusion

This scatter plot is the **strongest evidence yet** that this dataset is completely artificial.

Real housing markets have **natural variation**, **market forces**, and **multiple factors** affecting price. A perfect straight line like this only happens when someone writes a computer program that says:

**"Price = Square_Meters × 100"** (or some similar simple formula)

This is definitely **synthetic data created for educational purposes**!

## **Bar chart of the Average Price Per City Part Range**

In [None]:
#
# Create Bar Chart of Average Price Per City Part Range
# Using seaborn
#
plt.figure(figsize = (10, 6))

# averagePricePerCityPartRange = df.groupby('cityPartRange')['price'].mean() # pandas.core.frame.Series
# The above is returning a 'Series' but 'data' Parameter expects a Data Frame.
# Add 'reset_index()' after taking 'mean' converts this into a Data Frame
#
# TypeError                                 Traceback (most recent call last)
# /tmp/ipython-input-4140187221.py in <cell line: 0>()
#       4 plt.figure(figsize = (10, 6))
#       5 averagePricePerCityPartRange = df.groupby('cityPartRange')['price'].mean()
# ----> 6 sns.barplot(x = 'cityPartRange', y = 'price', data = averagePricePerCityPartRange)
#
# 5 frames
# /usr/local/lib/python3.12/dist-packages/seaborn/_core/data.py in handle_data_source(data)
#     276     elif data is not None and not isinstance(data, Mapping):
#     277         err = f"Data source must be a DataFrame or Mapping, not {type(data)!r}."
# --> 278         raise TypeError(err)
#     279
#     280     return data
#
# TypeError: Data source must be a DataFrame or Mapping, not <class 'pandas.core.series.Series'>.

averagePricePerCityPartRange = df.groupby('cityPartRange')['price'].mean().reset_index() # pandas.core.frame.DataFrame
sns.barplot(x = 'cityPartRange', y = 'price', data = averagePricePerCityPartRange)
plt.title('Average House Price by City Part Range', fontsize=16)
plt.xlabel('City Part Range', fontsize=12)
plt.ylabel('Average Price', fontsize=12)
plt.show()

In [None]:
#
# Create Bar Chart of Average Price Per City Part Range
# Using pyplot
#
plt.figure(figsize = (10, 6))
averagePricePerCityPartRange = df.groupby('cityPartRange')['price'].mean().reset_index() # pandas.core.frame.DataFrame
plt.bar(x = averagePricePerCityPartRange['cityPartRange'], height = averagePricePerCityPartRange['price'])
plt.title('Average House Price by City Part Range', fontsize=16)
plt.xlabel('City Part Range', fontsize=12)
plt.ylabel('Average Price', fontsize=12)
plt.show()

# Bar Chart Analysis: Average House Price by City Part Range

## What I See in This Chart

Looking at this bar chart, I can see that **all city parts have almost exactly the same average house price**. The pattern continues!

## Simple Observations

### All Bars Look Identical
- All bars are around **5 million dollars** average price
- **No variation** between different city parts
- City Part 5 might be slightly higher (around 5.1M)
- But overall, they're all practically the same

### What This Means
- **Every city part** has the same average house price
- **No premium locations** or cheaper areas
- **Perfect price uniformity** across all areas

## This is Completely Unrealistic Because...

In real cities, we'd expect **huge differences** between areas:
- **Downtown/Premium areas**: Maybe 8M average
- **Good suburban areas**: Maybe 6M average  
- **Average neighborhoods**: Maybe 4M average
- **Less desirable areas**: Maybe 2M average
- **Industrial/remote areas**: Maybe 1M average

## What Makes Areas Different in Real Life
- **Location desirability** (near beaches, downtown, schools)
- **Safety and crime rates**
- **School district quality**
- **Transportation access**
- **Local amenities** (parks, shopping, restaurants)
- **Economic development**

## Another Perfect Balance

This adds to our overwhelming evidence:

1. **Uniform price distribution** ✓
2. **Uniform square meters** ✓  
3. **No difference between Basic/Luxury** ✓
4. **Equal houses in all city parts** ✓
5. **Exactly 50% have pools** ✓
6. **Perfect linear price-size relationship** ✓
7. **Same average price in all city parts** ✓

## Simple Conclusion

This chart shows **another impossible uniformity** - all city parts have the exact same average house price.

In real cities, **location is everything** in real estate. Some areas are worth 3-5x more than others based on desirability, safety, schools, and amenities.

This is the **seventh chart** showing artificial balance, confirming beyond any doubt that this dataset is **completely synthetic** and created for learning purposes rather than representing any real housing market!

## **Box Plot of Price by Storm Protector Presence**

In [None]:
#
# Create Box Plot of Price by Strom Protector Presence
#
plt.figure(figsize = (8, 6))
sns.boxplot(x = 'hasStormProtector', y = 'price', data = df)
plt.title('Price Distribution by Storm Protector Presence', fontsize = 16)
plt.xlabel('Storm Protector Presence', fontsize = 12)
plt.ylabel('Price', fontsize = 12)
plt.show()

# Box Plot Analysis: Price Distribution by Storm Protector Presence

## What I See in This Chart

Looking at these box plots, I can see that **houses with and without storm protectors have exactly the same price ranges**. The artificial pattern strikes again!

## Simple Observations

### Identical Box Plots
- **Both boxes are exactly the same size and position**
- **Same median price** (middle line around 5M)
- **Same price range** (roughly 2.5M to 7.5M)
- **Same quartiles** (25th and 75th percentiles)
- **Identical whiskers** (min and max values)

### What This Means
- **Storm protectors don't affect price** at all
- **No premium** for having storm protection
- **Perfect price equality** between both groups

## This is Unrealistic Because...

In real housing markets, we'd expect:
- **Storm protectors add value** - they're expensive safety features
- **Higher prices** for houses with storm protection
- **Regional differences** - more valuable in storm-prone areas
- **Insurance benefits** - lower premiums might increase home value
- **Clear price separation** between protected and unprotected homes

## What We'd See in Real Data
- **Houses WITH storm protectors**: Higher median price (maybe 6M)
- **Houses WITHOUT storm protectors**: Lower median price (maybe 4.5M)
- **Different box sizes** showing different price distributions
- **Clear value difference** for safety features

## This Follows the Exact Same Pattern

Our growing list of identical artificial patterns:

1. **Uniform price distribution** ✓
2. **Uniform square meters** ✓  
3. **No difference between Basic/Luxury** ✓
4. **Equal houses in all city parts** ✓
5. **Exactly 50% have pools** ✓
6. **Perfect linear price-size relationship** ✓
7. **Same average price in all city parts** ✓
8. **No price difference for storm protectors** ✓

## Simple Conclusion

This box plot shows **another impossible scenario** - expensive safety features like storm protectors have zero impact on house prices.

In real life, **storm protectors are valuable additions** that increase home value through:
- **Safety benefits**
- **Insurance savings**  
- **Peace of mind**
- **Reduced maintenance costs**

This is the **eighth chart** showing artificial uniformity, adding even more proof that this dataset is **completely synthetic** and doesn't reflect any real housing market dynamics!

# **Correlation and Relational Plots**  

### **Computes and visualizes the correlation matrix of numerical features and pair plot for a high-level overview of key relationships**

## **Correlation matrix and heat map**

In [None]:
#
# Calculate Co-Relation Matrix
#

#
# Select only the numerical columns for correlation analysis
#
# numericalColumns = ['squareMeters', 'numberOfRooms', 'floors', 'numPrevOwners', 'basement', 'attic', 'garage', 'price']
numericalColumns = df.select_dtypes(include = np.number).columns.to_list()
# print(numericalColumns)

#
# Calculate Co-Relation Matrix
#
correlationMatrix = df[numericalColumns].corr()
# print(correlationMatrix)

#
# Create Heat Map
#
plt.figure(figsize = (10, 6))
sns.heatmap(data = correlationMatrix, annot = True, fmt = '.2f', cmap = 'coolwarm', linewidths = 0.5)
plt.title('Correlation Heatmap of Numerical Features', fontsize=16)
plt.show()

# Correlation Heatmap Analysis: Numerical Features

## What I See in This Chart

Looking at this correlation heatmap, I can see that **there are almost no correlations between any variables**. This is the final proof of artificial data!

## Simple Observations

### Almost All Correlations Are Zero
- **Diagonal line** shows 1.00 (each variable perfectly correlates with itself)
- **Everything else** shows values like 0.00, 0.01, -0.01, 0.02
- **No strong relationships** between any variables
- **No red or dark blue colors** except on the diagonal

### What This Means
- **Price doesn't correlate with size** (should be strong positive)
- **Number of rooms doesn't correlate with size** (should be strong positive)  
- **No relationships** between any house features
- **Perfect independence** between all variables

## This is IMPOSSIBLE in Real Data

In real housing markets, we'd expect **strong correlations** like:

### Expected Strong Correlations:
- **Size vs Price**: Should be 0.7-0.9 (bigger = more expensive)
- **Size vs Number of Rooms**: Should be 0.6-0.8 (bigger = more rooms)
- **Number of Rooms vs Price**: Should be 0.5-0.7 (more rooms = higher price)
- **Pool vs Price**: Should be 0.3-0.5 (pools add value)
- **New Built vs Price**: Should be 0.2-0.4 (new costs more)

### Expected Medium Correlations:
- **Floors vs Size**: More floors usually means bigger house
- **Garage vs Price**: Garages add value
- **Storm Protector vs Price**: Safety features add value

## What Real Correlation Heatmaps Look Like
- **Lots of colors** - reds, oranges, blues showing relationships
- **Strong patterns** between related variables
- **Logical groupings** of correlated features
- **Clear value relationships**

## The Ultimate Proof

This correlation matrix shows **perfect artificial independence**:

1. **Uniform price distribution** ✓
2. **Uniform square meters** ✓  
3. **No difference between Basic/Luxury** ✓
4. **Equal houses in all city parts** ✓
5. **Exactly 50% have pools** ✓
6. **Perfect linear price-size relationship** ✓
7. **Same average price in all city parts** ✓
8. **No price difference for storm protectors** ✓
9. **Zero correlations between all variables** ✓

## Simple Conclusion

This correlation heatmap is the **final smoking gun**.

Real housing data has **natural relationships** - bigger houses cost more, more rooms mean bigger houses, luxury features add value. These relationships create correlations.

But this dataset shows **zero relationships**, which means the data was generated with **completely independent random variables** - exactly what happens when someone creates synthetic data without considering real-world relationships.

This is **conclusive proof** that the dataset is 100% artificial and created purely for educational exercises!

## **Pair plot**

In [None]:
#
# Creates a pair plot for a high-level overview of key relationships
#

#
# Select a few key numerical columns for the pair plot
# Pair plots can be slow on large datasets with many columns
#
# pairColumns = df.select_dtypes(include = np.number).columns.to_list()
pairColumns = ['price', 'squareMeters', 'numberOfRooms', 'floors', 'hasPool']

#
# Create Pair Plot
#
plt.figure(figsize = (10, 6))
sns.pairplot(data = df[pairColumns])
# plt.title('Pair Plot of Key Housing Features', fontsize=16)
plt.suptitle('Pair Plot of Key Housing Features', y = 1.02, fontsize = 16)
plt.show()

# **Pie Charts**  

### **As we have seen above the data set is absolutely artificial, any more charts is not going to add any values but for academic interest, let's do some Pie Charts**

In [None]:
#
# Creates a Pie Chart to show Percentage of house that has pool or not
#
plt.figure(figsize = (10,6))
plt.pie([df['hasPool'].value_counts().get(0, 0), df['hasPool'].value_counts().get(1, 0)], labels = ['No Pool', 'Has Pool'], autopct = '%1.1f%%')
plt.title('Distribution of Houses with a Pool', fontsize = 16)
plt.show()

In [None]:
print(df['hasPool'].value_counts())
print(df['hasPool'].value_counts().get(0, 0))
print(df['hasPool'].value_counts().get(1, 0))

In [None]:
#
# Creates a Pie Chart to show the house category (Basic or Luxury)
# Different way to the above
#

#
# Count the occurrences for each category
#
catgoryCounts = df['category'].value_counts()

#
# Define labels and sizes for the pie chart
#
labels = catgoryCounts.index
sizes  = catgoryCounts.values

plt.figure(figsize=(8, 8))
plt.pie(
         sizes,
         labels=labels,
         autopct='%1.1f%%'
       )
plt.title('Distribution of Houses by Category', fontsize=16)
plt.axis('equal')
plt.show()