# Exploratory Data Analysis for Zurich Real Estate Price Prediction

This notebook explores the datasets used for the Zurich Real Estate Price Prediction project.

**Datasets:**
1. Property Prices by Neighborhood (bau515od5155.csv)
2. Property Prices by Building Age (bau515od5156.csv)
3. Travel Time Data (to be generated)

**Owner:** Matthieu (Primary), Anna (Support)

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from datetime import datetime

# Set plot style
plt.style.use('ggplot')
sns.set(style="whitegrid")

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)

## 1. Load Data

In [None]:
# Define file paths
DATA_DIR = "../data/raw"
NEIGHBORHOOD_DATA = "bau515od5155.csv"
BUILDING_AGE_DATA = "bau515od5156.csv"

# Load neighborhood data
neighborhood_path = os.path.join(DATA_DIR, NEIGHBORHOOD_DATA)
neighborhood_df = pd.read_csv(neighborhood_path)
print(f"Loaded neighborhood data with {neighborhood_df.shape[0]} rows and {neighborhood_df.shape[1]} columns")

# Load building age data
building_age_path = os.path.join(DATA_DIR, BUILDING_AGE_DATA)
building_age_df = pd.read_csv(building_age_path)
print(f"Loaded building age data with {building_age_df.shape[0]} rows and {building_age_df.shape[1]} columns")

## 2. Examine Neighborhood Data

In [None]:
# Display first few rows of neighborhood data
print("First 5 rows of neighborhood data:")
neighborhood_df.head()

In [None]:
# Check data types and missing values
print("\nNeighborhood data types:")
neighborhood_df.info()

In [None]:
# Check for missing values
print("\nMissing values in neighborhood data:")
neighborhood_df.isnull().sum()

In [None]:
# TODO: Examine the RaumLang column to identify neighborhoods
print("\nUnique values in RaumLang (neighborhoods):")
print(neighborhood_df['RaumLang'].unique())

In [None]:
# TODO: Check the distribution of neighborhoods
neighborhood_counts = neighborhood_df['RaumLang'].value_counts()
print("\nNeighborhood counts:")
neighborhood_counts

In [None]:
# TODO: Analyze price distribution by neighborhood
# This will depend on the actual structure of the dataset
# For now, we'll assume HAMedianPreis is the median price column

# Group by neighborhood and calculate average median price
neighborhood_prices = neighborhood_df.groupby('RaumLang')['HAMedianPreis'].mean().sort_values(ascending=False)

# Plot neighborhood prices
plt.figure(figsize=(12, 8))
neighborhood_prices.plot(kind='bar')
plt.title('Average Median Property Price by Neighborhood')
plt.xlabel('Neighborhood')
plt.ylabel('Price (CHF)')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

## 3. Examine Building Age Data

In [None]:
# Display first few rows of building age data
print("First 5 rows of building age data:")
building_age_df.head()

In [None]:
# Check data types and missing values
print("\nBuilding age data types:")
building_age_df.info()

In [None]:
# Check for missing values
print("\nMissing values in building age data:")
building_age_df.isnull().sum()

In [None]:
# TODO: Examine the BaualterLang_noDM column to identify building age categories
print("\nUnique values in BaualterLang_noDM (building age categories):")
print(building_age_df['BaualterLang_noDM'].unique())

In [None]:
# TODO: Analyze price distribution by building age
# This will depend on the actual structure of the dataset
# For now, we'll assume HAMedianPreis is the median price column

# Group by building age and calculate average median price
age_prices = building_age_df.groupby('BaualterLang_noDM')['HAMedianPreis'].mean().sort_values(ascending=False)

# Plot building age prices
plt.figure(figsize=(12, 6))
age_prices.plot(kind='bar')
plt.title('Average Median Property Price by Building Age')
plt.xlabel('Building Age')
plt.ylabel('Price (CHF)')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

## 4. Examine Room Count Distribution

In [None]:
# TODO: Examine the AnzZimmerLevel2Lang_noDM column for room counts
print("\nUnique values in AnzZimmerLevel2Lang_noDM (room counts):")
print(neighborhood_df['AnzZimmerLevel2Lang_noDM'].unique())

In [None]:
# TODO: Analyze price distribution by room count
# Group by room count and calculate average median price
room_prices = neighborhood_df.groupby('AnzZimmerLevel2Lang_noDM')['HAMedianPreis'].mean().sort_values(ascending=False)

# Plot room count prices
plt.figure(figsize=(10, 6))
room_prices.plot(kind='bar')
plt.title('Average Median Property Price by Room Count')
plt.xlabel('Room Count')
plt.ylabel('Price (CHF)')
plt.tight_layout()
plt.show()

## 5. Time Series Analysis

In [None]:
# TODO: Analyze price trends over time (2009-2024)
# Group by year and calculate average median price
yearly_prices = neighborhood_df.groupby('Stichtagdatjahr')['HAMedianPreis'].mean()

# Plot yearly price trends
plt.figure(figsize=(12, 6))
yearly_prices.plot(kind='line', marker='o')
plt.title('Average Median Property Price by Year')
plt.xlabel('Year')
plt.ylabel('Price (CHF)')
plt.grid(True)
plt.tight_layout()
plt.show()

## 6. Additional Insights

In [None]:
# TODO: Calculate price per square meter (if available)
# This will depend on the actual structure of the dataset
# For now, we'll assume HAPreisWohnflaeche is the price per square meter

# Check if the column exists
if 'HAPreisWohnflaeche' in neighborhood_df.columns:
    # Group by neighborhood and calculate average price per square meter
    neighborhood_price_per_sqm = neighborhood_df.groupby('RaumLang')['HAPreisWohnflaeche'].mean().sort_values(ascending=False)
    
    # Plot neighborhood price per square meter
    plt.figure(figsize=(12, 8))
    neighborhood_price_per_sqm.plot(kind='bar')
    plt.title('Average Price per Square Meter by Neighborhood')
    plt.xlabel('Neighborhood')
    plt.ylabel('Price per Square Meter (CHF)')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()

In [None]:
# TODO: Analyze correlations between features (if applicable)
# This will depend on the actual structure of the dataset
# For now, we'll use a subset of columns for demonstration

# Select numeric columns for correlation analysis
numeric_cols = ['Stichtagdatjahr', 'HAPreisWohnflaeche', 'HAMedianPreis']
corr_matrix = neighborhood_df[numeric_cols].corr()

# Plot correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.tight_layout()
plt.show()

## 7. Summary and Insights

### Key Findings:

1. **Neighborhood Analysis**:
   - [Fill in observations about neighborhoods and prices]
   - [Note any patterns or outliers]

2. **Building Age Impact**:
   - [Fill in observations about building age and prices]
   - [Note any trends or patterns]

3. **Room Count Analysis**:
   - [Fill in observations about room counts and prices]
   - [Note any patterns or trends]

4. **Time Trends**:
   - [Fill in observations about price trends over time]
   - [Note any significant periods of growth or decline]

### Next Steps:

1. Generate travel time data using Google Maps API
2. Clean and preprocess the datasets for modeling
3. Engineer additional features based on insights from this EDA
4. Develop baseline models for price prediction