# California Housing - Exploratory Data Analysis

This notebook performs comprehensive exploratory data analysis on the California Housing dataset.

## Objectives
1. Load and understand the dataset
2. Assess data quality
3. Clean and preprocess data
4. Engineer features
5. Perform statistical analysis
6. Generate visualizations
7. Demonstrate SQL operations
8. Derive insights for modeling

## 1. Setup & Imports

In [49]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
from pathlib import Path

# Add parent directory to path
sys.path.append('../src')

# sys.path.append(str(Path(__file__).parent))

# Import project modules
from src.dataset import HousingDataProcessor
from src.services.database import DatabaseManager
from src.plots import EDAAnalyser
from src.config import *

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.2f}'.format)

# Plotting settings
plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline

print("✅ All imports successful!")

✅ All imports successful!


## 2. Data Loading

In [50]:
# Initialize data processor
processor = HousingDataProcessor()

# Load California housing data
data = processor.load_data()

print(f"\nDataset shape: {data.shape}")
print(f"Rows: {data.shape[0]:,}")
print(f"Columns: {data.shape[1]}")

Attempting alternative data loading method...
Downloading from alternative source...
Data loaded successfully: 20640 rows, 10 columns

Dataset shape: (20640, 10)
Rows: 20,640
Columns: 10


In [51]:
# Display first few rows
print("First 5 rows:")
data.head()

First 5 rows:


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.33,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.26,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.64,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.85,342200.0,NEAR BAY


In [52]:
# Display last few rows
print("Last 5 rows:")
data.tail()

Last 5 rows:


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.56,78100.0,INLAND
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.56,77100.0,INLAND
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7,92300.0,INLAND
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.87,84700.0,INLAND
20639,-121.24,39.37,16.0,2785.0,616.0,1387.0,530.0,2.39,89400.0,INLAND


In [53]:
# Dataset info
print("Dataset Information:")
data.info()

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


## 3. Data Quality Assessment

In [54]:
# Check for missing values
print("Missing Values Check:")
missing = processor.check_missing_values()

if len(missing) == 0:
    print("\n✅ No missing values found!")
else:
    print(f"\n⚠️ Found missing values in {len(missing)} columns")

Missing Values Check:
Missing values found in 1 columns:
total_bedrooms    207
dtype: int64

⚠️ Found missing values in 1 columns


In [55]:
# Check for duplicates
duplicates = data.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

if duplicates == 0:
    print("✅ No duplicates found!")
else:
    print(f"⚠️ Found {duplicates} duplicate rows")

Number of duplicate rows: 0
✅ No duplicates found!


In [56]:
# Check data types
print("Data Types:")
print(data.dtypes)

Data Types:
longitude             float64
latitude              float64
housing_median_age    float64
total_rooms           float64
total_bedrooms        float64
population            float64
households            float64
median_income         float64
median_house_value    float64
ocean_proximity        object
dtype: object


## 4. Data Cleaning

In [57]:
# Handle missing values
print("Handling missing values...")
cleaned_data = processor.handle_missing_values(strategy='median')

print(f"\nBefore: {data.shape[0]:,} rows")
print(f"After: {cleaned_data.shape[0]:,} rows")

Handling missing values...
Filled total_bedrooms missing values with median: 435.00

Before: 20,640 rows
After: 20,640 rows


In [58]:
# Remove outliers
print("Removing outliers using IQR method...")
cleaned_data = processor.remove_outliers(method='iqr', threshold=1.5)

print(f"\nFinal cleaned dataset: {cleaned_data.shape[0]:,} rows")
print(f"Removed: {data.shape[0] - cleaned_data.shape[0]:,} rows ({((data.shape[0] - cleaned_data.shape[0])/data.shape[0]*100):.2f}%)")

Removing outliers using IQR method...
Removed 3108 outliers (15.06%). New shape: (17532, 10)

Final cleaned dataset: 17,532 rows
Removed: 3,108 rows (15.06%)


## 5. Feature Engineering

In [59]:
# Apply feature engineering
print("Applying feature engineering...")
engineered_data = processor.apply_feature_engineering()

print(f"\nOriginal features: {cleaned_data.shape[1]}")
print(f"With engineered features: {engineered_data.shape[1]}")
print(f"\nNew features added: {engineered_data.shape[1] - cleaned_data.shape[1]}")

Applying feature engineering...
Feature engineering completed. New columns: 15
Engineered features: ['ocean_proximity', 'rooms_per_household', 'bedrooms_per_room', 'population_per_household', 'income_category', 'age_category']

Original features: 10
With engineered features: 15

New features added: 5


In [60]:
# Display engineered features
print("Engineered Features:")
new_cols = [col for col in engineered_data.columns if col not in data.columns]
print(new_cols)

# Show sample of engineered features
engineered_data[['rooms_per_household', 'bedrooms_per_room', 
                'population_per_household', 'income_category', 'age_category']].head(10)

Engineered Features:
['rooms_per_household', 'bedrooms_per_room', 'population_per_household', 'income_category', 'age_category']


Unnamed: 0,rooms_per_household,bedrooms_per_room,population_per_household,income_category,age_category
2,8.29,0.13,2.8,very_high,old
3,5.82,0.18,2.55,high,old
4,6.28,0.17,2.18,medium,old
5,4.76,0.23,2.14,medium,old
6,4.93,0.19,2.13,medium,old
7,4.8,0.22,1.79,medium,old
8,4.29,0.26,2.03,low,old
9,4.97,0.2,2.17,medium,old
10,5.48,0.2,2.26,medium,old
11,4.77,0.21,2.05,medium,old


In [61]:
# Validate engineered features
print("Validating feature calculations...")

# Check rooms per household (use iloc to get first row by position, not by label)
sample_idx = 0
calculated_rooms = engineered_data.iloc[sample_idx]['total_rooms'] / engineered_data.iloc[sample_idx]['households']
stored_rooms = engineered_data.iloc[sample_idx]['rooms_per_household']

print("Sample validation for rooms_per_household:")
print(f"  Calculated: {calculated_rooms:.2f}")
print(f"  Stored: {stored_rooms:.2f}")
print(f"  Match: {np.isclose(calculated_rooms, stored_rooms)}")

Validating feature calculations...
Sample validation for rooms_per_household:
  Calculated: 8.29
  Stored: 8.29
  Match: True


## 6. Univariate Analysis

In [62]:
# Summary statistics
print("Summary Statistics:")
engineered_data.describe()

Summary Statistics:


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,rooms_per_household,bedrooms_per_room,population_per_household
count,17532.0,17532.0,17532.0,17532.0,17532.0,17532.0,17532.0,17532.0,17532.0,17532.0,17532.0,17532.0
mean,-119.62,35.69,29.91,2080.07,427.42,1140.87,399.78,3.65,196574.94,5.35,0.21,2.94
std,2.0,2.16,12.28,1026.66,199.85,547.36,186.4,1.5,107345.35,2.46,0.06,1.02
min,-124.35,32.54,1.0,2.0,2.0,3.0,2.0,0.5,14999.0,0.85,0.08,0.69
25%,-121.8,33.94,20.0,1353.0,284.0,747.0,267.0,2.51,113900.0,4.45,0.18,2.43
50%,-118.61,34.3,31.0,1942.0,405.0,1076.0,379.0,3.46,173800.0,5.19,0.2,2.82
75%,-118.04,37.74,38.0,2697.25,554.0,1492.0,520.0,4.58,251725.0,5.95,0.24,3.28
max,-114.49,41.95,52.0,5688.0,1044.0,2721.0,908.0,8.02,500001.0,141.91,1.0,63.75


In [63]:
# Target variable distribution
print("Target Variable (median_house_value) Statistics:")
target_stats = engineered_data['median_house_value'].describe()
print(target_stats)

print(f"\nRange: ${target_stats['min']:,.0f} - ${target_stats['max']:,.0f}")
print(f"Mean: ${target_stats['mean']:,.0f}")
print(f"Median: ${target_stats['50%']:,.0f}")

Target Variable (median_house_value) Statistics:
count    17532.00
mean    196574.94
std     107345.35
min      14999.00
25%     113900.00
50%     173800.00
75%     251725.00
max     500001.00
Name: median_house_value, dtype: float64

Range: $14,999 - $500,001
Mean: $196,575
Median: $173,800


In [64]:
# Distribution of each numeric feature
numeric_cols = engineered_data.select_dtypes(include=[np.number]).columns

print("Numeric Features:")
for col in numeric_cols:
    print(f"\n{col}:")
    print(f"  Mean: {engineered_data[col].mean():.2f}")
    print(f"  Median: {engineered_data[col].median():.2f}")
    print(f"  Std: {engineered_data[col].std():.2f}")

Numeric Features:

longitude:
  Mean: -119.62
  Median: -118.61
  Std: 2.00

latitude:
  Mean: 35.69
  Median: 34.30
  Std: 2.16

housing_median_age:
  Mean: 29.91
  Median: 31.00
  Std: 12.28

total_rooms:
  Mean: 2080.07
  Median: 1942.00
  Std: 1026.66

total_bedrooms:
  Mean: 427.42
  Median: 405.00
  Std: 199.85

population:
  Mean: 1140.87
  Median: 1076.00
  Std: 547.36

households:
  Mean: 399.78
  Median: 379.00
  Std: 186.40

median_income:
  Mean: 3.65
  Median: 3.46
  Std: 1.50

median_house_value:
  Mean: 196574.94
  Median: 173800.00
  Std: 107345.35

rooms_per_household:
  Mean: 5.35
  Median: 5.19
  Std: 2.46

bedrooms_per_room:
  Mean: 0.21
  Median: 0.20
  Std: 0.06

population_per_household:
  Mean: 2.94
  Median: 2.82
  Std: 1.02


## 7. Bivariate Analysis

In [65]:
# Correlation with target variable
target_corr = engineered_data.select_dtypes(include=[np.number]).corr()['median_house_value'].sort_values(ascending=False)

print("Correlations with Median House Value:")
print("="*50)
print(target_corr)
print("="*50)

Correlations with Median House Value:
median_house_value          1.00
median_income               0.63
total_rooms                 0.19
housing_median_age          0.13
households                  0.11
rooms_per_household         0.10
total_bedrooms              0.08
longitude                  -0.03
population                 -0.05
latitude                   -0.16
bedrooms_per_room          -0.19
population_per_household   -0.22
Name: median_house_value, dtype: float64


In [66]:
# Top 5 positive correlations
print("Top 5 Features with Highest Positive Correlation:")
top_5_positive = target_corr[target_corr < 1.0].head(5)
print(top_5_positive)

Top 5 Features with Highest Positive Correlation:
median_income         0.63
total_rooms           0.19
housing_median_age    0.13
households            0.11
rooms_per_household   0.10
Name: median_house_value, dtype: float64


In [67]:
# Top 5 negative correlations
print("Top 5 Features with Highest Negative Correlation:")
top_5_negative = target_corr.tail(5)
print(top_5_negative)

Top 5 Features with Highest Negative Correlation:
longitude                  -0.03
population                 -0.05
latitude                   -0.16
bedrooms_per_room          -0.19
population_per_household   -0.22
Name: median_house_value, dtype: float64


## 8. Multivariate Analysis

In [68]:
# Initialize EDA Analyser
print("Initializing EDA Analyser...")
eda = EDAAnalyser(engineered_data)

print("\nGenerating all visualizations...")
print("This may take a few moments...")

Initializing EDA Analyser...
EDAAnalyser initialized with 17532 rows and 15 columns

Generating all visualizations...
This may take a few moments...


In [69]:
# Generate all 10 visualizations
plots = eda.generate_all_plots()

print(f"\n✅ Generated {len(plots)} visualizations!")
print(f"Saved to: {FIGURES_DIR}")

for plot_name, path in plots.items():
    print(f"  - {plot_name}: {Path(path).name}")

Generating all visualizations...

[1/10] Generating histogram...
Saved: a:\Repositories\California-housing-prediction\notebooks\..\reports\figures\01_histogram_median_house_value.png
[2/10] Generating boxplot...


  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):


Saved: a:\Repositories\California-housing-prediction\notebooks\..\reports\figures\02_boxplot_median_house_value_by_income_category.png
[3/10] Generating scatter plot...
Saved: a:\Repositories\California-housing-prediction\notebooks\..\reports\figures\03_scatter_longitude_vs_latitude.png
[4/10] Generating correlation heatmap...
Saved: a:\Repositories\California-housing-prediction\notebooks\..\reports\figures\04_correlation_heatmap.png
[5/10] Generating pairplot...


  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  with pd.option_context('mode.use_inf_as_na', True):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  with pd.option_context('mode.use_inf_as_na', True):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  with pd.option_context('mode.use_inf_as_na', True):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_c

Saved: a:\Repositories\California-housing-prediction\notebooks\..\reports\figures\05_pairplot.png
[6/10] Generating bar chart...
Saved: a:\Repositories\California-housing-prediction\notebooks\..\reports\figures\06_bar_mean_median_house_value_by_income_category.png
[7/10] Generating violin plot...


  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):


Saved: a:\Repositories\California-housing-prediction\notebooks\..\reports\figures\07_violin_median_income_by_age_category.png
[8/10] Generating line chart...
Saved: a:\Repositories\California-housing-prediction\notebooks\..\reports\figures\08_line_median_house_value_by_housing_median_age.png
[9/10] Generating density plot...
Saved: a:\Repositories\California-housing-prediction\notebooks\..\reports\figures\09_density_multiple_features.png
[10/10] Generating geographic scatter...
Saved: a:\Repositories\California-housing-prediction\notebooks\..\reports\figures\10_geographic_scatter_median_house_value.png

All visualizations generated successfully!
Saved to: a:\Repositories\California-housing-prediction\notebooks\..\reports\figures

✅ Generated 10 visualizations!
Saved to: a:\Repositories\California-housing-prediction\notebooks\..\reports\figures
  - histogram: 01_histogram_median_house_value.png
  - boxplot: 02_boxplot_median_house_value_by_income_category.png
  - scatter: 03_scatter_lon

In [70]:
# Display correlation analysis
print("Detailed Correlation Analysis:")
correlations = eda.get_correlation_analysis('median_house_value')

Detailed Correlation Analysis:

Correlations with median_house_value:
median_house_value          1.00
median_income               0.63
total_rooms                 0.19
housing_median_age          0.13
households                  0.11
rooms_per_household         0.10
total_bedrooms              0.08
longitude                  -0.03
population                 -0.05
latitude                   -0.16
bedrooms_per_room          -0.19
population_per_household   -0.22
Name: median_house_value, dtype: float64


## 9. SQL Demonstrations

In [71]:
# Initialize database manager
print("Initializing Database Manager...")
db = DatabaseManager()

# Create tables
print("\nCreating database tables...")
db.create_tables()

Initializing Database Manager...
Database connection established: a:\Repositories\California-housing-prediction\notebooks\..\data\housing.db

Creating database tables...
Tables created successfully


In [72]:
# Insert data into database
print("Inserting data into database...")
db.insert_data(engineered_data, 'housing')

print("\nPopulating district summary table...")
db.populate_district_summary()

Inserting data into database...
Inserted 17532 rows into housing table

Populating district summary table...
Aggregated by income category: 4 categories
Populated district_summary table with 4 rows


4

In [73]:
# Demonstration 1: WHERE clause filtering
print("=" * 60)
print("SQL Demonstration 1: WHERE Clause")
print("=" * 60)

print("\nQuery: Filter houses with income between $30k-$50k")
filtered = db.filter_by_income(3.0, 5.0)

print(f"\nFound {len(filtered):,} records")
print("\nSample results:")
filtered[['longitude', 'latitude', 'median_income', 'median_house_value']].head(10)

SQL Demonstration 1: WHERE Clause

Query: Filter houses with income between $30k-$50k
Filtered by income (3.0 to 5.0): 7647 rows

Found 7,647 records

Sample results:


Unnamed: 0,longitude,latitude,median_income,median_house_value
0,-122.25,37.85,3.85,342200.0
1,-122.25,37.85,4.04,269700.0
2,-122.25,37.84,3.66,299200.0
3,-122.25,37.84,3.12,241400.0
4,-122.25,37.84,3.69,261100.0
5,-122.26,37.85,3.2,281500.0
6,-122.26,37.85,3.27,241800.0
7,-122.26,37.85,3.08,213500.0
8,-122.26,37.83,3.48,191400.0
9,-122.26,37.84,3.96,188800.0


In [74]:
# Demonstration 2: GROUP BY aggregation
print("=" * 60)
print("SQL Demonstration 2: GROUP BY Aggregation")
print("=" * 60)

print("\nQuery: Aggregate statistics by income category")
aggregated = db.aggregate_by_income_category()

print("\nAggregated Results:")
aggregated

SQL Demonstration 2: GROUP BY Aggregation

Query: Aggregate statistics by income category
Aggregated by income category: 4 categories

Aggregated Results:


Unnamed: 0,income_category,avg_house_value,avg_rooms_per_household,avg_age,count_districts
0,very_high,346940.72,6.78,25.95,1491
1,high,258614.72,6.06,27.94,3180
2,medium,186867.55,5.23,30.43,8556
3,low,117962.77,4.59,31.72,4305


In [75]:
# Demonstration 3: INNER JOIN
print("=" * 60)
print("SQL Demonstration 3: INNER JOIN")
print("=" * 60)

print("\nQuery: Join housing with district summary")
joined = db.join_housing_with_summary(limit=20)

print(f"\nJoined {len(joined):,} records")
print("\nSample results showing individual houses vs district averages:")
joined[['income_category', 'median_house_value', 'district_avg_value', 
        'rooms_per_household', 'district_avg_rooms']]

SQL Demonstration 3: INNER JOIN

Query: Join housing with district summary
Joined housing with district_summary: 20 rows

Joined 20 records

Sample results showing individual houses vs district averages:


Unnamed: 0,income_category,median_house_value,district_avg_value,rooms_per_household,district_avg_rooms
0,very_high,352100.0,346940.72,8.29,6.78
1,high,341300.0,258614.72,5.82,6.06
2,medium,342200.0,186867.55,6.28,5.23
3,medium,269700.0,186867.55,4.76,5.23
4,medium,299200.0,186867.55,4.93,5.23
5,medium,241400.0,186867.55,4.8,5.23
6,low,226700.0,117962.77,4.29,4.59
7,medium,261100.0,186867.55,4.97,5.23
8,medium,281500.0,186867.55,5.48,5.23
9,medium,241800.0,186867.55,4.77,5.23


In [76]:
# Database statistics
print("Database Statistics:")
stats = db.get_statistics()

print(f"\nHousing records: {stats['housing_count']:,}")
print(f"Summary records: {stats['summary_count']:,}")
print(f"Database size: {stats['database_size']:.2f} MB")
print(f"Tables: {stats['tables']}")

Database Statistics:

Housing records: 17,532
Summary records: 4
Database size: 1.68 MB
Tables: ['sqlite_sequence', 'housing', 'district_summary']


## 10. Key Insights & Conclusions

In [77]:
print("="*60)
print("KEY INSIGHTS FROM EXPLORATORY DATA ANALYSIS")
print("="*60)

# Top correlations
print("\n1. TOP 3 FEATURES CORRELATED WITH HOUSE PRICES:")
top_3 = target_corr[target_corr < 1.0].head(3)
for i, (feature, corr) in enumerate(top_3.items(), 1):
    print(f"   {i}. {feature}: {corr:.3f}")

# Geographic patterns
print("\n2. GEOGRAPHIC PATTERNS:")
print(f"   - Latitude range: {engineered_data['latitude'].min():.2f} to {engineered_data['latitude'].max():.2f}")
print(f"   - Longitude range: {engineered_data['longitude'].min():.2f} to {engineered_data['longitude'].max():.2f}")
print(f"   - Data covers California state")

# Income impact
print("\n3. INCOME IMPACT:")
income_groups = aggregated.sort_values('avg_house_value', ascending=False)
for _, row in income_groups.iterrows():
    print(f"   - {row['income_category'].title()}: ${row['avg_house_value']:,.0f} (n={row['count_districts']:,})")

# Housing characteristics
print("\n4. HOUSING CHARACTERISTICS:")
print(f"   - Average rooms per household: {engineered_data['rooms_per_household'].mean():.2f}")
print(f"   - Average bedrooms per room: {engineered_data['bedrooms_per_room'].mean():.2f}")
print(f"   - Average population per household: {engineered_data['population_per_household'].mean():.2f}")

# Recommendations
print("\n5. RECOMMENDATIONS FOR MODELING:")
print(f"   ✓ Use median_income as primary predictor (correlation: {target_corr['median_income']:.3f})")
print(f"   ✓ Include geographic features (latitude, longitude)")
print(f"   ✓ Leverage engineered features (rooms_per_household, etc.)")
print(f"   ✓ Consider non-linear relationships for income")
print(f"   ✓ Account for outliers in house values")

print("\n" + "="*60)

KEY INSIGHTS FROM EXPLORATORY DATA ANALYSIS

1. TOP 3 FEATURES CORRELATED WITH HOUSE PRICES:
   1. median_income: 0.635
   2. total_rooms: 0.189
   3. housing_median_age: 0.130

2. GEOGRAPHIC PATTERNS:
   - Latitude range: 32.54 to 41.95
   - Longitude range: -124.35 to -114.49
   - Data covers California state

3. INCOME IMPACT:
   - Very_High: $346,941 (n=1,491)
   - High: $258,615 (n=3,180)
   - Medium: $186,868 (n=8,556)
   - Low: $117,963 (n=4,305)

4. HOUSING CHARACTERISTICS:
   - Average rooms per household: 5.35
   - Average bedrooms per room: 0.21
   - Average population per household: 2.94

5. RECOMMENDATIONS FOR MODELING:
   ✓ Use median_income as primary predictor (correlation: 0.635)
   ✓ Include geographic features (latitude, longitude)
   ✓ Leverage engineered features (rooms_per_household, etc.)
   ✓ Consider non-linear relationships for income
   ✓ Account for outliers in house values



## 11. Data Export

In [78]:
# Save processed data
print("Saving processed data...")
processor.save_data(engineered_data, 'processed')

print("\n✅ All data exported successfully!")
print(f"\nProcessed data available at: {PROCESSED_DATA_PATH}")
print(f"Database available at: {DATABASE_PATH}")
print(f"Visualizations available at: {FIGURES_DIR}")

Saving processed data...
Data saved to: a:\Repositories\California-housing-prediction\notebooks\..\data\processed\housing_processed.csv
Backup saved to: a:\Repositories\California-housing-prediction\notebooks\..\data\processed\housing_processed_20260107_161836.csv

✅ All data exported successfully!

Processed data available at: a:\Repositories\California-housing-prediction\notebooks\..\data\processed\housing_processed.csv
Database available at: a:\Repositories\California-housing-prediction\notebooks\..\data\housing.db
Visualizations available at: a:\Repositories\California-housing-prediction\notebooks\..\reports\figures


## Summary

This notebook has completed comprehensive exploratory data analysis on the California Housing dataset:

✅ **Data Loading**: Loaded 20,640 housing records with 8 original features

✅ **Data Quality**: Checked for missing values, duplicates, and outliers

✅ **Data Cleaning**: Handled missing values and removed outliers

✅ **Feature Engineering**: Created 5 new features (rooms_per_household, bedrooms_per_room, population_per_household, income_category, age_category)

✅ **Statistical Analysis**: Performed univariate, bivariate, and multivariate analysis

✅ **Visualizations**: Generated 10 different types of plots

✅ **SQL Operations**: Demonstrated WHERE, GROUP BY, and INNER JOIN queries

✅ **Insights**: Identified key patterns and relationships

**Next Steps**:
1. Train linear regression model using processed data
2. Evaluate model performance
3. Make predictions using the Streamlit dashboard

**Files Generated**:
- Processed CSV: `data/processed/housing_processed.csv`
- SQLite Database: `data/housing.db`
- Visualizations: `reports/figures/*.png` (10 plots)