# Dataset Understanding and Problem Definition  
California Housing Dataset (1990 Census)

---

## Objective
The objective of this notebook is to develop a clear and structured understanding of the California Housing dataset before performing any data analysis or modeling.

This notebook focuses on:
- Understanding the context and origin of the data
- Identifying the target variable
- Interpreting each feature and its role
- Defining assumptions and hypotheses
- Establishing a clear analysis plan

No modeling or feature transformations are performed at this stage.

---

## Dataset Context
This dataset is based on data from the **1990 California Census** and was popularized by Aurélien Géron in the book *Hands-On Machine Learning with Scikit-Learn and TensorFlow*.

The dataset provides aggregated information about housing districts in California and is widely used as an introductory dataset for data analysis and machine learning due to:
- Its manageable size
- Clearly interpretable variables
- Real-world data imperfections requiring preprocessing

Each row represents **a geographical district**, not an individual house.

---

## Dataset Overview
The dataset contains numerical and categorical features describing geographical location, housing characteristics, population statistics, and income levels for California districts.

Key characteristics:
- Real-world, uncleaned data
- Requires handling missing values
- Contains both spatial and socioeconomic information
- Suitable for regression tasks

---

## Target Variable
**Target:** `median_house_value`

**Description:**  
Represents the median house value for a given California district.

**Type:**  
Numerical (continuous)

**Important Notes:**
- The target variable is capped at a maximum value due to data collection constraints.
- This cap may introduce bias and must be considered during analysis.
- The distribution is expected to be right-skewed.

---

## Feature Description
Below is a high-level explanation of each feature in the dataset.

### Geographical Features
- `longitude`: Longitude of the district.
- `latitude`: Latitude of the district.

These variables capture spatial information and are expected to strongly influence house prices due to location effects.

---

### Housing Characteristics
- `housing_median_age`: Median age of houses in the district.
- `total_rooms`: Total number of rooms across all houses.
- `total_bedrooms`: Total number of bedrooms across all houses.

These features describe the structural characteristics of housing within each district.

---

### Population and Household Features
- `population`: Total population in the district.
- `households`: Total number of households.

These variables provide demographic context and may correlate with housing density and demand.

---

### Socioeconomic Feature
- `median_income`: Median income of households in the district.

This feature is expected to be one of the strongest predictors of house value.

---

### Categorical Feature
- `ocean_proximity`: Proximity of the district to the ocean (categorical).

This variable captures location desirability and is expected to have a significant impact on house prices.

---

## Feature Types
Based on domain understanding, features can be categorized as follows:

### Numerical Features
- longitude
- latitude
- housing_median_age
- total_rooms
- total_bedrooms
- population
- households
- median_income
- median_house_value (target)

### Categorical Features
- ocean_proximity

---

## Assumptions and Hypotheses
The following assumptions guide the exploratory analysis:

1. Districts with higher median income have higher median house values.
2. Proximity to the ocean is associated with higher house prices.
3. Geographic location (latitude and longitude) captures regional price patterns.
4. Higher housing density (population per household) may affect prices.
5. Older housing districts may have different pricing behavior than newer ones.
6. Aggregate room counts need normalization (e.g., per household) to be meaningful.
7. Missing values in `total_bedrooms` may introduce bias if not handled properly.

---

## Potential Data Issues
Before analysis and modeling, the following issues are expected:
- Missing values in `total_bedrooms`
- Capped values in the target variable
- Aggregated features requiring normalization
- Spatial effects not captured by simple linear relationships
- Potential multicollinearity among room-related features

---

## Analysis Plan
The analysis will follow these steps:

1. Load and inspect the dataset structure.
2. Examine missing values and data types.
3. Analyze the distribution of the target variable.
4. Perform univariate analysis of numerical features.
5. Analyze spatial patterns using latitude and longitude.
6. Investigate relationships between features and the target variable.
7. Prepare insights for feature engineering and baseline modeling.

---

## Expected Outcome
By the end of this notebook, there should be:
- A solid understanding of the dataset and its limitations
- Clearly defined assumptions guiding further analysis
- A structured plan for exploratory analysis and modeling

This notebook serves as the foundation for all subsequent analysis.
