# Predictive California Housing Price Model 1990 US Census Bureau

Author: Data-Git-Hub <br>
GitHub Project Repository Link: https://github.com/Data-Git-Hub/applied-ml-data-git-hub <br>
11 March 2025 <br>


### Introduction
The U.S. Census Bureau collects extensive demographic, social, and economic data every ten years to analyze population trends and housing conditions. This data is gathered through mailed survey forms, in-person interviews, and follow-ups to account for non-responses. Despite its rigorous methodology, census data is subject to certain limitations, including sampling errors, misreporting, and undercounting, particularly in large, diverse states like California. The state's high population density, significant immigrant communities, and transient housing situations contribute to potential inaccuracies in data collection, affecting the precision of housing market analyses. <br>

The California housing dataset within sklearn.model is derived from the 1990 U.S. Census and includes various economic and demographic factors that influence the housing market. Key columns in the dataset include median house values, median income, average household size, housing age, and geographical location. These variables provide a snapshot of the state's real estate landscape, allowing for predictive modeling and trend analysis. Understanding these attributes can help in assessing affordability, regional disparities, and socioeconomic influences on housing prices. <br>

A closer examination of this dataset can reveal potential challenges in the California housing market. Issues such as affordability crises, income inequality, and housing shortages may emerge when analyzing the data trends. Given California's history of rapid urban expansion and economic booms, disparities in median incomes and housing costs could indicate broader problems, such as gentrification, rising rental burdens, and insufficient housing supply. Additionally, the limitations in the dataset may obscure certain critical aspects of the market, such as informal housing arrangements or fluctuations in demand due to migration patterns. <br>

### Imports
Python libraries are collections of pre-written code that provide specific functionalities, making programming more efficient and reducing the need to write code from scratch. These libraries cover a wide range of applications, including data analysis, machine learning, web development, and automation. Some libraries, such as os, sys, math, json, and datetime, come built-in with Python as part of its standard library, providing essential functions for file handling, system operations, mathematical computations, and data serialization. Other popular third-party libraries, like pandas, numpy, matplotlib, seaborn, and scikit-learn, must be installed separately and are widely used in data science and machine learning. The extensive availability of libraries in Python's ecosystem makes it a versatile and powerful programming language for various domains. <br>


Pandas is a powerful data manipulation and analysis library that provides flexible data structures, such as DataFrames and Series. It is widely used for handling structured datasets, enabling easy data cleaning, transformation, and aggregation. Pandas is essential for data preprocessing in machine learning and statistical analysis. <br>
https://pandas.pydata.org/docs/ <br>

NumPy (Numerical Python) is a foundational library for numerical computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a comprehensive collection of mathematical functions to operate on these arrays efficiently. NumPy is a key component in scientific computing and machine learning. <br>
https://numpy.org/doc/stable/ <br>

Matplotlib is a widely used data visualization library that allows users to create static, animated, and interactive plots. It provides extensive tools for generating various chart types, including line plots, scatter plots, histograms, and bar charts, making it a critical library for exploratory data analysis. <br>
https://matplotlib.org/stable/contents.html <br>

Seaborn is a statistical data visualization library built on top of Matplotlib, designed for creating visually appealing and informative plots. It simplifies complex visualizations, such as heatmaps, violin plots, and pair plots, making it easier to identify patterns and relationships in datasets. <br>
https://seaborn.pydata.org/ <br>

Scikit-learn provides a variety of tools for machine learning, including data preprocessing, model selection, and evaluation. It contains essential functions for building predictive models and analyzing datasets. <br>

ADD IPython.core.display <br>

sklearn.datasets.fetch_california_housing: This function loads the California housing dataset from the 1990 U.S. Census Bureau report. It includes data on median home values, income levels, and other socioeconomic factors, making it useful for regression tasks. <br>
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html <br>

sklearn.model_selection.train_test_split: This function is used to split datasets into training and testing subsets, ensuring that models are evaluated on unseen data to prevent overfitting. It is a standard practice in machine learning workflows. <br>
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html <br>

sklearn.linear_model.LinearRegression: This class implements a simple linear regression model that predicts target values based on input features. It uses the least squares method to determine the best-fit line, making it a fundamental technique in regression analysis. <br>
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html <br>

sklearn.metrics: This module provides various performance metrics for evaluating machine learning models. <br>

* Root Mean Squared Error (RMSE): Measures the standard deviation of prediction errors, helping to understand model accuracy. <br>
* Mean Absolute Error (MAE): Calculates the average absolute differences between predicted and actual values, indicating overall prediction accuracy. <br>
* R² Score (R-Squared): Represents how well the independent variables explain the variance in the target variable. A higher R² value indicates a better-fitting model.<br>
https://scikit-learn.org/stable/modules/model_evaluation.html<br>

In [14]:

# Import pandas for data manipulation and analysis
import pandas as pd

# Import pandas for data manipulation and analysis
import numpy as np

# Import matplotlib for creating static visualizations
import matplotlib.pyplot as plt

# Import seaborn for statistical data visualization
import seaborn as sns

# Import display for Jupyter-friendly output
from IPython.display import display

# Import the California housing dataset from sklearn
from sklearn.datasets import fetch_california_housing

# Import train_test_split for splitting data into training and test sets
from sklearn.model_selection import train_test_split

# Import LinearRegression for building a linear regression model
from sklearn.linear_model import LinearRegression

# Import performance metrics for model evaluation
from sklearn.metrics import root_mean_squared_error, mean_absolute_error, r2_score

### Project Outline
The following approach will be used to analyze a dataset and build a predictive model systematically, ensuring that each step contributes to a deeper understanding of the data and enhances the accuracy of the final model. By following this structured workflow, we can make informed decisions about feature selection and model evaluation, ultimately improving predictive performance. <br>

We begin with Section 1: Load and Explore the Data, where we retrieve the California housing dataset from sklearn.datasets and examine its structure. This initial exploration is crucial as it helps us understand the nature of the dataset, including the number of features, data types, missing values, and basic statistical summaries. Without this foundational step, we risk making incorrect assumptions about the data, which could lead to ineffective modeling. <br>

Next, in Section 2: Visualize Feature Distributions, we focus on graphical representations of the dataset's features using histograms, box plots, and correlation heatmaps. Visualizing the data allows us to identify potential outliers, detect skewed distributions, and understand relationships between variables. This step provides insight into whether transformations or normalizations are necessary before feeding the data into a predictive model. <br>

Moving forward, Section 3: Feature Selection and Justification involves carefully selecting which features should be included in our model. Not all features contribute equally to prediction accuracy; some may be redundant or introduce noise. Using statistical methods such as correlation analysis and variance thresholds, we determine which variables are most relevant for predicting house prices. Proper feature selection enhances model efficiency and prevents overfitting. <br>

Finally, in Section 4: Train a Linear Regression Model, we implement and train a simple linear regression model using the selected features. This step involves splitting the dataset into training and testing subsets, fitting the model, and evaluating its performance using metrics such as Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared (R²). By doing so, we assess how well our model generalizes to unseen data and identify areas for potential improvement. <br>

Following this structured approach ensures a logical progression from data understanding to model development, allowing for a well-informed, data-driven approach to predictive modeling. <br>

Section 1. Load and Explore the Data <br>
Section 2. Visualize Feature Distributions <br>
Section 3. Feature Selection and Justification <br>
Section 4. Train a Linear Regression Model <br>

### Section 1. Load and Explore the Data

1.1 Load the dataset and display the first 10 rows <br>

Load the California housing dataset directly from `scikit-learn`. <br>
- The `fetch_california_housing` function returns a dictionary-like object with the data. <br>
- Convert it into a pandas DataFrame. <br>
- Display just the first 10 rows using `head()`. <br>


In [15]:
# Load the California housing dataset
data = fetch_california_housing(as_frame=True)
data_frame = data.frame  # Convert to Pandas DataFrame

# Set pandas display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 2000)
pd.set_option('display.max_colwidth', None)

# Display first 10 rows of the dataset
print("California Housing Data Preview:")
print(data_frame.head(10).to_string(), "\n")

# Check for missing values in the dataset and ensure the output always shows
missing_values = data_frame.isna().sum()

print("Missing values in California Housing Data:")
print(missing_values.to_string(), "\n")


California Housing Data Preview:
    MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude   Longitude  MedHouseVal
0 8.325200 41.000000  6.984127   1.023810  322.000000  2.555556 37.880000 -122.230000     4.526000
1 8.301400 21.000000  6.238137   0.971880 2401.000000  2.109842 37.860000 -122.220000     3.585000
2 7.257400 52.000000  8.288136   1.073446  496.000000  2.802260 37.850000 -122.240000     3.521000
3 5.643100 52.000000  5.817352   1.073059  558.000000  2.547945 37.850000 -122.250000     3.413000
4 3.846200 52.000000  6.281853   1.081081  565.000000  2.181467 37.850000 -122.250000     3.422000
5 4.036800 52.000000  4.761658   1.103627  413.000000  2.139896 37.850000 -122.250000     2.697000
6 3.659100 52.000000  4.931907   0.951362 1094.000000  2.128405 37.840000 -122.250000     2.992000
7 3.120000 52.000000  4.797527   1.061824 1157.000000  1.788253 37.840000 -122.250000     2.414000
8 2.080400 42.000000  4.294118   1.117647 1206.000000  2.026891 37.840000 -1

1.2 Check for missing values and display summary statistics <br>

In the cell below: <br>
1. Use `info()` to check data types and missing values. <br>
2. Use `describe()` to see summary statistics. <br>
3. Use `isnull().sum()` to identify missing values in each column. <br>


In [29]:
# Load the California housing dataset
data = fetch_california_housing(as_frame=True)
data_frame = data.frame  # Convert to Pandas DataFrame

# Set pandas display options to prevent truncation
pd.set_option('display.max_rows', 500)  # Adjust row display limit if necessary
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.expand_frame_repr', False)  # Prevent line wrapping
pd.set_option('display.float_format', '{:.6f}'.format)  # Format float precision

# Display dataset information
print("Dataset Information:")
display(data_frame.info())  # Using display() to prevent truncation

print("\nSummary Statistics:")
display(data_frame.describe())  # Ensure all summary statistics are fully visible

print("\nMissing Values in the Dataset:")
display(data_frame.isnull().sum())  # Ensure all missing values are displayed

# Check for duplicate rows
duplicates = data_frame.duplicated().sum()
print(f"\nNumber of Duplicate Rows: {duplicates}\n")

# Check for any constant (single-value) columns that may not be useful
constant_columns = [col for col in data_frame.columns if data_frame[col].nunique() == 1]
print(f"Columns with a Single Unique Value: {constant_columns}\n")

# Check for zero or negative values in relevant columns
print("\nChecking for Zero or Negative Values in Key Columns:")
for col in ["MedInc", "HouseAge", "AveRooms", "AveBedrms", "Population", "AveOccup"]:
    zero_negative_count = (data_frame[col] <= 0).sum()
    print(f"{col}: {zero_negative_count} values <= 0")

print("\nData exploration and cleaning checks completed!\n")


Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   MedInc       20640 non-null  float64
 1   HouseAge     20640 non-null  float64
 2   AveRooms     20640 non-null  float64
 3   AveBedrms    20640 non-null  float64
 4   Population   20640 non-null  float64
 5   AveOccup     20640 non-null  float64
 6   Latitude     20640 non-null  float64
 7   Longitude    20640 non-null  float64
 8   MedHouseVal  20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB


None


Summary Statistics:


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704,2.068558
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532,1.153956
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35,0.14999
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8,1.196
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49,1.797
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01,2.64725
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31,5.00001



Missing Values in the Dataset:


MedInc         0
HouseAge       0
AveRooms       0
AveBedrms      0
Population     0
AveOccup       0
Latitude       0
Longitude      0
MedHouseVal    0
dtype: int64


Number of Duplicate Rows: 0

Columns with a Single Unique Value: []


Checking for Zero or Negative Values in Key Columns:
MedInc: 0 values <= 0
HouseAge: 0 values <= 0
AveRooms: 0 values <= 0
AveBedrms: 0 values <= 0
Population: 0 values <= 0
AveOccup: 0 values <= 0

Data exploration and cleaning checks completed!



### Conclusions - Section 1
- Analysis of Data: <br> 

1) How many data instances (also called data records or data rows) are there? <br>
   20,640 data records. <br>

2) How many features (also columns or attributes) are there? <br>
   There is a total of 9 features. <br>

3) What are the names of the features? ("Feature" is used most often in ML projects.) <br>
   The feature names are MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude, and MedHouseVal. <br>

4) Which features are numeric? <br>
   All features are numeric. <br>

5) Which features are categorical (non-numeric)? <br>
   None. <br>

6) Are there any missing values? How should they be handled? Should we delete a sparsely populated column? Delete an incomplete data row? Substitute with a different value? <br>
   Through the previewing and cleaning process of the data set, all the column met the measurement of appropriate data for usage in ML. <br>

7) What else do you notice about the dataset? Are there any data issues? <br>
   The dataset is well-structured and clean, with no missing values, making it relatively easy to work with. All features are numeric, meaning no categorical encoding is required. However, some aspects of the data warrant closer examination. <br>

    One key observation is the wide range of values in certain features, such as Population. This could suggest potential outliers or skewness, which may impact model performance. Extreme values might distort statistical summaries and predictions, so further analysis—such as visualizing distributions using histograms or box plots—would help determine whether data transformations (e.g., log scaling) are necessary. <br>

    Another consideration is feature relationships and multicollinearity. Since variables like AveRooms, AveBedrms, and Population are interrelated, high correlations between them could affect the stability of regression models. Conducting a correlation analysis will help identify redundant features, ensuring the model does not suffer from overfitting or biased coefficients. <br>

    Additionally, while the dataset provides aggregated neighborhood-level data, it does not contain individual household details. This means any insights derived must be interpreted at a regional level rather than for specific properties. This aggregation could hide localized housing trends and introduce some loss of granularity in predictions. <br>

    Overall, while the dataset is clean and well-prepared for analysis, further outlier detection, correlation checks, and potential transformations could enhance its usability for predictive modeling. <br>

### Section 2. Visualize Feature Distributions <br>

2.1 Create histograms, boxplots, and scatterplots <br>

- Create histograms for all numeric features using `data_frame.hist()` with 30 bins. <br>
- Create a boxenplots using `sns.boxenplot()`. <br>
- Create scatter plots using `sns.pairplot()`. <br>

First, histograms <br>


In [17]:

# This is a Python cell. 
# Put your comments and code here.




Generate one Boxenplot for each column (good for large datasets) <br>

Example code: <br>

for column in data_frame.columns:
    plt.figure(figsize=(6, 4))
    sns.boxenplot(data=data_frame[column])
    plt.title(f'Boxenplot for {column}')
    plt.show()

In [18]:
# This is a Python cell. 
# Put your comments and code here.


Third - Scatter Plots <br>

Generate all Scatter plots (there is a LOT of data, so this will take a while) <br>

Comment out after analysis to speed up the notebook. <br>

Example code: 

sns.pairplot(data_frame)

plt.show()

In [19]:
# This is a Python cell. 
# Put your comments and code here.



### Section 3. Feature Selection and Justification <br>
- 3.1 Choose two input features for predicting the target.
- 3.2 Justify your selection with reasoning.

Analysis: Why did you choose these features? How might they impact predictions?

---

### Section 3. Feature Selection and Justification <br>

3.1 Choose two input features for predicting the target <br>

- Select `MedInc` and `AveRooms` as predictors. <br>
- Select `MedHouseVal` as the target variable. <br>

In the following, <br> 
X is capitalized because it represents a matrix (consistent with mathematical notation). <br>
y is lowercase because it represents a vector (consistent with mathematical notation). <br>


First: <br>
- Create a list of contributing features and the target variable <br>
- Define the target feature string (the variable we want to predict) <br>
- Define the input DataFrame <br>
- Define the output DataFrame <br>


Example code:

features: list = ['MedInc', 'AveRooms']

target: str = 'MedHouseVal'

df_X = data_frame[features]

df_y = data_frame[target]


In [20]:
# This is a Python cell. 
# Put your comments and code here.



This is a Markdown Cell.


## Section 4. Train a Linear Regression Model
### 4.1 Split the data
Split the dataset into training and test sets (80% train / 20% test).

Call train_test_split() by passing in: 

- df_X – Feature matrix (input data) as a pandas DataFrame
- y – Target values as a pandas Series
- test_size – Fraction of data to use for testing (e.g., 0.1 = 10%)
- random_state – Seed value for reproducible splits

We'll get back four return values:

- X_train – Training set features (DataFrame)
- X_test – Test set features (DataFrame)
- y_train – Training set target values (Series)
- y_test – Test set target values (Series)


Example code:

X_train, X_test, y_train, y_test = train_test_split(
    df_X, df_y, test_size=0.2, random_state=42)


In [21]:
# This is a Python cell. 
# Put your comments and code here.
# 



This is a Markdown Cell.

### 4.2 Train the model
Create and fit a `LinearRegression` model.

LinearRegression – A class from sklearn.linear_model that creates a linear regression model.

model – An instance of the LinearRegression model. This object will store the learned coefficients and intercept after training.

fit() – Trains the model by finding the best-fit line for the training data using the Ordinary Least Squares (OLS) method.

X_train – The input features used to train the model.

y_train – The target values used to train the model.


Example code:


model = LinearRegression()

model.fit(X_train, y_train)


In [22]:
# This is a Python cell. 
# Put your comments and code here.
# 


This is a Markdown Cell.

Make predictions for the test set.

The model.predict() method applies this equation to the X test data to compute predicted values.

y_pred = model.predict(X_test)

y_pred contains all the predicted values for all the rows in X_test based on the linear regression model.


Example code:

y_pred = model.predict(X_test)


In [23]:

# This is a Python cell. 
# Put your comments and code here.
# 

This is a Markdown Cell. 

### 4.3 Report R^2, MAE, RMSE
Evaluate the model using R^2, MAE, and RMSE.

First:

- Coefficient of Determination (R^2) - This tells you how well the model explains the variation in the target variable. A value close to 1 means the model fits the data well; a value close to 0 means the model doesn’t explain the variation well.


Example code:
  
r2 = r2_score(y_test, y_pred)

print(f'R²: {r2:.2f}')



In [24]:

# This is a Python cell. 
# Put your comments and code here.
# 


This is a Markdown Cell

Second:

- Mean Absolute Error (MAE) - This is the average of the absolute differences between the predicted values and the actual values. A smaller value means the model’s predictions are closer to the actual values.


Example code:

mae = mean_absolute_error(y_test, y_pred)

print(f'MAE: {mae:.2f}')




In [25]:
# This is a Python cell. 
# Put your comments and code here.
# 



This is a Markdown Cell

Third:

- Root Mean Squared Error (RMSE) - This is the square root of the average of the squared differences between the predicted values and the actual values. It gives a sense of how far the predictions are from the actual values, with larger errors having more impact.

Example code:

rmse = root_mean_squared_error(y_test, y_pred)

print(f'RMSE: {rmse:.2f}')


In [26]:
# This is a Python cell. 
# Put your comments and code here.
# 


### Conclusion