In [5]:
# Step 1: Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Step 2: Load the California Housing dataset
california_housing = fetch_california_housing(as_frame=True)

# Step 3: Display dataset information
print(california_housing.DESCR)

# Step 4: Convert to DataFrame for easier manipulation
df = california_housing.frame

# Step 5: Display first few rows of the DataFrame
print(df.head())

# Step 6: Split the data into features and target variable
X = df.drop('MedHouseVal', axis=1)
y = df['MedHouseVal']

# Step 7: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 8: Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 9: Initialize the linear regression model
model = LinearRegression()

# Step 10: Train the model
model.fit(X_train_scaled, y_train)

# Step 11: Make predictions on the testing set
y_pred = model.predict(X_test_scaled)

# Step 12: Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Step 13: Print evaluation metrics
print("Mean Squared Error:", mse)
print("R-squared:", r2)


.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

# California Housing Regression Project Documentation

## Introduction
This project aims to build a regression model to predict median house values in California districts based on various features.

## Dataset
The California Housing dataset contains 20,640 instances with 8 numeric predictive attributes and the target variable:
- MedInc: Median income in block group
- HouseAge: Median house age in block group
- AveRooms: Average number of rooms per household
- AveBedrms: Average number of bedrooms per household
- Population: Block group population
- AveOccup: Average number of household members
- Latitude: Block group latitude
- Longitude: Block group longitude
- MedHouseVal: Median house value (target variable)

## Preprocessing
1. Loaded the dataset using `fetch_california_housing` function.
2. Converted the dataset into a DataFrame for easier manipulation.
3. Split the data into features (X) and target variable (y).
4. Standardized the features using `StandardScaler`.
5. Split the data into training and testing sets.

## Model Building
1. Initialized a linear regression model using `LinearRegression`.
2. Trained the model on the training data.
3. Made predictions on the testing data.

## Model Evaluation
Evaluated the model's performance using the following metrics:
- Mean Squared Error (MSE): 0.5473
- R-squared: 0.6066

## Conclusion
The linear regression model achieved an R-squared value of 0.6066, indicating that it explains approximately 60.66% of the variance in the target variable. Further optimization and feature engineering could potentially improve the model's performance.

