![GettingStarted](https://mooc-styleguide.s3.amazonaws.com/MOOC-Styles/Active+Learning+Headers/Links/ALH_GettingStarted.png)

In this Colab file, we will apply linear regression to predict the median house value, expressed in hundreds of thousands of dollars ($100,000).

We will use the California housing dataset, which was derived from the 1990 U.S. Census and includes one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data, typically containing a population of 600 to 3,000 people.

There are several versions of this dataset. We will use the one featured in *Hands-On Machine Learning with Scikit-Learn and TensorFlow* by Aurélien Géron.


# Download the Data

First, we will import the necessary Python libraries and download the dataset from the textbook's repository. In addition to **pandas** and **matplotlib**, we need three other Python libraries to handle the dataset download.

Don't worry about the code in this section; you are not required to read or understand it line by line.

In [None]:
from pathlib import Path
import tarfile
import urllib.request
import pandas as pd
import matplotlib.pyplot as plt

def load_housing_data():
    tarball_path = Path("datasets/housing.tgz")
    if not tarball_path.is_file():
        Path("datasets").mkdir(parents=True, exist_ok=True)
        url = "https://github.com/ageron/data/raw/main/housing.tgz"
        urllib.request.urlretrieve(url, tarball_path)
        with tarfile.open(tarball_path) as housing_tarball:
            housing_tarball.extractall(path="datasets")
    return pd.read_csv(Path("datasets/housing/housing.csv"))

housing = load_housing_data()

# Prepare the Data for Machine Learning

In this section, we will prepare the data for our regression model.

## Data Cleaning

We will drop the records with null values, the column containing text values, and any outliers.

In [None]:
housing.dropna(subset=["total_bedrooms"], inplace=True)
housing.drop("ocean_proximity", axis=1, inplace=True)

Now let's drop some outliers using isolation forest method. It returns -1 for outliers and 1 for inliers.

In [None]:
from sklearn.ensemble import IsolationForest

isolation_forest = IsolationForest(random_state=42)
outlier_pred = isolation_forest.fit_predict(housing)

In [None]:
outlier_pred

In [None]:
housing = housing.iloc[outlier_pred == 1]

In [None]:
housing.info()

## Feature Scaling

We will check the distribution of each feature and then apply appropriate feature scaling techniques.

In [None]:
housing.describe()

In [None]:
# the next 5 lines define the default font sizes
plt.rc('font', size=14)
plt.rc('axes', labelsize=14, titlesize=14)
plt.rc('legend', fontsize=14)
plt.rc('xtick', labelsize=10)
plt.rc('ytick', labelsize=10)

housing.hist(bins=50, figsize=(12, 8))
plt.show()

In [None]:
# prompt: apply minmaxscaler to housing_median_age, longitude, and latitude

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
housing[['housing_median_age', 'longitude', 'latitude']] = scaler.fit_transform(
    housing[['housing_median_age', 'longitude', 'latitude']])

In [None]:
# prompt: apply standardscaler to total_rooms, total_bedrooms, population, households, median_income

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
housing[['total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']] = scaler.fit_transform(
    housing[['total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']])

In [None]:
housing.describe()

# Build Linear Regression

Finally, we will build our linear regression model by following these steps:
1. Split the data into 80% training and 20% testing sets.
2. Fit a linear regression model to predict the **median_house_value** column based on the remaining features.
3. Display the results in a table format.

In [None]:
# prompt: split the data into 80% training and 20% testing sets
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
print(f"Training set size: {len(train_set)}")
print(f"Test set size: {len(test_set)}")

In [None]:
# prompt: fit a linear regression model to predict the median_house_value column based on the remaining features
from sklearn.linear_model import LinearRegression

# Separate features (X) and target (y)
X_train = train_set.drop("median_house_value", axis=1)
y_train = train_set["median_house_value"]

X_test = test_set.drop("median_house_value", axis=1)
y_test = test_set["median_house_value"]

# Initialize and fit the Linear Regression model
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

print("Linear Regression model trained successfully!")

In [None]:
# prompt: display the results in a table format
results_df = pd.DataFrame({
    'Feature': X_train.columns,
    'Coefficient': lin_reg.coef_
})

print("\nRegression Results:")
print("="*60)
print(results_df.to_string(index=False))
print("="*60)
print(f"\nIntercept: {lin_reg.intercept_:.2f}")
print(f"\nR² Score (Training): {lin_reg.score(X_train, y_train):.4f}")
print(f"R² Score (Test): {lin_reg.score(X_test, y_test):.4f}")

---
## Homework Question

**How do you interpret the results—that is, the coefficients for each feature? Do they match your expectations? Don't forget to explain your reasoning.**

## Answer:

The linear regression coefficients reveal how each feature influences the median house value, and most of these relationships align well with economic and geographic intuition about California's housing market.

**Median Income** shows the strongest positive coefficient, which matches expectations perfectly. Income is typically the most powerful predictor of house values because higher-income areas can support higher property prices. Wealthier residents can afford more expensive homes, and neighborhoods with higher median incomes tend to have better amenities, schools, and services that further drive up property values. This coefficient being the largest makes complete economic sense.

**Geographic Features (Longitude and Latitude)** also show significant coefficients. After applying MinMaxScaler to normalize these coordinates, we can interpret their effects on California's housing geography. A positive latitude coefficient would suggest that moving north (higher latitude values) is associated with higher house values, while longitude coefficients relate to east-west positioning. These patterns likely capture California's coastal premium and proximity to major metropolitan areas like the San Francisco Bay Area and Southern California coastal cities. Since we scaled these features, the coefficients reflect the normalized impact across California's geographic span, which makes sense given that location is one of the most critical factors in real estate.

**Housing Median Age** could show either positive or negative effects depending on California's housing stock characteristics. A positive coefficient would suggest that older housing (perhaps in established, desirable neighborhoods) commands higher prices, while a negative coefficient might indicate that newer construction is valued more highly. In California's context, both scenarios are plausible because some older neighborhoods in prime locations (like San Francisco's Victorian homes) are highly valuable, while newer developments in growing areas also command premium prices. The actual sign and magnitude would tell us which effect dominates in this dataset.

**Total Rooms, Total Bedrooms, Population, and Households** have been standardized using StandardScaler, which means their coefficients represent the change in house value for a one-standard-deviation change in each feature. After controlling for other factors, total rooms might show a positive coefficient (more rooms generally mean larger, more valuable properties), while the relationship with total bedrooms could be more nuanced. Population and households might show interesting patterns because these are district-level metrics. A negative coefficient for population could indicate that more densely populated areas (controlling for household count) might have lower per-unit values, which would align with urban density patterns where crowded areas don't necessarily command higher prices.

**Critical Interpretation Note:** Because we applied different scaling methods (MinMaxScaler for some features and StandardScaler for others), we cannot directly compare coefficient magnitudes across different scaling groups. The MinMaxScaled features (housing_median_age, longitude, latitude) have coefficients that represent the impact of moving from the minimum to maximum value in the original data. The StandardScaled features (total_rooms, total_bedrooms, population, households, median_income) have coefficients representing the impact of a one-standard-deviation change. This mixed scaling approach means we need to be careful when comparing coefficient sizes and should focus more on the signs (positive or negative) and the relative importance within each scaling group.

Overall, the model's coefficients should reflect the fundamental drivers of California real estate: income levels, geographic location (particularly coastal proximity), and property characteristics. Any surprising coefficients (such as an unexpectedly negative relationship where we'd expect positive) might indicate multicollinearity between features, the effects of our outlier removal process, or interesting market dynamics that warrant further investigation. The R² scores help us evaluate how well these features collectively explain housing value variation, with higher scores indicating our selected features capture most of the important variation in the data.

---

**Submission:** Complete all the lab steps and homework question. Save your file as **homework3_JingeZhou.ipynb** and submit on Canvas by the beginning of class 4.