<a href="https://colab.research.google.com/github/NathanDelgadillo/AAI2026/blob/main/Module_3_pt1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Dataset source (required by rubric):
# Kaggle - House Sales in King County, USA (kc_house_data.csv)
# https://www.kaggle.com/datasets/harlfoxem/housesalesprediction

# 1) Load dataset (update the filename if yours is different)

In [2]:
# 1) Load dataset (update the filename if yours is different)
df = pd.read_csv("kc_house_data.csv")

# 2) Keep only the needed columns

In [3]:
df = df[["price", "sqft_living", "zipcode"]].copy()

# 3) Rename to match assignment language

In [5]:
df = df.rename(columns={
    "sqft_living": "square_footage",
    "zipcode": "zipcode"
})

# 4) Create an assignment-friendly categorical location column:
#    - Top 1/3 most expensive zipcodes -> "Downtown"
#    - Middle 1/3 -> "Suburb"
#    - Bottom 1/3 -> "Rural"

In [6]:
zip_mean_price = df.groupby("zipcode")["price"].mean().sort_values()
n = len(zip_mean_price)

bottom_cut = int(n / 3)
top_cut = int(2 * n / 3)

rural_zips = set(zip_mean_price.index[:bottom_cut])
downtown_zips = set(zip_mean_price.index[top_cut:])

def zip_to_location(z):
    if z in downtown_zips:
        return "Downtown"
    if z in rural_zips:
        return "Rural"
    return "Suburb"

df["location"] = df["zipcode"].apply(zip_to_location)


# 5) Features (X) and target (y)

In [7]:
X = df[["square_footage", "location"]]
y = df["price"]


# 6) Preprocess: OneHotEncode the categorical location
#    (handle_unknown avoids errors if a new category appears)


In [8]:
try:
    ohe = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
except TypeError:
    # For older scikit-learn versions
    ohe = OneHotEncoder(sparse=False, handle_unknown="ignore")

preprocessor = ColumnTransformer(
    transformers=[
        ("location", ohe, ["location"])
    ],
    remainder="passthrough"  # keeps square_footage as-is
)

# 7) Pipeline: preprocess -> linear regression

In [9]:
model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("regressor", LinearRegression())
])

# 8) Train/test split + train

In [10]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
model.fit(X_train, y_train)



The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



# 9) Predict: 2000 sq ft house in Downtown (rubric requirement)


In [11]:
new_house = pd.DataFrame({
    "square_footage": [2000],
    "location": ["Downtown"]
})
predicted_price = model.predict(new_house)[0]
print(f"Predicted price for a 2000 sq ft house in Downtown: ${predicted_price:,.2f}")


Predicted price for a 2000 sq ft house in Downtown: $683,203.61


# 10) Print coefficients with readable feature names

In [13]:
ohe_fitted = model.named_steps["preprocessor"].named_transformers_["location"]
location_feature_names = ohe_fitted.get_feature_names_out(["location"]).tolist()

# With remainder="passthrough", square_footage is appended after the encoded columns
feature_names = location_feature_names + ["square_footage"]

coefficients = model.named_steps["regressor"].coef_
intercept = model.named_steps["regressor"].intercept_

print("\nModel Coefficients:")
for feature, coef in zip(feature_names, coefficients):
    print(f"{feature}: {coef:.2f}")

print(f"\nIntercept: {intercept:.2f}")



Model Coefficients:
location_Downtown: 163045.83
location_Rural: -158325.73
location_Suburb: -4720.10
square_footage: 240.78

Intercept: 38607.30


Explanation:

Square Footage Coefficient

The square footage coefficient represents the average increase in house price for each additional square foot. For example, if the coefficient is 280, this means that for every extra square foot, the house price increases by approximately $280, assuming location remains constant.

Location Effect:

The location coefficients show how being in Downtown, Suburb, or Rural affects the price compared to the model’s baseline. A positive value means the location increases price, while a negative value means it decreases price.