# House Sales in King County, USA

Predicting house prices using exploratory data analysis and regression models.

This is a cleaned, portfolio-ready version of the original IBM project notebook. It focuses on your analysis and code, and removes platform-specific boilerplate.

## 1. Dataset

This notebook uses the **King County house sales dataset** with missing values (`kc_house_data_NaN.csv`).

- Original source (IBM Developer Skills Network):  
  `kc_house_data_NaN.csv` from the course *Data Analysis with Python*  
- Direct download URL:  
  `https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/FinalModule_Coursera/data/kc_house_data_NaN.csv`

To make the notebook easy to run for anyone:

1. You can either **download the CSV manually** and place it in a local `data/` folder, or
2. Let the notebook **download it automatically** from the URL if the local file is not found.


## 2. Setup

Import the libraries used throughout the analysis and configure the environment.

In [None]:
# Suppress non-critical warnings (optional)
import warnings
def warn(*args, **kwargs):
    pass
warnings.warn = warn

import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import r2_score, mean_squared_error

%matplotlib inline


## 3. Data loading and initial inspection

The code below:

1. Defines the direct **online URL** of the dataset
2. Looks for a local copy at `data/kc_house_data_NaN.csv`
3. If the local file is missing, it loads the data **directly from the URL**

This way, anyone cloning the repository can run the notebook as long as they have an internet connection or have placed the CSV into the `data/` folder.

In [None]:
DATA_URL = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/FinalModule_Coursera/data/kc_house_data_NaN.csv"
DATA_DIR = "data"
DATA_PATH = os.path.join(DATA_DIR, "kc_house_data_NaN.csv")

# Create data directory if it doesn't exist
os.makedirs(DATA_DIR, exist_ok=True)

if os.path.exists(DATA_PATH):
    print(f"Loading data from local file: {DATA_PATH}")
    df = pd.read_csv(DATA_PATH)
else:
    print("Local file not found. Loading data from URL...")
    print(f"URL: {DATA_URL}")
    df = pd.read_csv(DATA_URL)
    # Optionally save a local copy for future runs
    df.to_csv(DATA_PATH, index=False)
    print(f"Dataset downloaded and saved to: {DATA_PATH}")

print("Data shape:", df.shape)
df.head()


### 3.1 Structure and summary statistics

Inspect column types and basic descriptive statistics to understand the dataset.

In [None]:
df.info()
df.describe().T


## 4. Data wrangling

We remove identifier columns that are not useful for prediction and handle missing values in key numerical features.

In [None]:
# Drop identifier columns that do not carry predictive information
for col in ["id", "Unnamed: 0"]:
    if col in df.columns:
        df.drop(col, axis=1, inplace=True)

# Check missing values in bedrooms and bathrooms
print("Missing bedrooms before:", df['bedrooms'].isnull().sum())
print("Missing bathrooms before:", df['bathrooms'].isnull().sum())

# Impute with the mean
df['bedrooms'].fillna(df['bedrooms'].mean(), inplace=True)
df['bathrooms'].fillna(df['bathrooms'].mean(), inplace=True)

print("Missing bedrooms after:", df['bedrooms'].isnull().sum())
print("Missing bathrooms after:", df['bathrooms'].isnull().sum())


## 5. Exploratory data analysis (EDA)

We explore the distribution of key variables and their relationships with the target variable `price`.

In [None]:
# Distribution of the number of floors
floors_counts = df['floors'].value_counts().to_frame(name='count')
floors_counts


In [None]:
# Price distribution by waterfront vs. non-waterfront properties
sns.boxplot(x='waterfront', y='price', data=df)
plt.title("House price distribution by waterfront")
plt.show()


In [None]:
# Relationship between above-ground square footage and price
sns.regplot(x='sqft_above', y='price', data=df)
plt.title("sqft_above vs price")
plt.show()


In [None]:
# Correlation of numerical features with price
corr_with_price = df.corr(numeric_only=True)['price'].sort_values(ascending=False)
corr_with_price


## 6. Model development

We start with a simple linear regression model, then move to a multivariate model using several predictive features.

In [None]:
# Simple linear regression using sqft_living as the only predictor
lm_simple = LinearRegression()

X_simple = df[['sqft_living']]
y = df['price']

lm_simple.fit(X_simple, y)
r2_simple = lm_simple.score(X_simple, y)
print(f"R^2 (simple model with sqft_living): {r2_simple:.3f}")


In [None]:
# Multivariate linear regression with a richer set of features
features = [
    "floors",
    "waterfront",
    "lat",
    "bedrooms",
    "sqft_basement",
    "view",
    "bathrooms",
    "sqft_living15",
    "sqft_above",
    "grade",
    "sqft_living",
]

# Keep only features that actually exist in the dataframe
features = [f for f in features if f in df.columns]

X = df[features]
y = df['price']

lm_multi = LinearRegression()
lm_multi.fit(X, y)

r2_multi = lm_multi.score(X, y)
print("Features used:", features)
print(f"R^2 (multivariate model): {r2_multi:.3f}")


In [None]:
# Pipeline: scaling + polynomial features + linear regression
poly_pipeline = Pipeline([
    ("scale", StandardScaler()),
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),
    ("model", LinearRegression()),
])

cv_scores = cross_val_score(poly_pipeline, X, y, cv=5, scoring="r2")
print("Cross-validated R^2 (polynomial pipeline):")
print("Scores:", cv_scores)
print("Mean R^2:", cv_scores.mean())


## 7. Model evaluation and refinement

We evaluate the model using a train/test split and experiment with a Ridge regression model, with and without polynomial features.

In [None]:
# Train/test split
x_train, x_test, y_train, y_test = train_test_split(
    X, y, test_size=0.15, random_state=1
)

print("Number of training samples:", x_train.shape[0])
print("Number of test samples:", x_test.shape[0])


In [None]:
# Ridge regression on the original feature space
ridge = Ridge(alpha=0.1)
ridge.fit(x_train, y_train)

y_test_pred = ridge.predict(x_test)

r2_ridge = r2_score(y_test, y_test_pred)
rmse_ridge = mean_squared_error(y_test, y_test_pred, squared=False)

print(f"R^2 (Ridge, test set): {r2_ridge:.3f}")
print(f"RMSE (Ridge, test set): {rmse_ridge:,.0f}")


In [None]:
# Ridge regression with polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)

x_train_pr = poly.fit_transform(x_train)
x_test_pr = poly.transform(x_test)

ridge_poly = Ridge(alpha=0.1)
ridge_poly.fit(x_train_pr, y_train)

y_test_pred_poly = ridge_poly.predict(x_test_pr)

r2_ridge_poly = r2_score(y_test, y_test_pred_poly)
rmse_ridge_poly = mean_squared_error(y_test, y_test_pred_poly, squared=False)

print(f"R^2 (Ridge + polynomial features, test set): {r2_ridge_poly:.3f}")
print(f"RMSE (Ridge + polynomial features, test set): {rmse_ridge_poly:,.0f}")


## 8. Conclusions

- We explored key drivers of house prices in King County, such as square footage, grade, and waterfront.
- A simple linear regression on `sqft_living` provides a baseline model.
- A multivariate model with several features significantly improves the R² score.
- Polynomial features combined with scaling and regularization (Ridge) can further improve performance while controlling overfitting.

This notebook is designed to be easy to run for visitors to the GitHub repository: it can download the dataset automatically if it is not already present locally.