## Predicting home prices
Real estate websites such as Zillow and Realtor use models to estimate the prices of homes across the United States. Home buyers use real estate websites to plan for upcoming purchases, while home owners use real estate websites to decide whether to list their home for sale. The home prices dataset contains a random sample of 76 recent home sales near Seattle, Washington. The dataset contains six features: sales price ($1000s), square footage (1000s), number of bedrooms, number of bathrooms, year of construction, and garage size for each home.

A data scientist plans to compare two regression models for predicting home prices.

Simple linear regression with square footage as a single input feature.
Multiple linear regression with square footage, year built, number of bedrooms, number of bathrooms, and garage size.
No interaction terms or polynomial terms will be considered. A training set will be used to fit the two regression models, and a test set will be used to evaluate the models' predictions.

In [1]:
# Import packages and functions
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression

# Import data
homes = pd.read_csv('homes.csv').dropna()

In [2]:
homes.sample(10)

Unnamed: 0,ID,Price,Floor,Lot,Bath,Bed,BathBed,Year,Age,AgeSq,Gar,Status,DAc,School,DEd,DHa,DAd,DCr,DPa
8,9,269.9,1.922,4,2.1,4,8.4,1965,-0.5,0.25,2,Active,1,Parker,0,0,0,0,1
38,39,319.9,1.92,7,2.1,3,6.3,2004,3.4,11.56,2,Active,1,Parker,0,0,0,0,1
66,67,270.0,2.053,3,2.1,3,6.3,1977,0.7,0.49,2,Active,1,Redwood,0,0,0,0,0
9,10,238.8,1.92,5,2.1,3,6.3,1968,-0.2,0.04,2,Sold,0,Parker,0,0,0,0,1
16,17,283.0,1.98,4,3.0,4,12.0,1971,0.1,0.01,2,Active,1,Redwood,0,0,0,0,0
4,5,155.5,1.8,1,2.0,4,8.0,1994,2.4,5.76,1,Sold,0,Adams,0,0,1,0,0
64,65,236.5,1.95,4,3.0,4,12.0,1966,-0.4,0.16,2,Pending,0,Redwood,0,0,0,0,0
22,23,232.0,2.031,4,2.0,4,8.0,1950,-2.0,4.0,0,Sold,0,Parker,0,0,0,0,1
56,57,259.9,1.683,5,2.1,3,6.3,1979,0.9,0.81,1,Active,1,Harris,0,1,0,0,0
74,75,274.9,1.861,4,2.0,4,8.0,1995,2.5,6.25,2,Active,1,Parker,0,0,0,0,1


### Simple linear regression
The first potential model is a simple linear regression model with square footage as a single input feature. Square footage is hypothesized to be the most important feature: the bigger the house, the higher the sales price.

In [3]:
# Set seed and test proportion
seed = 123
test_p = 0.20

In [4]:
# Linear regression model

# Define input and output features
X = homes[['Floor']]
y = homes[['Price']]

# Create training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_p, random_state = seed)

linearModel = LinearRegression()
linearModel = linearModel.fit(X_train, y_train)

# Metrics for linear regression
y_pred = linearModel.predict(X_test)

MSE = mean_squared_error(y_test, y_pred)
print('MSE =', MSE)

MAE = mean_absolute_error(y_test, y_pred)
print('MAE =', MAE)

R_squared = r2_score(y_test, y_pred)
print('R-squared =', R_squared)

MSE = 4377.548634520407
MAE = 53.02646796077678
R-squared = 0.07881601353080447


  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dty

### Mulitple linear regression
The simple linear regression model with a single input has a low coefficient of determination: The model cannot explain much of the variation in home prices. Adding more input features may improve the model's performance by increasing the coefficient of determination and decreasing the error metrics (MSE, MAE).

In [5]:
# Set seed and test proportion
seed = 123
test_p = 0.20

In [6]:
# Multiple regression model

# Define input and output features
 # extract all the input features used in this model (square footage, year built, number of bedrooms, number of bathrooms, and garage size)
X = homes[['Floor', 'Year', 'Bed', 'Bath', 'Gar']]
# extract output feature (price)
y = homes[['Price']]

# Create training and testing data
 # split the data into training and testing usin the values provided in the previous cell
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_p, random_state = seed)

linearModel = LinearRegression()
linearModel = linearModel.fit(X_train, y_train)

# Metrics for linear regression
y_pred = linearModel.predict(X_test)

MSE = mean_squared_error(y_test, y_pred)
print('MSE =', MSE)

MAE = mean_absolute_error(y_test, y_pred)
print('MAE =', MAE)

R_squared = r2_score(y_test, y_pred)
print('R-squared =', R_squared)

MSE = 4033.1910508021892
MAE = 51.5565000296788
R-squared = 0.15128047211818496


  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dty