<a href="https://colab.research.google.com/github/Data-Analytics-with-Python/predicting-house-prices-Kaufmann11/blob/main/House_price_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Let us work with the dataset stored in [**house_prices.csv**](https://raw.githubusercontent.com/zhouy185/BUS_O712/refs/heads/main/Data/house_prices.csv) (click to download the file). This dataset includes the features of houses and the price at which it was sold in the current year (2024).
It includes the following variables:
* **Size (sq ft)**: This is the total area of the house
* **Number of Rooms**: The total number of bedrooms in the house
* **Neighborhood**: The type of the neighborhood the house is in
* **Year Built**: The year in which the house is built
* **Price**: The price at which the house was sold.

In this exercise, we will use linear regression model for prediction.

First, load the data and replace 'Year Built' with age of the house (as of 2025)

In [4]:
import pandas as pd

# Load the dataset
df = pd.read_csv('https://raw.githubusercontent.com/zhouy185/BUS_O712/refs/heads/main/Data/house_prices.csv')

# Calculate the age of the house as of 2025
df['Age'] = 2025 - df['Year Built']

# Drop the original 'Year Built' column if desired, or just use 'Age'
df = df.drop(columns=['Year Built'])

# Display the first few rows with the new 'Age' column
display(df.head())

# Define features (X) and target (y)
X = df.drop('Price', axis=1)
y = df['Price']

# Display the first few rows of X and y to verify
display(X.head())
display(y.head())

Unnamed: 0,Size (sq ft),Number of Rooms,Neighborhood,Price,Age
0,3532,4,Suburb,1195126.0,49
1,3407,5,Downtown,1412375.0,15
2,2453,5,Countryside,797476.0,57
3,1635,3,Downtown,523051.0,39
4,1563,2,Suburb,532291.0,55


Unnamed: 0,Size (sq ft),Number of Rooms,Neighborhood,Age
0,3532,4,Suburb,49
1,3407,5,Downtown,15
2,2453,5,Countryside,57
3,1635,3,Downtown,39
4,1563,2,Suburb,55


Unnamed: 0,Price
0,1195126.0
1,1412375.0
2,797476.0
3,523051.0
4,532291.0


Then, visualize the correlation between columns.

Perform the splits

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [9]:
# Define features (X) and target (y)
X = df.drop('Price', axis=1)
y = df['Price']

# Display the first few rows of X and y to verify
display(X.head())
display(y.head())

Unnamed: 0,Size (sq ft),Number of Rooms,Neighborhood,Age
0,3532,4,Suburb,49
1,3407,5,Downtown,15
2,2453,5,Countryside,57
3,1635,3,Downtown,39
4,1563,2,Suburb,55


Unnamed: 0,Price
0,1195126.0
1,1412375.0
2,797476.0
3,523051.0
4,532291.0


Next, integrate preprocessing (one hot encoding), linear regression, model fitting into a pipeline.

In [10]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Identify categorical and numerical features
categorical_features = ['Neighborhood']
numerical_features = ['Size (sq ft)', 'Number of Rooms', 'Age']

# Create a column transformer for preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ],
    remainder='passthrough' # Keep other columns if any
)

# Create the Linear Regression model
model = LinearRegression()

# Create the pipeline
pipe = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("linear_reg", model)
    ]
)

# Fit the pipeline
pipe.fit(X_train, y_train)
print("Pipeline fitted successfully!")

Pipeline fitted successfully!


Finally, use the fitted pipeline to do prediction.

In [12]:
X_train.head(2)

# Create a DataFrame for new house predictions, ensuring column names match X_train
new_house = pd.DataFrame(
    [
        [3000, 2, "Downtown", 50],
        [2000, 3, "Suburb", 30]
    ],
    columns=X_train.columns
)

# Make predictions using the fitted pipeline
predictions = pipe.predict(new_house)

print("Predictions for new houses:")
for i, price in enumerate(predictions):
    print(f"House {i+1}: ${price:,.2f}")

Predictions for new houses:
House 1: $944,280.76
House 2: $719,408.93
