# House Price Prediction

Think of finding the perfect house as a complex journey involving negotiations, research, and decision-making. Now, imagine having a smart guide that helps you navigate through this maze by analyzing data and predicting outcomes. Linear regression is that guide. In this tutorial, we'll explore how linear regression helps us understand and predict relationships in data, just like finding the ideal house by matching features with price. Let's get started and discover how this powerful tool can simplify your data-driven decisions!

## Setup

The House Price Prediction Dataset contains 13 features

| #  | Column Name    | Description                                                         |
|----|----------------|---------------------------------------------------------------------|
| 1  | Id             | To count the records.                                               |
| 2  | MSSubClass     | Identifies the type of dwelling involved in the sale.               |
| 3  | MSZoning       | Identifies the general zoning classification of the sale.           |
| 4  | LotArea        | Lot size in square feet.                                            |
| 5  | LotConfig      | Configuration of the lot                                            |
| 6  | BldgType       | Type of dwelling                                                    |
| 7  | OverallCond    | Rates the overall condition of the house                            |
| 8  | YearBuilt      | Original construction year                                          |
| 9  | YearRemodAdd   | Remodel date (same as construction date if no remodeling or additions). |
| 10 | Exterior1st    | Exterior covering on house                                          |
| 11 | BsmtFinSF2     | Type 2 finished square feet.                                        |
| 12 | TotalBsmtSF    | Total square feet of basement area                                  |
| 13 | SalePrice      | To be predicted                                                     |


Run the cell below to download dataset

In [1]:
!wget -O datasets/tutorial03-2.csv "https://docs.google.com/spreadsheets/d/1caaR9pT24GNmq3rDQpMiIMJrmiTGarbs/export?format=csv&id=1caaR9pT24GNmq3rDQpMiIMJrmiTGarbs&gid=1150341366"

--2024-09-02 14:03:13--  https://docs.google.com/spreadsheets/d/1caaR9pT24GNmq3rDQpMiIMJrmiTGarbs/export?format=csv&id=1caaR9pT24GNmq3rDQpMiIMJrmiTGarbs&gid=1150341366
Resolving docs.google.com (docs.google.com)... 172.217.163.206, 2404:6800:4007:81a::200e
Connecting to docs.google.com (docs.google.com)|172.217.163.206|:443... connected.
HTTP request sent, awaiting response... 307 Temporary Redirect
Location: https://doc-08-30-sheets.googleusercontent.com/export/54bogvaave6cua4cdnls17ksc4/elt3e0521r6of5lpe01tcpev10/1725265990000/115253717745408081083/*/1caaR9pT24GNmq3rDQpMiIMJrmiTGarbs?format=csv&id=1caaR9pT24GNmq3rDQpMiIMJrmiTGarbs&gid=1150341366 [following]
--2024-09-02 14:03:14--  https://doc-08-30-sheets.googleusercontent.com/export/54bogvaave6cua4cdnls17ksc4/elt3e0521r6of5lpe01tcpev10/1725265990000/115253717745408081083/*/1caaR9pT24GNmq3rDQpMiIMJrmiTGarbs?format=csv&id=1caaR9pT24GNmq3rDQpMiIMJrmiTGarbs&gid=1150341366
Resolving doc-08-30-sheets.googleusercontent.com (doc-08-30-shee

### Importing libraries

In [2]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
import matplotlib.pyplot as plt

### Create DataFrame

In [3]:
file_path = 'datasets/tutorial03-2.csv'
df = pd.read_csv(file_path)

# Preprocessing

In [4]:
df.drop(['Id'],axis=1,inplace=True)
df['SalePrice'].fillna(df['SalePrice'].mean(), inplace=True)
df_copy=df.copy()


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['SalePrice'].fillna(df['SalePrice'].mean(), inplace=True)


In [5]:
new_data = df_copy.dropna()

In [6]:
s = (new_data.dtypes == 'object')
object_cols = list(s[s].index)
print("Categorical variables:")
print(object_cols)
print('No. of. categorical features: ',
	len(object_cols))

Categorical variables:
['MSZoning', 'LotConfig', 'BldgType', 'Exterior1st']
No. of. categorical features:  4


### One Hot Encoding

In [7]:
OH_encoder = OneHotEncoder(sparse_output=False)
OH_cols = pd.DataFrame(OH_encoder.fit_transform(new_data[object_cols]))
OH_cols.index = new_data.index
OH_cols.columns = OH_encoder.get_feature_names_out()
df_final = new_data.drop(object_cols, axis=1)
df_final = pd.concat([df_final, OH_cols], axis=1)
df_final.head()

Unnamed: 0,MSSubClass,LotArea,OverallCond,YearBuilt,YearRemodAdd,BsmtFinSF2,TotalBsmtSF,SalePrice,MSZoning_C (all),MSZoning_FV,...,Exterior1st_CemntBd,Exterior1st_HdBoard,Exterior1st_ImStucc,Exterior1st_MetalSd,Exterior1st_Plywood,Exterior1st_Stone,Exterior1st_Stucco,Exterior1st_VinylSd,Exterior1st_Wd Sdng,Exterior1st_WdShing
0,60,8450,5,2003,2003,0.0,856.0,208500.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,20,9600,8,1976,1976,0.0,1262.0,181500.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,60,11250,5,2001,2002,0.0,920.0,223500.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,70,9550,5,1915,1970,0.0,756.0,140000.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,60,14260,5,2000,2000,0.0,1145.0,250000.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


### Scaling

In [8]:
scaler = MinMaxScaler()
df_normalized = pd.DataFrame(scaler.fit_transform(df_final), columns=df_final.columns)

# Data splitting

In [9]:
X, y = df_normalized.loc[:, df_normalized.columns != 'SalePrice'], df_normalized[['SalePrice']]

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [11]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((2330, 37), (583, 37), (2330, 1), (583, 1))

In [12]:
label_max, label_min = df["SalePrice"].max(), df["SalePrice"].min()
label_max,label_min

(755000.0, 34900.0)

In [13]:
y_reg_test = y_test * (label_max - label_min) + label_min
feature_names=X_train.columns

# Model Building

In [14]:
def init_params(n_features):
    w = np.zeros((n_features, 1))
    b = 0
    return w,b

def log_likelihood(X, y,w,b):
  # Assume variance = 1
    N = len(y)
    y_pred = np.dot(X, w) + b
    residual = y - y_pred
    log_likelihood = -0.5 * N * np.log(2 * np.pi) - 0.5 * np.sum(residual ** 2)
    return log_likelihood

def update(X, y,w,b,n_iterations=400,learning_rate=0.001):
  # Assume variance = 1
    N = len(y)
    for i in range(n_iterations):
        y_pred = np.dot(X, w) + b

        dw = np.dot(X.T, (y - y_pred)) / N
        db = np.sum(y - y_pred) / N

        w = w - learning_rate * dw
        b = b - learning_rate * db

    return w,b

def train_model(X, y):

    w,b=init_params(X.shape[1])

    w,b=update(X, y,w,b)

    return w,b

def predict(X,w,b):
    return np.dot(X, w) + b

In [15]:
W, b = train_model(X_train.values,y_train.values)

In [16]:
y_pred = predict(X_test,W,b)

In [17]:
y_pred=y_pred * (label_max - label_min) + label_min
mae=mean_absolute_error(y_reg_test,y_pred)

In [18]:
mae

858342.4210579091