# House Price Prediction with Machine Learning

You are working on a project to predict the price of houses based on various attributes. The dataset 
contains the following columns: House Age, Number of Bedrooms, Number of Bathrooms, Area (in sq ft), 
Location, and Price. The project involves the entire lifecycle of a machine learning model, including data 
collection, preprocessing, model training, evaluation, and deployment.

# Dataset Description

House Age: Integer, the age of the house in years.

Number of Bedrooms: Integer, the number of bedrooms in the house.

Number of Bathrooms: Integer, the number of bathrooms in the house.

Area (in sq ft): Integer, the area of the house in square feet.

Location: Categorical, the location of the house (Urban, Suburban, Rural).

Price: Integer, the price of the house in dollars.

# Task to Execute with their relevant code


a) Data Preprocessing

Explain how you would encode the categorical column (Location).

Describe the process of feature scaling and why it is important for this project.


In [43]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import pickle


In [44]:
df = pd.read_csv('House_price.csv')
df

Unnamed: 0,HouseAge,Bedroom,FullBath,LotArea,Location,SalePrice
0,2003,3,2,8450,Urban,208500
1,1976,3,2,9600,SubUrban,181500
2,2001,3,2,11250,Rural,223500
3,1915,3,1,9550,Urban,140000
4,2000,4,2,14260,SubUrban,250000
...,...,...,...,...,...,...
1455,1999,3,2,7917,Urban,175000
1456,1978,3,2,13175,SubUrban,210000
1457,1941,4,2,9042,Urban,266500
1458,1950,2,1,9717,SubUrban,142125


In [45]:
df.Location.value_counts()

Location
Urban       754
SubUrban    573
Rural       133
Name: count, dtype: int64

In [46]:
cat_cols = []
for c in df.columns:
    if df[c].dtype == 'object':
        cat_cols.append(c)

In [47]:
cat_cols

['Location']

In [48]:
df1 =df.copy()

In [49]:
from sklearn.preprocessing import OneHotEncoder

ohe= OneHotEncoder(drop = 'first')
ohe_transformed = ohe.fit_transform(df1[cat_cols])
dense_df = ohe_transformed.todense()
df_todense = pd.DataFrame(dense_df,columns = ohe.get_feature_names_out())

df1 = pd.concat([df1,df_todense],
              axis = 1)
df1.drop(cat_cols,axis = 1,inplace = True)
df1


Unnamed: 0,HouseAge,Bedroom,FullBath,LotArea,SalePrice,Location_SubUrban,Location_Urban
0,2003,3,2,8450,208500,0.0,1.0
1,1976,3,2,9600,181500,1.0,0.0
2,2001,3,2,11250,223500,0.0,0.0
3,1915,3,1,9550,140000,0.0,1.0
4,2000,4,2,14260,250000,1.0,0.0
...,...,...,...,...,...,...,...
1455,1999,3,2,7917,175000,0.0,1.0
1456,1978,3,2,13175,210000,1.0,0.0
1457,1941,4,2,9042,266500,0.0,1.0
1458,1950,2,1,9717,142125,1.0,0.0


In [50]:
pickle.dump(ohe,open('encoder.pkl','wb'))

In [51]:
x = df1.drop('SalePrice',
           axis = 1)
y = df1.SalePrice

Feature Scaling
Feature scaling is crucial for machine learning algorithms that compute distances between data points, such as K-Nearest Neighbors (KNN) or algorithms that require gradient descent, like linear regression and neural networks. Scaling ensures that all features contribute equally to the result and helps in speeding up the convergence of the learning algorithm.

The two common methods of feature scaling are:

Standardization (Z-score normalization): This rescales the data to have a mean of 0 and a standard deviation of 1.
Normalization (Min-Max scaling): This rescales the data to a fixed range, usually 0 to 1.
The below code shows how you can apply Min-Max scaling using scikit-learn:

In [52]:
scaler = MinMaxScaler()
x_scaled = scaler.fit_transform(x)
x_scaled = pd.DataFrame(x_scaled, columns = x.columns)
x_scaled.head()

Unnamed: 0,HouseAge,Bedroom,FullBath,LotArea,Location_SubUrban,Location_Urban
0,0.949275,0.375,0.666667,0.03342,0.0,1.0
1,0.753623,0.375,0.666667,0.038795,1.0,0.0
2,0.934783,0.375,0.666667,0.046507,0.0,0.0
3,0.311594,0.375,0.333333,0.038561,0.0,1.0
4,0.927536,0.5,0.666667,0.060576,1.0,0.0


In [53]:
pickle.dump(scaler,open('scal.pkl','wb'))

b) Model Training
List at least three machine learning algorithms suitable for predicting house prices.
Explain the advantages and disadvantages of each algorithm in the context of this project.


Answer to b) Model Training

Suitable Machine Learning Algorithms
- 1.Linear Regression:

   - Advantages: Simple to implement and interpret, works well with linearly separable data, fast to train.
   - Disadvantages: Assumes a linear relationship between features and target, sensitive to outliers.

- 2.Random Forest Regressor:

  - Advantages: Handles both linear and non-linear data, robust to outliers, reduces overfitting through averaging.
  - Disadvantages: Computationally expensive, less interpretable compared to linear models.

- 3.Gradient Boosting Regressor:

  - Advantages: High predictive accuracy, works well with a variety of data types, less overfitting due to boosting.
  - Disadvantages: Computationally intensive, sensitive to hyperparameters.

c) Model Training Process
Describe the process of training a machine learning model using the preprocessed dataset.

In [54]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor


# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(x_scaled, y, test_size=0.2, random_state=42)

# Initialize the models
linear_model = LinearRegression()
random_forest_model = RandomForestRegressor(n_estimators=100, random_state=42)
gradient_boosting_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

# Train the models
linear_model.fit(X_train, y_train)
random_forest_model.fit(X_train, y_train)
gradient_boosting_model.fit(X_train, y_train)

# Print training complete message
print("Models trained successfully!")


Models trained successfully!


In [55]:
linear_model.score(X_test,y_test)

0.46030236478022524

In [56]:
linear_model.predict(X_test)

array([143343.18812524, 226601.22071825, 107457.70087273, 172574.53802989,
       234790.81470009, 132133.97724538, 133294.09412651, 196295.309555  ,
       131310.05844739, 148989.31001791, 157450.02843237, 150110.26815057,
       129083.35713078, 232081.1944821 , 225179.07638489, 131143.96641217,
       229070.85547468, 134170.24775772, 100053.06493961, 232236.07866687,
       194289.88405973, 234146.31179336, 230580.80285938, 128792.82653193,
       226052.56770168, 214338.91465802, 233086.27496089, 145011.80159491,
       227094.36541765, 220632.9194422 ,  92194.42768183, 232173.59109017,
       241313.78454853, 132865.45434322, 239087.79664482, 146375.58040997,
       130002.7843693 , 232642.51918018, 237497.62437659, 139468.82901077,
       113781.57299375, 215227.77631492, 152946.23462661, 239932.66139559,
       143930.38872073, 128292.60904234, 139029.02209691, 145000.16394064,
       240856.37769083, 143694.54292567, 130002.7843693 , 202913.16244451,
       146091.48528134, 2

In [57]:
y_test

892     154500
1105    325000
413     115000
522     159000
1036    315500
         ...  
479      89471
1361    260000
802     189000
651     108000
722     124500
Name: SalePrice, Length: 292, dtype: int64

Question to (d) Model Evaluation
List and explain three metrics you would use to evaluate the performance of your model

Answer to (d) Model Evaluation
Three metrics to evaluate the performance of your model are:

- 1. Mean Absolute Error (MAE): Measures the average magnitude of the errors in a set of predictions, without considering their direction.

- 2. Mean Squared Error (MSE): Measures the average of the squares of the errors, giving more weight to larger errors.

- 3. R-squared (R²): Represents the proportion of the variance for the dependent variable that's explained by the independent variables in the model.

The code below shows how you can calculate these metrics in Python using scikit-learn:

In [58]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Predict on the test set
y_pred_linear = linear_model.predict(X_test)
y_pred_rf = random_forest_model.predict(X_test)
y_pred_gb = gradient_boosting_model.predict(X_test)

# Calculate metrics for Linear Regression
mae_linear = mean_absolute_error(y_test, y_pred_linear)
mse_linear = mean_squared_error(y_test, y_pred_linear)
r2_linear = r2_score(y_test, y_pred_linear)

# Calculate metrics for Random Forest
mae_rf = mean_absolute_error(y_test, y_pred_rf)
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

# Calculate metrics for Gradient Boosting
mae_gb = mean_absolute_error(y_test, y_pred_gb)
mse_gb = mean_squared_error(y_test, y_pred_gb)
r2_gb = r2_score(y_test, y_pred_gb)

# Print the evaluation metrics
print("Linear Regression - MAE:", mae_linear, "MSE:", mse_linear, "R²:", r2_linear)
print("Random Forest - MAE:", mae_rf, "MSE:", mse_rf, "R²:", r2_rf)
print("Gradient Boosting - MAE:", mae_gb, "MSE:", mse_gb, "R²:", r2_gb)


Linear Regression - MAE: 41893.738681622286 MSE: 4139656915.3831563 R²: 0.46030236478022524
Random Forest - MAE: 33446.462389106324 MSE: 2343881187.7746334 R²: 0.6944222286689179
Gradient Boosting - MAE: 34845.05924960067 MSE: 2681293464.5650945 R²: 0.650432929168943


Best performing model is Linear Regression model

In [59]:
pickle.dump(linear_model,open('model.pkl','wb'))

In [60]:
X_test

Unnamed: 0,HouseAge,Bedroom,FullBath,LotArea,Location_SubUrban,Location_Urban
892,0.659420,0.375,0.333333,0.033252,0.0,1.0
1105,0.884058,0.375,0.666667,0.051209,1.0,0.0
413,0.398551,0.250,0.333333,0.035804,0.0,1.0
522,0.543478,0.375,0.666667,0.017294,0.0,1.0
1036,0.978261,0.250,0.666667,0.054210,0.0,1.0
...,...,...,...,...,...,...
479,0.471014,0.250,0.333333,0.021618,0.0,1.0
1361,0.963768,0.375,0.666667,0.069448,1.0,0.0
802,0.963768,0.375,0.666667,0.032247,0.0,1.0
651,0.492754,0.500,0.333333,0.036383,1.0,0.0


In [61]:
df.describe()

Unnamed: 0,HouseAge,Bedroom,FullBath,LotArea,SalePrice
count,1460.0,1460.0,1460.0,1460.0,1460.0
mean,1971.267808,2.866438,1.565068,10516.828082,180921.19589
std,30.202904,0.815778,0.550916,9981.264932,79442.502883
min,1872.0,0.0,0.0,1300.0,34900.0
25%,1954.0,2.0,1.0,7553.5,129975.0
50%,1973.0,3.0,2.0,9478.5,163000.0
75%,2000.0,3.0,2.0,11601.5,214000.0
max,2010.0,8.0,3.0,215245.0,755000.0
