# House Price Prediction Model

After learning the introduction of Machine Learning, we'll not create a model that will predict the price of the house

We'll be using the huggingface library to load the dataset for us

## Installation

```
import pandas as pd

df = pd.read_csv("hf://datasets/mrseba/boston_house_price/housing_data.csv
```
)`

In [4]:
import pandas as pd

In [5]:
df = pd.read_csv("hf://datasets/mrseba/boston_house_price/housing_data.csv")

### Exploring the data

Once we've loaded the dataset, it's good to explore it to understand the structure and the features available

`print(df.head())`
`print(df.info())`
`print(df.describe())`

In [6]:
print(df.head())

      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD  TAX  PTRATIO  \
0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900    1  296     15.3   
1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671    2  242     17.8   
2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671    2  242     17.8   
3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622    3  222     18.7   
4  0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622    3  222     18.7   

        B  LSTAT  MEDV  
0  396.90   4.98  24.0  
1  396.90   9.14  21.6  
2  392.83   4.03  34.7  
3  394.63   2.94  33.4  
4  396.90    NaN  36.2  


In [7]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     486 non-null    float64
 1   ZN       486 non-null    float64
 2   INDUS    486 non-null    float64
 3   CHAS     486 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      486 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    int64  
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    486 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(12), int64(2)
memory usage: 55.5 KB
None


In [8]:
print(df.describe())

             CRIM          ZN       INDUS        CHAS         NOX          RM  \
count  486.000000  486.000000  486.000000  486.000000  506.000000  506.000000   
mean     3.611874   11.211934   11.083992    0.069959    0.554695    6.284634   
std      8.720192   23.388876    6.835896    0.255340    0.115878    0.702617   
min      0.006320    0.000000    0.460000    0.000000    0.385000    3.561000   
25%      0.081900    0.000000    5.190000    0.000000    0.449000    5.885500   
50%      0.253715    0.000000    9.690000    0.000000    0.538000    6.208500   
75%      3.560263   12.500000   18.100000    0.000000    0.624000    6.623500   
max     88.976200  100.000000   27.740000    1.000000    0.871000    8.780000   

              AGE         DIS         RAD         TAX     PTRATIO           B  \
count  486.000000  506.000000  506.000000  506.000000  506.000000  506.000000   
mean    68.518519    3.795043    9.549407  408.237154   18.455534  356.674032   
std     27.999513    2.1057

In [9]:
print(df.isnull().sum())

CRIM       20
ZN         20
INDUS      20
CHAS       20
NOX         0
RM          0
AGE        20
DIS         0
RAD         0
TAX         0
PTRATIO     0
B           0
LSTAT      20
MEDV        0
dtype: int64


In [10]:
df.fillna(df.mean(), inplace=True)

In [11]:
print(df.isnull().sum())

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64


So far, we've cleaned the dataset.

Here are the things we should know about:

1. **RM:** Average number of rooms per dwelling
2. **LSTAT:** Percentage of the population that is considered lower status
3. **PTRATIO:** Pupil-teacher ratio by town
4. **TAX:** Property tax rate per $10,000
5. **INDUS:** Proportion of non-retail business acres per town
6. **NOX:** Nitric oxides concentration (parts per 10 million)

We'll be using these as features because they represent key factors that can influence house prices. When predicting house prices using machine learning, it's essential to consider attributes that reflect both the house itself and the surrounding environment

In [12]:
features = ['RM', 'LSTAT', 'PTRATIO', 'TAX', 'INDUS', 'NOX']

# Define the target variable (house price)
target = 'MEDV'

We'll need to split the dataset into a training set and a test set to evaluate the model’s performance.

Using `train_test_split` from `scikit-learn`

In [13]:
from sklearn.model_selection import train_test_split

In [14]:
# Define X and y
X = df[features]
y = df[target]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Scaling the features helps models like linear regression perform better. We can use `StandardScaler` to scale the data:

In [15]:
from sklearn.preprocessing import StandardScaler

In [16]:
scaler = StandardScaler()

In [17]:
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [18]:
from sklearn.linear_model import LinearRegression

In [20]:
model = LinearRegression()

model.fit(X_train_scaled, y_train)

In [24]:
# Making Prediction

y_pred = model.predict(X_test_scaled)
print(y_pred)

[27.08732972 30.87049282 16.3858869  25.00779955 17.65684114 23.01671509
 18.12088678 14.37358011 19.92812718 19.5179874  20.55039208 21.68801223
 -1.60810192 22.58070515 19.32444402 23.84062417 19.31292202  3.46072094
 39.40367654 16.62855903 22.87345196 27.57543707 12.69147519 22.65796244
 16.85163263 12.87029228 20.46632353 19.26777287 19.04932937 18.37618248
 19.24977206 25.43572298 25.1307521  15.97016608 14.40297643 20.65161632
 32.80962729 20.75299259 20.89896378 22.42328541 12.72181775 28.52395202
 40.38236091 18.45651892 25.86509186 14.6225861  14.72621354 26.31162819
 17.78157324 30.31186821 23.60687806 33.64719465 16.91957952 25.60882202
 38.07904859 21.37574819 17.50323248 30.25570438 25.03470137 16.07187126
 25.98602456 32.51883772 29.39415458 16.69601318 27.37312755 12.85534467
 18.71975177 25.51419779 28.84587405 15.06775507 20.27709417 25.26940269
 11.97996625 21.24163562 23.10753205  6.58085352 19.99862483 38.94772401
 17.21874636 10.40562325 22.14853987  9.7804494  23

## Evaluating the Model

We can evaluate the model’s performance using Root Mean Squared Error (RMSE), which measures the error between predicted and actual values

In [25]:
from sklearn.metrics import mean_squared_error
import numpy as np

In [26]:
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print(f"Root Mean Squared Error: {rmse}")

Root Mean Squared Error: 5.281256648885003
