---
title: "Dataset Description"
---

## Goal of Collecting this Dataset

This dataset provides key features for predicting house prices, including area, bedrooms, bathrooms,
stories, amenities like air conditioning and parking, and information on furnishing status.
It enables analysis and modelling to understand the factors impacting house prices and develop accurate predictions in real estate markets.

## Source of the Dataset

The dataset was sourced from the kaggle website in this URL : https://www.kaggle.com/datasets/harishkumardatalab/housing-price-prediction?resource=download&select=Housing.csv

## General Information

- Number of Attributes: 13
- Number of Objects: 545
- Type of Attributes: Numeric, Symmetric Binary, Ordinal.
- Classes names: Price.

## About Dataset
This dataset provides comprehensive information for house price prediction, with 13 column names:

- Price: The price of the house.
- Area: The total area of the house in square feet.
- Bedrooms: The number of bedrooms in the house.
- Bathrooms: The number of bathrooms in the house.
- Stories: The number of stories in the house.
- Mainroad: Whether the house is connected to the main road (Yes/No).
- Guestroom: Whether the house has a guest room (Yes/No).
- Basement: Whether the house has a basement (Yes/No).
- Hot water heating: Whether the house has a hot water heating system (Yes/No).
- Airconditioning: Whether the house has an air conditioning system (Yes/No).
- Parking: The number of parking spaces available within the house.
- Prefarea: Whether the house is located in a preferred area (Yes/No).
- Furnishing status: The furnishing status of the house (Fully Furnished, Semi-Furnished, Unfurnished).


In [1]:
data <- read.csv("Housing.csv")

In [2]:
data

price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished
10850000,7500,3,3,1,yes,no,yes,no,yes,2,yes,semi-furnished
10150000,8580,4,3,4,yes,no,no,no,yes,2,yes,semi-furnished
10150000,16200,5,3,2,yes,no,no,no,no,0,no,unfurnished
9870000,8100,4,1,2,yes,yes,yes,no,yes,2,yes,furnished
9800000,5750,3,2,4,yes,yes,no,no,yes,1,yes,unfurnished


In [3]:
# Summary of the dataset
summary(data)

     price               area          bedrooms       bathrooms    
 Min.   : 1750000   Min.   : 1650   Min.   :1.000   Min.   :1.000  
 1st Qu.: 3430000   1st Qu.: 3600   1st Qu.:2.000   1st Qu.:1.000  
 Median : 4340000   Median : 4600   Median :3.000   Median :1.000  
 Mean   : 4766729   Mean   : 5151   Mean   :2.965   Mean   :1.286  
 3rd Qu.: 5740000   3rd Qu.: 6360   3rd Qu.:3.000   3rd Qu.:2.000  
 Max.   :13300000   Max.   :16200   Max.   :6.000   Max.   :4.000  
    stories      mainroad  guestroom basement  hotwaterheating airconditioning
 Min.   :1.000   no : 77   no :448   no :354   no :520         no :373        
 1st Qu.:1.000   yes:468   yes: 97   yes:191   yes: 25         yes:172        
 Median :2.000                                                                
 Mean   :1.806                                                                
 3rd Qu.:2.000                                                                
 Max.   :4.000                                    

## Data Pre-processing:

we performed some data preprocessing methods which are:

- Data cleaning can be applied to filling in missing values, remove noise, resolving inconsistencies, identifying and removing outliers in the data.
- Data transformations, such as normalization, may be applied. For example, normalization may improve the accuracy and efficiency of mining algorithms involving distance measurements.
- Feature Selection can be applied to select a sample that contains clean data(non-redundant, and relevant features) from the original data set.

## 1. Data Cleaning

1.1 we checked if we have missing values or not, and we concluded that we had no missing values:


In [4]:
# Check missing values
missing_values <- colSums(is.na(data))
missing_values

1.2 Data Cleaning and Noise Removal (it removes duplicates if any), and we noticed that we had no redundant data:


In [5]:
data <- unique(data)
data

price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished
10850000,7500,3,3,1,yes,no,yes,no,yes,2,yes,semi-furnished
10150000,8580,4,3,4,yes,no,no,no,yes,2,yes,semi-furnished
10150000,16200,5,3,2,yes,no,no,no,no,0,no,unfurnished
9870000,8100,4,1,2,yes,yes,yes,no,yes,2,yes,furnished
9800000,5750,3,2,4,yes,yes,no,no,yes,1,yes,unfurnished


1.3 Detecting and removing outliers:

In [6]:
#DETECTING OUTLIERS
install.packages("outliers")


  There is a binary version available but the source version is later:
         binary source needs_compilation
outliers   0.14   0.15             FALSE



installing the source package 'outliers'



In [7]:
library(outliers)

OutPrice = outlier(data$price, logical =TRUE)
sum(OutPrice)
Find_outlier = which(OutPrice ==TRUE, arr.ind = TRUE)

OutArea = outlier(data$area, logical =TRUE)
sum(OutArea)
Find_outlier = which(OutArea ==TRUE, arr.ind = TRUE)

OutBedrooms = outlier(data$bedrooms, logical =TRUE)
sum(OutBedrooms)
Find_outlier = which(OutBedrooms ==TRUE, arr.ind = TRUE)

OutBathrooms = outlier(data$bathrooms, logical =TRUE)
sum(OutBathrooms)
Find_outlier = which(OutBathrooms ==TRUE, arr.ind = TRUE)

OutStories = outlier(data$stories, logical =TRUE)
sum(OutStories)
Find_outlier = which(OutStories ==TRUE, arr.ind = TRUE)

OutParking = outlier(data$parking, logical =TRUE)
sum(OutParking)
Find_outlier = which(OutParking ==TRUE, arr.ind = TRUE)


after finding the outliers we need to remove them for a cleaner dataset,and a more efficient model:

In [8]:
data= data[-Find_outlier,]

data

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
1,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
3,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
5,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished
6,10850000,7500,3,3,1,yes,no,yes,no,yes,2,yes,semi-furnished
7,10150000,8580,4,3,4,yes,no,no,no,yes,2,yes,semi-furnished
8,10150000,16200,5,3,2,yes,no,no,no,no,0,no,unfurnished
9,9870000,8100,4,1,2,yes,yes,yes,no,yes,2,yes,furnished
10,9800000,5750,3,2,4,yes,yes,no,no,yes,1,yes,unfurnished
11,9800000,13200,3,1,2,yes,no,yes,no,yes,2,yes,furnished
12,9681000,6000,4,3,2,yes,yes,yes,yes,no,2,no,semi-furnished


## 2. Data Transformation

 2.1 We used Encoding to transform Categorical data into numerical data, which would help the model read the data easily, perform machine learning algorithms, and statistical analysis:

In [11]:

data$furnishingstatus <- factor(data$furnishingstatus,
                                      levels = c("furnished", "semi-furnished", "unfurnished"),
                                      labels = c("3", "2", "1"))

data$mainroad <- factor(data$mainroad,
                                levels = c("no", "yes"),
                                labels = c("0", "1"))

data$guestroom <- factor(data$guestroom,
                                 levels = c("no", "yes"),
                                 labels = c("0", "1"))

data$hotwaterheating <- factor(data$hotwaterheating,
                                       levels = c("no", "yes"),
                                       labels = c("0", "1"))

data$airconditioning <- factor(data$airconditioning,
                                       levels = c("no", "yes"),
                                       labels = c("0", "1"))

data$prefarea <- factor(data$prefarea,
                                levels = c("no", "yes"),
                                labels = c("0", "1"))

data$basement <- factor(data$basement,
                                levels = c("no", "yes"),
                                labels = c("0", "1"))

In [12]:
data

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
1,13300000,7420,4,2,3,1,0,0,0,1,2,1,3
3,12250000,9960,3,2,2,1,0,1,0,0,2,1,2
5,11410000,7420,4,1,2,1,1,1,0,1,2,0,3
6,10850000,7500,3,3,1,1,0,1,0,1,2,1,2
7,10150000,8580,4,3,4,1,0,0,0,1,2,1,2
8,10150000,16200,5,3,2,1,0,0,0,0,0,0,1
9,9870000,8100,4,1,2,1,1,1,0,1,2,1,3
10,9800000,5750,3,2,4,1,1,0,0,1,1,1,1
11,9800000,13200,3,1,2,1,0,1,0,1,2,1,3
12,9681000,6000,4,3,2,1,1,1,1,0,2,0,2


2.2 we used the Discretization technique on our class label "price" to simplify and clear the interpretation of the target class, making it easier to analyze and communicate the result of the model eventually, we used quantiles to discretize the price into categories

We did it by using the quantiles as shown:

In [22]:

library(dplyr)
# Use quantiles to discretize 'Price' into categories
data$Price_Category <- cut(data$price, 
                           breaks = quantile(data$price, probs = 0:4/4), 
                           labels = FALSE, include.lowest = TRUE)

# Drop the original 'Price' column
data <- data %>% select(-price)

"package 'dplyr' was built under R version 3.6.3"

ERROR: Error: package or namespace load failed for 'dplyr' in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]):
 there is no package called 'lifecycle'


# 3. Feature selection

Feature selection is a process of selecting a subset of relevant features (or attributes) from the original set of features in a dataset. The goal of feature selection is to choose the most relevant and important features, thereby reducing dimensionality, and improving model performance.

we used a random forest model to train our predictors(columns other than or target class), and then computed the importance of each column therefore we found that we had a top 5 important, or relevant attributes and we printed it out:

In [31]:
#Feature Selection

#ensure the results are repeatable
set.seed(7)
# load the library
library(randomForest)

# Prepare the predictors (features) and target variable
predictors <- data[, 1:12]  #columns 1 to 12 are predictors
target <- data[, 13]  #column 13 is the target variable

# Train a Random Forest model to compute feature importance
rf_model <- randomForest(predictors, target, ntree = 100, importance = TRUE)

# Get feature importance
feature_importance <- importance(rf_model)

# Sort features by importance
sorted_feature_importance <- feature_importance[order(-feature_importance[, 1]), ]

# Print the sorted feature importance
print("Feature Importance:")
print(sorted_feature_importance)

# Select the top N important features (e.g., top 5 features)
top_n_features <- rownames(sorted_feature_importance)[1:5]

# Print the top N important features
print("Top N Important Features:")
print(top_n_features)


"The response has five or fewer unique values.  Are you sure you want to do regression?"

[1] "Feature Importance:"
                   %IncMSE IncNodePurity
area             26.137439    223.256055
bathrooms        16.580581     69.899466
prefarea         12.429317     29.900780
furnishingstatus 12.248378     53.558200
basement         11.457251     20.827736
stories          10.451395     46.572786
airconditioning  10.016677     41.312586
bedrooms          9.273410     39.241459
mainroad          9.222297     27.230329
guestroom         5.607058     15.382876
parking           4.366392     28.734186
hotwaterheating  -1.257426      7.700046
[1] "Top N Important Features:"
[1] "area"             "bathrooms"        "prefarea"         "furnishingstatus"
[5] "basement"        
