# Interview Task: House Price Prediction Model

For the start, I was given the dataset of house prices alongside with different features that can influence the price of a house.

The task is to build and train a model that will predict the price of a house based on a set of given features.

First step is to take a look at the given dataset, understand them and try to make some assumptions based on previous real-life experience and knowledge received from a brief look of the data I have.

In [21]:
# Import of the necessary libraries
import pandas as pd
pd.set_option('display.max_columns', None)

## Step 1: Load the dataset
For this step, I downloaded the dataset from Kagglehub (https://www.kaggle.com/ahmednour/house-price-prediction). In order to efficiently read the dataset, I used the pandas library, specifically the method "read_csv()", that will read the contents of the dataset and store it in a pandas DataFrame.

In [22]:
dataset = pd.read_csv('./dataset/house_price_dataset.csv')

After I read the CSV dataset, I need to have some insights into what data is contained inside of it. For that purpose, I will use dataframes extensive methods to display the dataset, its description and useful information about it.

In [23]:
print("Step 2: Display the dataset (first 10 rows and last 10 rows)")


# display - method from IPython library that will display the given argument in a more pleasant way
display(dataset.head(10))
display(dataset.tail(10))

Step 2: Display the dataset (first 10 rows and last 10 rows)


Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished
5,10850000,7500,3,3,1,yes,no,yes,no,yes,2,yes,semi-furnished
6,10150000,8580,4,3,4,yes,no,no,no,yes,2,yes,semi-furnished
7,10150000,16200,5,3,2,yes,no,no,no,no,0,no,unfurnished
8,9870000,8100,4,1,2,yes,yes,yes,no,yes,2,yes,furnished
9,9800000,5750,3,2,4,yes,yes,no,no,yes,1,yes,unfurnished


Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
535,2100000,3360,2,1,1,yes,no,no,no,no,1,no,unfurnished
536,1960000,3420,5,1,2,no,no,no,no,no,0,no,unfurnished
537,1890000,1700,3,1,2,yes,no,no,no,no,0,no,unfurnished
538,1890000,3649,2,1,1,yes,no,no,no,no,0,no,unfurnished
539,1855000,2990,2,1,1,no,no,no,no,no,1,no,unfurnished
540,1820000,3000,2,1,1,yes,no,yes,no,no,2,no,unfurnished
541,1767150,2400,3,1,1,no,no,no,no,no,0,no,semi-furnished
542,1750000,3620,2,1,1,yes,no,no,no,no,0,no,unfurnished
543,1750000,2910,3,1,1,no,no,no,no,no,0,no,furnished
544,1750000,3850,3,1,2,yes,no,no,no,no,0,no,unfurnished


As it may be seen in above cell, we have data about the house prices and their features, such as:
* Price of the house.
* Area,
* Nr. of bathrooms,
* Nr. of bedrooms,
* Nr. of stories,
* If the house is on main road or not,
* If the house has a guest room or not,
* If the house has a basement or not,
* If the house has hot water heating or not,
* If the house has an air conditioning or not,
* Nr. of parking spaces,
* If the house is in a preferred area or not,
* If the house is furnished, semi-furnished or not furnished,

Each of those features may be used to predict the price of the house, however, as it will be seen later, some features are more important than over, and if used not properly, they may lead to a worse model performance, as well as higher training time.

Since we have the examples of houses alongside with their actual price, then it is safely to assume that I will apply Supervised Learning algorithm to train a model that will predict the price of the house based on the given features. Supervised Machine Learning - a type of Machine Learning that refers to algorithms that learn from labeled data (data that provides examples with correct answers), learns from being given the X and Y mapping, Input and Output mapping. [[1]](https://www.coursera.org/learn/machine-learning/lecture/s91wX/supervised-learning-part-1)

At the same time, I was instructed to provide some Initial Assumptions on the dataset and the house prediction model that may be built on that dataset. So, I will start with that.
Taken in the account the real life experience and the brief analysis of the dataset, I may suppose that:
1. The price of the house is strongly influenced by the Area of the house, since this is the main criteria for the evaluation of the price of the house.
2. The number of stories influences the actual area of the house, therefore it is a parameter that will formulate the Area and, therefore, the actual price.
3. Houses in the preferred areas are more expensive than those in the non-preferred areas.
4. Furnished houses are more expensive than semi-furnished and non-furnished houses.

The Initial Assumptions on the house pricing:
1. Main criteria in the price evaluation is the Area of the house.
2. Houses that are furnished are more expensive than those that are not furnished.
3. Houses with more rooms are more expensive.
4. Houses with at least 1 parking space is more expensive than those without parking spaces.

Step 3: Analyze the dimensionality of the dataset

Dimensionality of the dataset is actually about the numbers of rows and columns in the dataset.
* Number of rows - number of examples in the dataset.
* Number of columns - number of features in the dataset, however since it is labeled dataset, 1 column from the total number of columns is the target column, that contains the Y values, the actual prices of the houses. Other columns contain different features (input variables or X values) that influences the target values.

In [26]:
print(f"Dimensionality of the dataset: {dataset.shape}")
print(f"Number of rows (training examples): {dataset.shape[0]}")
print(f"Number of columns (X - nr. of features vectors, Y - targets vector): {dataset.shape[1]}")

Dimensionality of the dataset: (545, 13)
Number of rows (training examples): 545
Number of columns (X - nr. of features vectors, Y - targets vector): 13


In this case, we have 545 examples of houses, their input variables and the respective output variable. Alongside with this number, we have 13 columns, that contains actually 12 features and 1 target column.

In the context of ML practices, I will denote the above numbers with the following notation:
$$
\begin{split}
m = 545 \\
|\vec{X}| = 12 \text{ (for a single training example)}\\
|\vec{Y}| = 1 \text{ (for a single training example)}\\
\end{split}
$$
Beside that, in case I want to refer to a specific training example, I will use the following notation:
$$
\begin{split}
i = 0 \text{ ($i^{th}$ training example)}\\
(X^{(0)}, Y^{(0)}) = ([7420,4,2,3,yes,no,no,no,yes,2,yes,furnished], [13300000]) \\
\end{split}
$$

In the next step, I am going to explore the dataset on inconsistencies, such as missing values, and analyze the types I have in it.

In [29]:
print("Information:")
dataset.info()

Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   price             545 non-null    int64 
 1   area              545 non-null    int64 
 2   bedrooms          545 non-null    int64 
 3   bathrooms         545 non-null    int64 
 4   stories           545 non-null    int64 
 5   mainroad          545 non-null    object
 6   guestroom         545 non-null    object
 7   basement          545 non-null    object
 8   hotwaterheating   545 non-null    object
 9   airconditioning   545 non-null    object
 10  parking           545 non-null    int64 
 11  prefarea          545 non-null    object
 12  furnishingstatus  545 non-null    object
dtypes: int64(6), object(7)
memory usage: 55.5+ KB


As it may be seen, the dataset contains 545 entries, as I already found out, and 13 columns. The columns are of different datatypes, such as:
* int64 - integer,
* object - I can deduce from the brief analysis on the head and tail of the dataframe, that this is a string datatype, that contains categorical data:
    * yes, no - by analogy, True/False values, or 1/0 values,
    * furnished, semi-furnished, not furnished - different categories of the same feature (furnishingstatus).

At the same time, it may be seen that there are no missing (NULL) values, but to be sure, I will check it again.

In [35]:
print("Missing values:")
print(dataset.isnull().sum())

Missing values:
price               0
area                0
bedrooms            0
bathrooms           0
stories             0
mainroad            0
guestroom           0
basement            0
hotwaterheating     0
airconditioning     0
parking             0
prefarea            0
furnishingstatus    0
dtype: int64


As it may be seen, no missing values in the dataset, which is a good thing, since there are 3 ways I know to handle missing values:
* Drop the rows with missing values - this is an acceptable approach if the number of missing values is small, and the dataset is large enough to be able to compensate the loss of those examples with very large total number of examples.
* Interpolate the missing values - this is a good approach if the number of missing values is small, and the dataset is small. It uses the values of the neighboring examples to fill the missing values.
* Fill the missing values with the mean, median or mode of the column - this is a good approach if the number of missing values is large, and the dataset is large enough to be able to provide sufficient points that will be used to calculate the mean, median or mode of the column, offering less biased values for those missing ones.

Next step is to find out the duplicate examples in the dataset. However, I am not sure if the case I analyze (house price prediction) is one that does not allow duplicate values, since it may probabilistically 2 house may be similar and have similar prices (new houses in the same neighborhood, for example). However, I will still check for duplicates in the dataset, just to adhere to general good practices in data analysis.

In [41]:
print(dataset[dataset.duplicated(keep=False)])

Empty DataFrame
Columns: [price, area, bedrooms, bathrooms, stories, mainroad, guestroom, basement, hotwaterheating, airconditioning, parking, prefarea, furnishingstatus]
Index: []


As it may be seen, there are no duplicate examples in the dataset. The next step is to convert

In [10]:
print("Information:")
print(dataset.info())

Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   price             545 non-null    int64 
 1   area              545 non-null    int64 
 2   bedrooms          545 non-null    int64 
 3   bathrooms         545 non-null    int64 
 4   stories           545 non-null    int64 
 5   mainroad          545 non-null    object
 6   guestroom         545 non-null    object
 7   basement          545 non-null    object
 8   hotwaterheating   545 non-null    object
 9   airconditioning   545 non-null    object
 10  parking           545 non-null    int64 
 11  prefarea          545 non-null    object
 12  furnishingstatus  545 non-null    object
dtypes: int64(6), object(7)
memory usage: 55.5+ KB
None
