# 1. Business Understanding

Upside is a real estate company based in Bangalore, India. They want a model that is able to assist their clients on getting a price estimate of a house, based on the features it has

# 2. Data Acquisition and Understanding

## 2.1 Data Understanding

- Source: This data is from [Kaggle](https://www.kaggle.com/datasets/amitabhajoy/bengaluru-house-price-data/discussion).
- Format: The data is in a csv format.
- Content: The data contains 9 columns. These columns are:

    - Area_type - Description of the area
    - Availability - When it can be possessed or when it is ready
    - Location - Where it is located in Bengaluru
    - Size - BHK or Bedrooms
    - Society - To which society it belongs
    - Total_sqft - Size of the property in sq.ft
    - Bath - No. of Bathrooms
    - Balcony - No. of the Balcony
    - Price - Value of the property in lakhs (Indian Rupee - ₹)


## 2.2 Data Acquisition

In [3]:
# loading libraries

import pandas as pd

from matplotlib import pyplot as plt
import seaborn as sns

In [5]:
# adding data

data_path = r'C:\Users\njamb\Desktop\DataScience\BengaluruHousePricePrediction\Data\Raw\Bengaluru_House_Data.csv'

data = pd.read_csv(data_path)

# previewing the data

data.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


In [7]:
# checking shape of data

print(f"The data has {data.shape[0]} rows and {data.shape[1]} columns")

The data has 13320 rows and 9 columns


In [10]:
# checking the statistics of the numeric columns

data.describe()

Unnamed: 0,bath,balcony,price
count,13247.0,12711.0,13320.0
mean,2.69261,1.584376,112.565627
std,1.341458,0.817263,148.971674
min,1.0,0.0,8.0
25%,2.0,1.0,50.0
50%,2.0,2.0,72.0
75%,3.0,2.0,120.0
max,40.0,3.0,3600.0


In [12]:
# checking information of the columns

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13320 entries, 0 to 13319
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   area_type     13320 non-null  object 
 1   availability  13320 non-null  object 
 2   location      13319 non-null  object 
 3   size          13304 non-null  object 
 4   society       7818 non-null   object 
 5   total_sqft    13320 non-null  object 
 6   bath          13247 non-null  float64
 7   balcony       12711 non-null  float64
 8   price         13320 non-null  float64
dtypes: float64(3), object(6)
memory usage: 936.7+ KB


## 2.3 Data Cleaning

### 2.3.1 Consistency

In [15]:
# checking if the data has any duplicate values

print(f"The data has {data.duplicated().sum()} duplicate rows")

The data has 529 duplicate rows


In [17]:
# checking some of the duplicate rows

data[data.duplicated()]

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
971,Super built-up Area,Ready To Move,Haralur Road,3 BHK,NRowse,1464,3.0,2.0,56.0
1115,Super built-up Area,Ready To Move,Haralur Road,2 BHK,,1027,2.0,2.0,44.0
1143,Super built-up Area,Ready To Move,Vittasandra,2 BHK,Prlla C,1246,2.0,1.0,64.5
1290,Super built-up Area,Ready To Move,Haralur Road,2 BHK,,1194,2.0,2.0,47.0
1394,Super built-up Area,Ready To Move,Haralur Road,2 BHK,,1027,2.0,2.0,44.0
...,...,...,...,...,...,...,...,...,...
13285,Super built-up Area,Ready To Move,VHBCS Layout,2 BHK,OlarkLa,1353,2.0,2.0,110.0
13299,Super built-up Area,18-Dec,Whitefield,4 BHK,Prtates,2830 - 2882,5.0,0.0,154.5
13311,Plot Area,Ready To Move,Ramamurthy Nagar,7 Bedroom,,1500,9.0,2.0,250.0
13313,Super built-up Area,Ready To Move,Uttarahalli,3 BHK,Aklia R,1345,2.0,1.0,57.0


### 2.3.2 Completeness

In [25]:
# checking if the data has any missing values

data.isna().sum()

area_type          0
availability       0
location           1
size              16
society         5502
total_sqft         0
bath              73
balcony          609
price              0
dtype: int64

In [41]:
null_values = pd.DataFrame(data.isna().sum()/len(data)*100).rename(columns={0:"NullPercentage"})
null_values = null_values[null_values['NullPercentage'] > 0].sort_values(by='NullPercentage')
null_values

Unnamed: 0,NullPercentage
location,0.007508
size,0.12012
bath,0.548048
balcony,4.572072
society,41.306306


# 3.0 Modeling

# 4.0 Deployment