In [68]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ydata_profiling import ProfileReport
from src.viz import plot_histograms

%matplotlib inline
sns.set_style('whitegrid')
plt.rcParams['figure.dpi'] == 300

False

In [69]:
# reading the dataset
df = pd.read_csv('../data/raw/house_prices.csv')
df.sample(5)

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
1274,Super built-up Area,20-Dec,Jalahalli West,2 BHK,,1250,2.0,1.0,62.0
8141,Super built-up Area,17-Oct,Banashankari Stage V,3 BHK,Naiewre,1510,3.0,3.0,47.57
2406,Super built-up Area,Ready To Move,Marathahalli,3 BHK,PueraRi,1583,3.0,3.0,105.0
6890,Super built-up Area,Ready To Move,Subash Nagar,2 BHK,,1150,2.0,2.0,45.0
7013,Built-up Area,Ready To Move,Kengeri,3 BHK,,1436,2.0,2.0,55.0


In [70]:
# checking the shape of the data
df.shape

(13320, 9)

There are 13320 rows and 9 columns

In [71]:
# checking the dataframe information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13320 entries, 0 to 13319
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   area_type     13320 non-null  object 
 1   availability  13320 non-null  object 
 2   location      13319 non-null  object 
 3   size          13304 non-null  object 
 4   society       7818 non-null   object 
 5   total_sqft    13320 non-null  object 
 6   bath          13247 non-null  float64
 7   balcony       12711 non-null  float64
 8   price         13320 non-null  float64
dtypes: float64(3), object(6)
memory usage: 936.7+ KB


There are 3 numerical columns and 6 non-numerical columns in the dataset.

In [72]:
# looking at the summary statistics
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
bath,13247.0,2.69261,1.341458,1.0,2.0,2.0,3.0,40.0
balcony,12711.0,1.584376,0.817263,0.0,1.0,2.0,2.0,3.0
price,13320.0,112.565627,148.971674,8.0,50.0,72.0,120.0,3600.0


Bath: 
- Most properties have between 1 to 3 bathrooms, with a median of 2. 
- There is a property with an unusually high number of bathrooms (maximum of 40), suggesting a potential outlier.

Balcony:
- The majority of properties have 1 to 2 balconies, with a median of 2.
- The presence of properties with 0 balconies indicates that some units might not have a balcony.

Price:
- Prices vary widely, with a mean of approximately 112.57 and a large standard deviation of 148.97.
- The majority of properties have prices between 8 and 120, as indicated by the interquartile range (IQR).
- There is a substantial difference between the 75th percentile and the maximum price (3600), suggesting the presence of potential outliers.

# Data Cleaning 

There are 6 categorical columns and we cannot use categorical columns for linear regression.

- Dropping availability and society as they might not add anything to the model
- Dropping balcony column since it has a lot of missing values

In [73]:
# checking for missing values
df.isnull().sum()

area_type          0
availability       0
location           1
size              16
society         5502
total_sqft         0
bath              73
balcony          609
price              0
dtype: int64

In [74]:
# dropping availability and society column
df = df.drop(['availability','society','balcony'], axis=1)
# The rest of the missing values are really small amount incomparison to the entire size, hence dropping the null values
df = df.dropna()
df.head()

Unnamed: 0,area_type,location,size,total_sqft,bath,price
0,Super built-up Area,Electronic City Phase II,2 BHK,1056,2.0,39.07
1,Plot Area,Chikka Tirupathi,4 Bedroom,2600,5.0,120.0
2,Built-up Area,Uttarahalli,3 BHK,1440,2.0,62.0
3,Super built-up Area,Lingadheeranahalli,3 BHK,1521,3.0,95.0
4,Super built-up Area,Kothanur,2 BHK,1200,2.0,51.0


In [77]:
# fixed the issue of missing value
df.isnull().sum()

area_type     0
location      0
size          0
total_sqft    0
bath          0
price         0
dtype: int64