## About this DATASET

Suggest Edits
What are the things that a potential home buyer considers before purchasing a house? The location, the size of the property, vicinity to offices, schools, parks, restaurants, hospitals or the stereotypical white picket fence? What about the most important factor — the price?

Now with the lingering impact of demonetization, the enforcement of the Real Estate (Regulation and Development) Act (RERA), and the lack of trust in property developers in the city, housing units sold across India in 2017 dropped by 7 percent. In fact, the property prices in Bengaluru fell by almost 5 percent in the second half of 2017, said a study published by property consultancy Knight Frank.
For example, for a potential homeowner, over 9,000 apartment projects and flats for sale are available in the range of ₹42-52 lakh, followed by over 7,100 apartments that are in the ₹52-62 lakh budget segment, says a report by property website Makaan. According to the study, there are over 5,000 projects in the ₹15-25 lakh budget segment followed by those in the ₹34-43 lakh budget category.

Buying a home, especially in a city like Bengaluru, is a tricky choice. While the major factors are usually the same for all metros, there are others to be considered for the Silicon Valley of India. With its help millennial crowd, vibrant culture, great climate and a slew of job opportunities, it is difficult to ascertain the price of a house in Bengaluru.

BHK = Bedroom, Hall, Kitchen

https://www.kaggle.com/datasets/amitabhajoy/bengaluru-house-price-data/data

In [18]:
!pip install numpy
!pip install numpy
!pip install pandas
!pip install scikit-learn




[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [19]:
import numpy as np
import pandas as pd
import re
import sklearn

## Data Loading

In [20]:
df1 = pd.read_csv("Bengaluru_House_Data.csv")

In [21]:
df1.head(10)

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0
5,Super built-up Area,Ready To Move,Whitefield,2 BHK,DuenaTa,1170,2.0,1.0,38.0
6,Super built-up Area,18-May,Old Airport Road,4 BHK,Jaades,2732,4.0,,204.0
7,Super built-up Area,Ready To Move,Rajaji Nagar,4 BHK,Brway G,3300,4.0,,600.0
8,Super built-up Area,Ready To Move,Marathahalli,3 BHK,,1310,3.0,1.0,63.25
9,Plot Area,Ready To Move,Gandhi Bazar,6 Bedroom,,1020,6.0,,370.0


In [22]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13320 entries, 0 to 13319
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   area_type     13320 non-null  object 
 1   availability  13320 non-null  object 
 2   location      13319 non-null  object 
 3   size          13304 non-null  object 
 4   society       7818 non-null   object 
 5   total_sqft    13320 non-null  object 
 6   bath          13247 non-null  float64
 7   balcony       12711 non-null  float64
 8   price         13320 non-null  float64
dtypes: float64(3), object(6)
memory usage: 936.7+ KB


In [23]:
df1.shape

(13320, 9)

In [24]:
df1.describe()

Unnamed: 0,bath,balcony,price
count,13247.0,12711.0,13320.0
mean,2.69261,1.584376,112.565627
std,1.341458,0.817263,148.971674
min,1.0,0.0,8.0
25%,2.0,1.0,50.0
50%,2.0,2.0,72.0
75%,3.0,2.0,120.0
max,40.0,3.0,3600.0


## Data PreProcessing

In [25]:
df2 = df1.drop(
    ["area_type", "society", "balcony"],
    axis = 1)

In [26]:
df2.head()

Unnamed: 0,availability,location,size,total_sqft,bath,price
0,19-Dec,Electronic City Phase II,2 BHK,1056,2.0,39.07
1,Ready To Move,Chikka Tirupathi,4 Bedroom,2600,5.0,120.0
2,Ready To Move,Uttarahalli,3 BHK,1440,2.0,62.0
3,Ready To Move,Lingadheeranahalli,3 BHK,1521,3.0,95.0
4,Ready To Move,Kothanur,2 BHK,1200,2.0,51.0


In [27]:
df2.isnull().sum()

availability     0
location         1
size            16
total_sqft       0
bath            73
price            0
dtype: int64

In [28]:
df3 = df2.dropna()

In [29]:
df3.isnull().sum()

availability    0
location        0
size            0
total_sqft      0
bath            0
price           0
dtype: int64

In [30]:
df3.head()

Unnamed: 0,availability,location,size,total_sqft,bath,price
0,19-Dec,Electronic City Phase II,2 BHK,1056,2.0,39.07
1,Ready To Move,Chikka Tirupathi,4 Bedroom,2600,5.0,120.0
2,Ready To Move,Uttarahalli,3 BHK,1440,2.0,62.0
3,Ready To Move,Lingadheeranahalli,3 BHK,1521,3.0,95.0
4,Ready To Move,Kothanur,2 BHK,1200,2.0,51.0


In [31]:
df3["availability"].unique()

array(['19-Dec', 'Ready To Move', '18-May', '18-Feb', '18-Nov', '20-Dec',
       '17-Oct', '21-Dec', '19-Sep', '20-Sep', '18-Mar', '18-Apr',
       '20-Aug', '19-Mar', '17-Sep', '18-Dec', '17-Aug', '19-Apr',
       '18-Jun', '22-Dec', '22-Jan', '18-Aug', '19-Jan', '17-Jul',
       '18-Jul', '21-Jun', '20-May', '19-Aug', '18-Sep', '17-May',
       '17-Jun', '18-Oct', '21-May', '18-Jan', '20-Mar', '17-Dec',
       '16-Mar', '19-Jun', '22-Jun', '19-Jul', '21-Feb', '19-May',
       '17-Nov', '20-Oct', '20-Jun', '19-Feb', '21-Oct', '21-Jan',
       '17-Mar', '17-Apr', '22-May', '19-Oct', '21-Jul', '21-Nov',
       '21-Mar', '16-Dec', '22-Mar', '20-Jan', '21-Sep', '21-Aug',
       '14-Nov', '19-Nov', '15-Nov', '16-Jul', '15-Jun', '17-Feb',
       '20-Nov', '20-Jul', '16-Sep', '15-Oct', '20-Feb', '15-Dec',
       '16-Oct', '22-Nov', '15-Aug', '17-Jan', '16-Nov', '20-Apr',
       '16-Jan', '14-Jul'], dtype=object)

In [32]:
df3.groupby("availability")["availability"].count()

availability
14-Jul               1
14-Nov               1
15-Aug               1
15-Dec               1
15-Jun               1
                 ...  
22-Jun              19
22-Mar               3
22-May               8
22-Nov               2
Ready To Move    10564
Name: availability, Length: 80, dtype: int64

In [33]:
df3["availability"] = df3["availability"].apply(lambda x: 
                                                x 
                                                if x in ("Ready To Move") 
                                                else "Future Possession")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df3["availability"] = df3["availability"].apply(lambda x:


In [34]:
df3.groupby("availability")["availability"].count()

availability
Future Possession     2682
Ready To Move        10564
Name: availability, dtype: int64

In [35]:
df3["location"].unique()

array(['Electronic City Phase II', 'Chikka Tirupathi', 'Uttarahalli', ...,
       '12th cross srinivas nagar banshankari 3rd stage',
       'Havanur extension', 'Abshot Layout'], shape=(1304,), dtype=object)

In [36]:
df3.groupby("location")["location"].count().sort_values(ascending = False)

location
Whitefield                        534
Sarjapur  Road                    392
Electronic City                   302
Kanakpura Road                    266
Thanisandra                       233
                                 ... 
Dhanalakshmi Layout                 1
1st Stage Domlur                    1
1st Stage Radha Krishna Layout      1
Wheelers Road                       1
Tharabanahalli                      1
Name: location, Length: 1304, dtype: int64

In [37]:
location = df3.groupby("location")["location"].count().sort_values(ascending = False)

In [38]:
location_20cnt = location[location <= 20]
location_20cnt

location
Yelachenahalli                    20
Binny Pete                        20
Sanjay nagar                      20
Poorna Pragna Layout              20
HBR Layout                        20
                                  ..
Dhanalakshmi Layout                1
1st Stage Domlur                   1
1st Stage Radha Krishna Layout     1
Wheelers Road                      1
Tharabanahalli                     1
Name: location, Length: 1161, dtype: int64

In [39]:
df3["location"] = df3["location"].apply(lambda x:
                                        "Others" 
                                        if x in location_20cnt
                                        else x)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df3["location"] = df3["location"].apply(lambda x:


In [40]:
df3.groupby("location")["location"].count().sort_values(ascending = False)

location
Others                4314
Whitefield             534
Sarjapur  Road         392
Electronic City        302
Kanakpura Road         266
                      ... 
Kathriguppe             22
Thubarahalli            22
Basaveshwara Nagar      21
Hoskote                 21
Ulsoor                  21
Name: location, Length: 144, dtype: int64

In [41]:
df3["size"].unique()

array(['2 BHK', '4 Bedroom', '3 BHK', '4 BHK', '6 Bedroom', '3 Bedroom',
       '1 BHK', '1 RK', '1 Bedroom', '8 Bedroom', '2 Bedroom',
       '7 Bedroom', '5 BHK', '7 BHK', '6 BHK', '5 Bedroom', '11 BHK',
       '9 BHK', '9 Bedroom', '27 BHK', '10 Bedroom', '11 Bedroom',
       '10 BHK', '19 BHK', '16 BHK', '43 Bedroom', '14 BHK', '8 BHK',
       '12 Bedroom', '13 BHK', '18 Bedroom'], dtype=object)

In [42]:
df3[["bhks", "name"]] = df3["size"].str.split(" ", n = 1, expand = True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df3[["bhks", "name"]] = df3["size"].str.split(" ", n = 1, expand = True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df3[["bhks", "name"]] = df3["size"].str.split(" ", n = 1, expand = True)


In [43]:
df3.head(10)

Unnamed: 0,availability,location,size,total_sqft,bath,price,bhks,name
0,Future Possession,Electronic City Phase II,2 BHK,1056,2.0,39.07,2,BHK
1,Ready To Move,Others,4 Bedroom,2600,5.0,120.0,4,Bedroom
2,Ready To Move,Uttarahalli,3 BHK,1440,2.0,62.0,3,BHK
3,Ready To Move,Lingadheeranahalli,3 BHK,1521,3.0,95.0,3,BHK
4,Ready To Move,Kothanur,2 BHK,1200,2.0,51.0,2,BHK
5,Ready To Move,Whitefield,2 BHK,1170,2.0,38.0,2,BHK
6,Future Possession,Old Airport Road,4 BHK,2732,4.0,204.0,4,BHK
7,Ready To Move,Rajaji Nagar,4 BHK,3300,4.0,600.0,4,BHK
8,Ready To Move,Marathahalli,3 BHK,1310,3.0,63.25,3,BHK
9,Ready To Move,Others,6 Bedroom,1020,6.0,370.0,6,Bedroom


In [44]:
df3 = df3.drop("name", axis = 1)

In [45]:
df3.head(10)

Unnamed: 0,availability,location,size,total_sqft,bath,price,bhks
0,Future Possession,Electronic City Phase II,2 BHK,1056,2.0,39.07,2
1,Ready To Move,Others,4 Bedroom,2600,5.0,120.0,4
2,Ready To Move,Uttarahalli,3 BHK,1440,2.0,62.0,3
3,Ready To Move,Lingadheeranahalli,3 BHK,1521,3.0,95.0,3
4,Ready To Move,Kothanur,2 BHK,1200,2.0,51.0,2
5,Ready To Move,Whitefield,2 BHK,1170,2.0,38.0,2
6,Future Possession,Old Airport Road,4 BHK,2732,4.0,204.0,4
7,Ready To Move,Rajaji Nagar,4 BHK,3300,4.0,600.0,4
8,Ready To Move,Marathahalli,3 BHK,1310,3.0,63.25,3
9,Ready To Move,Others,6 Bedroom,1020,6.0,370.0,6


In [46]:
df3["total_sqft"].unique()

array(['1056', '2600', '1440', ..., '1133 - 1384', '774', '4689'],
      shape=(2067,), dtype=object)

In [47]:
def get_maen(x):
    if re.findall("-", x):
        ss = x.strip().split("-")
        return ((float(ss[0]) + float(ss[1])) / 2)
    try:
        return float(x.strip())
    except:
        return None

In [48]:
df3["total_sqft_new"] = df3["total_sqft"].apply(get_maen)
df3.head()

Unnamed: 0,availability,location,size,total_sqft,bath,price,bhks,total_sqft_new
0,Future Possession,Electronic City Phase II,2 BHK,1056,2.0,39.07,2,1056.0
1,Ready To Move,Others,4 Bedroom,2600,5.0,120.0,4,2600.0
2,Ready To Move,Uttarahalli,3 BHK,1440,2.0,62.0,3,1440.0
3,Ready To Move,Lingadheeranahalli,3 BHK,1521,3.0,95.0,3,1521.0
4,Ready To Move,Kothanur,2 BHK,1200,2.0,51.0,2,1200.0


In [49]:
df3 = df3.drop("total_sqft", axis = 1)

In [50]:
df3.head(10)

Unnamed: 0,availability,location,size,bath,price,bhks,total_sqft_new
0,Future Possession,Electronic City Phase II,2 BHK,2.0,39.07,2,1056.0
1,Ready To Move,Others,4 Bedroom,5.0,120.0,4,2600.0
2,Ready To Move,Uttarahalli,3 BHK,2.0,62.0,3,1440.0
3,Ready To Move,Lingadheeranahalli,3 BHK,3.0,95.0,3,1521.0
4,Ready To Move,Kothanur,2 BHK,2.0,51.0,2,1200.0
5,Ready To Move,Whitefield,2 BHK,2.0,38.0,2,1170.0
6,Future Possession,Old Airport Road,4 BHK,4.0,204.0,4,2732.0
7,Ready To Move,Rajaji Nagar,4 BHK,4.0,600.0,4,3300.0
8,Ready To Move,Marathahalli,3 BHK,3.0,63.25,3,1310.0
9,Ready To Move,Others,6 Bedroom,6.0,370.0,6,1020.0


In [51]:
df3.isnull().sum()

availability       0
location           0
size               0
bath               0
price              0
bhks               0
total_sqft_new    46
dtype: int64

In [52]:
df4 = df3.dropna()

In [53]:
df4.isnull().sum()

availability      0
location          0
size              0
bath              0
price             0
bhks              0
total_sqft_new    0
dtype: int64

In [54]:
df4["bath"].unique()

array([ 2.,  5.,  3.,  4.,  6.,  1.,  9.,  8.,  7., 11., 10., 14., 27.,
       12., 16., 40., 15., 13., 18.])

In [55]:
df4.groupby("bath")["bath"].count().sort_values()

bath
14.0       1
15.0       1
27.0       1
18.0       1
40.0       1
16.0       2
13.0       3
11.0       3
12.0       7
10.0      13
9.0       41
8.0       64
7.0      102
6.0      269
5.0      521
1.0      781
4.0     1222
3.0     3274
2.0     6893
Name: bath, dtype: int64

In [56]:
df5 = df4[df4["bath"] <= 6]

In [57]:
df5.head(10)

Unnamed: 0,availability,location,size,bath,price,bhks,total_sqft_new
0,Future Possession,Electronic City Phase II,2 BHK,2.0,39.07,2,1056.0
1,Ready To Move,Others,4 Bedroom,5.0,120.0,4,2600.0
2,Ready To Move,Uttarahalli,3 BHK,2.0,62.0,3,1440.0
3,Ready To Move,Lingadheeranahalli,3 BHK,3.0,95.0,3,1521.0
4,Ready To Move,Kothanur,2 BHK,2.0,51.0,2,1200.0
5,Ready To Move,Whitefield,2 BHK,2.0,38.0,2,1170.0
6,Future Possession,Old Airport Road,4 BHK,4.0,204.0,4,2732.0
7,Ready To Move,Rajaji Nagar,4 BHK,4.0,600.0,4,3300.0
8,Ready To Move,Marathahalli,3 BHK,3.0,63.25,3,1310.0
9,Ready To Move,Others,6 Bedroom,6.0,370.0,6,1020.0


In [59]:
df6 = df5.drop("size", axis = 1)

In [60]:
df6.head()

Unnamed: 0,availability,location,bath,price,bhks,total_sqft_new
0,Future Possession,Electronic City Phase II,2.0,39.07,2,1056.0
1,Ready To Move,Others,5.0,120.0,4,2600.0
2,Ready To Move,Uttarahalli,2.0,62.0,3,1440.0
3,Ready To Move,Lingadheeranahalli,3.0,95.0,3,1521.0
4,Ready To Move,Kothanur,2.0,51.0,2,1200.0


In [61]:
df6["bhks"] = pd.to_numeric(df6["bhks"], errors = "coerce")

In [62]:
df6[df6["total_sqft_new"] / df6["bhks"] < 400]

Unnamed: 0,availability,location,bath,price,bhks,total_sqft_new
9,Ready To Move,Others,6.0,370.0,6,1020.0
16,Ready To Move,Bisuvanahalli,3.0,48.0,3,1180.0
26,Ready To Move,Electronic City,1.0,23.1,2,660.0
29,Ready To Move,Electronic City,2.0,47.0,3,1025.0
31,Ready To Move,Bisuvanahalli,2.0,35.0,3,1075.0
...,...,...,...,...,...,...
13279,Ready To Move,Others,5.0,130.0,6,1200.0
13281,Ready To Move,Margondanahalli,5.0,125.0,5,1375.0
13300,Ready To Move,Hosakerehalli,6.0,145.0,5,1500.0
13303,Ready To Move,Vidyaranyapura,5.0,70.0,5,774.0


In [63]:
df7 = df6[df6["total_sqft_new"] / df6["bhks"] > 400]

In [64]:
df7.head(10)

Unnamed: 0,availability,location,bath,price,bhks,total_sqft_new
0,Future Possession,Electronic City Phase II,2.0,39.07,2,1056.0
1,Ready To Move,Others,5.0,120.0,4,2600.0
2,Ready To Move,Uttarahalli,2.0,62.0,3,1440.0
3,Ready To Move,Lingadheeranahalli,3.0,95.0,3,1521.0
4,Ready To Move,Kothanur,2.0,51.0,2,1200.0
5,Ready To Move,Whitefield,2.0,38.0,2,1170.0
6,Future Possession,Old Airport Road,4.0,204.0,4,2732.0
7,Ready To Move,Rajaji Nagar,4.0,600.0,4,3300.0
8,Ready To Move,Marathahalli,3.0,63.25,3,1310.0
10,Future Possession,Whitefield,2.0,70.0,3,1800.0


In [65]:
df7["price_per_sqft"] = df7["price"] * 100000 / df7["total_sqft_new"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df7["price_per_sqft"] = df7["price"] * 100000 / df7["total_sqft_new"]


In [66]:
df7.head()

Unnamed: 0,availability,location,bath,price,bhks,total_sqft_new,price_per_sqft
0,Future Possession,Electronic City Phase II,2.0,39.07,2,1056.0,3699.810606
1,Ready To Move,Others,5.0,120.0,4,2600.0,4615.384615
2,Ready To Move,Uttarahalli,2.0,62.0,3,1440.0,4305.555556
3,Ready To Move,Lingadheeranahalli,3.0,95.0,3,1521.0,6245.890861
4,Ready To Move,Kothanur,2.0,51.0,2,1200.0,4250.0


In [67]:
df7["price_per_sqft"].describe()

count     11384.000000
mean       6107.117641
std        3873.141954
min         267.829813
25%        4202.129570
50%        5252.421226
75%        6761.519797
max      176470.588235
Name: price_per_sqft, dtype: float64

In [68]:
def rmv_price_outlier(df):
    df_new = pd.DataFrame()
    for key, sdf in df.groupby("location"):
        m = sdf["price_per_sqft"].mean()
        s = sdf["price_per_sqft"].std()

        # Removed Outlier
        rdf = sdf[(sdf["price_per_sqft"] <= m + s) & (sdf["price_per_sqft"] > m - s)]

        df_new = pd.concat([df_new, rdf], ignore_index = True)
        
    return df_new

In [69]:
df8 = rmv_price_outlier(df7)

In [70]:
df8.shape

(9428, 7)

In [71]:
df8.head(10)

Unnamed: 0,availability,location,bath,price,bhks,total_sqft_new,price_per_sqft
0,Ready To Move,1st Phase JP Nagar,4.0,250.0,4,2825.0,8849.557522
1,Ready To Move,1st Phase JP Nagar,3.0,167.0,3,1875.0,8906.666667
2,Ready To Move,1st Phase JP Nagar,4.0,210.0,3,2065.0,10169.491525
3,Ready To Move,1st Phase JP Nagar,3.0,157.0,3,2024.0,7756.916996
4,Ready To Move,1st Phase JP Nagar,3.0,225.0,3,2059.0,10927.634774
5,Ready To Move,1st Phase JP Nagar,2.0,100.0,2,1394.0,7173.601148
6,Future Possession,1st Phase JP Nagar,2.0,93.0,2,1077.0,8635.097493
7,Ready To Move,1st Phase JP Nagar,2.0,180.0,2,1566.0,11494.252874
8,Ready To Move,1st Phase JP Nagar,2.0,50.0,1,840.0,5952.380952
9,Future Possession,1st Phase JP Nagar,3.0,131.0,3,1590.0,8238.993711


## Prediction MODEL

In [72]:
# One-Hot Encoding
availabilty_dummy = pd.get_dummies(df8["availability"], drop_first = True).astype(int)

In [73]:
availabilty_dummy

Unnamed: 0,Ready To Move
0,1
1,1
2,1
3,1
4,1
...,...
9423,1
9424,1
9425,1
9426,1


In [74]:
# One-Hot Encoding
location_dummy = pd.get_dummies(df8["location"], drop_first = True).astype(int)

In [75]:
df9 = pd.concat([df8, availabilty_dummy, location_dummy], axis = 1)

In [76]:
df9.head(10)

Unnamed: 0,availability,location,bath,price,bhks,total_sqft_new,price_per_sqft,Ready To Move,2nd Stage Nagarbhavi,5th Phase JP Nagar,...,Ulsoor,Uttarahalli,Varthur,Vidyaranyapura,Vijayanagar,Vittasandra,Whitefield,Yelahanka,Yelahanka New Town,Yeshwanthpur
0,Ready To Move,1st Phase JP Nagar,4.0,250.0,4,2825.0,8849.557522,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Ready To Move,1st Phase JP Nagar,3.0,167.0,3,1875.0,8906.666667,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Ready To Move,1st Phase JP Nagar,4.0,210.0,3,2065.0,10169.491525,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Ready To Move,1st Phase JP Nagar,3.0,157.0,3,2024.0,7756.916996,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Ready To Move,1st Phase JP Nagar,3.0,225.0,3,2059.0,10927.634774,1,0,0,...,0,0,0,0,0,0,0,0,0,0
5,Ready To Move,1st Phase JP Nagar,2.0,100.0,2,1394.0,7173.601148,1,0,0,...,0,0,0,0,0,0,0,0,0,0
6,Future Possession,1st Phase JP Nagar,2.0,93.0,2,1077.0,8635.097493,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,Ready To Move,1st Phase JP Nagar,2.0,180.0,2,1566.0,11494.252874,1,0,0,...,0,0,0,0,0,0,0,0,0,0
8,Ready To Move,1st Phase JP Nagar,2.0,50.0,1,840.0,5952.380952,1,0,0,...,0,0,0,0,0,0,0,0,0,0
9,Future Possession,1st Phase JP Nagar,3.0,131.0,3,1590.0,8238.993711,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [77]:
df9.shape

(9428, 151)

In [78]:
df10 = df9.drop(["availability", "location", "price_per_sqft"], axis = 1)

In [79]:
df10.head(10)

Unnamed: 0,bath,price,bhks,total_sqft_new,Ready To Move,2nd Stage Nagarbhavi,5th Phase JP Nagar,6th Phase JP Nagar,7th Phase JP Nagar,8th Phase JP Nagar,...,Ulsoor,Uttarahalli,Varthur,Vidyaranyapura,Vijayanagar,Vittasandra,Whitefield,Yelahanka,Yelahanka New Town,Yeshwanthpur
0,4.0,250.0,4,2825.0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,3.0,167.0,3,1875.0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,4.0,210.0,3,2065.0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,3.0,157.0,3,2024.0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,3.0,225.0,3,2059.0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,2.0,100.0,2,1394.0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,2.0,93.0,2,1077.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,2.0,180.0,2,1566.0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,2.0,50.0,1,840.0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,3.0,131.0,3,1590.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [80]:
X = df10.drop(["price"], axis = 1)
y = df10["price"]

In [81]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [82]:
print(f"Shape of X_train_set features: {X_train.shape}")
print(f"Shape of X_test_set features: {X_test.shape}")
print(f"Shape of y_train_set features: {y_train.shape}")
print(f"Shape of y_test_set features: {y_test.shape}")

Shape of X_train_set features: (7542, 147)
Shape of X_test_set features: (1886, 147)
Shape of y_train_set features: (7542,)
Shape of y_test_set features: (1886,)


In [83]:
from sklearn.linear_model import LinearRegression

from sklearn.tree import DecisionTreeRegressor

from sklearn.model_selection import cross_val_score

In [84]:
# 1-Linear Regression Model

model_LR = LinearRegression()
model_LR = model_LR.fit(X_train, y_train)
model_LR

In [85]:
y_prediction = model_LR.predict(X_test)

In [86]:
marked_model = pd.DataFrame({
    "Test Data:" : y_test,
    "Prediction Data:" : y_prediction
})

marked_model.head(10)

Unnamed: 0,Test Data:,Prediction Data:
5790,59.52,62.163694
4655,56.87,61.616705
8204,50.3,32.571101
8785,65.0,63.684101
1585,75.0,67.507402
8763,39.0,64.173245
1755,67.0,66.710606
1796,200.0,206.69319
4045,86.5,97.033558
5946,50.0,29.429282


In [88]:
# 2-Decision Tree Regression Model

model_DTR = DecisionTreeRegressor()
model_DTR = model_DTR.fit(X_train, y_train)
model_DTR

In [89]:
y_prediction = model_DTR.predict(X_test)

In [90]:
marked_model = pd.DataFrame({
    "Test Data:" : y_test,
    "Prediction Data:" : y_prediction
})

marked_model.head(10)

Unnamed: 0,Test Data:,Prediction Data:
5790,59.52,65.0
4655,56.87,70.0
8204,50.3,50.2
8785,65.0,39.0
1585,75.0,75.5
8763,39.0,49.25
1755,67.0,75.0
1796,200.0,145.0
4045,86.5,100.0
5946,50.0,27.0


## MODEL Scores

In [91]:
model_score_LR = cross_val_score(
    estimator = LinearRegression(),
    X = X_train, 
    y = y_train, 
    cv = 5
)

model_score_LR

array([0.81654493, 0.89079429, 0.80131776, 0.78554245, 0.84174438])

In [92]:
model_score_LR.mean() * 100

np.float64(82.71887633758844)

In [93]:
model_score_LR.std()

np.float64(0.03680908576212361)

In [94]:
model_score_DTR = cross_val_score(
    estimator = DecisionTreeRegressor(),
    X = X_train,
    y = y_train, 
    cv = 5
)

model_score_DTR

array([0.74099454, 0.83079254, 0.75231327, 0.72387893, 0.81502857])

In [95]:
model_score_DTR.mean() * 100

np.float64(77.26015698264328)

In [96]:
model_score_DTR.std()

np.float64(0.042357416883015184)

In [97]:
from sklearn.metrics import r2_score, mean_squared_error

In [98]:
y_prediction_DTR = model_DTR.predict(X_test)
y_prediction_LR = model_LR.predict(X_test)

In [99]:
print("R^2 score:", r2_score(y_test, y_prediction_LR * 100)) 
print("MSE:", mean_squared_error(y_test, y_prediction_LR * 100))

print("R^2 score:", r2_score(y_test, y_prediction_LR * 100))
print("MSE:", mean_squared_error(y_test, y_prediction_LR * 100))

R^2 score: -23108.690098172385
MSE: 116336638.16146298
R^2 score: -23108.690098172385
MSE: 116336638.16146298


## Make a prediction on our new data

In [100]:
df10.columns[0:3]

Index(['bath', 'price', 'bhks'], dtype='object')

In [121]:
# Model Linear Regression

model_LR, model_columns = joblib.load("House_Prediction_Model_via_LinearRegression.pkl")

# New input (example)
new_data = pd.DataFrame([[1200, 2, 3]], columns=["total_sqft", "bath", "bhk"])

# Reindex
new_data = new_data.reindex(columns=model_columns, fill_value=0)

# Predict
new_prediction = model_LR.predict(new_data)

print("Predicted price:", new_prediction[0])

Predicted price: 44.44681061782482


In [122]:
model_DTR, model_columns = joblib.load("House_Prediction_Model_via_DecisionTreeRegressor.pkl")

# New input (example)
new_data = pd.DataFrame([[5000, 7, 6]], columns=["total_sqft", "bath", "bhk"])

# Reindex
new_data = new_data.reindex(columns=model_columns, fill_value=0)

# Predict
new_prediction = model_DTR.predict(new_data)

print("Predicted price:", new_prediction[0])

Predicted price: 48.0


In [113]:
new_prediction = model_LR.predict(new_data)

print("Predicted price:", new_prediction[0])

Predicted price: 44.44681061782482


## Save MODELS

In [106]:
import joblib

In [124]:
joblib.dump((model_LR, list(X_train.columns)), "House_Prediction_Model_LinearRegression.pkl")

['House_Prediction_Model_LinearRegression.pkl']

In [128]:
# Model va ustunlar ro‘yxatini yuklab olish
model, model_columns = joblib.load("House_Prediction_Model_LinearRegression.pkl")

In [None]:
input_df = pd.DataFrame([[sqft, bath, bhk]], columns=["total_sqft", "bath", "bhk"])
input_df = input_df.reindex(columns=model_columns, fill_value=0) 

In [125]:
joblib.dump((model_DTR, list(X_train.columns)), "House_Prediction_Model_DecisionTreeRegressor.pkl")

['House_Prediction_Model_DecisionTreeRegressor.pkl']

In [130]:
# Model va ustunlar ro‘yxatini yuklab olish
model, model_columns = joblib.load("House_Prediction_Model_DecisionTreeRegressor.pkl")

In [132]:
input_df = pd.DataFrame([[sqft, bath, bhk]], columns=["total_sqft", "bath", "bhk"])
input_df = input_df.reindex(columns=model_columns, fill_value=0)

## Web Application with Streamlit

In [110]:
!pip install streamlit

Collecting streamlit
  Using cached streamlit-1.45.0-py3-none-any.whl.metadata (8.9 kB)
Collecting altair<6,>=4.0 (from streamlit)
  Using cached altair-5.5.0-py3-none-any.whl.metadata (11 kB)
Collecting blinker<2,>=1.5.0 (from streamlit)
  Using cached blinker-1.9.0-py3-none-any.whl.metadata (1.6 kB)
Collecting cachetools<6,>=4.0 (from streamlit)
  Using cached cachetools-5.5.2-py3-none-any.whl.metadata (5.4 kB)
Collecting click<9,>=7.0 (from streamlit)
  Using cached click-8.2.0-py3-none-any.whl.metadata (2.5 kB)
Collecting packaging<25,>=20 (from streamlit)
  Using cached packaging-24.2-py3-none-any.whl.metadata (3.2 kB)
Collecting pillow<12,>=7.1.0 (from streamlit)
  Using cached pillow-11.2.1-cp311-cp311-win_amd64.whl.metadata (9.1 kB)
Collecting protobuf<7,>=3.20 (from streamlit)
  Using cached protobuf-6.30.2-cp310-abi3-win_amd64.whl.metadata (593 bytes)
Collecting pyarrow>=7.0 (from streamlit)
  Using cached pyarrow-20.0.0-cp311-cp311-win_amd64.whl.metadata (3.4 kB)
Collecting 


[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [127]:
import streamlit as st
import pandas as pd
import joblib

# Modelni yuklash
model = joblib.load("House_Prediction_Model_LinearRegression.pkl")

# Sarlavha
st.title("House Price Prediction App with Linear Regression Model")

# Inputlar
sqft = st.number_input("Total Square Feet", min_value=300, max_value=10000, step=10)
bath = st.number_input("Number of Bathrooms", min_value=1, max_value=10)
bhk = st.number_input("Number of Bedrooms (BHK)", min_value=1, max_value=10)

# Bashorat tugmasi
if st.button("Predict Price"):
    input_df = pd.DataFrame([[sqft, bath, bhk]], columns=["total_sqft", "bath", "bhk"])
    prediction = model.predict(input_df)[0]
    st.success(f"Predicted Price: ₹ {prediction:,.2f}")

2025-05-12 15:36:15.627 
  command:

    streamlit run c:\Users\otabek.khamidov\Desktop\Machine Learning\CODESTUDIO - PROJECTS\ML Project 1 - House Price Prediction\venv\Lib\site-packages\ipykernel_launcher.py [ARGUMENTS]
2025-05-12 15:36:15.650 Session state does not function when running a script without `streamlit run`


In [None]:
import joblib
import pandas as pd
import streamlit as st

# Loading Model
model, model_columns = joblib.load("House_Prediction_Model_LinearRegression.pkl")

# Title
st.title("House Price Prediction App with Linear Regression Model")

# Inputs
location = st.selectbox("Location", sorted([col for col in model_columns if col not in ["total_sqft", "bath", "bhk"]]))
sqft = st.number_input("Total Square Feet", min_value=300, max_value=10000, step=10)
bath = st.number_input("Number of Bathrooms", min_value=1, max_value=10)
bhk = st.number_input("Number of Bedrooms (BHK)", min_value=1, max_value=10)

input_dict = {
    "total_sqft": sqft,
    "bath": bath,
    "bhk": bhk
    }

input_df = pd.DataFrame(columns=model_columns)
input_df.loc[0] = 0

input_df.at[0, "total_sqft"] = sqft
input_df.at[0, "bath"] = bath
input_df.at[0, "bhk"] = bhk

if location in input_df.columns:
    input_df.at[0, location] = 1

input_df = pd.DataFrame([[sqft, bath, bhk]], columns=["total_sqft", "bath", "bhk"])
input_df = input_df.reindex(columns=model_columns, fill_value=0)

# Bashorat tugmasi
if st.button("Predict Price"):
    input_df = pd.DataFrame([[sqft, bath, bhk]], columns=["total_sqft", "bath", "bhk"])
    prediction = model.predict(input_df)[0]
    st.success(f"Predicted Price: ₹ {prediction:,.2f}")