## 1. Data Loading and Preprocessing.

In these steps, we will load the Bengaluru House Data dataset using pandas and perform an initial exploration to understand its structure and contents.

- **Data Import:** The dataset is loaded into a pandas DataFrame named `data`.
- **Shape:** The dataset contains 12,530 rows and 7 columns after initial cleaning.
- **Columns:**  
    - `location`: Area or locality of the property  
    - `size`: Number of bedrooms (e.g., "2 BHK", "4 Bedroom")  
    - `total_sqft`: Total area in square feet  
    - `bath`: Number of bathrooms  
    - `price`: Price of the property (in lakhs)  
    - `bhk`: Extracted number of bedrooms as integer  
    - `price_per_sqft`: Price per square foot

We will also check for missing values, data types, and unique values in key columns to guide further cleaning and preprocessing steps. This foundational understanding helps in identifying potential issues such as outliers, inconsistent data, and the need for encoding categorical variables.

In [242]:
import pandas as pd
import numpy as np

In [243]:
data = pd.read_csv('Bengaluru_House_Data.csv')

In [244]:
data.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


We can see there are 9 columns with different values area_type , availability, location size socitey , toatl_sqft, balcony and price.


In [245]:
data.shape

(13320, 9)

As above we can see that the dataset have 13320 rows and 9 columns.

In [246]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13320 entries, 0 to 13319
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   area_type     13320 non-null  object 
 1   availability  13320 non-null  object 
 2   location      13319 non-null  object 
 3   size          13304 non-null  object 
 4   society       7818 non-null   object 
 5   total_sqft    13320 non-null  object 
 6   bath          13247 non-null  float64
 7   balcony       12711 non-null  float64
 8   price         13320 non-null  float64
dtypes: float64(3), object(6)
memory usage: 936.7+ KB


### Data Overview and Missing Values

The dataset initially contains 9 columns, with 6 columns of object type and 3 columns of float type. Below is a summary of missing values in each column:

- **location**: 1 missing value  
- **size**: 16 missing values  
- **society**: Many missing values (over 5,000)  
- **bath**: 73 missing values  
- **balcony**: Approximately 600 missing values  

To handle these missing values, we can either replace them with appropriate statistics (mean, median, or mode) or drop columns that are not important for our analysis.

#### Column-wise Data Types and Non-Null Counts

| Column         | Non-Null Count | Data Type |
|----------------|---------------|-----------|
| area_type      | 13,320        | object    |
| availability   | 13,320        | object    |
| location       | 13,319        | object    |
| size           | 13,304        | object    |
| society        | 7,818         | object    |
| total_sqft     | 13,320        | object    |
| bath           | 13,247        | float64   |
| balcony        | 12,711        | float64   |
| price          | 13,320        | float64   |

We will proceed by cleaning the data, handling missing values, and dropping columns that are not relevant for further analysis.

In [247]:
for col in data.columns:
    print(data[col].value_counts())
    print("*"*20)
# The above loop prints the value counts for each column in the DataFrame,
# helping to understand the distribution and frequency of unique values in every column.

area_type
Super built-up  Area    8790
Built-up  Area          2418
Plot  Area              2025
Carpet  Area              87
Name: count, dtype: int64
********************
availability
Ready To Move    10581
18-Dec             307
18-May             295
18-Apr             271
18-Aug             200
                 ...  
15-Aug               1
17-Jan               1
16-Nov               1
16-Jan               1
14-Jul               1
Name: count, Length: 81, dtype: int64
********************
location
Whitefield                        540
Sarjapur  Road                    399
Electronic City                   302
Kanakpura Road                    273
Thanisandra                       234
                                 ... 
Bapuji Layout                       1
1st Stage Radha Krishna Layout      1
BEML Layout 5th stage               1
singapura paradise                  1
Abshot Layout                       1
Name: count, Length: 1305, dtype: int64
********************
size
2 BHK    

In [248]:
data.isna().sum()

area_type          0
availability       0
location           1
size              16
society         5502
total_sqft         0
bath              73
balcony          609
price              0
dtype: int64

### Null Value Analysis & Next Steps

Several columns (`location`, `size`, `bath`, `balcony`, `society`) had missing values. To ensure data quality:

- Filled missing values in key columns using median (for numerics) or mode (for categoricals).
- Dropped columns with excessive nulls or low relevance.
- Converted data types and engineered new features (`bhk`, `price_per_sqft`).
- Removed outliers and standardized location categories.

These steps prepare the dataset for effective analysis and modeling.


In [249]:
data.shape

(13320, 9)

In [250]:
data.drop(columns=['area_type', 'availability', 'society', 'balcony'], inplace=True)

In [251]:
data.shape

(13320, 5)

### Dropping Irrelevant or Redundant Columns

We remove the columns `area_type`, `availability`, `society`, and `balcony` from the dataset because:

- They have many missing values or low relevance to price prediction.
- They introduce unnecessary complexity or high cardinality (e.g., `society`).
- Their information is either redundant or not useful for our analysis and modeling.

This step simplifies the dataset and focuses on the most relevant features.

In [252]:
data.describe()

Unnamed: 0,bath,price
count,13247.0,13320.0
mean,2.69261,112.565627
std,1.341458,148.971674
min,1.0,8.0
25%,2.0,50.0
50%,2.0,72.0
75%,3.0,120.0
max,40.0,3600.0


In [253]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13320 entries, 0 to 13319
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   location    13319 non-null  object 
 1   size        13304 non-null  object 
 2   total_sqft  13320 non-null  object 
 3   bath        13247 non-null  float64
 4   price       13320 non-null  float64
dtypes: float64(2), object(3)
memory usage: 520.4+ KB


In [254]:
data['location'].value_counts()

location
Whitefield                        540
Sarjapur  Road                    399
Electronic City                   302
Kanakpura Road                    273
Thanisandra                       234
                                 ... 
Bapuji Layout                       1
1st Stage Radha Krishna Layout      1
BEML Layout 5th stage               1
singapura paradise                  1
Abshot Layout                       1
Name: count, Length: 1305, dtype: int64


There was only one missing value in the `location` column, which was filled with `'Whitefield'`, the most frequent location. This maintains data consistency.

In [255]:
data['location'] = data['location'].fillna('Whitefield')


In [256]:
data['size'].value_counts()

size
2 BHK         5199
3 BHK         4310
4 Bedroom      826
4 BHK          591
3 Bedroom      547
1 BHK          538
2 Bedroom      329
5 Bedroom      297
6 Bedroom      191
1 Bedroom      105
8 Bedroom       84
7 Bedroom       83
5 BHK           59
9 Bedroom       46
6 BHK           30
7 BHK           17
1 RK            13
10 Bedroom      12
9 BHK            8
8 BHK            5
11 BHK           2
11 Bedroom       2
10 BHK           2
14 BHK           1
13 BHK           1
12 Bedroom       1
27 BHK           1
43 Bedroom       1
16 BHK           1
19 BHK           1
18 Bedroom       1
Name: count, dtype: int64

In [257]:
data['size'] = data['size'].fillna('2 BHK')

The `size` column has only 16 missing values, which are filled with the most frequent value, `2 BHK`, to maintain consistency.

In [258]:
data['bath'].median() # bathroom has 73 null values so we will replace them my median value.

2.0

In [259]:
data['bath'] = data['bath'].fillna(data['bath'].median())

Filled missing values in the `bath` column with the median number of bathrooms to maintain consistency and minimize the impact of outliers.

In [260]:
data['bath'].isna().sum()

0

In [261]:
data['bhk'] = data['size'].str.split().str.get(0).astype(int)

Extracting Number of Bedrooms (`bhk`)

The `bhk` column is generated by extracting the numeric value from the `size` column (e.g., "2 BHK" becomes 2). This transformation standardizes the number of bedrooms as an integer feature, making it easier to analyze and model.

Converting categorical or object-type columns into numeric values (such as integers or floats) is a crucial preprocessing step. It enables more effective statistical analysis and machine learning, as most algorithms require numerical input.

In [262]:
data['bhk'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 13320 entries, 0 to 13319
Series name: bhk
Non-Null Count  Dtype
--------------  -----
13320 non-null  int32
dtypes: int32(1)
memory usage: 52.2 KB


we can see there is no null value now.

In [263]:
data['bhk'].value_counts()

bhk
2     5544
3     4857
4     1417
1      656
5      356
6      221
7      100
8       89
9       54
10      14
11       4
27       1
19       1
16       1
43       1
14       1
12       1
13       1
18       1
Name: count, dtype: int64

Checking for outliers.


In [264]:
data[data.bhk>20]

Unnamed: 0,location,size,total_sqft,bath,price,bhk
1718,2Electronic City Phase II,27 BHK,8000,27.0,230.0,27
4684,Munnekollal,43 Bedroom,2400,40.0,660.0,43


only two columns that  have value greter than 20 so they are outliers.

Now let's check for `total_sqft` column.

In [265]:
data['total_sqft'].unique()

array(['1056', '2600', '1440', ..., '1133 - 1384', '774', '4689'],
      dtype=object)

To handle `total_sqft` values given as ranges (e.g., "1200-1500"), we take their mean. For other values, we convert them directly to float. This ensures all entries in `total_sqft` are numeric and consistent for analysis.

In [266]:
def convertRange(x):
    temp = x.split('-')
    if len(temp) == 2:
        return (float(temp[0]) + float(temp[1]))/2
    try:
        return float(x)
    except:
        return None

In [267]:
data['total_sqft'] = data['total_sqft'].apply(convertRange)

In [268]:
data['total_sqft'].dtype

dtype('float64')

In [269]:
data.head()

Unnamed: 0,location,size,total_sqft,bath,price,bhk
0,Electronic City Phase II,2 BHK,1056.0,2.0,39.07,2
1,Chikka Tirupathi,4 Bedroom,2600.0,5.0,120.0,4
2,Uttarahalli,3 BHK,1440.0,2.0,62.0,3
3,Lingadheeranahalli,3 BHK,1521.0,3.0,95.0,3
4,Kothanur,2 BHK,1200.0,2.0,51.0,2


Price Per square feet

In [270]:
data[['price', 'total_sqft']].corr()

Unnamed: 0,price,total_sqft
price,1.0,0.575559
total_sqft,0.575559,1.0


### Creating the `price_per_sqft` Feature

A new column, `price_per_sqft`, is added to represent the price per square foot for each property. This feature enables more effective comparison of property values across different locations and sizes.

In [271]:
data['price_per_sqft'] = data['price']*100000 / data['total_sqft']

In [272]:
data['price_per_sqft']

0         3699.810606
1         4615.384615
2         4305.555556
3         6245.890861
4         4250.000000
             ...     
13315     6689.834926
13316    11111.111111
13317     5258.545136
13318    10407.336319
13319     3090.909091
Name: price_per_sqft, Length: 13320, dtype: float64

In [273]:
data.describe()

Unnamed: 0,total_sqft,bath,price,bhk,price_per_sqft
count,13274.0,13320.0,13320.0,13320.0,13274.0
mean,1559.626694,2.688814,112.565627,2.802778,7907.501
std,1238.405258,1.338754,148.971674,1.294496,106429.6
min,1.0,1.0,8.0,1.0,267.8298
25%,1100.0,2.0,50.0,2.0,4266.865
50%,1276.0,2.0,72.0,3.0,5434.306
75%,1680.0,3.0,120.0,3.0,7311.746
max,52272.0,40.0,3600.0,43.0,12000000.0


In [274]:
data['location'].value_counts()

location
Whitefield                        541
Sarjapur  Road                    399
Electronic City                   302
Kanakpura Road                    273
Thanisandra                       234
                                 ... 
Bapuji Layout                       1
1st Stage Radha Krishna Layout      1
BEML Layout 5th stage               1
singapura paradise                  1
Abshot Layout                       1
Name: count, Length: 1305, dtype: int64

`location` columns preprocessing we will remove outlier if any.

In [275]:
# Remove leading and trailing spaces from location names
data['location'] = data['location'].apply(lambda x : x.strip())
location_count = data['location'].value_counts()


In [276]:
location_count  

location
Whitefield                        542
Sarjapur  Road                    399
Electronic City                   304
Kanakpura Road                    273
Thanisandra                       237
                                 ... 
Bapuji Layout                       1
1st Stage Radha Krishna Layout      1
BEML Layout 5th stage               1
singapura paradise                  1
Abshot Layout                       1
Name: count, Length: 1294, dtype: int64

In [277]:
location_count_less_10 = location_count[location_count <= 10]
location_count_less_10

location
Dairy Circle                      10
Nagappa Reddy Layout              10
Basapura                          10
1st Block Koramangala             10
Sector 1 HSR Layout               10
                                  ..
Bapuji Layout                      1
1st Stage Radha Krishna Layout     1
BEML Layout 5th stage              1
singapura paradise                 1
Abshot Layout                      1
Name: count, Length: 1053, dtype: int64

Locations with 10 or fewer occurrences (totaling 1,053 unique locations) are replaced with `'other'`. This reduces the number of unique categories in the `location` column, simplifying encoding and improving model performance.

In [278]:

data['location'] = data['location'].apply(lambda x: 'other' if x in location_count_less_10 else x)

In [279]:
data['location'].value_counts()

location
other                 2885
Whitefield             542
Sarjapur  Road         399
Electronic City        304
Kanakpura Road         273
                      ... 
Nehru Nagar             11
Banjara Layout          11
LB Shastri Nagar        11
Pattandur Agrahara      11
Narayanapura            11
Name: count, Length: 242, dtype: int64

In [280]:
data.describe()

Unnamed: 0,total_sqft,bath,price,bhk,price_per_sqft
count,13274.0,13320.0,13320.0,13320.0,13274.0
mean,1559.626694,2.688814,112.565627,2.802778,7907.501
std,1238.405258,1.338754,148.971674,1.294496,106429.6
min,1.0,1.0,8.0,1.0,267.8298
25%,1100.0,2.0,50.0,2.0,4266.865
50%,1276.0,2.0,72.0,3.0,5434.306
75%,1680.0,3.0,120.0,3.0,7311.746
max,52272.0,40.0,3600.0,43.0,12000000.0


Outlier Detection in `total_sqft`

A review of the `total_sqft` column reveals some unrealistic values, such as properties with as little as 1 square foot of area. Such entries are clear outliers or data entry errors, as it is not feasible for any property—especially those with multiple bedrooms—to have such a small area.

To improve data quality and ensure reliable analysis, we remove properties where the average area per BHK (i.e., `total_sqft` divided by `bhk`) is less than 300 sqft. This threshold helps filter out likely outliers and data inconsistencies, resulting in a more accurate and trustworthy dataset for further exploration and modeling.

In [281]:
data = data[((data['total_sqft']/data['bhk']) >= 300)]
data.describe()

Unnamed: 0,total_sqft,bath,price,bhk,price_per_sqft
count,12530.0,12530.0,12530.0,12530.0,12530.0
mean,1594.564544,2.559537,111.382401,2.650838,6303.979357
std,1261.271296,1.077938,152.077329,0.976678,4162.237981
min,300.0,1.0,8.44,1.0,267.829813
25%,1116.0,2.0,49.0,2.0,4210.526316
50%,1300.0,2.0,70.0,3.0,5294.117647
75%,1700.0,3.0,115.0,3.0,6916.666667
max,52272.0,16.0,3600.0,16.0,176470.588235


In [282]:
data.shape

(12530, 7)

In [283]:
def remove_outliers_sqft(df):
    df_output = pd.DataFrame()
    for key,subdf in df.groupby('location'):
        
        m = np.mean(subdf['price_per_sqft'])

        st = np.std(subdf['price_per_sqft'])

        gen_df = subdf[(subdf['price_per_sqft'] > (m-st)) & (subdf['price_per_sqft'] <= (m+st))]
        
        df_output = pd.concat([df_output,gen_df],ignore_index= True)
       
    return df_output
data = remove_outliers_sqft(data)
data.describe()

Unnamed: 0,total_sqft,bath,price,bhk,price_per_sqft
count,10301.0,10301.0,10301.0,10301.0,10301.0
mean,1508.440608,2.471702,91.286372,2.574896,5659.062876
std,880.694214,0.979449,86.342786,0.897649,2265.774749
min,300.0,1.0,10.0,1.0,1250.0
25%,1110.0,2.0,49.0,2.0,4244.897959
50%,1286.0,2.0,67.0,2.0,5175.600739
75%,1650.0,3.0,100.0,3.0,6428.571429
max,30400.0,16.0,2200.0,16.0,24509.803922


In [284]:
data.shape

(10301, 7)

### Understanding `groupby` and Outlier Removal Logic

- **`groupby('location')`** splits the DataFrame into groups based on unique values in the `location` column.
    - **`key`**: Stores the name of the current location (a string, e.g., `'Whitefield'`).
    - **`subdf`**: Contains a DataFrame with all rows for that location.

- For each location, the function keeps only those rows where `price_per_sqft` is within one standard deviation of the mean for that location. This filters out unusually high or low property prices, keeping only the most typical values for each area.


In [285]:


def bhk_outlier_remove(df):
    exclude_indices = np.array([])

    for location, location_df in df.groupby('location'):
        bhk_stats = {}

        # Step 1: Calculate mean and std for each BHK in a location
        for bhk, bhk_df in location_df.groupby('bhk'):
            bhk_stats[bhk] = {
                'mean': np.mean(bhk_df['price_per_sqft']),
                'std': np.std(bhk_df.price_per_sqft),
                'count': bhk_df.shape[0]
            }
  
        # Step 2: Compare each BHK price_per_sqft with the (BHK-1) stats
        
        for bhk, bhk_df in location_df.groupby('bhk'):
            stats = bhk_stats.get(bhk - 1)
            if stats and stats['count'] > 5:
                exclude_indices = np.append(
                    exclude_indices,
                    bhk_df[bhk_df['price_per_sqft'] < stats['mean']].index.values
                )

    return df.drop(exclude_indices, axis='index')


### Understanding and Removing BHK Outliers

The `bhk_outlier_remove` function refines real estate data by removing properties where the `price_per_sqft` is unusually low compared to properties with fewer rooms (BHK) in the same location. For example, it addresses cases where a 3BHK is listed at a lower price per square foot than a 2BHK nearby, which is typically unrealistic.

#### How It Works: Two-Step Process Per Location

**1. Calculate BHK Statistics**  
For each unique location, properties are grouped by their BHK count. For every BHK group within a location, the function computes:
- Mean `price_per_sqft`
- Standard deviation of `price_per_sqft`
- Count of properties for that BHK type

This builds a statistical profile for each BHK size in every location.

**2. Compare and Exclude Outliers**  
For each BHK type in a location:
- The function looks up the mean `price_per_sqft` of the (BHK - 1) group (e.g., for 3BHK, it checks 2BHK stats).
- If a property's `price_per_sqft` is less than the mean of the smaller BHK group (and the smaller group has more than 5 properties), it is flagged as an outlier.
- The indices of these outlier properties are collected.

**Final Step: Clean the DataFrame**  
After processing all locations and BHK groups, the function removes all identified outlier rows, resulting in a cleaner dataset for further analysis and modeling.


In [286]:
data = bhk_outlier_remove(data)

In [287]:
data.shape

(7361, 7)

In [288]:
data

Unnamed: 0,location,size,total_sqft,bath,price,bhk,price_per_sqft
0,1st Block Jayanagar,4 BHK,2850.0,4.0,428.0,4,15017.543860
1,1st Block Jayanagar,3 BHK,1630.0,3.0,194.0,3,11901.840491
2,1st Block Jayanagar,3 BHK,1875.0,2.0,235.0,3,12533.333333
3,1st Block Jayanagar,3 BHK,1200.0,2.0,130.0,3,10833.333333
4,1st Block Jayanagar,2 BHK,1235.0,2.0,148.0,2,11983.805668
...,...,...,...,...,...,...,...
10292,other,2 BHK,1200.0,2.0,70.0,2,5833.333333
10293,other,1 BHK,1800.0,1.0,200.0,1,11111.111111
10296,other,2 BHK,1353.0,2.0,110.0,2,8130.081301
10297,other,1 Bedroom,812.0,1.0,26.0,1,3201.970443


In [289]:
data.drop(columns=['size','price_per_sqft'],inplace=True)

### Cleaned data

In [290]:
data.head()

Unnamed: 0,location,total_sqft,bath,price,bhk
0,1st Block Jayanagar,2850.0,4.0,428.0,4
1,1st Block Jayanagar,1630.0,3.0,194.0,3
2,1st Block Jayanagar,1875.0,2.0,235.0,3
3,1st Block Jayanagar,1200.0,2.0,130.0,3
4,1st Block Jayanagar,1235.0,2.0,148.0,2


In [291]:
data.to_csv("Cleaned_data.csv")

In [292]:
X = data.drop(columns=['price'])
y = data['price']


## Model Building 

The necessary libraries for model building and evaluation have been imported, including tools for data preprocessing, regression algorithms (Linear Regression, Lasso, Ridge), and performance metrics. This sets the stage for splitting the data, encoding categorical features, scaling, and training regression models.
```

In [293]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression , Lasso,Ridge
from sklearn.preprocessing import OneHotEncoder , StandardScaler
from sklearn.compose import make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.metrics import r2_score

In [294]:
X_train,X_test,y_train, y_test = train_test_split(X,y,test_size = 0.2,random_state =0)

In [241]:
print(X_train.shape)
print(X_test.shape)

(5888, 4)
(1473, 4)


### Applying Linear Regression

In [300]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import r2_score

# Create column transformer
column_trans = make_column_transformer(
    (OneHotEncoder(sparse_output=False), ['location']),
    remainder='passthrough'
)

In [301]:
scaler = StandardScaler()
lr = LinearRegression() 

In [302]:
pipe = make_pipeline(column_trans, scaler, lr)

In [303]:
pipe.fit(X_train, y_train)

In [304]:
y_pred_lr = pipe.predict(X_test)

In [305]:
r2 = r2_score(y_test, y_pred_lr)
print("R² Score:", r2)

R² Score: 0.8252107719236643


### Applying Lasso

In [308]:
lasso = Lasso()

In [310]:

pipe = make_pipeline(column_trans, scaler,lasso)

In [311]:
pipe.fit(X_train,y_train)

In [312]:
y_pred_lasso = pipe.predict(X_test)
r2_score(y_test,y_pred_lasso)

0.8146894751690389

### Applying Ridge

In [313]:
ridge = Ridge()

In [315]:
pipe = make_pipeline(column_trans,scaler,ridge)

In [316]:
pipe.fit(X_train,y_train)

In [318]:
y_pred_ridge = pipe.predict(X_test)
r2_score(y_test,y_pred_ridge)

0.8252348502290165

In [321]:
print("No Regularisation :" ,r2_score(y_test,y_pred_lr))
print("Lasso :" ,r2_score(y_test,y_pred_lasso))
print("Ridge :" ,r2_score(y_test,y_pred_ridge))



No Regularisation : 0.8252107719236643
Lasso : 0.8146894751690389
Ridge : 0.8252348502290165


In [322]:
import pickle

In [323]:
pickle.dump(pipe,open("RidgeModel.pkl",'wb'))